Configure Hadoop and start Cluster Service Using Ansible Playbook

Simran Shrivas
4 min readDec 15, 2020

Hello Learner👨🏻‍💻!!!!

Here is one more article, so we all know ansible is an extra intelligent tool so in this article we configure the Hadoop cluster using Ansible playbook without going to DataNode and NameNode manually.

So Lets Start Article,

First, we discuss what is Hadoop? what is NameNode and what is DataNode?

🔰 Hadoop

Apache Hadoop is an open-source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data. Instead of using one large computer to store and process the data, Hadoop allows clustering multiple computers to analyze massive datasets in parallel more quickly.

🔰 DataNode

DataNodes are the slave nodes in HDFS. Unlike NameNode, DataNode is commodity hardware, that is, a non-expensive system that is not of high quality or high-availability

🔰 NameNode

NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.

So in this module, we launch the Hadoop cluster using Ansible Playbook over the AWS. Well, it is not so difficult to perform.

Let us understand step by step how we can do the 11.1 task:

For this, we can use AWS AMI which is available on AWS.

AWS Instance

So for this task, we create three instances one is the Controller node and two managed nodes for configuring Hadoop Cluster.

🔰 11.1 Configure Hadoop and start cluster services using Ansible Playbook

First of all, we install ansible in our controller node for that we move to the root of AWS Linux and then we install ansible using the below command:

# sudo amazon-linux-extras install ansible2
ansible installed
Now,we check ansible successfully install or not using the below command:
# ansible — version
Successfully installed Ansible

Creating Inventory in the controller node to manage or configure other nodes

Ansible inventory file defines the hosts and groups of hosts upon which commands, modules, and tasks in a playbook operate.

for this first write the detail in ansible.cfg file below the figure:

Go to vim/etc/ansible/ansible.cfg

When you press enter, you will find [defaults]. and write the text as shown in the image.

ansible.cfg

Now create an inventory file and write the details about IP address, Username, and Connection in the file.

Inventory

So in AWS something is different for connectivity, for that we have not provided any security key till now so the next step to providing a security key is completely different as compare to local VMs.

for that In the Controller, node go to ssh and create security

# cd .ssh
# ssh-keygen
Key_Generation

go to the authorized_keys file and copy the content and paste in the target node authorized

Authorized_Keys
Copy the content in the target node

so come back to the controller node and check the connectivity using the below command:

# ansible all -m ping
Successfully Connected Both nodes

Now, we write the Playbook to configure the Hadoop cluster.

For that we need to install some software:

  • Install JDK.
  • Install Hadoop.
  • Configure the core-sitx.xml and hdfs-site.xml file.
  • Create Directory.
  • Format the NameNode Only.
  • Start the Hadoop service in both Node.
  • Check the Report.

I write a playbook for all the steps and run this playbook in both target node. In these two target nodes, one is the slave node and the other is the master node we write both names with different IP in the ip.txt file.

Steup
DataNode
NameNode

Here is the output to configure both nodes:

output
output
output

so here is successfully completed task 11

Thank you for reading this article.

The complete yml code is available on my git profile:

https://github.com/Simi16/Configure_Hadoop_using_Playbook

Keep Learning👨🏻‍💻

--

--