Configure Hadoop and start Cluster Service Using Ansible Playbook

4 min readDec 15, 2020

Hello Learner👨🏻‍💻!!!!

Here is one more article, so we all know ansible is an extra intelligent tool so in this article we configure the Hadoop cluster using Ansible playbook without going to DataNode and NameNode manually.

So Lets Start Article,

First, we discuss what is Hadoop? what is NameNode and what is DataNode?

🔰 Hadoop

Apache Hadoop is an open-source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data. Instead of using one large computer to store and process the data, Hadoop allows clustering multiple computers to analyze massive datasets in parallel more quickly.

🔰 DataNode

DataNodes are the slave nodes in HDFS. Unlike NameNode, DataNode is commodity hardware, that is, a non-expensive system that is not of high quality or high-availability

🔰 NameNode

NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.

So in this module, we launch the Hadoop cluster using Ansible Playbook over the AWS. Well, it is not so difficult to perform.

Let us understand step by step how we can do the 11.1 task:

For this, we can use AWS AMI which is available on AWS.

So for this task, we create three instances one is the Controller node and two managed nodes for configuring Hadoop Cluster.

🔰 11.1 Configure Hadoop and start cluster services using Ansible Playbook

First of all, we install ansible in our controller node for that we move to the root of AWS Linux and then we install ansible using the below command:

# sudo amazon-linux-extras install ansible2

Now,we check ansible successfully install or not using the below command:
# ansible — version

Creating Inventory in the controller node to manage or configure other nodes

Ansible inventory file defines the hosts and groups of hosts upon which commands, modules, and tasks in a playbook operate.

for this first write the detail in ansible.cfg file below the figure:

Go to vim/etc/ansible/ansible.cfg

When you press enter, you will find [defaults]. and write the text as shown in the image.

Now create an inventory file and write the details about IP address, Username, and Connection in the file.

So in AWS something is different for connectivity, for that we have not provided any security key till now so the next step to providing a security key is completely different as compare to local VMs.

for that In the Controller, node go to ssh and create security

# cd .ssh
# ssh-keygen

go to the authorized_keys file and copy the content and paste in the target node authorized

so come back to the controller node and check the connectivity using the below command:

# ansible all -m ping

Now, we write the Playbook to configure the Hadoop cluster.

For that we need to install some software:

Install JDK.
Install Hadoop.
Configure the core-sitx.xml and hdfs-site.xml file.
Create Directory.
Format the NameNode Only.
Start the Hadoop service in both Node.
Check the Report.

I write a playbook for all the steps and run this playbook in both target node. In these two target nodes, one is the slave node and the other is the master node we write both names with different IP in the ip.txt file.