Planning to install a multinode Hadoop cluster and feeling confused about which Hadoop platform you should install and how and what components?
Installing a multi-node Hadoop cluster for production could be overwhelming at times due to the number of services used in different Hadoop platforms.
There is a total of three flavors of Hadoop distribution available in the market.
- Apache Hadoop.
- Hortonworks Hadoop Platform, HDP.
- Cloudera Hadoop.
We are going to see in detail here in this article how to build a production-grade multi-node Hadoop cluster from scratch With Centos 7.
Before we proceed further please check in below pre-requisites which need to be fulfilled before the beginning of the installation.
The Ambari host should have at least 1 GB RAM, with 500 MB free. To check the available memory on any host, run:
Maximum Open Files Requirements
The recommended maximum number of open file descriptors is 10000, or more. To check the current value set for the maximum number of open file descriptors, execute the following shell commands on each host:
ulimit -Sn ulimit -Hn
If the output is not greater than 10000, run the following command to set it to a suitable default:
ulimit -n 10000
Check hostname and FQDN
Please check you should update a complete FQDN, FQDN should be resolved with reverse and direct DNS lookup queries.
Setup Password-Less SSH
Passwordless ssh need to be set up with a host where you are going to install Ambari server to target hosts which are going to be either data node, secondary name nodes or hosting other HDPservices.
Note: This process should be completed with the user which you are going to use for the Hadoop installation and Ambari-server setup. if it is a non-root user you have to follow a bit long process and update some commands and configs for ambari-agents in the sudoers file.
- Generate public and private SSH keys on the Ambari Server host.
2. Copy the SSH Public Key (id_rsa.pub) to the root account on your target hosts.
3. Add the SSH Public Key to the authorized_keys file on your target hosts.
Note: This step should also be completed for the host that is hosting Ambari server as well apart from other target hosts.
cat id_rsa.pub >> authorized_keys
4. Depending on your version of SSH, you may need to set permissions on the .ssh directory (to 700) and the authorized_keys file in that directory (to 600) on the target hosts.
chmod 700 ~/.ssh chmod 600 ~/.ssh/authorized_keys
5. From the Ambari Server, make sure you can connect to each host in the cluster using SSH, without having to enter a password.
Enable NTP on the Cluster and on the Browser Host
The clocks of all the nodes in your cluster and the machine that runs the browser through which you access the Ambari Web interface must be able to synchronize with each other.
To install the NTP service and ensure it’s started on boot, run the following commands on each host:
yum install -y ntp systemctl enable ntpd
Check DNS and NSCD(Naming Service Caching Deamon)
update IP and host FQDN in /etc/hosts file.
Add below line
hostname <fully.qualified.domain.name> hostname -f
hostname -f command should return your FQDN you just set.
Edit the Network Configuration File
Modify the HOSTNAME property to set the fully qualified domain name.
For Ambari to communicate during setup with the hosts it deploys to and manages, certain ports must be open and available.
systemctl disable firewalld service firewalld stop
Disable SELinux and PackageKit and check the umask Value
You must disable SELinux for the Ambari setup to function. On each host in your cluster, enter:
UMASK (User Mask or User file creation MASK) sets the default permissions or base permissions granted when a new file or folder is created on a Linux machine. Most Linux distros set 022 as the default umask value. A umask value of 022 grants read, write, execute permissions of 755 for new files or folders.
A umask value of 027 grants read, write, execute permissions of 750 for new files or folders. Ambari, HDP, and HDF support umask values of 022 (0022 is functionally equivalent), 027 (0027 is functionally equivalent). These values must be set on all hosts.
Setting the umask for your current login session:
Checking your current umask:
Permanently changing the umask for all interactive users:
echo umask 0022 >> /etc/profile
Please if you do not have internet access in your environment then you may have to follow the procedure to set up a local repository.
If you have internet access, please follow the below procedure.
Downloading Ambari Repositories RHEL/CentOS/Oracle Linux 7.
1. Log in to your host as root.
2. Download the Ambari repository file to a directory on your installation host.
wget -nv http://public-repo-1.hortonworks.com/ambari/centos7/2.x/updates/2. 6.0.0/ambari.repo -O /etc/yum.repos.d/ambari.repo
Install the Ambari Server
Install the Ambari bits. This also installs the default PostgreSQL Ambari database.
Note: This should be done with sudo or root user.
yum install ambari-server
Set Up the Ambari Server
Before starting the Ambari Server, you must set up the Ambari Server. Setup configures Ambari to talk to the Ambari database, installs the JDK and allows you to customize the user account the Ambari Server daemon will run as.
Note: if you wish to make the Ambari server run with some non-root user you have to do the config for /etc/sudeors and add some entries of config to allow non-root users to run the Ambari server.
Note: By default, Ambari Server runs under root. Accept the default (n) at the Customize user account for ambari-server daemon prompt, to proceed as root.
For non-root user add following config at each target host in /etc/sudoers file.
# hadoopadmin Customizable Users hadoopadmin ALL=(ALL) NOPASSWD:SETENV: /bin/su hdfs *,/bin/su ambari-qa *,/bin/su ranger *,/bin/su zookeeper *,/bin/su knox *,/bin/su falcon *,/bin/su ams *, /bin/su flume *,/bin/su hbase *,/bin/su spark *,/bin/su accumulo *,/bin/su hive *,/bin/su hcat *,/bin/su kafka *,/bin/su mapred *,/bin/su oozie *,/bin/su sqoop *,/bin/su storm *,/bin/su tez *,/bin/su atlas *,/bin/su yarn *,/bin/su kms *,/bin/su activity_analyzer *,/bin/su livy *,/bin/su zeppelin *,/bin/su infra-solr *,/bin/su logsearch *,/bin/su druid *,/bin/su superset *# hadoopadmin: Core System Commands hadoopadmin ALL=(ALL) NOPASSWD:SETENV: /usr/bin/yum,/usr/bin/zypper,/usr/bin/apt-get, /bin/mkdir, /usr/bin/test, /bin/ln, /bin/ls, /bin/chown, /bin/chmod, /bin/chgrp, /bin/cp, /usr/sbin/setenforce, /usr/bin/test, /usr/bin/stat, /bin/mv, /bin/sed, /bin/rm, /bin/kill, /bin/readlink, /usr/bin/pgrep, /bin/cat, /usr/bin/unzip, /bin/tar, /usr/bin/tee, /bin/touch, /usr/bin/mysql, /sbin/service mysqld *, /usr/bin/dpkg *, /bin/rpm *, /usr/sbin/hst *, /sbin/service rpcbind *, /sbin/service portmap *# hadoopadmin: Hadoop and Configuration Commands hadoopadmin ALL=(ALL) NOPASSWD:SETENV: /usr/bin/hdp-select, /usr/bin/conf-select, /usr/hdp/current/hadoop-client/sbin/hadoop-daemon.sh, /usr/lib/hadoop/bin/hadoop-daemon.sh, /usr/lib/hadoop/sbin/hadoop-daemon.sh, /usr/bin/ambari-python-wrap *# hadoopadmin: System User and Group Commands hadoopadmin ALL=(ALL) NOPASSWD:SETENV: /usr/sbin/groupadd, /usr/sbin/groupmod, /usr/sbin/useradd, /usr/sbin/usermod# hadoopadmin: Knox Commands hadoopadmin ALL=(ALL) NOPASSWD:SETENV: /usr/bin/python2.6 /var/lib/ambari-agent/data/tmp/validateKnoxStatus.py *, /usr/hdp/current/knox-server/bin/knoxcli.sh# hadoopadmin: Ranger Commands hadoopadmin ALL=(ALL) NOPASSWD:SETENV: /usr/hdp/*/ranger-usersync/setup.sh, /usr/bin/ranger-usersync-stop, /usr/bin/ranger-usersync-start, /usr/hdp/*/ranger-admin/setup.sh *, /usr/hdp/*/ranger-knox-plugin/disable-knox-plugin.sh *, /usr/hdp/*/ranger-storm-plugin/disable-storm-plugin.sh *, /usr/hdp/*/ranger-hbase-plugin/disable-hbase-plugin.sh *, /usr/hdp/*/ranger-hdfs-plugin/disable-hdfs-plugin.sh *, /usr/hdp/current/ranger-admin/ranger_credential_helper.py, /usr/hdp/current/ranger-kms/ranger_credential_helper.py, /usr/hdp/*/ranger-*/ranger_credential_helper.py# hadoopadmin Infra and LogSearch Commands hadoopadmin ALL=(ALL) NOPASSWD:SETENV: /usr/lib/ambari-infra-solr/bin/solr *, /usr/lib/ambari-logsearch-logfeeder/run.sh *, /usr/sbin/ambari-metrics-grafana *, /usr/lib/ambari-infra-solr-client/solrCloudCli.sh *#sudo defaults ambari agents Defaults exempt_group = hadoopadmin Defaults !env_reset,env_delete-=PATH Defaults: hadoopadmin !requiretty
Start the Ambari Server
• Run the following command on the Ambari Server host:
Note: Start the ambari server with the user you configured in above step.
- To check the Ambari Server processes:
- To stop the Ambari Server:
Once setup completes and ambari server is started successfully next step is to login in ambari console. Default password for ambari console is admin/admin. you can access ambari GUI as shown below:
When you login first time in ambari console, there will be an option called launch install wizard, just click on it.
Name your cluster and select version
Just choose a single repository of your use and remove other links.In our case let redhat7 be there and remove all others.
Just check the “Skip Repository Base URL Validation”
During Install Options put your hostname one per line to set up a multi-node cluster.
In private key, update the ssh key you created at Ambari host with the user you are going to install and connect to target hosts.
Put username and port and click next.
It will try to install the agent in remote target hosts if everything is configured correctly. In case if it fails due to the below error.
ERROR 2017-07-21 14:33:56,892 NetUtil.py:84 - EOF occurred in violation of protocol (_ssl.c:765)ERROR 2017-07-21 14:33:56,892 NetUtil.py:85 - SSLError: Failed to connect. Please check openssl library versions.
Update file /etc/ambari-agent/conf/ambari-agent.ini Undeer [security] header.
sudo vi /etc/ambari-agent/conf/ambari-agent.ini
Add below setting under [security] header.
Post this step select Hadoop services you wish to install in your cluster and Assign the node for master services and click next.
Assign the nodes for slaves and clients and click next.
At customize service step you may define many properties like username and password for hive meta store DB and Oozie DB. Also, you can set Namenode directories and the Datanode directory path.
Please note while putting directory path for data node machine always use non-lvm based mount points. Click next.
Under Review Section check the summary presented and click Deploy.
Login into Ambari server and check whether all Hadoop services are running successfully if there are some alarms you have to fix individual components and restart the service.
So, Initially installation looks complex but Ambari made it simpler.
If you are going to install each component in every target host this is going to consume a lot of time and it will be very complex.
Yet in the second approach where we install Ambari first and then install HDP cluster with Ambari, all hassle is taken care of by Ambari.