Ubuntu box. The main goal of this tutorial is to get a more sophisticated Hadoop installation up and running, namely. Ubuntu boxes. This tutorial has been tested with the following software versions Ubuntu Linux 1. LTS deprecated 8. LTS, 8. 0. 4, 7. 1. Hadoop 1. 0. 3, released May 2. Figure 1 Cluster of machines running Hadoop at Yahoo Source Yahoo. Tutorial approach and structure. From two single node clusters to a multi node cluster We will build a multi node cluster using two Ubuntu boxes. In my humble opinion, the best way to do this for starters is to install, configure and test a. Hadoop setup for each of the two Ubuntu boxes, and in a second step to merge these two single node clusters. Ubuntu box will become the designated master but also act as a slave with. Its much easier to track down any. Figure 2 Tutorial approach and structure. Lets get started Prerequisites. Configuring single node clusters first. The tutorial approach outlined above means that you should read now my previous tutorial on. Hadoop single node cluster and. Hadoop cluster on each of the two Ubuntu boxes. It is. recommended that you use the same settings e. Just keep in mind when. Done Lets continue then Now that you have two single node clusters up and running, we will modify the Hadoop configuration to make one Ubuntu. Ubuntu box a slave. Note We will call the designated master machine just the master from now on and the slave only machine the slave. We will also give the two machines these respective hostnames in their networking setup, most notably in etchosts. If the hostnames of your machines are different e. Public Relation Activities Of Nestle. Shutdown each single node cluster with binstop all. Networking. This should come hardly as a surprise, but for the sake of completeness I have to point out that both machines must be. The easiest is to put both machines in the same network with regard to. To make it simple, we will assign the IP address 1. Update etchosts on both machines with the following lines etchosts for master AND slave 1. SSH access. The hduser user on the master aka hdusermaster must be able to connect a to its own user account on the. SSH login. If you followed my. SSH key which should be in HOME. HOME. sshauthorizedkeys. You can do this manually or use the. SSH command Distribute the SSH public key of hdusermaster 1hdusermaster ssh copy id i HOME. This command will prompt you for the login password for user hduser on slave, then copy the public SSH key for. The final step is to test the SSH setup by connecting with user hduser from the master to the user account. The step is also needed to save slaves host key fingerprint to the. So, connecting from master to master1. The authenticity of host master 1. RSA key fingerprint is 3b 2. Are you sure you want to continue connecting yesno Warning Permanently added masterRSA to the list of known hosts. Linux master 2. 6. Thu Jun 7 2. 0 1. UTC 2. 00. 7 i. 68. The authenticity of host slave 1. RSA key fingerprint is 7. Are you sure you want to continue connecting yesnoWarning Permanently added slaveRSA to the list of known hosts. Hadoop. Cluster Overview aka the goalThe next sections will describe how to configure one Ubuntu box as a master node and the other Ubuntu box as a slave. The master node will also act as a slave because we only have two machines available in our cluster but still. Figure 3 How the final multi node cluster will look like. The master node will run the master daemons for each layer Name. Node for the HDFS storage layer, and Job. Tracker for. the Map. Reduce processing layer. Both machines will run the slave daemons Data. Node for the HDFS layer, and. Task. Tracker for Map. Reduce processing layer. Basically, the master daemons are responsible for coordination and. Masters vs. Slaves. Typically one machine in the cluster is designated as the Name. Node and another machine the as Job. Tracker, exclusively. These are the actual master nodes. The rest of the machines in the cluster act as both Data. Node and Task. Tracker. These are the slaves or worker nodes. Hadoop 1. x documentationhadoop. Configurationconfmasters master onlyDespite its name, the confmasters file defines on which machines Hadoop will start secondary Name. Nodes in our. multi node cluster. In our case, this is just the master machine. The primary Name. Node and the Job. Tracker will. always be the machines on which you run the binstart dfs. Name. Node and the Job. Tracker will be started on the same machine if you run binstart all. Note You can also start an Hadoop daemon manually on a machine via binhadoop daemon. Here are more details regarding the confmasters file The secondary Name. Node merges the fsimage and the edits log files periodically and keeps edits log size within a limit. It is usually run on a different machine than the primary Name. Node since its memory requirements are on the same order as the primary Name. Node. The secondary Name. Node is started by binstart dfs. Hadoop HDFS user guidehadoop. Again, the machine on which binstart dfs. Name. Node. On master, update confmasters that it looks like this confslaves master onlyThe confslaves file lists the hosts, one per line, where the Hadoop slave daemons Data. Nodes and Task. Trackers. We want both the master box and the slave box to act as Hadoop slaves because we want both of. On master, update confslaves that it looks like this If you have additional slave nodes, just add them to the confslaves file, one hostname per line. Note The confslaves file on master is used only by the scripts like binstart dfs. For example, if you want to add Data. Nodes on the fly which is not described in this tutorial yet, you can manually start the Data. Node daemon on a new slave machine via binhadoop daemon. Using the confslaves file on the master simply helps you to make full cluster restarts easier. You must change the configuration files confcore site. ALL machines as follows. First, we have to change the. Name. Node the HDFS master host and port. In. our case, this is the master machine. ALL machines 1. 23. The name of the default file system. A URI whose. scheme and authority determine the File. System implementation. The. uris scheme determines the config property fs. SCHEME. impl naming. File. System implementation class. The uris authority is used to. Second, we have to change the. Job. Tracker Map. Reduce. master host and port. Again, this is the master inĀ our case. ALL machines 1. 23. The host and port that the Map. Reduce job tracker runs. If local, then jobs are run in process as a single map. Third, we change the. It defines how many machines a single file. If you set this to a value higher than the number of available. Data. Nodes, you will start seeing a lot of Zero targets found, forbidden. The default value of dfs. However, we have only two nodes available, so we set. ALL machines 1. 23. Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. Additional Settings. There are some other configuration options worth studying. The following information is taken from the. Hadoop API Overview. In file confmapred site. Determines where temporary Map. Reduce data is written.