###################################
CS435 Cluster Installation Tutorial 2
###################################
1. Copy the VM from the previous tutorial in to a new VM. You may call this file master. 
We will make changes to this VM so that it becomes part of a cluster consisting of 1 master node, and 3 slave nodes. 
#########################################################################################################

2. Rename the machine to master. Use the gedit to do the following
sudo gedit /etc/hostname

This should open the file consisting of the name of your machine. Type hadoop1 and save the file.
You may logout and login to have the changes get in effect.
#########################################################################################################

3. Re-Configure hadoop
#configure hadoop so it knows what hosts are workers
gedit /usr/local/hadoop/etc/hadoop/workers

#add the following to this file
hadoop1
hadoop2
hadoop3
hadoop4

#save the file.
<---------------------------<---------------------------<--------------------------->

#we have already configured hadoop xml files. You need to edit the core-site.xml file. Change the <property> tag <name> with value

<value>https://hadoop1:9000</value>
<---------------------------<---------------------------<--------------------------->

Save the file. 
<---------------------------<---------------------------<--------------------------->
#make sure tmp file directories are clean. Run the following on terminal
cd /usr/local/hadoop_tmp
rm * -R
mkdir n
mkdir d
ls -al
chmod 755 n
chmod 755 d

#This will clean the tmp folders so we can work on the cluster.

#########################################################################################################
4. Lets prepate the network package and tools. Use the apt install to install the net-tools package.
sudo apt install net-tools

#check the ip address of the host
ifconfig

#watch out for your ethernet controller item. usually it is eth0 or ens33 or similar
#we will setup our slave nodes with these ip addresses
#192.168.5.131	hadoop1 which is the master
#192.168.5.132	hadoop2 which is a slave
#192.168.5.133	hadoop3 which is a slave
#192.168.5.134	hadoop4 which is a slave

#########################################################################################################
##########################################  I M P O R T A N T  ##########################################
#########################################################################################################
5. Now change the IP address of your host
sudo ifconfig ens33 192.168.5.131
<---------------------------<---------------------------<--------------------------->
#we can edit the interface file so the changes become permanent
sudo gedit /etc/host/interfaces

#type the following
auto lo
iface lo inet loopback

auto ens33
iface ens33 inet static
  address 192.168.5.131
  netmask 255.255.255.0
<---------------------------<---------------------------<--------------------------->
# we will now reset the hosts file 
sudo gedit /etc/hosts

#type in the following to overwrite the existing info
192.168.5.131	hadoop1
192.168.5.132	hadoop2
192.168.5.133	hadoop3
192.168.5.134	hadoop4
<---------------------------<---------------------------<--------------------------->

#########################################################################################################
6. reboot the machine so the changes take effect

sudo reboot now

#ssh to the machine once to make passwordless ssh
ssh hadoop1

#########################################################################################################
7. Now shutdown your VM. This VM is a Ubuntu host that serves as a Node in the hadoop cluster. 

In your host machine, make 3 copies of this Node/VM. Change the name of each of these VMs appropriately. 


#########################################################################################################
#########################################################################################################
#########################################################################################################
8. The following are instructions to prepare the worker node. Repeat the same instructions for hadoop2, hadoop3 and hadoop4.

Start the VM. Login as before, and make the following changes:

#change the machine hostname to hadoop2, where 2 is the slave number
sudo nano /etc/hostname

#The Vi editor opens. Change the name to hadoop2. Use Ctrl SX to save and quit
Ctrl SX

# check the name of your machine
hostname

#It should show hadoop2
###################################
9. For this VM, we will change the network settings:
sudo ifconfig ens33 192.168.5.132

#note, we changed the IP address to 192.168.5.132

###################################
10. Test if you can ssh to this machine
ssh hadoop2

#check the IP address
ipconfig

#The IP should be  192.168.5.132

#########################################################################################################
##########   Repeat steps 8-9-10   for VM3 and VM4 with hostname hadoop3 and hadoop4   ##################
#########################################################################################################

11. We assume that all 4 of the VMs are running on your host machine. We will now enter hadoop1 which serves as master. We will connect to other machines using ssh.

ssh hadoop2
#This allows you to connect to hadoop2. To go back ->

exit

#Test this for all machines hadoop1, 2, 3 and 4.

#########################################################################################################
12. startup the cluster

#go the hadoop1 master machine. format namenode
hdfs namenode -format

#make sure there are no errors. If all is well, start the cluster

#start hdfs
start-all.sh

#Once the prompt becomes available do:
jps

# You will see a list with NameNode, SecondaryNameNode, DataNode, ResourceManager, NodeManager on hadoop1 ( node-master)
# Switch to any other worker VM; jps will give a list of a DataNode and a NodeManager on each of hadoop2, hadoop3 and hadoop4. 
#########################################################################################################
13. You can see the webUI here:
#for hdfs open browser and type
http://hadoop1:9870/

#for yarn
http://hadoop1:8088/

###################################
14. You are familiar with the pi program run from the first tutorial, he we run the MapReduce wordcount program
#Lets make some folders and files in hdfs
hdfs dfs -mkdir books
hdfs dfs -ls -R /

#This will make directory book and show all files therein
<---------------------------<---------------------------<--------------------------->
#download books from projectgutenberg website
#assuming that you downloaded these files:alice.txt holmes.txt frankenstein.txt

hdfs dfs -put alice.txt holmes.txt frankenstein.txt books

#this will copy the 3 files to the dfs in books folder

#lets see the directory
hdfs dfs -ls books
<---------------------------<---------------------------<--------------------------->
#run the wordcount program. this will read all the files in the dfs books/ folder and write response to output folder
 
hadoop jar hadoop-mapreduce-examples-3.3.6.jar wordcount "books/*" output

<---------------------------<---------------------------<--------------------------->
#download the output folder from the dfs. It will create folder /home/hadoop1/output
hdfs dfs -get output /home/hadoop1/output

#use gedit to open the resulting file from the output folder

#########################################################################################################
15. cluster status reports and shutdown

#get a report on your hdfs
hdfs dfsadmin -report

#check yarn cluster details
yarn node -list
#########################################################################################################
16. Closing the cluster.
#stop the cluster safely
stop-all.sh


###################################