In order to play around with Hadoop, there are different options available to us. The easiest way to get it up running quickly is to use a distribution. This is is basically a pre-packaged version of the "Hadoop Ecosystem" with batteries included (configurations pre-set, applications pre-installed). Otherwise you would have to install Java, install Hadoop, set various environment variables. All things that you probably are not very concerned with when first getting to learn Hadoop and the basic desire to play around in it.
Three main distributions vendors exists: Cloudera, HortonWorks and MapR. Of these, Cloudera is the oldest, HortonWorks distributes a 100% open source distribution of Apache Hadoop, and MapR has created their own version of the MapReduce-component to Apache's Hadoop.
HortonWorks has a 'sandbox' to learn Hadoop and its derivatives. We will be using that to learn the environment. To keep this consistent we will use Docker and a dockerimage of the HortonWorks Standbox Standalone HDP distribution. This way, we can change from distributions quickly in the future and do not have to install the Hadoop Ecosystem on our own machine but will use a virtual image.
If you do not have Docker installed, you can download and install it from here Docker CE download (Windows/Mac/Linux versions). For more information see the documentation .
We will mainly be following the installation guide provided by HortonWorks. The steps mentioned below are mentioned there, and more.
Step 1 ~ Download the distro:
Download the HortonWorks Hadoop Sandbox called HDP, (HortonWorks Data Platform) not to be confused with HDF ( HortonWorks Dataflow, this is a different distrubtion for Streaming data)! I am using windows, so I downloaded the Docker Windows image/configuration file. This results in a ZIP-file (in my situation called start-sandbox-hdp-standalone_2-6-4.ps1.zip) containing a ps1 powershell-script. Unzip this ZIP-file. For linux/mac, the resulting ZIP-file contains a SH script which is the equivalent we downloaded for Windows.
Step 2 ~ Install Docker and run the distro:
For step 2, Docker CE needs to be installed and running. If you do not have Docker installed, you can download and install it from here Docker CE download (Windows/Mac/Linux versions). For more information see the documentation. For Docker Windows and Mac users of the Docker GUI (only Mac and Windows), HortonWorks recommends increasing the amount of RAM assigned to Docker to a minimum of 8gb. Assuming you have done that, you can now run the script that you downloaded in step 1.
Change directory to the location of the script (sh or ps1 file).
For mac/linux just run:
1
sh start-sandbox-hdp-standalone_{version}.sh
Replace {version} with the version number of the specific distro, contained in the filename.
For Windows users:
You have to run the powershell script with some preceding commands that handle some authorization.
1
powershell -ExecutionPolicy ByPass -File start-sandbox-hdp-standalone_{version}.ps1
This process starts the Docker containers, loads config files and gets all the applications within the sandbox environment (HDFS, YARN, AMBARI) up and running. It could take a while depending on your hardware specifications and the amount of resources assigned to Docker. When finished, the output should display
1
Started Hortonworks HDP container
Step 3 ~ Connect to Hadoop:
Now the Hadoop Sandbox is up and running, we should be able to connect to it. This and the following steps are described in the following Hortonworks guide as well. The Docker container that has been spun up, is forwarding ports to localhost. We can connect to the Sandbox with SSH on localhost port 2222 (alternatively you can map the container's IP address to a desired hostname, as the guide specifies). On Windows you can use PuTTY or a different SSH client (I am using CMDer which has SSH capabilities built-in).
1
SSH root@localhost -p 2222
This prompts the password for root, which is hadoop as default. It will prompt you to change this password, remember/write down the new one!
Another way to connect to the HDP Sandbox is using the web-terminal: browsing to localhost:4200 should display a terminal that can be used to login just like the SSH way with the built-in terminal.
1
[root@sandbox-hdp ~]#
We should now be inside the Sandbox CLI environment, ready to run commands! We could list the folders in the '/' dir of the hadoop filesystem, for instance.
1
[root@sandbox-hdp /]# hadoop fs -ls /
2
Found 12 items
3
drwxrwxrwx - yarn hadoop 0 2018-04-10 12:53 /app-logs
4
drwxr-xr-x - hdfs hdfs 0 2018-02-01 10:32 /apps
5
drwxr-xr-x - yarn hadoop 0 2018-02-01 10:24 /ats
6
drwxr-xr-x - hdfs hdfs 0 2018-02-01 10:39 /demo
7
drwxr-xr-x - hdfs hdfs 0 2018-02-01 10:24 /hdp
8
drwx------ - livy hdfs 0 2018-02-01 10:27 /livy2-recovery
9
drwxr-xr-x - mapred hdfs 0 2018-02-01 10:24 /mapred
10
drwxrwxrwx - mapred hadoop 0 2018-02-01 10:25 /mr-history
11
drwxr-xr-x - hdfs hdfs 0 2018-02-01 10:24 /ranger
12
drwxrwxrwx - spark hadoop 0 2018-04-12 11:16 /spark2-history
13
drwxrwxrwx - hdfs hdfs 0 2018-02-01 10:47 /tmp
14
drwxr-xr-x - hdfs hdfs 0 2018-04-10 08:58 /user
Congratulations, you successfully spun up Hadoop as a single node cluster and executed your first command!
There are also other users/roles we could switch to, predefined by Hortonworks. Browse the different users to get a feel for common roles and rights for different Hadoop users.
Step 4 ~ Login to Ambari:
HortonWorks bundles it's HDP distribution with Ambari to manage the cluster. Ambari is a tool that manages the different nodes and supplies the users with a web-interface to run commands, handle files and monitor usage of the Hadoop cluster.
A completely open source management platform for provisioning, managing, monitoring and securing Apache Hadoop clusters. Apache Ambari takes the guesswork out of operating Hadoop. Ambari makes Hadoop management simpler by providing a consistent, secure platform for operational control. Ambari provides an intuitive Web UI as well as a robust REST API, which is particularly useful for automating cluster operations.
~ https://hortonworks.com/apache/ambari/
We can use Ambari for a lot of operations that are harder to visualize on the CLI. On the HDP distro, it is hosted on port 8080 and a tutorial-section of the website is hosted on port 8888. Connecting to http://localhost:8888 in your browser takes you to this tutorial/friendly-setup page. Here you can either follow a tutorial (enable pop-ups) or view the other applications that are hosted by the HDP container.
Ambari welcome page
Click on the left button "Launch Dashboard" to take you to the Ambari dashboard.
This should take you to the login screen of Ambari. We should now create the password for the admin user. To do that, we need to switch back to the terminal that connects us to our Hadoop Sandbox, if you closed it reconnect to it by typing ssh root@localhost -p 2222 and provide the password you created previously. Now, logged in as the root user, execute the following command.
1
ambari-admin-password-reset
Which prompts you to create a new password and to confirm it. The webserver will then restart, after which it will start listening on port 8080 again. When the prompt confirms this, you can go back to the login-screen and log-in with user admin and the password you just created. This should take you to the Ambari Dashboard.