Skip to content

Instantly share code, notes, and snippets.

@tagr
Last active January 10, 2018 02:10
Show Gist options
  • Save tagr/88172b58ddd9f8b0c89d41f4041127e2 to your computer and use it in GitHub Desktop.
Save tagr/88172b58ddd9f8b0c89d41f4041127e2 to your computer and use it in GitHub Desktop.
Set up Apache Spark 1.6 cluster on CentOS 6.8

Apache Spark 3-node Cluster (CentOS 6.8)


Edit Hosts file (all machines)

This allows us to refer to other nodes using names instead of IP addresses.

  1. sudo nano /etc/hosts
  2. Example on master:
    127.0.0.1 master localhost
    192.168.1.5 slave01
    192.168.1.6 slave02
  3. Control+X, then Y to save.

Add and copy SSH keys (all machines)

So Spark can communicate with other servers without continually entering passwords

  1. ssh-keygen
  2. Hit enter. Do not specify a file name.
  3. Press enter again twice to skip passphrase.
  4. Copy key to all other VMs ssh-copy-id youruser@slave01 Note: check permissions on home and ssh directories if master cannot connect to worker: /home/ufo ownership is 700 /home/ufo/.ssh ownership is 700 /home/ufo/.ssh/authorized_keys ownership is 600

Firewall (all machines)

  1. Review current firewall rules: sudo iptables --line -vnL
  2. Spark requires ports 4040, 6066, 7077, 8080, and 8081 open. The rules to allow traffic on these ports need to be above the REJECT rule, in our case usually line 5.
  3. Repeat for each port on all machines
    sudo iptables -I INPUT 5 -i eth0 -p tcp --dport 8080 -m state --state NEW,ESTABLISHED -j ACCEPT

Install Scala (all machines)

Default language for Spark

  1. wget http://downloads.typesafe.com/scala/2.11.7/scala-2.11.7.tgz
  2. tar xvf scala-2.11.7.tgz
  3. sudo mv scala-2.11.7 /usr/lib
  4. sudo ln -s /usr/lib/scala-2.11.7 /usr/lib/scala
  5. export PATH=$PATH:/usr/lib/scala/bin
  6. Verify installation scala -version

Install Apache Spark 1.6.0 (all machines)

The star of the party

  1. wget http://d3kbcqa49mib13.cloudfront.net/spark-1.6.0-bin-hadoop2.6.tgz
  2. tar xvf spark-1.6.0-bin-hadoop2.6.tgz
  3. export SPARK_HOME=$HOME/spark-1.6.0-bin-hadoop2.6
  4. export PATH=$PATH:$SPARK_HOME/bin

Add slaves file (master)

So the primary node can start workers remotely

  1. cd spark-1.6.0-bin-hadoop2.6/conf
  2. touch slaves
  3. sudo nano slaves
  4. Enter the following:
    slave01
    slave02
  5. Control+X, then Y to save.

Add Spark Environment config (master)

Spark was having trouble communicating with master node by name, so author added this step

  1. Skip this step if you are already in conf directory: cd spark-1.6.0-bin-hadoop2.6/conf
  2. touch spark-env.sh
  3. sudo nano spark-env.sh
  4. Enter the following (replace master IP address with actual):
    SPARK_MASTER_IP=192.168.1.4
  5. Control+X, then Y to save.

Start Apache Spark cluster (master)

  1. (From the Spark directory) sbin/start-all.sh
  2. Navigate to http://192.168.1.4:8080 (replace with your master node's IP address) to access Spark administration page.

Resources

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment