https://lipn.univ-paris13.fr/bigdata/index.php/How_to_use_Spark_on_Grid5000
https://github.com/mliroz/hadoop_g5k/wiki
https://github.com/mliroz/hadoop_g5k/wiki/spark_g5k
Prepare the needed files by downloading:
- Spark, e.g.: https://www.apache.org/dyn/closer.lua/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
- Compatible Hadoop, e.g.: https://archive.apache.org/dist/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz
You will need them in their archives, so don't extract their content.
Install Execo Using Pip
(No Proxy Needed unlike what told in old tutorial, easy_install not supported anymore on g5k?)
(frontend)$ python -m pip install --user execo
Retrieve hadoop_g5k sources from GitHub and then unzip it.
(frontend)$ wget https://github.com/mliroz/hadoop_g5k/archive/master.zip .
unzip master.zip
Update util.py
to avoid a Python error whenever checking java version (which has changed since the package release).
(frontend)$ nano hadoop_g5k-master/hadoop_g5k/util/util.py
then make the check_java_version
return True
, commenting out all the function's code. It works with the current version of OpenJDK installed by default anyway so checking is unnecessary.
... Edit utils.py / check_java_version code to return True all the time ...
Inside the hadoop_g5k_master
folder, launch the python setup command.
python setup.py install --user
Depending on your Python configuration, the scripts will be installed in a different directory. You may add this directory to the PATH in order to be able to call them from any directory.
To automatically add it to the PATH whenever connecting to g5k, add the following lines to your .bash_profile
file.
PATH="/home/$USER/.local/bin:$PATH"
export PATH
From a frontend, reserve your nodes as usual. For example:
$ oarsub -I -t allow_classic_ssh -l nodes=4,walltime=2
Then, from inside your reservation, create and initialize the hadoop cluster.
#--version 2 says we are working on an Hadoop 2.x.y version
$ hg5k --create $OAR_NODEFILE --version 2
#Change the hadoop archive path to yours
$ hg5k --bootstrap /home/$USER/hadoop-2.7.7.tar.gz
$ hg5k --initialize --start
Now create the STANDALONE based Spark cluster (hadoop_g5k does not work well anymore in YARN mode)
$ spark_g5k --create STANDALONE --hid 1
Then, install Spark (with compatible Hadoop dependency as Hadoop version installed in previous steps) on every cluster node.
#Change the Spark archive path to yours, ensuring the -hadoopX.Y version is the one deployed before
$ spark_g5k --bootstrap /home/$USER/spark-2.4.5-bin-hadoop2.7.tgz
Finally, initialize the Spark cluster and start it to make it available to process jobs.
$ spark_g5k --initialize --start
You are ready to submit a job from its assembly jar. For example:
$ spark_g5k --scala_job /home/$USER/some-spark-assembly.jar --main_class Main
After all your jobs are done, you shoud clean all temporary files created during previous phases.
$ spark_g5k --delete
$ hg5k --delete