- Install DSE
- In the cassandra.yml file, ensure the datacenter and cluster match your analytics datacenter
- In the cassandra-env.sh file add this configuration line toward the bottom
JVM_OPTS="$JVM_OPTS -Dcassandra.join_ring=false"
This will make your DSE node a coordinator only, it will not own any data. You can use this node to submit jobs to DSE locally without the need to know which is the master node. - start DSE
- Install python
- Install virtualenv
> virtualenv .jupyter
> source .jupyter/bin/activate
> pip install ipython
> pip install jupyter
> PYSPARK_SUBMIT_ARGS="$PYSPARK_SUBMIT_ARGS pyspark-shell" IPYTHON_OPTS="notebook --ip='*' --no-browser" dse pyspark
You can use something like supervisord
to keep jupyter running in the background.
If you are getting a permission denied error when starting pyspark that look slike this:
OSError: [Errno 13] Permission denied: '/run/user/505/jupyter'
It is because the XDG_RUNTIME_DIR is set to your logged in user, in that case just add the following environment variable before starting pyspark:
JUPYTER_RUNTIME_DIR="$HOME/.jupyter/runtime