Install

To run PySpark in a Jypyter notebook using Docker, we use a Docker image curated by the Jupyter project: jupyter/docker-stacks.

Link to jupyter/pyspark-notebook on Dockerhub: https://hub.docker.com/r/jupyter/pyspark-notebook/

Link to jupyter/pyspark-notebook on Github: https://github.com/jupyter/docker-stacks/tree/master/pyspark-notebook

Link to jupyter/docker-stacks on Github: https://github.com/jupyter/docker-stacks

Get the docker container

The short version: get the docker image using docker pull:

$ docker pull jupyter/pyspark-notebook

That's it. There is no long version.

To run it, we need to pass traffic from port 8888 on our machine into port 8888 on the Docker image:

$ docker run -it --rm -p 8888:8888 jupyter/pyspark-notebook

Fire it up

Fire up the Docker container with the command above:

$ docker run -it --rm -p 8888:8888 jupyter/pyspark-notebook

This will print out the URL for the Jupyter notebook. There are also ways to pass in a custom certificate, if you want to allow others to access the Jupyter notebook. These are all detailed in the jupyter/pyspark-notebook README, under the section "Notebook Options": https://github.com/jupyter/docker-stacks/tree/master/pyspark-notebook#notebook-options

docker run -d -p 8888:8888 \
    -v /some/host/folder:/etc/ssl/notebook \
    jupyter/pyspark-notebook start-notebook.sh \
    --NotebookApp.keyfile=/etc/ssl/notebook/notebook.key
    --NotebookApp.certfile=/etc/ssl/notebook/notebook.crt

Test it out

Once you have your notebook open, execute the following Python code to ensure it works ok:

import pyspark
sc = pyspark.SparkContext('local[*]')

# do something to prove it works
rdd = sc.parallelize(range(1000))
rdd.takeSample(False, 5)

Persistence and getting data in

One of the challenges of this approach is that the Docker container will reset every time you shut it down. This problem can be addressed by mounting a shared directory between our container and our local disk. In fact, we can mount the local directory into the specific Docker container directory where Jupyter always starts, so that as soon as we fire up our container, we always see all of our notebooks, data sets, etc. Even better, we can put that directory under version control (e.g., with git), and get persistence and version control for our data, while also getting the easy, instant-on setup of the Docker container.

Here's how we do that:

Create the local directory where your notebooks will live:

$ cd ~/
$ mkdir spark-stuff/

Flags

Docker/Jupyter PySpark

From charlesreid1

Contents