From charlesreid1

Install

To run PySpark in a Jypyter notebook using Docker, we use a Docker image curated by the Jupyter project: jupyter/docker-stacks.

Link to jupyter/pyspark-notebook on Dockerhub: https://hub.docker.com/r/jupyter/pyspark-notebook/

Link to jupyter/pyspark-notebook on Github: https://github.com/jupyter/docker-stacks/tree/master/pyspark-notebook

Link to jupyter/docker-stacks on Github: https://github.com/jupyter/docker-stacks

Get the docker container

The short version: get the docker image using docker pull:

$ docker pull jupyter/pyspark-notebook

That's it. There is no long version.

To run it, we need to pass traffic from port 8888 on our machine into port 8888 on the Docker image:

$ docker run -it --rm -p 8888:8888 jupyter/pyspark-notebook

Fire it up

Fire up the Docker container with the command above:

$ docker run -it --rm -p 8888:8888 jupyter/pyspark-notebook

This will print out the URL for the Jupyter notebook. There are also ways to pass in a custom certificate, if you want to allow others to access the Jupyter notebook. These are all detailed in the jupyter/pyspark-notebook README, under the section "Notebook Options": https://github.com/jupyter/docker-stacks/tree/master/pyspark-notebook#notebook-options

docker run -d -p 8888:8888 \
    -v /some/host/folder:/etc/ssl/notebook \
    jupyter/pyspark-notebook start-notebook.sh \
    --NotebookApp.keyfile=/etc/ssl/notebook/notebook.key
    --NotebookApp.certfile=/etc/ssl/notebook/notebook.crt

Test it out

Once you have your notebook open, execute the following Python code to ensure it works ok:

import pyspark
sc = pyspark.SparkContext('local[*]')

# do something to prove it works
rdd = sc.parallelize(range(1000))
rdd.takeSample(False, 5)

Persistence and getting data in

One of the challenges of this approach is that the Docker container will reset every time you shut it down. This problem can be addressed by mounting a shared directory between our container and our local disk. In fact, we can mount the local directory into the specific Docker container directory where Jupyter always starts, so that as soon as we fire up our container, we always see all of our notebooks, data sets, etc. Even better, we can put that directory under version control (e.g., with git), and get persistence and version control for our data, while also getting the easy, instant-on setup of the Docker container.

Here's how we do that:

When you start up your notebook container, you'll notice a message that's printed on the screen:

Serving notebooks from local directory: /home/jovyan

This is where we'll map the local directory TO, so that our stuff will always be there when we start the container.

Create the local directory where your notebooks will live:

$ cd ~/
$ mkdir spark-stuff/

Now run the docker container, and map your spark-stuff directory to /home/jovyan:

$ docker run -it --rm -p 8888:8888 -v $HOME/spark-stuff:/home/jovyan jupyter/pyspark-notebook

The first time it starts up, your directory will be empty. But when you start to create notebooks, and save them, everything will be stored locally. Check it out: try creating a notebook that has a simple print(2+2) statement, and save it. Now close out the container. You'll see Untitled.ipynb still sitting in your spark-stuff directory, even after the container is stopped:

$ docker run -it --rm -p 8888:8888 -v $HOME/spark-stuff:/home/jovyan jupyter/pyspark-notebook
Execute the command: jupyter notebook
[I 02:18:37.670 NotebookApp] Writing notebook server cookie secret to /home/jovyan/.local/share/jupyter/runtime/notebook_cookie_secret
[W 02:18:37.697 NotebookApp] WARNING: The notebook server is listening on all IP addresses and not using encryption. This is not recommended.
[I 02:18:37.730 NotebookApp] JupyterLab alpha preview extension loaded from /opt/conda/lib/python3.6/site-packages/jupyterlab
JupyterLab v0.27.0
Known labextensions:
[I 02:18:37.735 NotebookApp] Running the core application with no additional extensions or settings
[I 02:18:37.741 NotebookApp] Serving notebooks from local directory: /home/jovyan
[I 02:18:37.741 NotebookApp] 0 active kernels
[I 02:18:37.742 NotebookApp] The Jupyter Notebook is running at: http://[all ip addresses on your system]:8888/?token=7566095c85634bf37160afdfa74debe0d918ab43bfd65315
[I 02:18:37.743 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 02:18:37.748 NotebookApp]

    Copy/paste this URL into your browser when you connect for the first time,
    to login with a token:
        http://localhost:8888/?token=7566095c85634bf37160afdfa74debe0d918ab43bfd65315
[I 02:18:39.090 NotebookApp] 302 GET /api/contents/Untitled.ipynb?content=0&_=1506478678548 (172.17.0.1) 4.63ms
[I 02:19:43.202 NotebookApp] 302 GET /?token=7566095c85634bf37160afdfa74debe0d918ab43bfd65315 (172.17.0.1) 0.96ms
[I 02:19:46.813 NotebookApp] Creating new notebook in
[I 02:19:46.861 NotebookApp] Writing notebook-signing key to /home/jovyan/.local/share/jupyter/notebook_secret
[I 02:19:47.696 NotebookApp] Kernel started: 37bb4f61-0d18-4b15-8686-fa382777663d
[I 02:19:48.748 NotebookApp] Adapting to protocol v5.1 for kernel 37bb4f61-0d18-4b15-8686-fa382777663d
[I 02:19:54.408 NotebookApp] Saving file at /Untitled.ipynb
[I 02:19:56.723 NotebookApp] Kernel shutdown: 37bb4f61-0d18-4b15-8686-fa382777663d
^C[I 02:20:00.021 NotebookApp] interrupted
Serving notebooks from local directory: /home/jovyan
0 active kernels
The Jupyter Notebook is running at: http://[all ip addresses on your system]:8888/?token=7566095c85634bf37160afdfa74debe0d918ab43bfd65315
Shutdown this notebook server (y/[n])? y
[C 02:20:01.186 NotebookApp] Shutdown confirmed
[I 02:20:01.189 NotebookApp] Shutting down kernels

$ ls spark-stuff/
Untitled.ipynb

We can also copy in CSV files, text files, sqlite databases, polished notebooks, or even check out a git repository into the spark-stuff directory, and everything will be passed through to the Docker container invisibly.

Flags