Docker/Jupyter PySpark
From charlesreid1
Contents
Install
To run PySpark in a Jypyter notebook using Docker, we use a Docker image curated by the Jupyter project: jupyter/docker-stacks.
Link to jupyter/pyspark-notebook on Dockerhub: https://hub.docker.com/r/jupyter/pyspark-notebook/
Link to jupyter/pyspark-notebook on Github: https://github.com/jupyter/docker-stacks/tree/master/pyspark-notebook
Link to jupyter/docker-stacks on Github: https://github.com/jupyter/docker-stacks
Get the docker container
The short version: get the docker image using docker pull:
$ docker pull jupyter/pyspark-notebook
That's it. There is no long version.
To run it, we need to pass traffic from port 8888 on our machine into port 8888 on the Docker image:
$ docker run -it --rm -p 8888:8888 jupyter/pyspark-notebook
Fire it up
Fire up the Docker container with the command above:
$ docker run -it --rm -p 8888:8888 jupyter/pyspark-notebook
This will print out the URL for the Jupyter notebook. There are also ways to pass in a custom certificate, if you want to allow others to access the Jupyter notebook. These are all detailed in the jupyter/pyspark-notebook README, under the section "Notebook Options": https://github.com/jupyter/docker-stacks/tree/master/pyspark-notebook#notebook-options
docker run -d -p 8888:8888 \
    -v /some/host/folder:/etc/ssl/notebook \
    jupyter/pyspark-notebook start-notebook.sh \
    --NotebookApp.keyfile=/etc/ssl/notebook/notebook.key
    --NotebookApp.certfile=/etc/ssl/notebook/notebook.crt
Test it out
Once you have your notebook open, execute the following Python code to ensure it works ok:
import pyspark
sc = pyspark.SparkContext('local[*]')
# do something to prove it works
rdd = sc.parallelize(range(1000))
rdd.takeSample(False, 5)
Persistence and getting data in
One of the challenges of this approach is that the Docker container will reset every time you shut it down. This problem can be addressed by mounting a shared directory between our container and our local disk. In fact, we can mount the local directory into the specific Docker container directory where Jupyter always starts, so that as soon as we fire up our container, we always see all of our notebooks, data sets, etc. Even better, we can put that directory under version control (e.g., with git), and get persistence and version control for our data, while also getting the easy, instant-on setup of the Docker container.
Here's how we do that:
When you start up your notebook container, you'll notice a message that's printed on the screen:
Serving notebooks from local directory: /home/jovyan
This is where we'll map the local directory TO, so that our stuff will always be there when we start the container.
Create the local directory where your notebooks will live:
$ cd ~/ $ mkdir spark-stuff/
Now run the docker container, and map your spark-stuff directory to /home/jovyan:
$ docker run -it --rm -p 8888:8888 -v $HOME/spark-stuff:/home/jovyan jupyter/pyspark-notebook
The first time it starts up, your directory will be empty. But when you start to create notebooks, and save them, everything will be stored locally. Check it out: try creating a notebook that has a simple print(2+2) statement, and save it. Now close out the container. You'll see Untitled.ipynb still sitting in your spark-stuff directory, even after the container is stopped:
$ docker run -it --rm -p 8888:8888 -v $HOME/spark-stuff:/home/jovyan jupyter/pyspark-notebook
Execute the command: jupyter notebook
[I 02:18:37.670 NotebookApp] Writing notebook server cookie secret to /home/jovyan/.local/share/jupyter/runtime/notebook_cookie_secret
[W 02:18:37.697 NotebookApp] WARNING: The notebook server is listening on all IP addresses and not using encryption. This is not recommended.
[I 02:18:37.730 NotebookApp] JupyterLab alpha preview extension loaded from /opt/conda/lib/python3.6/site-packages/jupyterlab
JupyterLab v0.27.0
Known labextensions:
[I 02:18:37.735 NotebookApp] Running the core application with no additional extensions or settings
[I 02:18:37.741 NotebookApp] Serving notebooks from local directory: /home/jovyan
[I 02:18:37.741 NotebookApp] 0 active kernels
[I 02:18:37.742 NotebookApp] The Jupyter Notebook is running at: http://[all ip addresses on your system]:8888/?token=7566095c85634bf37160afdfa74debe0d918ab43bfd65315
[I 02:18:37.743 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 02:18:37.748 NotebookApp]
    Copy/paste this URL into your browser when you connect for the first time,
    to login with a token:
        http://localhost:8888/?token=7566095c85634bf37160afdfa74debe0d918ab43bfd65315
[I 02:18:39.090 NotebookApp] 302 GET /api/contents/Untitled.ipynb?content=0&_=1506478678548 (172.17.0.1) 4.63ms
[I 02:19:43.202 NotebookApp] 302 GET /?token=7566095c85634bf37160afdfa74debe0d918ab43bfd65315 (172.17.0.1) 0.96ms
[I 02:19:46.813 NotebookApp] Creating new notebook in
[I 02:19:46.861 NotebookApp] Writing notebook-signing key to /home/jovyan/.local/share/jupyter/notebook_secret
[I 02:19:47.696 NotebookApp] Kernel started: 37bb4f61-0d18-4b15-8686-fa382777663d
[I 02:19:48.748 NotebookApp] Adapting to protocol v5.1 for kernel 37bb4f61-0d18-4b15-8686-fa382777663d
[I 02:19:54.408 NotebookApp] Saving file at /Untitled.ipynb
[I 02:19:56.723 NotebookApp] Kernel shutdown: 37bb4f61-0d18-4b15-8686-fa382777663d
^C[I 02:20:00.021 NotebookApp] interrupted
Serving notebooks from local directory: /home/jovyan
0 active kernels
The Jupyter Notebook is running at: http://[all ip addresses on your system]:8888/?token=7566095c85634bf37160afdfa74debe0d918ab43bfd65315
Shutdown this notebook server (y/[n])? y
[C 02:20:01.186 NotebookApp] Shutdown confirmed
[I 02:20:01.189 NotebookApp] Shutting down kernels
$ ls spark-stuff/
Untitled.ipynb
We can also copy in CSV files, text files, sqlite databases, polished notebooks, or even check out a git repository into the spark-stuff directory, and everything will be passed through to the Docker container invisibly.
Flags
| dockernotes on the virtual microservice container platform Installing the docker platform: Docker/Installing Docker Hello World: Docker/Hello World 
 Creating Docker Containers: Getting docker containers from docker hub: Docker/Dockerhub Creating docker containers with dockerfiles: Docker/Dockerfiles Managing Dockerfiles using git: Docker/Dockerfiles/Git Setting up Python virtualenv in container: Docker/Virtualenv 
 Running docker containers: Docker/Basics Dealing with volumes in Docker images: Docker/Volumes Removing Docker images: Docker/Removing Images Rsync Docker Container: Docker/Rsync 
 Networking with Docker Containers: 
 
 
 | 
| docker podspods are groups of docker containers that travel together Docker pods are collections of Docker containers that are intended to run in concert for various applications. 
 Wireless Sensor Data Acquisition Pod The wireless sensor data acquisition pod deploys containers This pod uses the following technologies: Stunnel · Rsync · Apache · MongoDB · Python · Jupyter (numerical Python stack) 
 Deep Learning Pod This pod utilizes the following technologies: Python · Sklearn · Jupyter (numerical Python stack) · Keras · TensorFlow 
 | 
