From charlesreid1

Revision as of 02:09, 27 September 2017 by Admin (talk | contribs) (→‎Testing it out)

Install

To run PySpark in a Jypyter notebook using Docker, we use a Docker image curated by the Jupyter project: jupyter/docker-stacks.

Link to jupyter/pyspark-notebook on Dockerhub: https://hub.docker.com/r/jupyter/pyspark-notebook/

Link to jupyter/pyspark-notebook on Github: https://github.com/jupyter/docker-stacks/tree/master/pyspark-notebook

Link to jupyter/docker-stacks on Github: https://github.com/jupyter/docker-stacks

Get the docker container

The short version: get the docker image using docker pull:

$ docker pull jupyter/pyspark-notebook

That's it. There is no long version.

To run it, we need to pass traffic from port 8888 on our machine into port 8888 on the Docker image:

$ docker run -it --rm -p 8888:8888 jupyter/pyspark-notebook

Fire it up

Fire up the Docker container with the command above:

$ docker run -it --rm -p 8888:8888 jupyter/pyspark-notebook

This will print out the URL for the Jupyter notebook. There are also ways to pass in a custom certificate, if you want to allow others to access the Jupyter notebook. These are all detailed in the jupyter/pyspark-notebook README, under the section "Notebook Options": https://github.com/jupyter/docker-stacks/tree/master/pyspark-notebook#notebook-options

docker run -d -p 8888:8888 \
    -v /some/host/folder:/etc/ssl/notebook \
    jupyter/pyspark-notebook start-notebook.sh \
    --NotebookApp.keyfile=/etc/ssl/notebook/notebook.key
    --NotebookApp.certfile=/etc/ssl/notebook/notebook.crt

Test it out

Once you have your notebook open, execute the following Python code to ensure it works ok:

import pyspark
sc = pyspark.SparkContext('local[*]')

# do something to prove it works
rdd = sc.parallelize(range(1000))
rdd.takeSample(False, 5)

Flags