Docker/Jupyter PySpark
From charlesreid1
Install
To run PySpark in a Jypyter notebook using Docker, we use a Docker image curated by the Jupyter project: jupyter/docker-stacks.
Link to jupyter/pyspark-notebook on Dockerhub: https://hub.docker.com/r/jupyter/pyspark-notebook/
Link to jupyter/pyspark-notebook on Github: https://github.com/jupyter/docker-stacks/tree/master/pyspark-notebook
Link to jupyter/docker-stacks on Github: https://github.com/jupyter/docker-stacks
Get the docker container
The short version: get the docker image using docker pull:
$ docker pull jupyter/pyspark-notebook
That's it. There is no long version.
To run it, we need to pass traffic from port 8888 on our machine into port 8888 on the Docker image:
$ docker run -it --rm -p 8888:8888 jupyter/pyspark-notebook
Fire it up
Fire up the Docker container with the command above:
$ docker run -it --rm -p 8888:8888 jupyter/pyspark-notebook
This will print out the URL for the Jupyter notebook. There are also ways to pass in a custom certificate, if you want to allow others to access the Jupyter notebook. These are all detailed in the jupyter/pyspark-notebook README, under the section "Notebook Options": https://github.com/jupyter/docker-stacks/tree/master/pyspark-notebook#notebook-options
docker run -d -p 8888:8888 \
-v /some/host/folder:/etc/ssl/notebook \
jupyter/pyspark-notebook start-notebook.sh \
--NotebookApp.keyfile=/etc/ssl/notebook/notebook.key
--NotebookApp.certfile=/etc/ssl/notebook/notebook.crt
Test it out
Once you have your notebook open, execute the following Python code to ensure it works ok:
import pyspark
sc = pyspark.SparkContext('local[*]')
# do something to prove it works
rdd = sc.parallelize(range(1000))
rdd.takeSample(False, 5)
Persistence and getting data in
One of the challenges of this approach is that the Docker container will reset every time you shut it down. This problem can be addressed by mounting a shared directory between our container and our local disk. In fact, we can mount the local directory into the specific Docker container directory where Jupyter always starts, so that as soon as we fire up our container, we always see all of our notebooks, data sets, etc. Even better, we can put that directory under version control (e.g., with git), and get persistence and version control for our data, while also getting the easy, instant-on setup of the Docker container.
Here's how we do that:
Create the local directory where your notebooks will live:
$ cd ~/ $ mkdir spark-stuff/
Flags
| docker notes on the virtual microservice container platform
Installing the docker platform: Docker/Installing Docker Hello World: Docker/Hello World
Creating Docker Containers: Getting docker containers from docker hub: Docker/Dockerhub Creating docker containers with dockerfiles: Docker/Dockerfiles Managing Dockerfiles using git: Docker/Dockerfiles/Git Setting up Python virtualenv in container: Docker/Virtualenv
Running docker containers: Docker/Basics Dealing with volumes in Docker images: Docker/Volumes Removing Docker images: Docker/Removing Images Rsync Docker Container: Docker/Rsync
Networking with Docker Containers:
|
| docker pods pods are groups of docker containers that travel together
Docker pods are collections of Docker containers that are intended to run in concert for various applications.
Wireless Sensor Data Acquisition Pod The wireless sensor data acquisition pod deploys containers This pod uses the following technologies: Stunnel · Rsync · Apache · MongoDB · Python · Jupyter (numerical Python stack)
Deep Learning Pod This pod utilizes the following technologies: Python · Sklearn · Jupyter (numerical Python stack) · Keras · TensorFlow
|