Revision as of 01:19, 27 September 2017

Installing

Docker

Easiest solution: use Docker.

Use the Jupyter PySpark notebook Docker container: https://hub.docker.com/r/jupyter/pyspark-notebook/

This comes bundled with Apache Mesos, which is a cluster resource management framework. This enables you to connect to a Mesos-managed cluster and use compute resources on that cluster.

This Docker image is provided courtesy of the Jupyter project on Github: https://github.com/jupyter/docker-stacks

Nice explanation of how to set it up with either a standalone (single node) or Mesos cluster in the PySpark notebook image's README: https://github.com/jupyter/docker-stacks/tree/master/pyspark-notebook

Basically, here are the first few lines of a standalone notebook:

import pyspark
sc = pyspark.SparkContext('local[*]')

# do something to prove it works
rdd = sc.parallelize(range(1000))
rdd.takeSample(False, 5)

Mac

Ensure you have the following software installed:

Python 3.x distribution
Jupyter notebook
Java 8 JDK (link: https://www.java.com/en/download/faq/java_mac.xml)

Installing with Homebrew

Install Apache Spark using Homebrew:

$ brew install apache-spark

This should put pyspark on your path:

$ which pyspark
/usr/local/bin/pyspark

I was still getting problems importing pyspark, so I also ended up running a

$ pip3 install pyspark

Linux

Have the following software installed:

Python 3.x distribution
Jupyter notebook
Java 8 JDK (link: http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html)

Download spark from this page: http://spark.apache.org/downloads.html

Now get the Scala build tool into aptitude (see https://stackoverflow.com/questions/35529913/how-to-install-sbt-on-ubuntu-debian-with-apt-get):

$ echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list
$ sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 642AC823
$ sudo apt-get update
$ sudo apt-get install sbt

Now unzip the Spark source, enter the directory, and run:

$ sbt assembly

Ensure Spark was built correctly by running this command from the same directory:

$ bin/pyspark

Now set the $SPARK_HOME environment variable to wherever your Spark lives:

export SPARK_HOME="/path/to/unzipped/spark-2.2"

Testing Out Pyspark

Test it out by running the pyspark command. This should look a bit like Python, but with a Spark splash message:

$ pyspark
Python 2.7.10 (default, Feb  7 2017, 00:08:15)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/09/26 17:53:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/09/26 17:53:16 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
17/09/26 17:53:16 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
17/09/26 17:53:17 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.2.0
      /_/

Using Python version 2.7.10 (default, Feb  7 2017 00:08:15)
SparkSession available as 'spark'.
>>>

Test that it's ok by checking if the sc variable is holding a Spark context:

>>> sc
<SparkContext master=local[*] appName=PySparkShell>

Set Up PySpark With Jupyter

To use PySpark through a Jupyter notebook, instead of through the command line, first make sure your Jupyter is up to date:

$ pip3 install --upgrade jupyter

Now create a Jupyter profile for PySpark notebooks:

$ jupyter profile create pyspark

Oops.

Flags

@@ Line 122: / Line 122: @@
 >>> sc
 <SparkContext master=local[*] appName=PySparkShell>
-</pre>
-==Spark Environment Variables==
-Set a PySpark environment variable:
-<pre>
-export PYSPARK_SUBMIT_ARGS="--master local[2]"
 </pre>

PySpark: Difference between revisions

From charlesreid1