PySpark -> Python + Spark = ❤️🐍 – Francesco Servida's Blog

As i was playing around with Spark on my macbook for a school project, i decided i didn’t want to use the version our professor gave us, which was a Java project, and had to be coded in Scala, language i’m not really familiar with.

I love python, really, and as i knew Spark supported python i decided to give a look at what options i had to install and run it for testing and analysis of small datasets on my macbook.

So… For my reference when i’ll inevitably reinstall it and will forget the steps, and for you, who are presumably looking to install spark and use it within python, here’s how i did it!

Requirements

In order to follow the guide step by step you’ll need:

Python 3.5 installed (2.7 should also work, choose the one you prefer to code with) (Python 3.6 is incompatible with the current Spark version).
Py4J
Jupyter Notebook and IPython installed

On my macbook I installed all the requirements via Anaconda:

Download Anaconda from: https://www.continuum.io/downloads#macos and install it.
Open Anaconda-Navigator and create a new virtual environment with python 3.5 (Environments->Create New->Python Version:3.5)
Install Jupyter Notebook and qtConsole
Open a terminal and activate the virtual environment you just created, let’s say you named it ‘python35’ type:
1. $ source activate python35
  
  1
  
  $ source activate python35
Install Py4J via pip
1. $ pip install py4j
  
  1
  
  $ pip install py4j

Perfect, the requirements are in place, it’s time to pass on to Spark!

Spark & Python

Spark supports python via the “pyspark” module, this allows to write applications in python which will interface with spark.

Applications can either be run as standalone apps, importing the pyspark in the python app, or you can run an interactive python (or IPython / Jupyter) shell in a Spark context.

Install

We wanted to install spark didn’t we? Let’s do it then!

Download spark with hadoop from: http://spark.apache.org/downloads.html (At time of writing the latest version is spark-2.1.0-bin-hadoop2.7)
Unpack it in the ~/Applications directory

Done!

It was easy, wasn’t it?
Open a terminal and source it to the python35 environment and you’re ready to go, you can start by launching the “pyspark” executable in the bin folder.

$ ./bin/pyspark

1	$ ./bin/pyspark

This will open an interactive shell in python which you can use to start playing around.

But that’s not really handy if you have a big project or want to present some data as a notebook, isn’t it? In fact my real objective was to be able to launch self contained apps, as well as work insite jupyter notebook, which is really good to play around with ideas before formalizing them in an app.

So, how does it work?

Self contained App

To be able to use Spark in a self contained app we have to first import the Spark Context into it:

from pyspark import SparkContext

# "local" specifies that spark be run on your computer, as opposed to a remote cluster.
# "MySparkApp" is just the name of the app in the spark console
sc = SparkContext("local", "MySparkApp")

# Your spark code:
# eg.
book = sc.textFile("1984.txt")
word_count = book.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda x,y: x+y)
# print the first 10 elements of the RDD
for word in word_count.take(10):
 print(word)

from pyspark import SparkContext

# "local" specifies that spark be run on your computer, as opposed to a remote cluster.

# "MySparkApp" is just the name of the app in the spark console

sc = SparkContext("local", "MySparkApp")

# Your spark code:

# eg.

book = sc.textFile("1984.txt")

word_count = book.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda x,y: x+y)

# print the first 10 elements of the RDD

for word in word_count.take(10):

print(word)

Then to run it (always from our environment):

$ YOUR_SPARK_HOME/bin/spark-submit --master local[4] MySparkApp.py

(Where local[4] is telling Spark to run on localhost with 4 threads)

$ YOUR_SPARK_HOME/bin/spark-submit --master local[4] MySparkApp.py

(Where local[4] is telling Spark to run on localhost with 4 threads)

IPYTHON / Jupyter

To run in IPython/Jupyter we’ll need to add some variables to the shell environment:

$ vim ~/.bash_profile

(Or your favorite editor)

$ vim ~/.bash_profile

(Or your favorite editor)

# Append this
# SPARK
export SPARK_HOME=~/Applications/spark-2.1.0-bin-hadoop2.7
alias py_spark="$SPARK_HOME/bin/pyspark"
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PACKAGES="com.databricks:spark-csv_2.11:1.4.0"
export PYSPARK_SUBMIT_ARGS="--packages ${PACKAGES} pyspark-shell"
export PYSPARK_DRIVER_PYTHON=ipython
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --port=7777"

# Append this

# SPARK

export SPARK_HOME=~/Applications/spark-2.1.0-bin-hadoop2.7

alias py_spark="$SPARK_HOME/bin/pyspark"

export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH

export PACKAGES="com.databricks:spark-csv_2.11:1.4.0"

export PYSPARK_SUBMIT_ARGS="--packages ${PACKAGES} pyspark-shell"

export PYSPARK_DRIVER_PYTHON=ipython

export PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --port=7777"

Done!
Restart the shell, source the environment and type:

$ py_spark

1	$ py_spark

You’ll be greeted with IPython starting the notebook and prompting to open the browser, just open the link and create a new notebook. (Beware Jupyter gets access only to the folder & subfolders you started it in, start it where your project resides)

(python35) mypc:spark-2.1.0-bin-hadoop2.7 francesco$ ./bin/pyspark 
[TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated and will be removed in future versions.
[TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook` in the future
[I 19:48:31.622 NotebookApp] Serving notebooks from local directory: /Users/francesco/Applications/spark-2.1.0-bin-hadoop2.7
[I 19:48:31.622 NotebookApp] 0 active kernels 
[I 19:48:31.622 NotebookApp] The Jupyter Notebook is running at: http://localhost:7777/?token=670540452905476h49dj06sk75402h743026s709485j2
[I 19:48:31.622 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 19:48:31.624 NotebookApp] 
    
    Copy/paste this URL into your browser when you connect for the first time,
    to login with a token:
        http://localhost:7777/?token= 670540452905476h49dj06sk75402h743026s709485j2

(python35) mypc:spark-2.1.0-bin-hadoop2.7 francesco$ ./bin/pyspark

[TerminalIPythonApp] WARNING | Subcommand `ipython notebook` is deprecated and will be removed in future versions.

[TerminalIPythonApp] WARNING | You likely want to use `jupyter notebook` in the future

[I 19:48:31.622 NotebookApp] Serving notebooks from local directory: /Users/francesco/Applications/spark-2.1.0-bin-hadoop2.7

[I 19:48:31.622 NotebookApp] 0 active kernels

[I 19:48:31.622 NotebookApp] The Jupyter Notebook is running at: http://localhost:7777/?token=670540452905476h49dj06sk75402h743026s709485j2

[I 19:48:31.622 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).

[C 19:48:31.624 NotebookApp]

Copy/paste this URL into your browser when you connect for the first time,

to login with a token:

http://localhost:7777/?token= 670540452905476h49dj06sk75402h743026s709485j2

That’s all, enjoy 🙂

P.S.
For your fun: 1984.txt

Requirements

Spark & Python

Install

Self contained App

IPYTHON / Jupyter

Leave a Reply Cancel reply