PySpark -> Python + Spark = ❤️🐍

As i was playing around with Spark on my macbook for a school project, i decided i didn’t want to use the version our professor gave us, which was a Java project, and had to be coded in Scala, language i’m not really familiar with.

I love python, really, and as i knew Spark supported python i decided to give a look at what options i had to install and run it for testing and analysis of small datasets on my macbook.

So… For my reference when i’ll inevitably reinstall it and will forget the steps, and for you, who are presumably looking to install spark and use it within python, here’s how i did it!

Requirements

In order to follow the guide step by step you’ll need:

  • Python 3.5 installed (2.7 should also work, choose the one you prefer to code with) (Python 3.6 is incompatible with the current Spark version).
  • Py4J
  • Jupyter Notebook and IPython installed

On my macbook I installed all the requirements via Anaconda:

  1. Download Anaconda from: https://www.continuum.io/downloads#macos and install it.
  2. Open Anaconda-Navigator and create a new virtual environment with python 3.5 (Environments->Create New->Python Version:3.5)
  3. Install Jupyter Notebook and qtConsole
  4. Open a terminal and activate the virtual environment you just created, let’s say you named it ‘python35’ type:
  5. Install Py4J via pip

Perfect, the requirements are in place, it’s time to pass on to Spark!

Spark & Python

Spark supports python via the “pyspark” module, this allows to write applications in python which will interface with spark.

Applications can either be run as standalone apps, importing the pyspark in the python app, or you can run an interactive python (or IPython / Jupyter) shell in a Spark context.

Install

We wanted to install spark didn’t we? Let’s do it then!

  1. Download spark with hadoop from: http://spark.apache.org/downloads.html (At time of writing the latest version is spark-2.1.0-bin-hadoop2.7)
  2. Unpack it in the ~/Applications directory

Done!

It was easy, wasn’t it?
Open a terminal and source it to the python35 environment and you’re ready to go, you can start by launching the “pyspark” executable in the bin folder.

This will open an interactive shell in python which you can use to start playing around.

But that’s not really handy if you have a big project or want to present some data as a notebook, isn’t it? In fact my real objective was to be able to launch self contained apps, as well as work insite jupyter notebook, which is really good to play around with ideas before formalizing them in an app.

So, how does it work?

Self contained App

To be able to use Spark in a self contained app we have to first import the Spark Context into it:

Then to run it (always from our environment):

IPYTHON / Jupyter

To run in IPython/Jupyter we’ll need to add some variables to the shell environment:

Done!
Restart the shell, source the environment and type:

You’ll be greeted with IPython starting the notebook and prompting to open the browser, just open the link and create a new notebook. (Beware Jupyter gets access only to the folder & subfolders you started it in, start it where your project resides)

That’s all, enjoy 🙂

P.S.
For your fun: 1984.txt

Rispondi