June 10, 2025
When getting started with
PySpark, it can be
difficult just to get up and running. You create a fresh Python virtual
environment and pip install pyspark
as you would with any other Python
package, but then you encounter errors relating to Java version incompatability,
environment variable misconfiguration, Py4J dependencies, etc.
Thankfully, the Jupyter team have created
Jupyter Docker Stacks,
a set of ready-to-run Docker images containing Jupyter applications and
interactive computing tools. Their
jupyter/pyspark-notebook
image comes packed with Spark, JupyterLab, and many popular packages from the
scientific Python ecosystem (pandas, matplotlib, scikit-learn, etc).
To use the image, make sure you have Docker running on your machine and run the following command:
docker run -p 8888:8888 quay.io/jupyter/pyspark-notebook:latest
Docker will pull the image which may take a minute or two. Once complete, a
container will be started up with a Jupyter server running. In the terminal
output you should see a localhost:8888/lab
URL with a token
query string.
Jupyter Server 2.16.0 is running at:
http://localhost:8888/lab?token=8110043283b6474e06d8f22cf1f78e2b79a61804552c0620
Open the URL in your browser to access JupyterLab. You can then open a new Python 3 notebook, import PySpark, and write all the Spark code your heart desires.
from datetime import date
from pyspark.sql import Row, SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
Row(id=1, name="Tom", birthday=date(1998, 4, 7)),
Row(id=2, name="Alice", birthday=date(2002, 10, 22)),
Row(id=4, name="Brett", birthday=date(1973, 8, 16))
])
df.show()
If you prefer using a dev container, I’ve created a minimal repo that you can try out.