Various Docker resources for using data analytics tools like Spark, etc., in teaching
|examples||4 years ago|
|kafka||8 months ago|
|pyspark||8 months ago|
|spark||8 months ago|
|spark-pyspark-kafka||8 months ago|
|.gitignore||4 years ago|
|Makefile||8 months ago|
|README.md||8 months ago|
|docker-compose.yml||2 years ago|
|docker.make||8 months ago|
Various Docker resources for using data analytics tools like Spark, etc., in teaching. They are not intended for production use!
A standard installation of Apache Spark (2.3) that can start up either as a master or a worker node as required. Each worker is limited to a maximum of two cores (change this in
spark-defaults.conf and rebuild the image if necessary).
A standard installation of Apache Kafka (2.2) that uses the built-in Zookeeper instance, and is configured to listen for plain text.
Based on the Spark image, so that the Spark libraries are available.
A standard installation of PySpark, that includes a PySpark kernel for Jupyter. The default user is
pyspark, with a working directory of
It also installs the sparkmonitor extension, but that doesn’t always seem to work properly. The project hasn’t been updated since June 2018. TODO: PixieDust looks like a more robust and supported solution, but requires a new kernel to be installed.
Based on the Spark image, so that the Spark libraries are available. (Kafka is not included.)
The compose file sets up a Spark cluster and associated Kafka and PySpark instances, running on the network
spark-network. It defines four services:
spark-master: Creates a single Spark master node with the hostname
spark-master, exposing ports 7077 and 8080.
spark-worker: Creates a Spark worker node with 2 GB of memory (set by environment variable
SPARK_WORKER_MEMORY). Scalable as required.
kafka: Creates a single Kafka node with the hostname
kafka, using its built-in Zookeeper instance.
pyspark: Creates a PySpark/Jupyter instance, exposing port 8888. Scalable as required.
All four services map
~/tmp/sparkdata on the host.
You can, of course, run any combination of these services as desired. Examples:
docker compose up --scale spark-worker=2will create a complete Spark + Kafka + PySpark stack with two Spark worker nodes.
docker compose up pysparkwill run a standalone PySpark instance.