Various Docker resources for using data analytics tools like Spark, etc., in teaching

Nigel Stanger authored on 17 May 2019
examples Added documentation for examples 5 years ago
kafka Changed SPARK_LOCAL_IP to SPARK_HOSTNAME 5 years ago
pyspark Added Python packages 5 years ago
spark Increased spark worker memory 5 years ago
.gitignore Ignored .tgz files 5 years ago
Makefile Added “kafka” to “all” target 5 years ago
README.md Documented containers and services 5 years ago
docker-compose.yml Increased spark worker memory 5 years ago
README.md

docker-analytics

Various Docker resources for using data analytics tools like Spark, etc., in teaching. They are not intended for production use!

Spark

A standard installation of Apache Spark that can start up either as a master or a worker node as required. Each worker is limited to a maximum of two cores (change this in spark-defaults.conf and rebuild the image if necessary).

Kafka

A standard installation of Apache Kafka that uses the built-in Zookeeper instance, and is configured to listen for plain text.

PySpark

A standard installation of PySpark, that includes a PySpark kernel for Jupyter. The default user is pyspark, with a working directory of /home/pyspark/work.

It also installs the sparkmonitor extension, but that doesn’t always seem to work properly. The project hasn’t been updated since June 2018. TODO: PixieDust looks like a more robust and supported solution, but requires a new kernel to be installed.

Compose

The compose file sets up a Spark cluster and associated Kafka and PySpark instances, running on the network spark-network. It defines four services:

  • spark-master: Creates a single Spark master node with the hostname spark-master, exposing ports 7077 and 8080.
  • spark-worker: Creates a Spark worker node with 2 GB of memory (set by environment variable SPARK_WORKER_MEMORY). Scalable as required.
  • kafka: Creates a single Kafka node with the hostname kafka, using its built-in Zookeeper instance.
  • pyspark: Creates a PySpark/Jupyter instance, exposing port 8888. Scalable as required.

All four services map /mnt/sparkdata to ~/tmp/sparkdata on the host.