Added docker-compose examples

nigel.stanger / docker-analytics

Browse code Added docker-compose examples Also included Spark and Kafka version numbers. master spark3
1 parent 9a1e20c commit f92647b7907f669972382dd922e11e45e56e73ad Nigel Stanger authored on 17 May 2019

Patch

Showing 1 changed file

Ignore Space Show notes View README.md
# docker-analytics Various Docker resources for using data analytics tools like Spark, etc., in teaching. They are not intended for production use! ## Spark A standard installation of Apache Spark (2.3) that can start up either as a master or a worker node as required. Each worker is limited to a maximum of two cores (change this in `spark-defaults.conf` and rebuild the image if necessary). ## Kafka A standard installation of Apache Kafka (2.2) that uses the built-in Zookeeper instance, and is configured to listen for plain text. Based on the Spark image, so that the Spark libraries are available. ## PySpark A standard installation of PySpark, that includes a PySpark kernel for Jupyter. The default user is `pyspark`, with a working directory of `/home/pyspark/work`. It also installs the [sparkmonitor](https://github.com/krishnan-r/sparkmonitor) extension, but that doesn’t always seem to work properly. The project hasn’t been updated since June 2018. TODO: [PixieDust](https://github.com/pixiedust/pixiedust) looks like a more robust and supported solution, but requires a new kernel to be installed. Based on the Spark image, so that the Spark libraries are available. (Kafka is not included.) ## Compose The compose file sets up a Spark cluster and associated Kafka and PySpark instances, running on the network `spark-network`. It defines four services: * `spark-master`: Creates a single Spark master node with the hostname `spark-master`, exposing ports 7077 and 8080. * `spark-worker`: Creates a Spark worker node with 2 GB of memory (set by environment variable `SPARK_WORKER_MEMORY`). Scalable as required. * `kafka`: Creates a single Kafka node with the hostname `kafka`, using its built-in Zookeeper instance. * `pyspark`: Creates a PySpark/Jupyter instance, exposing port 8888. Scalable as required. All four services map `/mnt/sparkdata` to `~/tmp/sparkdata` on the host. You can, of course, run any combination of these services as desired. Examples: * `docker-compose up --scale spark-worker=2` will create a complete Spark + Kafka + PySpark stack with two Spark worker nodes. * `docker-compose up pyspark` will run a standalone PySpark instance. # docker-analytics Various Docker resources for using data analytics tools like Spark, etc., in teaching. They are not intended for production use! ## Spark A standard installation of Apache Spark that can start up either as a master or a worker node as required. Each worker is limited to a maximum of two cores (change this in `spark-defaults.conf` and rebuild the image if necessary). ## Kafka A standard installation of Apache Kafka that uses the built-in Zookeeper instance, and is configured to listen for plain text. ## PySpark A standard installation of PySpark, that includes a PySpark kernel for Jupyter. The default user is `pyspark`, with a working directory of `/home/pyspark/work`. It also installs the [sparkmonitor](https://github.com/krishnan-r/sparkmonitor) extension, but that doesn’t always seem to work properly. The project hasn’t been updated since June 2018. TODO: [PixieDust](https://github.com/pixiedust/pixiedust) looks like a more robust and supported solution, but requires a new kernel to be installed. ## Compose The compose file sets up a Spark cluster and associated Kafka and PySpark instances, running on the network `spark-network`. It defines four services: * `spark-master`: Creates a single Spark master node with the hostname `spark-master`, exposing ports 7077 and 8080. * `spark-worker`: Creates a Spark worker node with 2 GB of memory (set by environment variable `SPARK_WORKER_MEMORY`). Scalable as required. * `kafka`: Creates a single Kafka node with the hostname `kafka`, using its built-in Zookeeper instance. * `pyspark`: Creates a PySpark/Jupyter instance, exposing port 8888. Scalable as required. All four services map `/mnt/sparkdata` to `~/tmp/sparkdata` on the host.

Ignore Space Show notes View

README.md

# docker-analytics

Various Docker resources for using data analytics tools like Spark, etc., in teaching. They are *not* intended for production use!

## Spark

A standard installation of Apache Spark (2.3) that can start up either as a master or a worker node as required. Each worker is limited to a maximum of two cores (change this in `spark-defaults.conf` and rebuild the image if necessary).

## Kafka

A standard installation of Apache Kafka (2.2) that uses the built-in Zookeeper instance, and is configured to listen for plain text.

Based on the Spark image, so that the Spark libraries are available.

## PySpark

A standard installation of PySpark, that includes a PySpark kernel for Jupyter. The default user is `pyspark`, with a working directory of `/home/pyspark/work`.

It also installs the [sparkmonitor](https://github.com/krishnan-r/sparkmonitor) extension, but that doesn’t always seem to work properly. The project hasn’t been updated since June 2018. **TODO:** [PixieDust](https://github.com/pixiedust/pixiedust) looks like a more robust and supported solution, but requires a new kernel to be installed.

Based on the Spark image, so that the Spark libraries are available. (Kafka is *not* included.)

## Compose

The compose file sets up a Spark cluster and associated Kafka and PySpark instances, running on the network `spark-network`. It defines four services:

* `spark-master`: Creates a single Spark master node with the hostname `spark-master`, exposing ports 7077 and 8080.
* `spark-worker`: Creates a Spark worker node with 2 GB of memory (set by environment variable `SPARK_WORKER_MEMORY`). Scalable as required.
* `kafka`: Creates a single Kafka node with the hostname `kafka`, using its built-in Zookeeper instance.
* `pyspark`: Creates a PySpark/Jupyter instance, exposing port 8888. Scalable as required.

All four services map `/mnt/sparkdata` to `~/tmp/sparkdata` on the host.

You can, of course, run any combination of these services as desired. Examples:

* `docker-compose up --scale spark-worker=2` will create a complete Spark + Kafka + PySpark stack with two Spark worker nodes.
* `docker-compose up pyspark` will run a standalone PySpark instance.