diff --git a/README.md b/README.md index 1e4b8f1..781cdd3 100644 --- a/README.md +++ b/README.md @@ -4,11 +4,13 @@ ## Spark -A standard installation of Apache Spark that can start up either as a master or a worker node as required. Each worker is limited to a maximum of two cores (change this in `spark-defaults.conf` and rebuild the image if necessary). +A standard installation of Apache Spark (2.3) that can start up either as a master or a worker node as required. Each worker is limited to a maximum of two cores (change this in `spark-defaults.conf` and rebuild the image if necessary). ## Kafka -A standard installation of Apache Kafka that uses the built-in Zookeeper instance, and is configured to listen for plain text. +A standard installation of Apache Kafka (2.2) that uses the built-in Zookeeper instance, and is configured to listen for plain text. + +Based on the Spark image, so that the Spark libraries are available. ## PySpark @@ -16,6 +18,8 @@ It also installs the [sparkmonitor](https://github.com/krishnan-r/sparkmonitor) extension, but that doesn’t always seem to work properly. The project hasn’t been updated since June 2018. **TODO:** [PixieDust](https://github.com/pixiedust/pixiedust) looks like a more robust and supported solution, but requires a new kernel to be installed. +Based on the Spark image, so that the Spark libraries are available. (Kafka is *not* included.) + ## Compose The compose file sets up a Spark cluster and associated Kafka and PySpark instances, running on the network `spark-network`. It defines four services: @@ -26,3 +30,8 @@ * `pyspark`: Creates a PySpark/Jupyter instance, exposing port 8888. Scalable as required. All four services map `/mnt/sparkdata` to `~/tmp/sparkdata` on the host. + +You can, of course, run any combination of these services as desired. Examples: + +* `docker-compose up --scale spark-worker=2` will create a complete Spark + Kafka + PySpark stack with two Spark worker nodes. +* `docker-compose up pyspark` will run a standalone PySpark instance.