docker-analytics/examples/README.md at f93eab3191af1a422566393237927c542e0ac832

nigel.stanger / docker-analytics

Find file

Newer

Older

docker-analytics / examples / README.md

Nigel Stanger on 17 May 2019 2 KB Added relevant documentation links

Raw Blame History

# Examples

## “Hello world”

This is a basic “hello world” demonstration of Apache Structured Streaming using an Apache Kafka data source.  This notebook subscribes to that topic and displays the results of the query.

The script `sample_producer.py` repeatedly sends the string “Hello *n*” to the `sample` topic in Kafka, where *n* is an incrementing sequence number.

The Jupyter notebook `sample-consumer.ipynb` subscribes to the `sample` topic and displays the results of the query.

## Clickstream

This is a clickstream processing demo using Apache Kafka and Spark Structured Streaming, based on the original Scala version described at [Clickstream Analysis using Apache Spark and Apache Kafka](https://github.com/IBM/kafka-streaming-click-analysis) (IBM).

The clickstream data is from the [Wikipedia Clickstream](https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream) project, and is streamed line-by-line by the script `kafka_producer.py` into the `clickstream` topic in Kafka. Each line comprises four tab-separated values: the previous page visited (`prev`), the current page (`curr`), the type of page (`type`), and the number of clicks for that navigation path (`n`). The output is a rank-ordered list of Wikipedia pages with the most hits.

The example uses the November 2017 dump (`2017_01_en_clickstream.tsv.gz`) from the original Wikipedia Clickstream data set ([doi:10.6084/m9.figshare.1305770](https://doi.org/10.6084/m9.figshare.1305770)),but should work with later dumps available from <https://dumps.wikimedia.org/other/clickstream/>.

## Resources

* [kafka-python module](https://kafka-python.readthedocs.io/) documentation
* [pyspark.sql module](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html) documentation
* [Spark SQL Guide](https://spark.apache.org/docs/latest/sql-programming-guide.html)
* [Structured Streaming Programming Guide](https://spark.apache.org/docs/latest/tructured-streaming-programming-guide.html)
* [Structured Streaming + Kafka Integration Guide](https://spark.apache.org/docs/latest/tructured-streaming-kafka-integration.html)
* [Spark Standalone Mode](https://spark.apache.org/docs/latest/spark-standalone.html)
* [Spark Configuration](https://spark.apache.org/docs/latest/configuration.html)