docker-analytics / examples /
Nigel Stanger authored on 17 May 2019
..
README.md Added relevant documentation links 4 years ago
clickstream_consumer.ipynb Added examples 4 years ago
clickstream_producer.py Added examples 4 years ago
sample_consumer.ipynb Fixed typo 4 years ago
sample_producer.py Added examples 4 years ago
README.md

Examples

“Hello world”

This is a basic “hello world” demonstration of Apache Structured Streaming using an Apache Kafka data source. This notebook subscribes to that topic and displays the results of the query.

The script sample_producer.py repeatedly sends the string “Hello n” to the sample topic in Kafka, where n is an incrementing sequence number.

The Jupyter notebook sample-consumer.ipynb subscribes to the sample topic and displays the results of the query.

Clickstream

This is a clickstream processing demo using Apache Kafka and Spark Structured Streaming, based on the original Scala version described at Clickstream Analysis using Apache Spark and Apache Kafka (IBM).

The clickstream data is from the Wikipedia Clickstream project, and is streamed line-by-line by the script kafka_producer.py into the clickstream topic in Kafka. Each line comprises four tab-separated values: the previous page visited (prev), the current page (curr), the type of page (type), and the number of clicks for that navigation path (n). The output is a rank-ordered list of Wikipedia pages with the most hits.

The example uses the November 2017 dump (2017_01_en_clickstream.tsv.gz) from the original Wikipedia Clickstream data set (doi:10.6084/m9.figshare.1305770),but should work with later dumps available from https://dumps.wikimedia.org/other/clickstream/.

Resources