| .. | |||
| README.md | 6 years ago | ||
| clickstream_consumer.ipynb | 6 years ago | ||
| clickstream_producer.py | 6 years ago | ||
| sample_consumer.ipynb | 6 years ago | ||
| sample_producer.py | 6 years ago | ||
This is a basic “hello world” demonstration of Apache Structured Streaming using an Apache Kafka data source. This notebook subscribes to that topic and displays the results of the query.
The script sample_producer.py repeatedly sends the string “Hello n” to the sample topic in Kafka, where n is an incrementing sequence number.
The Jupyter notebook sample-consumer.ipynb subscribes to the sample topic and displays the results of the query.
This is a clickstream processing demo using Apache Kafka and Spark Structured Streaming, based on the original Scala version described at Clickstream Analysis using Apache Spark and Apache Kafka (IBM).
The clickstream data is from the Wikipedia Clickstream project, and is streamed line-by-line by the script kafka_producer.py into the clickstream topic in Kafka. Each line comprises four tab-separated values: the previous page visited (prev), the current page (curr), the type of page (type), and the number of clicks for that navigation path (n). The output is a rank-ordered list of Wikipedia pages with the most hits.
The example uses the November 2017 dump (2017_01_en_clickstream.tsv.gz) from the original Wikipedia Clickstream data set (doi:10.6084/m9.figshare.1305770),but should work with later dumps available from https://dumps.wikimedia.org/other/clickstream/.