diff --git a/examples/README.md b/examples/README.md new file mode 100644 index 0000000..377e648 --- /dev/null +++ b/examples/README.md @@ -0,0 +1,17 @@ +# Examples + +## “Hello world” + +This is a basic “hello world” demonstration of Apache Structured Streaming using an Apache Kafka data source. This notebook subscribes to that topic and displays the results of the query. + +The script `sample_producer.py` repeatedly sends the string “Hello *n*” to the `sample` topic in Kafka, where *n* is an incrementing sequence number. + +The Jupyter notebook `sample-consumer.ipynb` subscribes to the `sample` topic and displays the results of the query. + +## Clickstream + +This is a clickstream processing demo using Apache Kafka and Spark Structured Streaming, based on the original Scala version described at [Clickstream Analysis using Apache Spark and Apache Kafka](https://github.com/IBM/kafka-streaming-click-analysis) (IBM). + +The clickstream data is from the [Wikipedia Clickstream](https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream) project, and is streamed line-by-line by the script `kafka_producer.py` into the `clickstream` topic in Kafka. Each line comprises four tab-separated values: the previous page visited (`prev`), the current page (`curr`), the type of page (`type`), and the number of clicks for that navigation path (`n`). The output is a rank-ordered list of Wikipedia pages with the most hits. + +The example uses the November 2017 dump (`2017_01_en_clickstream.tsv.gz`) from the original Wikipedia Clickstream data set ([doi:10.6084/m9.figshare.1305770](https://doi.org/10.6084/m9.figshare.1305770)),but should work with later dumps available from .