diff --git a/examples/README.md b/examples/README.md index 377e648..a59a0f7 100644 --- a/examples/README.md +++ b/examples/README.md @@ -15,3 +15,13 @@ The clickstream data is from the [Wikipedia Clickstream](https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream) project, and is streamed line-by-line by the script `kafka_producer.py` into the `clickstream` topic in Kafka. Each line comprises four tab-separated values: the previous page visited (`prev`), the current page (`curr`), the type of page (`type`), and the number of clicks for that navigation path (`n`). The output is a rank-ordered list of Wikipedia pages with the most hits. The example uses the November 2017 dump (`2017_01_en_clickstream.tsv.gz`) from the original Wikipedia Clickstream data set ([doi:10.6084/m9.figshare.1305770](https://doi.org/10.6084/m9.figshare.1305770)),but should work with later dumps available from . + +## Resources + +* [kafka-python module](https://kafka-python.readthedocs.io/) documentation +* [pyspark.sql module](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html) documentation +* [Spark SQL Guide](https://spark.apache.org/docs/latest/sql-programming-guide.html) +* [Structured Streaming Programming Guide](https://spark.apache.org/docs/latest/tructured-streaming-programming-guide.html) +* [Structured Streaming + Kafka Integration Guide](https://spark.apache.org/docs/latest/tructured-streaming-kafka-integration.html) +* [Spark Standalone Mode](https://spark.apache.org/docs/latest/spark-standalone.html) +* [Spark Configuration](https://spark.apache.org/docs/latest/configuration.html)