GitBucket
4.21.2
Toggle navigation
Snippets
Sign in
Files
Branches
2
Releases
Issues
3
Pull requests
Labels
Priorities
Milestones
Wiki
Forks
nigel.stanger
/
docker-analytics
Browse code
Added relevant documentation links
master
spark3
1 parent
f92647b
commit
f93eab3191af1a422566393237927c542e0ac832
Nigel Stanger
authored
on 17 May 2019
Patch
Showing
1 changed file
examples/README.md
Ignore Space
Show notes
View
examples/README.md
# Examples ## “Hello world” This is a basic “hello world” demonstration of Apache Structured Streaming using an Apache Kafka data source. This notebook subscribes to that topic and displays the results of the query. The script `sample_producer.py` repeatedly sends the string “Hello *n*” to the `sample` topic in Kafka, where *n* is an incrementing sequence number. The Jupyter notebook `sample-consumer.ipynb` subscribes to the `sample` topic and displays the results of the query. ## Clickstream This is a clickstream processing demo using Apache Kafka and Spark Structured Streaming, based on the original Scala version described at [Clickstream Analysis using Apache Spark and Apache Kafka](https://github.com/IBM/kafka-streaming-click-analysis) (IBM). The clickstream data is from the [Wikipedia Clickstream](https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream) project, and is streamed line-by-line by the script `kafka_producer.py` into the `clickstream` topic in Kafka. Each line comprises four tab-separated values: the previous page visited (`prev`), the current page (`curr`), the type of page (`type`), and the number of clicks for that navigation path (`n`). The output is a rank-ordered list of Wikipedia pages with the most hits. The example uses the November 2017 dump (`2017_01_en_clickstream.tsv.gz`) from the original Wikipedia Clickstream data set ([doi:10.6084/m9.figshare.1305770](https://doi.org/10.6084/m9.figshare.1305770)),but should work with later dumps available from <https://dumps.wikimedia.org/other/clickstream/>. ## Resources * [kafka-python module](https://kafka-python.readthedocs.io/) documentation * [pyspark.sql module](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html) documentation * [Spark SQL Guide](https://spark.apache.org/docs/latest/sql-programming-guide.html) * [Structured Streaming Programming Guide](https://spark.apache.org/docs/latest/tructured-streaming-programming-guide.html) * [Structured Streaming + Kafka Integration Guide](https://spark.apache.org/docs/latest/tructured-streaming-kafka-integration.html) * [Spark Standalone Mode](https://spark.apache.org/docs/latest/spark-standalone.html) * [Spark Configuration](https://spark.apache.org/docs/latest/configuration.html)
# Examples ## “Hello world” This is a basic “hello world” demonstration of Apache Structured Streaming using an Apache Kafka data source. This notebook subscribes to that topic and displays the results of the query. The script `sample_producer.py` repeatedly sends the string “Hello *n*” to the `sample` topic in Kafka, where *n* is an incrementing sequence number. The Jupyter notebook `sample-consumer.ipynb` subscribes to the `sample` topic and displays the results of the query. ## Clickstream This is a clickstream processing demo using Apache Kafka and Spark Structured Streaming, based on the original Scala version described at [Clickstream Analysis using Apache Spark and Apache Kafka](https://github.com/IBM/kafka-streaming-click-analysis) (IBM). The clickstream data is from the [Wikipedia Clickstream](https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream) project, and is streamed line-by-line by the script `kafka_producer.py` into the `clickstream` topic in Kafka. Each line comprises four tab-separated values: the previous page visited (`prev`), the current page (`curr`), the type of page (`type`), and the number of clicks for that navigation path (`n`). The output is a rank-ordered list of Wikipedia pages with the most hits. The example uses the November 2017 dump (`2017_01_en_clickstream.tsv.gz`) from the original Wikipedia Clickstream data set ([doi:10.6084/m9.figshare.1305770](https://doi.org/10.6084/m9.figshare.1305770)),but should work with later dumps available from <https://dumps.wikimedia.org/other/clickstream/>.
Show line notes below