Goal: Uses Vagrant to create a 4 Virtual Machine (VM) cluster (3 running Kafka 1 running Spark) to expriment with Spark Streaming and Kafka.
- Spark Streaming v2.2.0
- Kafka v1.1.0
- Vagrant (tested with 1.9.1)
- VirtualBox (tested with 5.1.12 r112440)
First install VirtualBox and Vagrant.
Clone this repository
git clone https://github.com/imarios/spark-steaming-with-kafka.git
Create and provision the VMs
cd spark-steaming-with-kafka
vagrant up
Note: the above can take several minutes, especially the first time it runs (needs to download the Ubuntu Vagrant Box and install the required packages on each VM). This will take long only the first time, starting and stoping the VMs hereafter will be significantly faster.
When all VMs are ready
./start-all-in-cluster.sh
This will start a thee node ZooKeeper quorum and a three broker Kafka on the same three nodes.
vagrant ssh broker1 -c "jps"
> 17235 Jps
> 15605 QuorumPeerMain
> 16134 Kafka
Note: lines that start with ">" are shell output.
We can see that each of the nodes is running a ZooKeeper server and a Kafka broker.
To stop both Kafka and ZooKeeper
./stop-all-in-cluster.sh
To suspend the VirtualBox VMs (so you can resume later)
vagrant suspend
To resume
vagrant resume
#or
vagrant up
To delete the VMs and all their files
vagrant destroy
Create a topic called "t1" with replication factor 2 and 3 number of partitions. You can use the custom scripts provided by Kafka or use the script (create_topic.sh) provided here.
vagrant ssh spark1
/vagrant/scripts/create_topic.sh t1 2 3
Verify that the topic is created
# Assumes you are in any of the 4VMs
/vagrant/scripts/list_topics.sh
> __consumer_offsets
> t1
Add events to the topic (producer.sh)
/vagrant/scripts/producer.sh t1
a
b
c
# ctrl-C to exit
Read the events back (consumer.sh)
/vagrant/scripts/consumer.sh t1
> a
> c
> b
# ctrl-C to exit
Processed a total of 3 messages
Note that events for the entire topic may come out-of-order. Kafka preserves order only within a single partition.
Check the offsets of each partition (show_offsets.sh)
/vagrant/scripts/show_offsets.sh t1
> t1:2:1
> t1:1:1
> t1:0:1
Each partition has one event. Kafka was able to evenly distribute the three events over the three partitions of the "t1" topic.
Connect to the Spark VM
vagrant ssh spark1
Start Spark shell with the proper kafka dependecies. On start-up the shell pre-loads the commands from streaming.scala.
/vagrant/scripts/run_spark_streaming.sh t1
When the shell stops loading, go ahead and implement the most basic consumer that counts events in each 10 seconds interval.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_111)
Type in expressions to have them evaluated.
Type :help for more information.
scala> myStream.count.print
scala> ssc.start
From a seperate terminal connect to any of the three broker VMs.
vagrant ssh broker2
Run the auto_producer.sh with a rate of 4k characters per second (each event has 2 charactes which produces a 2k events per second stream).
/vagrant/scripts/auto_producer.sh t1 4k
Going back to your Spark Streaming shell you should see something similar to this
-------------------------------------------
Time: 1483680230000 ms
-------------------------------------------
21060
-------------------------------------------
Time: 1483680240000 ms
-------------------------------------------
19656
-------------------------------------------
Time: 1483680250000 ms
-------------------------------------------
21060
...
You can enjoy the spark sreaming UI at http://localhost:4041/streaming/. Feel free to kill the Spark shell and the producer when you get bored of this :).
Your testbed is now ready for you to experiment. Here are some suggestions:
- Try more complex Spark streaming statements. Search in the docs for inspiration.
- Try starting
auto_producer.sh
from all three brokers for the same topic. Raise the rates up and see how Spark handles the increased rate. - While at least one producer is running, try shutting down one of the other two broker VMs. Observe how Spark Streaming and how Kafka handle the change.