services | platforms | author |
---|---|---|
hdinsight |
java |
blackmist |
Learn how to use Azure Event Hubs with Storm on HDInsight. This example uses Java-based components to read and write data in Azure Event Hubs. It also demonstrates how to write data to the default storage for your cluster, and how to send data to Power BI using the Power BI real-time streaming API.
Note: This example is created and tested on HDInsight. It may work on other Hadoop distributions, but you will need to change things like the scheme used to store data to HDFS.
-
Apache Storm 1.1.0. This is available through Apache Storm on HDInsight version 3.6. For more information, see Get started with Storm on HDInsight cluster.
-
An Azure Event Hub. For more information, see Create an Event Hubs namespace and event hub.
-
Oracle Java Developer Kit (JDK) version 8 or equivalent, such as OpenJDK.
-
Maven: Maven is a project build system for Java projects.
-
A text editor or integrated development environment (IDE).
-
The
ssh
andscp
commands. For more information, see Use SSH with HDInsight.
The resources/writer.yaml
topology writes random data to an event hub. The data is generated by the DeviceSpout
component, and is a random device ID and device value. So it's simulating some hardware that emits a string ID and a numeric value.
The resources/reader.yaml
topology reads data from Event Hub (the data written by EventHubWriter,) parses the JSON data. It also emits the values read from Event Hub to Storm logs.
The resources/readertofile.yaml
topology is the same as the reader.yaml
topology, but it uses the HDFS-bolt component to write data to the HDFS-compatible file system used by HDInsight.
The resources/readertopowerbi.yaml
topology is the same as the reader.yaml
topology, but it uses a custom bolt component to write data to Microsoft Power BI using the Power BI real-time streaming API.
The data format in Event Hub is a JSON document with the following format:
{ "temperature": integer, "humidity": integer, "co2Level": integer }
The reason it's stored in JSON is compatibility - I ran into someone who wasn't formatting data sent to Event Hub as JSON (from a Java application,) and was reading it into a Java app. Worked fine. Then they wanted to replace the reading component with a C# application that expected JSON. Problem! Always store to a nice format that is future proofed in case your components change.
The parser component adds a timestamp value when it processes JSON data read from Event Hub. To demonstrate that you can add new values and aren't limited to just what you read in.
-
An Azure event hub with two shared access policies:
-
The writer policy must have write permission to the event hub.
-
The reader policy must have listen permissions to the event hub.
-
-
To connect to the event hub from Storm, you need the following information:
-
The connection string for the writer policy.
-
The policy key for the reader policy.
-
The name of your Event Hub.
-
The Service Bus namespace that your Event Hub was created in.
-
The number of partitions available with your Event Hub configuration.
For information on creating an event hub, see the Create Event Hubs document.
-
-
Fork & clone the repository so you have a local copy.
-
Add the Event Hub configuration to the
dev.properties
file. This is used to configure the spout that reads from Event Hub and the bolt that writes to it. -
Use
mvn package
to build everything.Once the build completes, the
target
directory will contain a file namedEventHubExample-1.0-SNAPSHOT.jar
.
Since these topologies just read and write to Event Hubs, you can test them locally if you have a Storm development environment. Use the following steps to run locally in the dev environment:
-
Run the writer:
storm jar EventHubExample-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --local -R /writer.yaml --filter dev.properties
-
Run the reader:
storm jar EventHubExample-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --remote -R /reader.yaml --filter dev.properties
Output is logged to the console when running locally. Use Ctrl+C to stop the topology.
-
Use SCP to copy the jar package to your HDInsight cluster. Replace USERNAME with the SSH user for your cluster. Replace CLUSTERNAME with the name of your HDInsight cluster:
scp ./target/EventHubExample-1.0-SNAPSHOT.jar [email protected]:EventHubExample-1.0-SNAPSHOT.jar
For more information on using `scp` and `ssh`, see [Use SSH with HDInsight](https://docs.microsoft.com/azure/hdinsight/hdinsight-hadoop-linux-use-ssh-unix).
-
Use SCP to copy the dev.properties file to the server:
scp dev.properties [email protected]:dev.properties
-
Once the file has finished uploading, use SSH to connect to the HDInsight cluster. Replace USERNAME the the name of your SSH login. Replace CLUSTERNAME with your HDInsight cluster name:
-
Use the following commands to start the topologies:
storm jar EventHubExample-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --remote -R /writer.yaml --filter dev.properties storm jar EventHubExample-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --remote -R /reader.yaml --filter dev.properties
This will start the topologies and give them a friendly name of "reader" and "writer".
-
To view the logged data, go to https://CLUSTERNAME.azurehdinsight.net/stormui, where CLUSTERNAME is the name of your HDInsight cluster. Select the topologies and drill down to the components. Select the port entry for an instance of a component to view logged information.
-
Use the following commands to stop the reader:
storm kill eventhubreader
By default, the components needed to write to WASB or ADL (the file schemes used by HDInsight's HDFS-compatable storage) are not in Storm's classpath. To add them and write output to file, use the following steps:
-
From the Azure Portal, select your HDInsight cluster.
-
Select the Script actions entry, and then select + Submit new.
-
Use the following values to fill in the Submit script action form:
- Select a script: Select Custom
- Name: Enter a name for this script. This is how it will appear in the script history.
- Bash script URI:
https://hdiconfigactions.blob.core.windows.net/linuxstormextlibv01/stormextlib.sh
- Node type(s): Select Nimbus and Supervisor node types.
- Parameters: Leave this field blank.
- Persist: Check this field.
-
Select Create to run the script action.
-
IMPORTANT! The
dev.properties
file assumes that your cluster is using Azure Storage as the default storage. If you are using Azure Data Lake Store instead, change thehdfs.url
value indev.properties
toadl:///
. -
Once the script completes, use the following command to start the topology that writes data to file:
storm jar EventHubExample-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --remote -R /readtofile.yaml --filter dev.properties
-
To view the files generated by the topology, use the following command:
hdfs dfs -ls /stormdata
This command returns results similar to the following:
Found 1 items -rw-r--r-- 1 storm supergroup 5123 2017-10-05 17:25 /stormdata/hdfs-bolt-5-0-1507224331637.txt
To view the contents of a file, use the following command:
hdfs dfs -text /stormdata/filename
Replace
filename
with the name of one of the files. -
To stop the topologies, use the following commands:
storm kill eventhubwriter
There isn't a pre-built Storm bolt for communicating with Power BI. However, Power BI provides a real-time streaming REST API that is easy to use. This project includes a PowerBIBolt.java
component that demonstrates the basics of using the Power BI real-time streaming API.
-
Use the steps in the https://powerbi.microsoft.com/en-us/documentation/powerbi-service-real-time-streaming/ document to learn how to work with real-time streaming in Power BI.
-
Create a new Custom streaming dataset in Power BI. In the Values from stream section, add the following values:
- temperature with a date type of Number
- humidity with a data type of Number
- co2level with a data type of Number
- timestamp with a data type of Datetime
-
Select Create, and then save the Push URL value returned. Select Done to finish configuring the real-time streaming API.
-
On the Storm cluster, modify the
dev.properties
file and setpowerbi.push.url
to the push URL returned in step 3. This URL is used by thePowerBIBolt.java
component when posting to Power BI. -
Use the following command to start the topology:
storm jar EventHubExample-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --remote -R /readtopowerbi.yaml --filter dev.properties
-
In Power BI, add some tiles to the dashboard and set the real-time streaming API as the source. Note that the values update as data is read by the topology.
-
Use the following commands to stop the reader and writer topologies:
storm kill eventhubreader storm kill eventhubwriter
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.