Lecture 14: Spark Structured Streaming (continued)#
Learning objectives#
By the end of this lecture, students should be able to:
Understand the key components of a Spark Streaming job
Set up a sample word count streaming application
Set up a sample device reading streaming application
Learning resources#
Please check out Spark documentation for a comprehensive explanation of Spark Streaming
https://spark.apache.org/docs/3.4.0/structured-streaming-programming-guide.html
Spark Streaming basics#
At a high level, Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches.
Example 2: Spark Streaming Read from Files#
In today’s example, we will attempt to create a Spark Streaming pipeline for a weather recording application. In this context, we have a device that records the temperature and produce a log in json
format for each reading.
Here’s an example of a reading stored in a json
format:
{
"eventId": "e3cb26d3-41b2-49a2-84f3-0156ed8d7502",
"eventOffset": 10001,
"eventPublisher": "device",
"customerId": "CI00103",
"data": {
"devices": [
{
"deviceId": "D001",
"temperature": 15,
"measure": "C",
"status": "ERROR"
},
{
"deviceId": "D002",
"temperature": 16,
"measure": "C",
"status": "SUCCESS"
}
]
},
"eventTime": "2023-01-05 11:13:53.643364"
}
Task#
Our goal here is to pre-process this json
file as it comes in. We will flatten the data and store them in a single csv
file.
The output should look like this
customerId |
eventId |
eventOffset |
eventPublisher |
eventTime |
deviceId |
measure |
status |
temperature |
---|---|---|---|---|---|---|---|---|
CI00108 |
aa90011f-3967-496… |
10003 |
device |
2023-01-05 11:13:… |
D004 |
C |
SUCCESS |
16 |
Code#
1. Initialize a SparkSession#
# Create the Spark Session
from pyspark.sql import SparkSession
spark = (
SparkSession.builder
.appName("Weather streaming")
.config("spark.streaming.stopGracefullyOnShutdown", "true")
.master("local[*]")
.getOrCreate()
)
spark
24/10/29 15:56:00 WARN Utils: Your hostname, Quans-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 10.50.26.105 instead (on interface en0)
24/10/29 15:56:00 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/10/29 15:56:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/10/29 15:56:01 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
24/10/29 15:56:01 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
SparkSession - in-memory
SparkSession.builder
: This is a builder pattern used to configure and create a SparkSession instance.appName("Streaming Process Files")
: Sets the name of the Spark application. This name will appear in the Spark web UI and logs.config("spark.streaming.stopGracefullyOnShutdown", True)
: Configures Spark to stop streaming jobs gracefully on shutdown. This means that Spark will try to complete the ongoing tasks before shutting down.master("local[*]")
: Sets the master URL to connect to. In this case, local[*] means that Spark will run locally with as many worker threads as logical cores on your machine.getOrCreate()
: This method either retrieves an existing SparkSession or creates a new one if none exists.
2. Import data#
# Enable automatic schema inference for streaming data
# This allows Spark to automatically infer the schema of the JSON files being read
spark.conf.set("spark.sql.streaming.schemaInference", True)
streaming_df = (
spark.readStream
.option("cleanSource", "archive")
.option("sourceArchiveDir", "data/archive/")
.option("maxFilesPerTrigger", 1)
.format("json")
.load("data/input/")
)
streaming_df.printSchema()
# streaming_df.show()
root
|-- _corrupt_record: string (nullable = true)
3. Explode the data#
# Lets explode the data as devices contains list/array of device reading
from pyspark.sql.functions import explode
exploded_df = streaming_df.withColumn("data_devices", explode("data.devices"))
# Check the schema of the exploded_df, place a sample json file and change readStream to read
exploded_df.printSchema()
# exploded_df.show(truncate=False)
root
|-- _corrupt_record: string (nullable = true)
|-- customerId: string (nullable = true)
|-- data: struct (nullable = true)
| |-- devices: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- deviceId: string (nullable = true)
| | | |-- measure: string (nullable = true)
| | | |-- status: string (nullable = true)
| | | |-- temperature: long (nullable = true)
|-- eventId: string (nullable = true)
|-- eventOffset: long (nullable = true)
|-- eventPublisher: string (nullable = true)
|-- eventTime: string (nullable = true)
|-- data_devices: struct (nullable = true)
| |-- deviceId: string (nullable = true)
| |-- measure: string (nullable = true)
| |-- status: string (nullable = true)
| |-- temperature: long (nullable = true)
4. Flatten the exploded data#
# Flatten the exploded df
from pyspark.sql.functions import col
flattened_df = (
exploded_df
.drop("data")
.withColumn("deviceId", col("data_devices.deviceId"))
.withColumn("measure", col("data_devices.measure"))
.withColumn("status", col("data_devices.status"))
.withColumn("temperature", col("data_devices.temperature"))
.drop("data_devices")
)
# Check the schema of the flattened_df, place a sample json file and change readStream to read
flattened_df.printSchema()
# flattened_df.show(truncate=False)
root
|-- _corrupt_record: string (nullable = true)
|-- customerId: string (nullable = true)
|-- eventId: string (nullable = true)
|-- eventOffset: long (nullable = true)
|-- eventPublisher: string (nullable = true)
|-- eventTime: string (nullable = true)
|-- deviceId: string (nullable = true)
|-- measure: string (nullable = true)
|-- status: string (nullable = true)
|-- temperature: long (nullable = true)
5. Write the output to console or csv file#
# Write the output to console sink to check the output
# (flattened_df.writeStream
# .format("console")
# .outputMode("append")
# .start().awaitTermination())
# Write the output to csv sink
(flattened_df.writeStream
.format("csv")
.outputMode("append")
.option("path", "data/output/device_output.csv")
.option("checkpointLocation", "data/checkpoint/device_output")
.start().awaitTermination())
24/10/29 15:43:53 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
ERROR:root:KeyboardInterrupt while sending command.
Traceback (most recent call last):
File "/opt/miniconda3/envs/dasc_5410/lib/python3.11/site-packages/py4j/java_gateway.py", line 1038, in send_command
response = connection.send_command(command)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda3/envs/dasc_5410/lib/python3.11/site-packages/py4j/clientserver.py", line 511, in send_command
answer = smart_decode(self.stream.readline()[:-1])
^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda3/envs/dasc_5410/lib/python3.11/socket.py", line 706, in readinto
return self._sock.recv_into(b)
^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
Cell In[8], line 13
1 # Write the output to console sink to check the output
2 # (flattened_df.writeStream
3 # .format("console")
(...)
6
7 # Write the output to csv sink
8 (flattened_df.writeStream
9 .format("csv")
10 .outputMode("append")
11 .option("path", "data/output/device_output.csv")
12 .option("checkpointLocation", "data/checkpoint/device_output")
---> 13 .start().awaitTermination())
File /opt/miniconda3/envs/dasc_5410/lib/python3.11/site-packages/pyspark/sql/streaming/query.py:221, in StreamingQuery.awaitTermination(self, timeout)
219 return self._jsq.awaitTermination(int(timeout * 1000))
220 else:
--> 221 return self._jsq.awaitTermination()
File /opt/miniconda3/envs/dasc_5410/lib/python3.11/site-packages/py4j/java_gateway.py:1321, in JavaMember.__call__(self, *args)
1314 args_command, temp_args = self._build_args(*args)
1316 command = proto.CALL_COMMAND_NAME +\
1317 self.command_header +\
1318 args_command +\
1319 proto.END_COMMAND_PART
-> 1321 answer = self.gateway_client.send_command(command)
1322 return_value = get_return_value(
1323 answer, self.gateway_client, self.target_id, self.name)
1325 for temp_arg in temp_args:
File /opt/miniconda3/envs/dasc_5410/lib/python3.11/site-packages/py4j/java_gateway.py:1038, in GatewayClient.send_command(self, command, retry, binary)
1036 connection = self._get_connection()
1037 try:
-> 1038 response = connection.send_command(command)
1039 if binary:
1040 return response, self._create_connection_guard(connection)
File /opt/miniconda3/envs/dasc_5410/lib/python3.11/site-packages/py4j/clientserver.py:511, in ClientServerConnection.send_command(self, command)
509 try:
510 while True:
--> 511 answer = smart_decode(self.stream.readline()[:-1])
512 logger.debug("Answer received: {0}".format(answer))
513 # Happens when a the other end is dead. There might be an empty
514 # answer before the socket raises an error.
File /opt/miniconda3/envs/dasc_5410/lib/python3.11/socket.py:706, in SocketIO.readinto(self, b)
704 while True:
705 try:
--> 706 return self._sock.recv_into(b)
707 except timeout:
708 self._timeout_occurred = True
KeyboardInterrupt: