Lecture 14: Spark Structured Streaming (continued)

Lecture 14: Spark Structured Streaming (continued)#

Learning objectives#

By the end of this lecture, students should be able to:

Understand the key components of a Spark Streaming job
Set up a sample word count streaming application
Set up a sample device reading streaming application

Learning resources#

Please check out Spark documentation for a comprehensive explanation of Spark Streaming

https://spark.apache.org/docs/3.4.0/structured-streaming-programming-guide.html

Spark Streaming basics#

At a high level, Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches.

Example 2: Spark Streaming Read from Files#

In today’s example, we will attempt to create a Spark Streaming pipeline for a weather recording application. In this context, we have a device that records the temperature and produce a log in json format for each reading.

Here’s an example of a reading stored in a json format:

{
  "eventId": "e3cb26d3-41b2-49a2-84f3-0156ed8d7502",
  "eventOffset": 10001,
  "eventPublisher": "device",
  "customerId": "CI00103",
  "data": {
    "devices": [
      {
        "deviceId": "D001",
        "temperature": 15,
        "measure": "C",
        "status": "ERROR"
      },
      {
        "deviceId": "D002",
        "temperature": 16,
        "measure": "C",
        "status": "SUCCESS"
      }
    ]
  },
  "eventTime": "2023-01-05 11:13:53.643364"
}

Task#

Our goal here is to pre-process this json file as it comes in. We will flatten the data and store them in a single csv file.

The output should look like this

customerId	eventId	eventOffset	eventPublisher	eventTime	deviceId	measure	status	temperature
CI00108	aa90011f-3967-496…	10003	device	2023-01-05 11:13:…	D004	C	SUCCESS	16

Code#

1. Initialize a SparkSession#

# Create the Spark Session
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .appName("Weather streaming")
    .config("spark.streaming.stopGracefullyOnShutdown", "true")
    .master("local[*]")
    .getOrCreate()
)

spark

24/10/29 15:56:00 WARN Utils: Your hostname, Quans-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 10.50.26.105 instead (on interface en0)
24/10/29 15:56:00 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/10/29 15:56:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/10/29 15:56:01 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
24/10/29 15:56:01 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.

SparkSession - in-memory

SparkContext

Spark UI

Version: v3.5.1
Master: local[*]
AppName: Weather streaming

SparkSession.builder: This is a builder pattern used to configure and create a SparkSession instance.
appName("Streaming Process Files"): Sets the name of the Spark application. This name will appear in the Spark web UI and logs.
config("spark.streaming.stopGracefullyOnShutdown", True): Configures Spark to stop streaming jobs gracefully on shutdown. This means that Spark will try to complete the ongoing tasks before shutting down.
master("local[*]"): Sets the master URL to connect to. In this case, local[*] means that Spark will run locally with as many worker threads as logical cores on your machine.
getOrCreate(): This method either retrieves an existing SparkSession or creates a new one if none exists.

2. Import data#

# Enable automatic schema inference for streaming data
# This allows Spark to automatically infer the schema of the JSON files being read
spark.conf.set("spark.sql.streaming.schemaInference", True)

streaming_df = (
    spark.readStream
    .option("cleanSource", "archive")
    .option("sourceArchiveDir", "data/archive/")
    .option("maxFilesPerTrigger", 1)
    .format("json")
    .load("data/input/")
)

streaming_df.printSchema()
# streaming_df.show()

root
 |-- _corrupt_record: string (nullable = true)

3. Explode the data#

# Lets explode the data as devices contains list/array of device reading
from pyspark.sql.functions import explode

exploded_df = streaming_df.withColumn("data_devices", explode("data.devices"))

# Check the schema of the exploded_df, place a sample json file and change readStream to read 
exploded_df.printSchema()
# exploded_df.show(truncate=False)

root
 |-- _corrupt_record: string (nullable = true)
 |-- customerId: string (nullable = true)
 |-- data: struct (nullable = true)
 |    |-- devices: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- deviceId: string (nullable = true)
 |    |    |    |-- measure: string (nullable = true)
 |    |    |    |-- status: string (nullable = true)
 |    |    |    |-- temperature: long (nullable = true)
 |-- eventId: string (nullable = true)
 |-- eventOffset: long (nullable = true)
 |-- eventPublisher: string (nullable = true)
 |-- eventTime: string (nullable = true)
 |-- data_devices: struct (nullable = true)
 |    |-- deviceId: string (nullable = true)
 |    |-- measure: string (nullable = true)
 |    |-- status: string (nullable = true)
 |    |-- temperature: long (nullable = true)

4. Flatten the exploded data#

# Flatten the exploded df
from pyspark.sql.functions import col

flattened_df = (
    exploded_df
    .drop("data")
    .withColumn("deviceId", col("data_devices.deviceId"))
    .withColumn("measure", col("data_devices.measure"))
    .withColumn("status", col("data_devices.status"))
    .withColumn("temperature", col("data_devices.temperature"))
    .drop("data_devices")
)

# Check the schema of the flattened_df, place a sample json file and change readStream to read 
flattened_df.printSchema()
# flattened_df.show(truncate=False)

root
 |-- _corrupt_record: string (nullable = true)
 |-- customerId: string (nullable = true)
 |-- eventId: string (nullable = true)
 |-- eventOffset: long (nullable = true)
 |-- eventPublisher: string (nullable = true)
 |-- eventTime: string (nullable = true)
 |-- deviceId: string (nullable = true)
 |-- measure: string (nullable = true)
 |-- status: string (nullable = true)
 |-- temperature: long (nullable = true)

5. Write the output to console or csv file#

# Write the output to console sink to check the output
# (flattened_df.writeStream
#  .format("console")
#  .outputMode("append")
#  .start().awaitTermination())

# Write the output to csv sink
(flattened_df.writeStream
 .format("csv")
 .outputMode("append")
 .option("path", "data/output/device_output.csv")
 .option("checkpointLocation", "data/checkpoint/device_output")
 .start().awaitTermination())

24/10/29 15:43:53 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
ERROR:root:KeyboardInterrupt while sending command.
Traceback (most recent call last):
  File "/opt/miniconda3/envs/dasc_5410/lib/python3.11/site-packages/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/dasc_5410/lib/python3.11/site-packages/py4j/clientserver.py", line 511, in send_command
    answer = smart_decode(self.stream.readline()[:-1])
                          ^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/dasc_5410/lib/python3.11/socket.py", line 706, in readinto
    return self._sock.recv_into(b)
           ^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[8], line 13
      1 # Write the output to console sink to check the output
      2 # (flattened_df.writeStream
      3 #  .format("console")
   (...)
      6 
      7 # Write the output to csv sink
      8 (flattened_df.writeStream
      9  .format("csv")
     10  .outputMode("append")
     11  .option("path", "data/output/device_output.csv")
     12  .option("checkpointLocation", "data/checkpoint/device_output")
---> 13  .start().awaitTermination())

File /opt/miniconda3/envs/dasc_5410/lib/python3.11/site-packages/pyspark/sql/streaming/query.py:221, in StreamingQuery.awaitTermination(self, timeout)
    219     return self._jsq.awaitTermination(int(timeout * 1000))
    220 else:
--> 221     return self._jsq.awaitTermination()

File /opt/miniconda3/envs/dasc_5410/lib/python3.11/site-packages/py4j/java_gateway.py:1321, in JavaMember.__call__(self, *args)
   1314 args_command, temp_args = self._build_args(*args)
   1316 command = proto.CALL_COMMAND_NAME +\
   1317     self.command_header +\
   1318     args_command +\
   1319     proto.END_COMMAND_PART
-> 1321 answer = self.gateway_client.send_command(command)
   1322 return_value = get_return_value(
   1323     answer, self.gateway_client, self.target_id, self.name)
   1325 for temp_arg in temp_args:

File /opt/miniconda3/envs/dasc_5410/lib/python3.11/site-packages/py4j/java_gateway.py:1038, in GatewayClient.send_command(self, command, retry, binary)
   1036 connection = self._get_connection()
   1037 try:
-> 1038     response = connection.send_command(command)
   1039     if binary:
   1040         return response, self._create_connection_guard(connection)

File /opt/miniconda3/envs/dasc_5410/lib/python3.11/site-packages/py4j/clientserver.py:511, in ClientServerConnection.send_command(self, command)
    509 try:
    510     while True:
--> 511         answer = smart_decode(self.stream.readline()[:-1])
    512         logger.debug("Answer received: {0}".format(answer))
    513         # Happens when a the other end is dead. There might be an empty
    514         # answer before the socket raises an error.

File /opt/miniconda3/envs/dasc_5410/lib/python3.11/socket.py:706, in SocketIO.readinto(self, b)
    704 while True:
    705     try:
--> 706         return self._sock.recv_into(b)
    707     except timeout:
    708         self._timeout_occurred = True

KeyboardInterrupt: