Webinar: Building Streaming Data Pipelines Using Structured Streaming and Delta Lake
Given the rise of IoT and other real-time sources and businesses’ desire to draw fast insights, there is a growing imperative for data professionals to build streaming data pipelines. Given the plethora of different tools and frameworks in the big data community, it is challenging to architect such pipelines correctly that achieve the desired performance and data quality. Hence it is important to properly understand the business and technical requirements of the pipelines and accordingly select the right tools to build them.
Structured Streaming in Apache Spark™ has proven to be the best platform for building distributed stream processing applications. Its unified SQL/Dataset/DataFrame APIs and Spark’s built-in functions make it easy for developers to express complex computations. However, this solves only one half of the problem of building end-to-end pipelines. To get insights from the processed data, it is important to ensure that downstream application can efficiently and reliably query the output of the pipelines. Current data lake solutions provide a scalable storage solution but not the data quality and reliability that databases and data warehouses are known for. To fix this, we, the original creators of Apache Spark™, have built Delta Lake, an open- source storage layer that brings ACID transactions and scalable metadata handling to data lakes.
In this webinar, we are going to show how Structured Streaming and Delta Lake together make it super-easy to write end-to-end pipelines. Specifically, we will cover the following.
- How to approach the problem of designing pipelines by critically understanding the requirements?
- How to classify the pipelines into common design patterns?
- How to solve each pattern in the correct way using Structured Streaming and Delta Lake?
- And finally, what is the future roadmap for Delta Lake?
Tathagata (TD) Das, Senior Software Engineer, Databricks
Tathagata Das is an Apache Spark committer and a member of the PMC. He’s the lead developer behind Spark Streaming and currently develops Structured Streaming. Previously, he was a grad student in the UC Berkeley at AMPLab, where he conducted research about data-center frameworks and networks with Scott Shenker and Ion Stoica.