The slides and notebooks for this session are available as attachments within the webinar itself. Please start the webinar, hover over the webinar, click [Attachments], and you will be able to download all the materials.
Apache Spark Streaming is one of the most popular stream processing framework that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. In this talk, we will focus on the following aspects of Spark streaming:
Motivation and most common use cases for Spark Streaming:
- Streaming Data Ingestion & ETL – Building a data highway to ingest real time data into warehouses, search engines or data lakes.
- Monitoring & Dashboarding
- Anomaly/Fraud Detection with Online Learning – Doing predictions on streams and keeping the model up-to-date based on new data being observed.
- Sessionization – Identifying sessions based on user behavior from streams
Common design patterns that emerge from these use cases and tips to avoid common pitfalls while implementing these design patterns:
- Associative Time Based Window Aggregations – How and when to use window functions efficiently to do associative aggregations and maintain running statistics from your data.
- Global Aggregations with State Management – Maintain the most current value of a statistic for all of time with a global state.
- Joining streams efficiently with static and dynamic datasets – Many a time, you might not only want to join multiple streams but also join with historical datasets. The historical datasets can be static or dynamically changing. We will walk over the best practices while doing these joins.
- Using SQL operations on stream – How to use Spark SQL on DStreams efficiently.
- Avoiding common pitfalls while doing online model updates
Performance optimization techniques:
- How to scale out efficiently to achieve high throughput.
- Better state management with state pruning.
- Fine tuning checkpoint interval for optimum performance.
- Efficient ways of writing to data sinks