Productionizing your Streaming Jobs

On-Demand Webinar

The slides and notebooks for this session are available as attachments within the webinar itself. Please start the webinar, hover over the webinar, click [Attachments], and you will be able to download all the materials.

Apache Spark Streaming is one of the most popular stream processing framework that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. In this talk, we will focus on the following aspects of Spark streaming:

Motivation and most common use cases for Spark Streaming:

  • Streaming Data Ingestion & ETL – Building a data highway to ingest real time data into warehouses, search engines or data lakes.
  • Monitoring & Dashboarding
  • Anomaly/Fraud Detection with Online Learning – Doing predictions on streams and keeping the model up-to-date based on new data being observed.
  • Sessionization – Identifying sessions based on user behavior from streams

Common design patterns that emerge from these use cases and tips to avoid common pitfalls while implementing these design patterns:

  • Associative Time Based Window Aggregations – How and when to use window functions efficiently to do associative aggregations and maintain running statistics from your data.
  • Global Aggregations with State Management – Maintain the most current value of a statistic for all of time with a global state.
  • Joining streams efficiently with static and dynamic datasets – Many a time, you might not only want to join multiple streams but also join with historical datasets. The historical datasets can be static or dynamically changing. We will walk over the best practices while doing these joins.
  • Using SQL operations on stream – How to use Spark SQL on DStreams efficiently.
  • Avoiding common pitfalls while doing online model updates
Performance optimization techniques:
  • How to scale out efficiently to achieve high throughput.
  • Better state management with state pruning.
  • Fine tuning checkpoint interval for optimum performance.
  • Efficient ways of writing to data sinks
Presenters
  • Prakash Chockalingam

    Software Architect - Databricks

    Prakash is currently a Solutions Architect at Databricks and focuses on helping customers building their big data infrastructure based on his decade-long experience on building large scale distributed systems and machine learning infrastructure at companies including Netflix and Yahoo. Prior to joining Databricks, he was with Netflix designing and building their recommendation infrastructure that serves out millions of recommendations to Netflix users every day. His interests broadly include distributed systems and machine learning and he has also co-authored several publications on machine learning and computer vision research in the early stages of his career.

  • Denny Lee

    Technology Evangelist - Databricks

    Denny Lee is a Technology Evangelist with Databricks; he is a hands-on data sciences engineer with more than 15 years of experience developing internet-scale infrastructure, data platforms, and distributed systems for both on-premises and cloud.