Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read

On-Demand Webinar

Recorded on 03/07/2016 10:00am PT, 1:00pm ET, 6:00pm UTC


The slides and notebooks for this session are available as attachments within the webinar itself. Please start the webinar, hover over the webinar, click [Attachments], and you will be able to download all the materials.


In this webcast, Jason Pohl, Solution Engineer from Databricks, will cover how to build a Just-in-Time Data Warehouse on Databricks with a focus on performing Change Data Capture from a relational database and joining that data to a variety of data sources. Not only does Apache Spark and Databricks allow you to do this easier with less code, the routine will automatically ingest changes to the source schema.


Highlights of this webinar include:


  • Starting with a Databricks notebook, Jason will build a classic Change Data Capture (CDC) ETL routine to extract data from an RDBMS.
  • A deep-dive into selecting a delta of changes from tables in an RDBMS, writing it to Parquet, querying it using Spark SQL.
  • Demonstrate how to apply a schema at time of read rather than before write.
Presenters
  • Jason Pohl

    Data Solutions Engineer - Databricks

    Jason Pohl is a solutions engineer with Databricks, focused on helping customers become successful with their data initiatives. Jason has spent his career building data-driven products and solutions.

  • Denny Lee

    Technology Evangelist - Databricks

    Denny Lee is a Technology Evangelist with Databricks; he is a hands-on data sciences engineer with more than 15 years of experience developing internet-scale infrastructure, data platforms, and distributed systems for both on-premises and cloud. Prior to joining Databricks, Denny worked as a Senior Director of Data Sciences Engineering at Concur and was part of the incubation team that built Hadoop on Windows and Azure (currently known as HDInsight).