Jump Start into Apache® Spark™ and Databricks

On-Demand Webinar

Recorded on 02/11/2016 10:00am PT, 1:00pm ET, 6:00pm UTC



The slides and notebooks for this session are available as attachments within the webinar itself. Please start the webinar, hover over the webinar, click [Attachments], and you will be able to download all the materials.




Denny Lee, Technology Evangelist with Databricks, will provide a jump start into Apache Spark™ and Databricks. Spark is a fast, easy to use, and unified engine that allows you to solve many Data Sciences and Big Data (and many not-so-Big Data) scenarios easily. Spark comes packaged with higher-level libraries, including support for SQL queries, streaming data, machine learning, and graph processing. We will leverage Databricks to quickly and easily demonstrate, visualize, and debug our code samples; the notebooks will be available for you to download. This introductory level jump start will focus on the following scenarios:



  • Quick Start on Spark: Provides an introductory quick start to Spark using Python and Resilient Distributed Datasets (RDDs). We will review how RDDs have actions and transformations and their impact on your Spark workflow.
  • A Primer on RDDs to DataFrames to Datasets: This will provide a high-level overview of our journey from RDDs (2011) to DataFrames (2013) to the newly introduced (as of Spark 1.6) Datasets (2015).
  • Just in Time Data Warehousing with Spark SQL: We will demonstrate a Just-in-Time Data Warehousing (JIT-DW) example using Spark SQL on an AdTech scenario. We will start with weblogs, create an external table with RegEx, make an external web service call via a Mapper, join DataFrames and register a temp table, add columns to DataFrames with UDFs, use Python UDFs with Spark SQL, and visualize the output - all in the same notebook.
Presenters
  • Denny Lee

    Technology Evangelist - Databricks

    Denny Lee is a Technology Evangelist with Databricks; he is a hands-on data sciences engineer with more than 15 years of experience developing internet-scale infrastructure, data platforms, and distributed systems for both on-premises and cloud. Prior to joining Databricks, Denny worked as a Senior Director of Data Sciences Engineering at Concur and was part of the incubation team that built Hadoop on Windows and Azure (currently known as HDInsight).