Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture

On-Demand Webinar


The slides and notebooks for this session are available as attachments within the webinar itself. Please start the webinar, hover over the webinar, click [Attachments], and you will be able to download all the materials.

This session will cover a series of use cases where you can store your data cheaply in files and analyze the data with Apache Spark™, as well as use cases where you want to store your data into a different data source to access with Spark DataFrames. Here’s an example outline of some of the topics that will be covered in the talk:

Use cases to store in file systems for use with Apache Spark:

  • Analyzing a large set of data files. 
  • Doing ETL of a large amount of data. 
  • Applying Machine Learning & Data Science to a large dataset. 
  • Connecting BI/Visualization tools to Apache Spark to analyze large datasets internally. 

Use cases to store your data in databases for use with Apache Spark:

Random access, frequent inserts, and updates of rows of SQL tables. Databases have better performance for these use cases.

  • Supporting Incremental updates of Databases into Spark. It’s not performant to update your Spark SQL tables backed by files. Instead, you can use message queues and Spark Streaming or perform an incremental select to make sure your Spark SQL tables stay up to date with your production databases. 
  • External Reporting with many concurrent requests. While Spark’s ability to cache your file data in memory will allow you to get back to fast interactive querying, that may not be optimal for supporting many concurrent requests. It’s better to use Spark to ETL your data to summary tables or some other format into a traditional database to serve your reports if you have many concurrent users to support. 
  • Searching content. A Spark job can certainly be written to filter or search for any content in files that you’d like. ElasticSearch is a specialized engine designed to return search results quicker.
Presenters
  • Vida Ha

    Solution Architect - Databricks

    Vida is currently a Solutions Engineer at Databricks where her job is to onboard and support customers using Apache Spark on Databricks. In her past, she worked on scaling Square’s Reporting Analytics System. She first began working with distributed computing at Google, where she improved search rankings of mobile-specific web content and built and tuned language models for speech recognition using a year’s worth of Google search queries. She’s passionate about accelerating the adoption of Apache Spark to bring the combination of speed and scale of data processing to the mainstream.

  • Dave Wang

    Product Marketing - Databricks

    Dave Wang is a product marketing manager at Databricks. He has over 10 years of experience in software development for real-time applications. Prior to Databricks, he was a management consultant at McKinsey & Company where he advised CEOs and CIOs on big data technology and its impact on business strategy.