GraphFrames: DataFrame-based Graphs for Apache® Spark™

On-Demand Webinar

Recorded on 04/14/2016 10:00am PT, 1:00pm ET, 5:00pm UTC

The slides and notebooks for this session are available as attachments within the webinar itself. Please start the webinar, hover over the webinar, click [Attachments], and you will be able to download all the materials.


GraphFrames bring the power of Apache Spark™ DataFrames to interactive analytics on graphs.



Expressive motif queries simplify pattern search in graphs, and DataFrame integration allows seamlessly mixing graph queries with Spark SQL and ML. By leveraging Catalyst and Tungsten, GraphFrames provide scalability and performance. Uniform language APIs expose the full functionality of GraphX to Java and Python users for the first time.



In this talk, the developers of the GraphFrames package will give an overview, a live demo, and a discussion of design decisions and future plans. This talk will be generally accessible, covering major improvements from GraphX and providing resources for getting started. A running example of analyzing flight delays will be used to explain the range of GraphFrame functionality: simple SQL and graph queries, motif finding, and powerful graph algorithms.



For experts, this talk will also include a few technical details on design decisions, the current implementation, and ongoing work on speed and performance optimizations.

Presenters
  • Joseph Bradley

    Software Engineer - Databricks

    Joseph Bradley is a Software Engineer and Apache Spark PMC member working on MLlib at Databricks. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon U. in 2013. His research included probabilistic graphical models, parallel sparse regression, and aggregation mechanisms for peer grading in MOOCs.

  • Denny Lee

    Technology Evangelist - Databricks

    Denny Lee is a Technology Evangelist with Databricks; he is a hands-on data sciences engineer with more than 15 years of experience developing internet-scale infrastructure, data platforms, and distributed systems for both on-premises and cloud.