October 10th, 2019 - 1:00 PM EST / 10:00 AM PST
Model inference, unlike model training, is usually embarrassingly parallel and hence simple to distribute. However, in practice, complex data scenarios and compute infrastructure often make this "simple" task hard to do from data source to sink. In this webinar, we will provide a reference end-to-end pipeline for distributed deep learning model inference using the latest features from Apache SparkTM and Delta Lake.
While the reference pipeline applies to various deep learning scenarios, we will focus on image applications, and demonstrate specific pain points and proposed solutions.
We start from data ingestion and ETL, using binary file data source from Apache Spark to load and store raw image files into a Delta Lake table. A small code change then enables Spark structure streaming to continuously discover and import new images, keeping the table up-to-date. From the Delta Lake table, Pandas UDF is used to wrap single-node code and perform distributed model inference in Spark.
We will provide some performance tuning tips and show how to monitor resource utilization, then briefly discuss CPU vs. GPU cost-effectiveness.
Sample Databricks notebooks will be provided to all registrants.