Scalable Machine Learning with Apache Spark™

Scalable Machine Learning with Apache Spark™

Summary

In this course, you will experience the full data science workflow, including data exploration, feature engineering, model building, and hyperparameter tuning. By the end of this course, you will have built an end-to-end distributed machine learning pipeline ready for production.

Description

This course guides students through the process of building machine learning pipelines using Apache Spark. You will build solutions to parallelize model training, hyperparameter tuning, and inference.

This course starts by introducing the core components of SparkML: transformers, estimators, and pipelines. To keep track of your machine learning experiments, you will use the open source libraries, MLflow and Delta Lake, to version and reproduce your models. Next, you will perform Bayesian hyperparameter tuning in a distributed setting with Hyperopt, followed by parallelizing the training and inference of single node models with Pandas UDFs. Lastly, you will learn how to leverage the Databricks Feature Store & glass-box AutoML offering to jumpstart your machine learning projects. There are plenty of hands-on labs, all available in Python.

Duration

2 Days

Objectives

Upon completion of the course, students should be able to:

 

  • Create data processing pipelines with Spark
  • Build and tune machine learning models with SparkML
  • Track, version, and deploy models with MLflow
  • Perform distributed hyperparameter tuning with Hyperopt
  • Use Spark to scale the inference of single-node models 
  • Use Databricks glass-box AutoML to automatically train and tune models 

Audience

  • Data scientist
  • Machine learning engineer

Prerequisites

  • Intermediate experience with Python
  • Beginning experience with the PySpark DataFrame API (or have taken the Apache Spark Programming with Databricks class)
  • Working knowledge of machine learning and data science

Outline

Day 1 AM

Time

Lesson                                                                      

Description                                                                                                                                                                                                                                                  

20m

Review

Review of Spark concepts

30m

ML Overview (optional)

Types of Machine Learning, Business applications of ML

(NOTE: this class uses Airbnb's SF rental data to predict things such as price of rental)"

10m

Break

 

35m

Data Cleansing

How to deal with null values, outliers, data imputation

40m

Data Exploration Lab

Exploring your data, log-normal distribution, determine baseline metric to beat

10m

Break

 

30m

Linear Regression I

Build simple univariate linear regression model

SparkML APIs: transformer vs estimator

 

Day 1 PM

Time

Lesson                                                                      

Description                                                                                                                                                                                                                                                  

20m

Linear Regression I Lab

Build multivariate linear regression model

Evaluate RMSE and R2

30m

Linear Regression II

How to One Hot Encode in Spark

Pipeline API

Saving and loading models

10m

Break

 

40m

Linear Regression II Lab

Simplify pipeline using RFormula

Build linear regression model to predict on log-scale, then exponentiate prediction and evaluate

30m

MLflow Tracking

Use MLflow to track experiments, log metrics, and compare runs

10m

Break

 

30m

MLflow Model Registry

Register a model using MLflow

Deploy that model into production

Update a model in production to new version including a staging phase for testing

Archive and delete models

45m

MLflow  Lab

Use MLflow to track models and Delta table

Register Model

 

Day 2 AM

Time

Lesson                                                                      

Description                                                                                                                                                                                                                                                  

20m

Review

What did we discover yesterday?

40m

Decision Trees

Distributed implementation of decision trees and maxBins parameter (why you WILL get different results from sklearn)

Feature importance

10m

Break

 

40m

Random Forests and Hyperparameter Tuning

Random Forests

K-Fold cross-validation

SparkML's Parallelism parameter (introduced in Spark 2.3)

Speed up Pipeline model training by 4x

30m

Hyperparameter Tuning Lab

Perform grid search on a random forest

Get the feature importances across the forest

Save the model

Identify differences between Sklearn's Random Forest and SparkML's

10m

Break

 

20m

Hyperopt

Hyperparameter tuning for MLlib models

20m

Hyperopt Lab

Distributed hyperparameter tuning for scikit-learn models with Spark

 

Day 2 PM

Time

Lesson                                                                      

Description                                                                                                                                                                                                                                                  

25m

AutoML

Programmatically use Databricks AutoML to automatically train and tune your models

20m

AutoML Lab

Use the Databricks AutoML UI to automatically train and tune your models

15m

Feature Store

Build, merge, and evolve features with the Databricks Feature Store

10m

Break

 

20m

XGBoost

Using 3rd party libraries with Spark

Discuss gradient boosted trees and their variants

15m

Inference with Pandas UDF

Build a single-node ML model, but apply in parallel using mapInPandas (introduced in Spark 3.0)

20m

Pandas UDFs Lab

Lab to apply a single-node ML model at scale

10m

Break

 

15m

Training with Pandas Function API

Build a single-node ML model for each IoT Device using applyInPandas (introduced in Spark 3.0) and track it with MLflow

20m

Koalas

Use the new open-source library to write Pandas code that distributes using Spark under the hood

Upcoming Classes

Date
Time
Location
Price
Aug 23 - 24
9:00 AM - 5:00 PM
Central European Summer Time
Online - Virtual - EMEA
$ 1500.00 USD
Aug 30 - Sep 2
9:00 AM - 1:00 PM
Pacific Daylight Time
Online - Virtual - Americas (half-day schedule)
$ 1500.00 USD
Sep 21 - 24
8:00 AM - 12:00 PM
Australian Eastern Standard Time (Victoria)
Online - Virtual - Australia
$ 1500.00 USD
Sep 23 - 24
9:00 AM - 5:00 PM
Central European Summer Time
Online - Virtual - EMEA
$ 1500.00 USD
Sep 27 - 30
9:00 AM - 1:00 PM
Pacific Daylight Time
Online - Virtual - Americas (half-day schedule)
$ 1500.00 USD
Oct 25 - 28
9:00 AM - 1:00 PM
Pacific Daylight Time
Online - Virtual - Americas (half-day schedule)
$ 1500.00 USD
Oct 28 - 29
9:00 AM - 5:00 PM
Central European Summer Time
Online - Virtual - EMEA
$ 1500.00 USD
Nov 29 - Dec 2
9:00 AM - 1:00 PM
Pacific Standard Time
Online - Virtual - Americas (half-day schedule)
$ 1500.00 USD
Nov 29 - 30
9:00 AM - 5:00 PM
Central European Time
Online - Virtual - EMEA
$ 1500.00 USD
Dec 27 - 30
9:00 AM - 1:00 PM
Pacific Standard Time
Online - Virtual - Americas (half-day schedule)
$ 1500.00 USD
Dec 28 - 29
9:00 AM - 5:00 PM
Central European Time
Online - Virtual - EMEA
$ 1500.00 USD

Onsite Training

Request Quote

Public Training

Virtual - EMEA

Virtual - Americas (half-day schedule)

Virtual - Australia


Don't see a date that works for you?

Request Class