Scalable Machine Learning with Apache Spark™

Scalable Machine Learning with Apache Spark™

Summary

In this course data analysts and data scientists practice the full data science workflow by exploring data, creating features and building models, performing hyperparameter tuning, and tracking parameters and managing models with MLflow. By the end of this course, you will have built end-to-end machine learning models ready for production. 

Description

This course guides students through the process of building machine learning solutions using Spark. You will build and tune ML models with SparkML using transformers, estimators, and pipelines. This course highlights some of the key differences between SparkML and single-node libraries such as scikit-learn. Furthermore, you will reproduce your experiments and version your models using MLflow.

 

You will also integrate 3rd party libraries into Spark workloads, such as XGBoost. In addition, you will leverage Spark to scale inference of single-node models and parallelize hyperparameter tuning. This course includes hands-on labs and concludes with a collaborative capstone project. All of the notebooks are available in Python, and in Scala as well where available.

Duration

2 Days

Objectives

Upon completion of the course, students should be able to:

 

  • Create data processing pipelines with Spark
  • Build and tune machine learning models with SparkML
  • Track, version, and deploy models with MLflow
  • Perform distributed hyperparameter tuning with Hyperopt
  • Use Spark to scale the inference of single-node models 

Audience

  • Data scientist
  • Machine learning engineer

Prerequisites

  • Intermediate experience with Python
  • Beginning experience with the PySpark DataFrame API (or have taken the Apache Spark Programming with Databricks class)
  • Working knowledge of machine learning and data science

Outline

2 Days

Day #1 AM

1

Introductions & Setup

Registration, Courseware & Q&As

2

Apache Spark Overview

Delta Overview

About Databricks, Spark & Spark Architecture

Delta Lake and how to leverage it for data science/ML applications

3

ML Overview (optional)

Types of Machine Learning, Business applications of ML

(NOTE: this class uses Airbnb's SF rental data to predict things such as price of rental)

4

Data Cleansing

How to deal with null values, outliers, data imputation

5

Data Exploration Lab

Exploring your data, log-normal distribution, determine baseline metric to beat

6

Linear Regression I

Build simple univariate linear regression model

SparkML APIs: transformer vs estimator

Day #1 PM

1

Linear Regression I Lab

Build multivariate linear regression model

Evaluate RMSE and R2

2

Linear Regression II

How to One Hot Encode in Spark, Pipeline API, Saving and loading models

3

Linear Regression II Lab

Simplify pipeline using RFormula, Build linear regression model to predict on log-scale, then exponentiate prediction and evaluate

4

MLflow Tracking

Use MLflow to track experiments, log metrics, and compare runs

5

MLflow Model Registry

Register a model using MLflow

Deploy that model into production

Update a model in production to new version including a staging phase for testing

Archive and delete models

6

MLflow Lab

Use MLflow to track models and Delta table

Register Model

Day #2 AM

1

Review

What did we discover yesterday?

2

Decision Trees

Distributed implementation of decision trees and maxBins parameter (why you WILL get different results from sklearn)

Feature importance

3

Hyperparameter Tuning

K-Fold cross-validation

SparkML's Parallelism parameter (introduced in Spark 2.3)

Speed up Pipeline model training by 4x

4

Hyperparameter Tuning/RF Lab

Hyperparameter search with cross-validation on Random Forests

5

Hyperopt

Hyperparameter tuning for MLlib models

Day #2 PM

1

Hyperopt Lab

Distributed hyperparameter tuning for scikit-learn models with SparkTrials

2

MLlib Deployment Options

Discuss batch, streaming, and real-time use cases and how MLlib/Spark fit into each of those

3

XGBoost

Using 3rd party libraries with Spark

Discuss gradient boosted trees and their variants

4

Inference with Pandas UDFs

Build a single-node ML model, but apply in parallel using mapInPandas (introduced in Spark 3.0)

5

Training with Pandas UDFs

Build a single-node ML model for each IoT Device using applyInPandas (introduced in Spark 3.0) and track it with MLflow

6

Pandas UDFs Lab

Lab to apply a single-node ML model at scale

7

Koalas

Use the new open-source library to write Pandas code that distributes using Spark under the hood

8

Capstone Project & Course Recap

Load dataset into Databricks and create Delta Table

Build a machine learning model

Track model performancew with MLflow

Present models to class (if time)



Upcoming Classes

Date
Time
Location
Price
Apr 28 - 29
9:00 AM - 5:00 PM
Central European Summer Time
Online - Virtual - EMEA
$ 1500.00 USD
May 3 - 6
9:00 AM - 1:00 PM
Pacific Daylight Time
Online - Virtual - Americas (half-day schedule)
$ 1500.00 USD
May 19 - 20
9:00 AM - 5:00 PM
Central European Summer Time
Online - Virtual - EMEA
$ 1500.00 USD
Jun 1 - 4
9:00 AM - 1:00 PM
Pacific Daylight Time
Online - Virtual - Americas (half-day schedule)
$ 1500.00 USD
Jun 23 - 24
9:00 AM - 5:00 PM
Central European Summer Time
Online - Virtual - EMEA
$ 1500.00 USD
Jun 28 - Jul 1
9:00 AM - 1:00 PM
Pacific Daylight Time
Online - Virtual - Americas (half-day schedule)
$ 1500.00 USD
Jul 26 - 29
9:00 AM - 1:00 PM
Pacific Daylight Time
Online - Virtual - Americas (half-day schedule)
$ 1500.00 USD
Jul 28 - 29
9:00 AM - 5:00 PM
Central European Summer Time
Online - Virtual - EMEA
$ 1500.00 USD

Onsite Training

Request Quote

Public Training

Virtual - EMEA

Virtual - Americas (half-day schedule)


Don't see a date that works for you?

Request Class