DB 301 - Apache Spark™ for Machine Learning and Data Science

DB 301 - Apache Spark™ for Machine Learning and Data Science

Summary

This 3-day course provides an introduction to the "Spark fundamentals," the "ML fundamentals," and a cursory look at various Machine Learning and Data Science topics with specific emphasis on skills development and the unique needs of a Data Science team through the use of lecture and hands-on labs.

Description

This course is combined with DB 100 - Apache Spark Overview to provide a comprehensive overview of the Apache Spark framework and the Spark-ML libraries for Data Scientist.

After working through the Apache Spark fundamentals on the first day, the following days delve into Machine Learning and Data Science specific topics. Participants are introduced to the Machine Learning Pipeline, transformers, estimators, and featurizers. Additional topics unique to Data Analyst and Data Scientist such as data cleansing, imputing missing values, and general data exploration are also covered.

Depending on the desires of the class, numerous electives are also available, including topics like MLeap, XGBoost & LightGBM, AutoML, spark-sklearn, Koalas, Collaborative Filtering, Isolation Forests, and more.

Duration

3 Days

Objectives

Upon completion, students should be able to:

  • Describe how Apache Spark's distributed design allows for the processing of Gigabytes to Terabytes of data
  • Apply basic intuition to the minor, albeit common, performance problems that new developers often encounter
  • Use the DataFrame APIs to ingest, alter and write data
  • Understand the breadth and depth of Apache Spark's capabilities
  • Build a Machine Learning Pipeline using a combination of Transformers and Estimators
  • Save and Restore Models
  • Apply models to a streaming data source

Optionally:

  • Various depending on selected electives

Audience

  • This course is ideal for Data Scientists and ML Practitioners who are new to Apache Spark and desire to learn how to employ their skills with the Apache Spark framework.
  • This course is suitable for SQL Analyst seeking to grow beyond simple SQL queries and into the use of the DataFrame and Spark-ML APIs
  • This course is suitable for Data Analyst and Data Engineers that have a stronger data science background and would like to benefit from a deeper understanding Spark-ML capabilities

Prerequisites

Prerequisites Knowledge:

  • Knowledge of SQL is helpful
  • Experience with either Python or Scala is required
  • Some familiarity with Apache Spark or other big-data processing frameworks is helpful but not required
  • Working knowledge of Machine Learning and Data Science principles is expected but not required

Prerequisites Courses:

Software & Hardware Requirements

  • Web Browser: Chrome
  • An Internet Connection
  • GoToTraining (for remote classes only)
    Please see the GoToMeeting System Check
  • A computer, laptop, or tablet with a keyboard

Additional Notes

  • The appropriate, web-based programming environment will be provided to students
  • This class can be taught concurrently in Python and Scala
  • A two-day version of this course is available with a specific focus on the DataFrame and Spark-ML fundamentals
  • While suitable for participants with a limited data science background, this course does not teach machine learning or data science principles

Outline

  • About Databricks, Spark
  • Types of Machine Learning and Business Applications of ML
  • Data cleansing: dealing with null values, outliers, and imputation
  • Linear Regression: univariate and multivariate models, evaluating measures of fit
  • Adv Linear Regression: categorical variables, pipelines, saving and loading
  • Use MLflow to track experiments, log metrics, and compare runs
  • ML Algorithms in Spark: Decision trees, Random Forest, XGBoost, LightGBM, Isolation Forest, K-Means
  • Deployment Options
  • Hyperparameter Tuning: Cross-validation and performance tuning
  • Logistic regression

Upcoming Classes

Date
Time
Location
Price
Apr 21 - 23
9:00 AM - 5:00 PM
Pacific Daylight Time
Online - Virtual - US Pacific
$ 2500.00 USD
Jun 9 - 11
9:00 AM - 5:00 PM
Eastern Daylight Time
Edison , United States
$ 2500.00 USD
Jun 9 - 11
9:00 AM - 5:00 PM
Eastern Daylight Time
Online - Virtual - US Eastern
$ 2500.00 USD
Jul 27 - 29
9:00 AM - 5:00 PM
Eastern Daylight Time
McLean , United States
$ 2500.00 USD
Jul 27 - 29
9:00 AM - 5:00 PM
Eastern Daylight Time
Online - Virtual - US Eastern
$ 2500.00 USD
Sep 14 - 16
9:00 AM - 5:00 PM
Pacific Daylight Time
San Francisco , United States
$ 2500.00 USD
Sep 14 - 16
9:00 AM - 5:00 PM
Pacific Daylight Time
Online - Virtual - US Pacific
$ 2500.00 USD
Oct 26 - 28
9:00 AM - 5:00 PM
Eastern Daylight Time
McLean , United States
$ 2500.00 USD
Oct 26 - 28
9:00 AM - 5:00 PM
Eastern Daylight Time
Online - Virtual - US Eastern
$ 2500.00 USD
Dec 14 - 16
9:00 AM - 5:00 PM
Eastern Standard Time
McLean , United States
$ 2500.00 USD
Dec 14 - 16
9:00 AM - 5:00 PM
Eastern Standard Time
Online - Virtual - US Eastern
$ 2500.00 USD

Onsite Training

Request Quote

Public Training

Virtual - US Pacific

Edison, NJ

Virtual - US Eastern

McLean, VA

San Francisco, CA


Don't see a date that works for you?

Request Class