DB 301 - Apache Spark™ for Machine Learning and Data Science

DB 301 - Apache Spark™ for Machine Learning and Data Science


*This course is to be replaced by Scalable Machine Learning with Apache Spark  


This 3-day course provides an introduction to the "Spark fundamentals," the "ML fundamentals," and a cursory look at various Machine Learning and Data Science topics with specific emphasis on skills development and the unique needs of a Data Science team through the use of lecture and hands-on labs.


This course is combined with DB 100 - Apache Spark Overview to provide a comprehensive overview of the Apache Spark framework and the Spark-ML libraries for Data Scientist.

After working through the Apache Spark fundamentals on the first day, the following days delve into Machine Learning and Data Science specific topics. Participants are introduced to the Machine Learning Pipeline, transformers, estimators, and featurizers. Additional topics unique to Data Analyst and Data Scientist such as data cleansing, imputing missing values, and general data exploration are also covered.

Depending on the desires of the class, numerous electives are also available, including topics like MLeap, XGBoost & LightGBM, AutoML, spark-sklearn, Koalas, Collaborative Filtering, Isolation Forests, and more.


3 Days


Upon completion, students should be able to:

  • Describe how Apache Spark's distributed design allows for the processing of Gigabytes to Terabytes of data
  • Apply basic intuition to the minor, albeit common, performance problems that new developers often encounter
  • Use the DataFrame APIs to ingest, alter and write data
  • Understand the breadth and depth of Apache Spark's capabilities
  • Build a Machine Learning Pipeline using a combination of Transformers and Estimators
  • Save and Restore Models
  • Apply models to a streaming data source


  • Various depending on selected electives


  • This course is ideal for Data Scientists and ML Practitioners who are new to Apache Spark and desire to learn how to employ their skills with the Apache Spark framework.
  • This course is suitable for SQL Analyst seeking to grow beyond simple SQL queries and into the use of the DataFrame and Spark-ML APIs
  • This course is suitable for Data Analyst and Data Engineers that have a stronger data science background and would like to benefit from a deeper understanding Spark-ML capabilities


Prerequisites Knowledge:

  • Knowledge of SQL is helpful
  • Experience with either Python or Scala is required
  • Some familiarity with Apache Spark or other big-data processing frameworks is helpful but not required
  • Working knowledge of Machine Learning and Data Science principles is expected but not required

Prerequisites Courses:

Software & Hardware Requirements

  • Web Browser: Chrome
  • An Internet Connection
  • GoToTraining (for remote classes only)
    Please see the GoToMeeting System Check
  • A computer, laptop, or tablet with a keyboard

Additional Notes

  • The appropriate, web-based programming environment will be provided to students
  • This class can be taught concurrently in Python and Scala
  • A two-day version of this course is available with a specific focus on the DataFrame and Spark-ML fundamentals
  • While suitable for participants with a limited data science background, this course does not teach machine learning or data science principles


  • About Databricks, Spark
  • Types of Machine Learning and Business Applications of ML
  • Data cleansing: dealing with null values, outliers, and imputation
  • Linear Regression: univariate and multivariate models, evaluating measures of fit
  • Adv Linear Regression: categorical variables, pipelines, saving and loading
  • Use MLflow to track experiments, log metrics, and compare runs
  • ML Algorithms in Spark: Decision trees, Random Forest, XGBoost, LightGBM, Isolation Forest, K-Means
  • Deployment Options
  • Hyperparameter Tuning: Cross-validation and performance tuning
  • Logistic regression

Upcoming Classes

No classes have been scheduled, but you can always Request a Class.