Scalable Machine Learning with Apache Spark


Build end-to-end machine learning learning models ready for production.


This course guides students through the process of building machine learning solutions using Spark. You will build and tune ML models with SparkML using transformers, estimators, and pipelines. This course highlights some of the key differences between SparkML and single-node libraries such as scikit-learn. Furthermore, you will reproduce your experiments and version your models using MLflow.You will also integrate 3rd party libraries into Spark workloads, such as XGBoost. In addition, you will leverage Spark to scale inference of single-node models and parallelize hyperparameter tuning.

Learning objectives

  • Create data processing pipelines with Spark.

  • Build and tune machine learning models with Spark ML.

  • Track, version, and deploy models with MLflow.

  • Perform distributed hyperparameter tuning with Hyperopt.

  • Use Spark to scale the inference of single-node models.


  • Intermediate experience with Python Beginning experience with the PySpark DataFrame API (or have taken the Apache Spark Programming with Databricks class)

  • Working knowledge of machine learning and data science

Learning path

  • This course is part of the Data Scientist learning path.

Proof of completion

  • Upon 80% completion of this course, you will receive a proof of completion. 


Part of Learning Pathway(s)