Scalable Data Science with SparkR/sparklyr

Scalable Data Science with SparkR/sparklyr


This course offers a thorough, hands-on overview of how to scale R applications with Apache Spark 


In this course data analysts and data scientists gain hands on experience scaling their exploratory data analysis and data science workflows with Apache Spark. You will use both SparkR and sparklyr to train distributed models, tune hyperparameters, and perform inference at scale. These models will then be tracked with MLflow to provide for reproducibility of their experiments. You will also learn data access/storage best practices with Delta Lake, as well as how to optimize your Spark code.


2 Days


Upon completion, students should be able to:

  • Explain Spark architecture and how Spark works
  • Read different data sources using SparkR and sparklyr
  • Build machine learning models such as linear regression, decision tree and random forests
  • Perform hyperparameter tuning at scale
  • Track your experiments with MLflow
  • Optimize Spark code and change configurations
  • Store and access data in Delta Lake


  • Data Analyst

  • Data Scientist


  • Intermediate experience using R
  • Working knowledge of machine learning and data science 

Additional Notes

  • The appropriate, web-based programming environment will be provided to students
  • This class is taught in R only

Upcoming Classes

No classes have been scheduled, but you can always Request a Class.