Distributed Machine Learning in SparkR/sparklyr

Distributed Machine Learning in SparkR/sparklyr


This course offers a thorough, hands-on overview of how to scale R applications with Apache Spark


In this course data analysts and data scientists practice the full data science workflow by exploring data, building features, training regression and classification models, and tuning and selecting the best model in R at scale. Students will also learn data access/storage best practices, how to optimize their Spark code and change configurations to improve performance.


2 Days


Upon completion, students should be able to:

  • Explain Spark architecture and how Spark works
  • Read different data sources using SparkR and sparklyr
  • Build machine learning models such as linear regression, decision tree and random forests
  • Perform hyperparameter tuning at scale
  • Track your experiments with MLflow
  • Optimize Spark code and change configurations
  • Store and access data in Delta Lake


  • This course is ideal for data scientists that are interested in using R at scale
  • This course is suitable for data analysts and machine learning engineers


Prerequisite Knowledge:

  • Familiarity with R is required
  • Familiarity with Spark is helpful
  • Familiarity with Machine Learning concepts is suggested

Prerequisite Courses:


Software & Hardware Requirements

  • Web Browser: Chrome
  • An Internet Connection
  • GoToTraining (for remote classes only)
    Please see the GoToMeeting System Check
  • A computer, laptop, or tablet with a keyboard

Additional Notes

  • The appropriate, web-based programming environment will be provided to students
  • This class is taught in R only


  • Get familiar with the Databricks environment
  • Create a Spark DataFrame; analyze the Spark UI; cache data and change Spark default configurations to speed up queries
  • Use single node R in a Databricks notebook to read in a CSV file; select columns; visualize relationships using ggplot
  • Use SparkR to read in a CSV file in a distributed manner; select columns; visualize relationships using ggplot
  • Use sparklyr to read in a CSV file in a distributed manner; select columns; visualize relationships using ggplot
  • Learn to create temporary views of tables for access between sparklyr, SparkR, and SQL cells; write to and read from DBFS
  • Set up RStudio integration with Databricks
  • Use built-in SparkR functions; deal with null/missing values
  • Examine the Spark UI to better understand what is happening under the hood; change Spark configurations to speed up Spark jobs
  • Learn how to apply R functions in parallel; train a regression model on the airbnb dataset; apply the regression model to the test dataset in parallel
  • Use the SparkR API to build a linear regression model; evaluate model on test data
  • Identify the differences between single node and distributed decision tree implementations; get the feature importance; examine common pitfalls of decision trees
  • Use Pipeline API; tune hyperparameters using Grid Search; optimize SparkML pipeline
  • Install MLflow for R on a Databricks cluster; train a model and log metrics, parameters, figures and models; view the results of training in the MLflow tracking UI
  • Use Delta Lake to store versions of your table; use MLflow on a specific version of a Delta Lake Table
  • Cache data to speed up query performance; discuss when/where to cache data
  • Compare CSV to Parquet in terms of storage and query performance; analyze the impact of different file compression formats on parallelization of tasks
  • Partition your data for increased query performance; minimize the small file problem

Upcoming Classes

No classes have been scheduled, but you can always Request a Class.