Scalable Data Science with SparkR/sparklyr

Scalable Data Science with SparkR/sparklyr

Summary

This course offers a thorough, hands-on overview of how to scale R applications with Apache Spark 

Description

In this course data analysts and data scientists gain hands on experience scaling their exploratory data analysis and data science workflows with Apache Spark. You will use both SparkR and sparklyr to train distributed models, tune hyperparameters, and perform inference at scale. These models will then be tracked with MLflow to provide for reproducibility of their experiments. You will also learn data access/storage best practices with Delta Lake, as well as how to optimize your Spark code.

Duration

2 Days

Objectives

Upon completion, students should be able to:

  • Explain Spark architecture and how Spark works
  • Read different data sources using SparkR and sparklyr
  • Build machine learning models such as linear regression, decision tree and random forests
  • Perform hyperparameter tuning at scale
  • Track your experiments with MLflow
  • Optimize Spark code and change configurations
  • Store and access data in Delta Lake

Audience

  • Data Analyst

  • Data Scientist

Prerequisites

  • Intermediate experience using R
  • Working knowledge of machine learning and data science 

Additional Notes

  • The appropriate, web-based programming environment will be provided to students
  • This class is taught in R only

Outline

  • Get familiar with the Databricks environment
  • Create a Spark DataFrame; analyze the Spark UI; cache data and change Spark default configurations to speed up queries
  • Use single node R in a Databricks notebook to read in a CSV file; select columns; visualize relationships using ggplot
  • Use SparkR to read in a CSV file in a distributed manner; select columns; visualize relationships using ggplot
  • Use sparklyr to read in a CSV file in a distributed manner; select columns; visualize relationships using ggplot
  • Learn to create temporary views of tables for access between sparklyr, SparkR, and SQL cells; write to and read from DBFS
  • Set up RStudio integration with Databricks
  • Use built-in SparkR functions; deal with null/missing values
  • Examine the Spark UI to better understand what is happening under the hood; change Spark configurations to speed up Spark jobs
  • Learn how to apply R functions in parallel; train a regression model on the airbnb dataset; apply the regression model to the test dataset in parallel
  • Use the SparkR API to build a linear regression model; evaluate model on test data
  • Identify the differences between single node and distributed decision tree implementations; get the feature importance; examine common pitfalls of decision trees
  • Use Pipeline API; tune hyperparameters using Grid Search; optimize SparkML pipeline
  • Install MLflow for R on a Databricks cluster; train a model and log metrics, parameters, figures and models; view the results of training in the MLflow tracking UI
  • Use Delta Lake to store versions of your table; use MLflow on a specific version of a Delta Lake Table
  • Cache data to speed up query performance; discuss when/where to cache data
  • Compare CSV to Parquet in terms of storage and query performance; analyze the impact of different file compression formats on parallelization of tasks
  • Partition your data for increased query performance; minimize the small file problem

Upcoming Classes

No classes have been scheduled, but you can always Request a Class.