ETL Part 3: Production (with capstone)


In this course data engineers optimize and automate Extract, Transform, Load (ETL) workloads using stream processing, job recovery strategies, and automation strategies like REST API integration. By the end of this course you will schedule highly optimized and robust ETL jobs, debugging problems along the way.

NOTE: This course is specific to the Databricks Unified Analytics Platform (based on Apache Spark™). While you might find it helpful for learning how to use Apache Spark in other environments, it does not teach you how to use Apache Spark in those environments.



You will see the following warning:

WARNING: This notebook was tested on DBR 6.2 but we found DBR 7.0. Using an untested DBR may yield unexpected results and/or various errors Please update your cluster configuration and/or download a newer version of this course before proceeding.

A new version of the courseware currently is not available but we encourage you to continue enjoying the course with a newer runtime


3-6 hours, 75% hands-on

Format: Self-paced

The course is a series of six self-paced lessons available in both Scala and Python. A final capstone project involves refactoring a batch ETL job to a streaming pipeline. In the process, students run the workload as a job and monitor it. Each lesson includes hands-on exercises.


This course is intended to be run in a Databricks workspace. The course contains Databricks notebooks for both Azure Databricks and AWS Databricks; you can run the course on either platform.

Note: Access to a Databricks workspace is not part of your course purchase price. You are responsible for getting access to Databricks. See the FAQ for instructions on how to get access to an Databricks workspace.

Note: This course will not run on Databricks Community Edition.

Learning Objectives

During this course learners

  • Perform an ETL job on a streaming data source
  • Parameterize a code base and manage task dependencies
  • Submit and monitor jobs using the REST API or Command Line Interface
  • Design and implement a job failure recovery strategy using the principle of idempotence
  • Optimize ETL queries using compression and caching best practices with optimal hardware choices


  1. Course Overview and Setup
  2. Streaming ETL
  3. Runnable Notebooks
  4. Scheduling Jobs
  5. Job Failure
  6. ETL Optimizations
  7. Capstone Project

Target Audience

  • Primary Audience: Data Engineers


Lab Requirements


6 hours