ETL Part 1: Data Extraction (with capstone)


In this course data engineers access data where it lives and then apply data extraction best practices, including schemas, corrupt record handling, and parallelized code. By the end of this course, you will extract data from multiple sources, use schema inference and apply user-defined schemas, and navigate Azure Databricks and Apache Spark™ documents to source solutions.

NOTE: This course is specific to the Databricks Unified Analytics Platform (based on Apache Spark™). While you might find it helpful for learning how to use Apache Spark in other environments, it does not teach you how to use Apache Spark in those environments.



You will see the following warning:

WARNING: This notebook was tested on DBR 6.2 but we found DBR 7.0. Using an untested DBR may yield unexpected results and/or various errors Please update your cluster configuration and/or download a newer version of this course before proceeding.

A new version of the courseware currently is not available but we encourage you to continue enjoying the course with a newer runtime


3-6 hours, 75% hands-on

Format: Self-paced

The course is a series of seven self-paced lessons available in both Scala and Python. A final capstone project involves writing an end-to-end ETL job that loads semi-structured JSON data into a relational model. Each lesson includes hands-on exercises.


This course is intended to be run in a Databricks workspace. The course contains Databricks notebooks for both Azure Databricks and AWS Databricks; you can run the course on either platform.

Note: Access to a Databricks workspace is not part of your course purchase price. You are responsible for getting access to Databricks. See the FAQ for instructions on how to get access to an Databricks workspace.

During this course learners

  • Write a basic ETL pipeline using the Spark design pattern Ingest data using DBFS mounts in Azure Blob Storage and S3
  • Ingest data using serial and parallel JDBC reads
  • Define and apply a user-defined schema to semi-structured JSON data
  • Handle corrupt records
  • Productionize an ETL pipeline


  1. Course Overview and Setup
  2. ETL Process Overview
  3. Connecting to Azure Blob Storage and S3
  4. Connecting to JDBC
  5. Applying Schemas to JSON Data
  6. Corrupt Record Handling
  7. Loading Data and Productionalizing
  8. Capstone Project: Parsing Nested Data

Target Audience

  • Primary Audience: Data Engineers


There are no prerequisites for this course.

Lab Requirements


6 hours