SP820: ETL Part 1: Data Extraction (AWS Databricks)
In this course data engineers access data where it lives and then apply data extraction best practices, including schemas, corrupt record handling, and parallelized code. By the end of this course, you will extract data from multiple sources, use schema inference and apply user-defined schemas, and navigate Databricks and Apache Spark™ documents to source solutions.
NOTE: This course is specific to the Databricks Unified Analytics Platform (based on Apache Spark™). While you might find it helpful for learning how to use Apache Spark in other environments, it does not teach you how to use Apache Spark in those environments.
3-6 hours, 75% hands-on
The course is a series of seven self-paced lessons available in both Scala and Python. A final capstone project involves writing an end-to-end ETL job that loads semi-structured JSON data into a relational model. Each lesson includes hands-on exercises.
This version of the course is intended to be run on AWS Databricks.
Note: Access to a Databricks workspace is not part of your course purchase price. You are responsible for getting access to Databricks. See the FAQ for instructions on how to get access to an Databricks workspace.
During this course learners
- Write a basic ETL pipeline using the Spark design pattern Ingest data using DBFS mounts in Azure Blob Storage and S3
- Ingest data using serial and parallel JDBC reads
- Define and apply a user-defined schema to semi-structured JSON data
- Handle corrupt records
- Productionize an ETL pipeline
- Course Overview and Setup
- ETL Process Overview
- Connecting to Azure Blob Storage and S3
- Connecting to JDBC
- Applying Schemas to JSON Data
- Corrupt Record Handling
- Loading Data and Productionalizing
- Capstone Project: Parsing Nested Data
- Primary Audience: Data Engineers
- Getting Started with Apache Spark™ SQL (AWS Databricks) (optional, but recommended)
- Please be sure to use a supported browser.