ETL Part 2: Data Transformation and Loads (with capstone)
In this course data engineers apply data transformation and writing best practices such as user-defined functions, join optimizations, and parallel database writes. By the end of this course, you will transform complex data with custom functions, load it into a target database, and navigate Databricks and Spark documents to source solutions.
NOTE: This course is specific to the Databricks Unified Analytics Platform (based on Apache Spark™). While you might find it helpful for learning how to use Apache Spark in other environments, it does not teach you how to use Apache Spark in those environments.
You will see the following warning:
WARNING: This notebook was tested on DBR 6.2 but we found DBR 7.0. Using an untested DBR may yield unexpected results and/or various errors Please update your cluster configuration and/or download a newer version of this course before proceeding.
A new version of the courseware currently is not available but we encourage you to continue enjoying the course with a newer runtime
3-6 hours, 75% hands-on
The course is a series of seven self-paced lessons available in both Scala and Python. A final capstone project involves writing custom, generalizable transformation logic to population data warehouse summary tables and efficiently writing the tables to a database. Each lesson includes hands-on exercises.
This course is intended to be run in a Databricks workspace. The course contains Databricks notebooks for both Azure Databricks and AWS Databricks; you can run the course on either platform.
Note: Access to a Databricks workspace is not part of your course purchase price. You are responsible for getting access to Databricks. See the FAQ for instructions on how to get access to an Databricks workspace.
During this course learners
- Apply built-in functions to manipulate data
- Write UDFs with a single DataFrame column inputs
- Apply UDFs with a multiple DataFrame column inputs and that return complex types
- Employ table join best practices relavant to big data environments
- Repartition DataFrames to optimize table inserts
- Write to managed and unmanaged tables
- Course Overview and Setup
- Common Transformations
- User Defined Functions
- Advanced UDFs
- Joins and Lookup Tables
- Database Writes
- Table Management
- Capstone Project: Custom Transformations, Aggregating and Loading
- Primary Audience: Data Engineers
- ETL Part 1: Data Extraction (strongly encouraged)
- Please be sure to use a supported browser.