ETL Part 2: Transformations and Loads

Summary

In this course, data engineers apply data transformation and writing best practices, such as user-defined functions, join optimizations, and parallel database writes. By the end of this course, you will transform complex data with custom functions, load it into a target database, and navigate Databricks and Spark documents to source solutions.

Description

Length

3-6 hours, 75% hands-on

Format: Self-paced

The course is a series of seven self-paced lessons available in both Scala and Python. Each lesson includes hands-on exercises.

The course contains Databricks notebooks for both Azure Databricks and AWS Databricks; you can run the course on either platform.

Learning Objectives

During this course learners

  • Apply built-in functions to manipulate data
  • Write UDFs with a single DataFrame column inputs
  • Apply UDFs with a multiple DataFrame column inputs and that return complex types
  • Employ table join best practices relevant to big data environments
  • Repartition DataFrames to optimize table inserts
  • Write to managed and unmanaged tables

Lessons

  1. Course Overview and Setup
  2. Common Transformations
  3. User Defined Functions
  4. Advanced UDFs
  5. Joins and Lookup Tables
  6. Database Writes
  7. Table Management

Target Audience

  • Primary Audience: Data Engineers

Prerequisites

  • ETL Part 1 (strongly encouraged)

Lab Requirements

Duration

6 hours