ETL Part 3: Production

Summary

In this course data engineers optimize and automate Extract, Transform, Load (ETL) workloads using stream processing, job recovery strategies, and automation strategies like REST API integration. By the end of this course you will schedule highly optimized and robust ETL jobs, debugging problems along the way.

Description

Length

3-6 hours, 75% hands-on

Format: Self-paced

The course is a series of six self-paced lessons available in both Scala and Python. Each lesson includes hands-on exercises.

The course contains Databricks notebooks for both Azure Databricks and AWS Databricks; you can run the course on either platform. Note: The notebooks will not run on Databricks Community Edition.

Learning Objectives

During this course learners

  • Perform an ETL job on a streaming data source
  • Parameterize a code base and manage task dependencies
  • Submit and monitor jobs using the REST API or Command Line Interface
  • Design and implement a job failure recovery strategy using the principle of idempotence
  • Optimize ETL queries using compression and caching best practices with optimal hardware choices

Lessons

  1. Course Overview and Setup
  2. Streaming ETL
  3. Runnable Notebooks
  4. Scheduling Jobs
  5. Job Failure
  6. ETL Optimizations

Target Audience

  • Primary Audience: Data Engineers

Prerequisites

  • ETL Part 1 (strongly encouraged)
  • ETL Part 2 (strongly encouraged)

Lab Requirements

Duration

8 hours