ETL Part 1: Data Extraction

Summary

In this course data engineers access data where it lives and then apply data extraction best practices, including schemas, corrupt record handling, and parallelized code. By the end of this course, you will extract data from multiple sources, use schema inference and apply user-defined schemas, and navigate Azure Databricks and Apache Spark™ documents to source solutions.

Description

Length

3-6 hours, 75% hands-on

Format: Self-paced

The course is a series of seven self-paced lessons available in both Scala and Python. Each lesson includes hands-on exercises.

The course contains Databricks notebooks for both Azure Databricks and AWS Databricks; you can run the course on either platform.

During this course learners

  • Write a basic ETL pipeline using the Spark design pattern Ingest data using DBFS mounts in Azure Blob Storage and S3
  • Ingest data using serial and parallel JDBC reads
  • Define and apply a user-defined schema to semi-structured JSON data
  • Handle corrupt records
  • Productionize an ETL pipeline

Lessons

  1. Course Overview and Setup
  2. ETL Process Overview
  3. Connecting to Azure Blob Storage and S3
  4. Connecting to JDBC
  5. Applying Schemas to JSON Data
  6. Corrupt Record Handling
  7. Loading Data and Productionalizing

Target Audience

  • Primary Audience: Data Engineers

Prerequisites

  • Getting Started with Apache Spark™ SQL

Lab Requirements

Duration

6 hours