Data Engineering with Databricks

Data Engineering with Databricks

Summary

Leverage the power of the Databricks platform to implement the Lakehouse architecture.

Description

This course begins with an overview of data architecture concepts, an introduction to the Lakehouse paradigm, and an in-depth look at Delta Lake features and functionality. Participants will learn about applying software engineering principles with Databricks as they build end-to-end OLAP data pipelines using Delta Lake for batch and streaming data. Considerations around normalization, change data capture, slowly changing dimensions, and regulatory compliance will be explored. The course also discusses serving data to end users through aggregate tables and Redash. Throughout the course, emphasis will be placed on using data engineering best practices with Databricks. Participants who wish to dive deeper on tuning and optimization can take Optimizing Apache Spark on Databricks.

Duration

2 Days

Objectives

Upon completion of this course, students should be able to:

  • Build an end-to-end batch and streaming OLAP data pipeline using the Databricks Workspace.
  • Make data available for consumption by downstream stakeholders using specified design patterns
  • Apply Databricks’ recommended best practices in engineering a single source of truth Delta architecture.

Audience

Data Engineers and Machine Learning Engineers

Prerequisites

  • Intermediate to advanced programming skills in Python
  • Intermediate to advanced SQL skills
  • Beginning experience using the Spark DataFrames API
  • Beginning knowledge of general data engineering concepts
  • Beginning knowledge of the core features and use cases of Delta Lake

Additional Notes

All ​participants ​will ​need ​:

  • an ​internet ​connection
  • a ​device ​that is compliant with the following supported internet browsers
  • NOTE: GoToTraining ​is ​our chosen online ​platform ​through which the ​class ​will ​be ​delivered and ​prior ​to ​attendance, ​each ​registrant ​will ​receive ​GoToTraining ​log-in ​instructions.
  • Outline

     

    Day #1 AM: Welcome and Setup

    Duration

    Modules

    40 min

    Introduction

    Introductions, course overview, case study description

    40 min

    The Big Picture

    Review of the big data landscape, the role of Databricks, Delta Lake and the Lakehouse architecture

    10m

    Break

    40 min

    Software Engineering

    The role of Databricks in the SW engineering process, CI/CD considerations, designing for testability

    20 min

    Lab: Configuration and Utilities

    Overview of the case study "plus" project structure

    10m

    Break

    40 min

    Planning Your Data Pipeline - "Plus" Project

    Designing the data pipelines for the case study "plus" project

    20 min

    Engineering a Data Pipeline

    Databricks File System (DBFS), streaming vs batch, bronze and silver table strategy for the "plus" project

    10 min

    Wrap-up

    210 min

     

    Day #1 PM: Structured Streaming Pipeline

    Duration

    Modules

    15 min

    Recap

    15 min

    Lab: Ingesting Raw Data

    Preparing the raw data for the "plus" pipeline labs

    20 min

    Lab: Raw to Bronze

    Landing the raw data plus metadata into the bronze table using Structured Streaming

    20 min

    Delta Table Versioning

    Using Delta table history and versioning to explore the commits made to the table

    10m

    Break

    30 min

    Lab: Bronze to Silver

    Parsing the bronze table raw data into a structured silver table using Structured Streaming

    40 min

    Lab: Silver Update

    Cleaning dirty data in the silver table

    10m

    Break

    10 min

    The Query Layer

    Gold table strategy, Redash overview

    30 min

    Demo: Silver to Gold

    Aggregating data from the silver table to populate the gold table using Structured Streaming

    20 min

    Lab: Silver to Gold

    "Hardening" the silver table to gold table process

    10 min

    Wrap-up

    210 min

     

    Day #2 AM: Batch Pipeline

    Duration

    Modules

    15 min

    Recap

    10 min

    Planning Your Data Pipeline - "Classic" Project

    Designing the data pipelines for the case study "classic" project

    15 min

    Lab: Configuration and Utilities

    Overview of the case study "classic" project structure

    15 min

    Lab: Ingesting Raw Data

    Preparing the raw data for the "classic" pipeline labs

    10m

    Break

    20 min

    Lab: Raw to Bronze

    Landing the raw data plus metadata into the bronze table using batch processing

    30 min

    Lab: Bronze to Silver

    Parsing the "clean" bronze table raw data into a structured silver table using batch processing

    10m

    Break

    40 min

    Lab: Silver Update

    Cleaning dirty data in the bronze table and appending it to the silver table using batch processing

    10 min

    Demo: The Entire Pipeline

    Reviewing the "hardened" version of the entire end-to-end "classic" pipeline

    10m

    Break

    20 min

    Schema Enforcement and Evolution

    Delta Lake support for schema enforcement and evolution

    15 min

    Demo: Schema Enforcement

    Example of Delta Lake schema enforcement

    15 min

    Demo: Schema Evolution

    Example of Delta Lake schema evolution

    10 min

    Wrap-up

    215 min

     

    Day #2 PM: Structured Streaming

    Duration

    Modules

    20 min

    Recap

    15 min

    GDPR & CCPA Compliance

    Considering aspects of GDPR & CCPA compliance, compiance strategies, features of Delta Lake supporting compliance

    15 min

    Demo: Compliance

    Example of deleting personal data, recovering deleted data

    10 min

    Break

    30 min

    Normalization

    Comparing normalized vs denormalize data models, star and snowflake schemas

    40 min

    Slowly Changing Dimensions & Change Data Capture

    Types of SCDs, stream composability and its restrictions, CDC

    10 min

    Break

    40 min

    Delta Engine Optimizations

    Databricks Delta Engine optimizations for more efficient queries

    20 min

    Demo: Delta Engine Optimizations

    Examples of Databricks Delta Engine optimizations

    10 min

    Wrap-up

     

    Upcoming Classes

    Date
    Time
    Location
    Price
    Oct 21 - 22
    8:00 AM - 5:00 PM
    Australian Eastern Daylight Time (Victoria)
    Online - Virtual - Australia
    $ 1500.00 USD
    Oct 21 - 22
    9:00 AM - 5:00 PM
    Central European Summer Time
    Online - Virtual - EMEA
    $ 1500.00 USD
    Nov 1 - 2
    9:00 AM - 6:00 PM
    Eastern Daylight Time
    Online - Virtual - US Eastern
    $ 1500.00 USD
    Nov 25 - 26
    9:00 AM - 5:00 PM
    Central European Time
    Online - Virtual - EMEA
    $ 1500.00 USD
    Nov 29 - 30
    8:00 AM - 5:00 PM
    Australian Eastern Daylight Time (Victoria)
    Online - Virtual - Australia
    $ 1500.00 USD
    Dec 2 - 3
    9:00 AM - 6:00 PM
    Pacific Standard Time
    Online - Virtual - US Pacific
    $ 1500.00 USD
    Dec 9 - 10
    9:00 AM - 6:00 PM
    Eastern Standard Time
    Online - Virtual - US Eastern
    $ 1500.00 USD
    Dec 13 - 14
    9:00 AM - 5:00 PM
    Central European Time
    Online - Virtual - EMEA
    $ 1500.00 USD
    Jan 13 - 14
    9:00 AM - 5:00 PM
    Central European Time
    Online - Virtual - EMEA
    $ 1500.00 USD
    Jan 24 - 25
    9:00 AM - 6:00 PM
    Pacific Standard Time
    Online - Virtual - US Pacific
    $ 1500.00 USD
    Feb 14 - 15
    9:00 AM - 6:00 PM
    Eastern Standard Time
    Online - Virtual - US Eastern
    $ 1500.00 USD
    Mar 7 - 8
    8:00 AM - 5:00 PM
    Australian Eastern Daylight Time (Victoria)
    Online - Virtual - Australia
    $ 1500.00 USD
    Mar 10 - 11
    9:00 AM - 6:00 PM
    Pacific Standard Time
    Online - Virtual - US Pacific
    $ 1500.00 USD
    Mar 10 - 11
    9:00 AM - 5:00 PM
    Central European Time
    Online - Virtual - EMEA
    $ 1500.00 USD
    Apr 4 - 5
    9:00 AM - 6:00 PM
    Eastern Daylight Time
    Online - Virtual - US Eastern
    $ 1500.00 USD
    Apr 21 - 22
    9:00 AM - 5:00 PM
    Central European Summer Time
    Online - Virtual - EMEA
    $ 1500.00 USD
    May 2 - 3
    8:00 AM - 5:00 PM
    Australian Eastern Standard Time (Victoria)
    Online - Virtual - Australia AEST
    $ 1500.00 USD
    May 5 - 6
    9:00 AM - 6:00 PM
    Pacific Daylight Time
    Online - Virtual - US Pacific
    $ 1500.00 USD
    May 23 - 24
    9:00 AM - 6:00 PM
    Eastern Daylight Time
    Online - Virtual - US Eastern
    $ 1500.00 USD
    May 26 - 27
    9:00 AM - 5:00 PM
    Central European Summer Time
    Online - Virtual - EMEA
    $ 1500.00 USD
    Jun 23 - 24
    9:00 AM - 6:00 PM
    Pacific Daylight Time
    Online - Virtual - US Pacific
    $ 1500.00 USD
    Jun 30 - Jul 1
    9:00 AM - 5:00 PM
    Central European Summer Time
    Online - Virtual - EMEA
    $ 1500.00 USD

    Onsite Training

    Request Quote

    Public Training

    Virtual - Australia

    Virtual - EMEA

    Virtual - US Eastern

    Virtual - US Pacific

    Virtual - Australia AEST

    Classes marked with Full are full and no additional registrations are accepted. If you cannot find another class that suits your schedule, feel free to request a class and we will do our best to accomodate your needs.


    Don't see a date that works for you?

    Request Class