Data Engineering with Databricks

Data Engineering with Databricks

Summary

This 2-day course will teach you best practices for using Databricks to build data pipelines, through lectures and hands-on labs. At the end of the course, you will have all the knowledge and skills that a data engineer would need to build an end-to-end Delta Lake pipeline for streaming and batch data, from raw data ingestion to consumption by end users.

Description

This course begins with a review of programming with Spark APIs and an introduction to key terms and definitions of Databricks data engineering tools, followed by an overview of DB Connect, the Spark UI, and writing testable code. Participants will learn about the Cloud Data Platform in terms of data architecture concepts and will build an end-to-end OLAP data pipeline using Delta Lake with batch and streaming data, learning best practices throughout. Participants who wish to dive deeper into tuning and optimization can take the Advanced Data Engineering with Databricks course.

Duration

2 Days

Objectives

Upon completion of this course, students should be able to:

  • Build an end-to-end batch and streaming OLAP data pipeline
  • Make data available for consumption by downstream stakeholders using specificied design patterns
  • Apply Databricks' recommended best practices in engineering a single source of truth Delta architecture

Audience

Data Engineers and Machine Learning Engineers

Prerequisites

  • Intermediate to advanced programming skills in Python or Scala
  • Intermediate to advanced SQL skills
  • Beginning experience using the Spark DataFrames API
  • Beginning knowledge of general data engineering concepts
  • Beginning knowledge of the core features and use cases of Delta Lake

Additional Notes

All ​participants ​will ​need ​:

  • an ​internet ​connection

  • a ​device ​that is compliant with the following supported internet browsers

  • to ​confirm ​​​your ​​​device ​​​can ​​​run ​​​GoToTraining : ​ Validate

  • NOTE: GoToTraining ​is ​our chosen online ​platform ​through which the ​class ​will ​be ​delivered and ​prior ​to ​attendance, ​each ​registrant ​will ​receive ​GoToTraining ​log-in ​instructions.

  • Outline

     

    Day #1 AM: Welcome and Setup

    Duration

    Modules

    40 min

    Introduction

    Introductions, course overview, case study description

    40 min

    The Big Picture

    Review of the big data landscape, the role of Databricks, Delta Lake and the Lakehouse architecture

    10m

    Break

    40 min

    Software Engineering

    The role of Databricks in the SW engineering process, CI/CD considerations, designing for testability

    20 min

    Lab: Configuration and Utilities

    Overview of the case study "plus" project structure

    10m

    Break

    40 min

    Planning Your Data Pipeline - "Plus" Project

    Designing the data pipelines for the case study "plus" project

    20 min

    Engineering a Data Pipeline

    Databricks File System (DBFS), streaming vs batch, bronze and silver table strategy for the "plus" project

    10 min

    Wrap-up

    210 min

     

    Day #1 PM: Structured Streaming Pipeline

    Duration

    Modules

    15 min

    Recap

    15 min

    Lab: Ingesting Raw Data

    Preparing the raw data for the "plus" pipeline labs

    20 min

    Lab: Raw to Bronze

    Landing the raw data plus metadata into the bronze table using Structured Streaming

    20 min

    Delta Table Versioning

    Using Delta table history and versioning to explore the commits made to the table

    10m

    Break

    30 min

    Lab: Bronze to Silver

    Parsing the bronze table raw data into a structured silver table using Structured Streaming

    40 min

    Lab: Silver Update

    Cleaning dirty data in the silver table

    10m

    Break

    10 min

    The Query Layer

    Gold table strategy, Redash overview

    30 min

    Demo: Silver to Gold

    Aggregating data from the silver table to populate the gold table using Structured Streaming

    20 min

    Lab: Silver to Gold

    "Hardening" the silver table to gold table process

    10 min

    Wrap-up

    210 min

     

    Day #2 AM: Batch Pipeline

    Duration

    Modules

    15 min

    Recap

    10 min

    Planning Your Data Pipeline - "Classic" Project

    Designing the data pipelines for the case study "classic" project

    15 min

    Lab: Configuration and Utilities

    Overview of the case study "classic" project structure

    15 min

    Lab: Ingesting Raw Data

    Preparing the raw data for the "classic" pipeline labs

    10m

    Break

    20 min

    Lab: Raw to Bronze

    Landing the raw data plus metadata into the bronze table using batch processing

    30 min

    Lab: Bronze to Silver

    Parsing the "clean" bronze table raw data into a structured silver table using batch processing

    10m

    Break

    40 min

    Lab: Silver Update

    Cleaning dirty data in the bronze table and appending it to the silver table using batch processing

    10 min

    Demo: The Entire Pipeline

    Reviewing the "hardened" version of the entire end-to-end "classic" pipeline

    10m

    Break

    20 min

    Schema Enforcement and Evolution

    Delta Lake support for schema enforcement and evolution

    15 min

    Demo: Schema Enforcement

    Example of Delta Lake schema enforcement

    15 min

    Demo: Schema Evolution

    Example of Delta Lake schema evolution

    10 min

    Wrap-up

    215 min

     

    Day #2 PM: Structured Streaming

    Duration

    Modules

    20 min

    Recap

    15 min

    GDPR & CCPA Compliance

    Considering aspects of GDPR & CCPA compliance, compiance strategies, features of Delta Lake supporting compliance

    15 min

    Demo: Compliance

    Example of deleting personal data, recovering deleted data

    10 min

    Break

    30 min

    Normalization

    Comparing normalized vs denormalize data models, star and snowflake schemas

    40 min

    Slowly Changing Dimensions & Change Data Capture

    Types of SCDs, stream composability and its restrictions, CDC

    10 min

    Break

    40 min

    Delta Engine Optimizations

    Databricks Delta Engine optimizations for more efficient queries

    20 min

    Demo: Delta Engine Optimizations

    Examples of Databricks Delta Engine optimizations

    10 min

    Wrap-up

     

    Upcoming Classes

    Date
    Time
    Location
    Price
    Jan 25 - 26
    9:00 AM - 5:00 PM
    Central European Time
    Online - Virtual - EMEA
    $ 1500.00 USD
    Feb 8 - 9
    9:00 AM - 6:00 PM
    Pacific Standard Time
    Online - Virtual - US Pacific
    $ 1500.00 USD
    Feb 22 - 23
    9:00 AM - 5:00 PM
    Central European Time
    Online - Virtual - EMEA
    $ 1500.00 USD
    Mar 4 - 5
    9:00 AM - 6:00 PM
    Eastern Standard Time
    Online - Virtual - US Eastern
    $ 1500.00 USD
    Mar 15 - 16
    8:00 AM - 5:00 PM
    India Standard Time
    Online - Virtual - India
    $ 1500.00 USD
    Mar 22 - 23
    9:00 AM - 5:00 PM
    Central European Time
    Online - Virtual - EMEA
    $ 1500.00 USD
    Apr 8 - 9
    9:00 AM - 6:00 PM
    Pacific Daylight Time
    Online - Virtual - US Pacific
    $ 1500.00 USD
    Apr 26 - 27
    9:00 AM - 5:00 PM
    Central European Summer Time
    Online - Virtual - EMEA
    $ 1500.00 USD
    May 3 - 4
    9:00 AM - 6:00 PM
    Eastern Daylight Time
    Online - Virtual - US Eastern
    $ 1500.00 USD
    May 17 - 18
    9:00 AM - 5:00 PM
    Central European Summer Time
    Online - Virtual - EMEA
    $ 1500.00 USD
    Jun 2 - 3
    8:00 AM - 5:00 PM
    Australian Eastern Standard Time (Victoria)
    Online - Virtual - Australia
    $ 1500.00 USD
    Jun 14 - 15
    9:00 AM - 6:00 PM
    Pacific Daylight Time
    Online - Virtual - US Pacific
    $ 1500.00 USD
    Jun 21 - 22
    9:00 AM - 5:00 PM
    Central European Summer Time
    Online - Virtual - EMEA
    $ 1500.00 USD
    Jul 12 - 13
    9:00 AM - 6:00 PM
    Eastern Daylight Time
    Online - Virtual - US Eastern
    $ 1500.00 USD
    Jul 19 - 20
    8:00 AM - 5:00 PM
    India Standard Time
    Online - Virtual - India
    $ 1500.00 USD
    Jul 26 - 27
    9:00 AM - 5:00 PM
    Central European Summer Time
    Online - Virtual - EMEA
    $ 1500.00 USD

    Onsite Training

    Request Quote

    Public Training

    Virtual - EMEA

    Virtual - US Pacific

    Virtual - US Eastern

    Virtual - India

    Virtual - Australia


    Don't see a date that works for you?

    Request Class