Apache Spark™ Programming with Databricks

Apache Spark™ Programming with Databricks

Summary

This course uses a case study driven approach to explore the fundamentals of Spark Programming with Databricks, including Spark architecture, the DataFrame API, Structured Streaming, and query optimization.

Description

This course uses a case study driven approach to explore the fundamentals of Spark Programming with Databricks, including Spark architecture, the DataFrame API, Structured Streaming, and query optimization. You will start by visualizing and applying Spark architecture concepts in example scenarios. Then, you will explore and preprocess datasets by applying a variety of DataFrame transformations and actions. After ingesting data from various file formats, you will apply these preprocessing steps and write them to Delta tables. The case study then expands to stream from Delta in an analytics use case that demonstrates core Structured Streaming concepts. Lastly, you will explore the Spark UI and how query optimization, partitioning, and caching affect performance.

Duration

2 Days

Objectives

 

Upon completion of the course, students should be able to meet the following objectives:

  • Define the major components of Spark architecture and execution hierarchy
  • Describe how DataFrames are built, transformed, and evaluated in Spark
  • Apply the DataFrame API to explore, preprocess, join, and ingest data in Spark
  • Apply the Structured Streaming API to perform analytics on streaming data
  • Navigate the Spark UI and describe how the catalyst optimizer, partitioning, and caching affect Spark's execution performance

Audience

 

  • Data engineer
  • Data scientist
  • Machine learning engineer
  • Data architect

Prerequisites

 

  • Familiarity with basic SQL concepts (select, filter, groupby, join, etc)
  • Beginner programming experience with Python or Scala (syntax, conditions, loops, functions)

Additional Notes

All ​participants ​will ​need ​:

  • an ​internet ​connection

  • a ​device ​that is compliant with the following supported internet browsers

  • to ​confirm ​​​your ​​​device ​​​can ​​​run ​​​GoToTraining : ​ Validate

  • NOTE: GoToTraining ​is ​our chosen online ​platform ​through which the ​class ​will ​be ​delivered and ​prior ​to ​attendance, ​each ​registrant ​will ​receive ​GoToTraining ​log-in ​instructions.

  • Outline

    2 Days (4 half days)

    Day 1: DataFrames

    1

    Introduction

    Course overview, Databricks ecosystem, Spark overview, Case study, Knowledge check

    2

    Databricks Platform

    Databricks concepts, Workspace UI, Notebooks, Lab

    3

    Spark SQL

    Spark SQL module, Documentation, DataFrame concepts, Lab

    4

    Reader & Writer

    Data Sources, DataFrameReader & Writer, Schemas, Performance, Lab

    5

    DataFrame & Column

    Columns and expressions, Transformations, Actions, Rows, Lab

    Day 2: Transformations

    1

    Aggregation

    Groupby, Grouped data methods, Aggregate functions, Math functions, Lab

    2

    Datetimes

    Dates & Timestamps, Datetime patterns, Datetime functions, Lab

    3

    Complex Types

    String functions, Collection functions

    4

    Additional Functions

    Non-aggregate functions, NaFunctions, Lab

    5

    User-Defined Functions

    User-defined functions, Vectorized UDFs, Performance, Lab

    Day 3: Spark Optimization

    1

    Spark Architecture

    Spark Cluster, Spark Execution, Shuffling, Lab

    2

    Shuffles & Caching

    Lineage, Shuffle files, Caching, Caching recommendations, Spark UI: Storage, Lab

    3

    Query Optimization

    Catalyst Optimizer, Adaptive Query Execution, Best practices, Lab

    4

    Spark UI

    Spark UI navigation, Spark UI: Jobs, Stages, SQL

    5

    Partitioning

    Partitions vs cores, Default shuffle partitions, Repartition, Best practices, AQE, Lab

    Day 4: Structured Streaming

    1

    Review

    DataFrames and Transformations, Lab

    2

    Streaming Query

    Streaming concepts, Sources and Sinks, Streaming Query, Transformations, Lab

    3

    Processing Streams

    Monitoring Streams, Lab

    4

    Aggregating Streams

    Streaming aggregations, Windows, Watermarking, Lab

    5

    Delta Lake

    Delta Lake concepts, Batch and streaming, Lab

     

    Upcoming Classes

    Date
    Time
    Location
    Price
    Jan 20 - 21
    9:00 AM - 5:00 PM
    Central European Time
    Online - Virtual - EMEA
    $ 1500.00 USD
    Jan 25 - 26
    9:00 AM - 6:00 PM
    Eastern Standard Time
    Online - Virtual - US Eastern
    $ 1500.00 USD
    Feb 10 - 11
    9:00 AM - 5:00 PM
    Central European Time
    Online - Virtual - EMEA
    $ 1500.00 USD
    Feb 11 - 12
    8:00 AM - 5:00 PM
    Australian Eastern Daylight Time (Victoria)
    Online - Virtual - Australia
    $ 1500.00 USD
    Feb 16 - 17
    9:00 AM - 6:00 PM
    Pacific Standard Time
    Online - Virtual - US Pacific
    $ 1500.00 USD
    Mar 1 - 2
    8:00 AM - 5:00 PM
    India Standard Time
    Online - Virtual - India
    $ 1500.00 USD
    Mar 3 - 4
    9:00 AM - 5:00 PM
    Central European Time
    Online - Virtual - EMEA
    $ 1500.00 USD
    Mar 18 - 19
    9:00 AM - 6:00 PM
    Eastern Daylight Time
    Online - Virtual - US Eastern
    $ 1500.00 USD
    Apr 1 - 2
    9:00 AM - 6:00 PM
    Eastern Daylight Time
    Online - Virtual - US Eastern
    $ 1500.00 USD
    Apr 19 - 20
    8:00 AM - 5:00 PM
    India Standard Time
    Online - Virtual - India
    $ 1500.00 USD
    Apr 22 - 23
    9:00 AM - 6:00 PM
    Pacific Daylight Time
    Online - Virtual - US Pacific
    $ 1500.00 USD
    Apr 22 - 23
    9:00 AM - 5:00 PM
    Central European Summer Time
    Online - Virtual - EMEA
    $ 1500.00 USD
    May 12 - 13
    9:00 AM - 5:00 PM
    Central European Summer Time
    Online - Virtual - EMEA
    $ 1500.00 USD
    May 13 - 14
    9:00 AM - 6:00 PM
    Eastern Daylight Time
    Online - Virtual - US Eastern
    $ 1500.00 USD
    Jun 7 - 8
    9:00 AM - 6:00 PM
    Eastern Daylight Time
    Online - Virtual - US Eastern
    $ 1500.00 USD
    Jun 16 - 17
    9:00 AM - 5:00 PM
    Central European Summer Time
    Online - Virtual - EMEA
    $ 1500.00 USD
    Jun 28 - 29
    9:00 AM - 6:00 PM
    Pacific Daylight Time
    Online - Virtual - US Pacific
    $ 1500.00 USD
    Jul 5 - 6
    9:00 AM - 5:00 PM
    Central European Summer Time
    Online - Virtual - EMEA
    $ 1500.00 USD
    Jul 22 - 23
    9:00 AM - 6:00 PM
    Eastern Daylight Time
    Online - Virtual - US Eastern
    $ 1500.00 USD
    Jul 27 - 28
    8:00 AM - 5:00 PM
    India Standard Time
    Online - Virtual - India
    $ 1500.00 USD

    Onsite Training

    Request Quote

    Public Training

    Virtual - EMEA

    Virtual - US Eastern

    Virtual - Australia

    Virtual - US Pacific

    Virtual - India

    Classes marked with Full are full and no additional registrations are accepted. If you cannot find another class that suits your schedule, feel free to request a class and we will do our best to accomodate your needs.


    Don't see a date that works for you?

    Request Class