Apache Spark™ Programming with Databricks

Apache Spark™ Programming with Databricks

Summary

This course uses a case study driven approach to explore the fundamentals of Spark Programming with Databricks, including Spark architecture, the DataFrame API, Structured Streaming, and query optimization.

Description

This course uses a case study driven approach to explore the fundamentals of Spark Programming with Databricks, including Spark architecture, the DataFrame API, Structured Streaming, and query optimization. You will start by visualizing and applying Spark architecture concepts in example scenarios. Then, you will explore and preprocess datasets by applying a variety of DataFrame transformations and actions. After ingesting data from various file formats, you will apply these preprocessing steps and write them to Delta tables. The case study then expands to stream from Delta in an analytics use case that demonstrates core Structured Streaming concepts. Lastly, you will explore the Spark UI and how query optimization, partitioning, and caching affect performance.

Duration

2 Days

Objectives

 

Upon completion of the course, students should be able to meet the following objectives:

  • Define the major components of Spark architecture and execution hierarchy
  • Describe how DataFrames are built, transformed, and evaluated in Spark
  • Apply the DataFrame API to explore, preprocess, join, and ingest data in Spark
  • Apply the Structured Streaming API to perform analytics on streaming data
  • Navigate the Spark UI and describe how the catalyst optimizer, partitioning, and caching affect Spark's execution performance

Audience

 

  • Data engineer
  • Data scientist
  • Machine learning engineer
  • Data architect

Prerequisites

 

  • Familiarity with basic SQL concepts (select, filter, groupby, join, etc)
  • Beginner programming experience with Python or Scala (syntax, conditions, loops, functions)

Additional Notes

All ​participants ​will ​need ​:

  • an ​internet ​connection

  • a ​device ​that is compliant with the following supported internet browsers

  • to ​confirm ​​​your ​​​device ​​​can ​​​run ​​​GoToTraining : ​ Validate

  • NOTE: GoToTraining ​is ​our chosen online ​platform ​through which the ​class ​will ​be ​delivered and ​prior ​to ​attendance, ​each ​registrant ​will ​receive ​GoToTraining ​log-in ​instructions.

  • Outline

    2 Days (4 half days)

    Day 1: DataFrames

    1

    Introduction

    Course overview, Databricks ecosystem, Spark overview, Case study, Knowledge check

    2

    Databricks Platform

    Databricks concepts, Workspace UI, Notebooks, Lab

    3

    Spark SQL

    Spark SQL module, Documentation, DataFrame concepts, Lab

    4

    Reader & Writer

    Data Sources, DataFrameReader & Writer, Schemas, Performance, Lab

    5

    DataFrame & Column

    Columns and expressions, Transformations, Actions, Rows, Lab

    Day 2: Transformations

    1

    Aggregation

    Groupby, Grouped data methods, Aggregate functions, Math functions, Lab

    2

    Datetimes

    Dates & Timestamps, Datetime patterns, Datetime functions, Lab

    3

    Complex Types

    String functions, Collection functions

    4

    Additional Functions

    Non-aggregate functions, NaFunctions, Lab

    5

    User-Defined Functions

    User-defined functions, Vectorized UDFs, Performance, Lab

    Day 3: Spark Optimization

    1

    Spark Architecture

    Spark Cluster, Spark Execution, Shuffling, Lab

    2

    Shuffles & Caching

    Lineage, Shuffle files, Caching, Caching recommendations, Spark UI: Storage, Lab

    3

    Query Optimization

    Catalyst Optimizer, Adaptive Query Execution, Best practices, Lab

    4

    Spark UI

    Spark UI navigation, Spark UI: Jobs, Stages, SQL

    5

    Partitioning

    Partitions vs cores, Default shuffle partitions, Repartition, Best practices, AQE, Lab

    Day 4: Structured Streaming

    1

    Review

    DataFrames and Transformations, Lab

    2

    Streaming Query

    Streaming concepts, Sources and Sinks, Streaming Query, Transformations, Lab

    3

    Processing Streams

    Monitoring Streams, Lab

    4

    Aggregating Streams

    Streaming aggregations, Windows, Watermarking, Lab

    5

    Delta Lake

    Delta Lake concepts, Batch and streaming, Lab

     

    Upcoming Classes

    Date
    Time
    Location
    Price
    Nov 3 - 6
    8:00 AM - 12:00 PM
    Australian Eastern Daylight Time (New South Wales)
    Online - Virtual - APJ (half-day schedule)
    $ 1500.00 USD
    Nov 4 - 5
    9:00 AM - 5:00 PM
    Eastern Standard Time
    Online - Virtual - US Eastern
    $ 1500.00 USD
    Nov 23 - 24
    9:00 AM - 5:00 PM
    Eastern Standard Time
    Online - Virtual - US Eastern
    $ 1500.00 USD
    Nov 25 - 26
    9:00 AM - 5:00 PM
    Central European Time
    Online - Virtual - EMEA
    $ 1500.00 USD
    Dec 9 - 10
    9:00 AM - 5:00 PM
    Pacific Standard Time
    Online - Virtual - US Pacific
    $ 1500.00 USD

    Onsite Training

    Request Quote

    Public Training

    Virtual - APJ (half-day schedule)

    Virtual - US Eastern

    Virtual - EMEA

    Virtual - US Pacific

    Classes marked with Full are full and no additional registrations are accepted. If you cannot find another class that suits your schedule, feel free to request a class and we will do our best to accomodate your needs.


    Don't see a date that works for you?

    Request Class