Optimizing Apache Spark™ on Databricks

Optimizing Apache Spark™ on Databricks

Summary

This 2-day course aims to deepen the knowledge of key “problem” areas in Apache Spark, how to mitigate those problems, and even explores new features in Spark 3 that further help to push the envelope in terms of application performance.

Note: This class is available as a one-day version in Private Only Training.

Description

In this course, students will explore five key problems that represent the vast majority of performance problems in an Apache Spark application: skew, spill, shuffle, storage, and serialization. With each of these topics, we explore coding examples based on 100 GB to 1+ TB datasets that demonstrate how these problems are introduced, how to diagnose these problems with tools like the Spark UI, and conclude by discussing mitigation strategies for each of these problems.

 

We continue the conversation by looking at a series of key ingestion concepts that promote strategies for processing terabytes of data including managing Spark partition sizes, disk-partitioning, bucketing, z-ordering, and more. With each of these topics, we explore when and how each of these techniques should be implemented, new challenges that productionalizing these solutions might provide along with corresponding mitigation strategies.

 

Finally, we introduce a couple of other key topics such as issues with data locality, IO caching and spark caching, pitfalls of broadcast Joins, and new features like Spark 3’s Adaptive Query Execution and Dynamic Partition Pruning. We then conclude the course with discussions and exercises on designing and configuring clusters for optimal performance given specific use cases, personas, the divergent needs of various teams, and cross-team security concerns.

Duration

2 Days

Audience

  • Data engineer
  • Data architect

Prerequisites

  • Intermediate to advanced programming experience in Python or Scala
  • Hands-on experience developing Apache Spark applications 

Outline

Day 1 Morning: Introduction and the 5 Most Common Performance Problems

  • Setup, Introductions
  • Spark Architecture Review
  • Spark UI Review, Lab
  • Skew, Labs
  • Day 1 Afternoon: The 5 Most Common Performance Problems Continued

  • Spill, Labs
  • Shuffle
  • Storage, Labs
  • Serialization, Labs
  • Day 2 Morning: Key Ingestion Concepts

  • Ingestion Basics, Labs
  • Predicate Push Downs, Labs
  • Disk Partitioning, Lab
  • Z-Ordering, Labs
  • Bucketing, Labs
  • Day 2 Afternoon: Optimizing with AQE and High Performance; Designing Clusters for High Performance

  • Tuning Shuffle Partitions, Lab
  • Join Optimizations, Lab
  • Skew Join Optimizations, Lab
  • Dynamic Partition Pruning, Lab
  • Designing Clusters for High Performance
  • Cluster Configurations Scenarios
  • Designing Clusters Breakout
  • Upcoming Classes

    Date
    Time
    Location
    Price
    Oct 26 - 27
    9:00 AM - 5:00 PM
    Central European Summer Time
    Online - Virtual - EMEA
    $ 1500.00 USD
    Nov 1 - 4
    9:00 AM - 1:00 PM
    Pacific Daylight Time
    Online - Virtual - Americas (half-day schedule)
    $ 1500.00 USD
    Nov 18 - 19
    9:00 AM - 5:00 PM
    Central European Time
    Online - Virtual - EMEA
    $ 1500.00 USD
    Dec 6 - 9
    9:00 AM - 1:00 PM
    Pacific Standard Time
    Online - Virtual - Americas (half-day schedule)
    $ 1500.00 USD
    Dec 6 - 9
    1:00 PM - 5:00 PM
    Australian Eastern Daylight Time (Victoria)
    Online - Virtual - Australia AEST
    $ 1500.00 USD
    Jan 17 - 18
    9:00 AM - 5:00 PM
    Central European Time
    Online - Virtual - EMEA
    $ 1500.00 USD
    Jan 24 - 27
    9:00 AM - 1:00 PM
    Pacific Standard Time
    Online - Virtual - Americas (half-day schedule)
    $ 1500.00 USD
    Mar 14 - 17
    9:00 AM - 1:00 PM
    Pacific Daylight Time
    Online - Virtual - Americas (half-day schedule)
    $ 1500.00 USD
    Mar 24 - 25
    9:00 AM - 5:00 PM
    Central European Time
    Online - Virtual - EMEA
    $ 1500.00 USD
    Apr 25 - 28
    9:00 AM - 1:00 PM
    Pacific Daylight Time
    Online - Virtual - Americas (half-day schedule)
    $ 1500.00 USD
    Apr 26 - 27
    9:00 AM - 5:00 PM
    Central European Summer Time
    Online - Virtual - EMEA
    $ 1500.00 USD
    Jun 13 - 16
    9:00 AM - 1:00 PM
    Pacific Daylight Time
    Online - Virtual - Americas (half-day schedule)
    $ 1500.00 USD
    Jun 21 - 22
    9:00 AM - 5:00 PM
    Central European Summer Time
    Online - Virtual - EMEA
    $ 1500.00 USD

    Onsite Training

    Request Quote

    Public Training

    Virtual - EMEA

    Virtual - Americas (half-day schedule)

    Virtual - Australia AEST


    Don't see a date that works for you?

    Request Class