Optimizing Apache Spark™ on Databricks

Optimizing Apache Spark™ on Databricks

Summary

This 1-day course aims to deepen the knowledge of key “problem” areas in Apache Spark, how to mitigate those problems, and even explores new features in Spark 3 that further help to push the envelope in terms of application performance.

Description

In this course, students will explore five key problems that represent the vast majority of performance problems in an Apache Spark application: skew, spill, shuffle, storage, and serialization. With each of these topics, we explore coding examples based on 100 GB to 1+ TB datasets that demonstrate how these problems are introduced, how to diagnose these problems with tools like the Spark UI, and conclude by discussing mitigation strategies for each of these problems.

 

We continue the conversation by looking at a series of key ingestion concepts that promote strategies for processing terabytes of data including managing Spark partition sizes, disk-partitioning, bucketing, z-ordering, and more. With each of these topics, we explore when and how each of these techniques should be implemented, new challenges that productionalizing these solutions might provide along with corresponding mitigation strategies.

 

Finally, we introduce a couple of other key topics such as issues with data locality, IO caching and spark caching, pitfalls of broadcast Joins, and new features like Spark 3’s Adaptive Query Execution and Dynamic Partition Pruning. We then conclude the course with discussions and exercises on designing and configuring clusters for optimal performance given specific use cases, personas, the divergent needs of various teams, and cross-team security concerns.

Duration

8 hours

Audience

  • Data engineer
  • Data architect

Prerequisites

  • Intermediate to advanced programming experience in Python or Scala
  • Hands-on experience developing Apache Spark applications 

Outline

 

Day #1 AM

Duration

Modules

20 min

Setup, Introductions, etc.

200 min

The 5 Most Common Performance Problems (The 5Ss)

Introduction / Benchmarking (1 lab)

Skew (3 labs)

Spill (2 labs)

Shuffle (0)

Storage (3 labs)

Serialization (2 labs)

220 min

 

Day #1 PM

Duration

Modules

120 min

Key Ingestion Concepts


Ingestion Basics (4 labs)

Predicate Push Downs (3 labs)

Disk Partitioning (1 lab)

Z-Ordering (2 labs)

Bucketing (2 labs)

Review/Wrap Up

10 min

Break

60 min

Optimizing with AQE & DPP


Intro

Tuning Shuffle Partitions (1 lab)

Join Optimizations (1 lab)

Skew Join Optimizations (1 lab)

Dynamic Partition Pruning (1 lab)

10 min

Break

60 min

Designing Clusters for High Performance


Designing Clusters for High Performance

Cluster Configurations Scenarios

Designing Clusters Breakout

260 min

 

Optional Review Topics

 

60 min

Introduction to Spark Architecture

45 min

Spark UI Demo

 

Upcoming Classes

Date
Time
Location
Price
Jan 29
9:00 AM - 5:00 PM
Central European Time
Online - Virtual - EMEA
$ 1000.00 USD
Feb 26
9:00 AM - 5:00 PM
Central European Time
Online - Virtual - EMEA
$ 1000.00 USD
Mar 2 - 3
8:00 AM - 12:00 PM
Australian Eastern Daylight Time (New South Wales)
Online - Virtual - APJ (half-day schedule)
$ 1000.00 USD
Mar 3 - 4
9:00 AM - 1:00 PM
Pacific Standard Time
Online - Virtual - Americas (half-day schedule)
$ 1000.00 USD
Mar 26
9:00 AM - 5:00 PM
Central European Time
Online - Virtual - EMEA
$ 1000.00 USD
Apr 1 - 2
9:00 AM - 1:00 PM
Pacific Daylight Time
Online - Virtual - Americas (half-day schedule)
$ 1000.00 USD
Apr 30
9:00 AM - 5:00 PM
Central European Summer Time
Online - Virtual - EMEA
$ 1000.00 USD
May 13 - 14
9:00 AM - 1:00 PM
Pacific Daylight Time
Online - Virtual - Americas (half-day schedule)
$ 1000.00 USD
May 21
9:00 AM - 5:00 PM
Central European Summer Time
Online - Virtual - EMEA
$ 1000.00 USD
Jun 17 - 18
9:00 AM - 1:00 PM
Pacific Daylight Time
Online - Virtual - Americas (half-day schedule)
$ 1000.00 USD
Jun 25
9:00 AM - 5:00 PM
Central European Summer Time
Online - Virtual - EMEA
$ 1000.00 USD
Jul 29 - 30
9:00 AM - 1:00 PM
Pacific Daylight Time
Online - Virtual - Americas (half-day schedule)
$ 1000.00 USD
Jul 30
9:00 AM - 5:00 PM
Central European Summer Time
Online - Virtual - EMEA
$ 1000.00 USD

Onsite Training

Request Quote

Public Training

Virtual - EMEA

  • Confirmed
    9:00 AM - 5:00 PM CET
    $ 1000.00 USD
  • 9:00 AM - 5:00 PM CET
    $ 1000.00 USD
  • 9:00 AM - 5:00 PM CET
    $ 1000.00 USD
  • 9:00 AM - 5:00 PM CEST
    $ 1000.00 USD
  • 9:00 AM - 5:00 PM CEST
    $ 1000.00 USD
  • 9:00 AM - 5:00 PM CEST
    $ 1000.00 USD
  • 9:00 AM - 5:00 PM CEST
    $ 1000.00 USD

Virtual - APJ (half-day schedule)

Virtual - Americas (half-day schedule)


Don't see a date that works for you?

Request Class