SP870-Az: Tuning, Troubleshooting, and Best Practices Part 1: The Basics (Azure Databricks)

Summary

In this course, data engineers and data scientists will understand the essentials of tuning Apache Spark™ applications.

We’ll review the architectural foundations of Spark, including cluster sizing and caching. We’ll also explore the Spark UI, including understanding how to interpret DAGs, Event Timelines, and Metrics.

By the end of this course, you will have the core knowledge and experience to begin troubleshooting performance-oriented problems.

NOTE: This course is delivered on the Databricks Unified Analytics Platform (based on Apache Spark™). While you might find it helpful for learning how to use Apache Spark in other environments, it does not teach you how to use Apache Spark in those environments.

Description

Length

3-6 hours, 30% hands-on

Format: Self-paced

The course is a series of nine self-paced lessons in Databricks notebooks.

Platform

This version of the course is intended to be run on Azure Databricks.

Learning Objectives

During this course learners

  • Review the distributed programming approach of Spark
  • Review the nature of the work done by drivers and executors
  • Review how jobs, stages, and tasks relate
  • Learn how the SparkUI can be used to discover performance-related issues in Spark applications
  • Learn how to understand DAGs, Stage boundaries, Event Timelines and Metrics
  • Develop an intuition on cluster sizing
  • Understand how caching works in a shared cluster environment

Lessons

  1. Course Overview and Setup
  2. Scenario 1 - Understanding Apache Spark
  3. Scenario 2 - Understanding Apache Spark
  4. Scenario 3 - Understanding Apache Spark
  5. Spark Architecture - Key Concepts
  6. The Spark UI
  7. The Spark UI in Action Lab
  8. Cluster Configurations
  9. A Caching Story

Target Audience

  • Primary Audience: Data Engineers

Prerequisites

  • Getting Started with Apache Spark DataFrames (optional, but strongly encouraged)

Lab Requirements

License Limitations

This self-paced training course may be used by 1 user for 12 months from the date of purchase. It may not be transferred or shared with any other user.

Terms

The use of the self-paced training course is subject to the Terms of Service and the Databricks Privacy Policy.

Duration

6 hours