SP870: Tuning, Troubleshooting, and Best Practices Part 1: The Basics (AWS Databricks)
In this course, data engineers and data scientists will understand the essentials of tuning Apache Spark™ applications.
We’ll review the architectural foundations of Spark, including cluster sizing and caching. We’ll also explore the Spark UI, including understanding how to interpret DAGs, Event Timelines, and Metrics.
By the end of this course, you will have the core knowledge and experience to begin troubleshooting performance-oriented problems.
NOTE: This course is delivered on the Databricks Unified Analytics Platform (based on Apache Spark™). While you might find it helpful for learning how to use Apache Spark in other environments, it does not teach you how to use Apache Spark in those environments.
3-6 hours, 30% hands-on
The course is a series of nine self-paced lessons in Databricks notebooks.
This version of the course is intended to be run on AWS Databricks.
During this course learners
- Review the distributed programming approach of Spark
- Review the nature of the work done by drivers and executors
- Review how jobs, stages, and tasks relate
- Learn how the SparkUI can be used to discover performance-related issues in Spark applications
- Learn how to understand DAGs, Stage boundaries, Event Timelines and Metrics
- Develop an intuition on cluster sizing
- Understand how caching works in a shared cluster environment
- Course Overview and Setup
- Scenario 1 - Understanding Apache Spark
- Scenario 2 - Understanding Apache Spark
- Scenario 3 - Understanding Apache Spark
- Spark Architecture - Key Concepts
- The Spark UI
- The Spark UI in Action Lab
- Cluster Configurations
- A Caching Story
- Primary Audience: Data Engineers
- Getting Started with Apache Spark DataFrames (optional, but strongly encouraged)
- Please be sure to use a supported browser.
This self-paced training course may be used by 1 user for 12 months from the date of purchase. It may not be transferred or shared with any other user.