DB 110 - Apache Spark™ Tuning and Best Practices

DB 110 - Apache Spark™ Tuning and Best Practices


This course offers a deep dive into the processes of tuning Spark applications, developing best practices and avoiding many of the common pitfalls associated with developing Spark applications.


This course is a lab-intensive workshop in which students implement various best practices while inducing, diagnose and then fixing various performance problems. The course continues with numerous instructor-lead coding challenges to refactor existing code with the effect of an increase in overall performance by applying learned best practices. It then concludes with a full day workshop in which students work individually or in teams to complete a typical, full-scale data migration of a poorly maintained dataset.


3 Days


  • Shortcut investigations by developing common-sense intuition as to the root cause of various performance issues.
  • Diagnose & fix various storage-related performance issues including
  • The tiny files problem
  • Malformed partitions
  • Unpartitioned data
  • Overpartitioned data
  • Incorrectly typed data
  • Identify when and when not to cache data
  • Articulate the performance ramifications of different caching strategies
  • Diagnose & fix common coding mistakes that lead to de-optimization
  • Optimize joins via broadcasting, pruning, and pre-joining
  • Apply tips and tricks for
  • Investigating the files system
  • Diagnosing partition skew
  • Developing and distributing utility functions
  • Developing micro-benchmarks & avoiding related pitfalls
  • Rapid ETL development for extremely large datasets
  • Working in shared cluster environments
  • Develop different strategies for testing ETL components
  • Rapidly develop insights on otherwise costly datasets


Primarily for software engineers but is directly applicable to analysts, architects and data scientist who want to further their skills by learning how to develop high-performance Spark applications and diagnosing and troubleshooting common performance problems.


  • Intermediate to advanced programming experience in Python or Scala is required.
  • Practical experience of developing Apache Spark applications is recommended
  • Proficiency with Apache Spark's DataFrames API is desired but not essential

Additional Notes

All ​participants ​will ​need ​:

  • an ​internet ​connection

  • a ​device ​that is compliant with the following supported internet browsers

  • to ​confirm ​​​your ​​​device ​​​can ​​​run ​​​GoToTraining : ​ Validate

  • NOTE: GoToTraining ​is ​our chosen online ​platform ​through which the ​class ​will ​be ​delivered and ​prior ​to ​attendance, ​each ​registrant ​will ​receive ​GoToTraining ​log-in ​instructions.

  • Upcoming Classes

    Aug 27 - Aug 29
    8:00 AM - 4:00 PM
    Pacific Daylight Time
    Online - Virtual Class - US Pacific Time
    $ 2500.00 USD
    Oct 22 - Oct 24
    9:00 AM - 5:00 PM
    Pacific Daylight Time
    Online - Virtual Class - US Pacific Time
    $ 2500.00 USD
    Dec 3 - Dec 5
    9:00 AM - 5:00 PM
    Pacific Standard Time
    Online - Virtual Class - US Pacific Time
    $ 2500.00 USD

    Onsite Training

    Request Quote

    Public Training

    Virtual Class - US Pacific Time

    Don't see a date that works for you?

    Request Class