DB 110 - Apache Spark™ Tuning and Best Practices

DB 110 - Apache Spark™ Tuning and Best Practices

Summary

This course offers a deep dive into the processes of tuning Spark applications, developing best practices and avoiding many of the common pitfalls associated with developing Spark applications.

Description

This course is a lab-intensive workshop in which students implement various best practices while inducing, diagnose and then fixing various performance problems. The course continues with numerous instructor-lead coding challenges to refactor existing code with the effect of an increase in overall performance by applying learned best practices. It then concludes with a full day workshop in which students work individually or in teams to complete a typical, full-scale data migration of a poorly maintained dataset.

Duration

3 Days

Objectives

  • Shortcut investigations by developing common-sense intuition as to the root cause of various performance issues.
  • Diagnose & fix various storage-related performance issues including
  • The tiny files problem
  • Malformed partitions
  • Unpartitioned data
  • Overpartitioned data
  • Incorrectly typed data
  • Identify when and when not to cache data
  • Articulate the performance ramifications of different caching strategies
  • Diagnose & fix common coding mistakes that lead to de-optimization
  • Optimize joins via broadcasting, pruning, and pre-joining
  • Apply tips and tricks for
  • Investigating the files system
  • Diagnosing partition skew
  • Developing and distributing utility functions
  • Developing micro-benchmarks & avoiding related pitfalls
  • Rapid ETL development for extremely large datasets
  • Working in shared cluster environments
  • Develop different strategies for testing ETL components
  • Rapidly develop insights on otherwise costly datasets

Audience

Primarily for software engineers but is directly applicable to analysts, architects and data scientist who want to further their skills by learning how to develop high-performance Spark applications and diagnosing and troubleshooting common performance problems.

Prerequisites

  • Intermediate to advanced programming experience in Python or Scala is required.
  • Practical experience of developing Apache Spark applications is recommended
  • Proficiency with Apache Spark's DataFrames API is desired but not essential

Additional Notes

All ​participants ​will ​need ​:

  • a ​laptop ​with ​updated ​versions ​of ​Chrome ​or ​Firefox ​(Internet ​Explorer ​and ​Safari ​are ​not ​supported) ​
  • ​an ​internet ​connection ​which ​can ​support ​use ​of ​GoToTraining. ​
  • ​GoToTraining ​is ​the online ​platform ​via ​which ​the ​class ​will ​be ​delivered and ​prior ​to ​attendance, ​each ​registrant ​will ​receive ​GoToTraining ​log-in ​instructions. For ​more ​information ​and ​to ​confirm ​​​your ​​​computer ​​​can ​​​run ​​​GoToTraining, ​please ​check ​here: Validation

    Upcoming Classes

    Jul 15
    7:00 AM - 11:00 AM
    Pacific Daylight Time
    Online
    $ 2500.00 USD
    Jul 22
    9:00 AM - 1:00 PM
    Greenwich Mean Time
    Online
    $ 2500.00 USD $ 2125.00 USD
    Before Jun 28, 2019 6:00PM GMT
    Aug 27
    8:00 AM - 4:00 PM
    Pacific Daylight Time
    Online
    $ 2500.00 USD
    Oct 22
    8:00 AM - 4:00 PM
    Pacific Daylight Time
    Online
    $ 2500.00 USD
    Dec 3
    8:00 AM - 4:00 PM
    Pacific Standard Time
    Online
    $ 2500.00 USD

    Onsite Training

    Request Quote

    Public Training

    Virtual Class - US Pacific Time

    Virtual Class - GMT Time

    • Confirmed
      9:00 AM - 1:00 PM GMT
      $ 2500.00 USD $ 2125.00 USD before Friday, June 28, 2019 6:00 PM GMT.

    Don't see a date that works for you?

    Request Class