Databricks Certified Professional Data Engineer

Summary

Coming October 2021
The Databricks Certified Professional Data Engineer certification exam assesses the understanding of the Databricks platform and developer tools, the ability to build data processing pipelines, the data pipeline modeling, the ability to make data pipelines secure, the ability to monitor and log activity on data pipelines, and an understanding of best practices for managing, testing, and deploying code.

Description

The Databricks Certified Professional Data Engineer certification exam assesses an individual’s ability to use Databricks to perform common data engineering tasks. This includes an understanding of the Databricks platform and developer tools like Apache Spark, Delta Lake, MLflow, and the Databricks CLI and REST API. It also assesses the ability to build optimized and cleaned ETL pipelines. Additionally, modeling data into a Lakehouse using knowledge of general data modeling concepts will also be assessed. Finally, ensuring that data pipelines are secure, reliable, monitored, and tested before deployment will also be included in this exam. Individuals who pass this certification exam can be expected to complete data engineering tasks using Databricks and its associated tools.

Prerequisites

The minimally qualified candidate should:

  • Understand how to use and the benefits of using the Databricks platform and its tools, including:
    • Platform (notebooks, clusters, Jobs, Databricks SQL, relational entities, Repos)
    • Apache Spark (PySpark, DataFrame API, basic architecture)
    • Delta Lake (SQL-based Delta APIs, basic architecture, core functions)
    • Databricks CLI (deploying notebook-based workflows)
    • Databricks REST API (configure and trigger production pipelines)
  • Build data processing pipelines using the Spark and Delta Lake APIs, including:
    • Building batch-processed ETL pipelines
    • Building incrementally processed ETL pipelines
    • Optimizing workloads
    • Deduplicating data
    • Using Change Data Capture (CDC) to propagate changes
  • Model data management solutions, including:
    • Lakehouse (bronze/silver/gold architecture, databases, tables, views, and the physical layout)
    • General data modeling concepts (keys, constraints, lookup tables, slowly changing dimensions)
  • Build production pipelines using best practices around security and governance, including:
    • Managing notebook and jobs permissions with ACLs
    • Creating row- and column-oriented dynamic views to control user/group access
    • Securely storing personally identifiable information (PII)
    • Securely delete data as requested according to GDPR & CCPA
  • Configure alerting and storage to monitor and log production jobs, including:
    • Setting up notifications
    • Configuring SparkListener
    • Recording logged metrics
    • Navigating and interpreting the Spark UI
    • Debugging errors
  • Follow best practices for managing, testing and deploying code, including:
    • Managing dependencies
    • Creating unit tests
    • Creating integration tests
    • Scheduling Jobs
    • Versioning code/notebooks
    • Orchestration Jobs

Preparation

The following Databricks courses should help you prepare for this exam:

  • Apache Spark Programming with Databricks
  • Optimizing Apache Spark on Databricks
  • Data Engineering with Databrickss

Exam Details

The exam details are as follows:

  • The exam consists of 60 multiple-choice questions. Candidates will have 120 minutes to complete the exam.
  • The minimum passing score for the exam is 70 percent. This translates to correctly answering a minimum of 42 of the 60 questions.
  • The exam will be conducted via an online proctor.

Other exam details are available via the Certification FAQ.

Technology Requirements

Please find tech requirements and preparation instructions on Kryterion’s Online Proctored Exam Support page.

Release

This exam will be released publicly in October 2021.