Quick Reference: Spark Architecture

Summary

Explore an overview of the internal architecture of Apache Spark™.

Description

Apache Spark™ is a unified analytics engine for large scale data processing known for its speed, ease and breadth of use, ability to access diverse data sources, and APIs built to support a wide range of use-cases. Databricks builds on top of Spark and adds many performance and security enhancements. This course is meant to provide an overview of Spark’s internal architecture.

Learning objectives

  • Describe basic Spark architecture and define terminology such as “driver” and “executor”.

  • Explain how parallelization allows Spark to improve speed and scalability of an application.

  • Describe lazy evaluation and how it relates to pipelining.

  • Identify high-level events for each stage in the Optimization process.

Prerequisites

  • Beginning knowledge of big data and data science concepts.

Learning path

  • This course is part of the SQL Analyst, Data Engineer and Data Scientist learning paths.

Proof of completion

  • Upon 80% completion of this course, you will receive a proof of completion.