Apache Spark MasterClass Chapter 1 – Episode 5
-
Walk through a Spark application's lifecycle from SparkSession creation to completion across Driver, Cluster Manager, and Executors. Show answer
A Spark application’s lifecycle begins when a user launches the app and creates a **SparkSession**. This initializes the **SparkContext**, which acts as the main entry point. The **Driver** then builds a logical execution plan (a DAG) from the transformations defined in the code. It communicates with the **Cluster Manager** (like YARN or Kubernetes) to request resources. The Cluster Manager allocates containers or pods on worker nodes and starts the **Executors**. The Driver then divides the DAG into **stages** (separated by wide dependencies like shuffles) and each stage into smaller **tasks** that run on the executors. The executors perform the computations on their assigned data partitions and report their status back to the Driver. Once all tasks complete for an **action**, the results are written to a sink, and the Driver stops the SparkContext, releasing all resources.
-
Client mode vs cluster mode: where does the Driver run and why does it matter? Show answer
In **client mode**, the Driver runs on the machine where you launched the Spark application (e.g., your laptop or an edge node). This is useful for interactive development and debugging. However, it’s susceptible to network issues, and if your machine fails, the entire job fails. In **cluster mode**, the Driver runs inside the cluster itself, managed by the Cluster Manager. This is the recommended mode for production because it is more resilient, as the driver is co-located with the executors, and logs are centralized within the cluster.
-
How do you size executors (cores/memory) and how many should you request? Show answer
The goal is to size executors to maximize parallel processing and minimize overhead. A good starting point is to allocate **4-8 cores per executor**. This provides enough parallelism without causing excessive garbage collection (GC) or I/O contention. You should give each executor enough memory to handle its tasks and avoid spilling data to disk, but not so much that it leads to long GC pauses. You request enough executors to utilize the total cores of your cluster effectively. For example, if you have 10 nodes with 16 cores each, you might request 20 executors with 8 cores each to keep all 160 cores busy.
-
Explain jobs, stages, and tasks. What creates a stage boundary? Show answer
A **job** is triggered by a single **action** in Spark (e.g., `count()`, `collect()`, `save()`). The Spark scheduler breaks a job into one or more **stages**, which are sets of parallel tasks that can be executed without data shuffling. A **stage boundary** is created by a **wide dependency**, such as a `groupByKey()` or a `join()`, which requires data to be redistributed across the network (a shuffle). Each stage is then composed of many **tasks**, with one task per data partition, which are the fundamental units of execution that run on the executors.
-
When would you enable dynamic allocation and what prerequisites are needed? Show answer
You would enable **dynamic allocation** when your workloads are **bursty or variable in size**, as it automatically scales the number of executors up or down based on the workload. This helps reduce costs by not keeping idle resources running. The primary prerequisite is an **external shuffle service**. This service ensures that shuffle data is preserved on the worker nodes even after an executor scales down, so the data is not lost and can be accessed by new executors.
-
How do you approach fault tolerance when executors frequently preempt (e.g., using spot instances)? Show answer
To handle frequent preemption from spot instances, you must design your application for **recomputation**. Ensure that all transformations are **deterministic** and that the source data is stored in durable storage. **Checkpointing** can also be used to truncate long lineage chains, reducing the amount of recomputation required. You should also make your tasks as short as possible by increasing parallelism, so if an executor is preempted, less work needs to be redone.
-
Compare DataFrames, Datasets, and RDDs. When would you choose each? Show answer
-
Describe Structured Streaming s processing guarantees and how to design idempotent sinks. Show answer
Structured Streaming provides **at-least-once** processing guarantees by default, meaning a record is guaranteed to be processed, but may be processed more than once. To achieve **exactly-once** end-to-end, you need to use idempotent sinks. You can make sinks idempotent by performing **upsert operations** based on a natural key (e.g., `MERGE` into a Delta table) or by using transactional writes where you can uniquely identify and deduplicate records. The use of checkpointing is crucial for recovery, as it stores the stream’s progress and state.
-
How do you reduce shuffle cost in Spark jobs at the design and tuning levels? Show answer
-
Explain Catalyst and Tungsten in one minute. Show answer
**Catalyst** is Spark SQL’s optimizer. It builds and transforms query plans, enabling smart optimizations like predicate pushdown, constant folding, and join reordering. **Tungsten** is the execution engine that focuses on improving memory and CPU efficiency by using techniques like off-heap memory management and whole-stage code generation to reduce garbage collection and improve performance.
-
What monitoring tools and artifacts do you rely on in production? Show answer
In production, I rely on the **Spark UI** for real-time monitoring of active jobs, stages, and tasks. The **Spark History Server** provides persistent logs and UI access for completed jobs. For comprehensive metrics, I use **metrics sinks** like Prometheus/Grafana to collect and visualize key metrics such as CPU usage, memory consumption, and shuffle I/O. Finally, I use **Driver and Executor logs** to debug errors and identify specific issues.
-
How would you implement real-time fraud detection with Structured Streaming? Show answer
I would implement real-time fraud detection by ingesting a stream of transactions from a source like **Kafka**. The Structured Streaming job would join this stream with a small, broadcasted table of reference data. I would then use **stateful operations** with **tumbling windows** and watermarks to aggregate transactions by user and time window. The aggregated features would then be used to score against a pre-trained **machine learning model**, and the results would be written to an operational data store with **idempotent upserts**. I would also monitor consumer lag to ensure the pipeline keeps up with the incoming data.
-
What are common causes of OOM in Spark and how do you mitigate them? Show answer
Common causes of OutOfMemory (OOM) errors in Spark include **data skew** (a few partitions get too much data), **excessive caching**, large shuffles, and **unbounded state** in streaming jobs. To mitigate these: **rebalance partitions** to handle skew, set a proper **storage level** for cached data (e.g., `MEMORY_AND_DISK`), increase the number of shuffle partitions, and set state Time-To-Live (TTL) or watermarks in streaming applications.
-
Kubernetes vs YARN for Spark: trade-offs? Show answer
**YARN** is a mature, tightly integrated solution for the Hadoop ecosystem, offering strong data locality with HDFS. **Kubernetes** provides better resource isolation and is cloud-native, making it a good choice for multi-tenant environments. The choice depends on your organization’s infrastructure and operational expertise. Both are officially supported by Spark.
-
How do you secure a Spark application end-to-end? Show answer
To secure a Spark application, I would enforce **encryption in transit with TLS** for all communication between the Driver and Executors. I would also use **IAM roles** or **Kerberos** for authentication and implement **least privilege** access to data sources. Sensitive data would be **masked or tokenized** at ingestion, and I would enable **audit logs** on the data catalog to track all access. Secrets would be managed using a secrets vault and not hard-coded in the application.

Comments
No comments yet. Be the first!
You must log in to comment.