Apache Spark MasterClass Chapter 1 – Episode 4
-
MapReduce vs Spark: explain why disk I/O made Hadoop jobs slow and how Spark avoided this bottleneck. Show answer
Hadoop MapReduce is slow due to its **disk-heavy execution model**. It writes intermediate results to disk between each Map and Reduce phase, which is a major bottleneck, especially for multi-stage or iterative jobs. Spark avoids this by keeping intermediate data in **memory** whenever possible, only spilling to disk when necessary. This significantly reduces I/O operations and makes Spark much faster for a wide range of analytical workloads.
-
Define an RDD and explain how immutability and lineage deliver fault tolerance. Show answer
An **RDD** (Resilient Distributed Dataset) is a fundamental data structure in Spark, representing an **immutable, partitioned collection** of data distributed across a cluster. Because RDDs are immutable, any transformation (like `map` or `filter`) creates a *new* RDD without modifying the original. Spark also keeps a **lineage graph**, a directed acyclic graph (DAG) of all transformations needed to create an RDD from its source data. This lineage is the key to fault tolerance: if a partition of an RDD is lost due to an executor failure, Spark can **recompute only the missing partition** by re-executing the transformations in its lineage, without having to re-read the entire dataset or use expensive global checkpoints.
-
Give a practical example of when you would cache/persist an RDD and when you wouldn t. Show answer
You would **cache** an RDD when you need to perform multiple actions on the same dataset. For instance, if you load a large dataset of customer orders and want to run three different analyses on it—one for total revenue, one for top-selling products, and one for geographical sales distribution—caching the initial RDD saves Spark from re-reading and re-computing the dataset from the source for each analysis. You **wouldn’t cache** an RDD if you only perform a single action on it. Caching uses valuable memory, and if that memory is only used for a one-time operation, it would be a wasted resource that could have been used by other tasks.
-
Explain lazy evaluation in Spark and its benefits. Show answer
**Lazy evaluation** is a core Spark principle where transformations are not executed immediately. Instead, Spark builds a **logical plan** or a DAG of transformations. The execution of this plan is deferred until an **action** (like `count`, `collect`, or `save`) is called. The benefits of this approach are:
-
What are predicate pushdown and partition pruning? How do they reduce work? Show answer
**Predicate pushdown** is an optimization that pushes filters (e.g., `WHERE` clauses) down to the data source. Instead of reading all data and then filtering it in Spark, the data source itself applies the filter, so Spark only receives the required rows and columns. **Partition pruning** is a similar optimization that skips entire data partitions based on filtering conditions on the partition columns. For example, if you query `WHERE date = ‘2025-08-01’`, Spark can ignore all other date partitions. Both techniques **reduce the amount of data read from disk or network**, which significantly speeds up job execution and reduces I/O.
-
How do partitions influence performance? Give guidance on sizing and count. Show answer
Partitions are the fundamental units of parallelism in Spark. The number of partitions directly influences how many tasks can run in parallel. **Too few partitions** can underutilize a large cluster, while **too many partitions** introduce excessive scheduling overhead. The general guidance is to aim for **2-4 partitions per CPU core** in your cluster. This provides enough parallelism to keep cores busy while minimizing task scheduling overhead. The size of each partition should be large enough to amortize overhead but not so large that it leads to long-running tasks or memory issues.
-
Describe a workflow that is inefficient in MapReduce but efficient in Spark, and why. Show answer
A common workflow that is inefficient in MapReduce is running a **multi-stage analytical pipeline on the same dataset**. For example, if you need to calculate unique users, average session length, and top items from a large clickstream log file, a MapReduce approach would require three separate jobs. Each job would need to re-read the entire input from HDFS, perform its computation, and write its own output. Spark’s approach is much more efficient. It can load the data once, **cache it in memory**, and then run the three different analyses as separate actions on the same cached RDD, eliminating the need for redundant I/O and intermediate disk writes.
-
How does Spark recover from executor failures without losing results? Show answer
Spark recovers from executor failures by leveraging its **RDD lineage graph**. When an executor fails, Spark’s driver knows exactly which RDD partitions were lost. Since the RDDs are immutable and the lineage graph contains all the transformations needed to re-create the data, Spark can **re-run only the necessary transformations on a new executor** to re-compute the lost partitions. This approach is highly efficient because it avoids re-computing the entire job and does not require costly, global checkpoints.
-
What operational complexities did Hadoop introduce that Spark reduced? Show answer
Hadoop introduced operational complexities by requiring separate, specialized engines and APIs for different types of workloads: MapReduce for batch, Hive for SQL, and various other tools for streaming and machine learning. Spark reduced this complexity by providing a **unified runtime and a single set of APIs** for all these workloads. This consolidation means developers can use a single framework and a common set of tools for a wide range of data tasks, reducing tool sprawl, integration overhead, and the learning curve.
-
When might classic Hadoop MapReduce still be a sensible choice? Show answer
Classic Hadoop MapReduce might still be a sensible choice in scenarios where:
-
Explain the role of YARN in the Hadoop ecosystem and Spark s relationship to it. Show answer
**YARN** (Yet Another Resource Negotiator) is the resource manager and scheduler for the Hadoop ecosystem. It’s responsible for allocating compute resources (containers) and managing the lifecycle of applications submitted to the cluster. Spark’s relationship with YARN is symbiotic. Spark can run as a client on YARN, **leveraging YARN for resource allocation** and cluster management, while Spark itself handles its own internal job scheduling and DAG execution. This allows Spark to coexist with and share resources with other frameworks in a Hadoop environment.
-
How do you avoid unnecessary shuffles in Spark when building RDD pipelines? Show answer
A **shuffle** is a wide transformation that redistributes data across the network, which is very expensive. To avoid unnecessary shuffles, you can:
-
What is the downside of indiscriminate caching and how do you manage memory pressure? Show answer
The downside of indiscriminate caching is that it can lead to **memory pressure**, causing Spark to spill cached data to disk or even evict other cached data. This can defeat the purpose of caching and significantly degrade performance. To manage memory pressure, you should **cache only the RDDs that will be reused multiple times**; select an appropriate storage level (`MEMORY_ONLY`, `MEMORY_AND_DISK`, etc.); and explicitly **unpersist** RDDs when they are no longer needed.
-
Relate the Daytona GraySort example to Spark s design goals. Show answer
The **Daytona GraySort** example, where Spark sorted a massive dataset of 100 terabytes much faster than MapReduce, is a perfect illustration of Spark’s design goals. It highlighted Spark’s **efficient in-memory processing**, its **optimized DAG execution engine**, and its ability to drastically **reduce disk I/O** compared to Hadoop’s rigid MapReduce model. The result demonstrated that Spark could achieve faster, more resource-efficient large-scale processing, which was its primary objective.
-
How do RDDs enable running multiple analytics questions on the same dataset efficiently? Show answer
RDDs enable efficient multi-question analytics by providing a **shared, in-memory representation** of the data. Instead of re-reading the source data for each new question, you can load the data once into an RDD, apply transformations, and then **trigger multiple actions** (each representing an analytical question) on that single RDD. By caching the RDD, Spark can reuse the in-memory data, which drastically reduces I/O and improves end-to-end runtime compared to running multiple, independent jobs that each start from scratch.

Comments
No comments yet. Be the first!
You must log in to comment.