Apache Spark MasterClass Chapter 1 – Episode 3

  1. Contrast the execution models of MapReduce and Apache Spark. Where does each excel? Show answer

    MapReduce executes in coarse-grained phases (map -> shuffle -> reduce) and persists intermediate results to local disk between phases, which improves durability and simplifies retries but incurs I/O overhead. It excels in very large, batch-oriented workloads, stable nightly ETL, and environments where failure recovery and simplicity are prioritized. Spark represents computations as DAGs of transformations and tries to keep intermediate data in memory, writing to disk only when needed (e.g., spills, checkpoints, wide shuffles). Spark excels at iterative algorithms, interactive analytics, and pipelines with multiple dependent transformations, delivering lower latency and higher developer productivity via higher-level APIs (DataFrames, SQL, RDDs).

  2. Explain the shuffle phase. What makes it expensive and how can you mitigate that cost? Show answer

    Shuffle redistributes data across the network so that all values for the same key end up on the same reducer (or Spark task). It’s expensive due to network I/O, sorting, serialization/deserialization, and disk spills. Mitigations: Use combiners/map-side aggregations to reduce data volume early; choose good keys to balance partitions and avoid skew; compress map outputs; increase memory to limit spills; tune the number of reducers/partitions; leverage map-side joins/broadcast joins when one side is small; use efficient encodings (e.g., Parquet/Avro) and avoid oversized objects.

  3. Design a MapReduce job for word count and extend it to compute the top-K most frequent words. Show answer

    Word count: Mapper emits (word, 1) per word; optional combiner sums local counts; reducer aggregates counts per word and writes (word, total). Top-K: After reducers compute (word, total), perform a second stage that groups results into a single reducer (or a small set) keeping a size-K min-heap to track the top-K counts. Alternatively, use map-side partial top-K and merge in the reducer to reduce data volume.

  4. What is data skew in MapReduce/Spark and how do you handle it? Show answer

    Skew occurs when a few keys receive disproportionately large amounts of data, creating straggler tasks and elongated job times. Handling: custom partitioners to spread hot keys, salting keys (key

  5. How does MapReduce achieve fault tolerance during map and reduce phases? Show answer

    Mapper outputs are written to local disk; on failure, the master reschedules the task on another worker to recompute the missing outputs from the original split. Reducers read mapper outputs from the surviving local disks; if a reducer fails, it can be restarted to re-read the intermediate data. Persisting intermediates plus deterministic tasks ensures progress despite failures.

  6. What is speculative execution and when would you enable it? Show answer

    Speculative execution launches duplicate attempts of slow (straggling) tasks on other workers; whichever finishes first commits. It’s useful when outliers are due to transient node issues. Avoid it if tasks are non-idempotent or if duplicated side effects are costly.

  7. What is a combiner function? When is it safe and effective to use? Show answer

    A combiner performs local aggregation on mapper outputs before shuffling, reducing data sent over the network. It is safe and effective only for associative and commutative operations (e.g., sum, count, min/max). It may be invoked 0, 1, or multiple times; the algorithm must not depend on invocation count.

  8. How do partitioners influence MapReduce performance and correctness? Show answer

    Partitioners determine which reducer receives a given key (e.g., default hash partitioner). A good partitioner balances load across reducers and preserves correctness by ensuring all values for a key go to the same reducer. Custom partitioners can group related keys or split hot keys to mitigate skew (with post-processing to merge results if needed).

  9. How do you choose input split size and what is its effect on performance? Show answer

    Split size affects parallelism and overhead. Small splits increase parallelism but add scheduling overhead; very large splits reduce overhead but may cause long tasks and uneven load. A common approach aligns splits with the underlying block size (e.g., HDFS block size) while tuning for workload characteristics (compression, record size).

  10. What serialization formats would you consider for MapReduce pipelines and why? Show answer

    For MapReduce, SequenceFile, Avro, or Parquet are common. Avro provides schema evolution and compact binary serialization; SequenceFile is simple and splittable; Parquet is columnar, efficient for analytical reads (more common in Spark). Choose formats that are splittable, compressible, and support your downstream engines and schema evolution needs.

  11. When would you still choose classic MapReduce over Spark? Show answer

    When the environment lacks sufficient memory or stable Spark infrastructure; when workloads are massive batch jobs with simple transformations where disk-based robustness is preferred; when existing MapReduce code is hardened and migration risk is high; or when operational simplicity and predictable recovery outweigh latency needs.

  12. How would you migrate a legacy MapReduce ETL job to Spark while controlling risk? Show answer

    Start by replicating outputs in a staging environment and validating with data diffs. Replace mappers/reducers with DataFrame/Spark SQL transformations; validate partitioning and ordering semantics, especially around shuffles. Introduce checkpoints and lineage. Run both pipelines in parallel (shadow mode), compare metrics and costs, then cut over gradually with clear rollback plans.

  13. Explain write amplification in MapReduce and tuning techniques to reduce it. Show answer

    Write amplification occurs when data is written multiple times (map output, shuffle spills, reduce output). Reduce it by enabling map output compression, using combiners, tuning buffer sizes to limit spills, compacting outputs, and minimizing unnecessary intermediate stages. In Spark, caching and avoiding redundant wide shuffles further mitigate writes.

  14. What is a map-side join and when is it appropriate? Show answer

    A map-side join joins a large dataset with a much smaller one by distributing (broadcasting) the small dataset to mappers so each mapper can join locally, avoiding a full shuffle. It is appropriate when the small dataset fits in memory per mapper and can be reliably distributed (e.g., via distributed cache or broadcast variables in Spark).

  15. Walk through the MapReduce architecture lifecycle from job submission to final output. Show answer

    The user program submits a job; a master (e.g., JobTracker/YARN ApplicationMaster) schedules map tasks on workers against input splits. Mappers process records, emit (key, value) pairs, and write local intermediate files (optionally with combiners). Reducers remotely fetch and merge map outputs (shuffle and sort), aggregate by key, and write final results to the distributed filesystem. The master tracks task states and retries failures; job completion is signaled once all reduce tasks finish writing outputs.

Comments

No comments yet. Be the first!

You must log in to comment.