Apache Spark MasterClass Chapter 3 – Episode 5

Walk through the lifecycle from a user calling `count()` to tasks finishing on executors. Show answer

The driver receives the action, the query plan is optimized into a DAG, and the DAGScheduler splits it into stages at shuffle boundaries. The TaskScheduler creates tasks per partition and assigns them to executors. Executors run tasks (one per core), spill/shuffle as needed, and return results. The driver aggregates stage results, marks the job complete, and the UI reflects stages/tasks timeline and metrics.
How do narrow vs. wide transformations influence stage construction and performance? Show answer

Narrow transformations (map/filter) can be pipelined within a stage because each output partition depends on one input partition. Wide transformations (groupBy/orderBy/join) require shuffles, forcing stage boundaries and expensive disk/network I/O. Minimizing wide ops, pre-partitioning data, and leveraging broadcast joins mitigate costs.
Given the M&M dataset includes a numeric 'Count' field, when should you use `sum('Count')` vs `count('Count')`? Show answer

`count(‘Count’)` counts rows per (State, Color); it answers ‘how many records’. `sum(‘Count’)` sums the numeric counts per group; it answers ‘how many M&Ms’. Choose based on semantics; many public examples demonstrate both patterns.
Explain Spark’s lazy evaluation benefits and how to debug a plan before running it. Show answer

Lazy evaluation lets Catalyst reorder/push filters and combine stages for efficiency. Use `explain(true)` to inspect logical/physical plans, check for shuffles (`Exchange`), and ensure predicate pushdown. Trigger with small actions like `show(5)` instead of `collect()` during development.
What causes task failures and how do retries work? Show answer

Common causes: OOM, fetch failures during shuffle, bad UDF logic, corrupted input. Spark retries failed tasks (configurable) possibly on different executors; persistent failures can fail a stage. Examine UI logs, adjust memory, fix code/data, or increase `spark.task.maxFailures` judiciously.
Compare client vs. cluster deploy modes for `spark-submit`. When is each appropriate? Show answer

Client: the driver runs on the submission machine; good for interactive use where logs/UI are local. Cluster: the driver runs on a worker node; preferred for production where the submitter may disconnect. Choice affects network paths to data sources and driver memory sizing.
How would you size executors and partitions for a balanced job? Show answer

Aim for 2–5 cores per executor to avoid single hotspots, allocate memory to fit data/shuffle, and set partitions such that tasks take tens of seconds. Use `spark.sql.shuffle.partitions` or `repartition`/`coalesce`, and validate via the UI’s task durations and skew metrics.
Which Spark UI tabs do you use to diagnose slowdowns in the MnM job? Show answer

Stages (to inspect shuffle boundaries and time), Storage (to verify caching), SQL tab (physical plans), and Executors (GC time, task failures). Look for skewed tasks, long shuffle reads/writes, and spilled memory.
What is broadcast join and when does it help? Show answer

Broadcast join sends a small table to all executors to avoid shuffling the large table. It’s effective when one side fits under `spark.sql.autoBroadcastJoinThreshold` (or is explicitly broadcast), reducing network I/O and stages.
Describe the role of the driver vs. executors and failure implications. Show answer

The driver holds the SparkContext, schedules jobs, and aggregates results; executors run tasks and store data/cache. Executor failure is recoverable via task retry; driver failure aborts the application unless supervised (e.g., cluster manager restarts).
How do you make a Spark application’s configuration portable across environments? Show answer

Externalize configs via args/files/env variables, avoid hard-coded paths/credentials, and use profiles per env. Keep Spark/Scala versions consistent; package dependencies with Maven Shade or sbt-assembly to avoid classpath conflicts.
What’s the difference between caching and checkpointing? When would you use each? Show answer

Caching keeps computed partitions in memory/disk for reuse; it preserves lineage. Checkpointing truncates lineage by materializing to reliable storage, useful for very long DAGs or iterative algorithms to prevent stack/lineage blowup.
How can you validate that your grouping and aggregation logic is correct for the MnM example? Show answer

Start with a tiny deterministic CSV and hand-compute expected results; assert equality. Compare `count(‘Count’)` vs. `sum(‘Count’)` to ensure semantics. Use `explain()` to confirm only necessary shuffles, and test filters for a single state like CA.
What pitfalls exist when relying on `inferSchema` for CSV sources? Show answer

Inference can mis-type columns or be expensive on large files; types may drift between runs. Prefer explicit schemas for stability and performance, especially for numeric aggregations.
Outline how you’d package and submit the Scala MnM app to a cluster. Show answer

Define dependencies in `build.sbt`, run `sbt clean package` to create a fat JAR (or use sbt-assembly). Submit with `spark-submit –class main.scala.chapter2.MnMcount`, set master/deploy mode, and tune executor resources. Verify logs/UI on the driver and configure retries/monitoring.

Comments

No comments yet. Be the first!

You must log in to comment.