Apache Spark MasterClass Chapter 3 – Episode 6
-
Explain lazy evaluation in the context of `sc.textFile` and why the path error may only appear on an action. Show answer
`sc.textFile(path)` returns an RDD whose lineage references the file; no I/O occurs yet. Only when you call an action (e.g., `count`, `first`, `collect`) does Spark schedule jobs, contact the filesystem, and materialize partitions. If the path is wrong, the failure surfaces at that time, not at RDD creation.
-
Compare `cache()`/`persist()` and when to use them during log analysis. Show answer
`cache()` is shorthand for `persist(MEMORY_ONLY)`. `persist` lets you select levels (e.g., MEMORY_AND_DISK). Use them when the same RDD/DataFrame will be reused across multiple actions (e.g., filter counts, longest-line, word stats). Avoid caching one-off datasets to save memory and prevent eviction thrash.
-
Contrast `reduceByKey` with `countByKey` in the example. Show answer
`reduceByKey` is a distributed transformation that combines values per key using a function (e.g., `_ + _`), staying as an RDD. `countByKey` is an action that returns a driver-side `Map[K,Long]`. For large key spaces prefer `reduceByKey` or `mapValues` + `reduceByKey` to avoid driver OOM.
-
What pitfalls exist in `_.split("\s+")(1)` when parsing logs, and how would you harden it? Show answer
Malformed or short lines cause `IndexOutOfBoundsException`. Harden by splitting with a limit, validating length, using `Option`, or pattern matching: `val parts = line.split(“\s+”, 3); if (parts.length > 1) Some(parts(1)) else None`. Filter out `None` or route to a dead-letter path for audit.
-
Under what circumstances would you prefer `aggregateByKey` over `reduceByKey`? Show answer
`aggregateByKey` allows different intra- and inter-partition combining (seqOp/combOp), useful for computing complex metrics like sums and maxes together or building collections with bounded size. `reduceByKey` uses a single commutative/associative function for both.
-
Why does `saveAsTextFile` create multiple part files and how can you control that? Show answer
Each partition writes one file to maximize parallelism. Control file count by `repartition(n)`/`coalesce(n)` before saving. To produce a single file, coalesce to 1 (acceptable for small outputs) and beware of the driver becoming a bottleneck during collect→write patterns.
-
Outline how you’d package and submit the WordCount app using sbt and spark-submit. Show answer
Create `wordcount.sbt` with Spark dependency marked `provided`, run `sbt package` to build the jar under `target/scala-/`. Submit with `spark-submit –class WordCount –master local[*]`. On a cluster, set the appropriate master (e.g., yarn, k8s) and resource flags.
-
What are the tradeoffs between `first`, `take`, and `collect` when inspecting results? Show answer
`first` returns one element and is cheap; `take(n)` fetches a bounded sample; `collect` materializes the entire RDD to the driver and can OOM. Prefer `take` for previews and avoid `collect` on large datasets.
-
How would you compute the longest log line more efficiently than mapping lengths plus `reduce`? Show answer
Use a single pass with `reduce` over the original lines comparing length, or `reduce` with a tuple `(len, line)` to keep both metrics. Alternatively, `mapPartitions` to compute local maxima then reduce, which reduces shuffle/cross-partition comparisons.
-
Describe how to store counts as a SequenceFile and why you might choose it over text. Show answer
`RDD[(K,V)].saveAsSequenceFile(path)` writes Hadoop SequenceFiles (binary key/value). They are splittable and space-efficient vs. plain text and preserve types, aiding interoperability with Hadoop tools.
-
If your error filter relies on the second token, how would you support varied formats (tabs, multiple spaces)? Show answer
Use a regex split `”\s+”`, or better, a compiled regex with capture groups and `Pattern.split`. Set a split limit to avoid over-splitting message text and validate token counts before indexing.
-
What is the effect of calling `persist(MEMORY_AND_DISK)` vs `MEMORY_ONLY`, and when would you change levels? Show answer
`MEMORY_AND_DISK` spills partitions that don’t fit memory to disk, avoiding recomputation at the cost of I/O; `MEMORY_ONLY` recomputes evicted partitions. Use `MEMORY_AND_DISK` for expensive recompute paths or iterative analysis; drop to `DISK_ONLY` for very large but infrequently accessed RDDs.
-
How does `reduceByKey` minimize shuffle compared to `groupByKey` (if used)? Show answer
`reduceByKey` performs map-side combiners to aggregate within partitions before shuffling, sending less data across the network. `groupByKey` ships all values and aggregates post-shuffle, which can be heavier on memory and bandwidth.
-
What kind of monitoring and debugging cues would you read from the Spark UI during the log analysis job? Show answer
Check Stages for time spent, shuffle read/write, and skew. Look at Executors for GC time and task failures. Storage tab confirms persistence; the Environment tab verifies configuration and classpath. Use the DAG visualization to spot wide transformations and stage boundaries.
-
Suggest a minimal unit/integration test strategy for the WordCount app. Show answer
Unit-test pure functions (tokenization, normalization). For integration, run a local SparkSession, read a tiny deterministic input, and assert expected `(word,count)` results. Use temporary directories for outputs, clean them between tests, and avoid global state.

Comments
No comments yet. Be the first!
You must log in to comment.