Apache Spark MasterClass Chapter 3 – Episode 6

Explain lazy evaluation in the context of `sc.textFile` and why the path error may only appear on an action. Show answer

`sc.textFile(path)` returns an RDD whose lineage references the file; no I/O occurs yet. Only when you call an action (e.g., `count`, `first`, `collect`) does Spark schedule jobs, contact the filesystem, and materialize partitions. If the path is wrong, the failure surfaces at that time, not at RDD creation.
Compare `cache()`/`persist()` and when to use them during log analysis. Show answer

`cache()` is shorthand for `persist(MEMORY_ONLY)`. `persist` lets you select levels (e.g., MEMORY_AND_DISK). Use them when the same RDD/DataFrame will be reused across multiple actions (e.g., filter counts, longest-line, word stats). Avoid caching one-off datasets to save memory and prevent eviction thrash.
Contrast `reduceByKey` with `countByKey` in the example. Show answer

`reduceByKey` is a distributed transformation that combines values per key using a function (e.g., `_ + _`), staying as an RDD. `countByKey` is an action that returns a driver-side `Map[K,Long]`. For large key spaces prefer `reduceByKey` or `mapValues` + `reduceByKey` to avoid driver OOM.
What pitfalls exist in `_.split("\s+")(1)` when parsing logs, and how would you harden it? Show answer

Malformed or short lines cause `IndexOutOfBoundsException`. Harden by splitting with a limit, validating length, using `Option`, or pattern matching: `val parts = line.split(“\s+”, 3); if (parts.length > 1) Some(parts(1)) else None`. Filter out `None` or route to a dead-letter path for audit.
Under what circumstances would you prefer `aggregateByKey` over `reduceByKey`? Show answer

`aggregateByKey` allows different intra- and inter-partition combining (seqOp/combOp), useful for computing complex metrics like sums and maxes together or building collections with bounded size. `reduceByKey` uses a single commutative/associative function for both.
Why does `saveAsTextFile` create multiple part files and how can you control that? Show answer

Each partition writes one file to maximize parallelism. Control file count by `repartition(n)`/`coalesce(n)` before saving. To produce a single file, coalesce to 1 (acceptable for small outputs) and beware of the driver becoming a bottleneck during collect→write patterns.
Outline how you’d package and submit the WordCount app using sbt and spark-submit. Show answer

Create `wordcount.sbt` with Spark dependency marked `provided`, run `sbt package` to build the jar under `target/scala-/`. Submit with `spark-submit –class WordCount –master local[*]`. On a cluster, set the appropriate master (e.g., yarn, k8s) and resource flags.
What are the tradeoffs between `first`, `take`, and `collect` when inspecting results? Show answer

`first` returns one element and is cheap; `take(n)` fetches a bounded sample; `collect` materializes the entire RDD to the driver and can OOM. Prefer `take` for previews and avoid `collect` on large datasets.
How would you compute the longest log line more efficiently than mapping lengths plus `reduce`? Show answer

Use a single pass with `reduce` over the original lines comparing length, or `reduce` with a tuple `(len, line)` to keep both metrics. Alternatively, `mapPartitions` to compute local maxima then reduce, which reduces shuffle/cross-partition comparisons.
Describe how to store counts as a SequenceFile and why you might choose it over text. Show answer

`RDD[(K,V)].saveAsSequenceFile(path)` writes Hadoop SequenceFiles (binary key/value). They are splittable and space-efficient vs. plain text and preserve types, aiding interoperability with Hadoop tools.
If your error filter relies on the second token, how would you support varied formats (tabs, multiple spaces)? Show answer

Use a regex split `”\s+”`, or better, a compiled regex with capture groups and `Pattern.split`. Set a split limit to avoid over-splitting message text and validate token counts before indexing.
What is the effect of calling `persist(MEMORY_AND_DISK)` vs `MEMORY_ONLY`, and when would you change levels? Show answer

`MEMORY_AND_DISK` spills partitions that don’t fit memory to disk, avoiding recomputation at the cost of I/O; `MEMORY_ONLY` recomputes evicted partitions. Use `MEMORY_AND_DISK` for expensive recompute paths or iterative analysis; drop to `DISK_ONLY` for very large but infrequently accessed RDDs.
How does `reduceByKey` minimize shuffle compared to `groupByKey` (if used)? Show answer

`reduceByKey` performs map-side combiners to aggregate within partitions before shuffling, sending less data across the network. `groupByKey` ships all values and aggregates post-shuffle, which can be heavier on memory and bandwidth.
What kind of monitoring and debugging cues would you read from the Spark UI during the log analysis job? Show answer

Check Stages for time spent, shuffle read/write, and skew. Look at Executors for GC time and task failures. Storage tab confirms persistence; the Environment tab verifies configuration and classpath. Use the DAG visualization to spot wide transformations and stage boundaries.
Suggest a minimal unit/integration test strategy for the WordCount app. Show answer

Unit-test pure functions (tokenization, normalization). For integration, run a local SparkSession, read a tiny deterministic input, and assert expected `(word,count)` results. Use temporary directories for outputs, clean them between tests, and avoid global state.

Comments

No comments yet. Be the first!

You must log in to comment.