Apache Spark MasterClass Chapter 3 – Episode 4

Explain the full text pipeline (read → tokenize → clean → explode → count). Where do most bugs sneak in? Show answer

Read lines with `spark.read.text` (one row per line). Tokenize via `split` into arrays. Clean with `lower`, `regexp_extract/replace`, and `trim`. Explode arrays so each token becomes its own row, then `groupBy(‘word’).count()` and `orderBy(desc(‘count’))`. Bugs typically come from tokenization (e.g., hyphenation, punctuation, Unicode), cleaning that removes meaningful tokens, and forgetting to filter blanks/nulls.
Lazy evaluation vs. eager eval in PySpark: pros/cons and when to use each. Show answer

Lazy evaluation defers execution, allowing Catalyst to optimize and minimize I/O/shuffles—great for large, chained transformations. Eager eval (REPL display) is convenient for learning/exploration but risks triggering heavy jobs inadvertently. Use eager only in small demos; prefer lazy in production.
Design an efficient word count that handles huge books and many files. Show answer

Use `spark.read.text()` to parallelize across files, avoid UDFs, and rely on built-ins (`split`, `lower`, `regexp_replace`). Repartition to match cluster parallelism, consider stopword removal, and use `groupBy(‘word’).count()` followed by `orderBy(desc(‘count’))`. Cache intermediate cleaned tokens only if reused (e.g., multiple top-N variants).
How would you keep apostrophes or hyphenated words while still cleaning punctuation? Show answer

Adjust the regex to include characters you want to keep, e.g., `regexp_extract(col(‘w’), “[a-z’]+”, 0)` to keep apostrophes or use Unicode classes like `\p{L}`. Alternatively, use `regexp_replace` to strip disallowed characters while preserving desired ones.
Why prefer DataFrame functions over Python UDFs for this workload? Show answer

Built-in functions are compiled into Spark’s optimized plans and avoid Python ↔ JVM serialization overhead. UDFs disable many optimizations and are slower; use them only if no native expression covers the need (or consider pandas UDFs with care).
What are common pitfalls when using `explode` on very large arrays? Show answer

Massive row explosion can create data skew and memory pressure. Mitigations: filter early, tokenize carefully (e.g., ignore blank tokens), limit to necessary columns, and monitor partition sizes; use `explode_outer` for null-safe behavior.
How would you unit test this pipeline? Mention at least three concrete assertions. Show answer

Create small fixtures and assert: (1) no blank tokens remain after cleaning, (2) token counts for known inputs (e.g., ‘a a b’ → a:2, b:1), (3) schema expectations (array → explode → string), and (4) top-N results are stable for deterministic inputs.
Show two ways to rename columns and when each is preferable. Show answer

`alias` within `select` is concise when you create/transform a column. `withColumnRenamed` is handy to rename an existing column after the fact or in multi-step pipelines. Both avoid fragile string parsing of generated column names.
How do you robustly remove stopwords like 'is', 'the', 'and' at scale? Show answer

Use `isin` with a broadcasted list or join against a small stopword DataFrame and filter. Normalize case first (`lower`), and ensure trimming; optionally keep domain-specific terms by maintaining a curated allowlist.
What are best practices to avoid collecting huge datasets to the driver while exploring? Show answer

Use `show()` with limits, `limit()` before actions, sample with `sample()` or `take()` small counts, and inspect via the Spark UI. Avoid `collect()` on large frames; prefer `write` to storage if you need full outputs.
Describe how you’d parameterize the top-N (10, 20, 50, 100) requirement cleanly. Show answer

Compute counts once, store as a DataFrame, then call `.orderBy(desc(‘count’)).limit(n)` where `n` is a function parameter or widget input. This avoids recomputation and keeps the logic DRY.
What’s the difference between `regexp_extract` and `regexp_replace` for cleaning? Give examples. Show answer

`regexp_extract(col, pattern, group)` returns a captured substring; e.g., `[a-z]+` to keep letters. `regexp_replace(col, pattern, replacement)` substitutes matches; e.g., replace `[^a-z]` with ” to strip non-letters while preserving the rest of the string.
How can you enrich the word count with document/line context for later analysis? Show answer

Read multiple files and add `input_file_name()` to keep source context, add `monotonically_increasing_id()` for line IDs, or maintain paragraph indices. Aggregate per file/section, then union into global counts for comparative analytics.
What partitioning/tuning knobs matter most for this pipeline? Show answer

`spark.sql.shuffle.partitions` for downstream aggregations, `repartition/coalesce` for balancing, broadcast thresholds for joins (if stopword table), and caching policy. Monitor shuffle spill, skew, and executor memory in the Spark UI.
If your regex destroys non-ASCII words (e.g., café, naïve), how would you fix it? Show answer

Use Unicode-aware patterns like `\p{L}+` instead of `[a-z]+`, or normalize with `unaccent` libraries before cleaning. Validate with multilingual fixtures to ensure tokens aren’t lost.

Comments

No comments yet. Be the first!

You must log in to comment.