Apache Spark MasterClass Chapter 3 – Episode 3

How would you design a robust 'daily active users' (DAU) metric in production using Spark? Show answer

Ingest event logs partitioned by event date (e.g., `dt=YYYY-MM-DD`), parse event time, and deduplicate by a stable user key. Compute `count(distinct user_id)` per day via DataFrame or SQL; for very large data, consider `approx_count_distinct` with a precision tradeoff. Handle late-arriving data with watermarking/reprocessing windows and write results to a partitioned table for downstream BI.
Explain the two-step aggregation used to compute average items per cart and potential pitfalls. Show answer

Step 1: `groupBy(cart_id).agg(count(item_id).as(“total”))` to get per-cart counts. Step 2: `agg(avg(col(“total”)))` to average across carts. Pitfalls: duplicated events, empty carts, or null `item_id`; you may need to dedupe on `(cart_id,item_id,event_time)` and filter invalid rows. For daily averages, group by both `dt` and `cart_id` before averaging.
What are practical ways to compute 'top N items per day' at scale? Show answer

Aggregate counts per `(dt,item_id)`, then use window functions like `row_number() over (partition by dt order by total desc)` and filter `<= N`. Alternatively, use `sortWithinPartitions` followed by `mapPartitions` for streaming/large cardinality, and guard against skew by salting hot keys.
How do you use `explain()` to debug and optimize a slow query? Show answer

`explain()` reveals scans, shuffles (`Exchange`), and aggregation types. Look for unnecessary shuffles or wide stages; reduce them via `repartition`/`coalesce`, broadcast joins, and predicate pushdown. Confirm `spark.sql.shuffle.partitions` is tuned and caching is used only when reused.
Why does Spark use lazy evaluation, and how should you structure jobs because of it? Show answer

Lazy evaluation lets Spark build and optimize a full plan before executing, minimizing I/O and shuffles. Structure pipelines as a series of transformations, trigger actions at logical checkpoints, and cache only reused intermediates. Use `checkpoint` to truncate lineage in very long DAGs.
When would you prefer DataFrames vs. SQL for these BI tasks? Show answer

Use DataFrames for programmatic composition, type safety (in Scala via Datasets), and reuse in codebases. Use SQL for analyst familiarity and concise declarative logic. Both compile to the same Catalyst plan; pick the interface that maximizes clarity and team velocity.
How do you make these daily reports idempotent and reliable in batch pipelines? Show answer

Partition inputs/outputs by date, write with overwrite per partition or MERGE semantics, and include job metadata (run ID, source offsets). Validate counts and distincts against expectations, and use checkpointing or atomic writes to avoid partial results.
What’s your approach to schema management and evolution for the activity dataset? Show answer

Define an explicit schema (or case class) and enforce it at read time. For evolution, use formats supporting schema evolution (e.g., Parquet/Delta/Iceberg) and apply additive changes; monitor for nullability/compatibility, and document versions.
How would you handle heavy skew on a popular `item_id` when computing top items? Show answer

Apply key salting (add random suffixes then recombine), enable adaptive query execution (AQE) to coalesce skewed partitions, or broadcast the smaller side where applicable. You can also sample heavy hitters separately or cap per-cart contributions.
What trade-offs exist between `count(distinct …)` and `approx_count_distinct(…)`? Show answer

`count(distinct)` is exact but can be expensive (memory/shuffle). `approx_count_distinct` (HyperLogLog++) is sublinear memory with configurable relative error—great for large-scale DAU, at the cost of small inaccuracies.
Demonstrate how to convert the DataFrame pipeline into a SQL view layer for BI tools. Show answer

Register temp or permanent views (`createOrReplaceTempView` / save as table) and define SQL queries for DAU, average cart size, and top-N. Expose via Spark Thrift Server or a lakehouse engine; manage permissions and column masks at the catalog level.
What testing strategy would you apply for these transformations? Show answer

Create small, deterministic fixtures covering duplicates, nulls, and edge cases; run tests with a local SparkSession. Assert row counts, distincts, and top-N outputs; use golden files for expected results and property-based tests for invariants (e.g., totals match inputs).
How should `spark.sql.shuffle.partitions` be tuned for these jobs? Show answer

Default (~200) may be too high/low depending on cluster/data. Profile tasks in the Spark UI and set partitions to keep task times balanced (tens of seconds). Use AQE to coalesce or split partitions dynamically in Spark 3+.
When is it appropriate to cache, and what are the risks? Show answer

Cache when the same DataFrame is reused multiple times (e.g., as both a DAU base and top-items base). Risks include memory pressure and eviction; always unpersist when done and monitor storage memory.
What operational signals would you monitor for these daily pipelines? Show answer

Job duration, input/output row counts, distinct user counts, skew warnings, failed/retried tasks, and shuffle spill. Alert on anomalies relative to historical baselines and on missing partitions to catch upstream delays.

Comments

No comments yet. Be the first!

You must log in to comment.