Apache Spark MasterClass Chapter 1 – Episode 2

How do you design an Airflow (or similar) DAG to ensure idempotent tasks, safe retries, and correct backfills without duplicating data in downstream sinks? Show answer

Use **immutable inputs** and **deterministic outputs**, and make sink writes **idempotent** (upsert/merge on a natural key or (run_id, partition) composite). Write into a staging location/table and **atomically swap/rename** on success. Keep **checkpoint markers** (e.g., partition manifests) to prevent reprocessing the same slice twice. In Airflow, leverage templated params (execution_date/logical_date) to bind each run to a unique partition. Limit concurrency with ‘max_active_runs’ per DAG and per-task pools. For backfills, enable ‘catchup=True’ but protect sinks with idempotent merges and data quality gates; use ‘depends_on_past’ only when necessary and prefer explicit data dependencies. Record a run ledger (run_id output partitions) for observability and to support safe replays.
When would you prefer time-based schedules (cron) over event-driven triggers (file arrival, message on a topic), and how do you implement both cleanly? Show answer

Prefer **cron** when upstream timing is predictable (daily closes, hourly exports) or SLAs are time-based. Prefer **event-driven** when upstream emits signals (object-created on S3/GCS, Kafka message) and you want low-latency starts or to avoid idle polling. Implement cron with native schedulers (Airflow schedule_interval or Timetables). Implement events with deferrable sensors, cloud notifications (S3/GCS Pub/Sub), or external triggers; avoid long-running busy sensors. Combine both for guardrails (e.g., cron kickoff with an event sensor to ensure data arrival).
How do you enforce schema and quality contracts between upstream sources and downstream sinks across a multi-step pipeline? Show answer

Define contracts in **version-controlled schemas** (Avro/Protobuf/JSON Schema) and register them in a **schema registry/catalog**. Validate at ingress (bronze) and again at promotion (silver/gold) with tools like Great Expectations/dbt tests: types, ranges, nullability, uniqueness, referential integrity. Fail fast and quarantine bad data. Use **semver** for schema evolution, require review for breaking changes, and provide consumerfacing views to maintain backward compatibility. Automate checks in **CI/CD** before deploying DAG changes.
A mid-pipeline task fails after partial writes. Explain your recovery strategy (transaction boundaries, checkpoints, compensation, or replay). Show answer

Design tasks to be **atomic** or **idempotent**. For file-based sinks: write to temp paths and perform an **atomic rename** on completion; on failure, clean temp and retry. For databases: use transactions or upserts with **idempotency keys** to avoid duplicates. Maintain **checkpoints** (e.g., processed offsets/partitions) to resume. If partial side effects occurred, execute compensating deletes/merges based on a run ledger. As a last resort, **replay** from a durable source of truth (Kafka log, bronze table) to rebuild downstream tables deterministically.
How do you handle late-arriving or out-of-order events in streaming jobs? Discuss watermarking, allowed lateness, and retractions. Show answer

Use **event time** (not processing time) and assign **watermarks** that lag behind the max observed event time by a tolerance window. Configure **allowed lateness** to accept stragglers within that window and update aggregates. For updates after output, support **retractions** or upsert sinks keyed by (entity, window) to correct prior results. Keep state TTL aligned with lateness SLAs, and monitor skew to tune watermark delays. Optionally route very late data to a side output for offline reprocessing/backfills.
Compare at-least-once vs exactly-once semantics for streaming pipelines. When is atleast-once sufficient, and how do you approach exactly-once in practice? Show answer

**At-least-once** may emit duplicates on retries but is simpler and often acceptable when sinks are idempotent (e.g., MERGE by key) or metrics tolerate tiny overcounts. **Exactly-once** typically combines idempotent producers, transactional writes, and atomic offset commits (e.g., Kafka transactions, Streams API, Flink two-phase commit). In practice, aim for **effectively-once**: at-least-once delivery plus idempotent/materialized sinks with dedupe keys to ensure correct end state.
How do you choose Kafka partition keys to balance load while preserving necessary ordering guarantees? Provide examples. Show answer

Pick keys that maintain per-entity ordering where required (e.g., **user_id**, **account_id**) while distributing load across many partitions. Avoid low-cardinality keys (e.g., country=’US’) that hotspot. For globally ordered streams, accept a single partition (reduced throughput). For mixed needs, use **composite keys** (user_id
What metrics do you monitor for Kafka consumers? How do you detect and remediate lag? Show answer

Monitor consumer lag (per partition), records/bytes consumed per second, processing latency, rebalance rate, commit frequency, error rate, and pause/resume time. Detect lag via group-coordinator metrics and alert on thresholds. Remediate by (1) **scaling out consumers** in the group, (2) increasing partitions (futureproofing), (3) tuning fetch.min.bytes, max.poll.records, session/poll intervals, (4) optimizing downstream processing (batching, vectorization), and (5) ensuring sufficient broker/network IO and avoiding slow sinks.
Contrast automatic vs manual offset commits. When would you use each, and how do you avoid message loss or duplication during failures? Show answer

**Auto-commit** is simple but risks committing before successful processing. **Manual (sync) commits** after sink success provide stronger guarantees. Use ‘read-process-write-commit’ ordering: process batch write to idempotent sink commit offsets. For stronger guarantees, use transactional producers/consumers to atomically write results and commit offsets (EOS). Always handle retries idempotently and avoid committing offsets in finally blocks that run after failures.
How do Kafka retention policies and compaction influence your ability to reprocess data? When would you use a dead-letter topic? Show answer

Time/size-based retention preserves history for replay until expiration. Log **compaction** keeps the latest record per key for changelog topics good for rebuilding state but not full history. For full reprocessing, maintain long retention on raw event topics or archive to object storage. Use a **dead-letter topic** for messages that repeatedly fail validation/transformation; include error metadata and implement periodic triage/replay after fixes.
Describe how you identify and handle backpressure in streaming pipelines. Show answer

Identify via rising consumer lag, growing processing latency, and queue depths. Handle by throttling intake (pause partitions), batching records, resizing compute, increasing parallelism, tuning poll/fetch sizes, and applying backoff with jitter. Implement rate limits at producers and use buffering with bounded queues. Profile hotspots in downstream sinks (e.g., database upserts) and consider bulk/async writes or a buffering/compaction layer.
How do you secure Kafka and pipelines end-to-end and align with data governance? Show answer

Encrypt in transit with **TLS**; use mTLS or SASL (SCRAM/OAUTHBEARER) for auth. Authorize with **ACLs** and least-privilege principals per topic/group. Isolate networks (VPC peering/PrivateLink), rotate secrets, and store them in a vault. Enable audit logs, schema registry ACLs, and topic-level data classification. Integrate with a catalog/governance layer (e.g., Lake Formation/Unity Catalog/DataHub) for ownership, tags, and policies, and enforce **PII masking** at sinks/views.
Which metadata/lineage patterns or tools would you use to capture dataset origins and transformations across batch and streaming? Show answer

Adopt **OpenLineage/Marquez** or DataHub/Amundsen for end-to-end lineage, instrumenting DAG tasks and stream processors to emit run metadata (inputs, outputs, code version, run ids). Capture column-level lineage where possible. Store lineage in a central service, surface it in the catalog, and tie it to CI/CD artifacts and data contracts to support impact analysis and audits.
What s your approach to unit tests, integration tests, contract tests, and data quality tests for DAGs and stream processors? Show answer

**Unit-test** pure transforms with small fixtures. **Integration-test** end-to-end flows using ephemeral envs (Testcontainers/LocalStack) and sample topics/buckets. **Contract-test** schemas with a registry and backward-compat checks. Add **data quality tests** (Great Expectations/dbt) for nulls, ranges, and referential integrity. Validate DAG structure (no cycles, dependency completeness) and simulate retries/backfills. Gate deployments with CI and promote via environments.
Given a near real-time dashboard requirement, how do you decide between micro-batch and continuous streaming? Include latency SLOs, cost, and ops complexity. Show answer

Define a latency SLO first (e.g., P95 < 60s). If SLO is minutes-level, **micro-batch** (e.g., Spark Structured Streaming micro-batches or scheduled mini-jobs) is often simpler and cheaper, with strong recovery semantics. If sub-5s latency or strict per-event SLAs are required, choose **continuous streaming** (Flink/Kafka Streams), accepting higher ops complexity and always-on compute. Consider data volume, cost-per-query, stateful needs, and exactly-once requirements. Start with micro-batch and evolve when product value justifies lower latency.

Comments

No comments yet. Be the first!

You must log in to comment.