Apache Spark MasterClass Chapter 1 – Episode 1

Explain CAP Theorem in your own words and give a concrete example of when you would prioritize Availability over Consistency in a data platform. What compensating controls would you add? Show answer

CAP states that in the presence of a network partition (P), a distributed system can provide either strong Consistency (all clients see the same data at the same time) or Availability (the system continues to serve requests), but not both simultaneously. Example prioritizing Availability: A social media feed service during regional link failures should still allow users to post and read most recent content, even if some replicas are stale (eventual consistency). Compensating controls: (1) Use per-item versioning and vector clocks to detect conflicts, (2) implement read-repair/merkle-tree anti-entropy to converge replicas, (3) expose write conflict resolution policies (LWW or app-specific merge), (4) surface data freshness SLAs to clients, and (5) audit logs to reconcile anomalies post-partition.
Given a flight-booking OLTP schema, how would you design the downstream OLAP model for on-time performance analytics? Walk through grain, dimensions, facts, and partitioning. Show answer

Grain: One row per flight leg per scheduled departure date. Dimensions: Date (calendar, fiscal), Airport, Carrier, Aircraft, Route, Weather bucket, Delay reason codes, Time-of-day. Facts: Actual departure/arrival times, delay minutes (off-block, taxi, airborne), cancellations, diversions, on-time flag, load factor. Partitioning & clustering: Partition by flight_date; cluster by carrier, origin, destination. Precompute aggregates (by route/daypart/season) and create materialized views for common rollups. Use surrogate keys and conformed dimensions to join across subject areas.
When would you land data in a lake first versus loading straight into a warehouse? Discuss schema evolution, cost, governance, and performance trade-offs. Show answer

Land to lake first when: (1) sources are semi/unstructured (logs, JSON, media), (2) schema evolves frequently, (3) you want cheap immutable storage and reprocessing, (4) multiple downstream consumers need raw history. Load straight to warehouse when: (1) inputs are well-structured, (2) BI latency is primary, (3) governed models already exist. Trade-offs: Lakes are cheaper and flexible but need cataloging, quality, and performance tuning (file sizes, partitions). Warehouses offer managed performance/governance but can be costlier at scale and less friendly to schema-on-read patterns. Many teams adopt a lakehouse: land raw in lake, curate gold models for BI in warehouse/SQL engine.
Describe the bronze-silver-gold medallion architecture. How do you define SLOs/SLAs and data contracts at each layer? Show answer

Bronze: Immutable raw ingests, minimal validation; SLOs landing latency, completeness vs source. Contracts file formats, delivery frequency. Silver: Cleaned/enriched, conformed schemas, deduped; SLOs data quality thresholds (null %, referential integrity), freshness windows. Contracts stable schemas with documented evolution policies. Gold: Curated, business-ready marts/aggregates; SLAs availability to dashboards by X time, query performance, semantic consistency. Contracts versioned metrics, governed dimensions, breaking-change deprecation policy.
You have daily Parquet files on S3 for clickstream events. Queries are slow. What actions would you take (partitioning, compaction, file size targets, predicate pushdown, Z-ordering/clustering) and why? Show answer

Actions: (1) Partition by event_date (and possibly device/country if selective) to reduce scanned data; (2) Compact small files into target sizes ~128 512MB to improve parallelism vs overhead; (3) Ensure column stats and pruning via proper Parquet metadata; (4) Use clustering/Z-ordering on high-cardinality filter columns to improve locality; (5) Enforce predicate pushdown-ready types (avoid nested strings for filters); (6) Materialize aggregates for hot queries; (7) Consider caching or result reuse; (8) Optimize catalog partitions and avoid over-partitioning (many tiny dirs).
A stakeholder wants real-time dashboards. How do you decide between micro-batch and true streaming? Discuss latency, cost, fault tolerance, exactly-once semantics, and state management. Show answer

Decision hinges on SLA: sub-second to a few seconds may need true streaming; ~tens of seconds to minutes often fits micro-batch. Micro-batch is simpler, cheaper, and leverages batch semantics with incremental scheduling. True streaming provides lower latency but demands careful state management, watermarking, and idempotent sinks for exactly-once. Fault tolerance: checkpoints and replay in both; streaming requires robust backpressure handling. Cost: streaming jobs run 24/7; micro-batch can scale to schedule windows. Start with micro-batch unless product value requires ultra-low latency.
Your team is moving from ETL to ELT. What changes in tooling, costs, observability, and governance should they expect? When is ETL still the better choice? Show answer

ELT shifts transforms into the warehouse/lake engine (SQL/Spark). Tooling: dbt/SQL-based transforms, orchestration with Airflow/Workflows, versioned models. Costs: compute shifts to warehouse/lake; storage of raw + modeled layers increases; better elasticity. Observability: model tests (dbt), data quality checks, lineage in catalog. Governance: stronger data contracts, environment promotion, CI/CD for SQL. ETL remains better when transforms are complex non-SQL, require specialized compute (ML/feature extraction), strict PII isolation before landing, or when upstream egress and pre-cleaning constraints apply.
A bad transform corrupted yesterday s silver tables. How do you detect, triage, and recover? Include lineage, checkpoints, time travel/versioning, and replay strategies. Show answer

Detect: quality monitors (volume, null %, uniqueness), anomaly alerts, schema drift detectors. Triage: freeze downstream loads, identify affected datasets via lineage graph. Recover: use table/version time travel (e.g., Delta/Iceberg) to revert to a known-good snapshot; if unavailable, drop/rebuild silver from bronze using deterministic replay and checkpoints. Patch: fix code, add unit/contract tests, backfill gold. Postmortem: root cause, guardrails, rollback runbook, and SLAs for recovery time (RTO/RPO).
Compare Hive Metastore and a cloud data catalog (e.g., AWS Glue Data Catalog) for table management in lakes. How do you manage schemas, partitions, and permissions at scale? Show answer

Hive Metastore: self-managed, widely supported, good for on-prem or IaaS clusters. Glue Data Catalog: managed, serverless, integrates with AWS IAM/Lake Formation, crawlers, and multi-service sharing. At scale: enforce schema registries and versioning, automate partition registration (MSCK REPAIR, auto-sync), use table formats supporting metadata pruning (Iceberg/Delta), and centralize permissions via fine-grained access controls (Lake Formation/Unity Catalog). Maintain naming conventions, stewardship ownership, and CI/CD for catalog changes.
A JSON field adds new nested attributes. How do you evolve schemas safely across bronze silver gold without breaking consumers? Discuss contract tests and backward compatibility. Show answer

Bronze: store raw JSON and capture schema snapshots. Silver: apply permissive parsing with default values for new fields; avoid dropping unknowns. Gold: version the semantic model; add columns as nullable; communicate deprecation timeline for breaking changes. Use contract tests to validate required fields, types, and constraints; run in CI before promotion. Maintain consumer-facing views to preserve backward compatibility while rolling out model v2.
Your monthly lakehouse bill spiked. What s your checklist? (e.g., storage classes, lifecycle policies, compaction, caching, pruning, query rewrite, job scheduling, and spot instances.) Show answer

Checklist: (1) Storage enable lifecycle/ILM (transition older data to infrequent/archival), compress, dedupe, compact small files; (2) Compute right-size clusters/warehouses, enable autoscaling/auto-stop, use spot/preemptible where safe; (3) Query ensure partition pruning, predicate pushdown, limit SELECT *, add filters, materialize aggregates, cache hot datasets; (4) Scheduling consolidate jobs, avoid overlapping heavy windows; (5) Catalog drop orphaned snapshots/checkpoints; (6) Monitor unit costs per workload, set budgets and alerts.
How do you enforce quality in pipelines (null checks, uniqueness, referential integrity, threshold monitors)? Where do data contracts live and how are they validated? Show answer

Quality: add assertions at bronze (schema/format), silver (dedupe, type, ranges), and gold (metric reasonability). Use frameworks like Deequ/Great Expectations or dbt tests. Monitor thresholds (e.g., % nulls, row counts) with alerting. Contracts: live alongside code in repo (YAML/JSON schemas), versioned, and validated in CI. At runtime, enforce via schema-on-read/write, with quarantine paths for violations and clear SLAs for remediation.
Given a high-write, geo-distributed user session store, which database family would you choose (NoSQL, NewSQL, OLTP RDBMS) and why? Discuss consistency models and failure scenarios. Show answer

Likely NoSQL (e.g., DynamoDB/Cassandra) for high write throughput and multi-region replication with tunable consistency. If strict global consistency and SQL transactions are mandatory, consider NewSQL (e.g., Spanner/CockroachDB) with higher latency/cost. Choose consistency per operation: eventual for writes to keep availability, strong/transactional for critical read-modify-write flows. Plan for regional failover, hinted handoff, and conflict resolution policies.
Executives want sub-second dashboard queries on gold datasets. What design patterns (star schema, materialized views, aggregates, cube/semantic layers, caching) would you apply? Show answer

Use star/snowflake schemas to simplify joins; pre-aggregate with materialized views or aggregate tables; add semantic layer/cubes for metrics governance; enable caching at query engine and BI layer; cluster/partition tables by filter columns; consider columnar storage and vectorized execution. For very hot metrics, precompute daily/hourly rollups and use incremental refresh.
You must land PII into a lake. Outline your approach to encryption (at rest/in transit), IAM, fine-grained access controls, tokenization/masking, auditability, and data retention. Show answer

Encryption: TLS in transit; KMS-managed keys at rest with bucket/table-level policies. IAM: least privilege, role-based access, scoped temp credentials. Fine-grained controls: column/row-level policies (e.g., Lake Formation/Unity Catalog), views for masked data. Data handling: tokenize or pseudonymize at ingestion; separate secure zones for raw PII; minimize propagation to silver/gold. Audit: enable object access logs and query audit logs; regular reviews. Retention: legal holds, automated deletion per policy, and documented DPIA/compliance mapping (GDPR/CCPA).

Comments

No comments yet. Be the first!

You must log in to comment.