Apache Spark MasterClass Chapter 3 – Episode 2
-
Walk through the Monte Carlo p estimation with PySpark and discuss pitfalls at scale. Show answer
Create a SparkContext, define `inside()` that samples random (x,y) and tests `x^2+y^2<1`, then `sc.parallelize(range(NUM_SAMPLES)).filter(inside).count()` and compute `p≈4*count/NUM_SAMPLES`. Pitfalls: Python `random` in tight loops can be slow; prefer per-partition generators via `mapPartitions`. Skewed partition sizes or too few partitions reduce parallelism. Consider using RDD APIs with proper partitioning or switch to Scala for lower overhead.
-
How do you ingest a simple CSV into PySpark and what options matter for correctness and performance? Show answer
`spark.read.csv(path, header=True, sep=’,’, inferSchema=True)` loads the file; specify schema explicitly for reliability. Use `multiLine`, `quote`, `escape`, and `mode` for messy data. For performance, coalesce small files, set `wholeFile` if needed, and cache downstream if reused.
-
Contrast SparkSession and SparkContext across PySpark and Scala. Show answer
SparkSession is the unified entry for SQL/DataFrame/Catalog; it wraps SparkContext. In Scala: `SparkSession.builder().getOrCreate()`; in PySpark: `SparkSession.builder.getOrCreate()`. There is one active SparkContext per JVM; access it via `spark.sparkContext`.
-
Explain jobs, stages, and tasks using `df.count` as the example. Show answer
`df.count` is an action that creates a job. Spark splits it into stages at shuffle boundaries—typically map-side counts then a reduce/aggregate stage. Each stage comprises tasks (one per partition). All tasks in a stage must finish before moving to the next stage.
-
What causes shuffles and how can you mitigate their cost? Show answer
Shuffles occur on wide transformations (join, groupBy, distinct, repartition). Mitigate with partition pruning, broadcast joins (`spark.sql.autoBroadcastJoinThreshold`), salting keys, using `reduceByKey`/map-side aggregations, and tuning `spark.sql.shuffle.partitions` to balance parallelism vs. overhead.
-
How would you run the same notebook on local, Docker, Databricks, and Codespaces with minimal changes? Show answer
Abstract file paths via environment variables or widgets; keep data in mounted volumes (Docker) or DBFS (Databricks). Install matching Spark versions and drivers; avoid hard-coded absolute paths. Package dependencies via requirements/conda or cluster libraries.
-
When is it better to define a schema instead of relying on `inferSchema` for CSVs? Show answer
When types are known, for consistency and performance (no full-file sampling). Explicit schemas prevent type drift, truncation, or costly inference on large datasets, and enable predicate pushdown where applicable.
-
Demonstrate a minimal Scala Spark app skeleton and explain each part. Show answer
`object App extends App { val spark = SparkSession.builder.master(“local[1]”).appName(“Demo”).getOrCreate(); import spark.implicits._ }`. `object … extends App` sets the entry point, `builder` configures master/app name, and `implicits` provides conversions/encoders.
-
How do you diagnose performance issues when `NUM_SAMPLES` is very large in the p demo? Show answer
Inspect the Spark UI for stage/task times and skew; increase partitions (`parallelize(…, numSlices)`), avoid Python UDF overhead by batching work in `mapPartitions`, and verify the driver/executor memory and cores. Benchmark smaller samples and scale up; use `persist` only when reusing RDDs.
-
What are best practices for organizing paths and secrets when reading JDBC sources? Show answer
Externalize credentials via environment variables or secrets managers; avoid plaintext in notebooks. Parameterize JDBC URLs and tables; use `partitionColumn/lowerBound/upperBound/numPartitions` for parallel reads; enable SSL and least-privilege DB users.
-
Give examples of narrow vs. wide transformations and why it matters. Show answer
Narrow: `map`, `flatMap`, `filter` (no shuffle). Wide: `join`, `groupByKey`, `repartition` (shuffle). Wide ops incur network and disk I/O; designing pipelines to minimize them is crucial for scalability.
-
How does `spark.read.format('parquet').load(path)` differ from reading CSV? Show answer
Parquet is columnar, typed, and supports predicate pushdown and compression out-of-the-box—no schema guessing needed. CSV is row-oriented text; requires parsing and type handling. Parquet is preferred for performance and storage efficiency.
-
What are common mistakes beginners make when moving from REPL to packaged apps? Show answer
Hard-coding paths/credentials, mismatched Spark/Scala versions, missing shaded dependencies, relying on driver-local files, and not testing with representative data sizes. Use config files, CI builds, and environment-specific submit scripts.
-
How can you keep notebooks reproducible across environments? Show answer
Pin library versions, check notebooks into version control, keep data in deterministic locations (e.g., /mnt/data, DBFS paths), use widgets/parameters, and export notebooks to scripts/JARs for scheduled runs.
-
Describe how Spark’s DAG and lazy evaluation improve efficiency for the CSV → count pipeline. Show answer
Spark builds a logical plan from transformations, then optimizes it prior to execution upon an action (`count`). It can collapse stages, push filters, and coalesce scans, executing with minimal passes over data and reduced shuffles.

Comments
No comments yet. Be the first!
You must log in to comment.