Apache Spark MasterClass Chapter 3 – Episode 2

  1. Walk through the Monte Carlo p estimation with PySpark and discuss pitfalls at scale. Show answer

    Create a SparkContext, define `inside()` that samples random (x,y) and tests `x^2+y^2<1`, then `sc.parallelize(range(NUM_SAMPLES)).filter(inside).count()` and compute `p≈4*count/NUM_SAMPLES`. Pitfalls: Python `random` in tight loops can be slow; prefer per-partition generators via `mapPartitions`. Skewed partition sizes or too few partitions reduce parallelism. Consider using RDD APIs with proper partitioning or switch to Scala for lower overhead.

  2. How do you ingest a simple CSV into PySpark and what options matter for correctness and performance? Show answer

    `spark.read.csv(path, header=True, sep=’,’, inferSchema=True)` loads the file; specify schema explicitly for reliability. Use `multiLine`, `quote`, `escape`, and `mode` for messy data. For performance, coalesce small files, set `wholeFile` if needed, and cache downstream if reused.

  3. Contrast SparkSession and SparkContext across PySpark and Scala. Show answer

    SparkSession is the unified entry for SQL/DataFrame/Catalog; it wraps SparkContext. In Scala: `SparkSession.builder().getOrCreate()`; in PySpark: `SparkSession.builder.getOrCreate()`. There is one active SparkContext per JVM; access it via `spark.sparkContext`.

  4. Explain jobs, stages, and tasks using `df.count` as the example. Show answer

    `df.count` is an action that creates a job. Spark splits it into stages at shuffle boundaries—typically map-side counts then a reduce/aggregate stage. Each stage comprises tasks (one per partition). All tasks in a stage must finish before moving to the next stage.

  5. What causes shuffles and how can you mitigate their cost? Show answer

    Shuffles occur on wide transformations (join, groupBy, distinct, repartition). Mitigate with partition pruning, broadcast joins (`spark.sql.autoBroadcastJoinThreshold`), salting keys, using `reduceByKey`/map-side aggregations, and tuning `spark.sql.shuffle.partitions` to balance parallelism vs. overhead.

  6. How would you run the same notebook on local, Docker, Databricks, and Codespaces with minimal changes? Show answer

    Abstract file paths via environment variables or widgets; keep data in mounted volumes (Docker) or DBFS (Databricks). Install matching Spark versions and drivers; avoid hard-coded absolute paths. Package dependencies via requirements/conda or cluster libraries.

  7. When is it better to define a schema instead of relying on `inferSchema` for CSVs? Show answer

    When types are known, for consistency and performance (no full-file sampling). Explicit schemas prevent type drift, truncation, or costly inference on large datasets, and enable predicate pushdown where applicable.

  8. Demonstrate a minimal Scala Spark app skeleton and explain each part. Show answer

    `object App extends App { val spark = SparkSession.builder.master(“local[1]”).appName(“Demo”).getOrCreate(); import spark.implicits._ }`. `object … extends App` sets the entry point, `builder` configures master/app name, and `implicits` provides conversions/encoders.

  9. How do you diagnose performance issues when `NUM_SAMPLES` is very large in the p demo? Show answer

    Inspect the Spark UI for stage/task times and skew; increase partitions (`parallelize(…, numSlices)`), avoid Python UDF overhead by batching work in `mapPartitions`, and verify the driver/executor memory and cores. Benchmark smaller samples and scale up; use `persist` only when reusing RDDs.

  10. What are best practices for organizing paths and secrets when reading JDBC sources? Show answer

    Externalize credentials via environment variables or secrets managers; avoid plaintext in notebooks. Parameterize JDBC URLs and tables; use `partitionColumn/lowerBound/upperBound/numPartitions` for parallel reads; enable SSL and least-privilege DB users.

  11. Give examples of narrow vs. wide transformations and why it matters. Show answer

    Narrow: `map`, `flatMap`, `filter` (no shuffle). Wide: `join`, `groupByKey`, `repartition` (shuffle). Wide ops incur network and disk I/O; designing pipelines to minimize them is crucial for scalability.

  12. How does `spark.read.format('parquet').load(path)` differ from reading CSV? Show answer

    Parquet is columnar, typed, and supports predicate pushdown and compression out-of-the-box—no schema guessing needed. CSV is row-oriented text; requires parsing and type handling. Parquet is preferred for performance and storage efficiency.

  13. What are common mistakes beginners make when moving from REPL to packaged apps? Show answer

    Hard-coding paths/credentials, mismatched Spark/Scala versions, missing shaded dependencies, relying on driver-local files, and not testing with representative data sizes. Use config files, CI builds, and environment-specific submit scripts.

  14. How can you keep notebooks reproducible across environments? Show answer

    Pin library versions, check notebooks into version control, keep data in deterministic locations (e.g., /mnt/data, DBFS paths), use widgets/parameters, and export notebooks to scripts/JARs for scheduled runs.

  15. Describe how Spark’s DAG and lazy evaluation improve efficiency for the CSV → count pipeline. Show answer

    Spark builds a logical plan from transformations, then optimizes it prior to execution upon an action (`count`). It can collapse stages, push filters, and coalesce scans, executing with minimal passes over data and reduced shuffles.

Comments

No comments yet. Be the first!

You must log in to comment.