Apache Spark MasterClass Chapter 2 – Episode 2

  1. When should you prefer the pyspark shell vs. creating a SparkSession in a plain Python/IPython REPL? Show answer

    Use the pyspark shell when you want zero setup convenience: it pre-wires sc and spark, sets classpaths, and exposes the Spark UI immediately. Use a plain Python/IPython REPL when integrating with existing tooling (e.g., data science notebooks, IDE debuggers) or when you need explicit control over SparkSession creation (e.g., custom configs, packages, shuffle settings) via the builder pattern.

  2. Explain what getOrCreate() does and why your config changes sometimes appear to be ignored in interactive sessions. Show answer

    getOrCreate() returns the current SparkSession if one exists; otherwise it creates a new one. Because the JVM is already started for an existing session, late changes to fundamental configs (e.g., executor memory, driver JVM options, some Spark SQL settings) won’t take effect. To apply them reliably, stop the active session and restart the REPL or your process, then construct SparkSession with the desired configs before first use.

  3. How do you tame noisy logs while still preserving useful diagnostics? Show answer

    At runtime, call spark.sparkContext.setLogLevel(‘ERROR’

  4. You see 'hostname resolves to a loopback address' and the UI shows 127.0.0.1:4040. You’re on Wi-Fi and peers can’t access your UI. What do you do? Show answer

    Bind the driver to a non-loopback interface by setting SPARK_LOCAL_IP to your machine’s LAN IP before starting Spark, or configure spark.driver.host. Verify with ifconfig/ipconfig. Restart the session so the binding takes effect. Remember that exposing UIs on a laptop has security implications; prefer SSH tunnels on shared networks.

  5. What’s the difference between the UI at :4040 and other Spark UIs such as the standalone master UI? Show answer

    Port 4040 is a per-application UI (Driver UI) showing Jobs, Stages, Storage, SQL, and Environment for that specific run. The standalone master UI (default 8080) shows cluster-level resources (workers, apps). The history server (often 18080) displays finished app UIs from event logs. Knowing which UI to consult speeds up debugging.

  6. How would you reproduce a long running failure quickly using the REPL before writing a full job? Show answer

    Minimize the dataset size (sample/limit), run the same transformations interactively, and call actions (e.g., count, show) to force execution. Inspect the UI DAG and stage metrics. Once logic is correct and stable, scale up gradually, cache reusable intermediates, and add checkpoints if necessary.

  7. Describe the major directories in a Spark binary and when you use each. Show answer

    bin contains user CLIs (pyspark, spark-sql, spark-submit). sbin has admin scripts (start-master.sh, start-worker.sh). examples hosts sample code; data has toy inputs; jars holds Spark and dependency JARs; conf has templates (spark-defaults.conf, log4j). Understanding this layout helps when customizing configs or launching in different modes.

  8. How do you run multiple pyspark sessions without UI port conflicts? Show answer

    Spark assigns 4040 to the first app, then 4041, 4042, etc. If a port is stuck (zombie process), kill the process or set spark.ui.port to a free port before starting the session. In scripts: spark-submit –conf spark.ui.port=4050. In notebooks, set the conf prior to session creation.

  9. What practical benefits does IPython provide over the default Python REPL for PySpark work? Show answer

    Multiline paste without syntax mangling, syntax highlighting, better history and search, magics for timing/profiling, richer tracebacks, and integration with Jupyter. These increase productivity during exploratory development.

  10. How do you safely change Spark configuration mid session? Show answer

    Some runtime SQL configs can be changed via spark.conf.set, but JVM-level and many core Spark configs are fixed after the context starts. To avoid undefined behavior, stop the active session (spark.stop()), ensure the JVM exits if necessary, then rebuild SparkSession with the new configuration.

  11. You launched pyspark and get 'Unable to load native-hadoop library' should you fix it for local learning? Show answer

    Usually not required. It’s a warning that native Hadoop bindings aren’t present; Spark will use pure Java fallbacks. Only address it if you need native features (e.g., performance on certain filesystems) or you’re preparing a production environment.

  12. What does master=local[*] imply for parallelism, and when would you change it? Show answer

    local[*] uses as many threads as available CPU cores on the driver machine. For deterministic tests you might set local[1]. When targeting a real cluster, set master to spark://…, yarn, or k8s and submit via spark-submit or configure in the builder before session creation.

  13. How would you instrument a PySpark job for easier post-mortem analysis? Show answer

    Give a unique, descriptive appName; enable event logging to a durable path; attach custom log4j configuration; emit metrics via Dropwizard/Spark metrics system; and persist intermediate checkpoints. This ensures the history server can display a full UI after completion or failure.

  14. What are common causes of 'job died near the end' and how do you protect against them? Show answer

    Skewed keys causing single task hotspots; insufficient executor/driver memory; unpersisted cached data leading to recomputation; stage retries exhausted due to flaky data sources. Mitigate with salting/skew hints, memory tuning and joins (broadcast where appropriate), cache/persist with correct storage levels, checkpoint long lineages, and validate input data early.

  15. How do you map older RDD-centric code to modern DataFrame/Spark SQL patterns without losing control? Show answer

    Start with SparkSession for DataFrame APIs; many RDD transforms map to DataFrame operations (map withColumn/transform, reduceByKey → groupBy+agg). Keep RDDs for low-level or irregular operations, but prefer DataFrames for optimizer benefits (Catalyst, Tungsten). Use spark.sparkContext for interoperability when needed.

Comments

No comments yet. Be the first!

You must log in to comment.