Apache Spark MasterClass Chapter 2 – Episode 3

  1. Compare the Spark shells: spark-shell, pyspark, and sparksql. When would you choose each? Show answer

    spark-shell (Scala) offers first-class access to all APIs and is excellent for learning Scala and Spark together. pyspark is ideal for Python users who want interactive DataFrame work or to prototype ETL quickly. sparksql provides a SQL-first interface to run ad hoc queries against structured data and inspect plans without writing Scala/Python code. Choose based on your team’s primary language and the task—exploration, SQL ad hoc analysis, or building snippets you’ll later productionize.

  2. Why is a REPL helpful for Spark and what are its limitations? Show answer

    A REPL shortens feedback loops, making it safer to test transformations on subsets and to iterate rapidly. You can inspect DAGs in the UI, confirm schemas, and validate logic before scaling. Limitations: session state can become messy, reproducibility may suffer if you don’t save code, and JVM-level config changes typically require restarting the session.

  3. Walk me through verifying a fresh install using the shells. Show answer

    Launch pyspark or spark-shell. Confirm the banner and UI link on :4040. In Scala or Python, run spark.version. Load a small file with spark.read.text(…), call show(10, no truncation), and count(). If anything fails, check JAVA_HOME/SPARK_HOME on macOS/Linux or winutils + PATH on Windows; consult the UI’s Environment tab.

  4. How do you keep interactive work reproducible so it can move to production later? Show answer

    Capture the shell commands in scripts or notebooks; prefer version-controlled .scala/.py files and use :load to replay. Pin Spark/Java versions, document configs, and encode small datasets or sampling logic. Write unit-testable functions and use spark-submit for repeatability once stabilized.

  5. Explain the DataFrame 'read → show → count' loop for quick EDA. Show answer

    read brings data into a logical table (DataFrame); show previews rows to validate parsing and content; count is a fast sanity check that triggers execution and reveals performance characteristics. Together they validate sources, schemas, and pipeline assumptions quickly.

  6. What does local[*] mean and when might you change it? Show answer

    local[*] uses as many threads as local cores. Change to local[1] for deterministic single-thread testing, or point to spark://…, yarn, or k8s when you want a real cluster. Always set the master before the session starts.

  7. How can you use :history and :load effectively during an interview take-home or live coding? Show answer

    :history helps recall earlier experiments; you can copy snippets into a file. :load runs that file in the shell, making demos consistent and avoiding typos. It’s a lightweight path toward a script you’ll later submit via spark-submit.

  8. What are key tabs in the Spark UI and what problems do they help you solve? Show answer

    Jobs/Stages/SQL visualize DAGs, task counts, and query plans to diagnose skew or shuffles. Storage shows cached/persisted DataFrames. Environment lists all Spark/Java/Hadoop settings and classpath details. Executors displays per-executor CPU/memory and failure stats. Together they accelerate root cause analysis.

  9. Demonstrate parity between Scala and PySpark for a simple task. Show answer

    Scala: val strings = spark.read.text(\../README.md\); strings.show(10, false); strings.count(). Python: strings = spark.read.text(\../README.md\); strings.show(10, truncate=False); strings.count(). The calls mirror each other, enabling easy cross-language learning.

  10. How would you teach a teammate to discover APIs quickly without docs? Show answer

    Use tab completion extensively: type spark. then Tab to explore members; in Scala, :type clarifies inferred types. For DataFrames, call printSchema(), explain(), and .columns to self-discover structure and behavior.

  11. When would you use the Structured APIs over direct RDDs in the shells? Show answer

    Almost always for analytics/ETL: DataFrames/SQL are higher level, more concise, and benefit from Catalyst/Tungsten optimizations. Drop to RDDs for fine-grained control, custom partitioning, or irregular transformations not expressible with the tabular model.

  12. What’s your approach to preventing 'state pollution' in long shell sessions? Show answer

    Reset periodically with :reset; modularize code into files and :load them; avoid redefining symbols with conflicting types; and restart the shell when changing low-level configs. Keep notes or a notebook to track the canonical version of working code.

  13. How do you preview without truncation and why does it matter? Show answer

    Use show(n, truncate=False) in PySpark or show(n, false) in Scala to display full values. It matters when inspecting parsed JSON/CSV fields or long messages—truncation can hide parsing errors or unexpected characters.

  14. Explain a quick Scala warm-up you’d give to a new Spark user. Show answer

    Start with println, define vals/vars, create an Array or Seq, then foreach print. Show a simple function with an inferred return and a Boolean predicate (e.g., isOddAge). Demonstrate chaining with filter(…).foreach(println). Wrap with :type to reinforce static typing concepts relevant to Spark’s APIs.

  15. If a candidate claims 'the shells are only for demos', how do you respond? Show answer

    They’re excellent for real workflows: debugging schemas, profiling transformations on samples, validating joins and aggregations, and reproducing bugs by narrowing inputs. They shorten iteration cycles before you scale to full datasets or refactor into production jobs.

Comments

No comments yet. Be the first!

You must log in to comment.