Apache Spark MasterClass Chapter 2 – Episode 4

What do the common WARN messages mean when launching PySpark on a laptop? How should you respond? Show answer

Two frequent WARN lines are: (1) hostname resolves to 127.0.0.1 and suggests setting SPARK_LOCAL_IP, and (2) unable to load the native Hadoop library—falling back to built-in Java classes. Both are normal in local development. Unless you rely on Hadoop-native codecs or specific networking, continue. If binding is an issue (e.g., multiple NICs), set SPARK_LOCAL_IP to the correct interface.
Compare launching PySpark in a terminal vs. launching it into Jupyter. When would you prefer each? Show answer

Terminal: quickest feedback for ad hoc experiments and REPL-friendly tasks; minimal overhead. Jupyter: when you need notebooks, plots, rich text, and a preserved execution history. For teaching, demos, and collaborative reviews, Jupyter is ideal; for quick one-offs and performance checks, the terminal is simpler.
How do PYSPARK_DRIVER_PYTHON and PYSPARK_DRIVER_PYTHON_OPTS change PySpark’s behavior? Show answer

PYSPARK_DRIVER_PYTHON sets the process used to run the PySpark driver (e.g., jupyter). PYSPARK_DRIVER_PYTHON_OPTS passes arguments to that process (e.g., ‘notebook –no-browser –port=8888’). Together, invoking pyspark launches Jupyter with Spark prewired.
On Windows, why might you add findspark to a notebook and call findspark.init()? Show answer

Windows paths and environment variables vary across installs. findspark.init() programmatically locates SPARK_HOME and amends sys.path so import pyspark works consistently inside notebooks and IDEs without manual PATH edits.
Explain host:container port mapping (-p host:container) for the jupyter/pyspark-notebook image, and a common mistake. Show answer

Inside the container, Jupyter listens on 8888. Mapping -p 8888:8888 exposes it as http://localhost:8888. Using -p 7777:8888 exposes it at http://localhost:7777. A common mistake is copying the printed 8888 link into the browser after starting with -p 7777:8888; you must use the host port (7777).
What does -v hostPath:containerPath do in docker run and why is it critical for data science workflows? Show answer

It mounts a local folder into the container so code and data persist outside the container. Without a bind mount, files created in the container may be ephemeral; mounting lets you version datasets/notebooks on the host and share them across containers and teammates.
When would you run docker with -it and bash instead of directly launching the notebook? Show answer

Use -it … bash to drop into a shell for debugging, installing OS packages, running headless jobs, inspecting environment variables, or starting Jupyter manually. It’s helpful when the notebook server won’t start or you need to customize the container at runtime.
Why might you add –rm, –shm-size, and –ulimit memlock=-1 for notebook containers? Show answer

–rm ensures the container is removed after exit to save disk. –shm-size enlarges shared memory for libraries (e.g., pandas/Spark drivers) that use /dev/shm. –ulimit memlock=-1 allows larger locked-in-memory segments, reducing failures with memory-intensive tasks in some environments.
What’s the quickest way to validate that Spark SQL is functioning in a fresh session? Show answer

Construct a minimal SparkSession and run a tiny SQL query that doesn’t require external data, e.g., spark.sql(\select ‘spark’ as hello\).show(). It confirms the session, SQL parser, and display pipeline are working.
Describe a minimal Databricks CE workflow to ingest a CSV and query it. Show answer

Create a free CE account → Create a single-node cluster → Upload CSV via Create → Table, checking ‘First row is header’ and ‘Infer Schema’ → Name the table (e.g., movies) → In a notebook attached to the cluster, run df = spark.table(‘movies’); df.show().
If a Databricks SQL cell fails with 'table not found', what are the first things you check? Show answer

Check the database and table names (singular vs. plural typos), run SHOW DATABASES and SHOW TABLES in the current database, confirm the notebook is attached to an active cluster, and verify the workspace path if using Delta files instead of managed tables.
What are trade-offs between local Jupyter, Dockerized notebooks, and Databricks CE for PySpark work? Show answer

Local Jupyter: fastest startup, full control; but dependency drift and OS issues are common. Docker: reproducible and portable; initial learning curve and volume/port setup needed. Databricks CE: managed Spark and notebooks; internet required, resource limits, and workspace constraints.
How do you tame logging noise during development without hiding important failures? Show answer

Use sc.setLogLevel(‘ERROR’) during exploration to suppress INFO/WARN spam; switch back to WARN/INFO when diagnosing behavior. Avoid OFF in teams because it hides early signals. Persist full logs when running batch jobs or CI pipelines.
What does the Spark UI at :4040 provide during interactive sessions? Show answer

Per-application dashboards: Jobs, Stages, Tasks, Storage, Environment, Executors, and SQL tabs. It’s essential for debugging skew, watching shuffle activity, monitoring caching, and understanding physical execution plans.
You launched PySpark fine, but a notebook can’t import pyspark. Root causes and fixes? Show answer

Root causes: the notebook kernel uses a different Python than your Spark install; missing SPARK_HOME on PATH; on Windows, missing findspark.init(); or the conda env wasn’t activated. Fixes: select the correct kernel, activate the env, set SPARK_HOME and PATH, or call findspark.init() in the notebook.

Comments

No comments yet. Be the first!

You must log in to comment.