Apache Spark MasterClass Chapter 3 – Episode 1

When should you use spark-shell vs. spark-submit? Provide concrete examples. Show answer

`spark-shell` is ideal for interactive exploration and quick trials—e.g., trying a DataFrame transformation on a sample file or validating a join. `spark-submit` is for production-like, repeatable runs of packaged apps/jars or Python scripts (e.g., nightly ETL, scheduled batch jobs). Typical workflow: prove logic in the shell → move code into a project → package → submit with `spark-submit`.
Explain `spark` and `sc` in the shell. How do they relate to each other? Show answer

`spark` is the `SparkSession`, the unified entry point for SQL/DataFrame APIs and catalog features. `sc` is the underlying `SparkContext`, which manages the connection to the cluster and low-level RDD APIs. In Spark 2+, `SparkSession` wraps and exposes the `SparkContext` via `spark.sparkContext`.
Define transformations and actions. How does lazy evaluation help performance? Show answer

Transformations (e.g., `map`, `filter`, `select`) create a new logical plan without executing immediately—Spark records lineage as a DAG. Actions (e.g., `collect`, `count`, `write`) trigger execution. Lazy evaluation lets Spark analyze and optimize the entire plan (e.g., predicate pushdown, pipelining) before executing, reducing scans and shuffle.
Contrast narrow and wide transformations. Why do wide transformations often dominate cost? Show answer

Narrow transformations (e.g., `map`, `filter`) don’t require data movement between partitions; they’re pipelined. Wide transformations (e.g., `join`, `groupByKey`, `repartition`) require shuffles across the network. Shuffles incur disk I/O, network transfer, and synchronization, often becoming the bottleneck at scale.
What is lineage and how does it enable fault tolerance in Spark? Show answer

Lineage is the record of transformations that created an RDD/DataFrame. If a partition is lost, Spark recomputes it from lineage rather than replicating all data. This approach reduces storage overhead while preserving resilience; cached/persisted data can speed up recomputation.
Give practical guidance for resource tuning flags like `–executor-memory`, `–executor-cores`, and `–num-executors`. Show answer

Start with cluster constraints (total cores/RAM), then size executors to balance parallelism vs. GC pressure. Keep cores/executor modest (e.g., 3–5) to avoid single-executor hotspots; set memory to fit dataset and shuffle. Use `–num-executors` to hit desired total cores; enable dynamic allocation where supported to scale with load.
Why and when would you switch to Kryo serialization? How do you enable it? Show answer

Kryo is faster and more compact than Java serialization for complex objects, reducing network and storage overhead. Enable with `–conf spark.serializer=org.apache.spark.serializer.KryoSerializer` (and register classes if needed). It’s helpful for heavy shuffles, caching, or custom object graphs.
The logs drown your console during development. How can you reduce the noise? Show answer

Edit `$SPARK_HOME/conf/log4j2.properties` and set `rootLogger.level = ERROR` (or `WARN`). You can also lower specific loggers (e.g., `org.apache.spark`, `akka`) and keep driver logs readable while preserving error visibility.
Show two ways to create a small DataFrame in Scala and PySpark. Show answer

Scala: `import spark.implicits._; val df = Seq((1,”a”),(2,”b”)).toDF(“id”,”val”)`. PySpark: `rows=[(1,”a”),(2,”b”)]; df=spark.createDataFrame(rows,[“id”,”val”])`. Both materialize tiny in-memory datasets for quick prototyping.
What’s a safe way to read from PostgreSQL via JDBC and avoid driver/classpath issues? Show answer

Use `spark.read.format(“jdbc”)` with `url`, `dbtable`, `user`, `password`, and include the JDBC driver via `–packages org.postgresql:postgresql:` or a provided JAR. For parallelism, set `partitionColumn`, `lowerBound`, `upperBound`, and `numPartitions` to parallelize large table reads.
How do you obtain or create the current SparkSession in Scala and PySpark? Show answer

Scala: `val spark = SparkSession.builder().getOrCreate()`. PySpark: `from pyspark.sql import SparkSession; spark = SparkSession.builder.getOrCreate()`.
How are jobs, stages, and tasks related, and where do you inspect them? Show answer

An action creates a job. Spark splits it into stages at shuffle boundaries. Each stage consists of many tasks (one per partition typically). Use the Spark Web UI (e.g., `http://driver-host:4040`) to view DAGs, stages, tasks, and executor metrics.
What best practices help when packaging and submitting your own Scala app? Show answer

Ensure consistent Scala/Spark versions, shade conflicting deps if needed (e.g., with sbt-assembly/Maven Shade), and keep configs external. Test locally with small data, then submit with `spark-submit` specifying `–class`, master/deploy mode, and resource flags.
How do you explore catalog metadata with SparkSession? Give examples. Show answer

`spark.catalog.listDatabases.show(false)` and `spark.catalog.listTables.show(false)` enumerate available databases/tables. You can also `spark.sql(“SHOW DATABASES”)` or `spark.sql(“SHOW TABLES”)` if using SQL syntax.
What’s the recommended way to manage credentials for JDBC connections in Spark? Show answer

Avoid hard-coding. Use secrets managers (KMS/Vault), environment variables, or Spark configs passed at submit time. Restrict notebook visibility, don’t commit secrets to VCS, and prefer network policies and SSL for transport security.

Comments

No comments yet. Be the first!

You must log in to comment.