Apache Spark MasterClass Chapter 2 – Episode 8
-
What is SparkR and when would you use it? Show answer
SparkR is an R package that lets R users work with Spark DataFrames and distributed computing. It’s used when data scientists prefer R for modeling but need Spark’s scalability.
-
Difference between SparkR and sparklyr? Show answer
SparkR is part of Apache Spark, while sparklyr (by RStudio) provides deeper integration with the R ecosystem, including dplyr verbs. SparkR is more native, sparklyr is more user-friendly.
-
Why is sbt preferred for Scala Spark projects? Show answer
sbt handles dependency management, incremental builds, and integrates with IDEs like IntelliJ. It’s the de facto build tool for Scala.
-
How does IntelliJ help in Spark development? Show answer
IntelliJ IDEA offers code completion, debugging, integrated sbt support, and easier refactoring for Scala Spark projects.
-
Why do we use –add-exports option in VM options? Show answer
To allow access to internal Java modules (e.g., sun.nio.ch) required by Spark on newer Java versions.
-
Explain role of Zeppelin in Spark ecosystem. Show answer
Zeppelin is a web-based notebook supporting multiple interpreters (Spark, SQL, Python, R). It allows interactive data exploration and visualization.
-
Why do we mount Spark inside Zeppelin Docker container? Show answer
To provide Zeppelin access to Spark binaries and libraries, enabling Spark interpreter execution.
-
What are pros/cons of using Docker for Zeppelin on Windows? Show answer
Pros: isolation, easy setup, reproducibility. Cons: requires Docker Desktop, potential performance overhead.
-
How is %pyspark different from %spark in Zeppelin? Show answer
%pyspark runs PySpark code, while %spark is generic for Spark SQL or Scala. Interpreters differ by language.
-
Can Zeppelin connect to external clusters? Show answer
Yes, Zeppelin can connect to remote Spark clusters by configuring interpreter settings (master=yarn or master=spark://…).
-
What’s the difference between sbt package and sbt run? Show answer
sbt run executes locally within sbt; sbt package builds a deployable JAR for cluster submission.
-
Why is scalaVersion important in build.sbt? Show answer
Spark artifacts are Scala-version specific (2.12/2.13). Mismatches cause runtime errors.
-
How does Zeppelin store notebooks? Show answer
Notebooks are stored as JSON files, either locally or in mounted volumes (/zeppelin/notebook).
-
What are common IntelliJ errors in Spark setup? Show answer
Missing Scala plugin, incorrect JDK version, misconfigured classpath, sbt sync issues.
-
How do you integrate Zeppelin with external data sources? Show answer
Configure interpreters with JDBC drivers or connectors (e.g., %jdbc, %elasticsearch) to query external systems.

Comments
No comments yet. Be the first!
You must log in to comment.