Apache Spark MasterClass Chapter 2 – Episode 8

What is SparkR and when would you use it? Show answer

SparkR is an R package that lets R users work with Spark DataFrames and distributed computing. It’s used when data scientists prefer R for modeling but need Spark’s scalability.
Difference between SparkR and sparklyr? Show answer

SparkR is part of Apache Spark, while sparklyr (by RStudio) provides deeper integration with the R ecosystem, including dplyr verbs. SparkR is more native, sparklyr is more user-friendly.
Why is sbt preferred for Scala Spark projects? Show answer

sbt handles dependency management, incremental builds, and integrates with IDEs like IntelliJ. It’s the de facto build tool for Scala.
How does IntelliJ help in Spark development? Show answer

IntelliJ IDEA offers code completion, debugging, integrated sbt support, and easier refactoring for Scala Spark projects.
Why do we use –add-exports option in VM options? Show answer

To allow access to internal Java modules (e.g., sun.nio.ch) required by Spark on newer Java versions.
Explain role of Zeppelin in Spark ecosystem. Show answer

Zeppelin is a web-based notebook supporting multiple interpreters (Spark, SQL, Python, R). It allows interactive data exploration and visualization.
Why do we mount Spark inside Zeppelin Docker container? Show answer

To provide Zeppelin access to Spark binaries and libraries, enabling Spark interpreter execution.
What are pros/cons of using Docker for Zeppelin on Windows? Show answer

Pros: isolation, easy setup, reproducibility. Cons: requires Docker Desktop, potential performance overhead.
How is %pyspark different from %spark in Zeppelin? Show answer

%pyspark runs PySpark code, while %spark is generic for Spark SQL or Scala. Interpreters differ by language.
Can Zeppelin connect to external clusters? Show answer

Yes, Zeppelin can connect to remote Spark clusters by configuring interpreter settings (master=yarn or master=spark://…).
What’s the difference between sbt package and sbt run? Show answer

sbt run executes locally within sbt; sbt package builds a deployable JAR for cluster submission.
Why is scalaVersion important in build.sbt? Show answer

Spark artifacts are Scala-version specific (2.12/2.13). Mismatches cause runtime errors.
How does Zeppelin store notebooks? Show answer

Notebooks are stored as JSON files, either locally or in mounted volumes (/zeppelin/notebook).
What are common IntelliJ errors in Spark setup? Show answer

Missing Scala plugin, incorrect JDK version, misconfigured classpath, sbt sync issues.
How do you integrate Zeppelin with external data sources? Show answer

Configure interpreters with JDBC drivers or connectors (e.g., %jdbc, %elasticsearch) to query external systems.

Comments

No comments yet. Be the first!

You must log in to comment.