Apache Spark MasterClass Chapter 2 – Episode 8

  1. What is SparkR and when would you use it? Show answer

    SparkR is an R package that lets R users work with Spark DataFrames and distributed computing. It’s used when data scientists prefer R for modeling but need Spark’s scalability.

  2. Difference between SparkR and sparklyr? Show answer

    SparkR is part of Apache Spark, while sparklyr (by RStudio) provides deeper integration with the R ecosystem, including dplyr verbs. SparkR is more native, sparklyr is more user-friendly.

  3. Why is sbt preferred for Scala Spark projects? Show answer

    sbt handles dependency management, incremental builds, and integrates with IDEs like IntelliJ. It’s the de facto build tool for Scala.

  4. How does IntelliJ help in Spark development? Show answer

    IntelliJ IDEA offers code completion, debugging, integrated sbt support, and easier refactoring for Scala Spark projects.

  5. Why do we use –add-exports option in VM options? Show answer

    To allow access to internal Java modules (e.g., sun.nio.ch) required by Spark on newer Java versions.

  6. Explain role of Zeppelin in Spark ecosystem. Show answer

    Zeppelin is a web-based notebook supporting multiple interpreters (Spark, SQL, Python, R). It allows interactive data exploration and visualization.

  7. Why do we mount Spark inside Zeppelin Docker container? Show answer

    To provide Zeppelin access to Spark binaries and libraries, enabling Spark interpreter execution.

  8. What are pros/cons of using Docker for Zeppelin on Windows? Show answer

    Pros: isolation, easy setup, reproducibility. Cons: requires Docker Desktop, potential performance overhead.

  9. How is %pyspark different from %spark in Zeppelin? Show answer

    %pyspark runs PySpark code, while %spark is generic for Spark SQL or Scala. Interpreters differ by language.

  10. Can Zeppelin connect to external clusters? Show answer

    Yes, Zeppelin can connect to remote Spark clusters by configuring interpreter settings (master=yarn or master=spark://…).

  11. What’s the difference between sbt package and sbt run? Show answer

    sbt run executes locally within sbt; sbt package builds a deployable JAR for cluster submission.

  12. Why is scalaVersion important in build.sbt? Show answer

    Spark artifacts are Scala-version specific (2.12/2.13). Mismatches cause runtime errors.

  13. How does Zeppelin store notebooks? Show answer

    Notebooks are stored as JSON files, either locally or in mounted volumes (/zeppelin/notebook).

  14. What are common IntelliJ errors in Spark setup? Show answer

    Missing Scala plugin, incorrect JDK version, misconfigured classpath, sbt sync issues.

  15. How do you integrate Zeppelin with external data sources? Show answer

    Configure interpreters with JDBC drivers or connectors (e.g., %jdbc, %elasticsearch) to query external systems.

Comments

No comments yet. Be the first!

You must log in to comment.