Apache Spark MasterClass Chapter 2 – Episode 6

  1. Why do we use PostgreSQL instead of Derby for Hive Metastore in production? Show answer

    Derby is embedded and single-user only, making it unsuitable for concurrent queries. PostgreSQL (or MySQL) supports multiple connections, transactions, and scalability.

  2. What happens if you forget to format the NameNode before starting Hadoop? Show answer

    HDFS will fail to start with metadata mismatch errors. Formatting initializes NameNode metadata and directory structure.

  3. Explain the role of hdfs-site.xml. Show answer

    Defines storage directories (dfs.name.dir, dfs.data.dir) and replication settings, which are critical for data storage and redundancy.

  4. What’s the difference between Hive Metastore and HiveServer2? Show answer

    Metastore stores schema metadata in a database, while HiveServer2 is the execution engine that processes HiveQL queries and communicates with clients.

  5. Why is ssh localhost important in Hadoop setup? Show answer

    Ensures passwordless SSH for Hadoop daemons to manage nodes. Even single-node clusters require it for inter-process communication.

  6. How would you troubleshoot Hive Metastore startup failure? Show answer

    Check logs (metastore.log), validate JDBC driver, verify DB connection, confirm schema initialization, and check hive-site.xml configuration.

  7. Difference between core-site.xml and yarn-site.xml. Show answer

    core-site.xml defines global Hadoop settings like the default filesystem, while yarn-site.xml configures YARN-specific services such as ResourceManager and NodeManager.

  8. What are common issues on Windows when running Hadoop? Show answer

    Missing winutils.exe, incorrect JAVA_HOME, and permission errors due to lack of POSIX emulation.

  9. Why is replication factor set to 1 in single-node setups? Show answer

    To save space—since data can’t be replicated across multiple nodes. In production clusters, the replication factor is usually three or more.

  10. Explain Beeline vs Hive CLI. Show answer

    Hive CLI is deprecated. Beeline is JDBC-based, more secure, and supports remote HiveServer2 connections.

  11. What does schematool -dbType postgres -initSchema do? Show answer

    Creates the required Hive Metastore tables in PostgreSQL, ensuring Hive can store metadata properly.

  12. How do you check if Hadoop services started successfully? Show answer

    Use jps on Linux, check web UIs (http://localhost:9870), or verify processes such as NameNode, DataNode, and ResourceManager.

  13. Why do we create /user/hive/warehouse in HDFS? Show answer

    Hive stores tables here by default. Permissions ensure Hive can read/write metadata and data files.

  14. How do you secure Hive Metastore database in production? Show answer

    Use strong passwords, role-based access control, restrict network access, and enable SSL between Hive and the database.

  15. How would you integrate Hive with Spark SQL? Show answer

    Configure Spark with the Hive Metastore by setting spark.sql.warehouse.dir and the Hive Metastore URI, enabling Spark to query Hive tables.

Comments

No comments yet. Be the first!

You must log in to comment.