Apache Spark MasterClass Chapter 2 – Episode 6
-
Why do we use PostgreSQL instead of Derby for Hive Metastore in production? Show answer
Derby is embedded and single-user only, making it unsuitable for concurrent queries. PostgreSQL (or MySQL) supports multiple connections, transactions, and scalability.
-
What happens if you forget to format the NameNode before starting Hadoop? Show answer
HDFS will fail to start with metadata mismatch errors. Formatting initializes NameNode metadata and directory structure.
-
Explain the role of hdfs-site.xml. Show answer
Defines storage directories (dfs.name.dir, dfs.data.dir) and replication settings, which are critical for data storage and redundancy.
-
What’s the difference between Hive Metastore and HiveServer2? Show answer
Metastore stores schema metadata in a database, while HiveServer2 is the execution engine that processes HiveQL queries and communicates with clients.
-
Why is ssh localhost important in Hadoop setup? Show answer
Ensures passwordless SSH for Hadoop daemons to manage nodes. Even single-node clusters require it for inter-process communication.
-
How would you troubleshoot Hive Metastore startup failure? Show answer
Check logs (metastore.log), validate JDBC driver, verify DB connection, confirm schema initialization, and check hive-site.xml configuration.
-
Difference between core-site.xml and yarn-site.xml. Show answer
core-site.xml defines global Hadoop settings like the default filesystem, while yarn-site.xml configures YARN-specific services such as ResourceManager and NodeManager.
-
What are common issues on Windows when running Hadoop? Show answer
Missing winutils.exe, incorrect JAVA_HOME, and permission errors due to lack of POSIX emulation.
-
Why is replication factor set to 1 in single-node setups? Show answer
To save space—since data can’t be replicated across multiple nodes. In production clusters, the replication factor is usually three or more.
-
Explain Beeline vs Hive CLI. Show answer
Hive CLI is deprecated. Beeline is JDBC-based, more secure, and supports remote HiveServer2 connections.
-
What does schematool -dbType postgres -initSchema do? Show answer
Creates the required Hive Metastore tables in PostgreSQL, ensuring Hive can store metadata properly.
-
How do you check if Hadoop services started successfully? Show answer
Use jps on Linux, check web UIs (http://localhost:9870), or verify processes such as NameNode, DataNode, and ResourceManager.
-
Why do we create /user/hive/warehouse in HDFS? Show answer
Hive stores tables here by default. Permissions ensure Hive can read/write metadata and data files.
-
How do you secure Hive Metastore database in production? Show answer
Use strong passwords, role-based access control, restrict network access, and enable SSL between Hive and the database.
-
How would you integrate Hive with Spark SQL? Show answer
Configure Spark with the Hive Metastore by setting spark.sql.warehouse.dir and the Hive Metastore URI, enabling Spark to query Hive tables.

Comments
No comments yet. Be the first!
You must log in to comment.