classwork-chapter2-ep6 - certcloudprojects

Installation Steps

Tutorial: Hadoop & Hive Setup on Ubuntu/Windows with PostgreSQL Metastore

Linux

Windows

This guide covers installation and setup of Hadoop, Hive, and PostgreSQL as a Hive Metastore on Ubuntu.

1. Install and Configure Hadoop

Download Hadoop:

https://hadoop.apache.org/releases.html
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.4.0/hadoop-3.4.0.tar.gz
tar -xvzf hadoop-3.4.0.tar.gz
mv hadoop-3.4.0 ~/hadoop

Update Hadoop Configuration

Edit hadoop-env.sh → set JAVA_HOME:

export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64/

Edit core-site.xml:

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:9745</value>
  </property>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/home/sthithapragna/hadoop/tmp</value>
  </property>
</configuration>

Edit hdfs-site.xml (NameNode + DataNode paths):

<configuration>
  <property>
    <name>dfs.name.dir</name>
    <value>file:/home/sthithapragna/hdfs/namenode</value>
  </property>
  <property>
    <name>dfs.data.dir</name>
    <value>file:/home/sthithapragna/hdfs/datanode</value>
  </property>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>

Edit mapred-site.xml:

<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
</configuration>

Edit yarn-site.xml:

<property>
  <name>yarn.web-proxy.address</name>
  <value>localhost:9099</value>
</property>

Update .bashrc

export HADOOP_HOME=/home/sthithapragna/hadoop
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME

Enable SSH for Hadoop

sudo apt update
sudo apt install openssh-server -y
sudo systemctl enable ssh
sudo systemctl start ssh
ssh-keygen -t rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 755 ~/.ssh/authorized_keys
ssh localhost

Format & Start Hadoop

hdfs namenode -format
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh

2. Install and Configure Hive

Download and extract Hive:

wget https://dlcdn.apache.org/hive/hive-4.1.0/apache-hive-4.1.0-bin.tar.gz
tar -zxvf apache-hive-4.1.0-bin.tar.gz
mv apache-hive-4.1.0-bin ~/hive

Update Hive configuration:

cd ~/hive/conf
mv hive-default.xml.template hive-site.xml
mv hive-env.sh.template hive-env.sh

Edit hive-site.xml (temporary Derby DB config):

<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=/home/sthithapragna/hive/metastore/metastore_db;create=true</value>

Update .bashrc

export HIVE_HOME=/home/sthithapragna/hive
export PATH=$PATH:$HIVE_HOME/bin
export HIVE_CONF_DIR=$HIVE_HOME/conf

Prepare HDFS directories

hdfs dfs -mkdir -p /user/hive/warehouse /tmp/hive
hdfs dfs -chmod -R 777 /user/hive /tmp/hive

3. Configure PostgreSQL as Hive Metastore

Install PostgreSQL:

sudo apt install postgresql -y
sudo systemctl status postgresql

Create Hive Metastore user + DB:

sudo -i -u postgres
psql
CREATE USER psqluser WITH PASSWORD 'sthithapragna';
CREATE DATABASE pymetastore OWNER psqluser;
\q

Edit pg_hba.conf to allow local connections (change auth to trust):

local all all trust
host all all 127.0.0.1/32 trust
host all all ::1/128 trust

Update Hive to use PostgreSQL

Edit hive-site.xml:

<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:postgresql://localhost:5432/pymetastore</value>

<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.postgresql.Driver</value>

<name>javax.jdo.option.ConnectionUserName</name>
<value>psqluser</value>

<name>javax.jdo.option.ConnectionPassword</name>
<value>sthithapragna</value>

Initialize Hive Metastore

Download PostgreSQL JDBC driver: PostgreSQL JDBC
Copy driver to Hive lib:

cp postgresql-42.7.2.jar ~/hive/lib/
schematool -dbType postgres -initSchema --verbose

Start Hive Services

$HIVE_HOME/bin/hive --service metastore 1>$HIVE_HOME/logs/metastore.out 2>$HIVE_HOME/logs/metastore.log &
$HIVE_HOME/bin/hiveserver2 1>$HIVE_HOME/logs/hive.out 2>$HIVE_HOME/logs/hive.log &

Connect with Beeline

beeline --verbose=true -u "jdbc:hive2://127.0.0.1:10000/default"

4. Handy Queries in PostgreSQL

List Hive databases:

SELECT "NAME", "DB_LOCATION_URI" FROM "DBS";

Show all tables in a database (example: smoke_test_db):

SELECT "TBL_NAME", "TBL_TYPE"
FROM "TBLS" t
JOIN "DBS" d ON t."DB_ID"=d."DB_ID"
WHERE d."NAME"='smoke_test_db';

Show columns of a table:

SELECT c."COLUMN_NAME", c."TYPE_NAME"
FROM "COLUMNS_V2" c
JOIN "SDS" s ON c."CD_ID"=s."CD_ID"
JOIN "TBLS" t ON s."SD_ID"=t."SD_ID"
JOIN "DBS" d ON t."DB_ID"=d."DB_ID"
WHERE d."NAME"='smoke_test_db' AND t."TBL_NAME"='your_table';

✅ You now have Hadoop, Hive, and PostgreSQL Metastore configured on Ubuntu 🚀

This guide covers installation and setup of Hadoop and Hive on Windows, using PostgreSQL as the Hive Metastore database.

1. Install and Configure Hadoop

Verify Java installation:

java -version

Download Hadoop binaries:

https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz

Download hadoop.dll and winutils.exe from:

https://github.com/cdarlint/winutils/tree/master/hadoop-3.3.6/bin

Place them in C:\hadoop-3.3.6\bin\.

Edit Configuration Files

core-site.xml:

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
  </property>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>file:/C:/Users/sthithapragna/hadoop-3.3.6/tmp</value>
  </property>
</configuration>

hdfs-site.xml:

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:/C:/Users/sthithapragna/hadoop-3.3.6/data/namenode</value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:/C:/Users/sthithapragna/hadoop-3.3.6/data/datanode</value>
  </property>
</configuration>

Set Java Path in `hadoop-env.cmd`

set JAVA_HOME="C:\Progra~1\Java\jre1.8.0_451"

Set Environment Variables

setx JAVA_HOME C:\Program Files\Java\jdk-1.8
setx HADOOP_HOME C:\hadoop-3.3.6
setx HADOOP_OPTS "-Djava.library.path=%HADOOP_HOME%\bin"
setx PATH "%PATH%;%JAVA_HOME%\bin;%HADOOP_HOME%\bin;%HADOOP_HOME%\sbin"

Check Diagnostics

winutils
hadoop version

Create Directories and Set Permissions

New-Item -ItemType Directory -Force C:\hadoop-3.3.6\data\namenode
New-Item -ItemType Directory -Force C:\hadoop-3.3.6\data\datanode
New-Item -ItemType Directory -Force C:\hadoop-3.3.6\tmp

winutils chmod -R 777 C:\hadoop-3.3.6\data
winutils chmod -R 777 C:\hadoop-3.3.6\tmp

Format HDFS and Start Hadoop

hdfs namenode -format
start-dfs.cmd
start-yarn.cmd

Check if services are running:

netstat -ano | findstr ":9000"
netstat -ano | findstr ":9870"

Cluster health:

hdfs dfsadmin -report
yarn node -list

2. Install and Configure Hive

Ensure prerequisites:

Hadoop HDFS is running (start-dfs.cmd / start-yarn.cmd)
PostgreSQL is installed (metastore user & DB created)
Java JDK (not just JRE) is installed
winutils.exe and hadoop.dll are in %HADOOP_HOME%\bin

Download Hive

https://dlcdn.apache.org/hive/

Extract Hive to C:\hive-4.0.1.

Set Environment Variables

setx HIVE_HOME C:\hive-4.0.1
setx PATH "%PATH%;%HIVE_HOME%\bin"

Configure PostgreSQL Metastore

In PostgreSQL:

CREATE USER psqluser WITH PASSWORD 'sthithapragna';
CREATE DATABASE pymetastore OWNER psqluser;
\c pymetastore;

Copy JDBC driver to:

C:\hive-4.0.1\lib

Create `hive-site.xml`

Save under C:\hive-4.0.1\conf:

<configuration>
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:postgresql://localhost:5432/pymetastore</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>org.postgresql.Driver</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>psqluser</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>sthithapragna</value>
  </property>
  <property>
    <name>hive.metastore.schema.verification</name>
    <value>false</value>
  </property>
  <property>
    <name>datanucleus.autoCreateSchema</name>
    <value>true</value>
  </property>
</configuration>

Initialize Schema

schematool -dbType postgres -initSchema

Start Hive Services

hive --service metastore
hive --service hiveserver2

Check services:

netstat -ano | findstr "9083"
netstat -ano | findstr "10000"

Configure Beeline

Edit beeline.cmd and replace with:

@echo off
set JAVA_OPTS=-Djline.terminal=none
set HIVE_HOME=C:\hive-4.0.1
set HADOOP_HOME=C:\hadoop-3.3.6
set CLASSPATH=%HIVE_HOME%\conf;%HIVE_HOME%\lib\*;%HADOOP_HOME%\share\hadoop\common\*;%HADOOP_HOME%\share\hadoop\common\lib\*;%HADOOP_HOME%\share\hadoop\hdfs\*;%HADOOP_HOME%\share\hadoop\hdfs\lib\*;%HADOOP_HOME%\share\hadoop\mapreduce\*;%HADOOP_HOME%\share\hadoop\mapreduce\lib\*;%HADOOP_HOME%\share\hadoop\yarn\*;%HADOOP_HOME%\share\hadoop\yarn\lib\*
java %JAVA_OPTS% -cp "%CLASSPATH%" org.apache.hive.beeline.BeeLine %*

Connect with Beeline

beeline -u jdbc:hive2://localhost:10000

Expected prompt:

0: jdbc:hive2://localhost:10000>

✅ Hadoop and Hive are now configured on Windows 🚀

Comments

No comments yet. Be the first!

You must log in to comment.

Tutorial: Hadoop & Hive Setup on Ubuntu/Windows with PostgreSQL Metastore

1. Install and Configure Hadoop

Update Hadoop Configuration

Update .bashrc

Enable SSH for Hadoop

Format & Start Hadoop

2. Install and Configure Hive

Update .bashrc

Prepare HDFS directories

3. Configure PostgreSQL as Hive Metastore

Update Hive to use PostgreSQL

Initialize Hive Metastore

Start Hive Services

Connect with Beeline

4. Handy Queries in PostgreSQL

1. Install and Configure Hadoop

Edit Configuration Files

Set Java Path in hadoop-env.cmd

Set Environment Variables

Check Diagnostics

Create Directories and Set Permissions

Format HDFS and Start Hadoop

2. Install and Configure Hive

Download Hive

Set Environment Variables

Configure PostgreSQL Metastore

Create hive-site.xml

Initialize Schema

Start Hive Services

Configure Beeline

Connect with Beeline

Comments

Set Java Path in `hadoop-env.cmd`

Create `hive-site.xml`