Installation Steps

Tutorial: Hadoop & Hive Setup on Ubuntu/Windows with PostgreSQL Metastore

Hadoop & Hive Setup on Ubuntu/Windows with PostgreSQL Metastore
Linux
Windows

This guide covers installation and setup of Hadoop, Hive, and PostgreSQL as a Hive Metastore on Ubuntu.

1. Install and Configure Hadoop

Download Hadoop:

https://hadoop.apache.org/releases.html
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.4.0/hadoop-3.4.0.tar.gz
tar -xvzf hadoop-3.4.0.tar.gz
mv hadoop-3.4.0 ~/hadoop

Update Hadoop Configuration

  • Edit hadoop-env.sh → set JAVA_HOME:
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64/
  • Edit core-site.xml:
<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:9745</value>
  </property>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/home/sthithapragna/hadoop/tmp</value>
  </property>
</configuration>
  • Edit hdfs-site.xml (NameNode + DataNode paths):
<configuration>
  <property>
    <name>dfs.name.dir</name>
    <value>file:/home/sthithapragna/hdfs/namenode</value>
  </property>
  <property>
    <name>dfs.data.dir</name>
    <value>file:/home/sthithapragna/hdfs/datanode</value>
  </property>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>
  • Edit mapred-site.xml:
<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
</configuration>
  • Edit yarn-site.xml:
<property>
  <name>yarn.web-proxy.address</name>
  <value>localhost:9099</value>
</property>

Update .bashrc

export HADOOP_HOME=/home/sthithapragna/hadoop
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME

Enable SSH for Hadoop

sudo apt update
sudo apt install openssh-server -y
sudo systemctl enable ssh
sudo systemctl start ssh
ssh-keygen -t rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 755 ~/.ssh/authorized_keys
ssh localhost

Format & Start Hadoop

hdfs namenode -format
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh

2. Install and Configure Hive

Download and extract Hive:

wget https://dlcdn.apache.org/hive/hive-4.1.0/apache-hive-4.1.0-bin.tar.gz
tar -zxvf apache-hive-4.1.0-bin.tar.gz
mv apache-hive-4.1.0-bin ~/hive

Update Hive configuration:

cd ~/hive/conf
mv hive-default.xml.template hive-site.xml
mv hive-env.sh.template hive-env.sh

Edit hive-site.xml (temporary Derby DB config):

<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=/home/sthithapragna/hive/metastore/metastore_db;create=true</value>

Update .bashrc

export HIVE_HOME=/home/sthithapragna/hive
export PATH=$PATH:$HIVE_HOME/bin
export HIVE_CONF_DIR=$HIVE_HOME/conf

Prepare HDFS directories

hdfs dfs -mkdir -p /user/hive/warehouse /tmp/hive
hdfs dfs -chmod -R 777 /user/hive /tmp/hive

3. Configure PostgreSQL as Hive Metastore

Install PostgreSQL:

sudo apt install postgresql -y
sudo systemctl status postgresql

Create Hive Metastore user + DB:

sudo -i -u postgres
psql
CREATE USER psqluser WITH PASSWORD 'sthithapragna';
CREATE DATABASE pymetastore OWNER psqluser;
\q

Edit pg_hba.conf to allow local connections (change auth to trust):

local all all trust
host all all 127.0.0.1/32 trust
host all all ::1/128 trust

Update Hive to use PostgreSQL

Edit hive-site.xml:

<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:postgresql://localhost:5432/pymetastore</value>

<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.postgresql.Driver</value>

<name>javax.jdo.option.ConnectionUserName</name>
<value>psqluser</value>

<name>javax.jdo.option.ConnectionPassword</name>
<value>sthithapragna</value>

Initialize Hive Metastore

  • Download PostgreSQL JDBC driver: PostgreSQL JDBC
  • Copy driver to Hive lib:
cp postgresql-42.7.2.jar ~/hive/lib/
schematool -dbType postgres -initSchema --verbose

Start Hive Services

$HIVE_HOME/bin/hive --service metastore 1>$HIVE_HOME/logs/metastore.out 2>$HIVE_HOME/logs/metastore.log &
$HIVE_HOME/bin/hiveserver2 1>$HIVE_HOME/logs/hive.out 2>$HIVE_HOME/logs/hive.log &

Connect with Beeline

beeline --verbose=true -u "jdbc:hive2://127.0.0.1:10000/default"

4. Handy Queries in PostgreSQL

  • List Hive databases:
SELECT "NAME", "DB_LOCATION_URI" FROM "DBS";
  • Show all tables in a database (example: smoke_test_db):
SELECT "TBL_NAME", "TBL_TYPE"
FROM "TBLS" t
JOIN "DBS" d ON t."DB_ID"=d."DB_ID"
WHERE d."NAME"='smoke_test_db';
  • Show columns of a table:
SELECT c."COLUMN_NAME", c."TYPE_NAME"
FROM "COLUMNS_V2" c
JOIN "SDS" s ON c."CD_ID"=s."CD_ID"
JOIN "TBLS" t ON s."SD_ID"=t."SD_ID"
JOIN "DBS" d ON t."DB_ID"=d."DB_ID"
WHERE d."NAME"='smoke_test_db' AND t."TBL_NAME"='your_table';

✅ You now have Hadoop, Hive, and PostgreSQL Metastore configured on Ubuntu 🚀

This guide covers installation and setup of Hadoop and Hive on Windows, using PostgreSQL as the Hive Metastore database.

1. Install and Configure Hadoop

Verify Java installation:

java -version

Download Hadoop binaries:

https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz

Download hadoop.dll and winutils.exe from:

https://github.com/cdarlint/winutils/tree/master/hadoop-3.3.6/bin

Place them in C:\hadoop-3.3.6\bin\.

Edit Configuration Files

core-site.xml:

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
  </property>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>file:/C:/Users/sthithapragna/hadoop-3.3.6/tmp</value>
  </property>
</configuration>

hdfs-site.xml:

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:/C:/Users/sthithapragna/hadoop-3.3.6/data/namenode</value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:/C:/Users/sthithapragna/hadoop-3.3.6/data/datanode</value>
  </property>
</configuration>

Set Java Path in hadoop-env.cmd

set JAVA_HOME="C:\Progra~1\Java\jre1.8.0_451"

Set Environment Variables

setx JAVA_HOME C:\Program Files\Java\jdk-1.8
setx HADOOP_HOME C:\hadoop-3.3.6
setx HADOOP_OPTS "-Djava.library.path=%HADOOP_HOME%\bin"
setx PATH "%PATH%;%JAVA_HOME%\bin;%HADOOP_HOME%\bin;%HADOOP_HOME%\sbin"

Check Diagnostics

winutils
hadoop version

Create Directories and Set Permissions

New-Item -ItemType Directory -Force C:\hadoop-3.3.6\data\namenode
New-Item -ItemType Directory -Force C:\hadoop-3.3.6\data\datanode
New-Item -ItemType Directory -Force C:\hadoop-3.3.6\tmp

winutils chmod -R 777 C:\hadoop-3.3.6\data
winutils chmod -R 777 C:\hadoop-3.3.6\tmp

Format HDFS and Start Hadoop

hdfs namenode -format
start-dfs.cmd
start-yarn.cmd

Check if services are running:

netstat -ano | findstr ":9000"
netstat -ano | findstr ":9870"

Cluster health:

hdfs dfsadmin -report
yarn node -list

2. Install and Configure Hive

Ensure prerequisites:

  • Hadoop HDFS is running (start-dfs.cmd / start-yarn.cmd)
  • PostgreSQL is installed (metastore user & DB created)
  • Java JDK (not just JRE) is installed
  • winutils.exe and hadoop.dll are in %HADOOP_HOME%\bin

Download Hive

https://dlcdn.apache.org/hive/

Extract Hive to C:\hive-4.0.1.

Set Environment Variables

setx HIVE_HOME C:\hive-4.0.1
setx PATH "%PATH%;%HIVE_HOME%\bin"

Configure PostgreSQL Metastore

In PostgreSQL:

CREATE USER psqluser WITH PASSWORD 'sthithapragna';
CREATE DATABASE pymetastore OWNER psqluser;
\c pymetastore;

Copy JDBC driver to:

C:\hive-4.0.1\lib

Create hive-site.xml

Save under C:\hive-4.0.1\conf:

<configuration>
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:postgresql://localhost:5432/pymetastore</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>org.postgresql.Driver</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>psqluser</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>sthithapragna</value>
  </property>
  <property>
    <name>hive.metastore.schema.verification</name>
    <value>false</value>
  </property>
  <property>
    <name>datanucleus.autoCreateSchema</name>
    <value>true</value>
  </property>
</configuration>

Initialize Schema

schematool -dbType postgres -initSchema

Start Hive Services

hive --service metastore
hive --service hiveserver2

Check services:

netstat -ano | findstr "9083"
netstat -ano | findstr "10000"

Configure Beeline

Edit beeline.cmd and replace with:

@echo off
set JAVA_OPTS=-Djline.terminal=none
set HIVE_HOME=C:\hive-4.0.1
set HADOOP_HOME=C:\hadoop-3.3.6
set CLASSPATH=%HIVE_HOME%\conf;%HIVE_HOME%\lib\*;%HADOOP_HOME%\share\hadoop\common\*;%HADOOP_HOME%\share\hadoop\common\lib\*;%HADOOP_HOME%\share\hadoop\hdfs\*;%HADOOP_HOME%\share\hadoop\hdfs\lib\*;%HADOOP_HOME%\share\hadoop\mapreduce\*;%HADOOP_HOME%\share\hadoop\mapreduce\lib\*;%HADOOP_HOME%\share\hadoop\yarn\*;%HADOOP_HOME%\share\hadoop\yarn\lib\*
java %JAVA_OPTS% -cp "%CLASSPATH%" org.apache.hive.beeline.BeeLine %*

Connect with Beeline

beeline -u jdbc:hive2://localhost:10000

Expected prompt:

0: jdbc:hive2://localhost:10000>

✅ Hadoop and Hive are now configured on Windows 🚀

Comments

No comments yet. Be the first!

You must log in to comment.