Tutorial: Setting up Java, Anaconda, and Apache Spark with PySpark Notebook on Ubuntu & Windows
This guide walks you through installing the tools required for data engineering and machine learning on Ubuntu. By the end, you’ll have a working Jupyter Notebook connected to PySpark.
1. Install Ubuntu Desktop and VirtualBox
- Download the Ubuntu Desktop ISO
- Download and install Oracle VirtualBox
2. Verify or Install Java (JDK 17)
java -version
sudo apt-get install software-properties-common
sudo apt update
sudo apt install openjdk-17-jdk
java -version
3. Install Python & Anaconda
python --version
# Download installer from https://www.anaconda.com/download/success
bash Anaconda3-4.4.0-Linux-x86_64.sh
source ~/.bashrc
4. Install Apache Spark
# Download Spark from https://spark.apache.org/downloads.html
tar -zxvf spark-3.5.6-bin-hadoop3.tgz
mv spark-3.5.6-bin-hadoop3 ~/
cd ~/spark-3.5.6-bin-hadoop3/
./bin/pyspark
5. Configure Environment Variables
Edit your ~/.bashrc file:
export PATH="/home/your-username/anaconda3/bin:$PATH"
function sparknotebook() {
export SPARK_HOME=/home/your-username/spark-3.5.6-bin-hadoop3
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
$SPARK_HOME/bin/pyspark
}
source ~/.bashrc
sparknotebook
6. Launch PySpark in Jupyter Notebook
Run the following command to start a Jupyter Notebook connected to Spark:
sparknotebook
⚠️ Troubleshooting Guide for Spark + PySpark Setup
1. Java Not Found or Wrong Version
sudo apt install openjdk-17-jdk -y
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
export PATH=$JAVA_HOME/bin:$PATH
source ~/.bashrc
2. PySpark Doesn’t Start (JAVA_HOME not set)
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
source ~/.bashrc
3. Jupyter Notebook Doesn’t Open
jupyter notebook
# Copy the token URL (http://127.0.0.1:8888/?token=...)
4. sparknotebook Command Not Found
source ~/.bashrc
sparknotebook
5. Python Not Found or Wrong Version
conda --version
# Ensure Anaconda is installed and PATH is correct
export PATH="/home/your-username/anaconda3/bin:$PATH"
6. Spark Version Issues
If you get Hadoop-related errors, download Spark pre-built for Hadoop 3.x from Spark Downloads.
This guide walks you through setting up PySpark on Windows using Anaconda, Apache Spark, Hadoop winutils, and Java. By the end, you’ll have a working PySpark environment inside your Conda virtual environment.
1. Install Anaconda (Python Distribution)
- Download Anaconda from the official page: Anaconda Download.
- Run the installer (
Anaconda3-<version>-Windows-x86_64.exe) and follow the setup wizard. - After installation, open Anaconda Prompt from the Start menu.
2. Verify Anaconda Installation
conda info
conda env list
3. Create and Activate a Conda Environment
Create a new Python 3.13 environment for PySpark:
conda create --name pyspark_env python=3.13
conda activate pyspark_env
4. Download and Extract Apache Spark
- Download Spark from Apache Spark Downloads (choose Pre-built for Apache Hadoop 3.x).
- Extract the
.tgzfile using 7-Zip or WinRAR. - Move the extracted Spark folder to your home directory:
move spark-3.5.6-bin-hadoop3 C:\Users\<username>\spark-3.5.6
5. Install Java JDK
- Download and install the latest JDK from Oracle JDK Downloads.
- Take note of the installation path (e.g.,
C:\Program Files\Java\jdk-24).
6. Add Hadoop Winutils
- Create the required Hadoop directory inside Spark:
mkdir C:\Users\<username>\spark-3.5.6\hadoop\bin
- Download
winutils.exefrom GitHub. - Move it into the Hadoop bin directory:
move C:\Users\username\Downloads\winutils.exe C:\Users\username\spark-3.5.6\hadoop\bin
7. Configure Environment Variables
Open System Properties → Advanced → Environment Variables and add:
- SPARK_HOME =
C:\Users\kapil\spark-3.5.6 - HADOOP_HOME =
C:\Users\kapil\spark-3.5.6\hadoop - Add to PATH:
C:\Users\kapil\spark-3.5.6\bin - SPARK_LOCAL_HOSTNAME =
localhost - JAVA_HOME =
C:\Program Files\Java\<your JDK version directory>
8. Restart and Reactivate Conda Environment
Restart Anaconda Prompt (so new environment variables are applied), then activate your environment again:
conda activate pyspark_env
9. Launch PySpark
Test PySpark from the terminal:
pyspark
10. (Optional) Enable PySpark in Jupyter Notebook
Install Findspark to help integrate PySpark with Jupyter:
python -m pip install findspark
Inside a Jupyter notebook, add:
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("TestApp").getOrCreate()
print(spark.version)
You now have a fully working PySpark environment inside Anaconda on Windows 🎉

Comments
No comments yet. Be the first!
You must log in to comment.