E01 – Hands-On Exercises: Getting Started with Apache Spark Shell

πŸš€ E01 – Hands-On Exercises: Getting Started with Apache Spark Shell

Exercise 1: Run the Built-in SparkPi Example

Objective: Verify your Spark installation by running the sample Pi calculation.

# Scala/Java
$ $SPARK_HOME/bin/run-example SparkPi

# Python
$ $SPARK_HOME/bin/spark-submit examples/src/main/python/pi.py

# R
$ $SPARK_HOME/bin/spark-submit examples/src/main/r/dataframe.R
Expected Outcome: Output should print something like Pi is roughly 3.1425.

Exercise 2: Start the Spark Shell (Scala)

Objective: Launch the interactive Spark Scala shell.

$SPARK_HOME/bin/spark-shell
Expected Outcome: Prompt scala> appears.

Exercise 3: Explore Spark Predefined Objects

Objective: Inspect the automatically created spark and sc.

:type spark
:type sc
System.getenv("PWD")
:help
Expected Outcome: Prints types of SparkSession and SparkContext, current directory, and shell commands.

Exercise 4: Launch Spark with Custom Configurations

Objective: Start Spark shell with YARN, memory configs, and Hudi package.

$SPARK_HOME/bin/spark-shell \
 --master yarn \
 --deploy-mode cluster \
 --driver-memory 16g \
 --executor-memory 32g \
 --executor-cores 4 \
 --conf "spark.sql.shuffle.partitions=1000" \
 --conf "spark.executor.memoryOverhead=4024" \
 --conf "spark.memory.fraction=0.7" \
 --conf "spark.memory.storageFraction=0.3" \
 --packages org.apache.hudi:hudi-spark3.3-bundle_2.12:0.12.0 \
 --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
 --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog" \
 --conf "spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension"
Expected Outcome: Spark shell launches with advanced configs and Hudi support.

Exercise 5: Create a DataFrame in Scala

Objective: Learn basic DataFrame creation.

import spark.implicits._

val cars = Seq(
  ("USA","Chrysler","Dodge","Jeep"),
  ("Germany","BMW","VW","Mercedes"),
  ("Spain", "GTA Spano","SEAT","Hispano Suiza")
)

val cars_df = cars.toDF()
cars_df.show()
Expected Outcome: Table with countries and cars prints to console.

Exercise 6: Connect to PostgreSQL via JDBC

Objective: Read a PostgreSQL table into Spark.

val df_postgresql = spark.read.format("jdbc")
  .option("url", "jdbc:postgresql://<host>:5432/db")
  .option("driver", "org.postgresql.Driver")
  .option("dbtable","schema.table")
  .option("user","user_name")
  .option("password","your_password")
  .load()

df_postgresql.show()
Expected Outcome: PostgreSQL table data loads into a Spark DataFrame.

Exercise 7: Use PySpark Shell

Objective: Start Spark shell with Python.

$SPARK_HOME/bin/pyspark

cars = [
    ("USA","Chrysler","Dodge","Jeep"),
    ("Germany","BMW","VW","Mercedes"),
    ("Spain", "GTA Spano","SEAT","Hispano Suiza")
]

cars_df = spark.createDataFrame(cars)
cars_df.show()
Expected Outcome: Cars DataFrame prints in PySpark.

Exercise 8: Run Applications with spark-submit

Objective: Submit an example Spark job with arguments.

$ $SPARK_HOME/bin/spark-submit \
 --deploy-mode client \
 --master local \
 --class org.apache.spark.examples.SparkPi \
 $SPARK_HOME/examples/jars/spark-examples_2.12-3.3.0.jar 80
Expected Outcome: Job runs and prints Pi approximation.

Exercise 9: Write and Run Your Own Scala Application

Objective: Compile and run a custom Scala app on Spark.

// Functions.scala
object Functions {
  def main(args: Array[String]) = {
    agregar(1,2)
  }
  val agregar = (x: Int, y: Int) => println(x+y)
}

# Compile and run
scalac ./Functions.scala -d Functions.jar
spark-submit --class Functions ./Functions.jar
Expected Outcome: Prints 3.

Exercise 10: Create a SparkSession in Scala

Objective: Programmatically create a SparkSession.

import org.apache.spark.sql.SparkSession

object NewSparkSession extends App {
  val spark = SparkSession.builder()
    .master("local[4]")
    .appName("Hands-On Spark 3")
    .getOrCreate()

  println(spark)
  println("The Spark Version is : " + spark.version)
}
Expected Outcome: SparkSession details and version print.

Exercise 11: Create a SparkSession in PySpark

Objective: Programmatically create a SparkSession in Python.

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("local[4]") \
    .appName("Hands-On Spark 3") \
    .getOrCreate()

print(spark)
print("Spark Version : " + spark.version)
Expected Outcome: SparkSession details and version print.

Exercise 12: Explore Transformations vs Actions

Objective: Understand lazy evaluation in Spark RDDs.

rdd1 = spark.sparkContext.parallelize([1,2,3,6,7,10])

rdd2 = rdd1.filter(lambda x: x > 5)
rdd3 = rdd2.map(lambda x: x * 2)

# Action
result = rdd3.collect()
print(result)
Expected Outcome: [12, 14, 20]

Comments

No comments yet. Be the first!

You must log in to comment.