π E01 β Hands-On Exercises: Getting Started with Apache Spark Shell
Exercise 1: Run the Built-in SparkPi Example
Objective: Verify your Spark installation by running the sample Pi calculation.
# Scala/Java
$ $SPARK_HOME/bin/run-example SparkPi
# Python
$ $SPARK_HOME/bin/spark-submit examples/src/main/python/pi.py
# R
$ $SPARK_HOME/bin/spark-submit examples/src/main/r/dataframe.R
Expected Outcome: Output should print something like
Pi is roughly 3.1425.Exercise 2: Start the Spark Shell (Scala)
Objective: Launch the interactive Spark Scala shell.
$SPARK_HOME/bin/spark-shell
Expected Outcome: Prompt
scala> appears.Exercise 3: Explore Spark Predefined Objects
Objective: Inspect the automatically created spark and sc.
:type spark
:type sc
System.getenv("PWD")
:help
Expected Outcome: Prints types of SparkSession and SparkContext, current directory, and shell commands.
Exercise 4: Launch Spark with Custom Configurations
Objective: Start Spark shell with YARN, memory configs, and Hudi package.
$SPARK_HOME/bin/spark-shell \
--master yarn \
--deploy-mode cluster \
--driver-memory 16g \
--executor-memory 32g \
--executor-cores 4 \
--conf "spark.sql.shuffle.partitions=1000" \
--conf "spark.executor.memoryOverhead=4024" \
--conf "spark.memory.fraction=0.7" \
--conf "spark.memory.storageFraction=0.3" \
--packages org.apache.hudi:hudi-spark3.3-bundle_2.12:0.12.0 \
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
--conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog" \
--conf "spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension"
Expected Outcome: Spark shell launches with advanced configs and Hudi support.
Exercise 5: Create a DataFrame in Scala
Objective: Learn basic DataFrame creation.
import spark.implicits._
val cars = Seq(
("USA","Chrysler","Dodge","Jeep"),
("Germany","BMW","VW","Mercedes"),
("Spain", "GTA Spano","SEAT","Hispano Suiza")
)
val cars_df = cars.toDF()
cars_df.show()
Expected Outcome: Table with countries and cars prints to console.
Exercise 6: Connect to PostgreSQL via JDBC
Objective: Read a PostgreSQL table into Spark.
val df_postgresql = spark.read.format("jdbc")
.option("url", "jdbc:postgresql://<host>:5432/db")
.option("driver", "org.postgresql.Driver")
.option("dbtable","schema.table")
.option("user","user_name")
.option("password","your_password")
.load()
df_postgresql.show()
Expected Outcome: PostgreSQL table data loads into a Spark DataFrame.
Exercise 7: Use PySpark Shell
Objective: Start Spark shell with Python.
$SPARK_HOME/bin/pyspark
cars = [
("USA","Chrysler","Dodge","Jeep"),
("Germany","BMW","VW","Mercedes"),
("Spain", "GTA Spano","SEAT","Hispano Suiza")
]
cars_df = spark.createDataFrame(cars)
cars_df.show()
Expected Outcome: Cars DataFrame prints in PySpark.
Exercise 8: Run Applications with spark-submit
Objective: Submit an example Spark job with arguments.
$ $SPARK_HOME/bin/spark-submit \
--deploy-mode client \
--master local \
--class org.apache.spark.examples.SparkPi \
$SPARK_HOME/examples/jars/spark-examples_2.12-3.3.0.jar 80
Expected Outcome: Job runs and prints Pi approximation.
Exercise 9: Write and Run Your Own Scala Application
Objective: Compile and run a custom Scala app on Spark.
// Functions.scala
object Functions {
def main(args: Array[String]) = {
agregar(1,2)
}
val agregar = (x: Int, y: Int) => println(x+y)
}
# Compile and run
scalac ./Functions.scala -d Functions.jar
spark-submit --class Functions ./Functions.jar
Expected Outcome: Prints
3.Exercise 10: Create a SparkSession in Scala
Objective: Programmatically create a SparkSession.
import org.apache.spark.sql.SparkSession
object NewSparkSession extends App {
val spark = SparkSession.builder()
.master("local[4]")
.appName("Hands-On Spark 3")
.getOrCreate()
println(spark)
println("The Spark Version is : " + spark.version)
}
Expected Outcome: SparkSession details and version print.
Exercise 11: Create a SparkSession in PySpark
Objective: Programmatically create a SparkSession in Python.
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("local[4]") \
.appName("Hands-On Spark 3") \
.getOrCreate()
print(spark)
print("Spark Version : " + spark.version)
Expected Outcome: SparkSession details and version print.
Exercise 12: Explore Transformations vs Actions
Objective: Understand lazy evaluation in Spark RDDs.
rdd1 = spark.sparkContext.parallelize([1,2,3,6,7,10])
rdd2 = rdd1.filter(lambda x: x > 5)
rdd3 = rdd2.map(lambda x: x * 2)
# Action
result = rdd3.collect()
print(result)
Expected Outcome:
[12, 14, 20]
Comments
No comments yet. Be the first!
You must log in to comment.