π E02 β Hands-On Exercises: PySpark Fast Start + Scala App Basics
Exercise 1: Create a CSV File
Objective: Prepare a sample dataset to use in PySpark.
movie,year
The Godfather,1972
The Shawshank Redemption,1994
Pulp Fiction,1994
Expected Outcome: A file
movies.csv is created with 3 rows of movie data.Exercise 2: Read CSV into PySpark DataFrame
Objective: Load and display the CSV file in PySpark.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.csv("movies.csv", header=True, sep=',')
df.show()
Expected Outcome: A DataFrame with 2 columns (
movie, year) and 3 rows prints to the console.Exercise 3: Estimate Pi with PySpark
Objective: Use distributed computation in PySpark to approximate Pi.
from pyspark import SparkContext
import random
# Create SparkContext
sc = SparkContext.getOrCreate()
NUM_SAMPLES = 1000000 # start smaller, e.g., 1M
def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1
count = sc.parallelize(range(0, NUM_SAMPLES)).filter(inside).count()
pi = 4 * count / NUM_SAMPLES
print("Pi is roughly", pi)
Expected Outcome: Prints an approximation of Pi, e.g.
Pi is roughly 3.1416.Exercise 4: Create Your First Scala Spark Application
Objective: Build a minimal Spark application in Scala.
package com.packt.descala.scalaplayground
import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession}
import org.apache.spark.sql.functions.{avg, sum}
object FirstSparkApplication extends App {
val spark: SparkSession = SparkSession
.builder()
.master("local[1]")
.appName("SparkPlayground")
.getOrCreate()
import spark.implicits._
}
Expected Outcome: Compiles successfully and starts a Spark application named
SparkPlayground.Exercise 5: Read Parquet Data in Scala
Objective: Load Parquet files into a DataFrame.
val source_storage_location = "my location"
val df: DataFrame = spark
.read
.format("parquet")
.load(source_storage_location)
Expected Outcome: A DataFrame is created from the Parquet files at the given path.
Exercise 6: Count Records in Scala DataFrame
Objective: Trigger a Spark job and observe stages in the Spark UI.
println(df.count)
Expected Outcome: Prints the number of rows in the DataFrame. In the Spark UI, you should see multiple stages created for the count job.
Exercise 7: Explore Shuffling
Objective: Trigger a shuffle operation in Spark and observe the performance impact.
val df2 = df.repartition(50, $"Groups") // Example: repartition by column 'Groups'
println(df2.count())
Expected Outcome: Data is repartitioned by the column
Groups. Spark performs a shuffle, visible in the Spark UI as an expensive operation.
Comments
No comments yet. Be the first!
You must log in to comment.