E02 – PySpark Fast Start + Scala App Basics

🚀 E02 – Hands-On Exercises: PySpark Fast Start + Scala App Basics

Exercise 1: Create a CSV File

Objective: Prepare a sample dataset to use in PySpark.

movie,year
The Godfather,1972
The Shawshank Redemption,1994
Pulp Fiction,1994

Expected Outcome: A file movies.csv is created with 3 rows of movie data.

Exercise 2: Read CSV into PySpark DataFrame

Objective: Load and display the CSV file in PySpark.

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
df = spark.read.csv("movies.csv", header=True, sep=',')
df.show()

Expected Outcome: A DataFrame with 2 columns (movie, year) and 3 rows prints to the console.

Exercise 3: Estimate Pi with PySpark

Objective: Use distributed computation in PySpark to approximate Pi.

from pyspark import SparkContext
import random

# Create SparkContext
sc = SparkContext.getOrCreate()

NUM_SAMPLES = 1000000  # start smaller, e.g., 1M

def inside(p):
    x, y = random.random(), random.random()
    return x*x + y*y < 1

count = sc.parallelize(range(0, NUM_SAMPLES)).filter(inside).count()
pi = 4 * count / NUM_SAMPLES
print("Pi is roughly", pi)

Expected Outcome: Prints an approximation of Pi, e.g. Pi is roughly 3.1416.

Exercise 4: Create Your First Scala Spark Application

Objective: Build a minimal Spark application in Scala.

package com.packt.descala.scalaplayground

import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession}
import org.apache.spark.sql.functions.{avg, sum}

object FirstSparkApplication extends App {
  val spark: SparkSession = SparkSession
    .builder()
    .master("local[1]")
    .appName("SparkPlayground")
    .getOrCreate()

  import spark.implicits._
}

Expected Outcome: Compiles successfully and starts a Spark application named SparkPlayground.

Exercise 5: Read Parquet Data in Scala

Objective: Load Parquet files into a DataFrame.

val source_storage_location = "my location"

val df: DataFrame = spark
  .read
  .format("parquet")
  .load(source_storage_location)

Expected Outcome: A DataFrame is created from the Parquet files at the given path.

Exercise 6: Count Records in Scala DataFrame

Objective: Trigger a Spark job and observe stages in the Spark UI.

println(df.count)

Expected Outcome: Prints the number of rows in the DataFrame. In the Spark UI, you should see multiple stages created for the count job.

Exercise 7: Explore Shuffling

Objective: Trigger a shuffle operation in Spark and observe the performance impact.

val df2 = df.repartition(50, $"Groups")  // Example: repartition by column 'Groups'
println(df2.count())

Expected Outcome: Data is repartitioned by the column Groups. Spark performs a shuffle, visible in the Spark UI as an expensive operation.

Comments

No comments yet. Be the first!

You must log in to comment.