E03 – Spark for BI: DAU, Avg Cart Size, Top 10 Items

🚀 E03 – Hands-On Exercises: Spark for BI (DAU, Avg Cart Size, Top 10 Items)

Exercise 1: Create Sample Activity Data

Objective: Set up a Spark DataFrame simulating ecommerce activity.

import spark.implicits._
import org.apache.spark.sql.functions._

case class Activity(userId: String, cartId: String, itemId: String)

val activities = Seq(
  Activity("u1", "c1", "i1"),
  Activity("u1", "c1", "i2"),
  Activity("u2", "c2", "i1"),
  Activity("u3", "c3", "i3"),
  Activity("u4", "c4", "i3")
)

val df = spark.createDataFrame(activities)
  .toDF("user_id", "cart_id", "item_id")

df.show()

Expected Outcome:

+-------+-------+-------+
|user_id|cart_id|item_id|
+-------+-------+-------+
|     u1|     c1|     i1|
|     u1|     c1|     i2|
|     u2|     c2|     i1|
|     u3|     c3|     i3|
|     u4|     c4|     i3|
+-------+-------+-------+

Exercise 2: Daily Active Users (DAU)

Objective: Count unique users who visited.

df.select("user_id").distinct().count()

Expected Outcome: 4 (there are 4 unique users in the sample data).

Exercise 3: Average Cart Size

Objective: Calculate the average number of items per cart.

val avg_cart_items = df
  .select("cart_id","item_id")
  .groupBy("cart_id")
  .agg(count(col("item_id")) as "total")
  .agg(avg(col("total")) as "avg_cart_items")

avg_cart_items.show()

Expected Outcome:

+--------------+
|avg_cart_items|
+--------------+
|          1.25|
+--------------+

Exercise 4: Inspect Execution Plan

Objective: View the query plan for average cart size computation.

avg_cart_items.explain()

Expected Outcome: Spark prints a physical plan (DAG) showing stages, aggregations, and shuffles.

Exercise 5: Top 10 Most Added Items

Objective: Identify the most frequently added items across all carts.

df.select("item_id")
  .groupBy("item_id")
  .agg(count(col("item_id")) as "total")
  .sort(desc("total"))
  .limit(10)
  .show()

Expected Outcome:

+-------+-----+
|item_id|total|
+-------+-----+
|     i3|    2|
|     i1|    2|
|     i2|    1|
+-------+-----+

Exercise 6: Daily Active Users with Spark SQL

Objective: Use Spark SQL to count distinct users.

df.createOrReplaceTempView("activities")

spark.sql("""
  select count(distinct(user_id)) as unique_users
  from activities
""").show()

Expected Outcome:

+------------+
|unique_users|
+------------+
|           4|
+------------+

Comments

No comments yet. Be the first!

You must log in to comment.