Cube and Rollup for GroupBy

Many popular SQL engines, such as Apache Spark and PostgresSQL, have convenient functions for multiple groupby manipulations, namely cube and rollup. I will use cube in PySpark as an illustration.

Suppose we have the following course enrolment data across different years for different courses. Note CS is not available for year 2021.

+----+---------+---------+
|year|   course|enrolment|
+----+---------+---------+
|2021|     Math|      100|
|2021|  Physics|      200|
|2021|Chemistry|      300|
|2022|     Math|      150|
|2022|  Physics|      250|
|2022|Chemistry|      350|
|2022|       CS|      300|
|2023|     Math|      200|
|2023|  Physics|      300|
|2023|Chemistry|      310|
|2023|       CS|      400|
+----+---------+---------+

Assume we want to answer the following two questions.

  • Q1: What is the total enrolment per year, summing over all courses?
  • Q2: What is the average enrolment for each course, averaging over different years?

Obviously, we can answer these questions separately with different groupby operations, but the cube function provides a convenient shortcut. We can use the following statement to answer these questions simultaneously.

df_agg = (
    df.cube('year', 'course')
    .agg(
        F.sum('enrolment').alias('total_enrolment'),
        F.avg('enrolment').alias('average_enrolment'),
    )
    .orderBy('year', 'course')
)

cube basically does groupby+aggregate for all possible combinations of the year and course columns. The following output shows df_agg generated from above. Note the NULL values indicate the "grand total" over all possible values of non-null columns.

For example:

  • Row 1 indicates the total/average enrolment over all years and all courses.
  • Row 2 indicates the total/average for CS over its available years (only 2022 and 2023).
  • Row 3 indicates the total/average for Chemistry over its available years (2021 through 2023).
  • Row 6 indicates the total/average for all available courses in 2021 (so CS is excluded).
  • Row 10 indicates the total/average for all available courses in 2022 (CS is included).

+----+---------+---------------+-----------------+
|year|   course|total_enrolment|average_enrolment|
+----+---------+---------------+-----------------+
|NULL|     NULL|           2860|            260.0|
|NULL|       CS|            700|            350.0|
|NULL|Chemistry|            960|            320.0|
|NULL|     Math|            450|            150.0|
|NULL|  Physics|            750|            250.0|
|2021|     NULL|            600|            200.0|
|2021|Chemistry|            300|            300.0|
|2021|     Math|            100|            100.0|
|2021|  Physics|            200|            200.0|
|2022|     NULL|           1050|            262.5|
|2022|       CS|            300|            300.0|
|2022|Chemistry|            350|            350.0|
|2022|     Math|            150|            150.0|
|2022|  Physics|            250|            250.0|
|2023|     NULL|           1210|            302.5|
|2023|       CS|            400|            400.0|
|2023|Chemistry|            310|            310.0|
|2023|     Math|            200|            200.0|
|2023|  Physics|            300|            300.0|
+----+---------+---------------+-----------------+

To answer our original questions:

  • Q1 - We use the following filter to get the total enrolment per year, summing over all courses.
df_agg.filter((col('year').isNotNull()) & (col('course').isNull())).select(
    'year', 'total_enrolment'
)
+----+---------------+
|year|total_enrolment|
+----+---------------+
|2021|            600|
|2022|           1050|
|2023|           1210|
+----+---------------+
  • Q2 - We use the following filter to get the average enrolment for each course, averaging over different years
df_agg.filter((col('year').isNull()) & (col('course').isNotNull())).select(
    'course', 'average_enrolment'
)
+---------+-----------------+
|   course|average_enrolment|
+---------+-----------------+
|       CS|            350.0|
|Chemistry|            320.0|
|     Math|            150.0|
|  Physics|            250.0|
+---------+-----------------+

The rollup operation is basically a special case of cube, which is best suitable for hierarchical columns such as Province + City, or School + Classroom scenarios.

Unfortunately, these operations are currently not available in pandas (see discussion here and a solution here) or polars (see discussion here).


You'll only receive email when they publish something new.

More from Spark Tseung
All posts