Cube and Rollup for GroupBy
February 8, 2025•500 words
Many popular SQL engines, such as Apache Spark and PostgresSQL, have convenient functions for multiple groupby manipulations, namely cube
and rollup
. I will use cube
in PySpark as an illustration.
Suppose we have the following course enrolment data across different years for different courses. Note CS is not available for year 2021.
+----+---------+---------+
|year| course|enrolment|
+----+---------+---------+
|2021| Math| 100|
|2021| Physics| 200|
|2021|Chemistry| 300|
|2022| Math| 150|
|2022| Physics| 250|
|2022|Chemistry| 350|
|2022| CS| 300|
|2023| Math| 200|
|2023| Physics| 300|
|2023|Chemistry| 310|
|2023| CS| 400|
+----+---------+---------+
Assume we want to answer the following two questions.
- Q1: What is the total enrolment per year, summing over all courses?
- Q2: What is the average enrolment for each course, averaging over different years?
Obviously, we can answer these questions separately with different groupby operations, but the cube
function provides a convenient shortcut. We can use the following statement to answer these questions simultaneously.
df_agg = (
df.cube('year', 'course')
.agg(
F.sum('enrolment').alias('total_enrolment'),
F.avg('enrolment').alias('average_enrolment'),
)
.orderBy('year', 'course')
)
cube
basically does groupby+aggregate for all possible combinations of the year
and course
columns. The following output shows df_agg
generated from above. Note the NULL
values indicate the "grand total" over all possible values of non-null columns.
For example:
- Row 1 indicates the total/average enrolment over all years and all courses.
- Row 2 indicates the total/average for CS over its available years (only 2022 and 2023).
- Row 3 indicates the total/average for Chemistry over its available years (2021 through 2023).
- Row 6 indicates the total/average for all available courses in 2021 (so CS is excluded).
- Row 10 indicates the total/average for all available courses in 2022 (CS is included).
+----+---------+---------------+-----------------+
|year| course|total_enrolment|average_enrolment|
+----+---------+---------------+-----------------+
|NULL| NULL| 2860| 260.0|
|NULL| CS| 700| 350.0|
|NULL|Chemistry| 960| 320.0|
|NULL| Math| 450| 150.0|
|NULL| Physics| 750| 250.0|
|2021| NULL| 600| 200.0|
|2021|Chemistry| 300| 300.0|
|2021| Math| 100| 100.0|
|2021| Physics| 200| 200.0|
|2022| NULL| 1050| 262.5|
|2022| CS| 300| 300.0|
|2022|Chemistry| 350| 350.0|
|2022| Math| 150| 150.0|
|2022| Physics| 250| 250.0|
|2023| NULL| 1210| 302.5|
|2023| CS| 400| 400.0|
|2023|Chemistry| 310| 310.0|
|2023| Math| 200| 200.0|
|2023| Physics| 300| 300.0|
+----+---------+---------------+-----------------+
To answer our original questions:
- Q1 - We use the following filter to get the total enrolment per year, summing over all courses.
df_agg.filter((col('year').isNotNull()) & (col('course').isNull())).select(
'year', 'total_enrolment'
)
+----+---------------+
|year|total_enrolment|
+----+---------------+
|2021| 600|
|2022| 1050|
|2023| 1210|
+----+---------------+
- Q2 - We use the following filter to get the average enrolment for each course, averaging over different years
df_agg.filter((col('year').isNull()) & (col('course').isNotNull())).select(
'course', 'average_enrolment'
)
+---------+-----------------+
| course|average_enrolment|
+---------+-----------------+
| CS| 350.0|
|Chemistry| 320.0|
| Math| 150.0|
| Physics| 250.0|
+---------+-----------------+
The rollup
operation is basically a special case of cube
, which is best suitable for hierarchical columns such as Province + City, or School + Classroom scenarios.
Unfortunately, these operations are currently not available in pandas (see discussion here and a solution here) or polars (see discussion here).