Calculating averages and percentiles
Use case
We want to understand the distribution of values for a certain numeric property within a dataset. We're used to average values and intuitively understand how to calculate them. However, we also know that average values can be misleading for skewed (opens in a new tab) distributions which are common in the real world: for example, 2.5 is the average value for both (1, 2, 3, 4)
and (0, 0, 0, 10)
.
So, it's usually better to use percentiles (opens in a new tab). Parameterized by a fractional number n = 0..1
, where the n-th percentile is equal to a value that exceeds a specified ratio of values in the distribution. The median (opens in a new tab) is a special case: it's defined as the 50th percentile (n = 0.5
), and it can be casually thought of as "the middle" value. 2.5 and 0 are the medians of (1, 2, 3, 4)
and (0, 0, 0, 10)
, respectively.
Data modeling
Let's explore the data in the users
cube that contains various demographic information about users, including their age:
[
{
"users.name": "Abbott, Breanne",
"users.age": 52,
},
{
"users.name": "Abbott, Dallas",
"users.age": 43,
},
{
"users.name": "Abbott, Gia",
"users.age": 36,
},
{
"users.name": "Abbott, Tom",
"users.age": 39,
},
{
"users.name": "Abbott, Ward",
"users.age": 67,
},
];
Calculating the average age is as simple as defining a measure with the built-in avg
type.
Calculating the percentiles would require using database-specific functions. However, almost every database has them under names of PERCENTILE_CONT
and PERCENTILE_DISC
, Postgres (opens in a new tab) and Snowflake (opens in a new tab) included. For BigQuery (opens in a new tab), you'd need to use the APPROX_QUANTILES
function.
cubes:
- name: users
# ...
measures:
- name: avg_age
type: avg
sql: age
- name: median_age
type: number
sql: PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY age)
- name: p95_age
type: number
sql: PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY age)
Result
Using the measures defined above, we can explore statistics about the age of our users.
[
{
"users.avg_age": "52.3100000000000000",
"users.median_age": 53,
"users.p95_age": 82
}
]
For this particular dataset, the average age closely matches the median age, and 95% of all users are younger than 82 years.
Source code
Please feel free to check out the full source code (opens in a new tab) or run it with the docker-compose up
command. You'll see the result, including queried data, in the console.