How to get average, median, mode value in BigQuery

How can we get average, median, mode value in BigQuery ?

BigQuery(BQ) is very useful for data analysis or processing.

It is good at handling huge data. It returns summary result in short time.

And we want average, median, mode value when we analyze data.

How can we get these values in BigQuery ?

So today I introduce about "How to get average, median, mode value in BigQuery".

TOC

Prepare data

First, we have to prepare data.

Load this CSV file into table avg_median_sample.

text_col,num_col1,num_col2,num_col3
a,10,1,100
a,10,2,100
a,10,3,100
a,10,4,100
a,10,5,100
a,10,6,100
a,10,7,100
a,10,8,100
a,10,9,100
a,10,10,100
b,12,11,200
b,12,12,200
b,12,13,200
b,12,14,200
c,15,15,200
c,15,16,200
c,15,17,300
c,15,18,300
c,15,19,10000

Then we can see table like below.

Rowtext_colnum_col1num_col2num_col3
1a101100
2a102100
3a103100
4a104100
5a105100
6a106100
7a107100
8a108100
9a109100
10a1010100
11b1211200
12b1212200
13b1213200
14b1214200
15c1515200
16c1516200
17c1517300
18c1518300
19c151910000

How to get average value

Once data is prepared, get average value.

In order to get average value, you should use AVG() function.

SELECT
text_col,
avg(num_col1) as avg_1,
avg(num_col2) as avg_2,
avg(num_col3) as avg_3
FROM test.avg_median_sample
GROUP BY text_col

Result is like below.

Rowtext_colavg_1avg_2avg_3
1a105.5100
2b1212.5200
3c15172200
average

Expressions, functions, and operators in Standard SQL  |  BigQuery

How to get median value

Next, get median value.

Median value is not a calculated value.

It is middle value in ordered list.

What is difference between average and median ?

Average value is the center of gravity. If data scale is same in data list, it is good feature.

But if there is big scale value in data list, average value also become large. In that case, it is not a good feature.

Umm, any example ?

For example, assume that there are 100 people and Optimus Prime. What does average weight value mean ?

Oh, Optimus Prime is too heavy. So average value is useless.

We can get median value with PERCENTILE_CONT() function.

We use it as below.

SELECT distinct
text_col,
PERCENTILE_CONT(num_col1, 0.5) OVER(PARTITION BY text_col) AS median_1,
PERCENTILE_CONT(num_col2, 0.5) OVER(PARTITION BY text_col) AS median_2,
PERCENTILE_CONT(num_col3, 0.5) OVER(PARTITION BY text_col) AS median_3
FROM test.avg_median_sample
ORDER BY text_col

Then we get this result.

Rowtext_colmedian_1median_2median_3
1a105.5100
2b1212.5200
3c1517300
median

And if you change second parameter of PERCENTILE_CONT() from 0.5 to 0.1 , you can get 10% position value. If 0.75, you can get 75% position value.

Expressions, functions, and operators in Standard SQL  |  BigQuery

How to get mode value

Then try to get mode value.

The mode is the value that appears most frequently in a data set.

Frequency is important. 100 and 100.1 are different values.

So mode doesn't match for real numbers.

It is good to use for checking group frequency.

In order to get mode value, you vcan use APPROX_TOP_COUNT().

SELECT
text_col,
APPROX_TOP_COUNT(num_col1, 1) top1,
APPROX_TOP_COUNT(num_col2, 1) top2,
APPROX_TOP_COUNT(num_col3, 1) top3
FROM test.avg_median_sample
GROUP BY text_col
ORDER BY text_col

Result is below.

Rowtext_coltop1top1.counttop2.valuetop2.counttop3.valuetop3.count
1a10101110010
2b1241112004
3c1551512002
mode value

As you see, .value is value that appears frequently. And .count means appeared count.

Expressions, functions, and operators in Standard SQL  |  BigQuery

Conclusion

Today I described about "How to get average, median, mode value in BigQuery".

In order to get these values, we can use functions below.

  • Average: AVG()
  • Median: PERCENTILE_CONT()
  • Mode: APPROX_TOP_COUNT()

AVG() is easy to use. But others are little bit complicated.

Soon you will get used to use them.

If you like this article, please
Like or Follow !

If you like this article, please share !

Author

karasanのアバター karasan System engineer

Mid-career engineer (AI, Data science, Salesforce, etc.).
Good at Python and SQL.

TOC