How to rank data considering duplication in BigQuery

2024-02-23

How can we rank data in BigQuery ?

BigQuery(BQ) is very useful for data analysis or processing.
It is good at handling huge data. It returns summary result in short time.

When we handle data, sometimes we would like to do numbering.
For example, extract from first to third record in each category.
In order to get ranking in whole data, we should use order by clause.

But how can we get ranking number ?

And also if 4 records have same score, it is difficult to choose 3 records.

How can we number records without duplication ?

So today I will introduce about "How to rank data considering duplication in BigQuery".

How to rank in BigQuery

There are some methods to rank records in BigQuery.

RANK
DENSE_RANK
ROW_NUMBER

They are different functions.

So introduce each function.

Prepare Data

Anyway, we should prepare data.

Import this CSV file as rank_sample table.

With using this table, check each numbering functions.

RANK

As its name, RANK is ranking function.

We can use it like RANK() OVER(ORDER BY <column name that you want to rank>).

SQL

SELECT
*,
RANK() OVER(ORDER BY val) as rank
FROM test.rank_sample
order by id

Result

Row	id	val	rank
1	1	100	1
2	2	120	5
3	3	130	7
4	4	100	1
5	5	110	4
6	6	160	8
7	7	170	9
8	8	100	1
9	9	120	5
10	10	200	10

Rank result

According to result, we can see 1st to 10th in rank column.

The val=100 records are all 1st, and next val=110 record is 4th.

It means that rank function considers same score as same rank and skip next number.

Numbering functions | BigQuery

DENSE_RANK

DENSE_RANK is numbering function that doesn't skip duplicated rank.

We can use it like DENSE_RANK() OVER(ORDER BY <column name>).

SQL

SELECT
*,
DENSE_RANK() OVER(ORDER BY val) as dense_rank
FROM test.rank_sample
order by id

Result

Row	id	val	dense_rank
1	1	100	1
2	2	120	3
3	3	130	4
4	4	100	1
5	5	110	2
6	6	160	5
7	7	170	6
8	8	100	1
9	9	120	3
10	10	200	7

DENSE_RANK result

We can see that rank is from 1st to 7th.

The difference from RANK() is that val=110 record is not 4th but 2nd.

So DENSE_RANK does not skip duplicated rank.

Numbering functions | DENSE_RANK

ROW_NUMBER

ROW_NUMBER is not for ranking, but for setting row number.

But we can use it like ROW_NUMBER() OVER(ORDER BY <column name>) as same as RANK() function.

SQL

SELECT
*,
ROW_NUMBER() OVER(ORDER BY val) as dense_rank
FROM test.rank_sample
order by id

Result

Row	id	val	row_number
1	1	100	1
2	2	120	5
3	3	130	7
4	4	100	2
5	5	110	4
6	6	160	8
7	7	170	9
8	8	100	3
9	9	120	6
10	10	200	10

ROW_NUMBER result

According to the result, we can see from 1st to 10th.

Unique point is that there are 2nd and 3rd records in val=100 data.

ROW_NUMBER just puts row number. So they have different number even in same score records.

Numbering functions | ROW_NUMBER

Comparison

We classify RANK, DENSE_RANK, ROW_NUMBER like below.

Function	Description
RANK	Same rank in same score. If there are three 1st records,next is 4th.
DENSE_RANK	Same rank in same score. If there are three 1st records,next is 2nd.
ROW_NUMBER	Even in same score, it numbers different number.

SQL

SELECT
*,
RANK() OVER(ORDER BY val) as rank,
DENSE_RANK() OVER(ORDER BY val) as dense_rank,
ROW_NUMBER() OVER(ORDER BY val) as row_number
FROM test.rank_sample
order by id

Result

Row	id	val	rank	dense_rank	row_number
1	1	100	1	1	1
2	2	120	5	3	5
3	3	130	7	4	7
4	4	100	1	1	2
5	5	110	4	2	4
6	6	160	8	5	8
7	7	170	9	6	9
8	8	100	1	1	3
9	9	120	5	3	6
10	10	200	10	7	10

Conclusion

Today I described about "How to rank data considering duplication in BigQuery".

There are some ranking functions below.

RANK
DENSE_RANK
ROW_NUMBER

Each feature is following.

Function	Description
RANK	Same rank in same score. If there are three 1st records,next is 4th.
DENSE_RANK	Same rank in same score. If there are three 1st records,next is 2nd.
ROW_NUMBER	Even in same score, it numbers different number.