DB

How to rank data considering duplication in BigQuery

Share this for your friends.

How to rank data considering duplication in BigQuery

How can we rank data in BigQuery?


BigQuery(BQ) is very useful for data analysis or processing.

It is good at handling huge data. It returns summary result in short time.


Through data handling, sometimes we would like to do numbering.

For example, extract from first to third record in each category.

In order to get ranking in whole data, we should use order by clause.

But how can we get rank number ?

And also if 4 records have same score, it is difficult to choose 3 records.

How can we number records without duplication ?

So today I introduce about "How to rank data considering duplication in BigQuery".

Author


Mid-carieer engineer (AI, system). Good at Python and SQL.

Advantage to read

You can understand "How to rank data considering duplication in BigQuery". Then you don't have to concern about numbering.


How to rank in BigQuery

There are some methods to rank in BigQuery.

Numbering in BQ

  • RANK
  • DENSE_RANK
  • ROW_NUMBER


They are different functions.

So introduce each function.



Data

Anyway, we should prepare data.

Import this CSV file as rank_sample table.

Data

id,val
001,100
002,120
003,130
004,100
005,110
006,160
007,170
008,100
009,120
010,200


With using this table, check each numbering functions.



RANK

As its name, RANK is ranking function.

We can use it like RANK() OVER(ORDER BY <column name that you want to rank>).

SQL

SELECT
*,
RANK() OVER(ORDER BY val) as rank
FROM test.rank_sample
order by id


Result

Row id val rank
1 1 100 1
2 2 120 5
3 3 130 7
4 4 100 1
5 5 110 4
6 6 160 8
7 7 170 9
8 8 100 1
9 9 120 5
10 10 200 10


According to result, we can see 1st to 10th in rank column.

The val=100 records are all 1st, and next val=110 record is 4th.

It means that rank function considers same score as same rank and skip next number.




DENSE_RANK

DENSE_RANK is numbering function that doesn't skip duplicated rank.

We can use it like DENSE_RANK() OVER(ORDER BY <column name>).

SQL

SELECT
*,
DENSE_RANK() OVER(ORDER BY val) as dense_rank
FROM test.rank_sample
order by id


Result

Row id val dense_rank
1 1 100 1
2 2 120 3
3 3 130 4
4 4 100 1
5 5 110 2
6 6 160 5
7 7 170 6
8 8 100 1
9 9 120 3
10 10 200 7


We can see that rank is from ast to 7th.

The difference from RANK() is that val=110 record is not 4th but 2nd.

So DENSE_RANK does not skip duplicated rank.




ROW_NUMBER

ROW_NUMBER is not for ranking, but for setting row number.

But we can use it like ROW_NUMBER() OVER(ORDER BY <column name>) as same as RANK() function.

SQL

SELECT
*,
ROW_NUMBER() OVER(ORDER BY val) as dense_rank
FROM test.rank_sample
order by id


Result

Row id val row_number
1 1 100 1
2 2 120 5
3 3 130 7
4 4 100 2
5 5 110 4
6 6 160 8
7 7 170 9
8 8 100 3
9 9 120 6
10 10 200 10


According to the result, we can see from 1st to 10th.

Unique point is that there are 2nd and 3rd records in val=100 data.

ROW_NUMBER just puts row number. So they have different number even in same score records.




Comparison

We classify RANK, DENSE_RANK, ROW_NUMBER like below.

Function Description
RANK Same rank in same score. If there are three 1st records,next is 4th.
DENSE_RANK Same rank in same score. If there are three 1st records,next is 2nd.
ROW_NUMBER Even in same score, it numbers different number.


SQL

SELECT
*,
RANK() OVER(ORDER BY val) as rank,
DENSE_RANK() OVER(ORDER BY val) as dense_rank,
ROW_NUMBER() OVER(ORDER BY val) as row_number
FROM test.rank_sample
order by id


Result

Row id val rank dense_rank row_number
1 1 100 1 1 1
2 2 120 5 3 5
3 3 130 7 4 7
4 4 100 1 1 2
5 5 110 4 2 4
6 6 160 8 5 8
7 7 170 9 6 9
8 8 100 1 1 3
9 9 120 5 3 6
10 10 200 10 7 10

Conclusion

Today I described about "How to rank data considering duplication in BigQuery".

There are some ranking functions below.

Numbering function in BQ

  • RANK
  • DENSE_RANK
  • ROW_NUMBER

  • Each feature is following.

    Function Description
    RANK Same rank in same score. If there are three 1st records,next is 4th.
    DENSE_RANK Same rank in same score. If there are three 1st records,next is 2nd.
    ROW_NUMBER Even in same score, it numbers different number.


    It is impressive to use ROW_NUMBER for avoiding duplication.



    There are some other articles about BigQuey.

    If you interested in them, please read them.


    Share this for your friends.

    If you felt this article is useful, please share.

    にほんブログ村 IT技術ブログへ

    -DB
    -

    © 2023 ITips