How to calculate cosine similarity of array in BigQuery

How to calculate cosine similarity of array in BigQuery

How can we calculate cosine similarity in BigQuery ?

BigQuery(BQ) is very useful for data analysis or processing.
It is good for handling huge data.
It returns summary result in short time.

BQ has one useful data format "array".

When we consider array as vector, we may want cosine similarity of vectors.

So how can we get cosine similarity ?

So today I introduce about "How to calculate cosine similarity of array in BigQuery".

目次

What is cosine similarity

Cosine similarity is a measure of similarity between two vectors.

Cosine cos is one of the trigonometric functions.

It takes 1 in case of 0°, 0 in case of 90° and -1 in case of 180°.

So it means that we can get an angle if we know value of cosine.

And if an angle between 2 vectors is close to zero, it means 2 vectors are similar.

So the cosine is used as similarity.

Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space.

Cosine similarity – Wikipedia

The formula of cosine similarity is below.

cosine similarity formula

I remember that I used to learn it.

Prepare data

Before calculate it, we should prepare data.

In order to make array data, we can use the table which we create in previous topic.

You can add CREATE TABLE to SQL, and create table.

CREATE TABLE test.array_sample2
AS
SELECT
key,
ARRAY_AGG(val) as val_array
FROM test.array_sample
GROUP BY key

Result table

Rowkeyval_array
1a1
2
3
2b2
4
5
3c3
2
-1

We use this table to calculate cosine similarity.

How to calculate cosine similarity of array in BigQuery

In order to calculate cosine similarity of array in BigQuery, we should calculate product between elements of vectors.

SQL is below.

SELECT
  t1.key AS key1,
  t2.key AS key2,
  (
  SELECT
    SUM(value1 * value2)/ SQRT(SUM(value1 * value1))/ SQRT(SUM(value2 * value2))
  FROM
    UNNEST(t1.val_array) AS value1
  WITH
  OFFSET
    pos1
  JOIN
    UNNEST(t2.val_array) AS value2
  WITH
  OFFSET
    pos2
  ON
    pos1 = pos2
  ) AS cosine_similarity
FROM
  test.array_sample2 AS t1,
  test.array_sample2 AS t2
ORDER BY
  key1,
  key2,
  cosine_similarity

It decomposed array to each element bt UNNEST(array), and added order number by WITH OFFSET pos.

Then use pos as join key for multiplication of each element.

Result is below.

Result

Rowkey1key2cosine_similarity
1aa1
2ab0.9960238411
3ac0.2857142857
4ba0.9960238411
5bb1
6bc0.3585685828
7ca0.2857142857
8cb0.3585685828
9cc1

Vector a and b are similar. So similarity is high.

Vector c faces different direction. So similarity is low.

Cosine similarity between pair of arrays in Bigquery – Stack Overflow

Conclusion

Today I explained about "How to calculate cosine similarity of array in BigQuery".

In order to calculate cosine similarity of array, we can check these following important pints.

  • Decompose array by "UNNEST(array)"
  • Add number to array element by "WITH OFFSET pos"
  • Calculate cosine similarity by multiplication between each array element

Using UNNEST(array) is little bit complicated.

How to calculate cosine similarity of array in BigQuery

この記事が気に入ったら
いいね または フォローしてね!

If you like this article, please share !
  • URLをコピーしました!
  • URLをコピーしました!

Author

karasanのアバター karasan System engineer

Mid-career engineer (AI, Data science, Salesforce, etc.).
Good at Python and SQL.

目次