
BigQuery(BQ) is very useful for data analysis or processing.
It is good at handling huge data. It returns summary result in short time.
BQ has one useful data format array.
When we consider array as vector, we may want cosine similarity of vectors.
So how can we get cosine similarity ?
So today I introduce about "How to calculate cosine similarity of array in BigQuery".
Author

Advantage to read
You can understand "How to calculate cosine similarity of array in BigQuery". Then you don't have to concern about similarity.
What is cosine similarity
Cosine similarity is a measure of similarity between two vectors.
Cosine cos
is one of the trigonometric functions.
It takes 1 in case of 0°, 0 in case of 90° and -1 in case of 180°.
So it means that we can get an angle if we know value of cosine.
And if an angle between 2 vectors is close to zero, it means 2 vectors are similar.
So the cosine is used as similarity.
Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space.
Reference: Cosine similarity - Wikipedia
The formula of cosine similarity is below.
![]() |
---|

Data
Before calculate it, we should prepare data.
In order to make array data, we can use the table which we create in previous topic.
Read more
You can add CREATE TABLE
to SQL, and create table.
SQL
CREATE TABLE test.array_sample2 AS SELECT key, ARRAY_AGG(val) as val_array FROM test.array_sample GROUP BY key
Result table
Row | key | val_array |
1 | a | 1 |
2 | ||
3 | ||
2 | b | 2 |
4 | ||
5 | ||
3 | c | 3 |
2 | ||
-1 |
How to calculate cosine similarity of array in BigQuery
In order to calculate cosine similarity of array in BigQuery, we should calculate product between elements of vectors.
SQL is below.
SQL
SELECT t1.key AS key1, t2.key AS key2, ( SELECT SUM(value1 * value2)/ SQRT(SUM(value1 * value1))/ SQRT(SUM(value2 * value2)) FROM UNNEST(t1.val_array) AS value1 WITH OFFSET pos1 JOIN UNNEST(t2.val_array) AS value2 WITH OFFSET pos2 ON pos1 = pos2 ) AS cosine_similarity FROM test.array_sample2 AS t1, test.array_sample2 AS t2 ORDER BY key1, key2, cosine_similarity
It decomposed array to each element bt UNNEST(array)
, and added order number by WITH OFFSET pos
.
Then use pos
as join key for multiplication of each element.
Result is below.
Result of SQL
Row | key1 | key2 | cosine_similarity |
---|---|---|---|
1 | a | a | 1 |
2 | a | b | 0.9960238411 |
3 | a | c | 0.2857142857 |
4 | b | a | 0.9960238411 |
5 | b | b | 1 |
6 | b | c | 0.3585685828 |
7 | c | a | 0.2857142857 |
8 | c | b | 0.3585685828 |
9 | c | c | 1 |
Vector a and b are similar. So similarity is high.
Vector c faces different direction. So similarity is low.
Conclusion
Today I explained about "How to calculate cosine similarity of array in BigQuery".
In order to calculate cosine similarity of array, we can take solution below.
Point

UNNEST(array)
is little difficult.There are some other articles about BigQuey.
If you interested in them, please read them.