Cosmos DB Performance with Geospatial Data
Solution ·Time for some performance tests on Cosmos DB around Geospatial coordinates!
Let’s hurt the service and see where it shines, where it has a hard time and how scaling it (i.e. increasing Request Units or RUs) helps.
We’ll publish an how-to create the same setup in a future article so tests can be reproduced. In the meantime, the C# code for the performance test is on GitHub.
Setup
We are testing Cosmos DB geospatial capabilities. Those rely on GeoJSON specification.
We populated a partitioned collection with random data.
We wanted some volume so we did put of lot of non-geospatial data in there. Documents have a geospatial “point” but also a bunch of long random strings and numbers. Here is a sample document:
{
"part": "2802",
"name": "s2%R/@qcP@T<W?n_]\"f#2]QF3QHHC]PvzCDWP;`aE]WDCC`>Fnw?w9x9/+a(j^%^",
"profile": {
"age": 89,
"salary": 23967.67,
"project": "A$zGuVg/a8\\r[EComB\"zF!'82lR]M((1Z?Omt?Gm%,OE&QlO%w4Ti;NO+w5F?umRJdxwOk-b^bMkL/s)3qV_+Ph!VEvm//cO\\!,Itebb\\gn_>dK4tbyG.Rgh/tPW@DwG"
},
"alias": {
"name": "m#L]wu6.vw/Sj6LwI1=Ph&!cM!782m!knm&u5b![@1',g-uo'_7k+mZgMbg<!22\"X'*1Nf8&8?Szu#WbeWB5[VwWonJC-S,./9SUudjypf<Xf2e\\>06jDFwwXi]@@\\+2/X>9d/1KE]F*Z!,7[1\"]`g&]#_<O\\%fC\\Z`Oc55e_dP#q]\\bzW%:^0=[d3U*x%k-r,H0s_<P=n_5_ks`#$3;D:L]Ko`8/6y#'BTcxU@-*1LP'_j#)BxC!A5V9yu9^M@.",
"reference": "m@,%ms)`75so_7lLsM<y++b&BTdw('DPZ3C/j?RuHfzMF!lT$#2XH`zn&HEy_Z\"[;*b9?K_!jWG$.G^X3-&\\6ts)R'<Yu.-xWg51#.e+Rz#*1jdQxPk:gj%yw/c.X`f6F0ya#l^v!Bm/4#<Ljmp8\\=o@q$1=yck$RE(&)onVr*Op.Fev(j664oZD&\\n+(9wFd`&uGEYoI&Bg_CXl5i27UWsQ@]\\KL9zAo1FF-#OhB\\`Dk_-6PTA8Hp0*\\7b^fCFO"
},
"weapon": "eew,/:0GQxEOp0O8&m4=v$ousJB=..giIy!(SdW2.0qr@y(+!.&t0&]l#'C;QJW9=,ebaN#!aQ!q2DbW6'Vb!)U-\\gkfouy/&plw1l=yTq0c`m]uS9HWCJu<sN_R%]C)b.PIYEPb?HDn>4@<`SUSa/$$dt7-atQyS%LMgs0HJ#XX0XAXp0,dd0n=Bf\"-bpgFG9;v*^rscbxDnCnBY2Gd?P]-m@F(p\\BY;3ai;<3b_zLc_v(p0@Fk9pFwB-u?+lFEqj=2k@93K+48)4I9jl\\W[`k-VwL`J\\hT*WI]`^\\U=BOI%lKq[VXD^Qc7=+ZhEylN",
"location": {
"type": "Point",
"coordinates": [
-73.69452978624741,
45.521848354422204
]
},
"id": "f06047a6-d5e7-4b10-dc30-7797f7a0145d",
"_rid": "VlN5AL+aJAAIAAAAAAAAAA==",
"_self": "dbs/VlN5AA==/colls/VlN5AL+aJAA=/docs/VlN5AL+aJAAIAAAAAAAAAA==/",
"_etag": "\"0100a6dc-0000-0000-0000-59ea53720000\"",
"_attachments": "attachments/",
"_ts": 1508529010
}
The idea here was not to create a too artificial sample set containing only indexed geospatial data in a very compact form. Instead we have lots of noise. Actually, only %33 of documents contain geospatial data (the other %66 skip the location node).
The geospatial location was taken on a rectangle containing the island of Montreal.
Here are the main attributes of the sample set:
- There are 1 200 000 documents
- Documents are distributed on 4000 logical partitions with 300 documents per logical partition
- %33 of documents (i.e. 400 000 documents) have a location node with a geospatial “point” in there
- Points are scattered uniformly on the geospatial rectangle
- There are no correlation between the partition key and the geospatial point coordinates
We ran the tests with 4 different Request Units (RUs) configurations:
- 2500
- 10000
- 20000
- 100000
First test: within a polygon
For the first test we take a query looking for points within a given polygon:
SELECT VALUE COUNT(1)
FROM record r
WHERE ST_WITHIN(
r.location,
{'type':'Polygon', 'coordinates':[@polyCoordinates]})
The polygon coordinates are passed in parameter here.
The query simply counts the number of point. We found that other aggregates (e.g. computing the average of a coordinate) yield similar performance.
The polygon is built according to two parameters: radius & edge count. We basically “approximate” a circle of a given radius with a polygon of a given number of edges.
Here is the complete data:
RU | Radius | Edge Count | # points | Elapsed |
---|---|---|---|---|
2500 | 0.005 | 50 | 190.8 | 0:00:00.105 |
2500 | 0.005 | 25 | 189.6 | 0:00:00.110 |
2500 | 0.005 | 10 | 180 | 0:00:00.126 |
2500 | 0.005 | 4 | 126.4 | 0:00:00.365 |
2500 | 0.05 | 4 | 12436.6 | 0:00:05.030 |
2500 | 0.05 | 10 | 18114.4 | 0:00:09.086 |
2500 | 0.05 | 25 | 19127.6 | 0:00:12.327 |
2500 | 0.05 | 50 | 19264.4 | 0:00:19.153 |
2500 | 0.1 | 25 | 65645.2 | 0:00:44.628 |
2500 | 0.1 | 10 | 62667.2 | 0:00:46.960 |
2500 | 0.1 | 4 | 45142.8 | 0:00:50.812 |
2500 | 0.1 | 50 | 66089.4 | 0:01:17.700 |
10000 | 0.005 | 25 | 189.6 | 0:00:00.103 |
10000 | 0.005 | 10 | 180 | 0:00:00.136 |
10000 | 0.005 | 50 | 190.8 | 0:00:00.195 |
10000 | 0.005 | 4 | 126.4 | 0:00:00.328 |
10000 | 0.05 | 4 | 12436.6 | 0:00:01.655 |
10000 | 0.05 | 10 | 18114.4 | 0:00:02.004 |
10000 | 0.05 | 50 | 19264.4 | 0:00:02.564 |
10000 | 0.05 | 25 | 19127.6 | 0:00:02.673 |
10000 | 0.1 | 4 | 45142.8 | 0:00:04.889 |
10000 | 0.1 | 25 | 65645.2 | 0:00:06.184 |
10000 | 0.1 | 10 | 62667.2 | 0:00:06.300 |
10000 | 0.1 | 50 | 66089.4 | 0:00:06.674 |
20000 | 0.005 | 50 | 190.8 | 0:00:00.107 |
20000 | 0.005 | 25 | 189.6 | 0:00:00.110 |
20000 | 0.005 | 10 | 180 | 0:00:00.112 |
20000 | 0.005 | 4 | 126.4 | 0:00:00.324 |
20000 | 0.05 | 4 | 12436.6 | 0:00:01.075 |
20000 | 0.05 | 10 | 18114.4 | 0:00:01.306 |
20000 | 0.05 | 25 | 19127.6 | 0:00:01.550 |
20000 | 0.05 | 50 | 19264.4 | 0:00:01.752 |
20000 | 0.1 | 4 | 45142.8 | 0:00:03.073 |
20000 | 0.1 | 10 | 62667.2 | 0:00:03.512 |
20000 | 0.1 | 25 | 65645.2 | 0:00:03.685 |
20000 | 0.1 | 50 | 66089.4 | 0:00:04.129 |
100000 | 0.005 | 25 | 189.6 | 0:00:00.107 |
100000 | 0.005 | 50 | 190.8 | 0:00:00.115 |
100000 | 0.005 | 10 | 180 | 0:00:00.138 |
100000 | 0.005 | 4 | 126.4 | 0:00:00.320 |
100000 | 0.05 | 4 | 12436.6 | 0:00:01.095 |
100000 | 0.05 | 10 | 18114.4 | 0:00:01.294 |
100000 | 0.05 | 25 | 19127.6 | 0:00:01.547 |
100000 | 0.05 | 50 | 19264.4 | 0:00:01.651 |
100000 | 0.1 | 4 | 45142.8 | 0:00:02.888 |
100000 | 0.1 | 10 | 62667.2 | 0:00:03.221 |
100000 | 0.1 | 25 | 65645.2 | 0:00:03.615 |
100000 | 0.1 | 50 | 66089.4 | 0:00:04.156 |
There are a few ways to slice the data. It is easier to look at a slice with constant RU. For instance RU = 2500:
RU | Radius | Edge Count | # points | Elapsed |
---|---|---|---|---|
2500 | 0.005 | 50 | 190.8 | 0:00:00.105 |
2500 | 0.005 | 25 | 189.6 | 0:00:00.110 |
2500 | 0.005 | 10 | 180 | 0:00:00.126 |
2500 | 0.005 | 4 | 126.4 | 0:00:00.365 |
2500 | 0.05 | 4 | 12436.6 | 0:00:05.030 |
2500 | 0.05 | 10 | 18114.4 | 0:00:09.086 |
2500 | 0.05 | 25 | 19127.6 | 0:00:12.327 |
2500 | 0.05 | 50 | 19264.4 | 0:00:19.153 |
2500 | 0.1 | 25 | 65645.2 | 0:00:44.628 |
2500 | 0.1 | 10 | 62667.2 | 0:00:46.960 |
2500 | 0.1 | 4 | 45142.8 | 0:00:50.812 |
2500 | 0.1 | 50 | 66089.4 | 0:01:17.700 |
The data here is sorted by elapsed time, i.e. the time the query took to return.
We can quickly see that the main driver for the performance is the number of documents returned. If we plot one against the other:
Here we transformed the elapsed time in total milliseconds in order to get an integer value which is easier to plot (in Excel anyway).
We observe the relationship is actually linear.
Now let’s look at the effect of scaling the request units, i.e. the amount of compute dedicated to the collection. Let’s take the slowest query and look at it executing with increasing RUs:
RU | Radius | Edge Count | # points | Elapsed |
---|---|---|---|---|
2500 | 0.1 | 50 | 66089.4 | 0:01:18.000 |
10000 | 0.1 | 50 | 66089.4 | 0:00:06.674 |
20000 | 0.1 | 50 | 66089.4 | 0:00:04.129 |
100000 | 0.1 | 50 | 66089.4 | 0:00:04.156 |
We can see that increasing RU from 2500 to 10000 improves the performance by an order of magnitude. Doubling the RUs to 20000 improves only by %50 while going all the way to 100000 doesn’t yield improvement.
This is an important insight: depending on the query, scaling might or might not improve performance notably.
Filtering
Some of the queries performed pretty slowly with low RUs.
As we observed those are the queries returning many points.
This might be necessary for an application but often we want to limit the number of returned value. It is interesting to see that if we use another criteria in the where clause to limit the number of points returned, we drastically improve performance.
Let’s modify the test query to:
SELECT VALUE COUNT(1)
FROM record r
WHERE ST_WITHIN(
r.location,
{'type':'Polygon', 'coordinates':[@polyCoordinates]})
AND r.profile.age<25
The age property is randomly generated to be uniformly distributed between 0 and 99. The filter we just put should shrink the result set by a factor 4.
It does: where the preceding query was returning more than 66000 points and take close to 80 seconds to run, it now returns 16293 and take below 8 seconds to run.
Here we see the power of automatic indexing in Cosmos DB.
Second (and last) test: proximity
For the second test our query is looking for points at proximity to a given poing:
SELECT
VALUE COUNT(1)
FROM record r
WHERE ST_DISTANCE (
r.location,
{'type':'Point', 'coordinates':@center})<@radius
The center and radius are passed in parameters here.
Again the query simply counts the number of point.
Here we simply took increasing radius value in order to compare performance. Here is the raw data:
RU | Radius | # points | Elapsed |
---|---|---|---|
2500 | 1000 | 896 | 0:00:00.533 |
2500 | 100 | 6.6 | 0:00:00.732 |
2500 | 3000 | 8109.8 | 0:00:03.249 |
2500 | 10000 | 68680.8 | 0:00:50.045 |
10000 | 1000 | 896 | 0:00:00.543 |
10000 | 100 | 6.6 | 0:00:00.724 |
10000 | 3000 | 8109.8 | 0:00:01.019 |
10000 | 10000 | 68680.8 | 0:00:05.698 |
20000 | 1000 | 896 | 0:00:00.518 |
20000 | 100 | 6.6 | 0:00:00.792 |
20000 | 3000 | 8109.8 | 0:00:00.964 |
20000 | 10000 | 68680.8 | 0:00:03.546 |
100000 | 100 | 6.6 | 0:00:00.664 |
100000 | 1000 | 896 | 0:00:00.679 |
100000 | 3000 | 8109.8 | 0:00:01.179 |
100000 | 10000 | 68680.8 | 0:00:03.499 |
We notice the same thing than in the first test: the more points returned by the query, the longer it takes.
We also notice the same improvement from 2500 to 10000 RUs and then the marginal improvement by increasing further.
Summary
We did two main performance test for Cosmos DB geospatial queries: polygon inclusion & proximity.
The main conclusions were:
- The more points returned, the longer the query takes
- If we filter further the query, it returns less points and perform better
- Scaling the Request Units (RUs) do improve performance but not in a linear fashion ; it is important to do performance test in order to properly invest in RUs
- Cosmos DB does a great job at using its indexes and can perform well despite having a big volume of documents
We strongly suggest you run similar tests with metrics closer to your workload in order to understand the performance profile of your query.