Cosmos DB Performance with Geospatial Data


pexels-photo-348489[1]Time for some performance tests on Cosmos DB around Geospatial coordinates!

Let’s hurt the service and see where it shines, where it has a hard time and how scaling it (i.e. increasing Request Units or RUs) helps.

We’ll publish an how-to create the same setup in a future article so tests can be reproduced.  In the meantime, the C# code for the performance test is on GitHub.

Setup

We are testing Cosmos DB geospatial capabilities.  Those rely on GeoJSON specification.

We populated a partitioned collection with random data.

We wanted some volume so we did put of lot of non-geospatial data in there.  Documents have a geospatial “point” but also a bunch of long random strings and numbers.  Here is a sample document:


{
 "part": "2802",
 "name": "s2%R/@qcP@T<W?n_]\"f#2]QF3QHHC]PvzCDWP;`aE]WDCC`>Fnw?w9x9/+a(j^%^",
 "profile": {
 "age": 89,
 "salary": 23967.67,
 "project": "A$zGuVg/a8\\r[EComB\"zF!'82lR]M((1Z?Omt?Gm%,OE&QlO%w4Ti;NO+w5F?umRJdxwOk-b^bMkL/s)3qV_+Ph!VEvm//cO\\!,Itebb\\gn_>dK4tbyG.Rgh/tPW@DwG"
 },
 "alias": {
 "name": "m#L]wu6.vw/Sj6LwI1=Ph&!cM!782m!knm&u5b![@1',g-uo'_7k+mZgMbg<!22\"X'*1Nf8&8?Szu#WbeWB5[VwWonJC-S,./9SUudjypf<Xf2e\\>06jDFwwXi]@@\\+2/X>9d/1KE]F*Z!,7[1\"]`g&]#_<O\\%fC\\Z`Oc55e_dP#q]\\bzW%:^0=[d3U*x%k-r,H0s_<P=n_5_ks`#$3;D:L]Ko`8/6y#'BTcxU@-*1LP'_j#)BxC!A5V9yu9^M@.",
 "reference": "m@,%ms)`75so_7lLsM<y++b&BTdw('DPZ3C/j?RuHfzMF!lT$#2XH`zn&HEy_Z\"[;*b9?K_!jWG$.G^X3-&\\6ts)R'<Yu.-xWg51#.e+Rz#*1jdQxPk:gj%yw/c.X`f6F0ya#l^v!Bm/4#<Ljmp8\\=o@q$1=yck$RE(&)onVr*Op.Fev(j664oZD&\\n+(9wFd`&uGEYoI&Bg_CXl5i27UWsQ@]\\KL9zAo1FF-#OhB\\`Dk_-6PTA8Hp0*\\7b^fCFO"
 },
 "weapon": "eew,/:0GQxEOp0O8&m4=v$ousJB=..giIy!(SdW2.0qr@y(+!.&t0&]l#'C;QJW9=,ebaN#!aQ!q2DbW6'Vb!)U-\\gkfouy/&plw1l=yTq0c`m]uS9HWCJu<sN_R%]C)b.PIYEPb?HDn>4@<`SUSa/$$dt7-atQyS%LMgs0HJ#XX0XAXp0,dd0n=Bf\"-bpgFG9;v*^rscbxDnCnBY2Gd?P]-m@F(p\\BY;3ai;<3b_zLc_v(p0@Fk9pFwB-u?+lFEqj=2k@93K+48)4I9jl\\W[`k-VwL`J\\hT*WI]`^\\U=BOI%lKq[VXD^Qc7=+ZhEylN",
 "location": {
 "type": "Point",
 "coordinates": [
 -73.69452978624741,
 45.521848354422204
 ]
 },
 "id": "f06047a6-d5e7-4b10-dc30-7797f7a0145d",
 "_rid": "VlN5AL+aJAAIAAAAAAAAAA==",
 "_self": "dbs/VlN5AA==/colls/VlN5AL+aJAA=/docs/VlN5AL+aJAAIAAAAAAAAAA==/",
 "_etag": "\"0100a6dc-0000-0000-0000-59ea53720000\"",
 "_attachments": "attachments/",
 "_ts": 1508529010
}

The idea here was not to create a too artificial sample set containing only indexed geospatial data in a very compact form.  Instead we have lots of noise.  Actually, only %33 of documents contain geospatial data (the other %66 skip the location node).

The geospatial location was taken on a rectangle containing the island of Montreal.

Here are the main attributes of the sample set:

  • There are 1 200 000 documents
  • Documents are distributed on 4000 logical partitions with 300 documents per logical partition
  • %33 of documents (i.e. 400 000 documents) have a location node with a geospatial “point” in there
  • Points are scattered uniformly on the geospatial rectangle
  • There are no correlation between the partition key and the geospatial point coordinates

We ran the tests with 4 different Request Units (RUs) configurations:

  • 2500
  • 10000
  • 20000
  • 100000

First test:  within a polygon

For the first test we take a query looking for points within a given polygon:


SELECT VALUE COUNT(1)
FROM record r
WHERE ST_WITHIN(
r.location,
{'type':'Polygon', 'coordinates':[@polyCoordinates]})

The polygon coordinates are passed in parameter here.

The query simply counts the number of point.  We found that other aggregates (e.g. computing the average of a coordinate) yield similar performance.

The polygon is built according to two parameters:  radius & edge count.  We basically “approximate” a circle of a given radius with a polygon of a given number of edges.

Here is the complete data:

RU Radius Edge Count # points Elapsed
2500 0.005 50 190.8 0:00:00.105
2500 0.005 25 189.6 0:00:00.110
2500 0.005 10 180 0:00:00.126
2500 0.005 4 126.4 0:00:00.365
2500 0.05 4 12436.6 0:00:05.030
2500 0.05 10 18114.4 0:00:09.086
2500 0.05 25 19127.6 0:00:12.327
2500 0.05 50 19264.4 0:00:19.153
2500 0.1 25 65645.2 0:00:44.628
2500 0.1 10 62667.2 0:00:46.960
2500 0.1 4 45142.8 0:00:50.812
2500 0.1 50 66089.4 0:01:17.700
10000 0.005 25 189.6 0:00:00.103
10000 0.005 10 180 0:00:00.136
10000 0.005 50 190.8 0:00:00.195
10000 0.005 4 126.4 0:00:00.328
10000 0.05 4 12436.6 0:00:01.655
10000 0.05 10 18114.4 0:00:02.004
10000 0.05 50 19264.4 0:00:02.564
10000 0.05 25 19127.6 0:00:02.673
10000 0.1 4 45142.8 0:00:04.889
10000 0.1 25 65645.2 0:00:06.184
10000 0.1 10 62667.2 0:00:06.300
10000 0.1 50 66089.4 0:00:06.674
20000 0.005 50 190.8 0:00:00.107
20000 0.005 25 189.6 0:00:00.110
20000 0.005 10 180 0:00:00.112
20000 0.005 4 126.4 0:00:00.324
20000 0.05 4 12436.6 0:00:01.075
20000 0.05 10 18114.4 0:00:01.306
20000 0.05 25 19127.6 0:00:01.550
20000 0.05 50 19264.4 0:00:01.752
20000 0.1 4 45142.8 0:00:03.073
20000 0.1 10 62667.2 0:00:03.512
20000 0.1 25 65645.2 0:00:03.685
20000 0.1 50 66089.4 0:00:04.129
100000 0.005 25 189.6 0:00:00.107
100000 0.005 50 190.8 0:00:00.115
100000 0.005 10 180 0:00:00.138
100000 0.005 4 126.4 0:00:00.320
100000 0.05 4 12436.6 0:00:01.095
100000 0.05 10 18114.4 0:00:01.294
100000 0.05 25 19127.6 0:00:01.547
100000 0.05 50 19264.4 0:00:01.651
100000 0.1 4 45142.8 0:00:02.888
100000 0.1 10 62667.2 0:00:03.221
100000 0.1 25 65645.2 0:00:03.615
100000 0.1 50 66089.4 0:00:04.156

There are a few ways to slice the data.  It is easier to look at a slice with constant RU.  For instance RU = 2500:

RU Radius Edge Count # points Elapsed
2500 0.005 50 190.8 0:00:00.105
2500 0.005 25 189.6 0:00:00.110
2500 0.005 10 180 0:00:00.126
2500 0.005 4 126.4 0:00:00.365
2500 0.05 4 12436.6 0:00:05.030
2500 0.05 10 18114.4 0:00:09.086
2500 0.05 25 19127.6 0:00:12.327
2500 0.05 50 19264.4 0:00:19.153
2500 0.1 25 65645.2 0:00:44.628
2500 0.1 10 62667.2 0:00:46.960
2500 0.1 4 45142.8 0:00:50.812
2500 0.1 50 66089.4 0:01:17.700

The data here is sorted by elapsed time, i.e. the time the query took to return.

We can quickly see that the main driver for the performance is the number of documents returned.  If we plot one against the other:

image

Here we transformed the elapsed time in total milliseconds in order to get an integer value which is easier to plot (in Excel anyway).

We observe the relationship is actually linear.

Now let’s look at the effect of scaling the request units, i.e. the amount of compute dedicated to the collection.  Let’s take the slowest query and look at it executing with increasing RUs:

RU Radius Edge Count # points Elapsed
2500 0.1 50 66089.4 0:01:18.000
10000 0.1 50 66089.4 0:00:06.674
20000 0.1 50 66089.4 0:00:04.129
100000 0.1 50 66089.4 0:00:04.156

We can see that increasing RU from 2500 to 10000 improves the performance by an order of magnitude.  Doubling the RUs to 20000 improves only by %50 while going all the way to 100000 doesn’t yield improvement.

This is an important insight:  depending on the query, scaling might or might not improve performance notably.

Filtering

Some of the queries performed pretty slowly with low RUs.

As we observed those are the queries returning many points.

This might be necessary for an application but often we want to limit the number of returned value.  It is interesting to see that if we use another criteria in the where clause to limit the number of points returned, we drastically improve performance.

Let’s modify the test query to:


SELECT VALUE COUNT(1)
FROM record r
WHERE ST_WITHIN(
r.location,
{'type':'Polygon', 'coordinates':[@polyCoordinates]})
AND r.profile.age<25

The age property is randomly generated to be uniformly distributed between 0 and 99.  The filter we just put should shrink the result set by a factor 4.

It does:  where the preceding query was returning more than 66000 points and take close to 80 seconds to run, it now returns 16293 and take below 8 seconds to run.

Here we see the power of automatic indexing in Cosmos DB.

Second (and last) test:  proximity

For the second test our query is looking for points at proximity to a given poing:


SELECT
VALUE COUNT(1)
FROM record r
WHERE ST_DISTANCE (
r.location,
{'type':'Point', 'coordinates':@center})<@radius

The center and radius are passed in parameters here.

Again the query simply counts the number of point.

Here we simply took increasing radius value in order to compare performance.  Here is the raw data:

RU Radius # points Elapsed
2500 1000 896 0:00:00.533
2500 100 6.6 0:00:00.732
2500 3000 8109.8 0:00:03.249
2500 10000 68680.8 0:00:50.045
10000 1000 896 0:00:00.543
10000 100 6.6 0:00:00.724
10000 3000 8109.8 0:00:01.019
10000 10000 68680.8 0:00:05.698
20000 1000 896 0:00:00.518
20000 100 6.6 0:00:00.792
20000 3000 8109.8 0:00:00.964
20000 10000 68680.8 0:00:03.546
100000 100 6.6 0:00:00.664
100000 1000 896 0:00:00.679
100000 3000 8109.8 0:00:01.179
100000 10000 68680.8 0:00:03.499

We notice the same thing than in the first test:  the more points returned by the query, the longer it takes.

We also notice the same improvement from 2500 to 10000 RUs and then the marginal improvement by increasing further.

Summary

We did two main performance test for Cosmos DB geospatial queries:  polygon inclusion & proximity.

The main conclusions were:

  • The more points returned, the longer the query takes
  • If we filter further the query, it returns less points and perform better
  • Scaling the Request Units (RUs) do improve performance but not in a linear fashion ; it is important to do performance test in order to properly invest in RUs
  • Cosmos DB does a great job at using its indexes and can perform well despite having a big volume of documents

We strongly suggest you run similar tests with metrics closer to your workload in order to understand the performance profile of your query.

Advertisements

4 thoughts on “Cosmos DB Performance with Geospatial Data

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s