Monitoring query performance in Cosmos DB

A common question with Cosmos DB is: how many Request Units (RUs) should I be using?

Although it’s easy to quickly guess what you need to start, it’s not so trivial to determine exactly what is needed.

My general advice (with Cosmos DB & other DB as a service) is to provision something pessimistically (i.e. put more RUs than not enough) and then monitor.

But how to monitor?

Well, the documentation online now gives us a great couple example of queries: Monitoring Azure Cosmos DB.

It is based on Azure Monitor and we can therefore query logs using the Kusto query language. That means that we can slice and dice the logs any way we want to find insights. That makes it very powerful.

A section of the doc builds the queries from a simple one to more sophisticated ones.

Just to give an example of the capability, here is one of the queries ran on a collection:

AzureDiagnostics
| where ResourceProvider=="MICROSOFT.DOCUMENTDB" and Category=="DataPlaneRequests"
| project activityId_g, requestCharge_s
| join kind= inner (
       AzureDiagnostics
       | where ResourceProvider =="MICROSOFT.DOCUMENTDB" and Category == "QueryRuntimeStatistics"
       | project activityId_g, querytext_s
) on $left.activityId_g == $right.activityId_g
| order by requestCharge_s desc
| limit 100

And the result on a collection where we just ran a few queries:

activityId_g requestCharge_s activityId_g1 querytext_s
a3d1… 2.85 a3d1… {“query”:”SELECT {"p1": {"p2": sum(c.p3)}} AS p4\nFROM c”,”parameters”:[]}
eae0 2.44 eae0… {“query”:”SELECT c.id, c._self, c._rid, c._ts, c["p1"] AS p2__18\nFROM c”,”parameters”:[]}
26c0… 2.39 26c0… {“query”:”SELECT *\nFROM c”,”parameters”:[]}
4f44… 2.39 4f44… {“query”:”SELECT *\nFROM c”,”parameters”:[]}
d2a5… 2.39 d2a5… {“query”:”SELECT c.p1, c.p2, c.p3\nFROM c”,”parameters”:[]}
167a… 2.36 167a… {“query”:”SELECT c.name\nFROM c”,”parameters”:[]}

We can see the queries aren’t as we entered them. Parameters have been aliased. Nevertheless, it gives us an idea of the shape of the query.

Looking over a longer period we could determine the minimum and maximum RU used by a workload.

We could then make an inform decision on the amount RU to provision or start using autopilot.


Leave a comment