Setup for populating Cosmos DB with random data using Logic Apps
Solution ·We recently published an article about Cosmos DB Performance with Geospatial Data.
In this article, we’re going to explain how to setup the environment in order to run those performance test.
More importantly, we believe this article is interesting on its own as it shows how to use Logic Apps to populate a Cosmos DB collection with random data in a very efficient way.
For this we will use a stored procedure as we explored in a past article.
The ARM Template is available on GitHub.
Azure Resources
We want to create three main Azure Resources:
- Cosmos DB Account
- Cosmos DB Connector (for Logic Apps)
- Logic App
We will also need to create artefacts within the Cosmos DB account. Namely:
- A Collection
- Modifying the Index Policy on the collection
- A Stored Procedure within the collection
ARM Template Deployment
Let’s create the Azure resource using the ARM template deployment available on GitHub (see deployment buttons at the bottom of the page).
The template has four parameters. The first one is mandatory, the other three have default values:
- Cosmos DB Account Name: Name of the Cosmos DB Account Azure resource ; this must be unique within all Cosmos DB Account in Azure (not only ours)
- Partition Count: The number of partitions we’re going to seed data into (default is 4000)
- Records per partition: Number of records (documents) we’re going to seed per partition (default is 300)
- Geo Ratio: The ratio of documents which will have a geospatial location in them (default is .33, hence %33)
If we leave the default as is, we’ll have 4000 x 300 = 1.2 million documents, with a third of them (i.e. 400 000) with geospatial locations. This corresponds to what we used for performance test.
Creating a Collection
Unfortunately, the Cosmos DB resource provider doesn’t expose sub components in the ARM model. So we can’t create the collection within the ARM template.
We could use the Command Line Interface (CLI) for Cosmos DB. Here we will use the Portal.
Let’s open the Cosmos DB Account resource created by the ARM template.
Let’s go to the Data Explorer tab.
Let’s then select New Collection and fill the form this way:
There are a few important fields in there:
- The database and collection name are hardcoded in the Logic App when invoking the stored procedure, so it is important to get them right
- Using the Unlimited storage capacity gives us a partitioned collection which is what we want for performance scale
- We initialize the throughput at 2500 RUs but we’ll change it for loading the data
- Partition key is “part”
Modifying Index Policy
While still being in the Data Explorer, let’s select the Scale & Settings of our collection:
At the bottom of the pane, let’s edit the Indexing Policy:
Geospatial data isn’t indexed by default. We therefore need to add at least the “Point” data type for indexing.
The procedure is explained in the public documentation.
It is important to do this before loading the data so the data is indexed on load instead of asynchronously indexed after a change of policy.
Creating Stored Procedure
While still being in the Data Explorer, let’s select New Stored Procedure:
Let’s enter createRecords as Stored Procedure Id.
For the body, let’s copy-paste the content of the CreateRecords.js.
Click Save.
Increase RUs
Before going into the Logic App, let’s beef up the Request Units (RUs) of our collection.
We suggest boosting it to the maximum, i.e. 100 000.
Then click Save.
Executing the Logic Apps
Let’s open the Logic App in the same Azure Resource Group.
The whole point for using Logic Apps here is to have a component that will invoke Cosmos DB stored procedures in parallel in a reliable fashion.
Let’s click on Run Trigger and then manual (in the sub menu).
A run calls the stored procedures 4000 times and take about 5-6 minutes to do so.
Reduce RUs
Do not forget to scale the collection back. The scale is the main driver for the cost of a collection.
Summary
Loading random data quickly in a Cosmos DB is best done by leveraging stored procedure as they run close to the data and can create documents very quickly.
Stored procedures run within a partition. So we also need something to loop among partition and this is what Logic Apps does here.
Logic App is also very cost effective since it is a server-less resource which incur costs only when used.
4 responses