Setup for populating Cosmos DB with random data using Logic Apps


pexels-photo-267968[1]We recently published an article about Cosmos DB Performance with Geospatial Data.

In this article, we’re going to explain how to setup the environment in order to run those performance test.

More importantly, we believe this article is interesting on its own as it shows how to use Logic Apps to populate a Cosmos DB collection with random data in a very efficient way.

For this we will use a stored procedure as we explored in a past article.

The ARM Template is available on GitHub.

Azure Resources

We want to create three main Azure Resources:

  • Cosmos DB Account
  • Cosmos DB Connector (for Logic Apps)
  • Logic App

We will also need to create artefacts within the Cosmos DB account.  Namely:

  • A Collection
  • Modifying the Index Policy on the collection
  • A Stored Procedure within the collection

ARM Template Deployment

Let’s create the Azure resource using the ARM template deployment available on GitHub (see deployment buttons at the bottom of the page).

The template has four parameters.  The first one is mandatory, the other three have default values:

  • Cosmos DB Account Name:  Name of the Cosmos DB Account Azure resource ; this must be unique within all Cosmos DB Account in Azure (not only ours)
  • Partition Count:  The number of partitions we’re going to seed data into (default is 4000)
  • Records per partition:  Number of records (documents) we’re going to seed per partition (default is 300)
  • Geo Ratio:  The ratio of documents which will have a geospatial location in them (default is .33, hence %33)

If we leave the default as is, we’ll have 4000 x 300 = 1.2 million documents, with a third of them (i.e. 400 000) with geospatial locations.  This corresponds to what we used for performance test.

Creating a Collection

Unfortunately, the Cosmos DB resource provider doesn’t expose sub components in the ARM model.  So we can’t create the collection within the ARM template.

We could use the Command Line Interface (CLI) for Cosmos DB.  Here we will use the Portal.

Let’s open the Cosmos DB Account resource created by the ARM template.

Let’s go to the Data Explorer tab.

image

Let’s then select New Collection and fill the form this way:

image

There are a few important fields in there:

  • The database and collection name are hardcoded in the Logic App when invoking the stored procedure, so it is important to get them right
  • Using the Unlimited storage capacity gives us a partitioned collection which is what we want for performance scale
  • We initialize the throughput at 2500 RUs but we’ll change it for loading the data
  • Partition key is “part”

Modifying Index Policy

While still being in the Data Explorer, let’s select the Scale & Settings of our collection:

image

At the bottom of the pane, let’s edit the Indexing Policy:

image

Geospatial data isn’t indexed by default.  We therefore need to add at least the “Point” data type for indexing.

The procedure is explained in the public documentation.

It is important to do this before loading the data so the data is indexed on load instead of asynchronously indexed after a change of policy.

Creating Stored Procedure

While still being in the Data Explorer, let’s select New Stored Procedure:

image

Let’s enter createRecords as Stored Procedure Id.

For the body, let’s copy-paste the content of the CreateRecords.js.

Click Save.

Increase RUs

Before going into the Logic App, let’s beef up the Request Units (RUs) of our collection.

image

We suggest boosting it to the maximum, i.e. 100 000.

Then click Save.

Executing the Logic Apps

Let’s open the Logic App in the same Azure Resource Group.

image

The whole point for using Logic Apps here is to have a component that will invoke Cosmos DB stored procedures in parallel in a reliable fashion.

Let’s click on Run Trigger and then manual (in the sub menu).

image

A run calls the stored procedures 4000 times and take about 5-6 minutes to do so.

Reduce RUs

Do not forget to scale the collection back.  The scale is the main driver for the cost of a collection.

Summary

Loading random data quickly in a Cosmos DB is best done by leveraging stored procedure as they run close to the data and can create documents very quickly.

Stored procedures run within a partition.  So we also need something to loop among partition and this is what Logic Apps does here.

Logic App is also very cost effective since it is a server-less resource which incur costs only when used.

Advertisements

One thought on “Setup for populating Cosmos DB with random data using Logic Apps

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s