Tag Archives: Data

Data Architecture & Data Science ; Azure SQL Database, SQL on Azure VMs, Azure Data Warehouse, Data Factory, Redis Cache, Azure Search

Hacking: accessing a graph in Cosmos DB with SQL / DocumentDB API

pexels-photo-264635[1]Azure Cosmos DB is Microsoft’s globally distributed multi-model database service.

At this point in time (August 2017) there are four supported models:  DocumentDB (also named SQL because the query language is similar to T-SQL), MongoDB, Tabular & Gremlin.

We’ve seen how to use Cosmos DB with Gremlin in a past article.

Now here’s a little secret:  although we choose the “model” (e.g. Gremlin) at the Cosmos DB account level, we can use other models to query the data.

Not all combination are possible, but many are.  Specifically, we can query a Gremlin graph using DocumentDB / SQL query language.

The graph is then projected into documents.

We will explore that in this article.

Why is that interesting?  Because there are a lot of tools out there we might be familiar with to manipulate DocumentDB (or MongoDB).  Having to possibility to look at a Graph with other APIs extends our toolset from Gremlin-based ones.

Creating a simple graph in Gremlin

Let’s create a simple graph in a Cosmos DB using Gremlin.  In a past article we’ve looked at how to setup Gremlin with Cosmos DB.


gremlin> :remote connect tinkerpop.server conf/remote-secure.yaml

gremlin> :> g.addV('person').property('id', 'Alice').property('age', 42).property('department', 'stereotype')

gremlin> :> g.addV('person').property('id', 'Bob').property('age', 24).property('department', 'support character')

gremlin> :> g.V('Alice').addE('communicatesWith').property('id', 'AliceToBob').property('language', 'English').to(g.V('Bob'))

The first line is there to connect to the remote server we configured in remote-secure.yaml.  For details see the setup article.

We now have a toy graph with two vertices connected with one edge.  Nothing too fancy but that will be enough for our purpose.

image

We can note the following:

  • We provided the ids of objects ; this isn’t always possible in graph databases but is possible with Cosmos DB (if we don’t provide it, a randomly generated GUID is automatically provisioned)
  • We did provide a custom property (i.e. language) on the edge
  • The graph partition key is department hence we provided it for each vertex

Document Query

The code is available on GitHub, more specifically in the Program.cs file.

Here we build on the code from the Cosmos DB async streaming article.  We simply read all the documents in the graph with DocumentDB API and output them in JSON format:


private async static Task ListAllDocumentsAsync(
    DocumentClient client,
    Uri collectionUri)
{
    var query = client.CreateDocumentQuery(
        collectionUri,
        new FeedOptions
        {
            EnableCrossPartitionQuery = true
        });
    var queryAll = query.AsDocumentQuery();
    var all = await GetAllResultsAsync(queryAll);

    Console.WriteLine($"Collection contains {all.Length} documents:");

    foreach (var d in all)
    {
        var json = GetJson(d);

        if (d.Id == "CarolToAlice")
        {
            await client.DeleteDocumentAsync(
                d.SelfLink,
                new RequestOptions
                {
                    PartitionKey = new PartitionKey(d.GetPropertyValue<string>("department"))
                });
        }

        Console.WriteLine(json);
    }

    Console.WriteLine();
}

The output should be the following:


{
   "id": "Bob",
   "_rid": "smp9AKyqeQADAAAAAAAABA==",
   "_self": "dbs/smp9AA==/colls/smp9AKyqeQA=/docs/smp9AKyqeQADAAAAAAAABA==/",
   "_ts": 1504096168,
   "_etag": "\"00001c04-0000-0000-0000-59a6afad0000\"",
   "label": "person",
   "age": [
     {
       "_value": 24,
       "id": "88a659bf-84d1-4c13-8450-ee57b426b7b3"
     }
   ],
   "department": "support character"
}
 {
   "id": "Alice",
   "_rid": "smp9AKyqeQAKAAAAAAAABg==",
   "_self": "dbs/smp9AA==/colls/smp9AKyqeQA=/docs/smp9AKyqeQAKAAAAAAAABg==/",
   "_ts": 1504096164,
   "_etag": "\"0000ed09-0000-0000-0000-59a6afa60000\"",
   "label": "person",
   "age": [
     {
       "_value": 42,
       "id": "78109dc8-587f-4d87-9d2e-e4a1731dec2b"
     }
   ],
   "department": "stereotype"
 }
 {
   "id": "AliceToBob",
   "_rid": "smp9AKyqeQALAAAAAAAABg==",
   "_self": "dbs/smp9AA==/colls/smp9AKyqeQA=/docs/smp9AKyqeQALAAAAAAAABg==/",
   "_ts": 1504096178,
   "_etag": "\"0000ee09-0000-0000-0000-59a6afb40000\"",
   "label": "communicatesWith",
   "language": "English",
   "_sink": "Bob",
   "_sinkLabel": "person",
   "_sinkPartition": "support character",
   "_vertexId": "Alice",
   "_vertexLabel": "person",
   "_isEdge": true,
   "department": "stereotype"
 }

We can learn a lot from this projection:

  • Vertices are pretty close to simple DocumentDB document ; the properties starting with an underscore (_) are our usual DocumentDB metadata (e.g. _self)
  • Vertex Properties (e.g. age) are represented as an array of complex sub structures (_value and an id) ; this is because in Gremlin a vertex’ (or edge’s) properties can have multiple values
  • Edges are more complex
    • A metadata property _isEdge seems to be the discriminator between a vertex and an edge
    • _vertexId & _vertexLabel identify the “source” of the edge (the starting point)
    • _sink, _sinkLabel & _sinkPartition identify the “target” of the edge (the destination point)
    • The partition of the edge is the same as the “source” vertex, even if we didn’t specify it in Gremlin
    • The custom property language is a flat property, not a complex one with arrays as in the vertices

Given that information, we can easily write queries, for instance, to list only vertices:


private class MinimalDoc
{
    public string id { get; set; }
    public bool? _isEdge { get; set; }
}

private async static Task ListOnlyVerticesAsync(
    DocumentClient client,
    Uri collectionUri)
{
    var query = client.CreateDocumentQuery<MinimalDoc>(
        collectionUri,
        new FeedOptions
        {
            EnableCrossPartitionQuery = true
        });
    var queryVertex = (from d in query
                        where !d._isEdge.HasValue
                        select d).AsDocumentQuery();
    var all = await GetAllResultsAsync(queryVertex);

    Console.WriteLine($"Collection contains {all.Length} documents:");

    foreach (var d in all)
    {
        Console.WriteLine(d.id);
    }

    Console.WriteLine();
}

This should list Alice & Bob but not the edge between them.

Can we write?

Querying is all nice and good, but what about writing?

Let’s try to simply add a document in the graph:


private async static Task AddTrivialVertexAsync(
    DocumentClient client,
    Uri collectionUri)
{
    var response = await client.CreateDocumentAsync(
        collectionUri,
        new
        {
            id = "Carol",
            label = "person",
            department = "support character"
        });
    var json = GetJson(response.Resource);

    Console.WriteLine(json);
}

If we use the Gremlin Console to look at it:


gremlin> :> g.V("Carol")

==>[id:Carol,label:person,type:vertex,properties:[department:[[id:Carol|department,value:support character]]]]

Hence we see the new document as a vertex.  That makes sense since we’ve seen that vertices are projected as simple documents.

If we add other simple properties (like we did with label) this will not work.  Those properties won’t show up in Gremlin.  That is because, as we’ve seen, in Gremlin, properties are always collections.  We can do that:


private async static Task AddVertexWithPropertiesAsync(
    DocumentClient client,
    Uri collectionUri)
{
    var response = await client.CreateDocumentAsync(
        collectionUri,
        new
        {
            id = "David",
            label = "person",
            age = new[] {
                new
                {
                    id = Guid.NewGuid().ToString(),
                    _value = 48
                }
            },
            department = "support character"
        });
    var json = GetJson(response.Resource);

    Console.WriteLine(json);
}

and in Gremlin:


gremlin> :> g.V("David").valueMap()

==>[age:[48],department:[support character]]

So it appears we can successfully write vertices in a graph using the DocumentDB API.

This is obviously useful to mass import graphs since there are a lot of tools out there that can import into DocumentDB.

Writing an edge

We can write vertices.  That is only half the equation for importing data in a graph.  What about edges?

It turns out we simply have to mimic what we’ve seen with existing edges:


private static async Task AddEdgeAsync(DocumentClient client, Uri collectionUri)
{
    var response = await client.CreateDocumentAsync(
        collectionUri,
        new
        {
            _isEdge = true,
            id = "CarolToAlice",
            label = "eavesdropOn",
            language = "English",
            department = "support character",
            _vertexId = "Carol",
            _vertexLabel = "person",
            _sink = "Alice",
            _sinkLabel = "person",
            _sinkPartition = "stereotype"
        });
    var json = GetJson(response.Resource);

    Console.WriteLine(json);
}

It is important for the edge’s partition to be the same as the source vertex, otherwise the edge won’t be seen by Gremlin.

We can validate the edge is now present in Gremlin:


gremlin> :> g.E()

==>[id:CarolToAlice,label:eavesdropOn,type:edge,inVLabel:person,outVLabel:person,inV:Alice,outV:Carol,properties:[language:English]]
 ==>[id:AliceToBob,label:communicatesWith,type:edge,inVLabel:person,outVLabel:person,inV:Bob,outV:Alice,properties:[language:English]]

gremlin> :> g.V("Carol").out("eavesdropOn")

==>[id:Alice,label:person,type:vertex,properties:[age:[[id:78109dc8-587f-4d87-9d2e-e4a1731dec2b,value:42]],department:[[id:Alice|department,value:stereotype]]]]

Summary

We’ve seen it is possible to both read and write to a Cosmos DB graph using the DocumentDB API.

It would also be possible to do so using the MongoDB API.

An obvious use is to leverage DocumentDB (or MongoDB) tools to manipulate a graph, e.g. for an initial load.

Advertisements

Cosmos DB Async Querying & Streaming

pexels-photo-223022[1]I wrote an article back in January 2015 about async querying Azure DocumentDB using the .NET SDK.

The service was still in preview back then.

Since then DocumentDB has been superseded by Azure Cosmos DB and the SDK has changed a bit so I thought I would rewrite that article.  Here it is.

LINQ was built before async into .NET / C#.  That is probably the #1 reason why doing LINQ queries on asynchronously fetched data source is so awkward today.  This will likely change one day but until then…

Why Async?

Before we dive in the solution, let’s see why we would want to implement asynchrony in querying.

This was true in 2015 and I hope it is less so today:  a lot of people do not understand why asynchrony is for in .NET.  I always think it’s worthwhile to discuss it.

Let’s try the reverse psychology approach. Here is what asynchrony doesn’t bring us:

  • It doesn’t make our client (e.g. browser) asynchronous ; for instance, if we implement it in a service call, it doesn’t make the caller asynchronous (e.g. Ajax)
  • It doesn’t bring us performance per se
  • It doesn’t make our code run on multiple threads at once

Asynchrony allows us to… SCALE our server code. It allows you to multiplex your server, to serve more concurrent requests at the same time. If we do not have scaling issues, we might not need asynchrony.

The reason it allows us to scale is that when we async / await on an I/O call (e.g. a Cosmos DB remote call), it frees the current thread to be used by another request until the call comes back, allowing us to serve more requests with less threads and memory.

Solution

The code is available on GitHub, more specifically in the Program.cs file.

The important part is to recognize that the query object (IDocumentQuery<T>) from the SDK is an asynchronous interface.  It fetches new results in batches.  So we can write a method to fetch all the results like this one:

private async static Task<T[]> GetAllResultsAsync<T>(IDocumentQuery<T> queryAll)
{
    var list = new List<T>();

    while (queryAll.HasMoreResults)
    {
        var docs = await queryAll.ExecuteNextAsync<T>();

        foreach (var d in docs)
        {
            list.Add(d);
        }
    }

    return list.ToArray();
}

Or one that allows us to process all the items in the query with an action:

private async static Task<int> ProcessAllResultsAsync<T>(
    IDocumentQuery<T> queryAll,
    Action<T> action)
{
    int count = 0;

    while (queryAll.HasMoreResults)
    {
        var docs = await queryAll.ExecuteNextAsync<T>();

        foreach (var d in docs)
        {
            action(d);
            ++count;
        }
    }

    return count;
}

We can create a query object with no fancy LINQ expression, i.e. basically querying the entire collection, like this:

var client = new DocumentClient(new Uri(SERVICE_ENDPOINT), AUTH_KEY);
var collectionUri = UriFactory.CreateDocumentCollectionUri(DATABASE, COLLECTION);
var query = client.CreateDocumentQuery(
    collectionUri,
    new FeedOptions
    {
        EnableCrossPartitionQuery = true
    });
var queryAll = query.AsDocumentQuery();

That code basically queries the entire collection and return an array of Document object.

We could also serialize into a custom object and filter the query:

var query = client.CreateDocumentQuery<MinimalDoc>(
    collectionUri,
    new FeedOptions
    {
        EnableCrossPartitionQuery = true
    });
var queryNoDog = (from d in query
                    where d.id != "Dog"
                    select d).AsDocumentQuery();

In the code sample there are 4 examples using different variations.

Summary

Asynchrony is a powerful to scale service-side code.

Cosmos DB allows us to do that in an easy way as was demonstrated in this article.

Hyperspheres & the curse of dimensionality

fractal-1118515_640I previously talked about the curse of dimensionality (more than 2 years ago) related to Machine Learning.

Here I wanted to discuss it in more depth and dive into the mathematics of it.

High dimensions might sound like Physics’ string theory where our universe is made of more than 4 dimensions.  This isn’t what we are talking about here.

The curse of dimensionality is related to what happens when a model deals with a data space with dimensions in the hundreds or thousands.

As the title of this article suggests, we’re going to take the angle of the properties of Hyperspheres (spheres in N dimensions) to explore high dimension properties.

This article is inspired by Foundations of Data Science by John Hopcroft & Ravindran Kannan (chapter 2).

Why should I care about High Dimension?

When introducing Machine Learning concepts, we typically use few dimensions in order to help visualization.  For instance, when I introduced linear regression or polynomial regression in past articles, I used datasets in two dimensions and plot them on a chart.

Brown RabbitIn the real world, typical data sets have much more dimensions.

A typical case of high dimension is image recognition (or character recognition as a sub category) where even a low resolution pictures will have hundreds of pixels.  The corresponding model would take gray-scale input vector of dimension 100+.

Close-up of an Animal Eating GrassWith fraud detection, transactions do not contain only the value of the transaction, but the time of day, day of week, geo-location, type of commerce, type of products, etc.  .  This might or might not be a high dimension problem, depending on the available data.

In an e-commerce web site, a Product recommendation algorithm could be as simple as an N x N matrix of 0 to 1 values where N is the number of products.

With IoT, multiple sensors feed a prediction model.

In bioinformatics, DNA sequencing generates a huge amount of data which often is arranged in high dimensional model.

Basically, high dimensions crop up everywhere.

What happens as dimension increases?

For starter a space with more dimensions simply is…  bigger.  In order to sample a space with 2 dimensions with a resolution of 10 units, we need to have 10^2 = 100 points.  Having the same sampling in a space of dimension 3 would require 10^3 = 1000 points.  Dimension 20?  20 would require 10^20 = 100 000 000 000 000 000 000 points.

Right off the bat we can tell that sampling the space of dimension 2 & 3 is realistic while for a space of dimension 20, it’s unlikely.  Hence we are likely going to suffer from under-sampling.

Yoshua Bengio has a nice discussion about Curse of Dimensionality here.

Hypersphere in a cube

Tapemeasure on 20Beyond sampling problems, metrics & measures change behaviour at high dimensions.  Intuitively it makes sense since a measure takes a vector (vectors) and squeeze it (them) into a numerical value ; the higher the dimension, the more data we squeeze into one number & hence we should lose information.

We use metrics & measures heavily in Machine Learning.  For instance, a lot of cost (or loss) functions are based on Euclidean’s distance:

dist(x,y) = \displaystyle\sum_{i=1}^N (x_i-y_i)^2

Now if x and / or y are random variables (e.g. samples), the law of large numbers applies when N becomes large.  This implies the sum will trend to the expected value with a narrower standard deviation as N increases.  In turns, this means there is less and less information in the distance as the number of dimensions increases.

This brings us to the hypersphere.  An hypersphere’s equation is

\displaystyle\sum_{i=1}^N x_i^2 = R^2

where x is a point of dimension N and R is the radius of the hypersphere.

An hypersphere of dimension 1 is a line, an hypersphere of dimension 2 is a circle, dimension 3 is a sphere, dimension 4 is an…  expending universe?  and so on.

A theorem I’ll demonstrate in a future article is that the volume of an hypersphere of radius 1 tends to zero as the dimension increases.

UPDATE (12-07-2017):  Demonstration of hypersphere hyper volume is done in this article.

This is fairly unintuitive, so let me give real numerical values:

Dimension Hyper Volume
1 2
2 3.141592654
3 4.188790205
4 4.934802201
5 5.263789014
6 5.16771278
7 4.72476597
8 4.058712126
9 3.298508903
10 2.55016404
11 1.884103879
12 1.335262769
13 0.910628755
14 0.599264529
15 0.381443281
16 0.23533063
17 0.140981107
18 0.082145887
19 0.046621601
20 0.025806891
21 0.01394915
22 0.007370431
23 0.003810656
24 0.001929574
25 0.000957722
26 0.000466303
27 0.000222872
28 0.000104638

If we plot those values:

image

We see the hyper volume increases in the first couple of dimensions.  A circle of radius 1 has an area of pi (3.1416) while a sphere of radius 1 has a volume of 4.19.  It peaks at dimension 5 and then shrinks.

It is unintuitive because in 2 and 3 dimensions (the only dimensions in which we can visualize an hypersphere), the hypersphere pretty much fills its embedding cube.  A way to “visualize” what’s happening in higher dimension is to consider a “diagonal” into an hypersphere.

For a circle, the diagonal (i.e. 45’) intersects with the unit circle at

(\frac {1} {\sqrt {2}}, \frac {1} {\sqrt {2}}) since (\frac {1} {\sqrt {2}})^2 + (\frac {1} {\sqrt {2}})^2 = 1^2

In general, at dimension N, the diagonal intersects at

x_i = \frac {1} {\sqrt {N}}

So, despite the hypersphere of radius 1 touches the cube of side 2 centered at the origin on each of its walls, the surface of the hypersphere, in general, gets further and further away from the cube surface as the dimension increases.

Consequences of the hypersphere volume

A straightforward consequence of the hypersphere volume is sampling.  Randomly sampling a square of side 2 centered at the origin will land points within the unit circle with probability \frac{\pi}{4} = \%79.  The same process with an hypersphere of dimension 8 would hit the inside of the hypersphere with a probability of %1.6.

A corollary to the hypersphere volume is that at higher dimension, the bulk of the volume of the hypersphere is concentrated in a thin annulus below its surface.  An obvious consequence of that is that optimizing a metric (i.e. a distance) in high dimension is difficult.

What should we do about it?

First step is to be aware of it.

A symptom of high dimensionality is under sampling:  the space covered is so large the number of sample points required to learn the underlying model are likely to be over the actual sample set’s size.

The simplest solution is to avoid high dimensionality with some pre-processing.  For instance, if we have a priori knowledge of the domain, we might be able to combine dimensions together.  For example, in an IoT field with 10 000 sensors, for many reasons, including curse of dimensionality, it wouldn’t be a good idea to consider each sensor inputs as an independent input.  It would be worth trying to aggregate out sensor inputs by analyzing the data.

Summary

Some Machine Learning algorithms will be more sensitive to higher dimensionality than others but the curse of dimensionality affects most algorithms.

It is a problem to be aware of and we should be ready to mitigate it with some good feature engineering.

Extended Outage @ Instapaper – Resiliency example

art-broken-explosion-glass[1]I use Instapaper extensively to store the continuous flow of internet articles I want to read.  I created a bunch of tools integrating with it (e.g. monitoring atom feeds and sending new articles to Instapaper)

Last week my tools didn’t work for a while so I finally logged in directly to the site.  The site was down, citing an extended outage, with a reference to a blog article to explain the outage.

It got back on its feet after around 48 hours.  This isn’t an article to call out Instapaper’s engineering:  that type of outage happens everywhere all the time.  But let’s learn from it.

The article cites the cause of the outage as being “we hit a system limit for our hosted database that’s preventing new articles from being saved”.  The article also cite they had to spent time on the phone with their cloud provider (didn’t mention which one) before diagnosing the problem.

Know the limits

They hit a limit.

What are the limits?

A clear advise when working in Azure:  know where your sandbox ends.  We should always consult http://aka.ms/azurelimits to inform our architecture.

The nice thing about the cloud is that most limit are clearly defined and embedded in SLAs.

This comes as a surprise to a lot of people, especially when they see some of the restrictions.  “How come?  I can only put 2 disks on a Standard D1 v2 VM?”  This is because the experience we have on premise where there are few hard wired limitations.  Instead, we typically have degradations, e.g. sure you can put a Terabyte storage on your old laptop, but you are likely going to saturate its IOs before you will exhaust the storage.  In Azure, it is more clearly defined because Azure is a public cloud, i.e. a multi-tenant environment.  The only way for Azure to guarantee other customers they can have capacity X is to make sure we do not use that capacity.  So there are sandboxes everywhere.

On the flipside, you have less surprises.  The reason you do not have so many sandboxes on prem is that everything is shared.  That is nice as long as we are the only one to crank the resource usage.  But when other systems start grinding that SAN, it isn’t so fast anymore.

So first lesson we get from that Instapaper outage is to know the limit.  Did the Instapaper know the limits of their database?  Did they hit some software limit, e.g. they used all the 32 bits of an identifier column?

Monitor for those limits

Now that we know the limits, we know when our system is going to break.

So we need to monitor it so that we won’t have a bad surprise as Instapaper team must have had (I’m sure some dinners were cancelled during those 48 hours).

In Azure, we can use Azure Monitor / Log Analytics to monitor different metrics on our resources.  We can setup alerts to be notified when some threshold has been reached.

We can then react.  If we setup the threshold low enough, that will give us a window of time to react and to sketch a plan.

In an article on SQL Database sizes in an elastic pool, we saw that we can fix a limit on database size to make sure it doesn’t consume the allowed size of the entire pool.  This is a safeguard mechanism in case our monitoring strategy fails.

Strategy to get beyond the limits

We know the limits of our architecture.  We monitor our system to get some heads up before those limits are reached.

When they are reached, what do we do?

Typically, if we’re aware of the limits, we will have given it some thoughts, even if it isn’t fully documented.  This is something that goes in handover conversations actually “this system has this limit, but you know, if that ever becomes a problem, consider that alternative”.  Hopefully, the alternative doesn’t consist into re-writing the entire system.

Scale out / partitioning strategies

If we are lucky, the original architect(s) of the system have baked in a way to overcome its limits.

A typical way is to partition the system.  We can partition the system per user / group of users or other data segments.  This way, we can scale out the system to handle different partitions.

We can start with only one set of resources (e.g. Virtual Machines) handling all partitions and the day that set of resource hits its limits, we can split the partitions into two groups and have another set of resources handling the other set of partitions.  And so on.

One day, that partition scheme might also hit its limits.  For instance, maybe we have a set of resources handling only one partition each and most of the set of resource have all hit their limits.  In that case, we’re back to figure out how to go beyond the limits of our system.  Typically, that will consist into repartition it in a way we can scale out further.

Azure is a partitioned / scaled out system.  That strategy has allowed it to grow to its current size.  Along the way, repartitioning was needed.  For instance, the transition of ASM to ARM was partially that.  In ASM, there were only 2 regions handling the Azure Management APIs while in ARM, each region handles API requests.

Conclusion

Instapaper was victim of its own success.  That’s a good problem to have but a problem nevertheless.

Make sure you at least know the limits of your system and monitor for them.  This way if success curses you, you’ll be able to react during working hours instead of cancelling your valentine dinner and spending your days justifying yourself to your CEO.

Azure SQL Elastic Pool – Moving databases across pools using PowerShell

hand-truck-564242_640[1]

I’ve written a bit about Azure SQL Elastic Pool lately:  an overview, about ARM template and about database size.

One of the many great features of Azure SQL Elastic Pool is that like Azure SQL Database (standalone), we can change the eDTU capacity of the pool “on the fly”, without downtime.

Unlike its standalone cousin though, we can’t change the edition of the pool.  The edition is either Basic, Standard or Premium.  It is set at creation and is immutable after that.

If we want to change the edition of a pool, the obvious way is to create another pool, move the databases there, delete the original, recreate it with a different edition and move the databases back.

This article shows how to do that using PowerShell.

You might want to move databases around for other reasons, typically to optimize the density and performance of pools.  You would then use a very similar script.

Look at the pool

Let’s start with the pools we established with the sample ARM template of a previous article.

From there we can look at the pool Pool-A using the following PowerShell command:


$old = Get-AzureRmSqlElasticPool -ResourceGroupName DBs -ElasticPoolName Pool-A -ServerName pooldemoserver

$old

We can see the pool current edition is Standard while its Data Transaction Unit (DTU) count is 200.

image

Create a temporary pool

We’ll create a temporary pool, aptly named temporary, attached to the same server:


$temp = New-AzureRmSqlElasticPool -ResourceGroupName DBs -ElasticPoolName Temporary -ServerName pooldemoserver -Edition $old.Edition -Dtu $old.Dtu

$temp

It’s important to create a pool that will allow the databases to be moved into.  The maximum size of a database is dependent of the edition and number of DTU of the elastic pool.  The easiest way is to create a pool with the same edition / DTU and this is what we do here by referencing the $old variable.

Move databases across

First, let’s grab the databases in the original pool:


$dbs = Get-AzureRmSqlDatabase -ResourceGroupName DBs -ServerName pooldemoserver | where {$_.ElasticPoolName -eq $old.ElasticPoolName}

$dbs | select DatabaseName

ElasticPoolName is a property of a database.  We’ll simply change it by setting each database:


$dbs | foreach {Set-AzureRmSqlDatabase -ResourceGroupName DBs -ServerName pooldemoserver -DatabaseName $_.DatabaseName -ElasticPoolName $temp.ElasticPoolName}

That command takes longer to run as the databases have to move from one compute to another.

Delete / Recreate pool

We can now delete the original pool.  It’s important to note that we wouldn’t have been able to delete a pool with databases in it.


Remove-AzureRmSqlElasticPool -ResourceGroupName DBs -ElasticPoolName $old.ElasticPoolName -ServerName pooldemoserver

$new = New-AzureRmSqlElasticPool -ResourceGroupName DBs -ElasticPoolName $old.ElasticPoolName -ServerName pooldemoserver -Edition Premium -Dtu 250

The second line recreates it with Premium edition.  We could keep the original DTU, but it’s not always possible since different editions support different DTU values.  In this case, for instance, it wasn’t possible since 200 DTUs isn’t supported for Premium pools.

If you execute those two commands without pausing in between, you will likely receive an error.  It is one of those cases where the Azure REST API returns and the resource you asked to be removed seems to be removed but you can’t really recreate it back yet.  An easy work around consist in pausing or retrying.

Move databases back

We can then move the databases back to the new pool:


$dbs | foreach {Set-AzureRmSqlDatabase -ResourceGroupName DBs -ServerName pooldemoserver -DatabaseName $_.DatabaseName -ElasticPoolName $new.ElasticPoolName}

Remove-AzureRmSqlElasticPool -ResourceGroupName DBs -ElasticPoolName $temp.ElasticPoolName -ServerName pooldemoserver

In the second line we delete the temporary pool.  Again, this takes a little longer to execute since databases must be moved from one compute to another.

Summary

We showed how to move databases from a pool to another.

The pretext was a change in elastic pool edition but we might want to move databases around for other reasons.

In practice you might not want to move your databases twice to avoid the duration of the operation and might be happy to have a different pool name.  In the demo we did, the move took less than a minute because we had two empty databases.  With many databases totaling a lot of storage it would take much more time to move those.

Azure SQL Elastic Pool – Database Size

PlanetsI mentioned in a past article, regarding database sizes within an elastic pool:

“No policies limit an individual database to take more storage although a database maximum size can be set on a per-database basis.”

I’m going to focus on that in this article.

An Azure SQL Database resource has a MaxSizeInBytes property.  We can set it either in an ARM template (see this ARM template and the property maxSizeBytes) or in PowerShell.

An interesting aspect of that property is that:

  • It takes only specific values
  • Not all values are permitted, depending on the elastic pool edition (i.e. Basic, Standard or Premium)

Valid values

One way to find the valid values is to navigate to the ARM schema.  That documented schema likely is slightly out of date since, as of December 2016, the largest value is 500GB, which isn’t the largest possible database size (1 TB for a P15).

The online documentation of Set-AzureRmSqlDatabase isn’t fairing much better as the documentation for the MaxSizeBytes parameter refers to a parameter MaxSizeGB to know about the acceptable values.  Problem is, MaxSizeGB parameter doesn’t exist.

But let’s start with the documented schema as it probably only lacks the most recent DB sizes.

Using that schema list of possible values and comparing that with the stand alone database size for given editions, we can conclude (after testing with ARM templates of course), that a Basic pool can have databases up to 2GB, for Standard we have 250GB and of course Premium can take all values.

It is important to notice that the pool can have larger storage.  For instance, even the smallest basic pool, with 50 eDTUs, can have a maximum storage of 5 GB.  But each DB within that pool can only grow up to 2 GB.

That gives us the following landscape:

Maximum Size (in bytes) Maximum Size (in GB) Available for (edition)
104857600 0.1 Premium, Standard, Basic
524288000 0.5 Premium, Standard, Basic
1073741824 1 Premium, Standard, Basic
2147483648 2 Premium, Standard, Basic
5368709120 5 Premium, Standard
10737418240 10 Premium, Standard
21474836480 20 Premium, Standard
32212254720 30 Premium, Standard
42949672960 40 Premium, Standard
53687091200 50 Premium, Standard
107374182400 100 Premium, Standard
161061273600 150 Premium, Standard
214748364800 200 Premium, Standard
268435456000 250 Premium, Standard
322122547200 300 Premium
429496729600 400 Premium
536870912000 500 Premium

Storage Policies

We can now use this maximum database as a storage policy, i.e. a way to make sure a single database doesn’t take all the storage available in a pool.

Now, this isn’t as trivially useful as the eDTUs min / max we’ve seen in a pool.  In the eDTU case, that was controlling how much compute was given to a database at all time.  In the case of a database maximum size, once the database reaches that size, it becomes read only.  That will likely break our applications running on top of it unless we planned for it.

A better approach would be to monitor the different databases and react to size changes, by moving the database to other pool for instance.

The maximum size could be a safeguard though.  For instance, let’s imagine we want each database in a pool to stay below 50 GB and we’ll monitor for that and raise alerts in case that threshold is reached (see Azure Monitor for monitoring and alerts).  Now we might still put a maximum size for the databases of 100 GB.  This would act as a safeguard:  if we do not do anything about a database outgrowing its target 50GB, it won’t be able to grow indefinitely, which could top the pool maximum size and make the entire pool read only, affecting ALL the databases in the pool.

In that sense the maximum size still act as a resource governor, preventing noisy neighbour effect.

PowerShell example

We can’t change a database maximum size in the portal (as of December 2016).

Using ARM template, it is easy to change the parameter.  Here, let’s simply show how we would change it for an existing database.

Building on the example we gave in a previous article, we can easily grab the Pool-A-Db0 database in resource group DBs and server pooldemoserver:


Get-AzureRmSqlDatabase -ServerName pooldemoserver -ResourceGroupName DBs -DatabaseName Pool-A-Db0

image

We can see the size is the one that was specified in the ARM template (ARM parameter DB Max Size default value), i.e. 10 GB.  We can bump it to 50 GB, i.e. 53687091200 bytes:


Set-AzureRmSqlDatabase -ServerName pooldemoserver -ResourceGroupName DBs -DatabaseName Pool-A-Db0 -MaxSizeBytes 53687091200

We can confirm the change in the portal by looking at the properties.

image

Default Behaviour

If the MaxSizeByte property is omitted, either in an ARM Template or a new-AzureRmSqlDatabase PowerShell cmdlet, the default behaviour is for the database to have the maximum capacity (e.g. for Standard, 250 GB).

After creation, we can’t set the property value to null to obtain the same effect.  Omitting the parameter simply keep to previously set value.

Summary

We’ve looked at the maximum size property of a database.

It can be used to control the growth of a database inside a pool and prevent a database growth to affect others.

Azure SQL Elastic Pool – ARM Templates

coil-632650_640[1]In my last article, I covered Azure SQL Elastic Pool.  In this one I cover how to provision it using ARM templates.

As of today (December 2016), the documentation about Azure SQL Elastic Pool provisioning via ARM templates is…  not existing.

Searching for it I was able to gather hints via a few colleagues GitHub repos, but there are no examples in the ARM quickstart templates nor is the elastic pool resource schema documented.  Also, the Automation Script feature in the portal doesn’t reverse engineer an ARM template for the elastic pool.

So I hope this article fills that gap and is easy to search for & consume.

ARM Template

Here we’re going to provision a Server with two pools, Pool-A & Pool-B (yeah, sounds a bit like Thing 1 & Thing 2), each having a few (configurable number of) databases in them.

{
  "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
  "contentVersion": "1.0.0.0",
  "parameters": {
    "Server Name": {
      "defaultValue": "pooldemoserver",
      "type": "string",
      "metadata": {
        "description": "Name of the SQL:  needs to be unique among all servers in Azure"
      }
    },
    "Admin Login": {
      "defaultValue": "myadmin",
      "type": "string",
      "metadata": {
        "description": "SQL Server Admin login name"
      }
    },
    "Admin Password": {
      "type": "securestring",
      "metadata": {
        "description": "SQL Server Admin login password"
      }
    },
    "Pool A Edition": {
      "defaultValue": "Standard",
      "type": "string",
      "allowedValues": [
        "Basic",
        "Standard",
        "Premium"
      ],
      "metadata": {
        "description": "Pool A Edition"
      }
    },
    "Pool B Edition": {
      "defaultValue": "Standard",
      "type": "string",
      "allowedValues": [
        "Basic",
        "Standard",
        "Premium"
      ],
      "metadata": {
        "description": "Pool B Edition"
      }
    },
    "DB Max Size": {
      "defaultValue": "10737418240",
      "type": "string",
      "allowedValues": [
        "104857600",
        "524288000",
        "1073741824",
        "2147483648",
        "5368709120",
        "10737418240",
        "21474836480",
        "32212254720",
        "42949672960",
        "53687091200",
        "107374182400",
        "161061273600",
        "214748364800",
        "268435456000",
        "322122547200",
        "429496729600",
        "536870912000"
      ],
      "metadata": {
        "description": "DB Max Size, in bytes"
      }
    }
  },
  "variables": {
    "Pool A": "Pool-A",
    "Pool B": "Pool-B",
    "DB A Prefix": "Pool-A-Db",
    "DB B Prefix": "Pool-B-Db",
    "Count A": 2,
    "Count B": 4
  },
  "resources": [
    {
      "name": "[parameters('Server Name')]",
      "type": "Microsoft.Sql/servers",
      "apiVersion": "2014-04-01-preview",
      "location": "[resourceGroup().location]",
      "dependsOn": [],
      "properties": {
        "administratorLogin": "[parameters('Admin Login')]",
        "administratorLoginPassword": "[parameters('Admin Password')]",
        "version": "12.0"
      },
      "resources": [
        {
          "type": "firewallRules",
          "kind": "v12.0",
          "name": "AllowAllAzureIps",
          "apiVersion": "2014-04-01-preview",
          "location": "[resourceGroup().location]",
          "dependsOn": [
            "[resourceId('Microsoft.Sql/servers', parameters('Server Name'))]"
          ],
          "properties": {
            "startIpAddress": "0.0.0.0",
            "endIpAddress": "0.0.0.0"
          }
        },
        {
          "type": "elasticpools",
          "name": "[variables('Pool A')]",
          "apiVersion": "2014-04-01-preview",
          "location": "[resourceGroup().location]",
          "dependsOn": [
            "[resourceId('Microsoft.Sql/servers', parameters('Server Name'))]"
          ],
          "properties": {
            "edition": "[parameters('Pool A Edition')]",
            "dtu": "200",
            "databaseDtuMin": "10",
            "databaseDtuMax": "50"
          }
        },
        {
          "type": "elasticpools",
          "name": "[variables('Pool B')]",
          "apiVersion": "2014-04-01-preview",
          "location": "[resourceGroup().location]",
          "dependsOn": [
            "[resourceId('Microsoft.Sql/servers', parameters('Server Name'))]"
          ],
          "properties": {
            "edition": "[parameters('Pool B Edition')]",
            "dtu": "400",
            "databaseDtuMin": "0",
            "databaseDtuMax": null
          }
        }
      ]
    },
    {
      "type": "Microsoft.Sql/servers/databases",
      "copy": {
        "name": "DBs-A",
        "count": "[variables('Count A')]"
      },
      "name": "[concat(parameters('Server Name'), '/', variables('DB A Prefix'), copyIndex())]",
      "location": "[resourceGroup().location]",
      "dependsOn": [
        "[resourceId('Microsoft.Sql/servers', parameters('Server Name'))]",
        "[resourceId('Microsoft.Sql/servers/elasticpools', parameters('Server Name'), variables('Pool A'))]"
      ],
      "tags": {
        "displayName": "Pool-A DBs"
      },
      "apiVersion": "2014-04-01-preview",
      "properties": {
        "collation": "SQL_Latin1_General_CP1_CI_AS",
        "maxSizeBytes": "[parameters('DB Max Size')]",
        "requestedServiceObjectiveName": "ElasticPool",
        "elasticPoolName": "[variables('Pool A')]"
      }
    },
    {
      "type": "Microsoft.Sql/servers/databases",
      "copy": {
        "name": "DBs-B",
        "count": "[variables('Count B')]"
      },
      "name": "[concat(parameters('Server Name'), '/', variables('DB B Prefix'), copyIndex())]",
      "location": "[resourceGroup().location]",
      "dependsOn": [
        "[resourceId('Microsoft.Sql/servers', parameters('Server Name'))]",
        "[resourceId('Microsoft.Sql/servers/elasticpools', parameters('Server Name'), variables('Pool B'))]"
      ],
      "tags": {
        "displayName": "Pool-B DBs"
      },
      "apiVersion": "2014-04-01-preview",
      "properties": {
        "edition": "[parameters('Pool B Edition')]",
        "collation": "SQL_Latin1_General_CP1_CI_AS",
        "maxSizeBytes": "[parameters('DB Max Size')]",
        "requestedServiceObjectiveName": "ElasticPool",
        "elasticPoolName": "[variables('Pool B')]"
      }
    }
  ]
}

We can deploy the template as is.  We’ll need to enter at least an Admin password (for the Azure SQL server).

The “Server Name” parameter must be unique throughout Azure (not just your subscription).  So if it happens to be taken when you try to deploy the template (in which case you would receive an error message around Server ‘pooldemoserver’ is busy with another operation), try a new, more original name.

Each parameter is documented in the metadata description.

Results

Let’s look at the result.  Let’s first go in the resource group where we deployed the template.

In the resource list we should see the following:

image

We first have our server, with default name pooldemoserver, our two pools, Pool-A & Pool-B, and 6 databases.

Let’s select Pool-A.

image

We can see the pool is of Standard edition, has 200 eDTUs with a minimum of 10 eDTUs and maximum 50 per databases, which is faithful to its ARM definition (line 10-13).

        {
          "type": "elasticpools",
          "name": "[variables('Pool A')]",
          "apiVersion": "2014-04-01-preview",
          "location": "[resourceGroup().location]",
          "dependsOn": [
            "[resourceId('Microsoft.Sql/servers', parameters('Server Name'))]"
          ],
          "properties": {
            "edition": "[parameters('Pool A Edition')]",
            "dtu": "200",
            "databaseDtuMin": "10",
            "databaseDtuMax": "50"
          }
        }

Similarly, Pool-B has a minimum of 0 and a maximum of 100.  The maximum was set to null in the template and hence is the maximum allowed for a standard pool of 400 DTUs.

Let’s select the databases in Pool-B.  Alternatively, we can select the Configure pool tool bar option.

image

The following pane shows us the eDTUs consumed in the last 14 days.  It also allows us to change the assigned eDTUs to the pool.

It is in this pane that we can add / remove databases from the pool.

image

In order to remove databases from the pool, they must first be selected in the lower right pane corner.  We will have to chose a standalone pricing tier for each DB and hit save.  As of today (December 2016), there are no way to move databases from one pool to another directly, i.e. they must first be converted as a stand alone.  It is possible to move databases from a pool to another using PowerShell as I’ll demonstrate in a future article though.

If we go back to the resource group and select any of the database, we have a link to its parent pool.

image

Summary

Despite the current lack (as of December 2016) of documentation around it, it is quite possible to create databases within an elastic pool using ARM templates as we’ve demonstrated here.