Hacking: changing Cosmos DB Portal experience from Graph to SQL

In the last article, we looked at how we could access a graph using the SQL (aka DocumentDB) API.

Here we’ll explore how we can switch the Portal experience from one to the other.

Portal Experience

The Portal Experience refers to the way the portal lets us interact with Cosmos DB Data.  It’s basically the Data Explorer experience.

Here we have the Cosmos DB Graph experience:

image

The Data Explorer lets us access the Graph using Gremlin and displays results in a Graph UI experience (i.e. showing vertices & edges).

Let’s compare this to the Cosmos DB SQL (aka DocumentDB) experience:

image

Here we query collections using SQL queries and results are shown as JSON documents.

CosmosDB in ARM

The schema for JSON ARM template of CosmosDB Database Account is documented here.

There are two important properties for Cosmos DB model (i.e. SQL, Graph, Table or MongoDB):  kind and defaultExperience (on fourth and seventh line respectively).

{
  "apiVersion": "2015-04-08",
  "type": "Microsoft.DocumentDB/databaseAccounts",
  "kind": "[parameters('kind')]",
  "name": "[parameters('databaseAccountName')]",
  "tags": {
    "defaultExperience": "[parameters('experience')]"
  },
  "location": "[resourceGroup().location]",
  "properties": {
    "name": "[parameters('databaseAccountName')]",
    "databaseAccountOfferType": "[variables('offerType')]",
    "consistencyPolicy": {
      "defaultConsistencyLevel": "[parameters('consistencyLevel')]",
      "maxStalenessPrefix": "[parameters('maxStalenessPrefix')]",
      "maxIntervalInSeconds": "[parameters('maxIntervalInSeconds')]"
    }
  }
}

Kind takes the following values:  GlobalDocumentDB, MongoDB & Parse.  It defines how the database engine is configured.  This property must be supplied at creation time and can’t be changed after.

DefaultExperience takes the following values:  DocumentDB, MongoDB,
Graph & Table.  It influences only how the portal behaves.  This property is optional and can be changed in any update deployments.

When creating a Cosmos DB account in the Portal, here is the mapping of the values.  The left-hand side column API refers to the drop down value selected in the portal at the account creation.

API Kind Default Experience
SQL (DocumentDB) GlobalDocumentDB DocumentDB
MongoDB MongoDB MongoDB
Gremlin (graph) GlobalDocumentDB Graph
Table (key-value) GlobalDocumentDB Table

We notice the Kind value Parse isn’t yet used with any model.  It is used for the Parse Server offering.

Changing the experience

With all that said, we can easily change the default experience from one ARM Deployment to another.  Template is available in GitHub.

Also, since the experience is a simple tag, it can be changed using PowerShell or even the Portal.

image

Summary

Although the fundamental database engine is set at the creation of the account, the portal experience can be changed.

Therefore, if it is convenient to change the experience in order to execute some tasks, it is possible to do so without impacting the underlying database.

Advertisements

Hacking: accessing a graph in Cosmos DB with SQL / DocumentDB API

pexels-photo-264635[1]Azure Cosmos DB is Microsoft’s globally distributed multi-model database service.

At this point in time (August 2017) there are four supported models:  DocumentDB (also named SQL because the query language is similar to T-SQL), MongoDB, Tabular & Gremlin.

We’ve seen how to use Cosmos DB with Gremlin in a past article.

Now here’s a little secret:  although we choose the “model” (e.g. Gremlin) at the Cosmos DB account level, we can use other models to query the data.

Not all combination are possible, but many are.  Specifically, we can query a Gremlin graph using DocumentDB / SQL query language.

The graph is then projected into documents.

We will explore that in this article.

Why is that interesting?  Because there are a lot of tools out there we might be familiar with to manipulate DocumentDB (or MongoDB).  Having to possibility to look at a Graph with other APIs extends our toolset from Gremlin-based ones.

Creating a simple graph in Gremlin

Let’s create a simple graph in a Cosmos DB using Gremlin.  In a past article we’ve looked at how to setup Gremlin with Cosmos DB.


gremlin> :remote connect tinkerpop.server conf/remote-secure.yaml

gremlin> :> g.addV('person').property('id', 'Alice').property('age', 42).property('department', 'stereotype')

gremlin> :> g.addV('person').property('id', 'Bob').property('age', 24).property('department', 'support character')

gremlin> :> g.V('Alice').addE('communicatesWith').property('id', 'AliceToBob').property('language', 'English').to(g.V('Bob'))

The first line is there to connect to the remote server we configured in remote-secure.yaml.  For details see the setup article.

We now have a toy graph with two vertices connected with one edge.  Nothing too fancy but that will be enough for our purpose.

image

We can note the following:

  • We provided the ids of objects ; this isn’t always possible in graph databases but is possible with Cosmos DB (if we don’t provide it, a randomly generated GUID is automatically provisioned)
  • We did provide a custom property (i.e. language) on the edge
  • The graph partition key is department hence we provided it for each vertex

Document Query

The code is available on GitHub, more specifically in the Program.cs file.

Here we build on the code from the Cosmos DB async streaming article.  We simply read all the documents in the graph with DocumentDB API and output them in JSON format:


private async static Task ListAllDocumentsAsync(
    DocumentClient client,
    Uri collectionUri)
{
    var query = client.CreateDocumentQuery(
        collectionUri,
        new FeedOptions
        {
            EnableCrossPartitionQuery = true
        });
    var queryAll = query.AsDocumentQuery();
    var all = await GetAllResultsAsync(queryAll);

    Console.WriteLine($"Collection contains {all.Length} documents:");

    foreach (var d in all)
    {
        var json = GetJson(d);

        if (d.Id == "CarolToAlice")
        {
            await client.DeleteDocumentAsync(
                d.SelfLink,
                new RequestOptions
                {
                    PartitionKey = new PartitionKey(d.GetPropertyValue<string>("department"))
                });
        }

        Console.WriteLine(json);
    }

    Console.WriteLine();
}

The output should be the following:


{
   "id": "Bob",
   "_rid": "smp9AKyqeQADAAAAAAAABA==",
   "_self": "dbs/smp9AA==/colls/smp9AKyqeQA=/docs/smp9AKyqeQADAAAAAAAABA==/",
   "_ts": 1504096168,
   "_etag": "\"00001c04-0000-0000-0000-59a6afad0000\"",
   "label": "person",
   "age": [
     {
       "_value": 24,
       "id": "88a659bf-84d1-4c13-8450-ee57b426b7b3"
     }
   ],
   "department": "support character"
}
 {
   "id": "Alice",
   "_rid": "smp9AKyqeQAKAAAAAAAABg==",
   "_self": "dbs/smp9AA==/colls/smp9AKyqeQA=/docs/smp9AKyqeQAKAAAAAAAABg==/",
   "_ts": 1504096164,
   "_etag": "\"0000ed09-0000-0000-0000-59a6afa60000\"",
   "label": "person",
   "age": [
     {
       "_value": 42,
       "id": "78109dc8-587f-4d87-9d2e-e4a1731dec2b"
     }
   ],
   "department": "stereotype"
 }
 {
   "id": "AliceToBob",
   "_rid": "smp9AKyqeQALAAAAAAAABg==",
   "_self": "dbs/smp9AA==/colls/smp9AKyqeQA=/docs/smp9AKyqeQALAAAAAAAABg==/",
   "_ts": 1504096178,
   "_etag": "\"0000ee09-0000-0000-0000-59a6afb40000\"",
   "label": "communicatesWith",
   "language": "English",
   "_sink": "Bob",
   "_sinkLabel": "person",
   "_sinkPartition": "support character",
   "_vertexId": "Alice",
   "_vertexLabel": "person",
   "_isEdge": true,
   "department": "stereotype"
 }

We can learn a lot from this projection:

  • Vertices are pretty close to simple DocumentDB document ; the properties starting with an underscore (_) are our usual DocumentDB metadata (e.g. _self)
  • Vertex Properties (e.g. age) are represented as an array of complex sub structures (_value and an id) ; this is because in Gremlin a vertex’ (or edge’s) properties can have multiple values
  • Edges are more complex
    • A metadata property _isEdge seems to be the discriminator between a vertex and an edge
    • _vertexId & _vertexLabel identify the “source” of the edge (the starting point)
    • _sink, _sinkLabel & _sinkPartition identify the “target” of the edge (the destination point)
    • The partition of the edge is the same as the “source” vertex, even if we didn’t specify it in Gremlin
    • The custom property language is a flat property, not a complex one with arrays as in the vertices

Given that information, we can easily write queries, for instance, to list only vertices:


private class MinimalDoc
{
    public string id { get; set; }
    public bool? _isEdge { get; set; }
}

private async static Task ListOnlyVerticesAsync(
    DocumentClient client,
    Uri collectionUri)
{
    var query = client.CreateDocumentQuery<MinimalDoc>(
        collectionUri,
        new FeedOptions
        {
            EnableCrossPartitionQuery = true
        });
    var queryVertex = (from d in query
                        where !d._isEdge.HasValue
                        select d).AsDocumentQuery();
    var all = await GetAllResultsAsync(queryVertex);

    Console.WriteLine($"Collection contains {all.Length} documents:");

    foreach (var d in all)
    {
        Console.WriteLine(d.id);
    }

    Console.WriteLine();
}

This should list Alice & Bob but not the edge between them.

Can we write?

Querying is all nice and good, but what about writing?

Let’s try to simply add a document in the graph:


private async static Task AddTrivialVertexAsync(
    DocumentClient client,
    Uri collectionUri)
{
    var response = await client.CreateDocumentAsync(
        collectionUri,
        new
        {
            id = "Carol",
            label = "person",
            department = "support character"
        });
    var json = GetJson(response.Resource);

    Console.WriteLine(json);
}

If we use the Gremlin Console to look at it:


gremlin> :> g.V("Carol")

==>[id:Carol,label:person,type:vertex,properties:[department:[[id:Carol|department,value:support character]]]]

Hence we see the new document as a vertex.  That makes sense since we’ve seen that vertices are projected as simple documents.

If we add other simple properties (like we did with label) this will not work.  Those properties won’t show up in Gremlin.  That is because, as we’ve seen, in Gremlin, properties are always collections.  We can do that:


private async static Task AddVertexWithPropertiesAsync(
    DocumentClient client,
    Uri collectionUri)
{
    var response = await client.CreateDocumentAsync(
        collectionUri,
        new
        {
            id = "David",
            label = "person",
            age = new[] {
                new
                {
                    id = Guid.NewGuid().ToString(),
                    _value = 48
                }
            },
            department = "support character"
        });
    var json = GetJson(response.Resource);

    Console.WriteLine(json);
}

and in Gremlin:


gremlin> :> g.V("David").valueMap()

==>[age:[48],department:[support character]]

So it appears we can successfully write vertices in a graph using the DocumentDB API.

This is obviously useful to mass import graphs since there are a lot of tools out there that can import into DocumentDB.

Writing an edge

We can write vertices.  That is only half the equation for importing data in a graph.  What about edges?

It turns out we simply have to mimic what we’ve seen with existing edges:


private static async Task AddEdgeAsync(DocumentClient client, Uri collectionUri)
{
    var response = await client.CreateDocumentAsync(
        collectionUri,
        new
        {
            _isEdge = true,
            id = "CarolToAlice",
            label = "eavesdropOn",
            language = "English",
            department = "support character",
            _vertexId = "Carol",
            _vertexLabel = "person",
            _sink = "Alice",
            _sinkLabel = "person",
            _sinkPartition = "stereotype"
        });
    var json = GetJson(response.Resource);

    Console.WriteLine(json);
}

It is important for the edge’s partition to be the same as the source vertex, otherwise the edge won’t be seen by Gremlin.

We can validate the edge is now present in Gremlin:


gremlin> :> g.E()

==>[id:CarolToAlice,label:eavesdropOn,type:edge,inVLabel:person,outVLabel:person,inV:Alice,outV:Carol,properties:[language:English]]
 ==>[id:AliceToBob,label:communicatesWith,type:edge,inVLabel:person,outVLabel:person,inV:Bob,outV:Alice,properties:[language:English]]

gremlin> :> g.V("Carol").out("eavesdropOn")

==>[id:Alice,label:person,type:vertex,properties:[age:[[id:78109dc8-587f-4d87-9d2e-e4a1731dec2b,value:42]],department:[[id:Alice|department,value:stereotype]]]]

Summary

We’ve seen it is possible to both read and write to a Cosmos DB graph using the DocumentDB API.

It would also be possible to do so using the MongoDB API.

An obvious use is to leverage DocumentDB (or MongoDB) tools to manipulate a graph, e.g. for an initial load.

Cosmos DB Async Querying & Streaming

pexels-photo-223022[1]I wrote an article back in January 2015 about async querying Azure DocumentDB using the .NET SDK.

The service was still in preview back then.

Since then DocumentDB has been superseded by Azure Cosmos DB and the SDK has changed a bit so I thought I would rewrite that article.  Here it is.

LINQ was built before async into .NET / C#.  That is probably the #1 reason why doing LINQ queries on asynchronously fetched data source is so awkward today.  This will likely change one day but until then…

Why Async?

Before we dive in the solution, let’s see why we would want to implement asynchrony in querying.

This was true in 2015 and I hope it is less so today:  a lot of people do not understand why asynchrony is for in .NET.  I always think it’s worthwhile to discuss it.

Let’s try the reverse psychology approach. Here is what asynchrony doesn’t bring us:

  • It doesn’t make our client (e.g. browser) asynchronous ; for instance, if we implement it in a service call, it doesn’t make the caller asynchronous (e.g. Ajax)
  • It doesn’t bring us performance per se
  • It doesn’t make our code run on multiple threads at once

Asynchrony allows us to… SCALE our server code. It allows you to multiplex your server, to serve more concurrent requests at the same time. If we do not have scaling issues, we might not need asynchrony.

The reason it allows us to scale is that when we async / await on an I/O call (e.g. a Cosmos DB remote call), it frees the current thread to be used by another request until the call comes back, allowing us to serve more requests with less threads and memory.

Solution

The code is available on GitHub, more specifically in the Program.cs file.

The important part is to recognize that the query object (IDocumentQuery<T>) from the SDK is an asynchronous interface.  It fetches new results in batches.  So we can write a method to fetch all the results like this one:

private async static Task<T[]> GetAllResultsAsync<T>(IDocumentQuery<T> queryAll)
{
    var list = new List<T>();

    while (queryAll.HasMoreResults)
    {
        var docs = await queryAll.ExecuteNextAsync<T>();

        foreach (var d in docs)
        {
            list.Add(d);
        }
    }

    return list.ToArray();
}

Or one that allows us to process all the items in the query with an action:

private async static Task<int> ProcessAllResultsAsync<T>(
    IDocumentQuery<T> queryAll,
    Action<T> action)
{
    int count = 0;

    while (queryAll.HasMoreResults)
    {
        var docs = await queryAll.ExecuteNextAsync<T>();

        foreach (var d in docs)
        {
            action(d);
            ++count;
        }
    }

    return count;
}

We can create a query object with no fancy LINQ expression, i.e. basically querying the entire collection, like this:

var client = new DocumentClient(new Uri(SERVICE_ENDPOINT), AUTH_KEY);
var collectionUri = UriFactory.CreateDocumentCollectionUri(DATABASE, COLLECTION);
var query = client.CreateDocumentQuery(
    collectionUri,
    new FeedOptions
    {
        EnableCrossPartitionQuery = true
    });
var queryAll = query.AsDocumentQuery();

That code basically queries the entire collection and return an array of Document object.

We could also serialize into a custom object and filter the query:

var query = client.CreateDocumentQuery<MinimalDoc>(
    collectionUri,
    new FeedOptions
    {
        EnableCrossPartitionQuery = true
    });
var queryNoDog = (from d in query
                    where d.id != "Dog"
                    select d).AsDocumentQuery();

In the code sample there are 4 examples using different variations.

Summary

Asynchrony is a powerful to scale service-side code.

Cosmos DB allows us to do that in an easy way as was demonstrated in this article.

Cosmos DB & Graph with Gremlin – Getting Started

gremlin-apache[1]Azure Cosmos DB is Microsoft’s globally distributed multi-model database service.

One of the paradigm it supports is Graph:  Cosmos DB can be used to store and query graphs.

At the time of this writing, it supports one interface, Gremlin, which is part of the Apache TinkerPop project.

This means we can use any Gremlin Console to connect to a Cosmos DB graph.

That is well documented.  I won’t reproduce the steps here.  Instead, I’m going to point to documentation.

Understanding Gremlin

First thing, let’s understand Gremlin.  Gremlin is to graph data what SQL is to relational data ; it is a graph traversal language.  Except the debate hasn’t fully settled in the graph world and Gremlin has meaningful competition (e.g. Cypher).

TinkerPop project site contains a very good documentation for getting started with Gremlin.  Their sales pitch is “learn it in 30 minutes” and it’s pretty accurate.

Once we’ve absorbed that, we can go deeper with the online exhaustive documentation.

Gremlin with Cosmos DB

cosmos-db[1]Azure documentation has a good guide to both create a Cosmos DB graph and connect to it with a Gremlin Console.

We can download the Gremlin Console from the Tinkerpop’s site.  It contains both Windows & Unix consoles.

Personally, I’ve installed it in the Linux subsystem on Windows 10 (when in Rome…).

Only trick is, that isn’t a app-get package and we need Java 1.8 to run the files.  See Oracle’s instruction to install it properly.  There seems to have been a split between version 1.7 and 1.8 and the package for 1.7 doesn’t upgrade to 1.8.

Using Gremlin on Cosmos DB

It is pretty straightforward by following the instructions.

Only counterintuitive aspect is that we need to prefix every Gremlin command with :> in order to access Cosmos DB (or any remote service in general from within Gremlin Console).

Summary

Cosmos DB supports Gremlin as an interface to command & query its graphs.

This article was meant to simply list the links to quickly get started in that scenario.

Azure Application Gateway Anatomy

Back in May, we talked about Azure Application Gateway.

In this article, we’re going to look at its anatomy, i.e. its internal component as exposed in the Azure Resource Manager (ARM) model.

A lot of Azure Resource has an internal structure.  For instance, a Virtual Network has a collection of subnets.

Azure Application Gateway has a very rich internal model.  We will look at this model in order to understand how to configure it.

What is Azure Application Gateway

From the official documentation:

Application Gateway is a layer-7 load balancer.  It provides failover, performance-routing HTTP requests between different servers, whether they are on the cloud or on-premises. Application Gateway provides many Application Delivery Controller (ADC) features including HTTP load balancing, cookie-based session affinity, Secure Sockets Layer (SSL) offload, custom health probes, support for multi-site, and many others.

I like to say that it is at time a Reverse Proxy, a Web Application Firewall (WAF) and a layer 7 load balancer.

Internal Resource Model

Let’s start with a summary diagram.  Each box represent a sub resource (except Application Gateway, which represents the main resource) and each of the bullet point within the box represent a property of that sub resource.

image

We can now look at each sub resource.

Application Gateway

Key properties of the gateway itself are

  • The SKU, i.e. the tier (Small, Medium, Large) & the number of instances (1 to 10)
  • A list of SSL certificates (used by HTTP Listeners)

SSL Certificates are optional if the Gateway exposes only HTTP endpoints but are required for HTTPS endpoints.

The SKU can be anything, although in order to have an SLA, it must be of tier medium or large and have at least 2 instances.

Gateway IP Configuration

The Gateway IP Configuration has a 1:1 relationship with the Application Gateway (trying to register a second configuration results in an error) and can therefore be conceptually considered as properties of the Gateway directly.

It simply defines in which subnet does the Application Gateway live.

Frontend IP Configuration

This sub resource defines how the gateway is exposed.

There can be either one or two of those configuration:  either public or private or both.

The same configuration can be used by more than one HTTP listener, using different port.

Frontend Port

Frontend ports describe which ports on the Application Gateway are exposed.  It simply is a port number.

HTTP Listener

This is a key component.  It combines a frontend IP configuration and port ; it also include a protocol (HTTP or HTTPS) and optionally an SSL certificate.

An HTTP listener is what the Application Gateway is listening to, e.g.:

  • Public IP X.Y.Z.W on port 443, HTTPS with a given SSL
  • Private IP 10.0.0.1 on port 80, HTTP

Backend Address Pool

The real compute handling requests.

Typically it’s going to be stand-alone VMs or VMs from a VM Scale Set (VMSS) but technically only the addresses are registered.  It could therefore be some public web site out there.

Backend HTTP Setting

This component describe how to connect to a backend compute:  port#, protocol (HTTP / HTTS) & cookie based affinity.

The frontend & backend can be different:  we can have HTTPS on 443 on the frontend while routing it to HTTP on port 12345 in the backend.  This is actually a typical SSL termination scenario.

Probe

A probe, actually a custom probe, probes a backend for health.  It is described by a protocol, a URL, interval, timeout, etc.  .  Basically we can customize how a backend is determined to be healthy or not.

A custom probe is optional.  By default, a default probe is configured probing the backend using the port and protocol specified in the backend http setting.

Rule

A rule binds an HTTP Listener, a backend address pool together & a backend setting.  It basically binds and endpoint in the frontend with an endpoint in the backend.

There are two types of rules:  basic rules & path rules.  The former simply binds the aforementioned components together while the later adds the concept of mapping a URL pattern to a given backend.

Summary

We covered the anatomy of Application Gateway and articulated how different components relate to each others.

In future articles we will build on this anatomy in order to address specific scenarios.

Hypersphere Volume

abendstimmung, ball-shaped, cloudsIn our last article we looked at how the dimension of data space impacts Machine Learning algorithms.  This is often referred to as the curse of dimensionality.

At the heart of the article we discussed the fact that an hypersphere hyper-volume trends to zero as dimension increases.

Here I want to demonstrate how to find the hyper-volume of an hypersphere of dimension N.

The Math Reference Project gives a short & sweet demonstration.  I personally found it hard to follow.  Foundations of Data Science by John Hopcroft & Ravindran Kannan (chapter 2) starts a demonstration but does cut short.

I wanted to contribute a complete demonstration because I just love that type of mathematical problem.  It’s one of my many flaws.

Approach

countryside, grass, grasslandWe’ll use the Cartesian coordinates and the fact that the volume of an hypersphere of dimension N can be found by integrating the volume of an hypersphere of dimension N-1 with an infinitesimal thickness:

V_N(R) = \displaystyle\int_{-R}^R V_{N-1}(\sqrt{R^2-x^2}) dx

imageWe’ll find the volume for a few dimensions then we’ll generalize the result.

N=1

Well, V_1(R) = 2 R:  it’s a line.

N=2

We already know the result should be V_2(R) = \pi R^2, but let’s demonstrate it.

\begin{array}{lcl} V_2(R) &=& \displaystyle\int_{-R}^R V_1(\sqrt{R^2-x^2}) dx\\ &=& \displaystyle\int_{-R}^R 2 \sqrt{R^2-x^2} dx\\&=& 2 R^2 \displaystyle\int_{-R}^R \sqrt{1- (\frac {x}{R})^2} d \frac{x}{R}\\&=& 2 R^2 \displaystyle\int_{-\pi/2}^{\pi/2} \sqrt{1- \sin^2 \theta} \, d (\sin \theta) \text{ where } \sin \theta = \frac{x}{R}\\&=& 2 R^2 \displaystyle\int_{-\pi/2}^{\pi/2} \cos^2 \theta \, d \theta\\&=& 2 R^2 \cdot \frac{1}{2} [ \theta + \sin {2 \theta} ]_{-\pi/2}^{\pi/2}\\ &=& \pi R^2\end{array}

N=3

We know the result should be V_3(R) = \frac{4}{3} \pi R^3, but again, let’s demonstrate it.

\begin{array}{rcl}V_3(R) &=& \displaystyle\int_{-R}^R V_2(\sqrt{R^2-x^2}) dx\\&=& \displaystyle\int_{-R}^R \pi (\sqrt{R^2-x^2})^2 dx\\&=& \pi (2 R^3 - \displaystyle\int_{-R}^R x^2 dx)\\&=& \pi (2 R^3 - \frac{2 R^3}{3})\\&=& \frac{4}{3} \pi R^3\end{array}

N=4

Let’s find the hyper-volume of an hypersphere of dimension 4.

\begin{array}{rcl} V_4(R) &=& \displaystyle\int_{-R}^R V_3(\sqrt{R^2-x^2}) dx\\&=& \displaystyle\int_{-R}^R \frac{4}{3} \pi (\sqrt{R^2-x^2})^3 dx\\&=& \frac{4}{3} \pi R^4 \displaystyle\int_{-R}^R (1-(\frac{x}{R})^2)^\frac{3}{2} d(\frac{x}{R})\\&=& \frac{4}{3} \pi R^4 \displaystyle\int_{-\frac{\pi}{2}}^{\frac{\pi}{2}} (1-\sin^2 \theta)^\frac{3}{2} d(\sin \theta) \text{ where } \sin \theta = \frac{x}{R}\\&=& \frac{4}{3} \pi R^4 \displaystyle\int_{-\frac{\pi}{2}}^{\frac{\pi}{2}} \cos^3 \theta \cdot \cos \theta d \theta\\&=& \frac{4}{3} \pi R^4 \displaystyle\int_{-\frac{\pi}{2}}^{\frac{\pi}{2}} \cos^4 \theta d \theta\\&=& \frac{4}{3} \pi R^4 ([\frac{\cos^3 \theta \sin \theta}{4}]_{-\frac{\pi}{2}}^{\frac{\pi}{2}} + \frac{3}{4} \displaystyle\int_{-\frac{\pi}{2}}^{\frac{\pi}{2}} \cos^2 \theta d \theta)\\&=& \frac{4}{3} \pi R^4 (0 + \frac{3}{4} \frac{1}{2} [\theta + \sin 2 \theta]_{-\frac{\pi}{2}}^{\frac{\pi}{2}})\\&=& \frac{\pi^2}{2} R^4\end{array}

Generalization

Now we have quite some practice.  Let’s try to generalize the hypersphere volume formula.

First let’s assume the volume formula has the following form:

V_N(R) = K_N R^N

Where K_N is a constant (independent of R).  We’ll see that we only need to assume that form for the volumes of N-1 and less.  Since we already know it to be true for N <= 4, it isn’t a strong assumption.

With that, let’s proceed:

\begin{array}{rcl} V_N(R) &=& \displaystyle\int_{-R}^R V_{N-1}(\sqrt{R^2-x^2}) dx\\&=& K_{N-1} \displaystyle\int_{-R}^R (R^2-x^2)^\frac{N-1}{2} dx\\&=& K_{N-1} R^N \displaystyle\int_{-R}^R (1-(\frac{x}{R})^2)^\frac{N-1}{2} d(\frac{x}{R})\\&=& K_{N-1} R^N \displaystyle\int_{-\frac{\pi}{2}}^{\frac{\pi}{2}} \cos^{N-1} \theta \cdot \cos \theta d \theta \text{ where } \sin \theta = \frac{x}{R}\\&=& K_{N-1} R^N \displaystyle\int_{-\frac{\pi}{2}}^{\frac{\pi}{2}} \cos^N \theta d \theta\end{array}

We’re dealing with a recursion here, so let’s rewrite this equation in terms of two sets of constants:

\begin{array}{rcl}V_N(R) &=& K_N R^N = C_N K_{N-1} R^N \text{ where } C_N = \displaystyle\int_{-\frac{\pi}{2}}^{\frac{\pi}{2}} \cos^N \theta d \theta\\&\implies& K_N = C_N K_{N-1}\\&\implies& K_N = (\displaystyle\prod_{i=2}^N C_i) K_1 = 2 \displaystyle\prod_{i=2}^N C_i \text{ (since }K_1=2 \text{)}\end{array}

Let’s work on the set of constants C.  We know the first couple of values:

\begin{array}{rcl} C_0 &=& \pi \\ C_1 &=& 2 \\ C_2 &=& \frac{\pi}{2} \end{array}

We can also obtain a recursive expression.

C_N = \displaystyle\int_{-\frac{\pi}{2}}^{\frac{\pi}{2}} \cos^N \theta d \theta = \frac{N-1}{N} \displaystyle\int_{-\frac{\pi}{2}}^{\frac{\pi}{2}} \cos^{N-2} \theta d \theta \implies C_N = \frac{N-1}{N} C_{N-2}

If we observes that

\begin{array}{rcl} C_N C_{N-1} &=& \frac{N-1}{N} C_{N-2} \frac{N-2}{N-1} C_{N-3}\\&=& \frac{N-2}{N} C_{N-2} C_{N-3}\\&=& \frac{N-2}{N} \frac{N-4}{N-2} C_{N-4} C_{N-5}\\&=& \frac{N-4}{N} C_{N-4} C_{N-5}\\&=&\begin{cases} \frac{2}{N} C_2 C_1 & \text{if N is even} \\ \frac{1}{N} C_1 C_0 & \text{if N is odd} \end{cases}\\&=&\begin{cases} \frac{2 \pi}{N} & \text{if N is even} \\ \frac{2 \pi}{N} & \text{if N is odd} \end{cases}\\&=&\frac{2 \pi}{N}\end{array}

Then we can write

\begin{array}{lcl} K_N &=& 2 \displaystyle\prod_{i=2}^N C_i \\ &=& \begin{cases} 2 \cdot \frac{2 \pi}{N} \frac{2 \pi}{N-2} \dots \frac{2 \pi}{4} C_2 & \text{if N is even} \\ 2 \cdot \frac{2 \pi}{N} \frac{2 \pi}{N-2} \dots \frac{2 \pi}{3} & \text{if N is odd} \end{cases}\\ &=& \begin{cases} \pi \cdot \frac{2 \pi}{N} \frac{2 \pi}{N-2} \dots \frac{2 \pi}{4} & \text{if N is even} \\ 2 \cdot \frac{2 \pi}{N} \frac{2 \pi}{N-2} \dots \frac{2 \pi}{3} & \text{if N is odd} \end{cases}\end{array}

Therefore we found that

\begin{array}{lcl} V_N (R) &=& \begin{cases} \pi \cdot \frac{2 \pi}{N} \frac{2 \pi}{N-2} \dots \frac{2 \pi}{4} \cdot R^N & \text{if N is even} \\ 2 \cdot \frac{2 \pi}{N} \frac{2 \pi}{N-2} \dots \frac{2 \pi}{3} \cdot R^N & \text{if N is odd} \end{cases}\end{array}

Which gives us an explicit formula for the volume of an hypersphere in N dimensions.

Limit

Given the formula for K_N \text{ (and that } V_N(R) =K_N R^N, it is easy to it is a product of smaller and smaller terms.

As soon as N becomes bigger than 2 \pi (i.e. at N=6), the terms becomes smaller than 1 and therefore the products start to shrink.

This is why the hyper volume vanishes as N grows towards infinity.

Values

We can then compute values (for R=1):

Dimension Formula Value
1 2 2
2 \pi 3.141592654
3 \frac{4 \pi}{3} 4.188790205
4 \pi \cdot \frac{2 \pi}{4}=\frac{\pi^2}{2}

4.934802201

5 2 \cdot \frac{2 \pi}{5} \frac{2 \pi}{3}=\frac{8 \pi^2}{15}

5.263789014

6 \pi \cdot \frac{2 \pi}{6} \frac{2 \pi}{4}= \frac{\pi^3}{6} 5.16771278

which corresponds to what we gave in our last article.

Summary

We demonstrated how to find the hyper volume of an hyper sphere of dimension N and could rigorously find that the hyper volume vanishes as the dimension grows.

That result is counterintuitive and this is why we thought a mathematical proof was warranted.

Hyperspheres & the curse of dimensionality

fractal-1118515_640I previously talked about the curse of dimensionality (more than 2 years ago) related to Machine Learning.

Here I wanted to discuss it in more depth and dive into the mathematics of it.

High dimensions might sound like Physics’ string theory where our universe is made of more than 4 dimensions.  This isn’t what we are talking about here.

The curse of dimensionality is related to what happens when a model deals with a data space with dimensions in the hundreds or thousands.

As the title of this article suggests, we’re going to take the angle of the properties of Hyperspheres (spheres in N dimensions) to explore high dimension properties.

This article is inspired by Foundations of Data Science by John Hopcroft & Ravindran Kannan (chapter 2).

Why should I care about High Dimension?

When introducing Machine Learning concepts, we typically use few dimensions in order to help visualization.  For instance, when I introduced linear regression or polynomial regression in past articles, I used datasets in two dimensions and plot them on a chart.

Brown RabbitIn the real world, typical data sets have much more dimensions.

A typical case of high dimension is image recognition (or character recognition as a sub category) where even a low resolution pictures will have hundreds of pixels.  The corresponding model would take gray-scale input vector of dimension 100+.

Close-up of an Animal Eating GrassWith fraud detection, transactions do not contain only the value of the transaction, but the time of day, day of week, geo-location, type of commerce, type of products, etc.  .  This might or might not be a high dimension problem, depending on the available data.

In an e-commerce web site, a Product recommendation algorithm could be as simple as an N x N matrix of 0 to 1 values where N is the number of products.

With IoT, multiple sensors feed a prediction model.

In bioinformatics, DNA sequencing generates a huge amount of data which often is arranged in high dimensional model.

Basically, high dimensions crop up everywhere.

What happens as dimension increases?

For starter a space with more dimensions simply is…  bigger.  In order to sample a space with 2 dimensions with a resolution of 10 units, we need to have 10^2 = 100 points.  Having the same sampling in a space of dimension 3 would require 10^3 = 1000 points.  Dimension 20?  20 would require 10^20 = 100 000 000 000 000 000 000 points.

Right off the bat we can tell that sampling the space of dimension 2 & 3 is realistic while for a space of dimension 20, it’s unlikely.  Hence we are likely going to suffer from under-sampling.

Yoshua Bengio has a nice discussion about Curse of Dimensionality here.

Hypersphere in a cube

Tapemeasure on 20Beyond sampling problems, metrics & measures change behaviour at high dimensions.  Intuitively it makes sense since a measure takes a vector (vectors) and squeeze it (them) into a numerical value ; the higher the dimension, the more data we squeeze into one number & hence we should lose information.

We use metrics & measures heavily in Machine Learning.  For instance, a lot of cost (or loss) functions are based on Euclidean’s distance:

dist(x,y) = \displaystyle\sum_{i=1}^N (x_i-y_i)^2

Now if x and / or y are random variables (e.g. samples), the law of large numbers applies when N becomes large.  This implies the sum will trend to the expected value with a narrower standard deviation as N increases.  In turns, this means there is less and less information in the distance as the number of dimensions increases.

This brings us to the hypersphere.  An hypersphere’s equation is

\displaystyle\sum_{i=1}^N x_i^2 = R^2

where x is a point of dimension N and R is the radius of the hypersphere.

An hypersphere of dimension 1 is a line, an hypersphere of dimension 2 is a circle, dimension 3 is a sphere, dimension 4 is an…  expending universe?  and so on.

A theorem I’ll demonstrate in a future article is that the volume of an hypersphere of radius 1 tends to zero as the dimension increases.

UPDATE (12-07-2017):  Demonstration of hypersphere hyper volume is done in this article.

This is fairly unintuitive, so let me give real numerical values:

Dimension Hyper Volume
1 2
2 3.141592654
3 4.188790205
4 4.934802201
5 5.263789014
6 5.16771278
7 4.72476597
8 4.058712126
9 3.298508903
10 2.55016404
11 1.884103879
12 1.335262769
13 0.910628755
14 0.599264529
15 0.381443281
16 0.23533063
17 0.140981107
18 0.082145887
19 0.046621601
20 0.025806891
21 0.01394915
22 0.007370431
23 0.003810656
24 0.001929574
25 0.000957722
26 0.000466303
27 0.000222872
28 0.000104638

If we plot those values:

image

We see the hyper volume increases in the first couple of dimensions.  A circle of radius 1 has an area of pi (3.1416) while a sphere of radius 1 has a volume of 4.19.  It peaks at dimension 5 and then shrinks.

It is unintuitive because in 2 and 3 dimensions (the only dimensions in which we can visualize an hypersphere), the hypersphere pretty much fills its embedding cube.  A way to “visualize” what’s happening in higher dimension is to consider a “diagonal” into an hypersphere.

For a circle, the diagonal (i.e. 45’) intersects with the unit circle at

(\frac {1} {\sqrt {2}}, \frac {1} {\sqrt {2}}) since (\frac {1} {\sqrt {2}})^2 + (\frac {1} {\sqrt {2}})^2 = 1^2

In general, at dimension N, the diagonal intersects at

x_i = \frac {1} {\sqrt {N}}

So, despite the hypersphere of radius 1 touches the cube of side 2 centered at the origin on each of its walls, the surface of the hypersphere, in general, gets further and further away from the cube surface as the dimension increases.

Consequences of the hypersphere volume

A straightforward consequence of the hypersphere volume is sampling.  Randomly sampling a square of side 2 centered at the origin will land points within the unit circle with probability \frac{\pi}{4} = \%79.  The same process with an hypersphere of dimension 8 would hit the inside of the hypersphere with a probability of %1.6.

A corollary to the hypersphere volume is that at higher dimension, the bulk of the volume of the hypersphere is concentrated in a thin annulus below its surface.  An obvious consequence of that is that optimizing a metric (i.e. a distance) in high dimension is difficult.

What should we do about it?

First step is to be aware of it.

A symptom of high dimensionality is under sampling:  the space covered is so large the number of sample points required to learn the underlying model are likely to be over the actual sample set’s size.

The simplest solution is to avoid high dimensionality with some pre-processing.  For instance, if we have a priori knowledge of the domain, we might be able to combine dimensions together.  For example, in an IoT field with 10 000 sensors, for many reasons, including curse of dimensionality, it wouldn’t be a good idea to consider each sensor inputs as an independent input.  It would be worth trying to aggregate out sensor inputs by analyzing the data.

Summary

Some Machine Learning algorithms will be more sensitive to higher dimensionality than others but the curse of dimensionality affects most algorithms.

It is a problem to be aware of and we should be ready to mitigate it with some good feature engineering.