Tag Archives: NoSQL

Not Only SQL (NoSQL) ; Azure DocumentDB, HDInsight HBase & Table Storage

DocumentDB protocol support for MongoDB

pexels-photo-91413Microsoft announced, in the wake of many DocumentDB announcement, that DocumentDB would support MongoDB protocol.

What does that mean?

It means you can now swap a DocumentDB for a MongoDB and the client (e.g. your web application) will work the same.

This is huge.

It is huge because Azure, and the cloud in general, have few Databases as a Service.

Azure has SQL Database, SQL Data Warehouse, Redis Cache, Search & DocumentDB.  You could argue that Azure Storage (Blob, Table, queues & files) is also one.  HBase under HDInsight could be another.  Data Lake Store & Data Lake Analytics too.

Still, compare that to any list of the main players in NoSQL and less than 10 services isn’t much.  For all the other options, you need to build it on VMs.  Since those are database workloads, optimizing their performance can be tricky.

MongoDB is a leader in the document-oriented NoSQL databases space.

With the recent announcement, this means all MongoDB clients can potentially / eventually run on Azure with much less effort.

And this is why this is a huge news.

A different account

For the time being, DocumentDB supports MongoDB through a different type of DocumentDB account.

You need to create your DocumentDB account as a DocumentDB – Protocol Support for MongoDB.

You’ll notice the portal interface is different for such accounts.

You can then access those accounts using familiar MongoDB tool such as MongoChef.

But you can still use DocumentDB tools to access your account too.

Summary

In a way you could say that Azure now has MongoDB as a Service.

A big caveat is that the protocol surface supported isn’t %100.  CRUDs are supported and the rest is prioritized and worked on.

Yet, the data story in Azure keep growing.

UPDATETo get started, check out:

 

Azure DocumentDB Demo

December the 1st, 2015, I’m doing a presentation to a Montreal User Group, MS DEV MTL. Here is the script of each demo.  Enjoy!

UPDATE:  You can see the presentation slides here.

 

Account Creation & Adding Documents

For the creation of an Azure DocumentDB account, allow me to refer to myself in Creating an Azure DocumentDB account.

In order to add a database, in your DocumentDB account blade, click “Add Database”, name it Demo-DB.

Select that database ; that will open the database blade. Click “Add Collection”, name it demo. Change the price tier to S1.

Select the collection you just created. That will open the collection blade. We are going to create two documents. For that, click “Create Document” on top of the collection blade. First document:

{
"firstName" : "Vincent-Philippe",
"lastName" : "Lauzon",
"office" : "MTL"
}

Second document:

{
"office" : "MTL",
"address" :
{
"streetNumber" : 2000,
"streetName" : "McGill College",
"streetType" : "Avenue",
"inBuilding" : "Suite 500",
"postalCode" : "H3A 3H3"
}
}

Now, let’s look at those document within the collection. In the collection blade, click “Document Explorer” (at the bottom). You will notice a few things:

  • Both documents were added an id property containing a generated GUID
  • Both documents didn’t have the same schema
  • JavaScript types string and integer were used

Let’s add a third document:

{
"firstName" : "John",
"lastName" : "Smith",
"office" : "Calgary",
"id" : "emp-john-smith",
"phoneNumber" : "123-456-7890"
}

You can go ahead and look at the document and observe that:

  • We manually inserted the id of the document here ; DocumentDB used the id
  • The schema was slightly different that the other employee

Simple Querying

For querying, in the collection blade, click “Query Explorer”. Leave the query as is, i.e.

 

SELECT * FROM c

 

Let’s observe a few things:

  • In the query, c stands for the collection. It is a variable name: you can replace c by whatever literal you fancy
  • The result is a JSON array containing the original documents in each
  • The documents have more “metadata”, i.e. properties starting with _, such as _ts, the timestamp

Let’s try something slightly less trivial:

 

SELECT *
FROM c
WHERE c.firstName != null

 

Now we have only the employees, i.e. we skipped the MTL office document.

The following query does a projection or a JSON transformation:

 

SELECT
{"firstName":c.firstName, "lastName":c.lastName} AS name,
c.office
FROM c
WHERE c.firstName!=null

 

This yields the following results:

[
{
 "name": {
 "firstName": "Vincent-Philippe",
 "lastName": "Lauzon"
 },
 "office": "MTL"
 },
 {
 "name": {
 "firstName": "John",
 "lastName": "Smith"
 },
 "office": "Calgary"
 }
]

 

This demonstrates how DocumentDB merges the power of T-SQL with the JavaScript language seamlessly.

To explore more about querying, go to the querying playground where you can explore interactively (web browser).

Indexing Policy

To look at the current indexing policy of a collection, in the collection blade, click “Indexing Policy”. Typically, you’ll see the following:

 

{
 "indexingMode": "consistent",
 "automatic": true,
 "includedPaths": [
 {
 "path": "/*",
 "indexes": [
 {
 "kind": "Range",
 "dataType": "Number",
 "precision": -1
 },
 {
 "kind": "Hash",
 "dataType": "String",
 "precision": 3
 },
 {
 "kind": "Spatial",
 "dataType": "Point"
 }
 ]
 },
 {
 "path": "/\"_ts\"/?",
 "indexes": [
 {
 "kind": "Range",
 "dataType": "Number",
 "precision": -1
 },
 {
 "kind": "Hash",
 "dataType": "String",
 "precision": 3
 }
 ]
 }
 ],
 "excludedPaths": []
}

 

where you can observe

  • Indexing is consistent (done synchronously with changes)
  • Indexing is automatic
  • Includes all properties
  • Numbers have range indexes, strings hashes and point spatial
  • Timestamp are both range & hash
  • No paths are excluded

Looking at consistency level

Go in you DocumentDB account blade, at the bottom, in “Configuration”, click “Default consistency”.

You can actually see the definitions of each level in the portal.

SDK Demo

Start up a new Console App project. Get the NuGet package Microsoft.Azure.DocumentDB.

Everything orbits around the DocumentClient component. To instantiate one, you need information from your DocumentDB account. In the account blade, click the key icon.

You’ll need:

  • URI (serviceEndPoint in the SDK)
  • Primary key (authKey in the SDK)

In the code, simply instantiate it as:

private static readonly DocumentClient _docClient = new DocumentClient(
new Uri(ENDPOINT),
AUTH_KEY,
ConnectionPolicy.Default,
ConsistencyLevel.Session);

Here you see that you can override the connection policy (see this post for details) and the consistency level for the connection.

The rest of the code will use the method “QueryAsync” defined in this post.

First, let’s find our collection, in purely scalable way:

 

private async static Task<DocumentCollection> GetCollectionAsync()
 {
 var dbQuery = from db in _docClient.CreateDatabaseQuery()
 where db.Id == DB_NAME
 select db;
 var database = (await QueryAsync(dbQuery)).FirstOrDefault();
 var collectionQuery = from col in _docClient.CreateDocumentCollectionQuery(database.AltLink)
 where col.Id == COLLECTION_NAME
 select col;
 var collection = (await QueryAsync(collectionQuery)).FirstOrDefault();
 return collection;
 }

 

What we do here is basically search our database among databases within the account by querying the database list, then do the same thing with collection.
The interesting points to notice here is that we do everything async, including querying. There is nothing blocking here.
Let’s define an employee object, a PONO:
public class Employee
 {
 [JsonProperty("id")]
 public string ID { get; set; }

 [JsonProperty("firstName")]
 public string FirstName { get; set; }

 [JsonProperty("lastName")]
 public string LastName { get; set; }

 [JsonProperty("office")]
 public string Office { get; set; }

 [JsonProperty("phoneNumber")]
 public string PhoneNumber { get; set; }
 }
Here we use attributes to map property names to bridge the gap of JavaScript and C# in terms of naming convention, i.e. the fact that JavaScript typically starts with lowercase while C# starts with uppercase. Other approach could have been used.
Let’s define a method to find me:

 

private async static Task<Employee> QueryVinceAsync(DocumentCollection collection)
 {
 var employees = from e in _docClient.CreateDocumentQuery<Employee>(collection.SelfLink)
 where e.FirstName == "Vincent-Philippe"
 select e;
 var vincent = (await QueryAsync(employees)).FirstOrDefault();
 return vincent;
 }

 

Here, we again do a query, this time on documents within a collection. We strong type the query for employee’s type. That doesn’t filter out non-employees though. The filter on the query does that: it searches for document having a property firstName being equaled to Vincent-Philippe. Document without such a property obviously fail that filter.
Then we can look at the code of the demo:

 

private static async Task DemoAsync()
 {
 var collection = await GetCollectionAsync();
 var vincent = await QueryVinceAsync(collection);
 var newEmployee = new Employee
 {
 FirstName = "Jessica",
 LastName = "Jones",
 Office = "Hell's Kitchen",
 PhoneNumber = "Unknown"
 };
 var newEmployeeResponse =
 await _docClient.CreateDocumentAsync(collection.SelfLink, newEmployee);
 // ID of the created employee document
 Console.WriteLine(newEmployeeResponse.Resource.Id);
 }

 

Interesting point here is the return type of document creation method. Since the SDK is a thin wrapper around REST calls, the return type returns all the stuff returned by the REST call. Of interest: newEmployeeResponse.RequestCharge. This is 6.1 and this is in Request Units (RUs). This helps you figure out the pricing tier you should look after.

 

Major upgrade to Azure DocumentDB LINQ provider

IC791289[1]Early Septembre 2015, Microsoft has announced a major upgrade to the LINQ Provider of the .NET SDK of DocumentDb.

I know it does appear a bit confusing since when DocumentDb was released (a year ago now), it was said that it supported SQL.  Well, it supported some SQL.

Now the surface area of SQL it supports has increased.  In order for us to take advantage of this within a .NET application, the LINQ Provider must be upgraded to translate more operations into that SQL.

You see, DocumentDb’s SDK works the same way than Entity Fx or LINQ to SQL or LINQ to XML in that respect:  your c# LINQ query gets translated into an expression tree by the compiler, then the LINQ provider (an implementation of IQueryable) translates the expression tree into an SQL string (at runtime).

LINQ

The SQL is what is sent to the DocumentDb service.

Today the LINQ provider allows string manipulation, array manipulation (e.g. concatenation), order by and some hierarchical manipulation too.

So download the latest NuGet package of DocumentDb client (i.e. 1.4.1) and try to expand your LINQ queries.

 

Enjoy!

Analysing Application Logs with DocumentDb

IC791289[1]Azure DocumentDB is Microsoft Document-centric NoSQL offering in the cloud.

I’ve been working with it since Septembre 2014 and I wanted to share a use case I found it really good at:  log analysis.

Now that takes some context.

 

I have been working for a customer using Microsoft Azure as a development Platform.  Applications were C# Web Apps and the logs were done using the standard .NET tracing capture by Azure to blob storage.

As with most applications, logging was in the backseat while applications were developed so developers would sprinkle lines such as:

System.Diagnostics.Trace.TraceError(“Service X was not available”);

Mind you, that’s better than my contribution, which typically has no logs at all except for a catch all in the outer-most scope so you can be informed that there has been a null exception somewhere!

I’ve architected quite a few systems and often end up troubleshooting issues in production environments, either functional or performance based.  Without good logs, that is impossible.

The problem I always found with text logs is they are so difficult to exploit.  Once you have a few megs of one-liners, they end-up being useless.  They lack two things:

  • Standard information
  • Structure

Standard information examples?  Event name, correlation-ID, duration, exception, etc.  Because they are standard through your log, they are easier to search.

Structure?  Well, structure enables you to search more easily as well.  When a developer splits information blocks by comma, another one by pipes, etc.  it doesn’t simplify the consumption of logs.

 

More robust logging solutions do implement those two aspects.  For instance, Enterprise Library Semantic Logging implements custom EventSource being semantic, i.e. have strong typing.  This is borrowed from Windows ETW Tracing.

Now the constrain I had was to use the basic .NET tracing and to log in the blob storage.

JSON

What I did is that I separated the problem of logging and the problem of analysing the logs:  I did log strong typed JSON events serialized as a string into the .NET trace.

.NET only saw strings and was happy.

But I had structured and standard information into JSON objects.  Better yet, I didn’t have the same type of JSON objects everywhere.  This allowed me, for instance, to log controller interception with an information subset, method calls with another and errors with yet another.

You must see me coming by now…  DocumentDB was the perfect tool to analyse that data.  All I had to do was to write some code to load CSV files from the blob storage to a DocumentDB collection.

The way I did that was to take each row in CSV files and considered them as a JSON object with the ‘message’ column being a complex field in the JSON object (i.e. yet another JSON object).

Then I could analyse the logs by simply doing DocumentDB-SQL queries.  Since the documents were already fully indexed, the queries were instantaneous!

 

This was a life saver on the different projects I used that approach.  I could dive in massive amount of logs easily, get information, compile statistics, detect outlyers, etc.  .

I actually bundled all that in a web solution that was able to import data from a blob container, have some interactive queries and also allow the export of query results to CSV files so I could analyse further in Excel.

 

In a way this borrows a lot of patterns from Big Data:

  • Store your unprocessed data in a data lake, in this case blobs
  • Load your data in structured form only for analysis (schema on read)
  • Extract insights
  • Drop the structured form

 

This allowed me to learn a lot more about DocumentDB.  I hit the ingestion limits of the S1 tier pretty quickly even by bundling records together using a stored procedure and had to implement back-offs.  When I got too aggressive creating / deleting collections, I got trottled and the service refused to serve me.

But otherwise, the query speed was really excellent.  Being able to dive in JSON ojects in query is a huge enabler.

SQL Server 2016

Here’s a rundown of my favourite new features in SQL Server 2016, largely inspired by the SQL Server Evolution video.

Impact of Cloud-First on SQL Design

This is a really nice illustration of the consequences of Cloud-First for Microsoft products.

4150.Microsoft-Mobile-First-Cloud-First[1]SQL has been basically flipped around as a product.  When SQL Azure was introduced years ago, it was a version of SQL Server running in the cloud.  Nowadays, SQL Azure drives the development of the SQL Server product.

Being cloud-first allows Microsoft to iterate much faster on different product feature in preview mode, gather a tone of feedback, thanks to the scale of Azure and deploy it (when the development cycles are over) globally very quickly.  That changes the entire dynamic of product development.

The nice thing is that it actually improves the SQL Server product:  when a new version of SQL Server comes in, e.g. 2016 right now, the features have been explored by an order of magnitude greater user base than in beta-test in the old world.

In-memory Columnstore indexes

In-memory OLTP was introduced in SQL Server 2014.  SQL Server 2016 adds another twist:  Columnstore indexes on in-memory tables!

IC709594[1]You can now have an high throughput table being fully in-memory and also have it optimized for analytics (columnstore indexes).

This unlocks scenarios such as real-time analytics (with no ETL to Data Warehouses)

Always Encrypted

Transparent Data Encryption (TDE, only recently added to Azure SQL) encrypts the data on disk.  This addresses mainly the physical compromising of data:  somebody steals your hard drive.

encryption[1]Not a bad start as often data center hard drives still get recycled and you’ll see headlines of data leaks caused by hard drives found with sensitive data on it once in a while.

Now what Always Encrypted brings to the table is the ability to encrypt data in motion, i.e. while you are reading it…  and it is column based so you can very selectively choose what gets encrypted.

With this an SQL Database administrator will only see rubbish in your encrypted columns.  Same thing for somebody who would eavesdrop on the wire.

The cryptographic keys aren’t stored in SQL Database either but on the client-side (Azure Vault anyone?) which means that even if your entire database gets stolen, the thief won’t be able to read your encrypted data.

…  it also mean you’ll have to be freaking careful on how you manage those keys otherwise you’ll end up with encrypted data nobody can read (did I mention Azure Vault?)

Polybase

Polybase was introduced in SQL Server Parallel Data Warehouse (PDW).

It extends the reach of TSQL queries beyond SQL Server tables to unstructured data sources such as Hadoop.

This will now be part of SQL Server 2016.

Run ‘R’ models in SQL

200px-R_logo.svg[1]Microsoft recent acquisition of Revolution Analytics didn’t take long to have impact.

We will be able to run ‘R’ analytics model right into SQL Server 2016.  This brings the power of R Data Analysis right so much closer to the data!

Advanced Analytics

Power BI is of course at the center of this but also a complete revamp of Reporting Service.

Stretching SQL Server to Azure

A hybrid-cloud approach to SQL Azure:  put your cold data (typically historical) in the cloud storage but keep your database on-premise.

I’m talking about data within the same table being both on-premise and in the cloud!

Quite easy to setup, this feature has potential to be a really nice introduction to the cloud for many organization.  The high value scenario is to drop the storage cost for on-premise application having huge database with most of the data being historical, i.e. cold (rarely accessed).

It is all configuration based, hence not requiring any changes in the consuming applications.

Wrapping up

SQL Server 2016 proves that SQL product line is alive and kicking with very valuable features for modern scenario, be it Big Data, Advanced Analytics or Hybrid Cloud computing.

You can try SQL Server 2016 using the VM template SQL Server 2016 CTP2 Evaluation on Windows Server 2012 R2 in the Marketplace as of this date (22/07/2015).

If you want more details, so far, the best source is that SQL Server Evolution video which is well done.

Azure DocumentDB – Performance Tips

dt-improved-performance[1]Azure DocumentDB has been released for a little while now.  Once you get passed the usual step of how to connect and do a few hello worlds, you will want to reach for more in-depth literature.  Sooner or later, performance will be on your mind when you’ll want to take architecture decision on a solution leveraging Azure DocumentDB.

Stephen Baron, Program Manager on Azure DocumentDB, has published a two-parters performance tips article (part 1 & part 2).

The tips given there are quite useful and do not require to rewrite all your client code.  They cover:

  • Network Optimization
  • How to better use the SDK
  • Indexing
  • Query optimization
  • Consistency settings

It is very well written, straightforward and therefore useful.  It is my reference so far.

The limits of DocumentDB Preview Release

I was looking for the limits of DocumentDB Standard Tier, the only tier available during the preview release.  It wasn’t all too trivial to find so here it is:

http://azure.microsoft.com/en-us/documentation/articles/documentdb-limits/

Among the limits that may constrain your solution:

  • Only 3 collections per Capacity Unit (but 100 Databases, the container of collections, per account)
  • 25 stored procs (or UDFs or triggers) per collection
  • Maximum request size for documents & attachments:  256 KB
  • Maximum response size:  1Mb
  • Maximum AND (and OR) per query:  5

As you can see, the current limits of DocumentDB are quite aggressive.  The NoSQL database is highly performing and it seems that those performance comes at a cost.

We can expect those limits to be loosen in the near future though.  This is typical for Preview release to have sandboxed solution.  The same way we can expect the product to be available in more regions.