HDInsight Hadoop Hive – CSV files analysis

hive_logo_medium[1]Ok, on a past blog we’ve been setuping Azure HDInsight for some Hive fun.

So let’s!

Today I’ll go and analyse the data contained in multiple CSV files.  Those files will be created (in Excel) but in a real-world scenario, they could be either data dump on a file server or file imported from a real system.

Creating some flat files

First, let’s create some data to be consumed by Hive.

Open Excel and author the following:

rand1 Widget Price InStock
=RAND() =IF(A2<0.3, “Hammer”, IF(A2>0.7, “Skrewdriver”, “Plier”)) =A2*20 =A2*1000

So basically, in the first column we generate a random number.  In the Widget column we generate a label (from the collection {Hammer, Skrewdriver, Plier}) based on that random number.  In the Price column we generate a price based on the random column.  Finally in the InStock column, we generate a number of items, still based on the random column.

So, yes, we are generating random dummy data.  Can’t wait to see the insight we’re going to get out of that!

Now, let’s auto-fill 100 rows using the first data row (i.e. the second row I made you type).  You should end up with something like that (but with totally different values, thanks to randomness):


(I’ve added the first row styling)

Now let’s save this file three times as CSV, creating files HiveSample-1.csv, HiveSample-2.csv and HiveSample-3.csv.  Between each save, enter a value in a blank cell (at the right of the data).  This will recalculate all the random entries.  You can then hit CTRL-Z and save.

Let’s push those files into your Hadoop blob container in the folder “my-data/sampled”.  This is important as we’ll refer to that in a typed command soon.

Hive Editor

We’ll use Hive with HDInsight Hive editor.  This is a web console.

Let’s open the HDInsight cluster we’ve created in the portal.


In the Quick Links we find the Cluster Dashboard.  When we click that we’re are prompted for some credentials.  This is where we have to give the cluster admin credentials we’ve entered during the setup.


In the top menu of the dashboard, you’ll find the Hive Editor.  Click that and you should land on a page looking like this:


Creating External Table

We’re going to create an external table.  An external table in Hive is a table where only the table definition is stored in Hive ; the data is stored in its original format outside of Hive itself (in the same blob storage container though).

In the query editor, we’re going to type

rand1 double,
Widget string,
Price double,
InStock int
STORED AS TEXTFILE LOCATION ‘wasb:///my-data/sampled/’

and hit the Submit button and…  wait.  You wait a lot when working with Hive unfortunately.

Hive transforms this Hive-QL query into an Hadoop Map-Reduce job and schedule the job.  Eventually your job will complete.  You can view the details of your job.  In this case the job doesn’t output anything but you can see the elapsed time in the logs.

What we did here is to tell Hive to create an external table with a given schema (schema on read, more on that in a minute), parsing it using comma as the delimited field.  We tell Hive to pick all the files within the folder “my-data/sampled” and we tell it to skip the first row of each file (the header).

Sanity queries

Let’s run a few queries for sanity’s sake.  Let’s start with a COUNT:


which should return you the total amount of rows in all the files you put in the folder.


SELECT * FROM hardware LIMIT 10

that should gives you a top 10 of the rows of the first file (yes, SELECT TOP is a TSQL-only instruction and isn’t part of ANSI SQL neither was it picked up by anyone else including HiveQL apparently).

You can submit both queries back-to-back.  You can give them a name to more easily find them back.

Schema on read

A concept you’ll hear a lot about in Hive (and in Big Data Analysis in general) is Schema on read.  That is opposed to Schema on write that we are used to in a typical SQL database.

Here the schema is used to guide the parser to interpret the data, it isn’t a schema used to format the data while written to the database.

It’s a subtle difference but an important one.  Schema-on-read means you wait for your analysis to impose constraints on the data.  This is quite powerful as you do not need to think in advance about the analysis you are going to do.  Well…  mostly!  You still need the information to be there in some form at least.

Slightly more complex queries

Ok, let’s try some analysis of the data.  Let’s group all the widget together and look at their average price and quantity:

AVG(Price) AS Price,
AVG (InStock) AS InStock
FROM hardware

Your millage will vary given the randomness, but you should have something in the line of:


Optimization with Tez

The last query took 85 seconds to run in my cluster.  Remember, I did configure my cluster to have only one data node of the cheapest VM available.

Still…  to compute sums over a dataset of 300…  A tad slow.

By and large, Hadoop map reduce isn’t fast, it is scalable.  It wasn’t build to be fast on small data sets.  It has a file-based job scheduling architecture which incurs lots of overhead.

Nevertheless, some optimizations are readily available.

For instance, if you type

set hive.execution.engine=tez;

on top of the last query, on my cluster it runs in 36 seconds.  Less than half than without the optimization.

That is TEZ.  TEZ is built on top of YARN.  To make a long story short, you can consider them as the Version 2 of the Hadoop Map-Reduce engine.

This works per-query so you always need to prefix your query with the set-instruction.


We’ve created some dummy data in Excel, saved it in CSV formats and exported it to blob storage.

We created an Hive external table reading those files.

We’ve queried that table to perform some analytics.

We’ve optimized the queries using Tez.

HDInsight / Hadoop Hive really shines when you try to perform ad hoc analytics:  you want to explore data.  If you already know that you want your quarterly average, there are better technologies suited for that (e.g. SQL Server Analytic Services).

For a more conceptual tutorial of Hive look here.

Azure Data Lake – Early look

Ok, this is a super early look at the technology.  Azure Data Lake was announced yesterday (September 29th, 2015) at AzureCon (and later blogged about by Scott Gu), it will public preview at the end of the year so there isn’t a tone of documentation about it.

But there has been quite a few characteristics unveiled.


What is a data lake?

I first came across the concept of Data lake reading Gartner incoming trends reports.  A data lake is your unprocessed / uncleansed (raw) data.  The idea being that instead of having your data stored neatly in a data warehouse where you’ve cleansed it and probably removed a lot of information from it by keeping only what’s necessary by your foreseen analytics, a data lake is your raw data:  not trivial to work with but it contains all the information.

2 Services

Actually, there are 2 distinct services here:

  • Azure Data Lake Store
  • Azure Data Lake Analytics

The two are sort of loosely coupled although well integrated.

Azure Data Lake Store is a storage service, a sort of alternative to Blob Storage.  It features huge storage scale (there aren’t any advertised capacity limits), low-latency for real-time workload (e.g. IoT) and supports any type of data, i.e. unstructured, semi-structured and structured.  At this point, it isn’t clear if it’s just a massive blob storage storing files only or if you can really store structured data natively (aka Azure Tables).  On the other hand, the store implements HDFS which is a file system…  so I guess the native format are files.

Azure Data Lake Analytics, on the other hand, is an analytics service.  So far, it seems that its primary interface is U-SQL, an extension of T-SQL supporting C# for imperative programming.  It is built on top of YARN (see my Hadoop ecosystem overview) but seems to be a Microsoft-only implementation, i.e. it isn’t Hive.

On top of that we have Visual Studio tools that seems to be mostly facilitating the authoring / debugging of analytics.

The two services are loosely coupled:  the store implements HDFS and can therefore be queried by anything that understand HDFS.  This doesn’t even mean Azure HDInsight only but other Hadoop distributions (e.g. Cloudera),  Spark and Azure Machine Learning.

928Pic2[1]Similarly, Azure Data Lake Analytics can query other stores, such as Hadoop, SQL, Azure SQL Data Warehouse, etc.  .


U SQL is the query language of Azure Data Lake Analytics.  In a nutshell it merges the declarative power of TSQL with the imperative power of C#.

Writing image analysis purely in TSQL would be quite cumbersome.  It is for that type of scenarios that C# is supported.

U SQL lands itself quite naturally to writing data transformations pipelines in-code.  This is very similar to what Hadoop Pig does.

For now I see U SQL as replacing both Hive & Pig in one throw.  On top of that, for developers used to .NET & TSQL, it is very natural and productive.  This is in stark contrasts to approaching Hive, which is kinda SQL but where you need to learn a bunch of minimally documented clutches (for instance, invoking a generalized CSV parser requires you to reference a Java JAR-packaged component in the middle of the schema definition) or PIG, which is its own thing.


Ok, so, why would you use Azure Data Lake vs Hadoop with Hive & PIG or Spark?

First, as I just mentioned, you’ll be productive way faster with Azure Data Lake.  Not just because you know TSQL & C# but because the integration with Visual Studio will beat all Hadoop odd tools any day.

Second, Azure Data Lake offers a killer features for me:  it offers you to pay per query.  This means you would really pay only for what you use instead of standing up an Hadoop cluster every time you need to perform queries and dropping it once you’re done.  Again, this is more productive since you don’t have to worry about the cluster management, but it can also be way more economic.

Also, hopefully Data Lake Analytics is faster than Hive!  If we’re lucky, it could be comparable to Spark.

In Summary…

A very exciting service!  I think its main strength is the seamless integration of the different pieces than the drastic new concepts.

Doing Big Data today is quite a chore.  You have a bunch of different tools to learn to use and string together.  Azure Data Lake could drastically change that.

As usual, a big part will be the pricing of those services.  If they are affordable they could help democratize Big Data Analytics.

Before the public preview arrives, have a look at the Service page.

HDInsight Hadoop Hive – Setup

hive_logo_medium[1]Let’s explore Hadoop Hive, shall we?

Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.

It was originally build by Facebook as an abstraction on top of Hadoop Map Reduce and now is an open source (Apache) project.

(See my Hadoop ecosystem overview here)

As with other Hadoop main components, you have a few options to work with it:

  • You can download it and install the cluster on a server farm on premise ; good luck with that, I’m sure it’s going to be a nice learning experience.
  • You can use HDInsight on-premise which is a port on Windows done in collaboration between Hortonworks & Microsoft.
  • You can use Azure HDInsight

Yes, if you read anything in this blog, you know I’m gona go with the Azure option!

I’ll give two major things to Azure HDInsight:

  1. It performs all the cluster installation / maintenance for you
  2. It (mostly) shields you from all the Unixy / 3000 open projects mix that is Hadoop

And for that I am grateful!  When I want to use Hive I don’t want to fight with Bash shells, I want to run some SQL over Big Data.


Just before we start…  what is Hive in a nutshell?

Hive allows you to query massive data sets (Big Data) without hand coding Java Map Reduce into Hadoop.

So it’s an abstraction layer and a productivity enhancer.

Setting it up

Alright, Azure Portal (preview) everyone.  Yeah!  Give the preview portal some love!

SQL Database

What?  Yeah…  there’s the thing…

Here I’ll show you how to manage an Hadoop / Hive cluster on the cheap.  One of the nice feature of Azure HDInsight is to externalize the Hive metastore (where the schemas of your tables will be stored) from the cluster.

This will allow you to tear down the cluster without losing your Hive Metadata.

By default the metastore is persisted on a Derby database (yes, chances are you never heard about that, remember what I told you about Azure HDInsight shielding you from all this Open Source Techs?) which is hosted on the cluster.

The alternative is to host it on an Azure SQL Database, outside the cluster.

So let’s create a DB.


Top left corner, (+ NEW), Data + Storage, SQL Database:  this opens up the New SQL Database blade.

Go for it.  Just be careful of a few things, namely:

  • Make sure it’s in the right subscription
  • Make sure it’s in the region (e.g. East US) you want your cluster to be
  • Make sure you tick “Allow azure services to access server” to let Hadoop access it


Now the HDInsight cluster!


Top left corner, (+ NEW), Data + Analytics, HDInsight:  this opens up the New HDInsight Cluster blade.

Cluster Name:  choose one, needs to be unique within the universe of HDInsight, hence I won’t give you mine, you would clash with it.

Cluster Type:  Hadoop.

Cluster Operating System:  leave it to Windows Server.

Make sure it’s in the right subscription and the right resource group.

Click Credentials, that’s going to pull the credentials blade.


Those are the credentials to authenticate to the cluster.  We’re just toying around, so don’t go overboard.  You do not need to enable the remote desktop for this.

Data Source

Ok, Data Source.  This will open the…  Data Source blade!

One of the differentiator of Azure HDInsight compared to other Hadoop implementation is that the distributed file system (HDFS) is implemented on top of Azure Blob Storage.  This is commonly known as WASB: Windows Azure Storage Blob.

To make a story short:

  • Hadoop comes with HDFS, Hadoop File System, a distributed File System
  • Files on HDFS are spread and replicated on many data nodes of your clusters
  • This means files in HDFS belongs to your cluster and can’t exist without it
  • WASB is an implementation of HDFS that simply forwards to Blob Storage which is already a distributed / replicated file system
  • WASB allows your files to exist without your cluster

The last point is key.  It means you can tear your cluster down and you keep your file.  You can later on recreate a cluster and continue where you were.

So…  I suggest you create a storage account just for your Hive discovery.

Data Source

Again, give it a unique name.  You can choose your default container which will itself default to the name of your cluster.

Node Pricing Tier

Again, we’re in discovery.  Let’s go easy on it:  one worker node and you can drop the type of VMs for the worker & header nodes to A3 (the cheapest).

Optional Configuration


In optional configuration, let’s select external metastores and select the SQL Database we created earlier.

Only do that for the Hive metastore (we won’t use Oozie today).  You can give your server admin password for today but DO NOT DO THAT in production.  In production, of course, create a separate user that has rights only on the specific db.


Finally, hit the create button!  This will churn away for a little while and you’ll have your cluster ready soon!

Key take aways

As key take aways, there are two pieces of your Azure HDInsight configuration that can survive the deletion of your cluster:

  1. HDFS content (via a storage account)
  2. Hive Metastore (via an Azure SQL Database)

This makes Azure HDInsight a component you can prop up to do heavy computation and tear down until next time.

You could even go a step further and backup your database into a .bacpac and tear down the SQL Database until next use.  This way you would end up paying only for storage which is dirt cheap compare to both SQL and Hadoop.

In the next blog entry, we’ll actually use the Hadoop cluster to do some Hive!

Major upgrade to Azure DocumentDB LINQ provider

IC791289[1]Early Septembre 2015, Microsoft has announced a major upgrade to the LINQ Provider of the .NET SDK of DocumentDb.

I know it does appear a bit confusing since when DocumentDb was released (a year ago now), it was said that it supported SQL.  Well, it supported some SQL.

Now the surface area of SQL it supports has increased.  In order for us to take advantage of this within a .NET application, the LINQ Provider must be upgraded to translate more operations into that SQL.

You see, DocumentDb’s SDK works the same way than Entity Fx or LINQ to SQL or LINQ to XML in that respect:  your c# LINQ query gets translated into an expression tree by the compiler, then the LINQ provider (an implementation of IQueryable) translates the expression tree into an SQL string (at runtime).


The SQL is what is sent to the DocumentDb service.

Today the LINQ provider allows string manipulation, array manipulation (e.g. concatenation), order by and some hierarchical manipulation too.

So download the latest NuGet package of DocumentDb client (i.e. 1.4.1) and try to expand your LINQ queries.



Hadoop ecosystem overview

HadoopHave taken a look at Hadoop lately?

People who do not know Hadoop think it’s a big data technology the same way SQL Server is a technology.

But Hadoop is more of an ecosystem of different modules interacting together.  This is its major strength and also the source of its major weakness, i.e. its lack of strong cohesion.

In this blog post, I’ll give an overview of that ecosystem.  If you’re new to Hadoop, this could be an easy introduction.  If you are steep in some Hadoop technology, this post could give you the opportunity to look around the ecosystem for things you aren’t specialized in.


Hadoop is an open-source Apache project.  It regroups a bunch of different modules being themselves open-source project.

In order to use Hadoop, you need to download quite a few of those projects, make sure their versions are compatible and assemble them.

This is what Hadoop distributions do.  The two leading distributions are Cloudera and Hortonworks.  Both distributions are fully supported on Azure Virtual Machines.


HDInsightHDInsight is a Windows based Hadoop distribution developed by Hortonworks & Microsoft.

Azure HDInsight

Azure HDInsight is a managed Service on Azure.  It allows you to create a fully managed Hadoop cluster in Azure.  It is using HDInsight for the Windows implementation but it also supports a Linux implementation.

As mentioned above, Microsoft supports Cloudera & Hortonworks in Virtual Machines, that is, if you install them in Azure VMs.  But Azure HDInsight is more than that.  It is a managed service, i.e. you do not need to worry about the VMs, they are managed for you.  Also, you can scale out your cluster in a few mouse clicks, which is quite convenient.

On top of being a managed service, Azure HDInsight is build to have external storage (i.e. blob storage & SQL Azure) in order to make it possible to create clusters temporarily but keep the state between activation.  This creates a very compelling economical model for the service.


Hadoop Distributed File System (HDFS) is a distributed file system.  It is built to make sure data is local to the machine processing it.

Azure HDInsight substitutes HDFS for Windows Azure Blob Storage (WASB).  This makes sense since WASB is already a distributed / scalable / highly available file system but also it mostly mitigates a shortcoming of HDFS.

HDFS requires Hadoop to be running to be accessible but it also requires the cluster to store the data.  This means Hadoop clusters must stay up or at least have their VMs in store for the data to exist.

WASB enables economic scenarios where the data is externalized from Hadoop and therefore clusters can be brought up and down always using the same data.


Basic paradigm for distributed computing in Hadoop, Map Reduce consists in splitting a problem’s data into chunk, processing those chunk on different computing units (the mapping phase) and assembling the results (reduce phase).

Map Reduce are written in Java (packaged in jar files) and scheduled as jobs in Hadoop.

Most of Hadoop projects are higher level abstraction leveraging Map Reduce without the complexity of writing the detailed implementation in Java.  For instance, Hive exposes HiveQL as a language to query data in Hadoop.

Azure Batch borrows a lot of concepts from Map Reduce.


Yet Another Resource Negotiator (YARN) addresses the original Hadoop scheduling manager shortcomings.  The Map Reduce 2.0 is built on top of YARN.

The original Job Tracker had scalability issues on top of single point of failure.


ApacheTezLogo_lowres[1]Apache Tez is an application framework allowing complex directed-acyclic-graph of tasks to processing data.  It is built on top of YARN and is a substitute to Map Reduce in some scenarios.

It tends to generate less jobs than Map Reduce and is hence more efficient.


hive_logo_medium[1]Apache Hive is a data warehouse enabling queries and management over large datasets.

As mentioned earlier, it is an abstraction on top of Hadoop lower level components (HDFS, Map Reduce & Tez).


jumping-orca_rotated_25percent[1]Apache HBase is column-oriented Big Data NoSql database based on Google Big Table leveraging HDFS.

In some scenarios HBase performs blazing fast on queries through massive data sets.


mahout-logo-brudman[1]Apache Mahout is a platform to run scalable Machine Learning algorithms leveraging Hadoop distributed computing.

It has quite an overlap with Azure ML.


pig[1]An high-level scripting language (Pig Latin) and a run-time environment, Apache Pig is another abstraction on top of map reduce.

Where Hive took a declarative approach with HiveQL, Pig takes a procedural approach.  A Pig Latin program is actually quite similar to a SQL Server Integration Services (SSIS) package, defining different steps for manipulating data.

Hive & Pig overlap.  You would choose Pig in scenarios where you are importing and transforming data and you would like to be able to see the intermediate steps (much like an SSIS package).


sqoop-logo[1]Apache Sqoop is Hadoop data movement involving booth relational and non-relational data sources.

It has functionalities very similar to Azure Data Factory.


spark-logo[1]Apache Spark is actually complementary to Hadoop.  In Azure it is packaged as an HDInsight variant.

In many ways Spark is a modern take on Hadoop and typically is faster, up to 100 times faster on some bench marks.

Spark leverages a directed acyclic graph execution engine (a bit like Tez) and leverages in-memory operation aggressively.

On top of its speed, the most compelling aspect of Spark is its consistency.  It is one framework to address scenarios of distributed computing, queries over massive data sets, Complex Event Processing (CEP) and streaming.

Its supported languages are Java, R, Python and Scale.  Will we see a .NET of JavaScript SDK soon?


There are way more Hadoop modules / projects!  I just talked about those I know.

You can find a more complete list over here.

Hadoop is a very rich ecosystem.  Its diversity can sometimes see as its weakness as each project, while leveraging common low-level components such as HDFS & YARN, are their own little world with different paradigms.

Nevertheless, it is one of the most popular big data platform on the market.

Strong AI & Existential Risks

0[1]There has been a recrudescence of hysterical talks about Strong Artificial Intelligence (AI) lately.

Strong AI is artificial intelligence matching and eventually going beyond the full human cognitive capacity.  Weak AI, by opposition, is the replication of some facets of human cognition:  face recognition, voice recognition, pattern matching, etc.  .

The ultimate goal of AI is to get to what researchers call Artificial General Intelligence, i.e. an AI that could duplicate human cognitive capacity in every aspects, including creativity and social behaviour.

Today, what we have is weak AI.  For some tasks, weak AI perform better than humans already though.  For instance, the Bing Image team has observed that for classifying images in multiple categories (thousands), AI performed better (did less errors) than their human counterparts.  For a low amount of categories, humans are still better.  IBM Deep Blue was able to beat the world chess champion, IBM Watson is able to win at Jeopardy, etc.  .

But we are still far from Artificial General Intelligence.  Machines lack social aptitudes, creativity, etc.  .  Nevertheless they are evolving at a rapid pace.


You might have read about the triumvirate of Elon Musk, Stephen Hawking and Bill Gates warning the world of the danger of strong AI.

Those declarations made quite some noise as those three gentlemen are bright and sober people you do not see wearing tin foil hats on national TV.  They also are super stars of their field.

Although their arguments are valid, I found the entire line of thoughts to be a bit vague and unfocussed to be compelling.

I compare this to when Hawking has been warning us about toning down the volume of our space probe messages in order not to attract malevolent aliens.  There was some merits to the argument I suppose, but it wasn’t well developed in my own opinion.

For strong AI, I found a much more solid argument at TED_logo_rgb1.

Ah, TED…  What did I do before I could get my 20 minutes fixes with you?  Thank you Chris Anderson!

Professor_Nick_Bostrom5[1]Anyway, there was Nick Bostrom, a Swedish philosopher which came to discuss What happens when our computers get smarter than we are?

Right off the bat, Bostrom makes you think.  He shows a photo of Milton from Office Space and mentions “this is the normal ways things are”, you can’t help but reflect on a few things.

Current office work / landscape is transitional.  There has been office workers since the late 19th century / early 20th century, as Kafka has depicted so vividly.  But the only constant in office work has been change.

In the time of Kafka, computers meant people carrying out computations!

2295022876_329393c9a2_b[1]Early in the banking career of my father, he spent the last Thursday night of every month at the office along a bunch of other workers computing the interests on the bank accounts!  They were using Abacus and paper since they didn’t have calculators yet.  You can imagine the constraints that put on the number of products the banks were able to sell back in those days.

I’ve been in IT for 20 years and about every 3 years my day to day job is redefined.  I’m still a solution architect, but the way I do my work, I spend my time, is different.  IT is of course extreme in that respect but the point is, office work has been changing since it started.

This is what you think about when you look at the photo Bostrom showed us.  This is what office work looked like in the 1990’s and already is an obsolete model.

The point of Bostrom was also that we can’t take today’s office reality as an anchor on how it will be in the future.  He then goes on to define existential risks (a risk that, if realised, would cause humanity extinction or drastic drop in the quality of life of human beings) and how Strong AI could pose one.  He spends a little time debunking typical arguments about how we could easily contain a strong AI.  You can tell it is a simplification of his philosophical work for a big audience.

All of this is done in a very structured and compelling way.

Parts of his argument resides in the observation that in the evolution of AI from Weak AI to Artificial General Intelligence to Strong AI, there is no train stop.  Once AI reaches our level it will be able to evolve on its own, without the help of human researchers.  It will continue to a Superintelligence, an AI more capable than humans in most ways.

asimov-robot[1]His conclusion isn’t that we should burn computers and go back to compute on paper but that we should prepare for a strong AI and work hard on how we will instruct it.  One can’t help to think back to Isaac Asimov 3 laws of robotics which were created, in the fictional world, for the exact same purpose.

He casts Strong AI as an optimization machine, which is how we defined Machine Learning at the end of a previous post.  His argument is that since AI will optimize whatever problem we give it, we should think hard about how we defined the problem so that we don’t end up being optimized out!

Nick Bostrom actually joined Elon Musk, Stephen Hawking and others in signing the open letter of the Future of Life Institute.  That institution is doing exactly what he proposed in the Ted Talk:  research on how to mitigate the existential risk posed by Strong AI.

If you are interested by his line of thoughts, you can read his book Superintelligence: Paths, Dangers, Strategies, where those arguments are developed more fully.

As mentionned above, I found this talk gives a much better shape to the existential risk posed by super intelligence argument we hear these days.

Personally, I remain optimistic regarding an existential risk posed by Strong AI, probably because I remain pessimistic for its realisation in a near future.  I think our lives are going to be transformed by Artificial Intelligence.  Super Intelligence might not occur in my life time, but pervasive AI definitely will.  It will assist us in many tasks and will gradually take off tasks.  For instance, you can see with the like of Skype Translator that translation will, in time, be done largely by machines.

transhumanism[1]Projections predict a huge shortage of software engineer.  For me this is an illusion.  What it means is that tasks currently requiring software engineering skills will become more an more common in the near future.  Since there aren’t enough software engineer around to do them, we will adapt the technology so that people with a moderate computer literacy background can perform them.  Look at the like of Power BI:  it empowers people with moderate computer skills to do analysis previously requiring BI experts.  Cortana Analytics Suite promises to take it even further.

I think we’ll integrate computers and AI more and more in our life & our work.  It will become invisible and part of us.

I think there are 3 outcomes of Strong AI:

  1. We are doom, Terminator 2 type of outcome
  2. We follow Nick Bostrom and make sure Strong AI won’t cause our downfall
  3. We integrate with AI in such a way there won’t be a revolution but just a transition

Transhumanism_h%2B_2Intuitively I would favor the third option as the more likely.  It’s a sort of transhumanism, where humanity will gradually integrate different technologies to enhance our capacities and condition.  Humanity would transition to a sort of new species.

All depends on how fast the Strong AI comes I suppose.

Ok, enough science-fiction speculation for today!  AI today is weak AI, getting better than us at very specific tasks, otherwise, we still rule!

Watch the TED talk, it’s worth it!

SQL Server In-Memory value and use cases


Microsoft SQL Server 2014 has made it in the leader category of Forrester’s recent In-Memory Database Platforms, Q3 2015 report.

SQL Server 2014 was behind SAP’s Hana, Oracle TimesTen & IBM DB2 with BLU acceleration.  It was on pare with Teradata Intelligent Memory.  Those were the leaders.  Following were the strong performers with MemSQL & Kognito among others.

SQL Server 2016, as I mentioned in my rundown of its new features, has improved on SQL Server 2014 regarding in-memory capacity.  But it isn’t released yet and Forrester’s methodology required the product to be released for at least 9 months with a sizeable customer base.

What does in-memory doing in a Database Engine?

Many people are confused by the in-memory features of SQL Server or why, in general, would a Database Engine in general has such feature.  At first glance, it does appear a bit odd:

  • Doesn’t RDMS engines already have caching mechanisms to keep hot data in-memory?
  • Shouldn’t the written data be stored on disk at transaction commit time to guarantee durability?
  • By putting tables in-memory ‘all the time’, don’t you unbalance how the rest of the tables can get cached?
  • Scenarios such as “rapid write” don’t make much sense since you can’t write for ever in-memory, can’t you?

Those are valid points and I’ll take time here to explain how this feature can, in the right circumstances, improve performance by an order of magnitude.

So, first, yes, in the right circumstances.  I underlined that because it isn’t a turbo button you press and everything is faster and you don’t pay the price elsewhere.  For some workloads, it performs ok and isn’t worth the memory it consumes while for some scenarios, you rip this multifold improvement.  Your mileage will vary.  It’s a tool in your toolbox.  I hope that the explanations I give here will help you figure out when it should help you.

You can look at SQL Server 2016 Book Online for insights on In-Memory features.  The book relates some key scenarios where it makes sense.

But if you are truly considering in-memory for mission critical workloads, go with In-Memory OLTP – Common Workload Patterns and Migration Considerations white paper.  It goes deep and explain key concepts very well.

Better than cached page

SQL query engine does cache hot data.  Basically, when you query data, it loads data pages in-memory and keep them there until it needs more memory elsewhere.

But traditional table are still optimized for disk access, a slow medium.  As such, it has a variety of bottlenecks.

padlock-520x347[1]A big issue is the contention on the different locking mechanisms.

Each time a transaction reads data, it acquires a read-lock on that data.  When another transaction wants to write on the same data, it must acquire a write-lock and therefore wait for the first transaction to complete since you can’t write while data is being read.

SQL also implements latches and spinlocks at different levels.

All those locking mechanism take time to manage but moreover, they collapse in terms of performance in some scenario.

A typical scenario is a read-write at the “end of a table”.  Here I talk about your typical table where the clustered index, despite the better advice of your DBA, is an auto-incremented integer.  Most of the activity tends to occur in the recent data, hence the bottom of the table.  Read and write locks interfere and performance collapses as you increase the number of threads reading and writing.

Another issue is the transaction log.  To guarantee durability, all created/modified data in a transaction must be written to the transaction log at commit-time.  This mean the transaction log is written often with small amount of data.  This puts a load on the I/O of your system.  Not only the record you modified must be written to the log, but each index is also updated.

Let’s compare that with in-memory tables.

In-memory tables are completely loaded in-memory.

They do not implement lock.  Instead, they implement optimistic concurrency using a multi-version scheme.  This eliminates lock contention.

They do write to disk to implement durability.  But only records.  They do not write index to disk:  indexes exist only in-memory.  This reduce I/O considerably.

In some scenarios, the data might not even be written to disk (more on that later).  In those cases, this eliminates I/O from the equation.

Natively compiled stored procedures can also increase CPU utilization (i.e. the amount of useful work the CPU does as oppose to parsing queries) dramatically.


With this in mind, we can see that in-memory tables aren’t just ‘cached normal tables’.  They are qualitatively different beasts.  They are optimized for memory and the same algorithms wouldn’t work for disk-based tables (e.g. you would need to persist the indexes).

Non-durable tables

phoeonix-clip-art-6770[1]I mentioned non-durable tables.  Yes, with the far extreme, you can tell SQL not to bother persisting your table to disk with DURABILITY = SCHEMA_ONLY.

Of course, I wouldn’t recommend that with your transactional data.  Some other in-memory database engines are ok with that because they implement replication of data across different nodes (for a very different example, see Redis).  But some scenarios can live with non durable data if they can recreate it in case of failure.

A typical example is a staging table during an ETL.  You could aggressively ingest data from different sources into an in-memory table then efficiently compute some transformation before dumping the results into a normal table.

How to be scalable with limited memory?

Most scenarios mentioned in the documentation are about how you can increase the scalability of your system with in-memory tables.  But how can you ingest data forever in a very bound RAM?

Well, you don’t, of course.

Either you leverage scenarios that read & write a lot of the same data or you move cold data out.

But this brings us back to the capacity of normal tables to ingest the cold data, doesn’t it?

Yes, but it allows you to sustain a burst for a while.  For scenarios where sources bombard you forever, in-memory tables won’t help you.  But often, those are only peaks.

Take IOT scenarios where sensors sends data in batch or given a physical event (e.g. a door opens).

Think of in-memory tables as a tool helping you getting out of a threshold where you can’t keep up with a peak:  a burst of subscription, a batch of events, etc.  .



I hope this post gave you a feel of what in-memory tables are for and why they out-perform normal tables in many scenarios.

As I mentioned before, they aren’t the mythical silver bullet but a tool in your toolbox.

In that post I focused on OLTP but in-memory tables are extremely popular in analytics as well.  This is where SQL Server 2016 shines with its In-memory Columnstore indexes.