HDInsight Hadoop Hive - Setup

hive_logo_medium[1]Let’s explore Hadoop Hive, shall we?

Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.

It was originally build by Facebook as an abstraction on top of Hadoop Map Reduce and now is an open source (Apache) project.

(See my Hadoop ecosystem overview here)

As with other Hadoop main components, you have a few options to work with it:

Yes, if you read anything in this blog, you know I’m gona go with the Azure option!

I’ll give two major things to Azure HDInsight:

  1. It performs all the cluster installation / maintenance for you
  2. It (mostly) shields you from all the Unixy / 3000 open projects mix that is Hadoop

And for that I am grateful!  When I want to use Hive I don’t want to fight with Bash shells, I want to run some SQL over Big Data.

Hadoop

Just before we start…  what is Hive in a nutshell?

Hive allows you to query massive data sets (Big Data) without hand coding Java Map Reduce into Hadoop.

So it’s an abstraction layer and a productivity enhancer.

Setting it up

Alright, Azure Portal (preview) everyone.  Yeah!  Give the preview portal some love!

SQL Database

What?  Yeah…  there’s the thing…

Here I’ll show you how to manage an Hadoop / Hive cluster on the cheap.  One of the nice feature of Azure HDInsight is to externalize the Hive metastore (where the schemas of your tables will be stored) from the cluster.

This will allow you to tear down the cluster without losing your Hive Metadata.

By default the metastore is persisted on a Derby database (yes, chances are you never heard about that, remember what I told you about Azure HDInsight shielding you from all this Open Source Techs?) which is hosted on the cluster.

The alternative is to host it on an Azure SQL Database, outside the cluster.

So let’s create a DB.

SQL

Top left corner, (+ NEW), Data + Storage, SQL Database:  this opens up the New SQL Database blade.

Go for it.  Just be careful of a few things, namely:

HDInsight

Now the HDInsight cluster!

NewHdInsight

Top left corner, (+ NEW), Data + Analytics, HDInsight:  this opens up the New HDInsight Cluster blade.

Cluster Name:  choose one, needs to be unique within the universe of HDInsight, hence I won’t give you mine, you would clash with it.

Cluster Type:  Hadoop.

Cluster Operating System:  leave it to Windows Server.

Make sure it’s in the right subscription and the right resource group.

Click Credentials, that’s going to pull the credentials blade.

Credentials

Those are the credentials to authenticate to the cluster.  We’re just toying around, so don’t go overboard.  You do not need to enable the remote desktop for this.

Data Source

Ok, Data Source.  This will open the…  Data Source blade!

One of the differentiator of Azure HDInsight compared to other Hadoop implementation is that the distributed file system (HDFS) is implemented on top of Azure Blob Storage.  This is commonly known as WASB: Windows Azure Storage Blob.

To make a story short:

The last point is key.  It means you can tear your cluster down and you keep your file.  You can later on recreate a cluster and continue where you were.

So…  I suggest you create a storage account just for your Hive discovery.

Data Source

Again, give it a unique name.  You can choose your default container which will itself default to the name of your cluster.

Node Pricing Tier

Again, we’re in discovery.  Let’s go easy on it:  one worker node and you can drop the type of VMs for the worker & header nodes to A3 (the cheapest).

Optional Configuration

OptionalConfig

In optional configuration, let’s select external metastores and select the SQL Database we created earlier.

Only do that for the Hive metastore (we won’t use Oozie today).  You can give your server admin password for today but DO NOT DO THAT in production.  In production, of course, create a separate user that has rights only on the specific db.

Create

Finally, hit the create button!  This will churn away for a little while and you’ll have your cluster ready soon!

Key take aways

As key take aways, there are two pieces of your Azure HDInsight configuration that can survive the deletion of your cluster:

  1. HDFS content (via a storage account)
  2. Hive Metastore (via an Azure SQL Database)

This makes Azure HDInsight a component you can prop up to do heavy computation and tear down until next time.

You could even go a step further and backup your database into a .bacpac and tear down the SQL Database until next use.  This way you would end up paying only for storage which is dirt cheap compare to both SQL and Hadoop.

In the next blog entry, we’ll actually use the Hadoop cluster to do some Hive!


Leave a comment