People who do not know Hadoop think it’s a big data technology the same way SQL Server is a technology.
But Hadoop is more of an ecosystem of different modules interacting together. This is its major strength and also the source of its major weakness, i.e. its lack of strong cohesion.
In this blog post, I’ll give an overview of that ecosystem. If you’re new to Hadoop, this could be an easy introduction. If you are steep in some Hadoop technology, this post could give you the opportunity to look around the ecosystem for things you aren’t specialized in.
Hadoop is an open-source Apache project. It regroups a bunch of different modules being themselves open-source project.
In order to use Hadoop, you need to download quite a few of those projects, make sure their versions are compatible and assemble them.
HDInsight is a Windows based Hadoop distribution developed by Hortonworks & Microsoft.
Azure HDInsight is a managed Service on Azure. It allows you to create a fully managed Hadoop cluster in Azure. It is using HDInsight for the Windows implementation but it also supports a Linux implementation.
As mentioned above, Microsoft supports Cloudera & Hortonworks in Virtual Machines, that is, if you install them in Azure VMs. But Azure HDInsight is more than that. It is a managed service, i.e. you do not need to worry about the VMs, they are managed for you. Also, you can scale out your cluster in a few mouse clicks, which is quite convenient.
On top of being a managed service, Azure HDInsight is build to have external storage (i.e. blob storage & SQL Azure) in order to make it possible to create clusters temporarily but keep the state between activation. This creates a very compelling economical model for the service.
Hadoop Distributed File System (HDFS) is a distributed file system. It is built to make sure data is local to the machine processing it.
Azure HDInsight substitutes HDFS for Windows Azure Blob Storage (WASB). This makes sense since WASB is already a distributed / scalable / highly available file system but also it mostly mitigates a shortcoming of HDFS.
HDFS requires Hadoop to be running to be accessible but it also requires the cluster to store the data. This means Hadoop clusters must stay up or at least have their VMs in store for the data to exist.
WASB enables economic scenarios where the data is externalized from Hadoop and therefore clusters can be brought up and down always using the same data.
Basic paradigm for distributed computing in Hadoop, Map Reduce consists in splitting a problem’s data into chunk, processing those chunk on different computing units (the mapping phase) and assembling the results (reduce phase).
Map Reduce are written in Java (packaged in jar files) and scheduled as jobs in Hadoop.
Most of Hadoop projects are higher level abstraction leveraging Map Reduce without the complexity of writing the detailed implementation in Java. For instance, Hive exposes HiveQL as a language to query data in Hadoop.
Azure Batch borrows a lot of concepts from Map Reduce.
Yet Another Resource Negotiator (YARN) addresses the original Hadoop scheduling manager shortcomings. The Map Reduce 2.0 is built on top of YARN.
The original Job Tracker had scalability issues on top of single point of failure.
Apache Tez is an application framework allowing complex directed-acyclic-graph of tasks to processing data. It is built on top of YARN and is a substitute to Map Reduce in some scenarios.
It tends to generate less jobs than Map Reduce and is hence more efficient.
Apache Hive is a data warehouse enabling queries and management over large datasets.
As mentioned earlier, it is an abstraction on top of Hadoop lower level components (HDFS, Map Reduce & Tez).
In some scenarios HBase performs blazing fast on queries through massive data sets.
Apache Mahout is a platform to run scalable Machine Learning algorithms leveraging Hadoop distributed computing.
It has quite an overlap with Azure ML.
An high-level scripting language (Pig Latin) and a run-time environment, Apache Pig is another abstraction on top of map reduce.
Where Hive took a declarative approach with HiveQL, Pig takes a procedural approach. A Pig Latin program is actually quite similar to a SQL Server Integration Services (SSIS) package, defining different steps for manipulating data.
Hive & Pig overlap. You would choose Pig in scenarios where you are importing and transforming data and you would like to be able to see the intermediate steps (much like an SSIS package).
Apache Sqoop is Hadoop data movement involving booth relational and non-relational data sources.
It has functionalities very similar to Azure Data Factory.
Apache Spark is actually complementary to Hadoop. In Azure it is packaged as an HDInsight variant.
In many ways Spark is a modern take on Hadoop and typically is faster, up to 100 times faster on some bench marks.
Spark leverages a directed acyclic graph execution engine (a bit like Tez) and leverages in-memory operation aggressively.
On top of its speed, the most compelling aspect of Spark is its consistency. It is one framework to address scenarios of distributed computing, queries over massive data sets, Complex Event Processing (CEP) and streaming.
There are way more Hadoop modules / projects! I just talked about those I know.
You can find a more complete list over here.
Hadoop is a very rich ecosystem. Its diversity can sometimes see as its weakness as each project, while leveraging common low-level components such as HDFS & YARN, are their own little world with different paradigms.
Nevertheless, it is one of the most popular big data platform on the market.