Azure Data Explorer (Kusto)

Let’s talk about Azure Data Explorer (ADX ADX) also known as Kusto.

If you ask me that is the best kept secret in Azure.

Well, it isn’t exactly a secret but most people do not know about it or if they do, they just think of it as the back-end engine behind Azure Monitor.

ADX is an Azure Analytics Service. It is great at analyzing large volume of near real time telemetry such as logs and IoT.

Isn’t that what Azure Datawarehouse is supposed to do? Or Azure Databricks?

In this article, I’ll go around characteristics of the service: what its strength are and where it is complemented by other services.

I started with a huge essay trying to cover every aspects but I was bored writing it so I guess it wouldn’t have been very exciting to reading material. I went with a much lighter version. I’ll explore it further in future articles.

Update 23-06-2020: To see Kusto in action, I recommend the article Exploring a data set with Kusto.

Scale & Performance

The online documentation says it scales to terabytes of data in minutes.

That is true but it is also true of many distributed data services.

The uniqueness comes in what we can do at that scale.

At heart Azure Data Explorer (ADX) is about… Data Exploration. It is a real challenge to explore data at the Terabyte scale with little data preparation, i.e. no defined indexes & no pre-computed aggregations.

Near real time

Integration

ADX has an impressive gallery of integration for such a young service:

The list is growing and doesn’t contain only Azure technology. ADX can therefore easily be part of a bigger solution.

What ADX isn’t optimal for / stretch scenarios

The public cloud brought a lot of fragmentation in the Data services. Although part of the reasons for that is the youth of the public cloud technologies, it is also due to inherent characteristics of big data analytics in the cloud:

Since we do not own the hardware the workloads are running on, we do not have to get married with one technology and run everything on it to amortise the cost of said hardware / licence. We can use the best tool for the job.

This is a balancing act as we need to take the skill set of people into account.

Most of the scenarios we are citing here can be done with ADX but it wouldn’t be the best platform to do so.

Scenario Why Azure PaaS Alternatives
Data warehouse For starter, ADX is mostly an append-only store. It isn’t transactional, doesn’t have log journals, etc. . This is part of the reasons it is so fast, but also part of the reasons it is a poor fit for a Datawarehouse. Also, although it is very fast, pre-computed aggregations would be better for dashboards. For the sceptics, the rumors of data warehousing’s dead have been greatly exaggerated. Azure Synapse & Power BI Premium
Application Back end Similar to Data warehousing, ADX isn’t built as a transactional workload. Cosmos DB, Azure SQL DB, Azure PostgreSQL, Azure MySQL, Azure MariaDB
Machine Learning (ML) Training ADX supports some built-in ML algorithms (mostly clustering algorithms and statistical tools at the time of this writing, i.e. February 2020), it isn’t an ML training platform. It is excellent for running prediction on a pre-training model though. Azure ML, Spark (Azure Databricks or Azure HD Insight), Azure Batch & Data Science Virtual Machine (DSVM)
Sub-second streaming ADX can go as low as seconds of latency in ingesting data and be able to do analytics (i.e. events are still indexed and can be queried). Most “near real time” scenarios fall comfortably within that window. But it isn’t a sub-second streaming platform (e.g. for low-latency-trading). Structured Streaming in Continuous Mode in Spark (Azure Databricks or Azure HD Insight), Kafka Streams on Azure HD Insight, Flink on Azure HD Insight

Concrete scenarios

Here are some scenarios we’ve seen in different industries. This is by no mean an exhaustive list but the popular scenarios.

Quite a few customers are using ADX / Kusto to analyze unified logs, i.e. logs from on-premise systems and different clouds. This is typical log analysis, so it could be for security, reliability engineering, forecasting, etc. .

IoT telemetry analysis is quite popular. As customers capture telemetry, they want to mine that data.

We see different businesses using it to analyze transactions (sales) to understand customer behaviours, predict trends or spike and optimize go-to-market strategy. What if in days of deploying a new product we could figure out what customer segment is having traction and which ones are lagging?

In general, we see customers starting with historical analysis and then move to more and more real time analysis as the teams are getting more comfortable with the service.

Summary

We hope we manage to give a good idea of what ADX can do.

It is also important to note that it is the data platform for other Azure Services:


3 responses

  1. Nidhi 2021-08-09 at 03:01

    We can use azure data explorer for near real time calculations for telemetry data. Azure Databricks also can natively stream data from IoT Hubs directly into a Delta table on ADLS and display the input vs. processing rates of the data.I wanted to know what is the use case when ADX is significantly better than Databricks streaming ingestion( workload and cost wise)?. Currently we are using ADX but want to move to databricks for machine sensors data. How can we convince to client which is better and why

  2. Vincent-Philippe Lauzon 2021-08-12 at 17:59

    Hi Nidhi.

    The “when to use which Azure Data Service” is common but somehow we fail to give a simple answer apparently. The topic doesn’t lay itself to a two sentences answer as there are many scenarios / tasks one could consider in the world of data.

    But since you narrow down the question to structured streaming in Spark vs real time ingestion in ADX, let’s try to answer that one.

    The way I like to address it is that is first observe that, when you think about it, they aren’t the same scenario. Spark enables us to pre-define aggregates (e.g. count by categories) and monitor those aggregates in near real-time. It does that very well and on a well-tuned cluster, can be very efficient. ADX ingests the data and allows us to query it ad hoc. So with ADX you have the freedom to query your data in any ways (ad hoc), i.e. you do not need to pre-define any aggregates. You can also look at any window of time: the last 30 seconds, the last hour, the last year, etc. . You can perform any type of analytics, for instance, you could run a regression on all your data, including the one from the last couple of seconds…

    You could also define a pre-aggregate in ADX in the form of materialized view. Again it serves a different purpose as it’s not limited in some time window. Of course that freedom comes at the cost of performance: ADX is actually ingesting the data, i.e. commiting it to disk in a compressed and indexed format.

    So I would say look at the exact scenario.

    You can have a look at a presentation I did about a year ago: https://iterationinsights.com/article/azure-data-explorer-3-scenarios/. More specifically, you can look at the last slide giving a “best of bread” for different scenarios here: https://iterationinsights.com/wp-content/uploads/2020/10/Calgary-Azure-Analytics-ADX-VPL.pdf.

    Hope that was helpful.

  3. Nidhi 2021-08-16 at 03:07

    Thanks Vincent….This is really helpful

Leave a comment