Alerts on Azure Function failures


So, you have a few functions running. Maybe some of those functions are important and you would like to be alerted when they fail.

In this article I’ll cover that exact scenario. We will do that in the following steps:

  • Deploy a recurrent function which fails %50 of the time (by design)
  • Define a query we can use for alerts
  • Define an alert
  • Sit back and wait for the alerts to pop in

This could be useful in many scenarios. Maybe you do want to get notify when some serverless compute fail before an end-user pick up the phone. Or maybe those are jobs running in the background and it would take a while to notice. You might want to detect failures and simply the fact the job didn’t run.

As usual, the code is on GitHub.

Deploying the sample solution

Let’s go ahead and deploy our sample solution:

Deploy button

The ARM Template has 2 arguments:

Parameter Description
storageAccountName Name of storage account. The storage account is used by the function app.
functionAppName Name of function app. The name needs to be globally unique as it is mapped to a DNS entry.

It is important to deploy the solution in a region where Azure Application Insights is available. The region is inherited from the resource group’s region if it is pre-existing.

This template should deploy the following resources:

Resources

Let’s go in the function, i.e. the App Service resource.

There should be one function deployed there: recurrent-function.

Note: the function was deployed inline via the ARM Template using the trick found in this quickstart template.

If we look at the code of that function:

using System;

public static void Run(TimerInfo myTimer, ILogger log)
{
    var isSuccess = (new Random().NextDouble()) < .5;

    log.LogInformation($"C# Timer trigger function executed at: {DateTime.Now}");

    if(isSuccess)
    {
        log.LogInformation($"Success");
    }
    else
    {
        log.LogInformation($"Failure");

        throw new ApplicationException("Failure");
    }
}

We can see why the function fails %50 of the time. We use a random number and basically flip a coin at each call.

We can run the function a few times and see it failing and succeeding.

If we look at the function monitoring (available since we hooked up Azure Application Insights), after a while we should see a profile like this:

Function runs

If we look at the Integrate menu item, we can see the Cron expression 0 */10 * * * * attached as a schedule. This means the function is set to run every 10 minutes.

Queries

Alerts are based on log queries. Let’s go in Application Insights and develop a query that could help us detecting failures.

Log pane

The Application Insights tables are displayed on the left. If we double-click on the requests table name it should appear in the query window. Alternatively, we can simply type it:

requests

We can then press SHIFT-ENTER or click the Run button to see all the requests recorded in the App Insights’ instance.

Let refine the query by looking only for our function. This is redundant in our case since there is only one function but in a more realistic context, it would be useful to filter against other workloads:

requests
| where timestamp > ago(30m)
| where name=="recurrent-function"
| sort by timestamp desc 

We could also filter against cloud_RoleName, which should be the name of the function app (globally unique, hence we didn’t include it here).

Here we also sorted by timestamp in order to see the requests in order. We also filter for the requests of the last 30 minutes. This is recommended to accelerate the queries.

We could also discriminate for only the successful requests:

requests
| where timestamp > ago(30m)
| where name=="recurrent-function"
| where success==true
| sort by timestamp desc 

This gives us all the successful requests of the last 30 minutes.

This could be a base for an alert. Since the function is supposed to run every 10 minutes, we should expect to see a successful request every 10 minute. If we don’t, it means the function didn’t run or failed.

Azure Application Analytics takes a few minutes to make the telemetry available. It is usually done under 10 minutes.

Setting up alerts

To setup an alert, let’s click on the + New alert rule button.

New Alert rule

We’ll first setup the condition. In the alert logic, we’ll leave the Based on Number of results and set the Operator to Less than and the Threshold value to 1. We’ll then set the Period (in minutes) to 10 and the Frequency (in minutes) to 5.

So basically, the alert is going to check every 5 minutes if the query we wrote, ran over the last 10 minutes (basically it overrides our where timestamp &gt; ago(30m)) and if there is less than one result, an alert is going to be raised.

We are going to setup an action group to send us an email.

Finally, we can name the alert Function failure.

We can then create the alert.

Awaiting alerts

We can now wait for alerts. Basically, every 10 minutes we will have %50 chance of receiving an email.

We can look at the function Monitor pane to see when the function fails and when it succeeds. Emails should follow the same pattern.

Monitor

Summary

Although this scenario was specific, it is quite easy to generalize it.

An alert is always based on a query on the telemetry. If we can formulate a query revealing bad behaviours, we can setup an alert on those behaviours.

This is part of the automation of a solution. We do not want to wait for our end users to discover that our site is slow or that some operations are failing. We want to be pro-active. For that we need visibility and alerts are a great way to make us aware of what is going on.

Of course, alerts are not the only tool at our disposal. Plots & dashboard can also help. In our case, we could have plotted the success rate of the function.

Customers often ask me “what should be our alerts“? Although there are a couple of classics such as page load time, request time, exceptions, etc. , the truth is every solution is different and you learn when you operate it what is normal and what is not. At first you might be blind on some metrics and oversensitive to others. Only with time and experience will you fine tune those and have a well-rounded alerting system.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s