Recurrent serverless batch job with Azure BatchSolution ·
Most solutions have recurrent batch jobs, e.g. nightly / end-of-month batch jobs.
There are many services we can leverage in Azure to run those. In this article, we are going to explore a service that has “Batch” in its name: Azure Batch.
Azure Batch is typically positioned for big compute since it easily schedule jobs to run on a cluster of VMs. But configured properly, it can also run recurrent jobs in a serverless manner.
Azure Batch supports auto-scaling: the number of nodes in the cluster can vary in function of the load.
A typical workflow for an Azure Batch workload is:
- Create a job
- Populate the job with a series of tasks representing part of a problem resolution
- Tasks get scheduled on different nodes
- Job completes
For instance, let’s consider the 3D image rendering problem. We can breakdown the image in many pieces and spread the rendering of each piece in different tasks. This allows to leverage many nodes (VMs) in the Batch cluster, hence the Big Compute.
Often the cluster size is fixed but we could also implement auto scaling, where a formula computing the size of the cluster is periodically evaluated. Typically the formula looks at the number of pending tasks to determine the number of nodes.
In our case, we need to run something much simpler: one job, one task, running once in a while (on a schedule). The workflow is hence much simpler:
- Job & only task is created by the scheduler
- Task gets scheduled on an empty cluster
- Auto scaling kicks in, increasing the size of the cluster from 0 to 1 node
- Task get executed
- Job completes
- Auto scaling kicks in again, decreasing the size of the cluster from 1 to 0 node
Here we’re going to use a simple auto scaling formula: if there is any task to run, the cluster should have one node ; otherwise, the cluster should be empty (and run at no cost). This is what we mean by serverless in this article. The servers are fully managed and ephemeral.
With some simple configuration we therefore change a service designed to run massive workload into a service running small workload periodically economically.
While we’re at pinching pennies for this solution, we’re going to leverage Low-Priority VMs.
The sample we are going to walk through is on Github.
We are going to run a job written in Python on a Linux cluster, but we could easily reconfigure the sample to run on Windows or different platform (.NET, Java, Shell script, PowerShell, etc.). Anything that can be invoked from a command line can be run on Azure Batch.
We are going to install some custom software before running our batch to make sure we cover scenarios where this is required. This is often necessary since we start from a vanilla VM having only the OS installed. We can also use a custom image with all our software pre-installed.
Deploy ARM Template
To accelerate the treatment we automated the deployment as much as possible.
Azure Batch can’t be fully configured using ARM template at the time of this writing (mid December 2017). For this reason, we do a first pass with ARM template, then another pass with a PowerShell script (which could easily be converted in a CLI script and run on MacOS / Linux) and finally we’ll do the final configuration using the Portal.
We can deploy the ARM template by using the following link:
Schema for Batch Account ARM template can be found here.
This template requires three parameters:
- Storage Account Name: it needs to be globally unique
- Batch Account Name: it also need to be unique within the Azure region
- VM Size: by default it is set to Standard DS-2 v2
The template will provision:
- A storage account used by the batch account
- A batch account
- An auto scaled pool within the batch account
We can look at the batch account in the Portal once it is provisioned. More specifically, we can look at the provisioned pool, aptly named mainPool. We see it has no dedicated nor low-priority nodes.
If we select the pool, we can then go to its Scale configuration.
We see the template configured it to be auto-scalling. The formula is evaluated every 5 minutes. It basically evaluates if there has been any pending task in the last 180 seconds. If so, it mandates to have a low priority node, otherwise the target is zero nodes.
Auto scaling formula in Azure Batch is explained in the online documentation.
We can force the evaluation of the formula. When we just provisioned the cluster, it is likely to evaluate to force a one-node cluster. This is a side-effect of how the formula is written. After a few minutes, the cluster will scale back to zero node.
Now we have a batch account with a pool configured, but we have nothing running on it.
The next step is to run the PowerShell script available here (we can either download the document or click on the Raw button in order to cut and paste the content).
First we need to change the three variables at the beginning of the script:
- $rg: The resource group where we deployed the batch account
- $batchAccount: the name of the batch account
- $storageAccount: the name of the associated storage account
The script performs the following operations:
- Copy resources from GitHub to a temporary local folder
- Packages the resources into 2 zip files
- Create 2 applications
- Upload the 2 packages in the applications
- Set the default version of the applications
- Clean up the temporary local folder
Applications are the way to package content we want to run in a Batch job.
Now we have two applications, which we can see in the application tab of the batch account.
PythonScript is an application we will schedule to run. It is a trivial “Hello world” Python application. PythonSetup is an application we’ll use to setup the nodes where PythonScript are going to run. It runs an install command on the node (a pip command which installs a Python package).
Applications are typically deployed with tasks in Azure Batch: we specify a list of application with a task and the service makes sure the applications are deployed to the node before the task is executed there.
As we’ll see later, we won’t use “normal” tasks. Those tasks, at the time of this writing (mid December 2017), do not support application deployment, at least not through the Portal interface. For that reason, we’ll specify the applications need to be deployed to every provisioned node.
Let’s select mainPool again and in the pool pane then select the Application packages tab.
From there, let’s select PythonSetup application and version 1.0 then click save.
Let’s repeat for the PythonScript application.
Note: the best practice would be to Use default version as the application version. This would deploy the default version. If we change the default version we wouldn’t need to think about the ripple effect here. But… at the time of this writing (mid December 2017), there is a bug in the Portal where this configuration would crash the nodes when they start. This bug is likely going to be resolved soon.
What this configuration does is to make sure the applications are unpacked (unzipped) and copied on each node provisioned.
Now let’s configure the service to use those applications.
Within the pool, let’s select Start task. Let’s also select True for Wait for success. In the Command line text box, let’s write:
/bin/sh -c “$AZ_BATCH_APP_PACKAGE_pythonsetup/setup-python.sh”
As it is stated in the online documentation, in order to have access to environment variable, we need to run a shell, hence the /bin/sh.
We access the directory where the application was unzipped within the node using an environment variable. The general format of environment variables related to application is described in online documentation. In our case we do not wish to specify the application version so we point to the default version.
The actual start task doesn’t install anything. It simply echo something in the shell.
We can now save the start task configuration.
The next step is to schedule jobs. We’re going to do this by selecting Job schedules tab of the batch account. We then select the Add button.
In Job schedule ID, let’s give it a name, e.g. ScheduledScript.
In Schedule, let’s leave defaults, i.e. not start nor end time but let’s select 30 minutes as the Recurrence interval.
For Pool, let’s select the main pool.
Under Job manager, preparation and release tasks, we select Custom and we’ll configure the Job manager task. This task is run once the job is created. Typically we would create more tasks within that task or manually outside of the job using the Batch SDK. Here we’ll simplify the job and simply run what we need to run within that task.
In task ID we’ll type only-task. In the command line we’ll type:
/bin/sh -c “python $AZ_BATCH_APP_PACKAGE_pythonscript/sample.py”
We can then press Select for the Job Manager and Ok for the Job Schedule.
The job schedule creates a new job at every recurrence interval (30 minutes in our case).
Each job will schedule a Job Manager task which will eventually kick the auto scaling into creating a node. This will in turn trigger a start job task. After completion of this one the Job Manager task will be able to execute. Eventually the auto scaling will retire the node.
While the executing node is still up we can look at the task execution. We can look at the job schedule’s jobs and select the latest job. From there we can look at its tasks:
Within the task, we can look at the Files on node, more specifically stdout.txt.
This shows what the task running, in our case the Python script, outputted in the console.
In this article we showed how we can us Azure Batch to run recurrent batch jobs.
There are a few configurations to get right but the general workflow is quite straightforward: schedule a recurrent job and set a job manager running whatever we need to run.