Demystifying Cloud Foundry's Diego

July 18, 2016

This post serves a specific purpose — to take Cloud Foundry’s new runtime environment, break it down into parts that hopefully make it a bit easier to understand. Diego was, for me, something that seemed a bit obscure until I dug into its vitals. I found the documentation gave me something of an idea of the runtime environment, but I wanted to see if I could write it in a plain, easy-to-understand manner. I’m hoping this post finds other people like me, largely beginners to Cloud Foundry, and makes it a bit quicker to hit the ground running with what is a pretty cool way to host Cloud Foundry applications.

First off

Diego’s meant to be a replacement for Cloud Foundry’s old runtime environment, DEA (Droplet Execution Agent) — and in fact, the original purpose was simply to update DEA, which is written mostly in Ruby, to a more modern environment written in Go.

DEA-Go, Carl. DEA-go.

Containers

Diego is, first and foremost, an environment for containers. That said, it doesn’t use Docker. Instead it uses garden, which follows the Open Container Initiative guidelines for hosting containers. Garden’s documentation mostly involves the Go documentation which helps a lot if you know the format of godocs. For the most part, you won’t really ever touch garden, it’ll be abstracted through Diego’s other components. Since garden follows the open specification, you should be able to import Docker images just as easily as if it was in the native Docker environment.

If you’re used to DEA, garden gives you a few advantages out of the box. First off, garden supports more operating systems than DEA’s warden. Diego and garden also makes it a lot easier to ssh into the containers involved in your application.

Management of Containers

From the standpoint of your application, here’s what you need to know: In Diego, you now have the choice to push a one-use function (a Task) or a more traditional application that stays resident (a Long-Running Process, or LRP) — a good example of an LRP might be a web server that you need always listening for traffic, while a Task may be something like a database migration as part of a release, or a task that examines recent data for something specific. Before, in DEA, you really only pushed processes that were expected to stay resident. Diego’s brain and health monitor makes sure these tasks are balanced as well as possible – spreading out CPU-intensive tasks across virtual machines, or balancing memory, et cetera. While before some of this was done as part of the cloud controller, now the Diego environment handles it itself.

Getting a bit further into the trees, pushing an application to Cloud Foundry using Diego would:

Contacts the Diego Brain which immediately sets up Auctioneer to announce to the diego cells that there is a new task or LRP that needs to be added, and how many cells it should use.
Lets the Converger know what the application expects to have running at any time, so that if there is a change, it can immediately set up a replacement.
The Diego Cells run the task at hand, constantly updating the bulletin board system with necessary information (such as CPU usage) that allow the auctioneer and the converger to ensure the app is running according to plan. Diego uses etcd to handle the BBS.
What isn’t handled by the BBS is handled by Consul – this is mostly locks to make sure only the right process is handling the right task (as an example in Diego, there can be only one Auctioneer at any time, but if that Auctioneer goes away, something else must pick up the lock) or load balancing.
Various other Diego-specific processes (Nsync, TPS, stager, and so forth) all exist as brokers to provide information from the cells to the right ingestors to ensure things are pushed in a safe manner, and information gets back to the right channels when things are not so safe.

So why all the complexity?

A mature deployment in a virtualized environment has several factors that keep it stable and productive – one, the service is highly available, meaning if one piece of it dies it is able to recover effectively. Two, there is monitoring in place to ensure that if one piece of it does it will recover or be replaced quickly, often without human involvement. Three, capacity issues (CPU, disk space, memory, or another factor) are handled in a proactive manner when at all possible.

The Diego environment, therefore, is set up to have a state it considers good and to have every piece of software that supports it work towards the good state. Each way, whether by health checking, balancing apps so capacity is spread out across the entire available environment, and deploying new applications to ensure its stability, all keep Diego self-sustaining. The environment seems complex, but each part is a service with a specific goal towards the greater goal.

Written by:
Seth Lindberg

Senior Cloud Engineer

Demystifying Cloud Foundry’s Diego

First off

Containers

Management of Containers

So why all the complexity?