August 31, 2015
At my first job, there was no code driven infrastructure at all. A nightmare from today’s perspective.
The next job, there were some perl scripts to massage systems in a imperative way.
Then, at SoundCloud, I used modern configuration management for the first time. SoundCloud heavily invested into Chef which was, at some point, used to drive almost every aspect of the infrastructure. A new service? Add a cookbook.
Compared to a world without any proper infrastructure automation this obviously was a great step in the right direction but it was a growing pile of technical debt and constant source of bikeshedding around how to actually use it.
Maybe you don’t have those problems. Maybe you have strict guidelines on how to use your configuration management or some chief architect to tame the chaos but I argue that lot of larger companies with agile and independent teams run into similar problems. If you disagree, please leave a comment.
There are several categories of problems with this setup. Some are organizational nature, like the way it was ab(used) to drive the complete infrastructure. There are also very specific issues with chef, like its internal complexity both from a operational perspective as well as from its complex interface. Things like 15 unintuitive precedence levels for node attributes to its multistep execution flow and leaky cookbook abstractions it’s often frustrating to use and accumulates a lot of technical debt due to it being hard to test and refactor without potentially breaking your infrastructure.
That being said, there isn’t an obvious better alternative. In the meanwhile I used ansible and salt which have their very own problems. Even though they try to be less complex than chef, they heavily depend on template driven metaprogramming and struggle with proper code reuse and testability similar to chef.
Over the years I came to the conclusion that configuration management in it’s current form as used in reality has some fundamental design issues.
The idea of defining a (distributed) systems state and mutating it to eventually converge to the desired state is sound but the interface the distributed systems provide is simply too complex.
All the mentioned configuration management systems use some agent or ssh access to execute commands on the systems, similar to the imperative design of user interfaces: You run commands in a specific order to modify the state of the systems.
But since there is no unified interface to configure applications, how to achieve a specific state is highly dependent on the application.
Configuration management systems try to solve this by abstracting a “thing” in a system and give it a clear interface with some idempotent functions to move it into a given state like installed
. Whether it’s called chef cookbook, salt state or (the most misleading name) ansible role.
If the “thing” is your web application, this might work very well but in reality you often have to configure third party applications that have subtle dependencies on specifics of other “things”. Or you have low level system configuration which affects and depends on other things installed. At this point, the abstractions usually break and you often end up introducing site specific changes to what is suppose to be reusable, generic components.
The lack of a generic system configuration interface that can be used to configure every aspect of a system, imposes a lot of complexity on the configuration management.
As long as there is configuration, there is configuration management. You will always have some form of desired state you want your infrastructure to be in. The question is not if but how to manage configuration. Since with configuration that changes rarely less things can go wrong, I believe the best way is so identify different configuration life cycles, find the right solution to manage this kind of configuration while compromising dynamically for correctness.
If the lifetime of a single host or container image is lower than the lifetime of a configuration option, it often makes sense to move this configuration to install/build time. Since no change can happen during the life cycle of the host or container it’s easier to reason about the infrastructure since change to this set of configuration can be ruled out as reason for a given observation.
Moving configuration to install time might mean making your bare metal installer preseed some configuration or building a static OS image for your cloud provider. The point is to bake in this configuration and just rebuild when changes are necessary.
All configuration that is same across all environments (dev/test/prod) should be considered to be hardcoded where environment specific configuration (credentials, URLs / service identifiers) should be passed in on runtime so you can build, test and deploy the same static artifact.
What in reality gets hardcoded is a case by case decision and depends on a lot of factors. On a bare metal infrastructure where all services are deployed straight to the host, there will be a lot of configuration that is site-wide and environment agnostic but simply is changed that often that reinstallation of the whole host isn’t feasible.
But in a containerized infrastructure you have host image and container image life cycles. There is little host configuration, so usually it all can be hardcoded. Even if reinstalling the host takes 20 minutes, if it only happens every few weeks and is fully automated, it’s probably fine. Building the container images in a continuous deployment pipeline might just take a minute from a change until the changes are deployed, so here again it’s feasible to bake in all suitable configuration.
Especially environment specific configuration like credentials should be passed to the services by whatever deploys your application. Even though a lot people still deploy their applications with configuration management instead of cluster schedulers, I’m convinced that will change in the next few years. Whether you’re using Mesos/Marathon, kubernetes or Omega the high level concepts are similar: You define your application and the scheduler decides based on the available resources where to run it. Whether services are deployed by config management systems or the cluster scheduler, since it’s starting services, it’s the right place to pass configuration to your service. Instead of writing configuration files, 12factor style configuration is usually better suited.
Instead of configuring your systems on a regular interval with some configuration management daemon, it’s often a better pattern to have the application or a wrapper around it determine the configuration on runtime. Instead of making configuration management orchestrate various services or instances, it’s more robust and arguably less cognitively challenging to consider the service as it’s own independent entity. This only works well if site-wide configuration is built in. Determining all this on runtime leads to similar complexity as we have with full blown configuration management today.
Isolating change to specific points in the life cycle of systems and services reduces the complexity of runtime configuration and simplifies the mental model when reasoning about the infrastructure.
By Johannes Ziemke.
Cool, cool but need help? You can hire me!.