Organizing your configuration data
Chef runs on configuration data—this data can be stored in a variety of different locations, and with a variety of purposes. When computing the final configuration for a given node, the sources of configuration data are then "squashed" into a single, authoritative configuration to be deployed. Those sources include the following:
Data from these locations is combined to produce a final hash of attributes when a client requests its run list from the server. Cookbooks provide a baseline set of attributes that the recipes inside rely on. These attributes act as "sane defaults" for the recipes that, in the absence of overriding values, are sufficient to execute the recipes without extra work. Other sources, including the environment, role and node itself, may in turn override these attributes in order to provide the final configuration.
When developing recipes, these attributes can be accessed through the node
hash and are computed by Chef using a set of rules to determine precedence. The order of precedence when computing this hash is broken down into the following levels (lowest to highest priority):
- Default
- Normal (also set)
- Override
Within each level, the sources of attribute data in order of increasing precedence are as follows:
- The attributes file inside of a cookbook
- Environment
- Role
- Node
This means that a node-specific override attribute takes precedence over all others, which in turn is more important than the role, environment and cookbook override attributes, and so down the chain of precedence. As your scope becomes narrowed from the most generic description of a role—the recipes—to the most specific component in the system—the actual node itself—these settings override the more generic values. A node knows best what the authoritative configuration should be as opposed to a recipe, which does not know anything about resources on the host. For example, consider the following scenario in which you have two hosts, potassium and chromium. For some legacy reason, their disks are configured slightly differently, as follows:
Potassium:
- 16 GB root partition
- 250 GB SSD data partition in
/opt
Chromium:
- 32 GB root partition
- 400 GB EBS disk mounted at
/usr
In order to install the PostgreSQL database server, you need to make sure you install it at a location that provides enough storage space for the data. In this example, there will be more data than either root disks can contain. As a result, the data directory will need to reside in /opt
on potassium and /usr
on chromium. There is no way that the PostgreSQL recipe can account for this, and the postgresql_server
recipe does not know anything about its resources. Subsequently, the logical place to configure the data directory is at the node level. If the default location according to the recipe were /usr/local
, then a node-level override may not be needed for chromium; however, in the case of potassium, it could be directed to store data in /opt/data
instead.
What all this means is that as you develop recipes, any default attribute set by your cookbook will be the lowest priority. You can safely set some reasonable defaults in your cookbook knowing that they will only be used as a fallback if nobody overrides them further down the chain.
Example attribute data
A simple default attributes file for PostgreSQL cookbook might look like the following:
default['postgresql']['port'] = "5432" default['postgresql']['data_dir'] = "/usr/local/pg/data" default['postgresql']['bind_address'] = "127.0.0.1"
Notice that the attributes for a cookbook are a Ruby hash. Typically, good practice dictates that the namespace (first key in the hash) is the same name as the cookbook (in this case, postgresql
), but this does not need to be the case. Due to cookbooks often containing multiple recipes, a cookbook's attributes file will often have per-recipe default configurations. Consider a further evolution of the PostgreSQL attributes file if it were to contain recipes for both the server and the client installation:
default[:postgresql][:client][:use_ssl] = true default[:postgresql][:server][:port] = "5432" default[:postgresql][:server][:log_dir] = "/var/log/pglog"
There are times when just a simple attributes file doesn't make sense because the configuration may be dependent on some property of the node being managed. The fact that the attributes file is just a Ruby script allows us to implement some logic inside our configuration (though you should avoid being overly clever). Consider a recipe where the default group for the root
user depends on the platform you are using: "user d
on BSDs, "n BSDs
on Ubuntu Linux, and "n Ubu
elsewhere. Chef provides a method, value_for_platform
, that allows the attribute to be changed depending on the platform the recipe is being executed on, as the following example demonstrates:
default[:users][:root][:primary_group] = value_for_platform( :openbsd => { :default => "wheel" }, :freebsd => { :default => "wheel" }, :ubuntu => { :default => "admin" }, :default => "root"
Where it makes sense, attributes can also be shared between cookbooks. There are limited uses for this, and it should be used with care as it blurs the boundaries between cookbooks and causes them to become too tightly coupled with one another.
Data bags
There are times when configuration data transcends a recipe, role, environment, or node. This type of configuration data tends to be system-wide data such as the following:
- Firewall rules for various types of hosts
- User accounts
- SSH keys
- IP address lists (white lists and black lists)
- API keys
- Site configuration data
- Anything that is not unique to a specific entity in your infrastructure
Data bags are very free-form, as the name implies; recipes that rely on data from data bags will impose their own expectations of the organization within a data bag, but Chef itself does not. Data bags can be considered, like all other Chef configuration data, to be one large hash of configuration data that is accessible to all the recipes across all the nodes in the system.
Building firewall rules are a good use case for data bags. A good cookbook is an island by itself; it makes as few assumptions about the world as possible in order to be as flexible and useful as it can be. For example, the PostgreSQL cookbook should not concern itself with firewall rules, that is, the realm of a firewall
cookbook. Instead, an administrator would leverage a generic firewall model and a cookbook with a specific firewall implementation such as the UFW cookbook to provide those features. In this case, if you were to look at the UFW cookbook, you would see the ufw::databag
recipe making use of data bags to make the firewall rules as flexible as possible.
In this case, ufw::databag
expects that there is a specific data bag named firewall and inside of it are items that share names with roles or nodes; this is in line with the notion that data bags are free-form, but certain cookbooks expect certain structure. If our infrastructure model had two roles, web_server
, and database_server
, then our firewall data bag would have contained two items named accordingly. The web_server
item could look like the following hash:
{ "id": "web_server", "rules": [{ "HTTP": { "dest_port": "80", "protocol": "tcp" }, "HTTPS": { "dest_port": "443", "protocol": "tcp" } }] }
Here, id
of the item maps to the name of the item, which is also the name of the role, so that the ufw::databag
recipe knows where to fetch the data it needs to build its internal firewall rules. To compute its list of firewall rules to apply, the ufw::databag
recipe examines the list of roles that the node is configured with and then loads the corresponding data from the items in the firewall data bag.
As you can see, data bags allow you to store centralized, free-form configuration data that does not necessarily pertain to a specific node, role, or recipe. By using data bags, cookbooks for configuring users, databases, firewalls, or just about any piece of software that needs shared data can benefit from the data stored in a data bag.
One might wonder why we have data bags when we already have attribute data, and that would be a good question to ask. Attributes represent the state of a node at a particular point in time, meaning that they are the result of a compaction of attribute data that is being supplied to a node at the time the client is being executed. When the Chef client runs, the attribute data for all the components contributing to the node's run list is evaluated at that time, flattened according to a specific priority chain, and then handed to the client. In contrast, data bags contain arbitrary data that has no attachment to a specific node, role, or cookbook; it is free-form data that can be used from anywhere for any purpose. One would not, for example, be likely to store user configuration data in a cookbook or on a specific node because that wouldn't make much sense; users exist across nodes, roles, and even environments. The same goes for other data such as network topology information, credentials, and other global data that would be shared across a fleet.