Attributes or data bags: what should I use?

Cat in paper bagIn a way, building IT infrastructure isn’t particularly exciting. Want to turn a Linux server into a MySQL database server? Enter some commands to install MySQL. Congratulations, you have successfully used some data (yum -y install mysql-server) to change the generic model (the out-of-the-box operating system). Much of our infrastructure is built the same way.

Chef takes this idea to its logical conclusion by formalizing the relationship between data (attributes, or properties of a machine) and the model (recipes and cookbooks). Therefore, one of the most important things for Chefs to contemplate is where to put their data and how to model them. Your data model changes the syntax and structure of your recipe code, and vice-versa. Therefore, it’s important to consider the design of attribute data structures, but also when and where to use data bags.

In the Beginning…

When Chef was first invented, there were only node attributes. In fact, there was only one level of node attribute: what we know now as “set” or “normal”. As more data structures got added to Chef — roles and environments being chief among them — attribute precedence and merge order was invented to resolve conflicts, and “default” and “override” levels were added for even more flexibility.

It soon became clear that a higher-order, global data structure was needed. Hence, data bags were born. If you’re a Dungeons & Dragons player, you’ll notice that the name is a humorous take on “bag of holding”.

What are Data Bags and Data Bag Items Good For?

Data bags are generally used to hold global information (“data bag items”) pertinent to your infrastructure that are not properties of the nodes themselves. In the majority of scenarios, you will continue to model most of your infrastructure using node attributes. Here are a few guidelines for whether a data bag item should be used to represent a piece of data:

  • If it is global across all of your infrastructure, and you think you might need to change that item en-masse at some point. Examples: An external service API key which does not vary per environment; an office gateway’s external IP address; a license key.
  • If it needs to be encrypted. Data bag items can not only be encrypted, but each item can have a different encryption key if desired. Encrypted data bag items give you the ability to secure sensitive information on the Chef server, so that no intruder could reveal your secrets even if they gained access to the Chef server. Also, no man-in-the-middle attack could reveal sensitive information by sniffing the traffic between the Chef client and the server; ciphertext is decoded only on the client. (This is less of a concern given that client-server communication is performed over SSL).
  • If it needs to be written to by another system and we want to isolate the scope of the data that system can write to. Example: application release information which could eventually be written by a continuous integration pipeline.
  • If an external team needs to update limited pieces of information and that team does not normally write Chef recipes. Example: the DBA that needs to occasionally modify a database password or connection string.

If none of these conditions is true, implement the configuration as an attribute.

Writing to Data Bags from External Systems

One of the strengths of the Chef server is that its API is well-documented, open, and easy to integrate with. Experienced customers and open-source users alike have written plugins and add-ons to the Chef client to enable it to do things that our software engineers never even thought of.

However, it is often undesirable for an external system to write to a node attribute, for the simple reason that it would need to write to one of three objects: a cookbook, a role, or an environment. A mistake could have far-ranging side effects beyond just the intended change. On the other hand, modifying a data bag item is a small, self-contained operation. The programming API is also far easier to use.

Here’s an example of a custom Ruby script that uses Chef as a library to update application release information for an app “foo”:

require 'net/http'
require 'chef/rest'
require 'chef/config'
require 'chef/data_bag'
require 'chef/data_bag_item'

bagname = 'myapps'
appname = 'foo'
version = '1.0.0'

# Use the same config as knife uses
Chef::Config.from_file(File.join(ENV['HOME'], '.chef', 'knife.rb'))

# Load data bag item, or create it if it doesn't exist yet
begin
  item = Chef::DataBagItem.load(bagname, appname)
rescue Net::HTTPServerException => e
  if e.response.code == "404" then
    puts("INFO: Creating a new data bag item")
    item = Chef::DataBagItem.new
    item.data_bag(bagname)
    item['id'] = appname
  else
    puts("ERROR: Received an HTTPException of type " + e.response.code)
    raise
  end
end

item['version'] = version
item.save

Many customers use this approach to implement a continuous delivery pipeline. Successful completion of a pipeline stage (e.g. “passed unit tests”) might be a corresponding data bag update (e.g. “update QA’s data bag item with the build #”). Next, the Chef recipe handling application deployment will pick up the change and deploy the new version of the application. This can happen either asynchronously (next time Chef Client runs) or synchronously (by using “knife ssh” or even a Push Job as another pipeline action).

Final Points

To summarize, data bag items provide a way to store data that is not directly associated with any particular node in the infrastructure. Data bags are also searchable: the name of the index is the name of the bag, so don’t name a bag “role” or “node” or it will never be found!

Keep data bag items small. The data is transmitted from the server to the client on every Chef run, so you don’t want an 8K data bag item being queried by 1000 machines every 15 minutes — that’s 32MB/hour of JSON on the network!

Finally, if in doubt, store data as a node attribute, until you find a need to convert it to a data bag item. You can always refactor your code.

Julian is a senior consulting engineer with Chef. His first experience with Chef was at SecondMarket, a New-York based alternative markets startup, and he has over a decade of systems administration experience at outfits large and small. When he's not helping customers with Chef, he enjoys good craft beer, indie music, and writing biographies about himself in the third person.

  • http://sethvargo.com/ Seth Vargo

    I’d also like to suggest/plea. Please don’t force the use of a data bag in a public cookbook. This is a mistake that we’ve learned from in the past. Instead of forcing people to use a certain schema, I recommend using a hybrid model, like described in the new Jenkins cookbook.

    Requiring a certain data bag structure forces people to manage their infrastructure in a certain manner. This is a violation of one of the guiding principles of Chef: you know your infrastructure best. The `users` cookbook is a big culprit here. It forces users to conform to a certain data structure, which rarely meets the ever-changing and unique demands of an organization.

    Alternatively, I recommend using attribute-driven cookbooks and then encourage users to populate those attributes how they see fit. In the case of the `users` cookbook, users could **chose** to populate an attribute by specifying it manually, loading it from a data bag, or using a third-party service (like LDAP).

    • bbytheway

      So in the deep dive of the splunk cookbook, posted today on the blog, it detailed the hard dependency on chef-vault, which is making a dependency on data bags + a lot more.

      In my own cookbooks, I’ve found forcing data bag use to make testing a huge pain, especially when you have other cookbooks that depend on the one using the data bags. I’m currently ripping that stuff out and making the cookbook more flexible in what it takes.

      I was surprised that the spunk cookbook, which is being held up as a “here’s how to do it” cookbook has this hard dependency.

Archives