How To Manage Your Servers Just Like Google With This Simple Strategy

When you have only a single linux server, it is easy. You ssh into it, you install what you need from source or from the distribution’s package repositories and then you are done. Easy Peasy.

One day, you realize that this single server is not enough, so you bring a second one and you install what you need to it.

Then as you grow more and more servers come online and each time you ssh into each server.

Unfortunately, as more servers come to life, it gets harder to manage them. It is difficult to keep track what is installed at each place.

Even when you do, sometimes on different servers the different versions of the same software are installed. As a result, it might be hard to explain and understand why something works on one machine and not on another one.

When you were setting that service on the old machine, you had installed these 5 additional packages to make it work, and now you don’t remember any more which they were.

Sometimes, you remember only when you bring the server online in production and it crashes due to missing dependencies.

What’s more installing every server by hand is a real pain and super slow. It takes a lot of time to write every command for every package. Setting again each configuration is not small task either.

What you get at the end is snowflake servers. Everyone is unique and you don’t even remember in what ways.

When a new security patch comes out for a software you use, it might be hard to pinpoint which machines need updating.

Wouldn’t it be nice if you always new what was installed on each server and it was super easy to update?

I love snowflakes, but only in winter and not when it comes to servers.

Configuration Managment can help but…

For a long time, I’ve been looking into ways to improve the situation.

The first approach to a better world that I found was solutions Puppet, Chef, Salt and Ansible.

I especially loved Ansible and I was thinking that this was it, but it had a few issues.

For example, we almost never use the official distribution repositories for most of the software running on our machines.

When I need nginx, I compile it from source, to get recent version with latest security and the latest performance. The same goes true for Redis, NodeJS, PHP and other technologies which I need for me or my clients.

With Ansible I automate most of the steps, but I am still required to compile nginx on every machine where I need it.

What’s even worse, with all of the above solutions, I’ve run into situations where due to a crash or some other failure I am into a state where the tool cannot operate, and I have to intervene manually with ssh.

At the end, these tools help but the situation is still far from ideal.

The Google Way

Google has always been known for innovation.

They were among the first to use commodity servers to run their web services, a practice which is now widespread.

They were also among the first to use containers, a decade before it became popular (thanks to Docker).

So, it was no surprise when I found a different way to mange servers and it came again from Google.

It was in the form of presentation by Marc Merlin, who works at Google

It is fascinating read, but to keep the long story short, Google has found a very simple way to keep their millions of servers easy to manage.

First, to avoid the snowflake scenario, they keep all their servers identical.

It doesn’t sound very ground breaking until you read how they do it, which is where the interesting part is.

They use file level syncing to keep the machines identical. Yep, they basically sync / on one server with / on another server.

I am sure, you will be quick to point that there are things which they cannot sync like /dev and /proc and you will be right, but you get idea. They sync everything they can.

They get a server to be a master image, then they install whatever they need to install on it from either source or a package. Then they sync this server files with all other servers in their fleet of millions servers.

Each time they need to update something, they just repeat the process.

The result is millions of identical servers, with no snowflakes. They always know what is installed and its versions and they can be confident that all machines have the same software with the same versions and as a result the same behavior.

This vastly simplifies server fleet management.

Of course, being Google, it is a little bit more complicated process then what I described but it is still much simpler then anything else I’ve heard.

I am not Google, what do I do? How can I adapt this to my company?

One of our clients needed to revamp its infrastructure for multiple reasons, and one of the issues he got was snowflake servers. So, we decided together to adopt the Google strategy.

If you read what Google described above, you will notice however that they also manage all installed packages, even basic ones, from their own repositories, removing triggers and making other modifications to make it work with file level syncing.

This is a luxury few companies can afford, but it shouldn’t prevent you from adopting the main part of this strategy.

As I already told you, beside the base software, everything else that we are using is compile from source code. This ensures getting the latest security and performance features.

When we would like to install a new software, a machine is designated as manager or main machine. This is where the install will happen.

Once a software is compiled, then it is put under /usr/local/software-name-version. Then a symlink is created between that folder and /usr/local/software-name.

This is a simple but effective way to keep multiple versions of each software and easily switch between them, if a regression or serious bug is found in one of them.

After the previous steps are done, rsync is used to keep /usr/local on the manager server in sync with the /usr/local folder on all other servers.

Rsync is a very handy tool. It makes it super easy to keep two folders on different machines identical.

Even when these folders contain millions of files or when a network disruption occurs, you can just launch it again and it will keep everything in sync.

Actually, this is even more powerful than it sounds. Rsync with file level sync makes it incredibly easy to recover from pretty much any type of crash and makes it virtually impossible to be in an unrecoverable or unknown state.

This is a unique feature of file level syncing. The more servers you have the more critical it will be for you and you will appreciate it.

Google doesn’t use rsync. Instead they use an internal tool which is very similar but more appropriate to their needs.

In addition to the installed software, we had two more special folders under /usr/local/ which of course were also synced.

The two other special folders were /usr/local/etc/default and /usr/local/etc/systemd.

The first, /usr/local/etc/default, contained the configurations of every piece of software.

It contained multiple versions of each configuration. For example, it contained a default configuration. Then, if we have two database servers with slightly different settings, it contained two more configurations, one for each of them.

However, these configurations were not active. It means that when we lunch a database, we first copy its configuration from /usr/local/etc/default to /etc and then it uses the one in /etc.

In case we needed to make a change, we usually do it under the /etc and then copy it back to /usr/local/etc/default.

Thanks to rsync, the /usr/local/etc/default folder is always identical on all machines.

The second folder /usr/local/etc/systemd contained the systemd services for every piece of software, but again not the active.

When we needed to run a particular software we copied its systemd service from /usr/local/etc/systemd to its usual place at /etc/systemd/system.

As a result, every machine was identical to every other in folder and file structure and what was installed on it.

The only difference was what was running on each server and what data was stored on it.

We had database machines to store user data, front end machines displaying user content and more. Each was running what it needed to run and only that.

The data on the servers was not synced as we wanted to sync only the stateless part of our machines. There are better ways to back up data.

An example

Let’s have a look at nginx for example. When you want to run nginx on a load balancer machine, you just copy its configuration from /usr/local/etc/default/nginx/conf to /etc/nginx/.

You might need to do some minor modifications to that config if the machine has some special uses. If you do, you copy back the configuration so that it can be synced.

Then you repeat those steps for the systemd service for nginx.

Once ready, you can just start your service and it will work, the same way it works on other servers in your fleet. You can be confident in that because all those machines are identical.

You also have to rsync between all machines any changes that you made.

When you do it regularly each rsync only transfers small sizes of data, and takes less than a second.

Some advantages

An advantage that comes from that approach is that you only need to compile software like nginx once. Then you just do file sync with all other machines and it works. This saves huge amounts of time.

Another advantage is that it makes recovery of serious machines crashes very fast. Even when you use a dedicated hardware server or a virtual machine, as soon as you have a new machine live, you can just do rsync and the new server is ready to go in just a few minutes.

Then you just select what software you would like to run on that machine and copy its configuration. The same that was running before the hardware failed. It will run just as well as it was running before.

A small disadvantage might be that you might need to rsync many servers, but this can be mitigated with a simple script. All it does is execute rsync on a loop.

You can run it as many times as you want. On servers that are synced it will just do nothing, and on all other servers it will fix anything missing.

When you grow to millions of servers you might want to do that in a little bit more sophisticated way, like Google did, but even then, you don’t need to do anything too complicated.

Doesn’t it waste space?

Another concern that you might have, as every server contains all libraries and software, even those that it will never use, is that it might waste some space.

It will, but this space is negligible. Usually all the software required is no more than a few hundred megabytes, often even less than hundred megabytes and on most cloud services a gygabite of data costs around $0.04 per month, so you waste less than that.

What about Ansible, Puppet or Chef?

Ansible, Puppet, Chef or Salt might help with some situations. I am actually a fan of Ansible, and have seen good things from the others, too.

However, file level sync is more robust and faster. The more servers you manage the more evident those advantages become.

However, there is a place for those tools. They work very well with the strategy discussed so far.

For example, you can use Ansible to install everything on your manager server. This helps to have the entire setup written in the Ansible playbook.

Each time you need something new, you update your playbook and run ansible.

Once that server is installed, you can do the rsync with all other machines from it.

Packages

As I told you earlier, one big difference from what Google does is how they treat packages. They keep every package they need on a local repository and it is heavily modified, so that it does not produce effects after install, as a result doing file level syncing is enough.

However, this is a luxury that few can afford. I have used two other approaches.

We install most of the software from source, but it might still depend on many other libraries.

What do we do about them?

One solution is to compile most of them from source. Then you can easily sync them. However, this might require too much compilation for your taste as well as resolving many dependencies and compile issues manually.

The other approach is to keep the version of all installed packages and install them on every new machine.

Try keeping the number of those packages to a minimum, for example just some base libraries and compilers, so that you can benefit the most from the file syncing strategy.

Using file level syncing for your entire server fleet has been great for us and apparently for large companies like Google, too.

You can easily start today. Choose part of your software that you will install only at one place and then use rsync just on its folder to sync it to other servers. The more folders you add the more confident you will become that this is a very robust approach.

What’s more you can only initially do that syncing between just two of your servers, then add one more, then another one, and so until you have all of them. We’ve done that before and it has worked beautifully.