When I first started looking after Unix systems, I spent most of my days in a remote shell. If you didn’t live in the data centre, that’s what you did, even before ssh replaced telnet.

Since moving to Kubernetes, logging into a server may have become less common, but when things go really wrong, I just need a shell – usually in a hurry.

Even while embracing automation to manage cloud-based servers, the problem of managing SSH access has never gone away.

In fact, the less often it is needed, the more reliable the tooling has to be.

The nightmare of synchronizing files
What makes managing access to remote servers hard is the way OpenSSH traditionally authorizes users: either by password or a public key.

Both mechanisms require a local database, synchronized across all machines and because of their added security, keys are now largely preferred.

Here is how they work:

  • The client presents a public key to the server along with signed session data
  • The server verifies the signature, thus ensuring the user possesses the private key
  • Then the server consults a list of keys, usually ~/.ssh/authorized_keys, to decide whether the login should be allowed

This only works as long as all servers have an up-to-date copy of these public keys.

You need to gather them from users and then populate a file on new servers with the list. When a key changes or a user is no longer authorized, existing servers have to be updated accordingly. Managing this across hundreds or thousands of machines can become very brittle.

Most people solve this problem in one of two ways:

  1. Share a single key pair across all users
  2. Have a tool like Puppet or Chef update authorized_keys files across their estate

The former is feasible if the number of users is small, but weakens security by multiplying the chance of the key being stolen or lost.

If a new pair is needed, it has to be securely shared and all servers updated with the new public key – a time-consuming process.

The second approach, using something like Puppet to manage the list, works for estates that already employ such a tool, but it’s not perfect.

To mention only the obvious drawback, the most common reason why I need to log into a box is to investigate why Puppet isn’t working.

For our new Kubernetes platform, we decided not to follow this pattern. Instead all our infrastructure is immutable; rolling out changes means replacing, not updating servers.

A better way: SSH Certificate Authorities
Fortunately, there is a third authorization mechanism in the SSH standard: the use of a certificate.

Even though introduced into OpenSSH back in 2010, it has seen only limited use, primarily because of the added difficulty of setting up the certificate authority (CA).

The mechanism works like this:

  • Instead of a public key, the user presents a certificate signed by the CA.
  • The server verifies the certificate using the CA’s public key.
  • It checks the signed content to confirm the user holds their private key.
  • If everything checks out, it matches the CA to a set of rules that determine access.

This means that given only a single CA public key, a server can allow access to any key signed by that CA. New users are first authorized by signing their keys.

Since the CA key rarely changes, machines can retrieve it on first boot.

Taken by itself, however, this approach only shifts the complexity from one place to another. Servers no longer need a list of authorized keys, but now a CA has to be established and access to its key needs to be controlled.

OpenSSH does not provide any tools for doing this beyond ssh-keygen which can sign keys from the command line using a second key.

Worse, the only way to block such certificates is to revoke them in a revocation list. In trying to eliminate the need for an up-to-date list of keys we only managed to replace it with an up-to-date list of revoked certificate.

Enter Vault
Hashicorp’s Vault, offers an elegant solution to all these problems in one of its many modules: a fully managed SSH CA.

Once set up, authorized users send their public key to Vault via vault write and receive a signed certificate in return.

In addition, Vault provides a convenient way to retrieve the CA public key for validation.

The CA private key never leaves Vault at all.

Vault applies the same security mechanisms to all secrets, thus giving us both access management and auditing right out of the box.

This process is so easy that we were able to set the certificate validity to a mere 30 minutes. Each time an engineer requires ssh access, they acquire a new certificate for that session.

Managing revocation lists is no longer necessary at all.

Now, when a user leaves the company, their access to Vault itself is revoked and with it access to the CA.

The one downside we have found over a year of using this mechanism is that it creates an added dependency on Vault.

To account for this, we run Vault in its high-availability setup with multiple servers sharing a single, encrypted storage.

In addition, we securely retain a copy of the key pair that AWS requires when EC2 instances are provisioned. This provides an easy fallback should we ever lose the Vault.

Using certificate-based authorization in combination with Vault as the certificate authority is by far the easiest way to ensure all our engineers can access servers when they need to while giving us the power to manage and audit their access.