Deploying certificate-based SSH with Ansible
December 25, 2016
A few months ago, I read “Scalable and secure access with SSH” by Marlon Dutra on the Facebook Engineering blog. It’s an informative look into how an organization of Facebook’s size is able to keep authentication manageable across a very large, dynamic, and scalable environment without a single point of failure. If you haven’t read the article, do that before reading mine. Otherwise, nothing below is going to make any sense. It’s a pretty short and enjoyable read.
Facebook mentions the need for some type of automation system inside larger environments, as manually signing and distributing certificates would be an otherwise arduous process. An automation system can also allow for an additional level of security: it becomes much easier to frequently refresh key pairs and certificates when an automated system can perform the heavy lifting.
I’ve been (infrequently) working with Ansible on personal projects for a few years now. Implementing portions of an SSH-based certificate architecture seemed like a good project to re-orient myself with the Ansible way of doing things. Below, I’ll discuss the Ansible approach that I chose, some of the challenges that I ran into, and room for future improvement.
All of the code discussed below is available on Github. The Vagrantfile for the live demo is also in there, so you can easily follow along if you’d like.
Variables
Very generally, there are 3 classifications of hosts within the certificate-based SSH architecture proposed in Facebook’s article: a certificate authority, bastion hosts, and regular production servers. The regular production servers can be further classified based on their purpose and security zone: webservers, database servers, log servers, etc.
Additionally, the environment has one or more users who log into the bastion hosts and then jump into production systems via SSH. These users are able to log in to production servers as the root user, and their ability to access production servers is based on the principals found in the SSH certificate that is signed by the certificate authority. For example: a website administrator would only be allowed to access webservers, because their certificate only has the zone-webservers
principal. All of the webservers in the environment are configured to allow logins from certificates with the zone-webservers
principal.
It turns out that these constructs map rather nicely into Ansible variables. The Ansible hosts
file can categorize each host (webservers, dbservers, etc). We can also define users, servers, and their respective principals in the various group_vars
files. We’ll take a closer look at each of these files during the demo.
Roles
Considering the overall Facebook architecture, it also becomes clear that each category of host has a certain role to play. The purpose of the certificate authority is to sign public keys from the bastion hosts. The purpose of a webserver is to accept root logins from users who present a valid certificate with the zone-webservers
principal. These well-defined purposes map well to Ansible roles. I’m going to dig deeper into each of these roles, but let’s start with a simple list:
- ntpClient
- automationServer
- newCA
- bastion
- existingCA
- sshServers
Now let’s take a closer look at the actions performed by each role. We’ll examine each role in the order of execution from the root main.yml file so that we can better understand the dependencies and relationships between each role.
ntpClient
Certificate-based implementations rely on correct timing among hosts. While this requirement isn’t based on strictly-defined clock drift (i.e. Kerberos), it is important that all hosts have the same approximate time. If a host with the incorrect time is presented with a certificate that appears to be signed in the future, it will reject it. Likewise, a certificate could appear to have expired if the clock skew between hosts is too great.
Needless to say, I learned this the hard way. The ntpClient
role simply installs and starts the NTP service with the default NTP servers. If you’re using my code in your environment, you may choose to role this in with a higher-level “common” task.
automationServer
The automationServer
role is just a fancy way of saying “localhost” and is used to perform some basic prep work on our Ansible server. Currently, it just creates a temporary directory for storing SSH public keys and certificates.
newCA
The first tangible step in the deployment of a certificate-based SSH system is the creation of a certificate authority. A CA in the SSH world is fairly simple: It really just contains a normal SSH keypair. The private CA key is used to sign and generate SSH certificates for normal user keys. The CA public key is then trusted by all of the servers in the environment.
bastion
Bastion hosts are used as SSH jump hosts for users trying to reach production servers. The bastion
role creates all of the user accounts and SSH key pairs for users seeking to access the environment. We also need to download the created public keys so that we can have them signed by the certificate authority.
existingCA
The existingCA
role is used to sign public keys with the existing certificate authority created by the newCA
role. We need to generate certificates for the public key of each user on each bastion host. These certificates need to be generated with the correct principals for each user. After the certificates have been generated, they must be downloaded so that they can be distributed to the appropriate bastion hosts.
It’s worth noting that this distribution of signed certificates is performed directly in the root main.yml play, instead of being split out into a separate role.
sshServers
The sshServers
role is applied to all production servers that accept user logins via certificate-based SSH. Each of these servers must trust the public key of the certificate authority, and they must be configured to allow root login with the appropriate principals.
Tags
While all of these deployment activities by Ansible are useful, it would be even more helpful to simply execute parts of the overall play on an as-needed basis. It probably isn’t necessary to generate a new certificate authority key pair every week (although this could be desirable). Rather, it might be useful to simply re-sign the existing public keys of each user with the existing certificate authority private key. Likewise, it could be useful to generate new SSH keys for all users and have these signed by the existing certificate authority. This is pretty simple to accomplish by using Ansible tags. The code implements tags for both of the above scenarios.
For example, executing ansible-playbook main.yml –i hosts --tags reSignOnly
will simply generate new certificates for all users by signing their existing public keys with the existing CA private key.
Challenges
I did run into some challenges when building this set of playbooks. A few of these are based on strange shortcomings in Ansible that had to be circumvented.
Not using Ansible key generation
It might seem odd that I don’t utilize the key generation capabilities of Ansible, namely the authorized_key module Unfortunately, the authorized_key
module is incapable of overwriting an existing SSH key pair, which is necessary in the use cases of these playbooks. Specifically, tasks making use of the initialBuild
and reKeyReSign
tags would fail, as Ansible would be unable to generate new SSH keys for each user.
Multiple bastions
The most frustrating issue that I ran into was Ansible’s limitations around loops. For the playbook to work on multiple bastion hosts, it’s necessary to iterate over both a list and a dictionary. Namely, the list of bastion hosts and the users dictionary from the all.yml
file need to be processed due to the way that Ansible fetch operations structure directories. I don’t want to rehash the issue too much here, so I’ll just reference the Stack Overflow question about the issue
The “solution” to this, as explained on Stack Overflow, is to use a hack that compresses a dictionary into a list. While this certainly works in my use case, it’s a bit frustrating that Ansible doesn’t provide more robust mechanisms to loop over arbitrary combinations of data structures.
Existing and new CA roles
It might seem strange that I’ve created roles for both an existingCA
and a newCA
even though these roles are executed on the same host (the certificate authority). It would make more sense to use tags for this use case. Unfortunately, Ansible doesn’t provide for the ability to execute tags in the main playbook. To use tags, the main playbook would have to execute the same role with different tags at different times.
For example, let’s suppose we used one role called certificateAuthority
. Let’s say that all tasks involved in building a certificate authority are tagged with the NewCATasks
tag. Then, let’s say that all the tasks involved with certificate generation on an already-built certificate authority are tagged with the existingCATasks
tag. The main tasks.yml
file would have to call the certificateAuthority
role twice at different times: once with the NewCATasks
tags and then again with the existingCATasks
tags.
Currently, Ansible doesn’t provide the ability to call other playbooks using tags. This might be by design: it can encourage cleaner code and less confusing tag dependencies. At any rate, it seemed sensible to divide the tasks into different roles.
Doing it live
So this is all interesting in theory, but it becomes much more fun to actually run the scripts and watch what they do. A sample Vagrantfile is available in the Github repository and can be used to follow along with the sample files provided in the repo. The Vagrantfile will build an infrastructure with the following hosts:
- A certificate authority at 10.100.0.20
- Bastion hosts (bastion01 and 02) at 10.100.0.10 and 10.100.0.11
- Web hosts (web01 and 02) at 10.100.0.101 and 10.100.0.102
- A database host at 10.100.0.121
- An ansible deployment server at 10.100.0.9
- A Java host (java01) at 10.100.0.131
Note that the web, database, and Java hosts don’t actually do anything. No real server software is installed. Rather, they’re just for demonstration purposes. You’ll also notice that this constitutes quite a few VMs, and may not run well on resource-constrained machines. I’ve got 16 GBs of RAM in my laptop, so it works fine for me. If I were smart, I would probably use a Docker provider for Vagrant.
Once the machines are built, Vagrant runs a shell provisioning script to automate some of the Ansible setup steps. This just creates an Ansible user (with password “ansible”), allows Ansible to perform passwordless sudo, and allows password SSH login. It goes without saying, but these sample Vagrant scripts should never be used for anything in production.
Note: If Vagrant throws errors about not being able to mount the vboxsf file system, just disabled folder sharing in the Vagrantfile. For some reason, VirtualBox, Vagrant, and CentOS 7 never play nicely with shared folders on any Windows machine that I’ve used. I don’t know if it’s a Vagrant issue, a VirtualBox issue, or a problem with the CentOS 7 provider. I don’t care, but it’s extremely annoying and broken 85% of the time. A haphazard combination of kernel updates, manual guest tools installation, and VirtualBox updates resolves the issue about 90% of the time. You can also install the vagrant-vbguest plugin, but that also only solves the problem sometimes. Of course, all of this manual work largely defeats the purpose of using Vagrant, but that’s a topic for another soapbox.
PS C:\Users\tony\Google Drive\Projects and Presentations\FB Ansible SSH\fbSSH> vagrant up
Bringing machine 'certAuthority' up with 'virtualbox' provider...
Bringing machine 'bastion01' up with 'virtualbox' provider...
Bringing machine 'bastion02' up with 'virtualbox' provider...
Bringing machine 'web01' up with 'virtualbox' provider...
Bringing machine 'web02' up with 'virtualbox' provider...
Bringing machine 'db01' up with 'virtualbox' provider...
Bringing machine 'ansible' up with 'virtualbox' provider...
Bringing machine 'java01' up with 'virtualbox' provider...
<< output truncated >>
At some point, we’ll have a working environment. I’m going to quickly install the epel-release, git, and Ansible. Then we’ll generate SSH keys for the ansible user, and copy those keys to all of the hosts in the environment:
[ansible@ansible certificateSSH]$ sudo yum install epel-release -y > /dev/null
[ansible@ansible ~]$ sudo yum install -y git ansible > /dev/null
[ansible@ansible ~]$ ssh-keygen -t ecdsa
<< output omitted >>
[ansible@ansible ~]$ ssh-copy-id 10.100.0.20
<< output and commands for all additional hosts omitted >>
Next, I’ll clone the project from Github:
[ansible@ansible ~]$ git clone https://github.com/acritelli/certificateSSH.git
Cloning into 'certificateSSH'...
remote: Counting objects: 75, done.
remote: Total 75 (delta 0), reused 0 (delta 0), pack-reused 75
Unpacking objects: 100% (75/75), done.
[ansible@ansible ~]$ cd certificateSSH/
Now that Ansible can access all of the servers in the environment and we have the code from Github, we should be good to run the main playbook. First, let’s take a quick look at some of the important file contents. We’ll start with the group_vars/all.yml
file:
[ansible@ansible certificateSSH]$ cat group_vars/all.yml
users:
webuser01: "zone-webservers"
dbuser01: "zone-dbservers"
webadmin01: "zone-dbservers,zone-webservers"
ansible_temp_directory: /tmp/ansible_ssh
These variables apply to all hosts, and can be accessed within any of the role playbooks. Notice that we have a users
dictionary with a list of users and a comma-separated string of principals. There’s also an ansible_temp_directory
variable that can be used to adjust where public keys and certificates are stored on the Ansible host. Next, let’s look at the group_vars/webservers.yml
file:
[ansible@ansible certificateSSH]$ cat group_vars/webservers.yml
principals:
- zone-webservers
This is a pretty simple variable file. It simply contains a list of allowed principals for a set of hosts. For our webserver, we only allow the “one-webservers
principal. The root-everywhere
principal is allowed on all hosts by default. The list of allowed principals will be placed into the /etc/ssh/allowed_principals/root
by the sshServer role. Finally, let’s take a look at the hosts
file for this sample topology:
[ansible@ansible certificateSSH]$ cat hosts
[certificateAuthority]
10.100.0.20
[bastionHosts]
10.100.0.10
10.100.0.11
[webservers]
10.100.0.101
10.100.0.102
[dbservers]
10.100.0.121
Our hosts file is fairly straightforward. We define a single certificate authority, a few bastion hosts, some webservers, and a database server. The IPs of each server correspond to the IPs from the Vagrantfile. Notice that the java01
server is missing from the hosts. We’ll add it in later.
OK, now that we have all of this boilerplate out of the way, let’s go ahead and run the playbook:
[ansible@ansible certificateSSH]$ ansible-playbook main.yml -i hosts
<< output omitted >>
With any luck, everything should be working. Let’s see if webuser01
can log into the webservers from bastion01
:
[webuser01@bastion01 vagrant]$ ssh root@10.100.0.101
Last login: Sun Dec 25 02:45:29 2016 from 10.100.0.10
[root@web01 ~]# whoami
root
[root@web01 ~]# exit
logout
Connection to 10.100.0.101 closed.
[webuser01@bastion01 vagrant]$ ssh root@10.100.0.102
Last login: Sun Dec 25 02:45:35 2016 from 10.100.0.10
[root@web02 ~]# whoami
root
[root@web02 ~]# exit
logout
Connection to 10.100.0.102 closed.
Cool! Can webuser01
log into the database server?
[webuser01@bastion01 vagrant]$ ssh root@10.100.0.121
The authenticity of host '10.100.0.121 (10.100.0.121)' can't be established.
ECDSA key fingerprint is 0f:30:d3:98:5d:37:9e:de:11:a0:c7:e5:ed:55:dc:31.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '10.100.0.121' (ECDSA) to the list of known hosts.
no such identity: /home/webuser01/.ssh/id_ed25519: No such file or directory
root@10.100.0.121's password:
[webuser01@bastion01 vagrant]$
Nope, and that’s what we want to see. Note that SSH prompts for a password login, which will fail as there is no webuser01
on any of the servers. Root login is enabled everywhere. You may choose to disable SSH password login entirely. I leave it enabled by default for this set of playbooks, mainly because removing it prevents me from logging in as the vagrant user in my lab.
Using Tags
So this set of playbooks will build our entire infrastructure out, which is useful for the initial build stages and for rebuilding a certificate authority and distributing its public key to all the servers in the environment. However, we discussed earlier that tags provide a valuable way to only execute certain tasks.
What if we want to generate new key pairs and re-sign them for all users? We can simply use the reKeyReSign
tag, as seen below. Notice that our play only ran the tasks associated with user key generation and signing. I’ve truncated the output to only the task list for brevity.
[ansible@ansible certificateSSH]$ ansible-playbook main.yml -i hosts --tags reKeyReSign
PLAY [all] *********************************************************************
TASK [setup] *******************************************************************
TASK [ntpClient : install ntp] *************************************************
TASK [ntpClient : start ntp] ***************************************************
TASK [ntpClient : ensure that ntp will start on boot] **************************
PLAY [localhost] ***************************************************************
TASK [setup] *******************************************************************
TASK [automationServer : ensure that ansible directory for public keys and certs exists] ***
PLAY [certificateAuthority] ****************************************************
TASK [setup] *******************************************************************
PLAY [bastionHosts] ************************************************************
TASK [setup] *******************************************************************
TASK [bastion : create SSH keys for each user account] *************************
TASK [bastion : adjust permissions for user's private and public SSH key] ******
TASK [bastion : download created SSH public key for each user] *****************
PLAY [certificateAuthority] ****************************************************
TASK [setup] *******************************************************************
TASK [existingCA : make temp directory for SSH public keys] ********************
TASK [existingCA : copy downloaded SSH public keys to CA for signing] **********
TASK [existingCA : create signed certificates with appropriate principals] *****
TASK [existingCA : download signed certificates] *******************************
TASK [existingCA : delete temporary public key directory] **********************
PLAY [bastionHosts] ************************************************************
TASK [setup] *******************************************************************
TASK [place signed certificates into user's .ssh folder] ***********************
PLAY [all:!bastionHosts:!certificateAuthority] *********************************
TASK [setup] *******************************************************************
PLAY RECAP *********************************************************************
10.100.0.10 : ok=10 changed=3 unreachable=0 failed=0
10.100.0.101 : ok=5 changed=0 unreachable=0 failed=0
10.100.0.102 : ok=5 changed=0 unreachable=0 failed=0
10.100.0.11 : ok=10 changed=3 unreachable=0 failed=0
10.100.0.121 : ok=5 changed=0 unreachable=0 failed=0
10.100.0.20 : ok=11 changed=5 unreachable=0 failed=0
localhost : ok=2 changed=0 unreachable=0 failed=0
How about just resigning the existing key pairs? That’s pretty easy too, with the reSignOnly
tag. Notice that only the key signing tasks ran, and new keys were not generated for each user.
[ansible@ansible certificateSSH]$ ansible-playbook main.yml -i hosts --tags reSignOnly
PLAY [all] *********************************************************************
TASK [setup] *******************************************************************
TASK [ntpClient : install ntp] *************************************************
TASK [ntpClient : start ntp] ***************************************************
TASK [ntpClient : ensure that ntp will start on boot] **************************
PLAY [localhost] ***************************************************************
TASK [setup] *******************************************************************
TASK [automationServer : ensure that ansible directory for public keys and certs exists] ***
PLAY [certificateAuthority] ****************************************************
TASK [setup] *******************************************************************
PLAY [bastionHosts] ************************************************************
TASK [setup] *******************************************************************
TASK [bastion : download created SSH public key for each user] *****************
PLAY [certificateAuthority] ****************************************************
TASK [setup] *******************************************************************
TASK [existingCA : make temp directory for SSH public keys] ********************
TASK [existingCA : copy downloaded SSH public keys to CA for signing] **********
TASK [existingCA : create signed certificates with appropriate principals] *****
TASK [existingCA : download signed certificates] *******************************
TASK [existingCA : delete temporary public key directory] **********************
PLAY [bastionHosts] ************************************************************
TASK [setup] *******************************************************************
TASK [place signed certificates into user's .ssh folder] ***********************
PLAY [all:!bastionHosts:!certificateAuthority] *********************************
TASK [setup] *******************************************************************
PLAY RECAP *********************************************************************
10.100.0.10 : ok=8 changed=1 unreachable=0 failed=0
10.100.0.101 : ok=5 changed=0 unreachable=0 failed=0
10.100.0.102 : ok=5 changed=0 unreachable=0 failed=0
10.100.0.11 : ok=8 changed=1 unreachable=0 failed=0
10.100.0.121 : ok=5 changed=0 unreachable=0 failed=0
10.100.0.20 : ok=11 changed=5 unreachable=0 failed=0
localhost : ok=2 changed=0 unreachable=0 failed=0
Adding hosts and users
OK, now let’s see how easy it is to add new hosts and users. We’re going to introduce our java01
host to the environment and place it into the zone-javaservers
security zone. We’ll also define a user called javauser01
who has access to this zone.
First, we’ll create a section in the hosts
file for javaservers:
[ansible@ansible certificateSSH]$ cat hosts
[certificateAuthority]
10.100.0.20
[bastionHosts]
10.100.0.10
10.100.0.11
[webservers]
10.100.0.101
10.100.0.102
[dbservers]
10.100.0.121
[javaservers]
10.100.0.131
Next, we’ll add the group_vars/javaservers
variable file to specify the allowed principal for the javaservers hosts:
[ansible@ansible certificateSSH]$ cat group_vars/javaservers.yml
principals:
- zone-javaservers
Finally, we’ll define our javauser01
with the zone-javaservers
principal in the group_vars/all.yml
file:
[ansible@ansible certificateSSH]$ cat group_vars/all.yml
users:
webuser01: "zone-webservers"
dbuser01: "zone-dbservers"
webadmin01: "zone-dbservers,zone-webservers"
javauser01: "zone-javaservers"
ansible_temp_directory: /tmp/ansible_ssh
Now that those three steps are completed, we can re-run our playbook. You might be wondering if it’s possible to use a tag for this, instead of rebuilding the entire environment. That’s a great question! And the answer is: not currently, but I’ve got a Github issue open for that.
[ansible@ansible certificateSSH]$ ansible-playbook main.yml -i hosts
<< output omitted >>
Once the playbook has finished running, we should be able to log in to the java01
server as javauser01
from either of the bastion servers:
[javauser01@bastion01 vagrant]$ ssh root@10.100.0.131
Last login: Sun Dec 25 03:02:39 2016 from 10.100.0.10
[root@java01 ~]# whoami
root
[root@java01 ~]# exit
And that’s it! This set of playbooks makes it pretty straightforward to roll out a certificate-based SSH environment and make modifications to users and hosts. I wanted it to be as turnkey as possible so that it can be run in other environments with minimal changes.
Improvement Opportunities
If you’ve made it this far, or just reviewed the code, you’ve probably noticed that there are some issues and room for improvement in this implementation. Definite issues and planned enhancements can be seen directly in the Github Issues tracker, so I won’t spend any time on them. However, there are some key points that I would like to discuss.
Scalability
As you can probably imagine, it’s not exactly scalable (or even really necessary) to regenerate the certificates or SSH keys for all users in one shot. Facebook’s article discusses a system where a user logs in, security checks are run on their login, a key pair is created, and a certificate is generated upon login. This type of system makes far more sense in any large environment, and would entirely remove the need to constantly refresh all of the certificates or key pairs in the environment.
While my set of playbooks doesn’t accomplish this, that’s actually by design. The type of system that Facebook described would be very environment-dependent. I could imagine a company using a RADIUS attribute to represent the allowed principals for a user. When the user logs into the bastion, the RADIUS response kicks off Ansible scripts for the key and certificate generation processes.
A similar scalability challenge exists within a very dynamic server environment. It would be inconvenient to constantly be updating a host file with hosts and principals.
This set of playbooks can be used as building blocks for robust and dynamic environments, or they can be used “as-is” for smaller, more static shops. Either way, these playbooks aim to provide a reasonable foundation.
Additional roles and configuration
Facebook’s “everyone logs in as root” paradigm relies on a thorough accounting infrastructure. This set of playbooks makes no such attempt to set up any sort of logging server or accounting configuration on hosts. Again, I’ll leave that up to each environment to implement according to their needs (although I may build some sort of basic logging role when I have time).
There is also some “missing” firewall configuration. Specifically, a few firewall rules should be in place:
- The bastion hosts should only be accessible from authorized external networks
- Regular servers should only be accessible via SSH from the bastion hosts
- Access to the certificate authority should be heavily restricted
Again, as firewall rules can be somewhat environment-specific, I’ve opted not to implement any with this set of playbooks. There’s a chance that a shop willing to Ansible-ize the rollout of SSH certificates may already have playbooks designed to implement their firewall and security strategies.
Conclusion
If you’re really still reading at this point: thanks for following along! This project was mostly just an excuse for me to pick back up on Ansible, and Facebook’s paradigm for SSH authentication provided an interesting project to work on. While I definitely ran into some strange issues in Ansible, particularly with iterating over more complex data structures, I think it was a great tool for this purpose. The certificate-based SSH methods described by Facebook’s article are very simple and straightforward, and Ansible allows for an automation system with equally low overhead.
Overall, I hope that this set of playbooks can provide some useful boilerplate for anyone who wants to automate a certificate-based SSH system. Feel free to fork the repository, raise issues, or submit pull requests.
Previous article: A Packet Look at Cisco FabricPath
Next article: Persistent Password SSH on AWS AMIs