Ansible at Grofers (Part 2) - Managing PostgreSQL

We have diverse data needs at Grofers. And we use different kinds of databases to fulfill different needs. We started with Postgres in the beginning as our primary data store. And even though we have become polyglot with databases, Postgres remains to be the most used database at Grofers. This blog post is about how we manage Postgres with Ansible.

Background

In the beginning, we had one monolith database that was used by all the services. We didn’t have operational expertise of running Postgres in production back then. So we chose to use Amazon RDS so that we do not have to spend much time in setting up the infrastructure or have any operational overheads. All operations on the database were done manually by a few people who were familiar with the setup.

As we started adopting microservices architecture and move away from the monolithic code base, we started splitting our database and added new databases for different services. Today, we have many PostgreSQL clusters running for different services, each having one or more databases on them. The infrastructure is still managed using Amazon RDS.

With the increase in the scale of our Postgres setup, it was becoming difficult to manage day-to-day operations like managing databases, monitoring, access controls, etc. It became necessary to introduce new toolchain for making these operations easier.

Ansible for All Database Ops

We did some looking around to decide which tool to use for managing Postgres. While there were a few tools available, we chose to give Ansible a try since we were already using it for managing our EC2 infrastructure and we were quite liking it. Ansible comes with built-in modules for managing Postgres and Amazon RDS as well. These modules had enough features to fulfill our needs. Using Ansible made a lot of sense also because we strongly believe in formalizing our infrastructure in code as much as we can and committing it in a Git repo. In unforeseen circumstances where you need to recreate the entire infrastructure, this comes in very handy and speed things up. You can also peer review infrastructure changes through pull requests - something that most of the GUI tools do not support.

Today we have one shared Ansible repo which has all the database configurations committed. All Postgres clusters are setup and managed using this codebase. You can get a bird’s eye view of the entire setup in one place. Other developers can also submit pull requests for making changes to the database infrastructure that their team owns. While somewhat limited in what it does, it fulfills our major operational needs and can easily be extended to add more new features if needed.

The two major things that we have achieved with this setup are:

  1. Managing Users and ACLs - ease in managing users who have access to Postgres. All ACL management is automated and done using Ansible.

  2. Monitoring Databases - setting alerts on various metrics provided by RDS for individual Postgres clusters.

Managing Users

In the early days, managing users on one main database was relatively easy. We didn’t have to do a lot of operations on a daily basis other than running database migrations. As we added more clusters and databases, managing users on each of them became a disaster. We had to frequently do database operations. Conventions and best practices could not be enforced. Knowledge sharing was just too difficult no matter how much you document.

The process that we followed on all the databases was similar. So we started with formalizing the process and automated it using Ansible. We created an Ansible role which manages users across all our databases and Postgres clusters.

We have 4 types of users on every PostgreSQL instance:

  1. Master admin
    There is only one master admin on every instance and only the infrastructure team has access to the credentials. This is created at the time of creating the RDS instance. This user is used to manage the database configurations and create/edit users on the instance. This user is not allowed to view any data in any of the databases.

  2. Database Admin
    The number of database admins on a single instance is equal to the number of databases on that instance. This user is the owner of a single database and doesn't have access to any other databases on that instance. This user has privileges to run CREATE, ALTER and DROP table commands. The main role of this user in our environment is to run migrations. The credentials for this database user are only shared with experienced/senior engineers in the project’s team.

  3. Database Application User
    This user has SELECT, UPDATE and DELETE privileges on all the tables in a single database and is typically used by applications for their data needs. There can be multiple users of this type. For example, there will be different users for a web server and applications running celery queues.

  4. Database Reader
    This user only has the SELECT privilege. This user type can be used for multiple purposes. For example, in our data pipeline to move data to our data warehouse, in our business intelligence dashboard for data analysis and reporting, etc.

We create roles in the database for Database Application User and Database Reader first at the time of creating the Postgres database. Later when users are created, we just assign these roles to new users. This helps us in cases when we need to make some minor tweaks to all the users. We just need to update only the role instead of all the users in case we need to tweak privileges like revoking access to read from some table for all the users.

To create the database and users on an instance, we just need to define variables for our Ansible role to consume like so:

database_hosts:  
  consumer_db:
    host: consumer-db.rln395ai2rx.us-east-1.rds.amazonaws.com
    host_admin: host_admin
    dbs:
      paymentsdb:
        users:
          paymentsdb_application_server:
            role_membership: application

consumer_db is just an identifier of the host and is only relevant to our Ansible role. host is the RDS host address of the Postgres cluster. host_admin is the master admin username provided by RDS at the time of creating the cluster. The playbook will create the following entities:

  • a database named paymentsdb
  • a role - paymentsdb_application
  • two users - paymentsdb_admin, paymentsdb_application_server

The admin user (paymentsdb_admin in this case) is created for every new database that is created in a Postgres cluster and can be used to run migrations on this database. The user (paymentsdb_application_server) with application role (paymentsdb_application in this case) can be used by applications like web servers.

Monitoring databases

We have our own system for monitoring and alerting based on collectd, InfluxDB and Grafana but since we are currently using AWS's managed service RDS for databases, there is not much we can collect from outside and have to rely on AWS CloudWatch for alerts. To create and update alarms, we use an Ansible role written in-house.

We create two types of alarms which are triggered based on certain thresholds:

  1. Warning - default thresholds are 70%

  2. Critical - default thresholds are 90%

We tweak the default thresholds based on the criticality and workloads of the database. For our mission critical databases, we keep lower thresholds.

Whenever an alarm is triggered, infrastructure team and the team using that database instance are notified.

So whenever we create a database, we just need to run our playbook. Upon execution of the Ansible playbook, all the alarms are created.

Cloudwatch allows monitoring only system level metrics like CPU, memory, etc. While they are helpful, they don’t tell much about what’s going inside Postgres itself. For monitoring the internals of Postgres, we use PGObserver to track database level metrics like table size, index size, sequential scans, etc. and pgbadger to track slow queries. We run pgbadger hourly on our logs to identify the slow queries that come in our system and take action on them.

The source code for the Ansible role for monitoring RDS instances is publicly available on Github and Ansible Galaxy.

Conclusion

Things were easier to manage when there was a single database. But as they started to increase, management issues started to creep in. We had to do similar tasks every other day. Doing them manually was too error prone.

"If there is something you need to do more than once, automate it."

This is what the infrastructure team at Grofers believes. So we automated our workflow using Ansible and this setup has helped us save time, increase our productivity and save us from getting in a mess of managing multiple databases.

Want to work with us? We are hiring.