The perfect dev environment using AWS for large databases
Best practices for large DB based development environments and how to leverage AWS regions
During the start of a product, the database is quite small, mostly empty or populated with dummy data. Developers prefer having a local database instance running their favorite database like PostgreSQL or MySQL. This frees them from dependencies, and has the added advantage of having near zero latency.
As the product grows, inevitably the database also increases in size. In some cases, replicating production issues also requires that the code be run on a copy of the production database. Which eventually leads to often databases for development environments being created by copying a database dump from production and then importing that database dump. And since database dumps are text, they can be highly compressed, which can result in a relatively small file to copy over. But the import of the dump can still take lots of time and cause high load on the dev computer as it rebuilds tables and indexes. As long as your data is relatively small, this process may be perfectly acceptable.
Our team also went through the journey of setting up an acceptable strategy to work with our ever-growing database on Cassandra.
As we started out, the database was small. Feature set was growing faster than usage, which meant we had to resort to dummy data. We wrote a script to generate dummy data based on several parameters and usecases. It worked fairly well and kept everything on the developer’s device.
For production issues, as the customers were beta and friendly, with consent, we seldom restored a copy of our daily backup to reproduce issues in our development environment.
Zero to one
MetroLeads data strategy is built on a schemaless model. Although we know the shape of the data, we rarely can count on it being complete or consistent. As the data passes through the data pipeline, it gets normalized for consumption by various stakeholders. When the product included features such as 3rd party integration, bring-your-vendor models, the situation was exacerbated. Data grew exponentially to be housed on a developer’s laptop. The requirement of a consistent database for multiple microservices to run increased further.
To combat this situation we introduced the “shrinking process”. Shrinking was a way to run a particular backup through a processing pipeline that:
- Removed all non customer data
- Anonymize or scrub remaining data to remove traces of any PII (Personally Identifiable Information)
- Leaves testing sandboxes intact
- Reduced the number of events by time e.g. only keep events of last 7 days
Developers have their own production accounts which are connected to dummy vendors and QA communication stacks. For e.g. we use fake data generators such as Mockaroo which is my personal favorite and a combination of excel functions to generate large import payloads.
MetroLeads provides a sandbox for each organization. This makes it easy for us to remove all customer organization data in one-go during the shrinking process.
Over a period of time we extended the shrinking the process to target a specific organization. This allowed us to run the same scrubbing process on a customer account without compromising security or data policies.
We ran this process on our tools server which went from being a
m4.2xlarge to handle simultaneous requests. We ended up timing them so that the load did not overwhelm the server. This was not scalable although worked for quite a long time.
Subsequently we hit an upper limit and decided to run the process as a daily cron with the developer accounts being refreshed every night. To avoid this being another tool that developers had to learn, we hooked it up as a Slack bot. Developers could simply run a Slack bot command and within a few hours the database would be made available on S3.
The shared solution
With the advent of Mumbai region the latency was no longer a problem. Our development team is mainly in Pune. Latency to the nearest Mumbai region was now below 40ms. Quite alright for our use case because our earlier architecture decisions lent themselves to handle this latency.
MetroLeads uses a combination of source of truth DB (Cassandra), a search engine (ElasticSearch), a message bus (RabbitMQ) and local caching (Redis). We decided to setup a shared database for all developers in Mumbai region. The idea was to setup all of our database in AWS and only keep Redis local to the developer laptop. As expected this worked really well for 80% of our UI based scenarios. The event processing flow was always meant to handle delays so that was never a problem.
We quickly changed our onboarding documentation to not require any database installations. Only setup your python servers with Redis and connect to the shared AWS server that housed all of the remaining databases.
The non-shared solution
At times several developers would work on the data set and would inadvertently overwrite changes. Since we already had a blueprint for setting up a remote database, we tweaked the blueprint to restore a shrunk database to any server of choice. This lead to an interesting setup; developers would launch their own servers in the nearest region and restore the blueprint on it. With large databases, this had two problems:
- Increased cost of running servers
- Stale data as developers would restore their copy less frequently
After a brainstorming session, we resorted to:
- Launch only spot instances for non-shared servers
- Use local port forwarding to switch between shared and non-shared servers
- Kill non-shared servers as early as possible
Port forwarding was a great idea. It let us switch between development servers without changing config every time. We’ve seen
local.dev2.env way too many times. With a port forward, you always point to
localhost:9160 but depending on which server you’ve connected to, it will route it appropriately. On a Mac, we recommend using Core Tunnel.
QA and other environments
This post only focuses on developer’s setup, however, the QA environments are a subset of that problem. Spinning up a new instance with dummy, test or production data is a sinch using the above techniques.
We built these approaches over a number of years, refining and tweaking them for our use case. There is a lot of room to improve and we are constantly striving to learn from others and get better at it.
Hope you enjoyed the article.