How we scaled our Monolith to execute 250Mn+ trades in a day?

You might have read dozens of articles around Monolith vs Microservices. And most of them suggest to keep your distance with the microservices until you have solid reasons to do so.

But microservices are tempting, you can get these solid reasons well early into your development cycle or as late as never. There are numerous articles on how to split monolith to microservices. But very few articles talk about how to do your monolith right. This is an attempt at the latter.

Context

At Probo, we are a small team of about 30 engineers. One basic principle we try to adopt is to keep things simple. As a result, we are running a monolithic version of our backend from day 1 and we are well into our 4th year. We have a seen a journey of 0 trades to 250Mn+ trades in a day.

Things that helped us in this journey

Code structure -
1. Our code structure is inspired by Domain Driven Design. From the start itself we had three major bounded contexts that we recognised and developed the code around it. It was as simple as - user (profile, referral, etc), trading (order-management, trade-management etc) and payments (user balance, transaction history etc) ecosystem.
  1. Interaction between these ecosystems were loosely coupled.
2. We had a very basic layered code architecture - Presentation → Application logic → Data management layer.
3. Concept of services in a monolith, with a focus on preventing domain leakage helped us scale better.
  1. If required each individual service used to communicate with other service through service<>service methods and not by data layer.
  2. Most of the database joins were in their own bounded contexts, which helped mantain loose coupling.
Splitting into Server & Worker modes
1. We encountered performance issues well early into our journey when everything was handled with a single process. To solve, we first identified the background tasks and separated them out of server process.
2. As a result, both server process and various worker processes could now independently scale without affecting one another.
Language Specific Tweaking
1. Whatever language you pick, there are always some basic level of configuration tweaking that is required to scale efficiently.
2. We use NodeJS majorly, and since it is single threaded we run it in cluster mode using PM2 to efficiently utilize our compute resources.
3. Some other configurations that were used
  1. --max-old-space-size=SIZE Sets the max memory size of V8's old memory section.
  2. --prof for profiling
The bottleneck is always the database
1. As mentioned by Kailash Nadh, “97.42%* of all scaling bottlenecks stem from databases”. And we can confirm that it is true!
2. Focus on fundamentals. “No Indexes” or “incorrect Indexing” is the most popular answer I get during my interviews when I ask about downtimes.
3. Learn about database config parameters even if you use a managed service. This helped us scale our database better and avoid downtimes.
Independent endpoint scaling
1. Segregating our high thoughput endpoints, helped us in scaling and help isolate issues faster.
2. Example - when a notification is sent to our users, we update the read/view status on our end. This is a burst traffic that impacts other APIs performance. Identifying this set of api’s and redirecting to their own set of dedicated instances helped us prevent slowness.

Not everything is good in the Wonderland.

Deployment time - There is a significant increase in deployment time due to
1. CI Pipeline tests and checks
2. We do rolling deployment, hence time taken to complete the deployment is directly proportional to number of instances
Fear of change - There is always a fear that a change might take complete service down. This has resulted in apprehensions in frequency of deployments, deployments in high traffic times etc. But as of now we still do multiple deployments throughout the day!
Multiple dependencies & Resource Sizing
1. Dependencies in terms of databases, cache etc keep on growing with use cases. Thus increasing your surface area for downtimes.
2. Horizontal scaling also put pressure on your resources like Database connections, which leads to unoptimized sizing. Proxy & Pooling to the rescue!