A marketplace we work on had difficulty engaging customers by informing them about new promotions and encouraging them to buy and sell their products. They needed a new, more effective communication channel than emails or push notifications.
They wanted to be able to send marketing messages with specific content and to specific user bases at scheduled times, ideally using the existing chat user interface.
Although altering the existing chat functionality required us to solve a few technical problems, it allowed us to deliver the feature quickly and, more importantly, provide the expected business value for our partner.
Altering existing system to speed up development
To develop it quickly, we decided to leverage our existing chat service originally designed for facilitating general text communication between customers. It only allowed users to send and receive chat messages. We changed that logic a little bit, so that our system (or any other entity) could also send and receive messages.
Another aspect that we needed to consider was the need to hide marketing messages after a certain period of time. Previously, we lacked the possibility to delete or hide these messages. Since our roadmap included plans to allow users to archive some of their conversations, we opted to implement this functionality now, at least on the backend. We abstracted the concept and utilized it to automatically archive marketing messages once the specified time, determined by the marketing team when crafting their message, has elapsed.
With these small fixes, we have completed most of the work for marketing messages, including sending and viewing them, as well as delivering notifications.
Balancing performance with stability
Unfortunately, after sending the marketing messages for the first time as a part of a load test, it appeared that our system was not yet prepared for sending hundreds of thousands of messages in such a short span of time. While sending the messages itself was working correctly, it caused an overflow in our event and async job queues, resulting in significant delays in other parts of our system, such as refreshing a product after editing.
Knowing that one should always prioritize the stability of the system over adding any new features, we have decided to slow down sending messages to prevent these negative effects. Users can live without marketing messages, but they will be upset if they suddenly cannot buy or sell products on the platform.
Only then did we start to think about what steps we needed to take in order to make sending these messages fast and smooth.
Thanks to this, even though messages were being sent more slowly than we had initially expected, we were able to release it and let our client use it. They were satisfied, especially since they quickly found out that it provided significant business value.
Horizontal scaling and asynchronous workload
While the feature was already working safely in production, we started discussing in our engineering team what we could do to make it work faster. Why can such a feature have negative impacts on the system? Ideally, if a system is overloaded with too much data or work, it should be possible to fix it with horizontal scaling of its infrastructure (i.e. adding more machines).
- First, we noticed an inefficiency in a query that checked if there was a conversation about a specific topic between two users. Since marketing messages became a thing, this query started to time out and cause high load on the database, as our "system user" had millions of opened conversations. Upon investigation, it appeared that the WHERE clause in that query was not set properly because we had forgotten to reference a column that was present in the composite index set on that table. This index was meant to be used to verify if a certain conversation exists. Once we added the missing column to the WHERE clause of the query, it performed smoothly again. To avoid such problems, it's always a good idea to make sure that index keys are properly set for complex queries on large tables. For more information on this topic, check out this in-depth article on The Where Clause
- The second thing we did was to increase the number of our async job processors and event consumers. This was necessary because sending millions of chat messages at once caused a lag due to a limit on how many jobs or events they could process per second. After deploying an increased container count in the Kubernetes configuration, the lag disappeared. Another solution would be to introduce autoscaling, which launches more containers if there is more work to be done.
- As we are using Kafka under the hood to process events, increasing the number of running event consumers was not enough. You have to remember to increase partition count of your Kafka topics as well - we’ve covered that topic already in a separate article called “How to increase Kafka topic partition count on Confluent platform using docker and kafkactl”.
- Scaling just Kafka was not enough, because our events were also published to EventStore, a legacy solution from which we haven’t migrated out yet. As it was based on a single Postgres database, scaling it horizontally wasn’t really an option, at least not a simple one. After tackling that debt and removing EventStore completely, it stopped becoming a bottleneck.
After overcoming these technical difficulties, finally we could speed up the process of sending messages - for example from 18 messages per second to 50 messages per second - almost triple the speed!
Messages sent per second before tackling the tech debt
Messages sent per second after tackling the tech debt
Resilience and reliability
Having noticed that sending messages to millions of users can take a while, we added precautions to make sure that they were not sent during inconvenient hours. For example, the process is currently paused at 9PM and automatically resumes at 7AM.
Additionally, sending and archiving messages can be activated, paused and resumed manually from the administration panel. Naturally, the admin can also see the progress of any of those tasks.
We were able to satisfy business requirements by implementing a working solution fast, focusing on providing value to our customers. Compared to previously used push notifications, users read new chat messages more often (24% compared to 15% for push notifications). This resulted in a 1.5% increase in conversion on the offer creation form and 150,000 new offers published on the marketplace. The new marketing channel had better-than-expected reception, maybe because it utilized a feature that had already been familiar to end-users.