yet another XMPP instant messenger

VaxBot performance challenge

A few days ago, VaxBot, a new XMPP-based vaccination appointment notification service was launched in the USA. The service is recommending Monal and yaxim as client applications, and both apps by default register accounts on yax.im.

This has brought us hundreds of new users (per day) and a significant amount of new traffic from the notifications, sometimes up to 200 messages per second. The additional traffic load caused some short service interruptions in the last days, and we are working together with the VaxBot team on implementing mitigations.

The VaxBot service started in mid-March in Massachusetts, and is slowly expanding to more and more states, hoping to also spread over the ocean. It was even featured on TV (archive links for EU citizens: FOX5, WBTV)!

This has lead to a significant uptick in yax.im account registrations, as can be seen in this graph:

yax.im account registrations over last few weeks

Once a user registers with VaxBot, it will automatically send them appointment notifications for their region as soon as they become available.

This means that for each potential appointment, a message will be sent to each registered user in the region. When a large chain opens up additional capacities, thousands of messages will be generated and sent out in a burst.

As those are chat messages, they need to be stored in the respective user’s account, delivered to online devices, and forwarded to the respective push service to wake up a mobile device.

yax.im messages from VaxBot

Due to how the server is processing messages, a large message flood from one connection can “capture” the processor for multiple seconds or even longer, leading to the starvation of other connections, causing delivery delays and even disconnects.

As a preliminary measure, we have implemented a rate limiting mechanism to reduce the impact of message bursts from VaxBot, and we are working on optimizing the number of messages generated by the bot and on increasing the server performance to be able to further scale up.

A long-term solution based on XEP-0060: Publish-Subscribe would be interesting and probably much more efficient for the infrastructure, but that would require significant changes to all clients, and vaccination can not wait.