On the benefits of slot scheduling
Ensuring smooth operation for thousands of users with regular workloads is hard. Using slot scheduling can help you make your workload and throughput more predictable and your users happy. Slot scheduling is a great medicine to alleviate load spikes on your background jobs cluster.
The traditional way of performing tasks for multiple users in the system is to schedule them using cron or similar. The issue however, is that scheduling “all the work for all the users” into one time slot will create a large load spike. This is bad for your shared services - the DB, Redis, APIs you call into - but also bad for your wallet, as you will need to suddenly scale your system by a factor when the time comes to run those jobs. In our company we need to run sync jobs to external APIs for every user - tens of thousands of users, and we want those jobs to run with predictable frequency. While we previously selected “least-recently-synced” workloads first, this came with the load spikes and all the accompanying issues.
Slot scheduling tackles the problem differently. It computes a modulo over the workload owner ID (like the user ID) and then buckets the user into one of a fixed number of slots. This allows us to run an approximately even number of tasks at any given point in time, providing for a near-flat throughput. It is better for the shared services, better for the queue - as jobs get portioned into the job queue in smaller batches, and do not stick around for long - and better for autoscaling, as there is much less of it needed.
We are also going to discuss a very viable strategy for designing systems like this - discrete simulations, which - when applied properly - can save hours of testing.