FantaSanremo system design
Last week was that week in Italy. Don’t get me wrong, not the week when the Olympics ended, but the week when the Sanremo contest went on air.
For those who don’t know, the Sanremo Festival is a famous Italian music competition where 30 artists perform new (and unknown) songs. They are then evaluated based on different criteria and by different people and audiences.
At the end of the 5 days, we have a winner. The winner will represent Italy at the Eurovision, which I’m sure many more people know and follow.
In general, I don’t care about Sanremo, but in the last few years, there has been a new game that has gained popularity and that I played as well: FantaSanremo.
FantaSanremo is like fantasy football, but instead of buying soccer players with virtual money, you buy artists. They score points (positive or negative) based on different actions over the entire contest duration, and at the end, you have a winner there as well, which is likely different from the real winner.
The basic rules are:
- You start with a limited amount of fake money (called “Baudi”)
- You can buy artists with Baudi and thus create one or more teams
- Every team can join one or more leagues
- Leagues can be public or private
- You usually play in leagues with your friends, so you have your own ranking at the end (and you can be a winner too!)
As an engineer, this sparked my curiosity, and I decided to do a thought experiment and design what a FantaSanremo system would look like if I had to create it from scratch, but following how it behaves now.
One important note: this isn’t a fully working solution, and there will probably be holes. Also, given that it’s a cloud-agnostic solution, I’m missing some shortcuts that AWS/GCP might offer.
Last note: I’m not part of the team/company developing and supporting FantaSanremo; this is just speculation, and every opinion is my own.
You can find the diagrams and the data model in the repository.
Let’s start!
Scale
Numbers taken from the news confirm that also this year the game was very popular.
It had around 3M active users, over 4M teams, with a peak of 2M teams created in the last few days before the contest. More than 600k leagues were created to challenge friends and family.
The activity pattern is highly bursty: team creation and modifications happen in the days leading up to the festival, while reads spike during the five live nights when users check leaderboards. This read-heavy, concentrated traffic pattern will drive many of the decisions below.
The Database
As with every system design, I like to start bottom-up, so I’ll dive into the database and the data model first.
FantaSanremo allows you to create and modify your teams and leagues within specific time windows. That’s done for different reasons:
- People can’t change the artists’ roles during the night performance.
- It allows score recalculations the day after, without write conflicts.
- It allows better resource management and, thus, better money management.
Given this, and also based on how many times I checked the leaderboards rather than creating/updating the team, I’m assuming the system is a read-heavy one, and this will have an impact on the database choice.
Besides that, I can see many relationships between entities. Users have one or more teams, which can join one or more leagues. Every team has many artists (up to 7). Every artist can earn many (predefined) points. Every user can earn many achievements.
Given the above-mentioned functional requirements, I exclude NoSQL options. Imagine a global league document containing all the teams; it would be huge.
Let’s go with the classic, then, and let’s use PostgreSQL as our relational database.
Our setup will be a primary with two or more replicas. The primary will accept all writes, while the replicas will serve all the reads. Reading from replicas will offload traffic from the primary, enabling better scaling.
This might have a small impact on users’ ability to see their writes straight away because of the eventual consistency it introduces, but I acknowledge it as a trade-off.
One last thing: I’ll use the semi-synchronous replication pattern. So, whenever a write comes in, before returning to the client, we wait for at least one replica to acknowledge the write synchronously, while others will replicate the data in the background. This is a good trade-off between the added latency in the writing path and the data safety we gain.
The Data Model
I sketched the schema I would use and made it as normalized as possible.
All entities have:
- An
iddefined asBIGSERIAL PRIMARY KEY, to be used for the joins - An
external_iddefined asUUID DEFAULT uuid_generate_v4() NOT NULL UNIQUE, to be used outside of the system created_atandupdated_at, because why not? They come almost for free in PostgreSQL:created_atuses a default value, whileupdated_atis kept in sync via a database trigger. This way, we don’t have to manage them in our application logic.
There are a couple of things that need an explanation.
The teams table has both the score and the score_history fields. The different leaderboards would use the first one to show the correct ranking. The second one, instead, is defined as a JSONB field. This was a shortcut I took to show users what an artist did on a specific night without having to create other tables. The official app allows it, so we have to as well.
An example for day 1 could be:
{
"day1": {
"total_points": "<score_for_the_day>",
"artists": [
{
"id": "<artist_id>",
"role_for_the_day": "<artist_role>",
"points": [
{
"id": "<point_id>",
"value": "<point_value>",
"applies_to_reserve": true
}
]
}
]
}
}
The other thing I want to tell you is about the messages table. This doesn’t cover any part of the game, but it will be helpful as we want to support achievements and thus events. More on this later.
The Services
All our systems are behind an API gateway that also handles authentication and authorization.
The most important service is the customer-facing one. This will handle the user requests, such as reads and writes. Writes are directed to the primary, while reads are always performed from the replicas. As simple as that.
Of course, the service will be replicated with at least 2-3 instances to better scale and avoid a complete failure if one instance dies unexpectedly.
It’s important to note that every write to the system will be done in a transactional manner to the interested entity table AND to the messages table, along with the event that triggered the write.
This prevents the double-write problem (i.e., writing to the DB and another system, such as a queue, without transactions) and enables the classic outbox pattern, which leads us to introduce the Message Relay Service.
The service role is simple: it reads Pending messages, pushes them to our queue, and marks them as Processed. This way, already processed messages won’t be reread.
We can have more instances for this service as well. The trick is to lock the rows read in a single run so that other instances won’t see those messages either. PostgreSQL provides the FOR UPDATE SKIP LOCKED statement to do precisely that.
The queue will be read by all interested services, among which we have:
- The achievements service updates users’ progress towards the achievements available in the game.
- The notification service notifies users when something notable happens, such as earning an achievement.
- The FantaLive service pushes the points earned by the performing artist to a dedicated page in real time.
Both the notification service and the FantaLive service will leverage WebSockets to push such events.
Of particular interest is the achievements flow. Whenever a user performs an action, it is written to the database. The relay would read the new events and push them to the queue. The achievement service would read these latest events, calculate the progress of a specific achievement for that particular user, and update the DB, writing the new progress and the notification event if the user completed one of them. At this point, the relay would read the latest event, but this time the notification service would pick it up and push it to the user.
Last, but not least, we have:
- A manually triggered job that updates all the scores after the night ends. This can’t be automated because you don’t know in advance when the show is ending (yes, they are very long nights).
- An admin service (internal) for admins to update points or do other administrative tasks. This is important for both the FantaLive feature and the scoring system. Without an easy way to add points, it would be hell for maintainers.
The only thing we’re missing to discuss now is what we expose outside.
The mobile app and the mobile website look the same, and since I don’t think the organizers can afford separate teams for each platform, I would bet they’re built with a cross-platform framework like Flutter or React Native/Expo.
Anyways, the user’s handheld device communicates with our system via a set of REST APIs.
The Missing Cache
As you might have noticed, I deliberately left out a cache layer from the architecture diagram. With 3M users hammering leaderboards during live nights, a cache (e.g., Redis) in front of the database replicas would reduce read latency and protect the replicas from traffic spikes.
That said, caching brings its own complexities like cache invalidation (when do you refresh after a score update?) and stale data (users seeing outdated leaderboards). Given that the read replicas already offload traffic from the primary and that scores only change once per night, the replicas alone might handle the load just fine.
I’d treat caching as an optimization, rather than a day-one requirement.
Conclusion
We reached the end of our analysis, and the system resonates with me. You can find the final diagram in the repository.
Engineering a system follows the same set of rules everywhere, so I wouldn’t expect that many differences. Using a cloud provider might make some things easier or different, such as dispatching events and running functions/lambdas in response, but the data flow will likely remain similar.
I had so much fun designing this, and it also brought some learnings:
- Gathering requirements from a running system poses different challenges than those in greenfield projects. I had to adapt to what I saw rather than do what I had in mind.
- Achievements and gamification in general are not easy to get right, and introduce lots of complexities.
- Jobs look so ’90s, but they’re still so widely used!
That said, I’m still amazed by how simple and great this game is, and surprised by how incredible its traction was!