Universal Scalability Law

January 27, 2025 5 minute read

Have you ever added more vCPU to the cloud database or service instance expecting a nice performance boost, only to see disappointing results? Think about the last time you scaled 2x or 4x expecting corresponding improvement? While it seems logical that doubling vCPU should double performance, real systems rarely work exact that way. There’s actually fascinating math behind this common frustration.

Welcome the Universal Scalability Law (USL), developed by Dr. Neil Gunther in 1993. It’s not just another academic formula - it’s a practical tool that explains why systems hit performance walls. Whether we’re vertically scaling a database instance or adding instances to services to scale horizontally, USL helps predict when we’ll hit diminishing returns and why throwing more hardware at the problem won’t necessary help.

The formula might look intimidating but it captures a fundamental truth about system performance:

\[C(N) = \frac{N}{1 + α \cdot (N-1) + β \cdot N \cdot (N-1)}\]

N represents any resource or capacity we’re trying to scale - this could be hardware resources (like vCPU count or I/O throughput) or software capacity (like concurrent database connections). The function C(N) gives us a relative performance ratio that shows how performance changes as we scale up, always compared to our baseline case.

Let’s look at two common scenarios:

When measuring database throughput, N might represent the number of simultaneous users accessing the system. In this case, C(N) shows us the relative throughput compared to having just one user. For example, if we measure that one user achieves 10 requests per second, and two users together achieve 19 requests per second, we can calculate:

C(1) = 1 (baseline)
C(2) = 19/10 = 1.9 (meaning we get 1.9 times higher throughput with two users)

When testing vertical scaling, N represents the number of vCPUs, and C(N) shows the performance ratio compared to using a single vCPU. For instance, if a single-vCPU configuration processes 10 transactions per second at maximum, and a four-vCPU configuration processes 35 transactions per second, then:

C(1) = 1 (baseline)
C(4) = 35/10 = 3.5 (meaning we get 3.5 times the performance with four vCPUs)

Where things get really interesting is in those two Greek letters - α (alpha) and β (beta) - contention and coherency. We might think of contention as baristas sharing a single espresso machine - more baristas mean longer queues to use shared resource. Each barista has to wait their turn to use the machine. Coherency is trickier - it’s the overhead of keeping everything in sync, like baristas coordinating who makes which drinks and updating the shared order board. As we add more baristas, these effects compound until they start eating into our performance gains.

Where USL gets really interesting - it allows to predict scaling behavior before spending money on hardware. Run some benchmarks with different vCPU counts, use some math tools to estimate parameters (like Python curve_fit from scipy.optimize), plot the results, and USL will tell you whether upgrading to that 32-core instance will actually help or just waste money.

Lets have a look on how USL works in practice using PostgreSQL 16 on AWS. I have ran some benchmarks using different AWS RDS for PostgreSQL instance sizes to see how vertical scaling really performs. I wanted to simulate a real-world scenario, so I used TPC-B - a classic banking benchmark that’s been the industry standard. Think of it as simulating a bank with multiple tellers processing transactions simultaneously. Each transaction updates account balances, records history, and updates branch statistics - pretty much what you’d expect from a write-heavy financial system.

I picked 3 RDS instances running on AWS’s Graviton2 processors: db.m6g.large, db.m6g.xlarge, and db.m6g.2xlarge. To keep things clean, I ran the load generator (pgbench) on a separate EC2 instance in the same availability zone - no point in letting higher network latency mess with results. The database was configured with 5 million accounts, 500,000 tellers, and 50,000 branches (-s 50). Why these numbers? Simple - I wanted the dataset too big to fit in memory. Real databases don’t always have the luxury of keeping everything in RAM. I then hammered each instance with an increasing number of concurrent connections, starting from 1 and going all the way up to 128. This way, we can see how each instance size handles growing concurrency pressure. Each test was run 4 times and results were averaged to get more consistent data.

The upper chart shows raw throughput (TPS) for three RDS instances as I increase client connections from 1 to 128. Dotted lines represent actual measurements, while solid lines show the USL approximation that closely follows the real-world data. The bottom chart shows performance ratios relative to the large instance.

The benchmark results tell a story about what happens when we throw more hardware at PostgreSQL. Our biggest db.m6g.2xlarge instance peaked at around 7,300 transactions per second before hitting a wall, while the much smaller db.m6g.large peaked at 2,400 transactions per second. At first glance, that might seem reasonable - bigger instance, better performance. But the second graph comparing the performance ratios between instance sizes, and this is where it gets interesting. In a perfect world, the xlarge (2x the price of large) should give us 2x performance, and the 2xlarge (4x the price) should give us 4x performance. Reality? The xlarge managed to provide even more than 2x improvement - 2.2x improvement over the large instance (probably due to the bigger cache). On the other hand, the 2xlarge oscillates only at 3.5x - meaning that last doubling of resources gave us diminishing returns.

Why does this happen? Again two key factors: contention and coherency. They represent the real-world friction that stops from getting that perfect linear scaling. Contention reflects queries waiting for table locks or transactions competing for storage bandwidth. Coherency captures an overhead of keeping everything synchronized. Like what it takes to maintain ACID properties across multiple database connections, or keeping cache coherent across parallel transactions. We can see this clearly in how the curves flatten out and later actually start dropping at higher connection counts. The xlarge instance, in particular, show this pattern beautifully.

Instead of conclusion, here’s a practical guide to using USL: Start by running load tests on the current system with increasing concurrency (1, 2, 4, 8, 16, 32, 64 concurrent operations) and measure throughput for each level. Normalize these results with single-operation throughput as baseline 1.0, then fit the points to the USL formula. The resulting alpha and beta values reveal your next steps - if alpha (contention) dominates and your curve flattens early, the system is hitting resource contention and hardware upgrades might help. But if beta (coherency) dominates and performance actually drops at higher concurrency, the system is hitting fundamental coordination overhead that more hardware probably won’t fix much - time to look at architectural changes instead. That’s the beauty of USL - it tells you whether to spend money on infrastructure or invest time in optimization before you make the decision.