Have you ever heard someone throw around the term “service level indicator” and wondered what on earth they were talking about? As someone who’s worked in IT for over a decade, I’ve had to get familiar with these concepts but I know it can sound like tech jargon gibberish to many folks.
That’s why I wanted to write this post – to explain in simple terms what a service level indicator is why it matters and how it’s used. My goal is to break it down so that anyone can understand these ideas. Whether you’re an IT pro looking to brush up or a civilian trying to decipher tech-speak, this post is for you!
What Exactly is a Service Level Indicator?
A service level indicator (SLI) is a specific metric used to measure and track the performance of a service. The “service” could be anything from an internal software system to an online customer-facing platform. Common examples of services include:
- A company website
- An online payment processing system
- An internal HR application
- Cloud storage
- Email platforms
The SLI provides a standard way to evaluate how well that service is working. It tracks quantifiable metrics like:
- Uptime/downtime
- Response time
- Transaction volume
- Error rate
- Throughput
By monitoring the SLI, you can instantly check the “health status” of a service. An SLI acts like a doctor’s chart, giving you vital signs to know if everything is OK.
Some typical SLIs include:
- Availability – % of time the service is accessible and working
- Latency – how long it takes the service to respond to a request
- Error rate – % of requests resulting in errors
- Throughput – requests processed per second
So in plain English, the SLI measures performance factors like speed, reliability, and capacity. It provides hard data to understand if a service is operating smoothly or needs attention.
Why Are SLIs Important?
SLIs give tangible insight into the user experience. The metrics allow you to track quality of service and identify problems.
For an online platform, latency directly impacts how fast pages load for customers. Error rate can reveal frequent crashes or bugs affecting users. Measuring these SLIs ensures you understand what your customers are truly experiencing.
Internally, SLIs help IT teams stay proactive. By monitoring SLIs, you can catch issues before they become outages. For instance, a sudden spike in latency may indicate an impending bottleneck. Or an uptick in errors could foreshadow a component failure down the road.
Spotting these early warning signs allows teams to get ahead of problems before they cascade into full-on crises. Having clear SLIs provides the necessary telemetry to stay on top of systems.
They also give a standardized language around service performance. Different teams can discuss specifics like “we need sub-100ms latency on the payment API” rather than vague notions of “faster” or “more reliable”. Quantifiable SLIs create clarity and alignment.
How Do SLIs Get Used?
SLIs provide the foundation for service level objectives (SLOs) and service level agreements (SLAs). An SLI defines the metric, while SLOs and SLAs determine expected values and minimum standards for those metrics.
Service Level Objectives
A service level objective (SLO) sets specific targets for an SLI. It establishes measurable performance goals like:
- 99.95% uptime
- Average latency below 100ms
- Peak throughput of 500 requests/sec
SLOs represent the performance level your service is designed to deliver. These targets should be based on user expectations and operational capabilities.
For instance, an internal HR portal will have very different uptime needs than an e-commerce site. The SLOs should account for these use cases. An overly strict SLO could impose unreasonable costs, while too lax of an SLO could disappoint users.
Setting appropriate SLOs requires collaborating with stakeholders – users, managers, developers, etc. Alignment is crucial for creating meaningful objectives.
Service Level Agreements
A service level agreement (SLA) formally defines the expected level of service between a provider and consumer. SLAs will specify consequences for failing to meet SLOs, such as penalties or breaches of contract.
Whereas SLOs are internal targets, SLAs make guarantees to customers. For example, a cloud storage provider may offer 99.9% uptime in their SLA. Falling short could require payouts to affected users.
Not all services warrant SLAs. They tend to appear in mission critical systems or paid services with outside users. SLAs provide accountability and assure customers that performance will meet certain standards.
Best Practices for SLIs
Now that you understand the purpose of SLIs, how do you go about choosing good indicators for your services? Here are a few best practices:
Focus on user experience – The best SLIs directly measure what matters most to your users. Is it speed, reliability, capacity? Design SLIs to gauge their real-world experience.
Pick simple metrics – Complex aggregations can obscure changes in performance. Opt for straightforward metrics like error rate and latency.
Measure distribution – Don’t just look at average performance. Use percentiles to expose outliers and reveal the full distribution of values.
Monitor client-side – Catch issues users experience that server-side metrics miss. For web apps, measure latency from the browser.
Standardize definitions – Adopt consistent terminology, calculation methods, etc. Standardization avoids confusion down the road.
Avoid vanity metrics – Don’t get distracted tracking metrics that sound impressive but provide little value. Stick to meaningful performance indicators.
Review regularly – Reevaluate your SLIs to ensure they still map to user needs as systems evolve. Retire outdated indicators that no longer fit.
SLI Gotchas to Avoid
While SLIs are powerful tools, they can also lead you astray if misused. Steer clear of these common SLI mishaps:
- Setting arbitrary targets without operational knowledge
- Defining vague, ambiguous metrics open to interpretation
- Measuring too many indicators that become information overload
- Optimizing strictly for SLIs and losing sight of big picture goals
- Having SLOs misaligned with business priorities and user expectations
Now You’re an SLI Expert!
That covers the key basics about service level indicators and why they matter. SLIs provide vital telemetry into system and service performance. When designed thoughtfully, they can unlock immense value.
But like any powerful tool, SLIs can do damage if used carelessly. Define indicators that truly reflect user experience. Rigorous testing will reveal if metrics match reality.
Hopefully this overview has demystified SLIs for you. The next time someone mentions service level indicators, you can smile knowingly and dazzle them with your newfound knowledge. Now you just have to convince them to explain SLAs and SLOs!
Chapter 4 – Service Level Objectives
Itâs impossible to manage a service correctly, let alone well, without understanding which behaviors really matter for that service and how to measure and evaluate those behaviors. To this end, we would like to define and deliver a given level of service to our users, whether they use an internal API or a public product.
We use intuition, experience, and an understanding of what users want to define service level indicators (SLIs), objectives (SLOs), and agreements (SLAs). These measurements describe basic properties of metrics that matter, what values we want those metrics to have, and how weâll react if we canât provide the expected service. Ultimately, choosing appropriate metrics helps to drive the right action if something goes wrong, and also gives an SRE team confidence that a service is healthy.
This chapter describes the framework we use to wrestle with the problems of metric modeling, metric selection, and metric analysis. Much of this explanation would be quite abstract without an example, so weâll use the Shakespeare service outlined in Shakespeare: A Sample Service to illustrate our main points.
Many readers are likely familiar with the concept of an SLA, but the terms SLI and SLO are also worth careful definition, because in common use, the term SLA is overloaded and has taken on a number of meanings depending on context. We prefer to separate those meanings for clarity.
An SLI is a service level indicatorâa carefully defined quantitative measure of some aspect of the level of service that is provided.
Most services consider request latencyâhow long it takes to return a response to a requestâas a key SLI. Other common SLIs include the error rate, often expressed as a fraction of all requests received, and system throughput, typically measured in requests per second. The measurements are often aggregated: i.e., raw data is collected over a measurement window and then turned into a rate, average, or percentile.
Ideally, the SLI directly measures a service level of interest, but sometimes only a proxy is available because the desired measure may be hard to obtain or interpret. For example, client-side latency is often the more user-relevant metric, but it might only be possible to measure latency at the server.
Another kind of SLI important to SREs is availability, or the fraction of the time that a service is usable. It is often defined in terms of the fraction of well-formed requests that succeed, sometimes called yield. (Durabilityâthe likelihood that data will be retained over a long period of timeâis equally important for data storage systems.) Although 100% availability is impossible, near-100% availability is often readily achievable, and the industry commonly expresses high-availability values in terms of the number of “nines” in the availability percentage. For example, availabilities of 99% and 99.999% can be referred to as “2 nines” and “5 nines” availability, respectively, and the current published target for Google Compute Engine availability is âthree and a half ninesââ99.95% availability.
An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI. A natural structure for SLOs is thus SLI ⤠target, or lower bound ⤠SLI ⤠upper bound. For example, we might decide that we will return Shakespeare search results “quickly,” adopting an SLO that our average search request latency should be less than 100 milliseconds.
Choosing an appropriate SLO is complex. To begin with, you donât always get to choose its value! For incoming HTTP requests from the outside world to your service, the queries per second (QPS) metric is essentially determined by the desires of your users, and you canât really set an SLO for that.
On the other hand, you can say that you want the average latency per request to be under 100 milliseconds, and setting such a goal could in turn motivate you to write your frontend with low-latency behaviors of various kinds or to buy certain kinds of low-latency equipment. (100 milliseconds is obviously an arbitrary value, but in general lower latency numbers are good. There are excellent reasons to believe that fast is better than slow, and that user-experienced latency above certain values actually drives people awayâ see “Speed Matters” [Bru09] for more details.)
Again, this is more subtle than it might at first appear, in that those two SLIsâQPS and latencyâmight be connected behind the scenes: higher QPS often leads to larger latencies, and itâs common for services to have a performance cliff beyond some load threshold.
Choosing and publishing SLOs to users sets expectations about how a service will perform. This strategy can reduce unfounded complaints to service owners about, for example, the service being slow. Without an explicit SLO, users often develop their own beliefs about desired performance, which may be unrelated to the beliefs held by the people designing and operating the service. This dynamic can lead to both over-reliance on the service, when users incorrectly believe that a service will be more available than it actually is (as happened with Chubby: see The Global Chubby Planned Outage), and under-reliance, when prospective users believe a system is flakier and less reliable than it actually is.
Chubby [Bur06] is Googleâs lock service for loosely coupled distributed systems. In the global case, we distribute Chubby instances such that each replica is in a different geographical region. Over time, we found that the failures of the global instance of Chubby consistently generated service outages, many of which were visible to end users. As it turns out, true global Chubby outages are so infrequent that service owners began to add dependencies to Chubby assuming that it would never go down. Its high reliability provided a false sense of security because the services could not function appropriately when Chubby was unavailable, however rarely that occurred.
The solution to this Chubby scenario is interesting: SRE makes sure that global Chubby meets, but does not significantly exceed, its service level objective. In any given quarter, if a true failure has not dropped availability below the target, a controlled outage will be synthesized by intentionally taking down the system. In this way, we are able to flush out unreasonable dependencies on Chubby shortly after they are added. Doing so forces service owners to reckon with the reality of distributed systems sooner rather than later.
Finally, SLAs are service level agreements: an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain. The consequences are most easily recognized when they are financialâa rebate or a penaltyâbut they can take other forms. An easy way to tell the difference between an SLO and an SLA is to ask “what happens if the SLOs arenât met?”: if there is no explicit consequence, then you are almost certainly looking at an SLO.16
SRE doesnât typically get involved in constructing SLAs, because SLAs are closely tied to business and product decisions. SRE does, however, get involved in helping to avoid triggering the consequences of missed SLOs. They can also help to define the SLIs: there obviously needs to be an objective way to measure the SLOs in the agreement, or disagreements will arise.
Google Search is an example of an important service that doesnât have an SLA for the public: we want everyone to use Search as fluidly and efficiently as possible, but we havenât signed a contract with the whole world. Even so, there are still consequences if Search isnât availableâunavailability results in a hit to our reputation, as well as a drop in advertising revenue. Many other Google services, such as Google for Work, do have explicit SLAs with their users. Whether or not a particular service has an SLA, itâs valuable to define SLIs and SLOs and use them to manage the service.
So much for the theoryânow for the experience.
Given that weâve made the case for why choosing appropriate metrics to measure your service is important, how do you go about identifying what metrics are meaningful to your service or system?
SLIs, SLOs, SLAs, oh my! (class SRE implements DevOps)
What is a Service Level Indicator (SLI)?
In information technology, a service level indicator ( SLI) is a measure of the service level provided by a service provider to a customer. SLIs form the basis of service level objectives (SLOs), which in turn form the basis of service level agreements (SLAs); an SLI is thus also called an SLA metric .
How do service level objectives compare?
You can review these explanations to better understand how they compare: A service level objective (SLO) is a goal that a company creates to compare against the eventual SLIs. SLOs help develop a customer’s expectations for the performance of a website or application in specific areas, such as response time or latency, which is a type of delay.
What is a service level indicator?
A service level indicator is a measurement of a cloud service’s performance against a service level objective (SLO). The SLI compares the actual performance against the company’s objectives to see if improvements are necessary.
What is a service level agreement (SLI)?
SLIs form the basis of service level objectives (SLOs), which in turn form the basis of service level agreements (SLAs); an SLI is thus also called an SLA metric . Though every system is different in the services provided, often common SLIs are used.