Microsoft published a paper describing challenges in dealing with gray failures, which are defined as subtle underlying faults that frequently cause major availability breakdowns and performance anomalies. And they point out that “as cloud systems increase in scale and complexity, gray failure becomes more common”. Indeed, gray failures are the most common type of failures in network infrastructures, and they are the hardest to detect and root cause. In this blog we explain how Apstra AOS 2.1 addresses gray failures by reviewing some of the examples and observations from the paper.
Example 1: One-Dimensional View of the Problem
The first example the authors cite is as follows:
“For instance, if a system’s request-handling module is stuck but its heartbeat module is not, then an error-handling module relying on heartbeats will perceive the systems healthy while a client seeking service will perceive it as failed.”
The root cause of this issue is that the described fault detection system is (a) one dimensional (only one metric used — heartbeat), and (b) it is concerned only with system health and not service health. Apstra’s AOS considers service expectations to be the most important aspect of the closed-loop validation. System health is considered as well but the ultimate goal of the system is delivering service, not any particular component being healthy for health’s sake. Think of it: do you even care if the system has a heartbeat as long as it is reliably delivering a service? In fact, redundancy is incorporated in how infrastructure is designed, so that in many cases, the service is delivered even if a particular system were to fail. The most important insight is the understanding of how systems are composed to deliver a service. In AOS, that composition context is an AOS reference design. It is a first class citizen and its artifacts are present in the single source of truth which AOS maintains.
Example 2: Harnessing Spatial Patterns
The second example cited is:
“As another example, if a link is operating at significantly lower bandwidth than usual, a connectivity test will reveal no problems but an application using the link may obtain bad performance.”
This is another example of a one-dimensional fault detection system. In AOS one would define connectivity service expectations to consist of both:
Bandwidth above threshold test.
Now let’s make this example a bit more interesting. Say the client had multiple connections and the performance would degrade if there were less than two connections available that had no earlier described anomalies. First, you need to identify where all the connections are. Given that the single source of truth is stored in AOS, you can leverage Graph-Based Live Queries in AOS to identify these links. And the word “Live” in it implies that the number and identities of these connections are tracked as they change. In other words, AOS live queries enable you to harness the changing spatial patterns. You can find examples of these patterns in our blog on Intent-Based Analytics (IBA).
Example 3: Harnessing Temporal Patterns
To make the above example even more interesting, let’s also say that the bandwidth expectation described above would take into consideration temporal patterns and for example raise an anomaly when bandwidth goes below a certain threshold of more than n seconds. Handling temporal patterns is also part of Apstra’s Intent-Based Analytics.
Another example related to temporal patterns:
“Gray failure tends to exhibit an interesting evolution pattern along the temporal dimension: initially, the system experiences minor faults (latent failure) that it tends to suppress. Gradually, the system transits into a degraded mode (gray failure) that is externally visible but which the observer does not see. Eventually, the degradation may reach a point that takes the system down (complete failure), at which point the observer also realizes the problem. A typical example is a memory leak.”
Armed with IBA, in order to provide early detection, an AOS operator simply sets temporal analysis to raise an anomaly when a latent failure happens. Program actions react to it before the system transitions to a degraded mode. The operator has the option to program actions that range from simple logs and alerts, to triggering additional telemetry and debug/analysis chains.
Example 4: Cascading Failures
Sometimes gray failures lead to cascading failures:
“In one instance, a certain data server was experiencing a severe capacity constraint, but a subtle resource-reporting bug caused the storage manager to not detect this gray failure condition. Thus, the storage manager continued routing write requests to this degraded server, causing it to crash and reboot. Of course, the reboot did nothing to fix the underlying problem, so the storage manager once again routed new write requests to it, causing it to crash and reboot again. After a while, a failure detector detected that the data server was repeatedly rebooting, concluded that it was irreparable, and took it out of service. This, along with another subtlety in the replication workflow, reduced the total available storage in the system and put pressure on the remaining healthy servers, causing more servers to degrade and experience the same ultimate fate. Naturally, this eventually led to a catastrophic cascading failure.”
While this problem could have been also avoided by multi-dimensional fault detection we want to point out another aspect where Apstra helps. In the above example a faulty failure detection system caused healthy servers to be taken out of the service. In AOS the operator programs a semantic validation test that raises an anomaly when it observes that a failure detector is taking servers out of service at a higher rate than normal, according to some temporal pattern (a fancy word for “too much”). For example, if “more than x servers are taken out within y seconds” AOS will signal that the root cause is faulty resource-reporting.
Example 5: Blame Game
Another example that screams “lack of single source of truth that one can programmatically reason about in the presence of change”:
“Occasionally, a storage or network issue makes a VM unable to access its virtual disk, and thus causes the VM to crash. If no failure detector detects the underlying problem with the storage or network, the compute-cluster failure detector may incorrectly attribute the failure to the compute stack in the VM. For this reason, such gray failure is challenging to diagnose and respond to. Indeed, we have encountered cases where teams responsible for different subsystems blame each other for the incidents since no one has clear evidence of the true cause.”
Apstra’s AOS is a single source of truth of the intended state of the infrastructure, which understands the way resources (VM, network, storage) are composed to deliver a service, and allows proper identification of the problem, eliminating the blame game.
The Microsoft paper uses the following figure to describe model to characterize gray failure.
The paper argues that “a key feature of gray failure is differential observability: that the system’s failure detectors may not notice problems even when applications are afflicted by them.” It goes on to state that “A natural solution to gray failure is to close the observation gaps between the system and the apps that it services.”
In AOS, this gap is eliminated by the very essence of our approach: a service/system composite is declared operational only when expectations related to both system and service (app) are met. From the point a developer programs in those expectations as part of an AOS reference design, AOS auto-generates these expectations, auto-executes the tests, and raises anomalies when test results do not match expectations. All of this in the presence of change.
The authors further go on to describe how one can leverage scale to tackle the challenges of gray failures.
“… perhaps with the help of global-scale probing from many devices, we can obtain enough data points to apply statistical inference and thereby identify components with persistent but occasional failures … For instance, we can aggregate observations of VM virtual disk failure events and map them to cluster and network topology information.”
In order to leverage global-scale probing, one must have a system capable of ingesting this large amount of data. At #NFD16 we demonstrated how within AOS this is enabled by a granular pub/sub approach, in addition to mechanisms for horizontal scaling of the data store and processing in AOS.
In conclusion, the authors suggest:
“Therefore, we advocate moving from singular failure detection (e.g., with heart-beats) to multi-dimensional health monitoring. This is analogous to making assessments of a human body’s condition: we need to monitor not only his heartbeat, but also other vital signs including temperature and blood pressure”.
Apstra wholeheartedly agrees with what the Microsoft authors are advocating and our approach achieves what they are after. To see for yourself, check-out AOS.