At Apstra, we have declared war against the complexities and inefficiencies that plague data center network operations today, preventing organizations from delivering on their digital transformation goals. We set ourselves a core mission, which is to deliver on the vision of an autonomous Self-Operating Network. We promised our customers that we would continue delivering on functionality that every day gets us closer to our goal.
Today, we take a significant step forward in our commitment to this challenge by announcing Apstra AOS 2.1. It is an exciting release because it includes for the first time the Intent-Based Analytics (IBA) capability that we pioneered, and that my co-founder Sasha Ratkovic first introduced to the industry in a 2017 blog post. AOS 2.1 is generally available this month.
AOS IBA is a feature that our customers have been excited about and eagerly awaiting. It embeds automated big data analytics into the real-time continuous validation capability of AOS, and provides our customers unprecedented control of their network infrastructures to address their digital transformation and IoT goals.
In a nutshell, AOS IBA liberates network operators from today’s status quo, which is to sift through mountains of raw telemetry and stare at network visualizations 24/7 to detect unusual patterns. And contrary to the traditional big-data analytics status quo, AOS IBA relieves network operators from having to write complex low level imperative programs that need to be integrated and constantly kept in sync.
Instead, network operators specify, using a simple, dynamic, declarative interface, exactly how they expect their network to operate — beyond mere connectivity and including traffic patterns, performance, and tolerance for grey failures. AOS then continuously validates the network operators’ intent, simply generating anomalies when it detects a deviation. With AOS IBA, network operators can quickly detect and prevent a wide range of service level violations — including security breaches, performance degradations, and traffic imbalances.
And as is a hallmark of the Apstra approach, AOS IBA works across devices from both established vendors and open alternatives. AOS IBA provides turn-key functionality yet is fully extensible; and we’re engaging the community by creating a catalog of open source probes on our community website.
AOS 2.1 with IBA is a big step forward towards our commitment to provide you with ways to simplify your network design, build and operations, while at the same time freeing yourself from your choice of hardware. It takes the AOS distributed operating system approach to a new level, unlocking tremendous value by integrating intent, configuration, and continuous validation to eliminate network outages and gray failures, reduce cost, and build a modern, agile, multi-vendor intent-based data center network — helping you realize in the process log-scale improvements in the CapEx, OpEx and capacity of your network infrastructure.
Next time you are deploying a new green patch (a new rack or a new POD), please feel free to contact us — we would love to help! Join an IBA webinar, schedule a demo, download the data sheet or read the press release
Microsoft published a paper describing challenges in dealing with gray failures, which are defined as subtle underlying faults that frequently cause major availability breakdowns and performance anomalies. And they point out that “as cloud systems increase in scale and complexity, gray failure becomes more common”. Indeed, gray failures are the most common type of failures in network infrastructures, and they are the hardest to detect and root cause. In this blog we explain how Apstra AOS 2.1 addresses gray failures by reviewing some of the examples and observations from the paper.
Example 1: One-Dimensional View of the Problem
The first example the authors cite is as follows:
“For instance, if a system’s request-handling module is stuck but its heartbeat module is not, then an error-handling module relying on heartbeats will perceive the systems healthy while a client seeking service will perceive it as failed.”
The root cause of this issue is that the described fault detection system is (a) one dimensional (only one metric used — heartbeat), and (b) it is concerned only with system health and not service health. Apstra’s AOS considers service expectations to be the most important aspect of the closed-loop validation. System health is considered as well but the ultimate goal of the system is delivering service, not any particular component being healthy for health’s sake. Think of it: do you even care if the system has a heartbeat as long as it is reliably delivering a service? In fact, redundancy is incorporated in how infrastructure is designed, so that in many cases, the service is delivered even if a particular system were to fail. The most important insight is the understanding of how systems are composed to deliver a service. In AOS, that composition context is an AOS reference design. It is a first class citizen and its artifacts are present in the single source of truth which AOS maintains.
Example 2: Harnessing Spatial Patterns
The second example cited is:
“As another example, if a link is operating at significantly lower bandwidth than usual, a connectivity test will reveal no problems but an application using the link may obtain bad performance.”
This is another example of a one-dimensional fault detection system. In AOS one would define connectivity service expectations to consist of both:
Bandwidth above threshold test.
Now let’s make this example a bit more interesting. Say the client had multiple connections and the performance would degrade if there were less than two connections available that had no earlier described anomalies. First, you need to identify where all the connections are. Given that the single source of truth is stored in AOS, you can leverage Graph-Based Live Queries in AOS to identify these links. And the word “Live” in it implies that the number and identities of these connections are tracked as they change. In other words, AOS live queries enable you to harness the changing spatial patterns. You can find examples of these patterns in our blog on Intent-Based Analytics (IBA).
Example 3: Harnessing Temporal Patterns
To make the above example even more interesting, let’s also say that the bandwidth expectation described above would take into consideration temporal patterns and for example raise an anomaly when bandwidth goes below a certain threshold of more than n seconds. Handling temporal patterns is also part of Apstra’s Intent-Based Analytics.
Another example related to temporal patterns:
“Gray failure tends to exhibit an interesting evolution pattern along the temporal dimension: initially, the system experiences minor faults (latent failure) that it tends to suppress. Gradually, the system transits into a degraded mode (gray failure) that is externally visible but which the observer does not see. Eventually, the degradation may reach a point that takes the system down (complete failure), at which point the observer also realizes the problem. A typical example is a memory leak.”
Armed with IBA, in order to provide early detection, an AOS operator simply sets temporal analysis to raise an anomaly when a latent failure happens. Program actions react to it before the system transitions to a degraded mode. The operator has the option to program actions that range from simple logs and alerts, to triggering additional telemetry and debug/analysis chains.
Example 4: Cascading Failures
Sometimes gray failures lead to cascading failures:
“In one instance, a certain data server was experiencing a severe capacity constraint, but a subtle resource-reporting bug caused the storage manager to not detect this gray failure condition. Thus, the storage manager continued routing write requests to this degraded server, causing it to crash and reboot. Of course, the reboot did nothing to fix the underlying problem, so the storage manager once again routed new write requests to it, causing it to crash and reboot again. After a while, a failure detector detected that the data server was repeatedly rebooting, concluded that it was irreparable, and took it out of service. This, along with another subtlety in the replication workflow, reduced the total available storage in the system and put pressure on the remaining healthy servers, causing more servers to degrade and experience the same ultimate fate. Naturally, this eventually led to a catastrophic cascading failure.”
While this problem could have been also avoided by multi-dimensional fault detection we want to point out another aspect where Apstra helps. In the above example a faulty failure detection system caused healthy servers to be taken out of the service. In AOS the operator programs a semantic validation test that raises an anomaly when it observes that a failure detector is taking servers out of service at a higher rate than normal, according to some temporal pattern (a fancy word for “too much”). For example, if “more than x servers are taken out within y seconds” AOS will signal that the root cause is faulty resource-reporting.
Example 5: Blame Game
Another example that screams “lack of single source of truth that one can programmatically reason about in the presence of change”:
“Occasionally, a storage or network issue makes a VM unable to access its virtual disk, and thus causes the VM to crash. If no failure detector detects the underlying problem with the storage or network, the compute-cluster failure detector may incorrectly attribute the failure to the compute stack in the VM. For this reason, such gray failure is challenging to diagnose and respond to. Indeed, we have encountered cases where teams responsible for different subsystems blame each other for the incidents since no one has clear evidence of the true cause.”
Apstra’s AOS is a single source of truth of the intended state of the infrastructure, which understands the way resources (VM, network, storage) are composed to deliver a service, and allows proper identification of the problem, eliminating the blame game.
The Microsoft paper uses the following figure to describe model to characterize gray failure.
The paper argues that “a key feature of gray failure is differential observability: that the system’s failure detectors may not notice problems even when applications are afflicted by them.” It goes on to state that “A natural solution to gray failure is to close the observation gaps between the system and the apps that it services.”
In AOS, this gap is eliminated by the very essence of our approach: a service/system composite is declared operational only when expectations related to both system and service (app) are met. From the point a developer programs in those expectations as part of an AOS reference design, AOS auto-generates these expectations, auto-executes the tests, and raises anomalies when test results do not match expectations. All of this in the presence of change.
The authors further go on to describe how one can leverage scale to tackle the challenges of gray failures.
“… perhaps with the help of global-scale probing from many devices, we can obtain enough data points to apply statistical inference and thereby identify components with persistent but occasional failures … For instance, we can aggregate observations of VM virtual disk failure events and map them to cluster and network topology information.”
In order to leverage global-scale probing, one must have a system capable of ingesting this large amount of data. At #NFD16 we demonstrated how within AOS this is enabled by a granular pub/sub approach, in addition to mechanisms for horizontal scaling of the data store and processing in AOS.
In conclusion, the authors suggest:
“Therefore, we advocate moving from singular failure detection (e.g., with heart-beats) to multi-dimensional health monitoring. This is analogous to making assessments of a human body’s condition: we need to monitor not only his heartbeat, but also other vital signs including temperature and blood pressure”.
Apstra wholeheartedly agrees with what the Microsoft authors are advocating and our approach achieves what they are after. To see for yourself, check-out AOS.