Pushing the State-of-the-Art with Intent-Based Analytics

Apstra Blog
Pushing the State-of-the-Art with Intent-Based Analytics


At Apstra, we have declared war against the complexities and inefficiencies that plague data center network operations today, preventing organizations from delivering on their digital transformation goals. We set ourselves a core mission, which is to deliver on the vision of an autonomous Self-Operating Network™. We promised our customers that we would continue delivering on functionality that every day gets us closer to our goal.  

Today, we take a significant step forward in our commitment to this challenge by announcing Apstra AOS™ 2.1. It is an exciting release because it includes for the first time the Intent-Based Analytics™ (IBA) capability that we pioneered, and that my co-founder Sasha Ratkovic first introduced to the industry in a 2017 blog post. AOS 2.1 is generally available this month.

AOS IBA is a feature that our customers have been excited about and eagerly awaiting. It embeds automated big data analytics into the real-time continuous validation capability of AOS, and provides our customers unprecedented control of their network infrastructures to address their digital transformation and IoT goals.

In a nutshell, AOS IBA liberates network operators from today’s status quo, which is to sift through mountains of raw telemetry and stare at network visualizations 24/7 to detect unusual patterns. And contrary to the traditional big-data analytics status quo, AOS IBA relieves network operators from having to write complex low level imperative programs that need to be integrated and constantly kept in sync.

Instead, network operators specify,  using a simple, dynamic, declarative interface, exactly how they expect their network to operate — beyond mere connectivity and including traffic patterns, performance, and tolerance for grey failures. AOS then continuously validates the network operators’ intent, simply generating anomalies when it detects a deviation. With AOS IBA, network operators can quickly detect and prevent a wide range of service level violations — including security breaches, performance degradations, and traffic imbalances.

And as is a hallmark of the Apstra approach, AOS IBA works across devices from both established vendors and open alternatives. AOS IBA provides turn-key functionality yet is fully extensible; and we’re engaging the community by creating a catalog of open source probes on our community website.

AOS 2.1 with IBA is a big step forward towards our commitment to provide you with ways to simplify your network design, build and operations, while at the same time freeing yourself from your choice of hardware. It takes the AOS distributed operating system approach to a new level, unlocking tremendous value by integrating intent, configuration, and continuous validation to eliminate network outages and gray failures, reduce cost, and build a modern, agile, multi-vendor intent-based data center network — helping you realize in the process log-scale improvements in the CapEx, OpEx and capacity of your network infrastructure.

Next time you are deploying a new green patch (a new rack or a new POD), please feel free to contact us — we would love to help! Join an IBA webinar, schedule a demo,  download the data sheet or read the press release

Apstra Addresses Microsoft Azure Network Gray Failures

Apstra Blog
Apstra Addresses Microsoft Azure Network Gray Failures

Microsoft published a paper describing challenges in dealing with gray failures, which are defined as subtle underlying faults that frequently cause major availability breakdowns and performance anomalies. And they point out that “as cloud systems increase in scale and complexity, gray failure becomes more common”. Indeed, gray failures are the most common type of failures in network infrastructures, and they are the hardest to detect and root cause. In this blog we explain how Apstra AOS™ 2.1 addresses gray failures by reviewing some of the examples and observations from the paper.

Example 1: One-Dimensional View of the Problem

The first example the authors cite is as follows:

“For instance, if a system’s request-handling module is stuck but its heartbeat module is not, then an error-handling module relying on heartbeats will perceive the systems healthy while a client seeking service will perceive it as failed.”

The root cause of this issue is that the described fault detection system is (a) one dimensional (only one metric used — heartbeat), and (b) it is concerned only with system health and not service health. Apstra’s AOS™ considers service expectations to be the most important aspect of the closed-loop validation. System health is considered as well but the ultimate goal of the system is delivering service, not any particular component being healthy for health’s sake. Think of it:  do you even care if the system has a heartbeat as long as it is reliably delivering a service? In fact, redundancy is incorporated in how infrastructure is designed, so that in many cases, the service is delivered even if a particular system were to fail. The most important insight is the understanding of how systems are composed to deliver a service. In AOS, that composition context is an AOS reference design. It is a first class citizen and its artifacts are present in the single source of truth which AOS maintains.

Example 2: Harnessing Spatial Patterns

The second example cited is:

“As another example, if a link is operating at significantly lower bandwidth than usual, a connectivity test will reveal no problems but an application using the link may obtain bad performance.”

This is another example of a one-dimensional fault detection system. In AOS one would define connectivity service expectations to consist of both:

Connectivity test
Bandwidth above threshold test.

Now let’s make this example a bit more interesting. Say the client had multiple connections and the performance would degrade if there were less than two connections available that had no earlier described anomalies. First, you need to identify where all the connections are. Given that the single source of truth is stored in AOS, you can leverage Graph-Based Live Queries in AOS™ to identify these links. And the word “Live” in it implies that the number and identities of these connections are tracked as they change. In other words, AOS live queries enable you to harness the changing spatial patterns. You can find examples of these patterns in our blog on Intent-Based Analytics (IBA).

Example 3: Harnessing Temporal Patterns

To make the above example even more interesting, let’s also say that the bandwidth expectation described above would take into consideration temporal patterns and for example raise an anomaly when bandwidth goes below a certain threshold of more than n seconds. Handling temporal patterns is also part of Apstra’s Intent-Based Analytics.

Another example related to temporal patterns:

“Gray failure tends to exhibit an interesting evolution pattern along the temporal dimension: initially, the system experiences minor faults (latent failure) that it tends to suppress. Gradually, the system transits into a degraded mode (gray failure) that is externally visible but which the observer does not see. Eventually, the degradation may reach a point that takes the system down (complete failure), at which point the observer also realizes the problem. A typical example is a memory leak.”

Armed with IBA, in order to provide early detection, an AOS operator simply sets temporal analysis to raise an anomaly when a latent failure happens. Program actions react to it before the system transitions to a degraded mode. The operator has the option to program actions that range from simple logs and alerts, to triggering additional telemetry and debug/analysis chains.

Example 4: Cascading Failures

Sometimes gray failures lead to cascading failures:

“In one instance, a certain data server was experiencing a severe capacity constraint, but a subtle resource-reporting bug caused the storage manager to not detect this gray failure condition. Thus, the storage manager continued routing write requests to this degraded server, causing it to crash and reboot. Of course, the reboot did nothing to fix the underlying problem, so the storage manager once again routed new write requests to it, causing it to crash and reboot again. After a while, a failure detector detected that the data server was repeatedly rebooting, concluded that it was irreparable, and took it out of service. This, along with another subtlety in the replication workflow, reduced the total available storage in the system and put pressure on the remaining healthy servers, causing more servers to degrade and experience the same ultimate fate. Naturally, this eventually led to a catastrophic cascading failure.”

While this problem could have been also avoided by multi-dimensional fault detection we want to point out another aspect where Apstra helps. In the above example a faulty failure detection system caused healthy servers to be taken out of the service. In AOS the operator programs a semantic validation test that raises an anomaly when it observes that a failure detector is taking servers out of service at a higher rate than normal, according to some temporal pattern (a fancy word for “too much”). For example, if “more than x servers are taken out within y seconds” AOS will signal that the root cause is faulty resource-reporting.

Example 5: Blame Game

Another example that screams “lack of single source of truth that one can programmatically reason about in the presence of change”:

“Occasionally, a storage or network issue makes a VM unable to access its virtual disk, and thus causes the VM to crash. If no failure detector detects the underlying problem with the storage or network, the compute-cluster failure detector may incorrectly attribute the failure to the compute stack in the VM. For this reason, such gray failure is challenging to diagnose and respond to. Indeed, we have encountered cases where teams responsible for different subsystems blame each other for the incidents since no one has clear evidence of the true cause.”

Apstra’s AOS is a single source of truth of the intended state of the infrastructure, which understands the way resources (VM, network, storage) are composed to deliver a service, and allows proper identification of the problem, eliminating the blame game.


The Microsoft paper uses the following figure to describe model to characterize gray failure.

The paper argues that “a key feature of gray failure is differential observability: that the system’s failure detectors may not notice problems even when applications are afflicted by them.”  It goes on to state that “A natural solution to gray failure is to close the observation gaps between the system and the apps that it services.”

In AOS, this gap is eliminated by the very essence of our approach: a service/system composite is declared operational only when expectations related to both system and service (app) are met. From the point a developer programs in those expectations as part of an AOS reference design, AOS auto-generates these expectations, auto-executes the tests, and raises anomalies when test results do not match expectations. All of this in the presence of change.

The authors further go on to describe how one can leverage scale to tackle the challenges of gray failures.

“… perhaps with the help of global-scale probing from many devices, we can obtain enough data points to apply statistical inference and thereby identify components with persistent but occasional failures … For instance, we can aggregate observations of VM virtual disk failure events and map them to cluster and network topology information.”

In order to leverage global-scale probing, one must have a system capable of ingesting this large amount of data. At #NFD16 we demonstrated how within AOS this is enabled by a granular pub/sub approach, in addition to mechanisms for horizontal scaling of the data store and processing in AOS.

In conclusion, the authors suggest:

“Therefore, we advocate moving from singular failure detection (e.g., with heart-beats) to multi-dimensional health monitoring. This is analogous to making assessments of a human body’s condition: we need to monitor not only his heartbeat, but also other vital signs including temperature and blood pressure”.

Apstra  wholeheartedly agrees with what the Microsoft authors are advocating and our approach achieves what they are after. To see for yourself, check-out AOS.


The Fallacy of the Network Greenfield Versus Brownfield Conundrum

Apstra Blog
The Fallacy of the Network Greenfield Versus Brownfield Conundrum

I was a panelist at the ONUG conference in New York, where we discussed the impact of automation and machine learning on IT jobs. It was a great conversation, which I enjoyed thoroughly. The debate of greenfield versus brownfield came up a few times in the conversation, and I remember jumping in at one point and clarifying that from a customer perspective, there was no “brownfield” versus “greenfield.”

Customers almost never “upgrade” brownfield environments; That would be akin to upgrading components or adding a new engine to a 10-year old car that doesn’t meet any of the performance, emissions, or safety standards that are readily available when one purchases a new car. And customers rarely deploy pure greenfield environments. Their upgrade processes are constrained by the need to support legacy technologies in support of their business that require them to keep the lights on, and prevents them from performing forklift upgrades of their entire environment. And even if they were able to do so, such an approach can carry unreasonable cost and risks of disrupting the business if the upgrade doesn’t go smoothly.

Instead, what I’ve seen customers do far more often is deploy what we like to call “green patches”. That is, they upgrade their infrastructure one incremental step at a time. This incremental new step could be a new rack or, often a new POD. This new green patch is based on the latest thinking in terms of architecture, and incorporates the guiding principles needed to meet the ever evolving requirements of the business.

This is more true today than ever before. Because of the urgent need to embrace the digital transformation of their business, enterprises are embarking on an accelerated schedule to upgrade their infrastructure. The guiding principles that CIOs leverage their new green patches need to set the foundation for log scale improvements in how compute and network infrastructures are built and operated, that is, reduction of capital and operational costs, while increasing capacity and reducing risk in their platforms. These guiding principles are as follows:

Simplicity of operations through turn-key automation of the entire lifecycle of their network services, delivering on autonomous infrastructure operations freed from the inefficiencies of manual redundant configuration and troubleshooting tasks.

Free yourself from your hardware: There are plenty of hardware options on the market, from established vendors, to open source offerings. Customers should have the ability to deploy the highest capacity that’s the most cost effective for their needs, and they need to do that seamlessly without disrupting their operational model.

Ability to scale to meet the needs of the business. This is accomplished through proper scale out architectures, and through operational models that enable the ability to grow infrastructure or replace devices to newer technology of larger capacity seamlessly.
Our customers choose Apstra for their data center green patches because we uniquely deliver on those three guiding principles, enabling them to achieve log scale efficiencies in the costs of building and operating their new incremental infrastructure deployment, especially when compared to how they built and operate their brownfield environments. Our customers choose Apstra because they understand the 
exorbitant opportunity costs of doing nothing; high rate of outages and lack of agility, both of which amount to a basic inability to compete.

They believe that Apstra is the right choice because they appreciate the deep technological innovations that we have incorporated in AOS, which made these guiding principles a reality: our distributed operating system approach which provides a foundation for scale, extensibility, and reliability; our turn-key
intent-based approach provides unprecedented simplicity through powerful automation of the entire lifecycle of their data center network services; our disaggregated approach allows them to treat hardware as commodity; the systematic leveraging of open APIs and standards, which gives them control over their destiny; and last but not least, our
graph-based representations,
intent-based analytics, and continuous validation provides them full confidence that their infrastructure is indeed operating as intended.

Don’t waste your time arguing the two sides of the greenfield versus brownfield conundrum. And don’t fall behind from your inability to build the infrastructure that’s required for your business. Upgrade your infrastructure using a green patch approach instead, setting yourself on the right path for log scale efficiency improvements required by your digital transformation initiative. We would love to help. Please feel free to
contact us or
schedule a demo.


Full Network Lifecycle Automation: So Easy Even a CIO Could Use It!

Apstra Blog
Full Network Lifecycle Automation: So Easy Even a CIO Could Use It!

I’ve been around networking since the early 1990’s. My first router was a Cisco AGS running IOS 8.2(3) and had interface cables (appliques?) that sliced the back of your hands when installing them. I would use mainframe telnet emulation software to get to the CLI prompt and type away using show and config commands.

Fast forward 25 years… my hands have healed and telnet morphed to ssh but it’s still all the same process. This, despite the fact that “compute” (whatever happened to calling them “servers”?) was automated over a decade ago and switches effectively became servers with a bunch of ports. But compute automation tools — think Ansible, Puppet, and Salt — were not built for this type of work because networks are distributed systems intertwined in ways that servers will never be. And these tools were meant to handle provisioning which only happens initially and rarely after that. But you still need to operate networks, which inevitably requires yet another tool! Why? Well it can all be traced back to siloed organizational structures.

I remember one large bank I worked for had a Configuration Management Group (CMG). All they did was create/push configs based on diagrams an Engineering group gave them. I was in that Engineering group then and I loved it! I would create a before and after diagram and hand it over to them to figure out “the nitty gritty.” I felt sorta like a CIO, only vastly underpaid.

The engineering, of course, was based on an Architecture team’s 30,000 foot view of how the network should be. And after configuration was updated, the operations team had the privilege of taking over for the rest of their (network) lives.

Documenting all these stages and finger pointing between silos was a total nightmare. It went like this:

Ops: “We lost connectivity to Boston DC but I didn’t even know we had a Boston DC.”

Eng: “Check the Visio, it’s somewhere on shared drive, I think.”

Ops: “I don’t have permissions to see it…oh wait, now I do but this diagram is dated two years ago and there’s no Boston.”

Eng: “Damn, well I’m not in office right now, I’ll try to update it later but you could try Arch team.”

Ops: “OK.”

Arch: “Why are you calling me, I’m an Architect!”

But let’s say there was a tool that could automate each silo’s tasks. Design, Build, Deploy, and Validate/Operate all derived from your high level business intent?

That, my friends, is what Apstra does. And we aren’t only working with one switch vendor because unlike those vendors, we don’t want to sell you switches. Choose Cisco, Arista, Juniper, Cumulus, or TBD…and we’ve got you covered. We are a software solution and aren’t replacing the Network OS, because they are good at what they do! But we are automating and optimizing the NOS in this vendor-agnostic, top down way and it’s very, very cool.

Soooooo, I know everyone is busy, busy, busy – I still love this graphic:

It’s because of this and how amazing I think Apstra’s intent-based OS is that I created this demo and broke down each network lifecycle stage into 5-7 minute chunks. A mere 26 minutes total. Even a CIO could find this much time.

And as you’ll see, there’s nothing here that same CIO couldn’t do his/herself because we are Intent-based which means we take business logic from you, then take care of everything else. We want to remove the mundane and repeatable, so you can focus on what humans do best, which is creating new ways to help your businesses.

Have a look and drop a line to sales@apstra.com. I would love to go deeper here on your needs and all the other amazing things Apstra does but I didn’t have time to discuss here because I’m very, very busy.