The introduction of overlays in the early days of SDN enabled organizations to bridge the gap between the dynamic nature of their business policies and the static nature of their network. At the same time, overlays introduced significant challenges, which limited their adoption in the enterprise.
Responding to customer requests in this area, Apstra is announcing today AOS 2.0. Leveraging the recent advances in network operating system APIs and switch silicon support for VXLAN, AOS 2.0 delivers the first intent-based integrated underlay and overlay solution for the data center network.
Around a decade ago when SDN discussions first began, switches had no APIs, and to deliver dynamic policy there was no choice but to bypass networking engineering teams and extend an overlay on top of the physical network. This approach created a number of problems that limited the adoption of the technology:
Underlays and overlays are opaque to each other. Because the underlay and overlay are completely decoupled, it is that much harder for IT teams to debug networking problems. Was it caused by the overlay? The underlay? Through which links or interfaces do the packets pertaining to this particular overlay tunnel flow?
Organizational processes break with decoupled underlays and overlays. Overlays made it unclear who was really responsible for network services. The network engineering team? The compute team? The cloud team? Compute teams are often driven to buy and operate an overlay without the participation of network teams. That could mean that two network operators in the same data center don’t really work together. Or even acknowledge each other. Worse, the networking team is often finger-pointed, often without evidence. The network is the most critical asset in the data center; one operational team should be empowered and responsible — not two.
Overlays don’t easily work with bare metal devices. While most workloads are virtualized, there is a lot of bare metal out there; storage, database, and many devices and appliances. The common solution is to build a gateway which generally encaps/decaps flows between the overlay and a bare metal segment. I am bullish on the use of an overlay, but quite honestly, a gateway for bare metal is a hack which only became necessary because of the unnatural fracture between underlay and overlay.
Introducing AOS 2.0:
Integrated overlay/underlay: Leveraging the innovative AOS state repository and intent modeling technology, all the state pertaining to the physical underlay, its topology, its logical entities, virtual networks, and all related telemetry are stored in the AOS distributed data store, and represented in a graph that captures all the pertinent relationships. As a result, AOS 2.0 provides powerful visibility into network state, including the physical and virtual, through its process of closed-loop, continuous validation of state against intent. In short, with AOS 2.0, the underlay/overlay correlation problem that has plagued first generation SDN solutions becomes a thing of the past.
All under the control of the network engineering team! AOS 2.0 enables a network infrastructure that leverages a modern Leaf/Spine L3 underlay network architecture using multi-vendor state-of-the-art equipment, that features an L3 underlay and stitches L2 services as an overlay — within the rack, and across racks. Organizations can then deliver L2 connectivity for their applications, and enforce policies and security zones across their various Application services Tiers — all under the control of network teams.
Natural support for bare metal servers. With AOS 2.0, configuring a virtual network spanning two separate racks is done automatically using an API or our Web interface – “Please create a virtual network connecting these endpoints”. Under the hood, AOS 2.0: (1) configures VLANs connecting end points to Top of Rack (TOR) switches, and (2) configures VXLAN tunnels between TOR switches that belong to different racks, and (3) validates in real time that the virtual network was properly configured (e.g., by continuously ensuring that VTEP routes are seen in routing tables for all devices). This approach applies to both virtual and bare-metal end points and doesn’t requires gateways.
In addition, AOS 2.0 leverages the same AOS core to provide the same unique AOS advantages:
Intent-Based, Vendor-Agnostic, and Closed-Loop: The vendor-independent approach of AOS is taken to another level with our multi-vendor implementation of VXLAN-based virtual networks. Owing to the Intent-Based approach of AOS, arcane vendor-specific configurations of VXLAN are abstracted away from network users. And owing to its closed loop continuous validation, complicated vendor-specific troubleshooting procedures are eliminated. The result is unprecedented hardware vendor choice and interoperability across both underlay and overlay.
Fully Automated: AOS 2.0 gets us closer to the vision of a self- operating and autonomous network infrastructure. It leverages the extensible foundation of AOS to deliver end-to-end automation of all phases in the life cycle of network services across the underlay and the overlay: design, build, deploy, and validate. This includes Day 0 Design and Initial Provisioning, Day 1 Builds and Day 2 Operational Changes and Troubleshooting capabilities. With unique system-wide commit capabilities for change operations and sophisticated continued validation and troubleshooting capabilities through intent-based analytics, AOS 2.0 delivers the most powerful autonomous operation capabilities available today.
Adding AOS 2.0’s new enterprise-class features (including RBAC, HTTPS, and Headless Operations), organizations can confidently start the process of migrating from legacy L2 data center infrastructures to modern Leaf-Spine infrastructures with fully automated and integrated L3 underlay and L2 overlay — all under the control of networking teams.
AOS 2.0 demonstrates that we are rapidly extending AOS capabilities. Customer-driven feature velocity is a key part of our vision, enabled by the extensible AOS architecture. This brings our customers expanded device support, and advanced intent-based analytics — which are coming as part of turn-key applications in future releases. Contact us to learn what AOS 2.0 can do for your network infrastructure and organization. A new era has begun, and we’re not looking back!
In my previous blog on Intent-Based Networking (IBN) I tried to provide some clarity about our vision of what IBN is, and what it is not. We stressed the importance of automating each and every phase of a service lifecycle by leveraging a single source of truth. Arguably, the most important phase is “closed-loop validation in the presence of change”. And at the core of it is what we call Intent-Based Analytics (IBA). Let’s define it and dig deeper into it.
The most important aspect of IBA is its ability to reason about the change in the closed-loop fashion. And the change can take place during any of the service lifecycle phases: design, build, deploy or validate. During the design phase I may want to add a rack. Or I may want to add a new virtual network or isolation domain. During the build phase I may want to change some IP or ASN assignments. Or swap an instance of a vendor A device with an instance of a vendor B device. During the deploy phase I may want to roll-back some of the changes. Or I want to tweak the way config is generated. But the key question is: have these changes produced the expected results?
As we mentioned earlier, changes can originate from an operator (business rule change). Could the AWS outage of February 2017 have been prevented if the automation system had semantic reasoning about the impact of the change? You could have a semantic validation that will allow an operator for example to have at most two, or at most 10% (whichever is smaller) of spine switches placed into “maintenance” mode at any given time. And if you intend to go over that limit, semantic validation will prevent you from doing it.
In the case of operational status changes, it is even more important to reason about the changes in closed-loop. This is because (a) you are not in control of the change, as you don’t control the nature, let alone your data center gear (b) there could be thousands of these changes happening instantaneously and you need to be able to handle this at scale. But wait; operational status is not part of the intent, is it? Of course it is. Intent is not only about configuring devices using a declarative specification of intent. (For more details about Apstra’s definition of IBN see these videos from NFD16) . It is even more importantly about specifying your expectations. You expect your Spine1 switch to be up. You expect Ethernet5 on it to be down. If they are not you want the anomaly to be raised. You want the right people to be alerted. You want to reason about the change in the context of intent in an automated, programmatic way.
In the examples above, Spine1 was treated as an indispensable and unique system with identity, that can never be down. In the “pets” vs “cattle” analogy, Spine1 is your “pet”. But what if you designed your system for failure, and no system was indispensable or irreplaceable, i.e. you “routed around the failure”? You want to know how the ensemble “cattle” is doing. There is no single operational status to answer that. You need to synthesize that operational status by extracting knowledge from a multitude of raw telemetry data. And all of this in the context of intent that serves as your single source of truth. So what does “extracting knowledge” actually mean in this context?
Say for example you have a complex application “myApp”, that has multiple components running as VMs/containers on multiple servers, spread across racks in your data center. The servers fail and as a result workloads move around. Or the workload placement algorithm moves them around for optimization purposes. You don’t have “pet” servers on which your “pet” application is running. But you do want to know how your app is doing. So you may say for example, “I want a sufficiently large percentage of the links that carry the traffic of ‘myApp’ to (a) be up, and (b) not overloaded”.
Challenges and Solutions
So, what are the challenges associated with the ask above?
Challenge 1: Identifying Relevant Resources
The first challenge is identifying these links in the presence of change. You may not have the single source of truth to give you this info. Or the source of truth may not be up to date. Or it was updated but you are not notified of the change. My cabling is in spreadsheet A and my VM placement information is in system B. You could solve the above with some scripting around the limitations, but in general building an asynchronous notification mechanism from scratch is a daunting task.
Solution: In Apstra Operating System (AOS™), these complexities are handled by the core platform. So how does this information get into AOS in the first place? Consider the two scenarios. In the first scenario, AOS is not controlling the placement of workloads. If this is done externally you need to update AOS with the mapping of VMs to servers via well-defined and published APIs. This could be a simple integration. In the second scenario, you build a reference design that runs the workload placement within AOS, and as a result, AOS is aware of it, and nothing extra needs to be done.
Once the info is in AOS you can use a Live Query pattern to identify the links of interests. To put it in plain english you ask a question: “What are the links that carry traffic between VMs that comprise the application ‘myApp’, and notify me of any change from now on?”. In the Fig 1. below we see myApp consisting of two VMs that are hosted on some servers attached to leafs 1 and 2. The set of links and switches that carry the traffic of interest are highlighted in blue in the Fig. 1.
Now, suppose a change took place resulting in moving the vm2 to a server on leaf 9. There is a set of links and switches that is not carrying myApp traffic any more and these are highlighted in red in Fig. 2. And this fact will be communicated via Live Query to AOS so that the telemetry on these links is not any more included in the calculations. Additionally, the new resources, that do need to be included in the calculations (new server hosting vm2 attached to leaf 9 and associated links) are also communicated to AOS via Live Query pattern.
Now you have a continuously up to date picture of resources that matters to you. Challenge 1 solved with AOS.
Challenge 2: I don’t have the required (raw) telemetry
The more data you have and know how to reason about, the more powerful your analytics are.
Solution: AOS supports the quick and easy addition of new data via telemetry collection extensions. Look for upcoming blogs on this topic.
Challenge 3: Composing The Analytics Workflow
The next challenge is that the raw telemetry data may not be sufficient for you. It may tell you that a link is “up” now, but is that enough info given that it was possibly down for 2 minutes just before you checked? So an instantaneous state may not be what you care about, but rather some “aggregated” information about recent history. Say you wanted it to be “up” for 99.9% of time in the last 1hr and “down” not more than 1s during any “down” event. Otherwise you declare its recent connectivity status as “unhealthy”. And regarding the traffic being overloaded on the links, you may want it not to be overloaded (over certain configurable threshold) for example for more than 5s for as you are ok with occasional spikes. Otherwise you declare it “overloaded”.
Also, say you have built some resiliency into your application and can tolerate up to 10% of the links being “unhealthy” or “overloaded” (or both). Otherwise you want an anomaly, “myApp network has an issue,” raised and encapsulating all the complex knowledge extraction described above. And this extraction was implemented by a logical data processing workflow consisting of the following stages (Fig. 3 below):
Collect statuses from all links carrying “myApp” traffic
Collect traffic utilization info from all links carrying “myApp” traffic
Create recent history time series (denoted as “TS” in Fig. 3) for the two metrics above
Perform the analysis on the time series to calculate time spent in “up”/”down” states and time over the utilization threshold.
Combine the results to calculate the percentage of “unhealthy” or “overloaded” links
Raise an anomaly if more than 5% of links is “unhealthy” or “overloaded” or both
This logical analytics workflow is essentially a directed acyclic graph of stages where each of the stages is one of (a) accumulation of resources of interest, (b) post-processing of telemetry, (c ) some conditional expression that raises an anomaly.
Now, assuming you solved challenges 1 and 2 (have the right resources dynamically updated and have all the required telemetry available). You have two options (a) implement this logical knowledge extraction workflow in your mind by staring at a “dashboard” with tens or hundreds of plots, each corresponding to a stage in the Fig. 3 and doing the correlation in your mind, or (b) building from scratch your analytics workflow engine.
Solution: Wouldn’t it be nice if you could achieve the above by designing such a logical workflow, specifying it declaratively (as a directed graph of stages) and then deploying it with a single POST request to AOS, and have the rest done automatically? Challenge 3 solved with AOS.
Ok, we got this information-rich anomaly. Is there anything else one may want to do?
Challenge 4: Automated Drill-down
Say now you identified that a number of unhealthy/overloaded links is higher than your application can tolerate. What do you want to do about it? Ideally, you may want to follow-up with the next level of investigation and troubleshooting. But to do that you again need a system that stores all relevant context (single source of truth), that you can ask the right questions (reason programmatically) in the presence of change (asynchronously). Building it from scratch? Daunting.
Solution: Again all of these features are an essential part of AOS. It takes your intent as context. It reacts to change. It synthesizes complex anomalies programmatically. And then you can react to these anomalies to automatically trigger the next level of detailed follow-ups that you could not afford (or did not want) to do all the time, across all the resources. For example given the above anomaly raised you may want to do one or more of the following, by leveraging the same mechanism used for any reasoning logic in AOS – the Live Query mechanism. It would react to creation of the above anomaly and then during the processing stage (callback function) of the Live Query:
Initiate the next level of telemetry to identify which flows are contributing to overloaded links, and try to identify elephant flows
Take a snapshot of relevant info (which other applications were running on the servers experiencing problems)
Log the info above for post mortem analysis
Any other secret technique under your belt
Best of all, these follow-up actions are simply just the next set of stages in our knowledge extraction analytics workflow. Challenge 4 solved by AOS.
If you were to do this yourself, in the absence of a single source of truth that you can programmatically reason in the presence of change (which is functionality provided by AOS Core) you would have to do the following:
Provide a coordination layer that will consolidate different sources of truth
Make that coordination layer capable of providing granular asynchronous notifications
Have a way to define your domain model and persist it
Have a framework to auto execute tests driven by your domain model
Build from scratch or manage a 3rd party data processing pipeline
One of my favourite quotes comes from Mike O’Dell, who was a pioneer in building likely the most sophisticated MPLS network at the time, UUNET. I attended one of his presentations about challenges in building MPLS networks and his first slide had the following text: “If you are not scared, you don’t understand.” So, if you feel scared trying to build from scratch the system described above, that is a good sign. You are not alone. Whether you really want to do it, is a different question.
Now imagine if you can achieve the above by designing such a workflow. Specify it declaratively (as a directed graph of stages), deploy it with a single POST request. Have AOS automate the collection of the telemetry, creating a data processing pipeline, resulting in actionable and context rich data, including, but not limited to raising the anomalies. Extract more knowledge while collecting less data. Or better yet, forget about the “imagine” part above, and stay tuned.