It’s because I had an Aha! moment when they walked me through what AOS is and how it intends to change the way networks are managed. I’ve experienced these sorts of moments a few times in my career and they’ve always led to transformational shifts in the industry, and for me personally, as I grew in my career.
I am extremely excited to be joining the talented and visionary team at Apstra. Networking has been at the heart of my career (3 decades strong!). I’ve seen first hand the creative destruction of the status quo that makes the world of networking so interesting. Today Intent-Based Networking is challenging the status quo and I love that we at Apstra are at the center of helping our customers on this revolutionary journey.
This is the second part of a blog series on using AOS to easily operate your data network. Click here to read part one.
Last time, we were talking about how we can use AOS to gracefully drain traffic off network devices in order to perform maintenance with a “Software-First” approach. You were probably scratching your head at the end saying “Now what?” Obviously, getting application traffic off the network is just step one in the operational process. So what happens next? Well, typically you are doing one of the following tasks:
- Troubleshoot device offline
- Replace failed or damaged device
- Upgrade the Device OS
Let’s focus on that last option, which every operator has been challenged by.
Network Device OS Upgrades are a required function in any modern enterprise. The OS can be affected by known or unknown bugs, security vulnerabilities, or more. The operator may also wish to upgrade the OS in order to activate a new feature. Whatever the reason is, with a fixed form factor device with a single CPU (or supervisor), we can expect some sort of outage due to the reload process. So as we have previously described, it typically makes sense to place the equipment into maintenance mode before performing these actions.
Multivendor Fabrics? No Problem.
Like every feature in AOS, the Intent is separated from the method by which we accomplish the stated goal. You shouldn’t have to be an expert in every vendor’s OS in order to manage a mixed hardware network. So the device OS upgrade process does not require any CLI input from the administrator, they simply select the devices to be upgraded, then pick the proper OS from a drop down list, and then submit the job. AOS can manage multiple simultaneous upgrades and even multiple active jobs for different OS types.
Positioning and Validating the Images
In advance of scheduling the upgrade, the operator should upload the OS images to the AOS server. When the images are being copied, you have the option to store the hash value provided by the vendor to ensure that the image is absolutely valid and has no errors. This information is stored in the AOS Server, in fact, we can use the hash value to regularly check the devices to make sure no corruption has taken place in our file system.
AOS also supports storing the images on a dedicated HTTP server, which allows for an engineering team to centralize all the images even if they are not being used with AOS.
Upgrading, or even Downgrading
Once the job has been kicked off, AOS copies the image to the storage location on each device. When the file copy is complete, the bootfile command is set to the new image and the device is reloaded. If the operator has placed the device into Maintenance Mode, then the device comes back in the same state. If the mode was not set on the device, it will come back up in a fully operational state with the service configuration.
You can use this workflow to both upgrade and downgrade the OS. Sometimes when we upgrade we encounter new problems and have to go back to the previous version. For this reason, we recommend storing all OS images on the AOS Server or the related HTTP server.
Hey, that wasn’t there before!
Frequently when you perform an upgrade, new default command settings appear in the running configuration syntax. This is due to subtle changes in the vendor’s source code. Typically these changes can be anticipated by reading the release notes, but on occasion new commands or even modified default settings will appear after a reload. AOS provides an easy way to identify these problems and resolve them.
In an AOS managed network, every device configuration is checked every 60 seconds for changes. When a device is reloaded and the AOS Agent is activated during boot, the configuration is checked immediately. ANY CHANGE, no matter how subtle, will be identified by AOS and presented to the administrator as a Config Anomaly. At that point, you can manually adjust the new command settings, or have AOS automate the change, or simply accept the new settings as a baseline, or “Golden” config. Once the change has been accepted, AOS expects that command to be there, and you will be alerted with another anomaly if any other change appears.
Automated OS Version Compliance Checks
AOS uses Intent-Based Analytics (IBA) to check elements of the network at regular intervals for problems. One of the more popular IBA “probes” is the OS Version Check, which looks at the running OS version on all devices from a certain vendor and triggers an anomaly if the version does not match. In AOS 3.0 we augmented this function with the Global SLA feature. This allows a member of the network or security team to set a preferred value once in AOS and refer to that variable in any number of IBA probes. For example, you could set the approved EOS version for Arista devices to “184.108.40.206F”. This is like envvars for your entire network.
The OS Check probe will then create an anomaly for every device running any version that differs from that. If the security team is alerted to a vulnerability in that version, that can simply change the version listed in the SLA and the IBA probes will automatically be updated to look for the new version.
As a result, all devices that were running the old version will now show an anomaly, and the count of these anomalies defines how many OS upgrades you need to do to be in compliance. The security team can watch this number in real-time, it will decrease with every successful OS upgrade. By using Role Based Access Control (RBAC), any member of the organization can be fully in the loop on the OS remediation process.
9 out of 10 Engineers Agree, Do Your Upgrades!
AOS was designed to improve the lives of operators and increase the efficiency of businesses by rapidly upleveling the capabilities of the people who manage these systems. Prior to using AOS, Bloomberg had a single engineer working on OS upgrades, taking upwards of 8 months to upgrade 174 switches. The same tasks could have been completed with AOS Maintenance Mode and Device OS Upgrade in approximately 87 hours. In fact, with parallel OS upgrade job support, the entire network could have been upgraded in a single day.
While OS upgrades are typically not the most exciting task, they are absolutely required and need to be completed in relatively short time periods. Don’t believe me? Check out what Andrew Lerner at Gartner had to say about it (https://blogs.gartner.com/andrew-lerner/2014/10/13/sorrydrb/). With the automated tools built into AOS, we can easily upgrade complex multivendor network topologies with a consistent workflow and validation assistance. This ensures that the upgrades are completed quickly and accurately, allowing you to return to working on more challenging projects.
For detailed documentation on how Device OS Upgrades function, please contact Apstra Sales at email@example.com.
If you’ve ever worked in network operations, you undoubtedly have a few crazy stories about network outages and the unusual hours you had to work to resolve them. Outages never seem to happen at 10AM on Monday when everyone is at their desk, freshly loaded with coffee. No, the network Gods seem intent on things failing in the middle of the night or when traffic is close to its peak. And if the failure was during working hours, undoubtedly you’ll have to do some sort of remediation at unpleasant times, without the benefit of having the entire team available to help.
Why has this become the standard process for dealing with issues? Shouldn’t there be some intelligent software that can automate workflows across an entire network regardless of what vendor hardware is being used? (Spoiler Alert: A “Software First” approach will help solve many of these issues).
No Shortage of Potential Problems
The number of potential problems in a network is unlimited. Devices misbehave or fail outright, optics fail regularly, copper and fiber cables become crimped or even cut, security vulnerabilities affecting multiple devices OS are routine. As owners and caretakers of these environments, we are responsible for designing and operating the network to avoid the most common issues, but we can always expect some sort of issue to appear.
Hardware failures will always exist. We can design redundant hardware and topologies, but when things break, how can we easily replace the equipment without causing further degradation to services?
Software bugs and issues don’t always appear immediately, sometimes there are conditions that set them off, sometimes the increased load drives them to the surface. Occasionally we discover them through notifications from the vendor, typically in the form of a security notice or PSIRT.
Probably worse than outright failures are “gray failures” which add uncertainty to our networks. A device that is dropping a large amount of packets still needs to be replaced, but until the operator has been able to perform this work, real application traffic will continue to be forwarded.
When a device has to be replaced but is still participating in the routing and switching for the fabric, several steps need to be taken to ensure that traffic is gracefully drained. For example, we don’t want to simply shut down the BGP routing process, as this would impact existing traffic flows. We also don’t want to modify the L2 switching plane as this could prove disastrous to the rest of the network. We want a consistent and well-tested routine for placing devices into maintenance mode, with the increased level of trust in this routine we can perform corrective actions during more potential change windows.
Operator Quality of Life
Most network operators are used to the fact that the business is rarely tolerant of changes in the middle of the day. In fact, as the network provides the lifeblood for all of IT, most businesses do not permit changes outside of very small maintenance windows, which are typically early morning on the weekends. Who wants to go into their datacenter at 4AM on Sunday morning? We need a tool that works across different platforms and vendors to reliably drain a device of application traffic and place it into a quarantined state while we repair it.
Introducing AOS Maintenance Mode
Apstra AOS was built by network engineers, for network engineers. As a result, it includes tools and workflows specifically designed to address the challenges of running a modern data center. AOS supports a complete workflow for taking fabric devices out of service while minimizing impact to active flows. In addition, these workflows are vendor-agnostic, so the same process will occur on different vendor devices even if the command syntax and workflow differs. Lastly, and probably most importantly, by placing a device into Maintenance Mode, we update the overall intent for the network, so monitoring and service expectations are automatically adjusted. No more false positives or the famous “sea of red” in our dashboards.
AOS enables Maintenance Mode by making small changes to route filters within the BGP process. This causes packets to be routed onto alternate paths according to the ECMP load balancing algorithm. Also, AOS can shutdown server facing ports to force traffic onto the MLAG peer. These combined changes result in one or less lost packets within our application flows, which are easily recovered by TCP or upper-layer protocols. This change occurs in seconds, and the device can be removed from the network for corrective actions.
Adding the route-map to the BGP process eliminates path selection through the drained node.
Draining devices with disabled ports on an MLAG pair.
AOS also provides advanced monitoring of drained devices with prebuilt Intent-Based Analytics probes. These probes run within the network all the time. When a device enters maintenance mode, an anomaly is issued if the device has more than a standard level of traffic flowing through the fabric ports. This ensures that we do not modify or power off devices that have not been fully drained. The operator can select the exact traffic level they wish to alert on.
Once the device has entered Maintenance Mode, we can perform other automated actions. In AOS this typically involves moving a device into the Ready state. The Ready state shuts down all L2 and L3 features (L3 routed mode for all interfaces, no neighbor adjacencies, LLDP only).
Once a device is in the Ready state we can simply turn it off and remove it from the rack. A replacement device can be the same vendor hardware or even a different vendor or form factor, freeing us up from finding the exact model previously used. In fact, you can use this method to gradually replace existing switches or even a vendor completely, as AOS automatically renders a configuration for each vendor type without the operator having to do anything aside from selecting the new vendor type from a dropdown box. So we have complete freedom to insert new hardware and ensure that it behaves exactly as the previous device did.
Finally, this process is perfect for performing NOS upgrades.
Stay tuned for Part 2 of this series on how AOS automates and validates NOS upgrades across an entire IP fabric with a few simple clicks in the UI.
For detailed documentation on how Maintenance Mode functions, please contact Apstra Sales at firstname.lastname@example.org.
Core components to improving your security posture through Intent-Based Networking include a single source of truth, continuous real-time validation and the ability to swap or upgrade devices quickly. These components are not enough, though.
Audit, time machine, and roll-back your infrastructure
Without a single source of truth, it is impossible to properly audit all changes that are taking place across your infrastructure. And without a proper audit, it is impossible to know whether your infrastructure has been compromised. With Intent-Based Networking, not only do you have one source of truth, but because all changes are done through software, they’re all recorded. You can go back to an audit trail, or even go back in time through a “time machine” like functionality. Doing so helps improve your security posture at many levels:
- You have the ability to monitor your audit trail for any suspicious activities
- Your Intent-Based Networking system can be programmed to look for suspicious activities. Examples of such activities that the system can easily detect are: the creation of new agents and processes; or changes to agents that enables them to accept incoming connections.
- When you do witness suspicious activity, your Intent-Based Networking system can automatically raise an alarm, or allow you to roll back to a known “safe” state.
Never, ever log into a device!
In today’s world, operators log into devices, and use the Command Line Interface (CLI) to make changes, or debug problems. This approach is fundamentally broken and insecure because it is far too easy for bad actors to take control of the devices.
With an Intent-Based Networking system, operators never log into a device. Moreover, devices never accept incoming connections. Devices only talk to the Intent-Based Networking system, which controls and protects the connection to those devices.
Properly secure, distributed architecture
Last but not least, none of this would matter if your Intent-Based Networking solution itself gets compromised. This is why a proper security posture also requires architecting the solution itself with security in mind. Apstra AOS is a software-first distributed system, which consists of many processes, each process only connecting to the Graph Datastore with secure, encrypted connections. The processes themselves do not accept any incoming connections; and the distributed data store authenticates connections and imposes access control.
Improve your security posture by adopting Intent-Based Networking and a “Software-First” approach
With security being of paramount concern, organizations should build their infrastructures with security as a top priority. By taking a “software-first” approach and deploying Intent-Based Networking, organizations can make quick progress in terms of their security posture by avoiding some of the most common causes of security mishaps — including lack of visibility, lack of consistency and uniformity, lack of accountability, and inability to resolve problems quickly when they arise.
Intent-Based Networking forces discipline into the operational model, driven at the core by a single source of truth. The single source of truth guarantees uniformity of policy and consistency of workflows; it is the foundation of real-time continuous validation tests; and it ensures visibility and dramatically reduces the mean time to insight when problems and security vulnerabilities do occur. It reliably prevents many security problems. Intent-Based Networking also helps to fix problems quickly when they arise, either by swapping devices quickly, upgrading software, or reverting to a known state.
In summary, Intent-Based Networking can dramatically improve organizations’ security posture. This is in addition to Intent-Based Networking’s proven benefits in delivering an order of magnitude acceleration in business velocity, an order of magnitude improvement in infrastructure reliability and an 83% reduction in costs.
If you’re interested in joining our Fortune 500 customers who are well on their way to transforming their infrastructures using a software-first approach, please contact us — we’d love to hear from you!
[Read the first blog in this two part series here]
You may have read in the news about horrific security gaps that have the potential of bringing down whole infrastructures, leaking critical business and personal data, and exposing organizations to massive liability.
There is no question that improving organizations’ security posture is a critical requirement for infrastructure and security teams.
While there are thousands of security point solutions addressing specific security threats, it is important that infrastructure teams are also diligent and implement approaches that, at the foundational level, enforce the level of discipline and hygiene required to maintain a good security posture. With that in mind, “Software-First” Intent-Based Networking can offer organizations significant improvements to their security posture. This blog explains why.
Single Source of Truth, Continuous Real-Time Validation
Without a single source of truth
Most organizations today do not have a single source of truth to capture the intent of their infrastructure. Intent is captured across various systems, in some cases spreadsheets and documents. The lack of a single source of truth for intent means there is often a deviation between what the architect originally intended, and what is actually implemented in the network. Changes are made to these networks over time and often documented by individuals who may no longer be at the company. We see so many operators worry about “touching anything” because they don’t know what’s there. For example, network engineers fear removing or changing access lists because they don’t know why they are there in the first place.
Needless to say, this situation creates an environment which can introduce dangerous security vulnerabilities that are easily exploited.
Data center infrastructures are becoming more distributed, more heterogeneous, and increasingly span multiple domains (various locations, private and public clouds, campus and edge).
Different domains are operated by multiple organizations using different systems within the same company. In some cases, the systems in place are completely manually operated. In other cases, there may be a software defined layer that controls some aspect of the security policy, while connectivity is managed by some other systems.
As a result, there is no consistent method by which an operator can enforce one uniform set of security policies across more than one domain, let alone across all their domains. In fact, blatant gaps exist in today’s environments. For example, you may be able to enforce security policies over your virtualized environment, but it can’t extend to bare metal servers or storage arrays. Operators are forced to program these policies manually, which is error prone. These gaps create dangerous security vulnerabilities.
Even if you had control of those domains, and think you pushed the correct configurations, there may be bugs in hardware or the device operating system that prevent the configuration from taking effect. Unless you have an ability to test your configuration actually worked, and that your security policy has been applied, you are still at risk.
Multidomain unified group-based policy and automation
“Software-First” Intent-Based Networking provides an ability to define global intent and security policy using a single source of truth. It also offers the capability to enforce these security policies across multiple heterogeneous domains. Changes in intent are updated automatically in the single source of truth, and then in turn, automatically enforced by the infrastructure. Last but not least, an intent-based system continuously validates in real-time the infrastructure is delivering on intent; therefore, operators can be confident the policies they’ve defined are indeed being enforced.
In summary, “software-first” Intent-Based Networking addresses these policy gaps and, as a result, significantly improves an organization’s security posture. The term software-first indicates that the entire multi-domain infrastructure is defined, programmed and operated through a single software-based system. This remains true regardless of the systems, products or vendors the engineers have chosen to implement the infrastructure. Software-first consolidates policy definition and enforces that policy end-to-end.
Ability to swap or upgrade devices quickly
Today, organizations are at the mercy of their hardware vendors’ bugs and quality problems (both hardware and device operating systems). Security vulnerabilities are common and are routinely discovered on infrastructure devices. When a hardware vendor discovers a security vulnerability in a customer’s hardware and device OS, the customer must wait for the hardware vendor to provide a patch, which may take monthsWhen the patch is finally delivered, the customer will need to go through their own qualification process for the new security patch, which may take many more months.
Skipping the qualification process is akin to rolling the dice on new potential unknown bugs (a very common occurrence with new device OS versions). This may potentially cause bigger problems, such as new security vulnerabilities or even outages. Gartner analyst Andrew Lerner wrote a great blog about the pain involved in network upgrades, where he compares the process to going to the dentist!
By taking a software-first approach, Intent-Based Networking enables companies to qualify new hardware or software very rapidly, and upgrade to those versions very quickly:
- If you learn that a version of a Switch Operating System that you have deployed has a security vulnerability, then you can quickly upgrade to another version. This is a process that can otherwise take months (8 months on average for businesses we’ve talked to).
- If you learn that a specific hardware that you have installed has a security vulnerability then you can swap for another device (this could even be a device from another vendor!) very quickly. Again, this is a process that can otherwise take months. Your software-first deployment ensures that even with a change of devices or vendors, there is no change to the way these products are operated and validated. There is no need to learn anything new.
To learn why “Software-First” Intent Based Networking gives you that ability, you can read my blog on “software-first” Intent-Based Networking, specifically the section titled “Five Million Tests a Day”, which describes how Apstra has built and operates the most powerful automated testbed in the industry.