Device OS Upgrades: Let’s Make This Simple

This is the second part of a blog series on using AOS to easily operate your data network.  Click here to read part one.

Last time, we were talking about how we can use AOS to gracefully drain traffic off network devices in order to perform maintenance with a “Software-First” approach. You were probably scratching your head at the end saying “Now what?” Obviously, getting application traffic off the network is just step one in the operational process.  So what happens next? Well, typically you are doing one of the following tasks:

  • Troubleshoot device offline
  • Replace failed or damaged device
  • Upgrade the Device OS

Let’s focus on that last option, which every operator has been challenged by.

Network Device OS Upgrades are a required function in any modern enterprise.  The OS can be affected by known or unknown bugs, security vulnerabilities, or more.  The operator may also wish to upgrade the OS in order to activate a new feature. Whatever the reason is, with a fixed form factor device with a single CPU (or supervisor), we can expect some sort of outage due to the reload process.  So as we have previously described, it typically makes sense to place the equipment into maintenance mode before performing these actions.

Multivendor Fabrics? No Problem.

Like every feature in AOS, the Intent is separated from the method by which we accomplish the stated goal. You shouldn’t have to be an expert in every vendor’s OS in order to manage a mixed hardware network. So the device OS upgrade process does not require any CLI input from the administrator, they simply select the devices to be upgraded, then pick the proper OS from a drop down list, and then submit the job. AOS can manage multiple simultaneous upgrades and even multiple active jobs for different OS types.

Positioning and Validating the Images

In advance of scheduling the upgrade, the operator should upload the OS images to the AOS server. When the images are being copied, you have the option to store the hash value provided by the vendor to ensure that the image is absolutely valid and has no errors. This information is stored in the AOS Server, in fact, we can use the hash value to regularly check the devices to make sure no corruption has taken place in our file system.

AOS also supports storing the images on a dedicated HTTP server, which allows for an engineering team to centralize all the images even if they are not being used with AOS.

Upgrading, or even Downgrading

Once the job has been kicked off, AOS copies the image to the storage location on each device. When the file copy is complete, the bootfile command is set to the new image and the device is reloaded. If the operator has placed the device into Maintenance Mode, then the device comes back in the same state. If the mode was not set on the device, it will come back up in a fully operational state with the service configuration.

You can use this workflow to both upgrade and downgrade the OS. Sometimes when we upgrade we encounter new problems and have to go back to the previous version. For this reason, we recommend storing all OS images on the AOS Server or the related HTTP server.

Hey, that wasn’t there before!

Frequently when you perform an upgrade, new default command settings appear in the running configuration syntax. This is due to subtle changes in the vendor’s source code. Typically these changes can be anticipated by reading the release notes, but on occasion new commands or even modified default settings will appear after a reload. AOS provides an easy way to identify these problems and resolve them.

In an AOS managed network, every device configuration is checked every 60 seconds for changes. When a device is reloaded and the AOS Agent is activated during boot, the configuration is checked immediately. ANY CHANGE, no matter how subtle, will be identified by AOS and presented to the administrator as a Config Anomaly. At that point, you can manually adjust the new command settings, or have AOS automate the change, or simply accept the new settings as a baseline, or “Golden” config. Once the change has been accepted, AOS expects that command to be there, and you will be alerted with another anomaly if any other change appears.

Automated OS Version Compliance Checks

AOS uses Intent-Based Analytics (IBA) to check elements of the network at regular intervals for problems. One of the more popular IBA “probes” is the OS Version Check, which looks at the running OS version on all devices from a certain vendor and triggers an anomaly if the version does not match. In AOS 3.0 we augmented this function with the Global SLA feature. This allows a member of the network or security team to set a preferred value once in AOS and refer to that variable in any number of IBA probes. For example, you could set the approved EOS version for Arista devices to “”. This is like envvars for your entire network.

The OS Check probe will then create an anomaly for every device running any version that differs from that. If the security team is alerted to a vulnerability in that version, that can simply change the version listed in the SLA and the IBA probes will automatically be updated to look for the new version.

As a result, all devices that were running the old version will now show an anomaly, and the count of these anomalies defines how many OS upgrades you need to do to be in compliance. The security team can watch this number in real-time, it will decrease with every successful OS upgrade. By using Role Based Access Control (RBAC), any member of the organization can be fully in the loop on the OS remediation process.

9 out of 10 Engineers Agree, Do Your Upgrades!

AOS was designed to improve the lives of operators and increase the efficiency of businesses by rapidly upleveling the capabilities of the people who manage these systems. Prior to using AOS, Bloomberg had a single engineer working on OS upgrades, taking upwards of 8 months to upgrade 174 switches. The same tasks could have been completed with AOS Maintenance Mode and Device OS Upgrade in approximately 87 hours. In fact, with parallel OS upgrade job support, the entire network could have been upgraded in a single day.

While OS upgrades are typically not the most exciting task, they are absolutely required and need to be completed in relatively short time periods. Don’t believe me? Check out what Andrew Lerner at Gartner had to say about it ( With the automated tools built into AOS, we can easily upgrade complex multivendor network topologies with a consistent workflow and validation assistance. This ensures that the upgrades are completed quickly and accurately, allowing you to return to working on more challenging projects.

For detailed documentation on how Device OS Upgrades function, please contact Apstra Sales at

* This article was originally published here

* This article was originally published here

Network Maintenance Mode: Improving Your Network and Improving Your Life

If you’ve ever worked in network operations, you undoubtedly have a few crazy stories about network outages and the unusual hours you had to work to resolve them. Outages never seem to happen at 10AM on Monday when everyone is at their desk, freshly loaded with coffee. No, the network Gods seem intent on things failing in the middle of the night or when traffic is close to its peak. And if the failure was during working hours, undoubtedly you’ll have to do some sort of remediation at unpleasant times, without the benefit of having the entire team available to help.

Why has this become the standard process for dealing with issues? Shouldn’t there be some intelligent software that can automate workflows across an entire network regardless of what vendor hardware is being used? (Spoiler Alert: A “Software First” approach will help solve many of these issues).

No Shortage of Potential Problems

The number of potential problems in a network is unlimited. Devices misbehave or fail outright, optics fail regularly, copper and fiber cables become crimped or even cut, security vulnerabilities affecting multiple devices OS are routine. As owners and caretakers of these environments, we are responsible for designing and operating the network to avoid the most common issues, but we can always expect some sort of issue to appear.

Hardware failures will always exist. We can design redundant hardware and topologies, but when things break, how can we easily replace the equipment without causing further degradation to services?

Software bugs and issues don’t always appear immediately, sometimes there are conditions that set them off, sometimes the increased load drives them to the surface. Occasionally we discover them through notifications from the vendor, typically in the form of a security notice or PSIRT.

Probably worse than outright failures are “gray failures” which add uncertainty to our networks. A device that is dropping a large amount of packets still needs to be replaced, but until the operator has been able to perform this work, real application traffic will continue to be forwarded.

Replacement Challenges

When a device has to be replaced but is still participating in the routing and switching for the fabric, several steps need to be taken to ensure that traffic is gracefully drained. For example, we don’t want to simply shut down the BGP routing process, as this would impact existing traffic flows. We also don’t want to modify the L2 switching plane as this could prove disastrous to the rest of the network. We want a consistent and well-tested routine for placing devices into maintenance mode, with the increased level of trust in this routine we can perform corrective actions during more potential change windows.

Operator Quality of Life

Most network operators are used to the fact that the business is rarely tolerant of changes in the middle of the day. In fact, as the network provides the lifeblood for all of IT, most businesses do not permit changes outside of very small maintenance windows, which are typically early morning on the weekends. Who wants to go into their datacenter at 4AM on Sunday morning? We need a tool that works across different platforms and vendors to reliably drain a device of application traffic and place it into a quarantined state while we repair it.

Introducing AOS Maintenance Mode

Apstra AOS was built by network engineers, for network engineers. As a result, it includes tools and workflows specifically designed to address the challenges of running a modern data center. AOS supports a complete workflow for taking fabric devices out of service while minimizing impact to active flows. In addition, these workflows are vendor-agnostic, so the same process will occur on different vendor devices even if the command syntax and workflow differs. Lastly, and probably most importantly, by placing a device into Maintenance Mode, we update the overall intent for the network, so monitoring and service expectations are automatically adjusted. No more false positives or the famous “sea of red” in our dashboards.

AOS enables Maintenance Mode by making small changes to route filters within the BGP process. This causes packets to be routed onto alternate paths according to the ECMP load balancing algorithm. Also, AOS can shutdown server facing ports to force traffic onto the MLAG peer. These combined changes result in one or less lost packets within our application flows, which are easily recovered by TCP or upper-layer protocols. This change occurs in seconds, and the device can be removed from the network for corrective actions.

Adding the route-map to the BGP process eliminates path selection through the drained node.


Draining devices with disabled ports on an MLAG pair.

AOS also provides advanced monitoring of drained devices with prebuilt Intent-Based Analytics probes. These probes run within the network all the time. When a device enters maintenance mode, an anomaly is issued if the device has more than a standard level of traffic flowing through the fabric ports. This ensures that we do not modify or power off devices that have not been fully drained. The operator can select the exact traffic level they wish to alert on.

Once the device has entered Maintenance Mode, we can perform other automated actions. In AOS this typically involves moving a device into the Ready state. The Ready state shuts down all L2 and L3 features (L3 routed mode for all interfaces, no neighbor adjacencies, LLDP only).

Once a device is in the Ready state we can simply turn it off and remove it from the rack. A replacement device can be the same vendor hardware or even a different vendor or form factor, freeing us up from finding the exact model previously used. In fact, you can use this method to gradually replace existing switches or even a vendor completely, as AOS automatically renders a configuration for each vendor type without the operator having to do anything aside from selecting the new vendor type from a dropdown box. So we have complete freedom to insert new hardware and ensure that it behaves exactly as the previous device did.

Finally, this process is perfect for performing NOS upgrades.

Stay tuned for Part 2 of this series on how AOS automates and validates NOS upgrades across an entire IP fabric with a few simple clicks in the UI.

For detailed documentation on how Maintenance Mode functions, please contact Apstra Sales at

* This article was originally published here

* This article was originally published here