🈸 🐮 🤳 The story of one switch 📜 👩🏽‍🤝‍👨🏻 📙

In our LAN aggregation, there were six pairs of Arista DCS-7050CX3-32S switches and one pair of Brocade VDX 6940-36Q switches. It’s not that Brocade switches on this network were very annoying, they work and carry out their functions, but we were preparing full automation of some actions, and we didn’t have these capabilities on these switches. I also wanted to switch from 40GE interfaces to the possibility of using 100GE to make a reserve for the next 2-3 years. So we decided to exchange Brocade for Arista.

These switches are LAN aggregation switches for each data center. Distribution switches (the second level of aggregation) are connected directly to them, which are already assembled in themselves Top-of-Rack LAN switches in server racks.

Each server is included in one or two access switches. Access switches are connected to a pair of distribution switches (two distribution switches and two physical links from the access switch to different distribution switches are used for redundancy).

Each server can be used by its client, so a separate VLAN is allocated to the client. The same VLAN is then assigned to another server of this client in any rack. The data center consists of several such rows (PODs), each row of racks has its own distribution switches. Then these distribution switches are connected to the aggregation switches.

Clients can order a server in any row, it is impossible to predict in advance that the server will be allocated or installed in any particular row in any particular rack, therefore, there are about 2500 VLANs in each data center on aggregation switches.

Equipment for DCI (Data-Center Interconnect) connects to the aggregation switches. It can be used for L2 connectivity (a pair of switches forming a VXLAN tunnel to another data center), and for L3 connectivity (two MPLS routers).

As I already wrote, in order to unify the processes of automating the configuration of services on equipment in one data center, it was necessary to replace central aggregation switches. We installed new switches next to the existing ones, combined them into an MLAG pair and began to prepare for work. They were immediately connected to existing aggregation switches, so that they had a common L2 domain across all client VLANs.

Circuit details

For specificity, we will call the old aggregation switches A1 and A2 , the new ones - N1 and N2 . Imagine that in POD 1 and POD 4 there are servers of one client C1 , the VLAN of the client is indicated in blue. This client uses the L2 connectivity service with another data center, so its VLAN is served on a pair of VXLAN switches.

Client C2 places the servers in POD 2 and POD 3 , we denote the client VLAN by dark green. This client also uses the connectivity service with another data center, but L3, so its VLAN is served on a pair of L3VPN routers.

We need client VLANs to understand at what stages of the replacement work what happens, where the communication break occurs, and what its duration may be. The STP protocol is not used in this scheme, since the width of the tree for it in this case is large, and the convergence of the protocol grows exponentially from the number of devices and links between them.

All devices connected by double links form a stack, an MLAG pair or a VCS-Ethernet factory. Such technologies are not used for a pair of L3VPN routers, since there is no need for L2 redundancy, it is enough that they have L2 connectivity to each other through aggregation switches.

Implementation options

When analyzing the options for further events, we realized that there are several ways to carry out these works. From a global break on the entire local network, to small literally 1-2 second breaks in parts of the network.

Network, stand! Switches, replace!

The easiest way - is, of course, announce a global break ties in all POD and all services DCI and switch all the links from the switch A in the switches of N .

In addition to the break, the time of which we cannot guarantee to predict (yes, we know the number of links, but do not know how many times something will go wrong - from a broken patch cord or damaged connector to a port or transceiver malfunction), we still cannot to predict in advance whether the length of the patch cords, DAC, AOC, connected to the old switches A, is enough to reach them, although standing next to them, but still slightly apart, the new switches N, and whether the same transceivers will work / DAC / AOC from Brocade switches in Arista switches.

And all this in conditions of severe pressure from customers and technical support ("Natasha, get up! Natasha, everything doesn’t work there! Natasha, we already wrote technical support, honestly, honestly! Natasha, we’ve already dropped everything there! Natasha, but how much has not will it work? Natasha, and when it works ?! ”). Even despite a pre-announced break and notification made to customers, an influx of calls at such a time is guaranteed.

Wait, 1-2-3-4!

And if not to declare a global break, but to announce a series of small breaks in communication on POD and DCI services. In the first break, switch only POD 1 to N switches , in the second - after a couple of days - POD 2 , then after a couple of days POD 3 , then POD 4 ... [N] , then VXLAN switches and then L3VPN routers.

With such an organization of switching work, we reduce the complexity of one-time work and increase our time for solving problems if something suddenly went wrong. POD 1 connectivity after switching with other POD and DCI is not lost. But the work itself is delayed for a long time, for the time of this work in the data center, an engineer is required to physically perform the switching, and during the work (and such work is usually performed at night, from 2 to 5 in the morning), the presence of an online network engineer is quite high qualifications. But on the other hand, we get short breaks in communication, as a rule, work can be carried out in the interval of half an hour with a break of up to 2 minutes (in practice, often 20-30 seconds with the expected behavior of the equipment).

In the above example, client C1 or client C2 you will have to warn about work with a break in communications at least three times - the first time to carry out work on one POD, in which one of its servers is located, the second time - on the second, and the third time - when switching equipment for DCI services.

Switching aggregated communication channels

Why are we talking about the expected behavior of the equipment, and how can aggregated channels switch with minimization of communication interruptions. Imagine the following picture:

On the one hand, the link is the distribution switches POD - D1 and D2 , they form an MLAG pair (stack, VCS factory, vPC pair), on the other hand two links - Link 1 and Link 2 - are included in the old MLAG pair aggregation switch A . On the side switches D formed aggregate interface called Port-channel A , on the side of aggregation switches A - aggregate interface called Port D-channel .

Aggregated interfaces use LACP in their work, that is, switches on both sides regularly exchange LACPDU packets on both links to make sure that the links:

workers;
.

When exchanging packets in a packet, the system-id value is transmitted , which indicates the device where these links are included. For an MLAG pair (stack, factory, etc.), the value of system-id for the devices forming the aggregated interface is the same. Switch D1 sends a Link 1 value system id-D , and the switch D2 sends a Link 2 value system id-D .

Switches A1 and A2 analyze the LACPDU packets received on the same Po D interface and verify that the system-id in them matches. If the system-id received by some link suddenly differs from the current working value, then this link is removed from the aggregated interface until the situation is corrected. Now we switch to the side D the current value of the system-id LACP-partner - A , and switches on the side A - the current system-id value of LACP-partner - D .

If you need to switch the aggregated interface, we can do two different ways:

Method 1 - Simple

A. .

N, LACP, Po D N system-id N.

Method 2 - Minimize the Break

2 Link 2. D , .

Link 2 N2. N Po DN, N2 LACPDU system-id N. , N2 , Link 2, Up, LACPDU .

, D2 Po A Link 2 system-id N, system-id A, D Link 2 Po A. N Link 2 , LACP- D2. Link 2 .

Link 1 A1, D . , D system-id Po A.

D N system-id A-N Po A Po DN, Link 2. , , 2 .

Link 1 N1, Po A Po DN. system-id , .

Additional links

But switching can be done without the presence of an engineer at the time of switching. To do this, we need to pre-lay additional links between the D distribution switches and the new N aggregation switches .

We are laying new links between N aggregation switches and all POD distribution switches. This requires order and laying of additional patch cords, and install additional transceivers in both of N , and in D . We can do this because we have free ports in the D switches of each POD (or we free them first). As a result, each POD is physically connected by two links to the old switches A and to the new switches N.

Two aggregated interfaces are formed on switch D - Po A with links Link 1 and Link 2 , and Po N with links Link N1 and Link N2 . At this stage, we check the correct connection of interfaces and links, the levels of optical signals at both ends of the links (via DDM information from the switches), we can even check the link's working capacity under load or monitor the status of optical signals and transceiver temperatures for a couple of days.

Traffic is still transmitted through the Po A interface, while the Po N interface is free of traffic. The settings on the interfaces are approximately the following:

Interface Port-channel A
Switchport mode trunk
Switchport allowed vlan C1, C2

Interface Port-channel N
Switchport mode trunk
Switchport allowed vlan none

Switches D, as a rule, support session configuration change; such switch models that have this functionality are used. So we can change the settings of the Po A and Po N interfaces in one go:

Configure session
Interface Port-channel A
Switchport allowed vlan none
Interface Port-channel N
Switchport allowed vlan C1, C2
Commit

Then the configuration change will happen quickly enough, and the break will be, in practice, no more than 5 seconds.

This method allows us to carry out all the preparatory work in advance, to carry out all the necessary checks, to coordinate the work with the participants in the process, to forecast in detail the actions for the production of work, without creative work, when “everything went wrong”, and have at hand a plan to return to the previous configuration. Work on this plan is carried out by a network engineer without the presence of a data center on site of the engineer who physically performs the switching.

What is more important with this switching method - all new links are already pre-set for monitoring. Errors, inclusion of links in the unit, loading of links - all the necessary information is already in the monitoring system, and it is already drawn on the maps.

D-day

Pod

We chose the least painful for customers and least prone to the “something went wrong” switch path with additional links. So for a couple of nights we switched all PODs to the new aggregation switches.

But it remains to switch equipment providing DCI services.

L2

In the case of equipment providing L2 connectivity, we could not carry out similar work with additional links. There are at least two reasons for this:

Lack of free ports of the required speed on VXLAN switches.
Lack of functionality for session configuration changes on VXLAN switches.

We did not switch the links “one at a time” with a break only for the duration of the approval of a new pair of system-id, since we did not have 100% confidence that the procedure would be correct, and a test in the laboratory showed that, in the case if “something goes wrong”, we still get a connection break, and the worst thing is not only for customers who have L2 connectivity with other data centers, but for all customers of this data center in general.

We carried out propaganda work on switching from L2 channels ahead of time, so the number of customers affected by operations on VXLAN switches was already several times less than a year ago. As a result, we decided to interrupt communication on the L2-connectivity service, provided that we maintain the normal operation of local network services in one data center. In addition, the SLA for this service provides for the possibility of scheduled work with a break.

L3

Why did we recommend everyone to switch to using L3VPN when organizing DCI services? One of the reasons is the ability to work on one of the routers that provide this service, simply with a reduction in the level of redundancy to N + 0, without interruption in communication.

Consider the scheme of service provision more closely. In this service, the L2 segment goes from client servers only to L3VPN Selectel routers. On routers, the client network is terminated.

Each client server, for example, S2 and S3 in the above diagram, has its own private IP addresses - 10.0.0.2/24 for the S2 server and 10.0.0.3/24 for the S3 server . Addresses 10.0.0.252/24 and 10.0.0.253/24assigned by Selectel to routers L3VPN-1 and L3VPN-2 , respectively. IP address 10.0.0.254/24 is a VRRP VIP address on Selectel routers.

You can read more about the L3VPN service on our blog.

Until the moment of switching everything looked approximately as in the diagram:

Two routers L3VPN-1 and L3VPN-2 were connected to the old switch aggregation A . The master for VRRP VIP addresses 10.0.0.254 is the L3VPN-1 router . He has priority set to this address higher than that of the L3VPN-2 router .

unit 1006 {
    description C2;
    vlan-id 1006;
    family inet {       
        address 10.0.0.252/24 {
            vrrp-group 1 {
                priority 200;
                virtual-address 10.100.0.254;
                preempt {
                    hold-time 120;
                }
                accept-data;
            }
        }
    }
}

S2 server uses 10.0.0.254 gateway to communicate with servers in other locations. Thus, disconnecting the L3VPN-2 router from the network (of course, if you first disconnect it from the MPLS domain) does not affect the connectivity of the client servers. At this point, the redundancy level of the circuit simply decreases.

After that, we can safely reconnect the L3VPN-2 router to the pair of N switches . Lay out links, change transceivers. The logical interfaces of the router, on which the work of client services depends, until the confirmation that everything is functioning as it should, are turned off.

After checking the links, transceivers, signal levels, error levels on the interfaces, the router gets into operation, but is already connected to a new pair of switches.

Next, we lower the VRRP priority of the L3VPN-1 router, and the VIP address 10.0.0.254 moves to the L3VPN-2 router. These works are also carried out without interruption of communication.

Transfer VIP address to the router 10.0.0.254 L3VPN-2 allows you to disable the router L3VPN-1 without interruption of contact for the client and connect it to have a new pair of aggregation switches of N .

Whether or not to return VRRP VIP to the L3VPN-1 router is another matter, and if you return, it is done without interruption.

Total

After all these steps, we really replaced the aggregation switches in one of our data centers, while minimizing breaks for our customers.

All that remains is dismantling. Dismantling of old switches, dismantling of old links between switches A and D, dismantling transceivers from these links, fixing monitoring, fixing network diagrams in documentation and monitoring.

We can use switches, transceivers, patch cords, AOC, DAC after switching, in other projects or other similar switching.

“Natasha, we switched everything!”

The story of one switch