Network automation. Case of life

Hello, Habr!

In this article we would like to talk about automation of network infrastructure. A working network diagram will be presented, which operates in one small but very proud company. All matches to real network hardware are random. We will consider a case that occurred in this network, which could lead to a business shutdown for a long time and serious financial losses. The solution of this case fits very well into the concept of "Automation of network infrastructure." Using automation tools, we will show how you can effectively solve complex problems in a short time, and reflect on the topic of why these tasks are more promising to be solved in this way and not otherwise (through the console) .

Disclaimer

Our main automation tools are Ansible (as a means of automation) and Git (as a repository of Ansible's playbooks). I immediately want to make a reservation that this is not a fact-finding article, where we talk about the logic of Ansible or Git, and explain basic things (for example, what are roles \ tasks \ modules \ inventory files \ variables in Ansible, or what happens when you enter git push commands or git commit). This story is not about how you can practice in Ansible, configure it on NTP or SMTP equipment. This is a story about how you can quickly and preferably without errors solve a network problem. It is also desirable to have a good idea of ​​how the network works, in particular, what is the TCP / IP, OSPF, BGP protocol stack. The choice of Ansible and Git is also out of the question. If you still have a choice of a specific solution,we highly recommend reading the book Network Programmability and Automation. Skills for the Next-Generation Network Engineer ”by Jason Edelman, Scott S. Lowe, and Matt Oswalt.


Now to the point.

Formulation of the problem


Imagine a situation: 3 a.m., you sleep soundly and dream. The call to the phone. The technical director

is calling: - Yes?
- ###, ####, #####, the cluster of firewalls has fallen and does not rise !!!
You rub your eyes, try to realize what is happening and imagine how such a thing could have happened. In the tube you can hear the hair tearing on the head of the director, and he asks to call back, because the general calls him on the second line.

After half an hour, you collected the first introductory notes from the shift on duty, woke everyone you could wake up to. As a result, the technical director did not lie, everything is so, the main cluster of firewalls has fallen, and no basic gestures bring him to his senses. All services that the company offers do not work.

Choose a problem for your taste, everyone will remember something different. For example, after an overnight update, in the absence of a heavy load, everything worked well, and all satisfied went to bed. Traffic went, and interface buffers began to overflow due to a bug in the network card driver.

The situation can well describe Jackie Chan.



Thanks, Jackie.

The situation is not very pleasant, is it?

Let's leave for the time of our network bro with his sad thoughts.

We will discuss how events will develop further.

We offer the following order of presentation of the material

  1. Consider the network diagram and analyze how it works;
  2. We will describe how we transfer settings from one router to another using Ansible;
  3. Let's talk about the automation of IT infrastructure in general.

Network diagram and its description


Scheme





Consider the logic of our organization. We will not name specific manufacturers of equipment, this does not matter within the article (The attentive reader himself will guess what kind of equipment is used) . This is just one of the good advantages of working with Ansible, when setting up, in general, we do not care what kind of equipment it is. Just to understand, this equipment is well-known vendors such as Cisco, Juniper, Check Point, Fortinet, Palo Alto ... you can substitute your own version.

We have two main tasks for moving traffic:

  1. Ensure the publication of our services, which are the business of the company;
  2. Provide communication with branches, a remote data center and third-party organizations (partners and customers), as well as access to the Internet through the central office.

Let's start with the basic elements:

  1. Two border routers (BRD-01, BRD-02);
  2. Firewall Cluster (FW-CLUSTER);
  3. Kernel Switch (L3-CORE);
  4. A router that will become a lifeline (in the process of solving the problem, we will transfer the network settings from FW-CLUSTER to EMERGENCY) (EMERGENCY);
  5. Switches for managing network infrastructure (L2-MGMT);
  6. Virtual machine with Git and Ansible (VM-AUTOMATION);
  7. A laptop that tests and develops playbooks for Ansible (Laptop-Automation).

A dynamic OSPF routing protocol is configured on the network with the following areas:

  • Area 0 - the area in which the routers responsible for the movement of traffic in the EXCHANGE zone are included;
  • Area 1 - the area in which the routers responsible for the work of the company's services are included;
  • Area 2 - the area in which the routers responsible for routing management traffic are included;
  • Area N - branch network areas.

On border routers it is created on a virtual router (VRF-INTERNET), on which eBGP full view with the corresponding assigned AS is raised. Between VRFs iBGP is configured. The company has a pool of white addresses that are published on these VRF-INTERNETs. Some of the white addresses are routed directly to FW-CLUSTER (addresses on which the company's services operate), some are routed through the EXCHANGE zone (internal company services that require external IP addresses and external NAT addresses for offices). Further, traffic gets to virtual routers created on L3-CORE with white and gray addresses (security zones).

Management networks use dedicated switches and are a physically dedicated network. Management network is also divided into security zones.
The EMERGENCY router physically and logically duplicates the FW-CLUSTER. All interfaces are disabled on it except those that look at the management network.

Automation and its description


We figured out how the network works. Now let’s take a look at the steps, what will we do to transfer traffic from FW-CLUSTER to EMERGENCY:

  1. Disable the interfaces on the kernel switch (L3-CORE) that connect it to the FW-CLUSTER;
  2. Disable the interfaces on the L2-MGMT core switch that connect it to the FW-CLUSTER;
  3. Configure the EMERGENCY router (by default, all interfaces are disabled on it, except those associated with L2-MGMT):

  • We include interfaces on EMERGENCY;
  • Configure the external ip-address (for NAT), which was on FW-Cluster;
  • We generate gARP requests so that in the L3-CORE arp tables poppy addresses change from FW-Cluster to EMERGENCY;
  • BRD-01, BRD-02;
  • NAT;
  • EMERGENCY OSPF Area 1;
  • EMERGENCY OSPF Area 2;
  • Area 1 10;
  • Area 1 10;
  • ip-, L2-MGMT ( , FW-CLUSTER);
  • gARP , arp- L2-MGMT - FW-CLUSTER EMERGENCY.

Again, we return to the original formulation of the problem. Three in the morning, huge stress, a mistake at any stage can lead to new problems. Ready to type commands through the CLI? Yes? Ok, go at least rinse your face, drink coffee and gather your will in a fist.
Bruce, please help the guys.



Well, we continue to cut our automation.
Below is a diagram of the workbook in terms of Ansible. This diagram reflects what we described just above, just a concrete implementation in Ansible.


At this stage, we realized what needs to be done, developed a playbook, conducted testing, now we are ready to launch it.

Another small digression. The ease of narration should not mislead you. The process of writing playbooks was not as simple and fast as it might seem. Testing took quite a lot of time, a virtual stand was created, the solution was rolled in many times, about 100 tests were conducted.

We start ... There is a feeling that everything happens very slowly, somewhere there is an error, something will not work in the end. The feeling of a parachute jump, and the parachute doesn’t want to open at once ... that's normal.

Next, we read the result of the Ansible playbook’s operations (we replaced the IP addresses for conspiracy purposes):

[xxx@emergency ansible]$ ansible-playbook -i /etc/ansible/inventories/prod_inventory.ini /etc/ansible/playbooks/emergency_on.yml 

PLAY [------->Emergency on VCF] ********************************************************

TASK [vcf_junos_emergency_on : Disable PROD interfaces to FW-CLUSTER] *********************
changed: [vcf]

PLAY [------->Emergency on MGMT-CORE] ************************************************

TASK [mgmt_junos_emergency_on : Disable MGMT interfaces to FW-CLUSTER] ******************
changed: [m9-03-sw-03-mgmt-core]

PLAY [------->Emergency on] ****************************************************

TASK [mk_routeros_emergency_on : Enable EXT-INTERNET interface] **************************
changed: [m9-04-r-04]

TASK [mk_routeros_emergency_on : Generate gARP for EXT-INTERNET interface] ****************
changed: [m9-04-r-04]

TASK [mk_routeros_emergency_on : Enable static default route to EXT-INTERNET] ****************
changed: [m9-04-r-04]

TASK [mk_routeros_emergency_on : Change NAT rule to EXT-INTERNET interface] ****************
changed: [m9-04-r-04] => (item=12)
changed: [m9-04-r-04] => (item=14)
changed: [m9-04-r-04] => (item=15)
changed: [m9-04-r-04] => (item=16)
changed: [m9-04-r-04] => (item=17)

TASK [mk_routeros_emergency_on : Enable OSPF Area 1 PROD] ******************************
changed: [m9-04-r-04]

TASK [mk_routeros_emergency_on : Enable OSPF Area 2 MGMT] *****************************
changed: [m9-04-r-04]

TASK [mk_routeros_emergency_on : Change OSPF Area 1 interfaces costs to 10] *****************
changed: [m9-04-r-04] => (item=VLAN-1001)
changed: [m9-04-r-04] => (item=VLAN-1002)
changed: [m9-04-r-04] => (item=VLAN-1003)
changed: [m9-04-r-04] => (item=VLAN-1004)
changed: [m9-04-r-04] => (item=VLAN-1005)
changed: [m9-04-r-04] => (item=VLAN-1006)
changed: [m9-04-r-04] => (item=VLAN-1007)
changed: [m9-04-r-04] => (item=VLAN-1008)
changed: [m9-04-r-04] => (item=VLAN-1009)
changed: [m9-04-r-04] => (item=VLAN-1010)
changed: [m9-04-r-04] => (item=VLAN-1011)
changed: [m9-04-r-04] => (item=VLAN-1012)
changed: [m9-04-r-04] => (item=VLAN-1013)
changed: [m9-04-r-04] => (item=VLAN-1100)

TASK [mk_routeros_emergency_on : Change OSPF area1 default cost for to 10] ******************
changed: [m9-04-r-04]

TASK [mk_routeros_emergency_on : Change MGMT interfaces ip addresses] ********************
changed: [m9-04-r-04] => (item={u'ip': u'..n.254', u'name': u'VLAN-803'})
changed: [m9-04-r-04] => (item={u'ip': u'..n+1.254', u'name': u'VLAN-805'})
changed: [m9-04-r-04] => (item={u'ip': u'..n+2.254', u'name': u'VLAN-807'})
changed: [m9-04-r-04] => (item={u'ip': u'..n+3.254', u'name': u'VLAN-809'})
changed: [m9-04-r-04] => (item={u'ip': u'..n+4.254', u'name': u'VLAN-820'})
changed: [m9-04-r-04] => (item={u'ip': u'..n+5.254', u'name': u'VLAN-822'})
changed: [m9-04-r-04] => (item={u'ip': u'..n+6.254', u'name': u'VLAN-823'})
changed: [m9-04-r-04] => (item={u'ip': u'..n+7.254', u'name': u'VLAN-824'})
changed: [m9-04-r-04] => (item={u'ip': u'..n+8.254', u'name': u'VLAN-850'})
changed: [m9-04-r-04] => (item={u'ip': u'..n+9.254', u'name': u'VLAN-851'})
changed: [m9-04-r-04] => (item={u'ip': u'..n+10.254', u'name': u'VLAN-852'})
changed: [m9-04-r-04] => (item={u'ip': u'..n+11.254', u'name': u'VLAN-853'})
changed: [m9-04-r-04] => (item={u'ip': u'..n+12.254', u'name': u'VLAN-870'})
changed: [m9-04-r-04] => (item={u'ip': u'..n+13.254', u'name': u'VLAN-898'})
changed: [m9-04-r-04] => (item={u'ip': u'..n+14.254', u'name': u'VLAN-899'})

TASK [mk_routeros_emergency_on : Generate gARPs for MGMT interfaces] *********************
changed: [m9-04-r-04] => (item={u'ip': u'..n.254', u'name': u'VLAN-803'})
changed: [m9-04-r-04] => (item={u'ip': u'..n+1.254', u'name': u'VLAN-805'})
changed: [m9-04-r-04] => (item={u'ip': u'..n+2.254', u'name': u'VLAN-807'})
changed: [m9-04-r-04] => (item={u'ip': u'..n+3.254', u'name': u'VLAN-809'})
changed: [m9-04-r-04] => (item={u'ip': u'..n+4.254', u'name': u'VLAN-820'})
changed: [m9-04-r-04] => (item={u'ip': u'..n+5.254', u'name': u'VLAN-822'})
changed: [m9-04-r-04] => (item={u'ip': u'..n+6.254', u'name': u'VLAN-823'})
changed: [m9-04-r-04] => (item={u'ip': u'..n+7.254', u'name': u'VLAN-824'})
changed: [m9-04-r-04] => (item={u'ip': u'..n+8.254', u'name': u'VLAN-850'})
changed: [m9-04-r-04] => (item={u'ip': u'..n+9.254', u'name': u'VLAN-851'})
changed: [m9-04-r-04] => (item={u'ip': u'..n+10.254', u'name': u'VLAN-852'})
changed: [m9-04-r-04] => (item={u'ip': u'..n+11.254', u'name': u'VLAN-853'})
changed: [m9-04-r-04] => (item={u'ip': u'..n+12.254', u'name': u'VLAN-870'})
changed: [m9-04-r-04] => (item={u'ip': u'..n+13.254', u'name': u'VLAN-898'})
changed: [m9-04-r-04] => (item={u'ip': u'..n+14.254', u'name': u'VLAN-899'})

PLAY RECAP ************************************************************************

Done!

In fact, it’s not quite ready, do not forget about the convergence of dynamic routing protocols and the loading of a large number of routes in the FIB. We cannot influence this in any way. We wait. It came together. Now it’s ready.

And in the village of Vilabaggio (which does not want to automate the network setup), they continue to wash the dishes. Bruce (albeit already different, but no less cool) is trying to figure out how much more to manually reconfigure equipment.



I would also like to dwell on one important point. How do we get everything back? After some time, we will bring our FW-CLUSTER back to life. This is the main equipment, not the backup one, the network should work on it.

Feel like starting to burn at networkers? The technical director will hear a thousand arguments why this is not necessary, why it can be done later. Unfortunately, this is how the network works out of a bunch of patches, pieces, remnants of former luxury. It turns out a quilt. Our task as a whole, not in this particular situation, but generally, in principle, as IT specialists, is to bring the network to the beautiful English word “consistency”, it is very multifaceted, it can be translated as: consistency, consistency, consistency, coherence, consistency, comparability, connectivity. This is all about him. Only in this state is the network manageable, we clearly understand what and how it works, we clearly understand what needs to be changed, if necessary, we clearly know where to look in case of problems.And only in such a network can you do tricks like those that we just described.

Actually, another playbook was prepared, which returned the settings to their original state. The logic of his work is the same (it is important to remember, the order of the tasks is very important), so as not to extend the already long article, we decided not to post the listing of the playbook. Having conducted such exercises, you will feel much calmer and more confident in the future, in addition, any crutches that you piled there will immediately find themselves.

Everyone can write to us and get the source code of all the written code, along with all the palybooks. Contacts in the profile.

findings


In our opinion, processes that can be automated have not crystallized yet. Based on what we encountered and what our Western colleagues are discussing, the following topics are still visible:

  • Device provisioning;
  • Data collection;
  • Reporting
  • Troubleshooting;
  • Compliance

If there is interest, we can continue the discussion on one of the given topics.

I also want to speculate a bit on automation. What should it be in our understanding:

  • The system must live without a man, while improving man. The system should not depend on the person;
  • Operation must be expert. There is no class of specialists who perform routine tasks. There are experts who have automated the whole routine, and solve only complex problems;
  • Routine \ standard tasks are done automatically "by button", resources are not wasted. The result of such tasks is always predictable and understandable.


And what should these points lead to:

  • Transparency of IT infrastructure (Less risks of operation, modernization, implementation. Less downtime per year);
  • The ability to plan IT resources (Capacity-planning system - you can see how much is consumed, how much resources are needed in a single system, and not by letters and visits to the top departments);
  • Ability to reduce the number of IT staff.

Authors of the article: Alexander Manov (CCIE RS, CCIE SP) and Pavel Kirillov. We are interested in discussing and proposing solutions on the topic of IT infrastructure automation.

All Articles