Network Monitoring – Traps vs. Polling

As a network administrator, I have been (partly) responsible for the monitoring of network infrastructures or even entire companies for many customers. For some companies I have even completely redesigned it.

Many companies currently use tools such as Nagios, Solarwinds or a similar package. These tools are ( at least, in my opinion) a first-generation software package because it takes a lot of time to set the triggers (in particular). If you also have to deal with different vendors, it becomes even more complex; every vendor has its own event codes and descriptions, which causes unequal and therefore unclear alerts.

Many parties also rely on the use of SNMP Traps. Although you will not be able to completely disable SNMP-Traps, relying on (only) these traps is dangerous. This has various reasons.

Why we cannot trust SNMP Traps

SNMP Traps work on the basis of UDP and is a single datagram which is sent by the device. Consider for example a temperature that exceeds the limit, for which a switch will send a single UDP datagram once. However, UDP does not work in the same way as TCP, because there is no check whether the UDP datagram has ever arrived. TCP has TCP Retransmissions for this, UDP has no control whatsoever.

In the real world, it can happen that a switch gets into trouble due to (for example) spannig-tree recalculation and the SNMP-Trap does not arrive as a result. These notifications will not be sent again and will never be registered. Same thing is the connection between your monitoring host and the device is unstable. Not something you can count on in the event of disruptions.

Next-generation alerts

As I wrote above, I see the mentioned software packages as first-generation monitoring, this is mainly due to the method of generating alerts. For example, alerts are often set that go off when the bandwidth consumption of a gigabit interface is used for 90% or when a hard disk has only 20% space left.

But is it useful to know that bandwidth is going through that specific interface? Maybe someone is watching a video on Netflix that first has to buffer. Is it useful to know that a hard drive only has 20% space left? If it is a small disk it might, but if it’s a 2TB disk then this does not seem worth mentioning.

Future-proof approach to Monitoring and Alerts

The only way to generate correct alerts, which actually require action, is based on metrics. Metrics, metrics and i say again: metrics. You can collect these metrics via SNMP Polling on the devices and with the collected data a trend line can be mapped out. Using the above-mentioned examples, we can determine on the basis of a trend analysis whether there is more often 90% consumption on the Gigabit interface. We can also determine how long it takes before a disk is really full and whether it requires action now.

Doors are opened by collecting metrics. Much more insight into what is happening on the infrastructure, outages can be prevented instead of resolved, better advice can be given in regards to capacity and growth path and it makes it easier for administrators.

The last huge difference is that all alerts are displayed on the same (unique) way. We are no longer dependent on the different vendors!

Steps to improve or setup network monitoring

  1. Draw the network infrastructure in a real-time map view
    No network administrator likes to make network drawings. Make it fun by playing with the real-time display of trends and data and at the same time gain insight into the network and up-to-date documentation.
  2. Transform from network monitoring to service and chain monitoring
    By gaining insight into the ultimate availability of the service, it is also clear what the impact is for the customer
  3. Reclassification of alerts based on impact.
    Considering point 2, we now know what the impact is for a customer. Many services are performed redundantly, so that the impact is less. Combine this with point 1 and you have a real-time view of where the disruption occurs without first having to search for half an hour on the various equipment.
  4. Create a performance baseline
    By measuring the services of the customer on performance (response time) in combination with the availability and response time of the network, it can quickly be determined whether there is a congestion.
  5. Work with trend analysis and forecast alerting
    By making use of all collected data, many false alerts can be prevented. With this data an indication can also be made about capacity and availability in the future.
Please follow and like us:
error

SSH Tunnel to watch Netflix

I often use a ‘hopping server’ when connecting to clients, that means i need to login twice each time. To make my life easier i sometimes use an SSH tunnel so i can connect to clients directly.

SSH Tunnel can also be usefull when your office blocks netflix 😉

Local Port Forwarding

This will allow you to access remote servers direcly from your local computer. Let’s assume you want to use RDP (3389) to a clients hosts (10.0.1.1) and your hopping server is ‘hopping.server’

ssh -L 6000:10.0.1.1:3389 wieger@hopping.server

Now you can open Remote Desktop and connect to ‘localhost:6000’, directing you through the tunnel!

Remote Port Forwarding

This will make your local service/port acccessible from a remote host. Sometimes i use this to keep a ‘backdoor’ and login remotely (home or whatever).

Let’s say you want to make a webapplication (TCP 443) availible at port 6000 on the remote SSH server

ssh -R 6000:localhost:443 wieger@bontekoe.technology

Now you should be able to connect to port 6000 on the remote host (bontekoe.technology)

Dynamic Forwarding (Proxy)

This is ideal for people who want to use the internet safely/anonymous or for offices where Netflix is blocked 😉

Use a remote server to tunnel all web traffic (eg. home server), connect through SSH to it using the -D flag

ssh -D 6000 wieger@bontekoe.technology

Now open up your browser settings, navigate to the connection properties and enter a Proxy server (manually using SOCKS). Use 127.0.0.1 as host and 6000 as port. The tunnel will remain open as long as you are connected through SSH.

Please follow and like us:
error

BGP Route-Leak causes a significant shift in European internet traffic to Chinese Backbone network

A series of unfortunate configuration mistakes?

Last Thursday, June 6, 2019 at 12:00 a.m. Dutch time, a two-hour outage began at various customers. Both business customers for fiber optic internet as well as private internet connections. At first, the extent of the disruption was not yet completely clear.

After several reports from customers I noticed that also on Dutch websites (such as allestoringen.nl) the complaints came in from people with connection problems. It soon turned out to be a national outage that primarily affected KPN (and indirectly many other parties including national payment systems).

A first traceroute showed that the traffic between Haarlem and Amsterdam went through different (completely illogical) paths. This had all the symptoms of a BGP hijack, The Dutch website Tweakers.net also mentioned this and expressed this suspicion.

Doing a bit more digging on the route my traffic was taking, i saw that my VPN connecting between Haarlem and Amsterdam was suddenly taking a de-tour through AS4134, known as the Chinanet Backbone Network.

For two hours, much of Europe’s Internet traffic passed through Chinese networks

This incident occurred after a Swiss hosting provider (Safe Host SA) started leaking more then 70.000 faulty routes to the Chinese backbone network. This Chinese network, in turn, forwarded these IP address announcements as valid to various large Tier-1 internet providers causing a huge traffic shift toward the Chinese Backbone network. The incident caused huge impact on the networks of Swisscom (AS3303), KPN (AS1130), Bouygues Telecom (AS5410) and SFR (AS21502).

Error or intention?

The official statement is; a configuration error by the Swiss hoster SafeHost. The remarkable thing is that the incorrectly advertised ranges were smaller and more specific then the ones advertised legitimatly. Some websites mention this could indicate the use of route optimizers but Safe Host SA confirms on twitter that they do not use bgp optimalisation software.

Now that Safe Host itself indicates that it is not the use of route optimizers that caused this disruption, the question arises how it is possible that such specific routes have been (wrongly) advertised. And why only to Chinanet Backbone? Safe Host SA is still investigating and not responding to questions on Twitter.

I am very curious about the explanation of this incident, if it ever comes.

Worldwide concern

Chris C. Demchak and Yuval Shavit released a publication in 2018 entitled “The Hidden Story of China Telecom’s BGP Hijacking”. In this publication they write how China has been able to divert specific traffic through Chinese POPs several times via BGP hijacks and the possible implications this has.

They also describe their concerns about the large (infrastructural) network presence of the Chinese Backbone Network in America (which is no different in the EU) while no other international network has similar presence in China. This large presence is one of the things that makes it easy for the Chinese to conduct a BGP Hijack on large scale and at the same time protecting their own infrastructure.

Using these numerous PoPs, CT (China Telecom) has already relatively seamlessly hijacked domestic US and cross US traffic and redirected it to China over days, weeks, and months as demonstrated in the examples below. The patterns of traffic revealed in traceroute research suggest repetitive IP hijack attacks committed by China Telecom.

The report argues that the Chinese government is using local ISPs for intelligence gathering by systematically hijacking BGP routes to reroute western traffic through its country, where it can log it for later analysis and provides the following examples;

  • Starting from February 2016 and for about 6 months, routes from Canada to Korean government sites were hijacked by China Telecom and routed through China
  • On October 2016, traffic from several locations in the USA to a large Anglo-American bank headquarters in Milan, Italy was hijacked by China Telecom to China
  • Traffic from Sweden and Norway to the Japanese network of a large American news organization was hijacked to China for about 6 weeks in April/May 2017.
  • Traffic to the mail server (and other IP addresses) of a large financial company in Thailand was hijacked several times during April, May, and July 2017.

Doug Madory from Oracle confirmed that AS4134 was redirecting traffic in a blogpost posted on 5th of november 2018.

In this blog post, I don’t intend to address the paper’s claims around the motivations of these actions. However, there is truth to the assertion that China Telecom (whether intentionally or not) has misdirected internet traffic (including out of the United States) in recent years. I know because I expended a great deal of effort to stop it in 2017.

How to prevent/secure this?

The problem with incidents like this is that the internet is running on the BGP protocol. BGP is a global protocol running between organizations and country’s crossing international borders. There is no single centralized authority, just internet providers that collaborate based on trust. The internet is intended to be open, transparant and so all ISP’s are trusted to play nice. Furthermore, there are initiatives that can improve the security of BGP in general, but these must be introduced on a large scale.

On the 12th of June, the Dutch National Cyber Security Center issued a report in which it once again shows how great the (digital) risks are for the Dutch society. In this report they also specifically mention the influence of China (among others).

“Countries such as China, Iran and Russia have offensive cyber programs against the Netherlands. This means that these countries are digital use resources to achieve both geopolitical and economic objectives to be achieved at the expense of Dutch interests “

It is time to immediately acknowledge and address these risks.

Please follow and like us:
error

Test internet speed using speedtest-cli

Speedtest-cli is a great tool to test your internet speed using the Speedtest servers, make sure you have Python installed before installing Speedtest-cli.

Installing and Using Speedtest-CLI

  1. Update APT and install packages

    apt-get update; apt-get install python-pip speedtest-cli

  2. Test your speed!

    speedtest-cli
    Testing download speed........................................
    Download: 913.12 Mbit/s
    Testing upload speed..................................................
    Upload: 524.12 Mbit/s

  3. Share your speed 🙂

    speedtest-cli --share
    This will provide you with an image to share proving the speed.

Please follow and like us:
error

Load Balancing Remote Desktop

Using HAProxy to loadbalance between RDS servers is usefull if you have more then one RDS servers and want users to connect to a single IP.

  1. Install Haproxy

    sudo apt-get update
    sudo apt-get install haproxy

  2. Add the RDP VIP (virtual IP) and RDP hosts

    defaults
    clitimeout 1h
    srvtimeout 1h
    listen VIP1 193.x.x.x:3389
    mode tcp
    tcp-request inspect-delay 5s
    tcp-request content accept if RDP_COOKIE
    persist rdp-cookie
    balance rdp-cookie
    option tcpka
    option tcplog
    server win2k19-A 192.168.10.5:3389 weight 10 check inter 2000 rise 2 fall 3
    server win2k19-B 192.168.10.6:3389 weight 10 check inter 2000 rise 2 fall 3
    option redispatch

Now we have HAProxy running on a 193.x.x.x ip address, when you connect to that IP it will direct you to one of the Windows 2019 machines. If one dies, it will remove it and you can reconnect to the last one that is online.

Please follow and like us:
error

Installing phpIPAM

phpipam is an open-source web IP address management application (IPAM). Its goal is to provide light, modern and useful IP address management. It has lots of features and can be integrated with PowerDNS!

  1. Install Apache, PHP 7.2, MariaDB and GIT client

    apt-get install apache2 mariadb-server php7.2 libapache2-mod-php7.2 php7.2-curl php7.2-mysql php7.2-curl php7.2-gd php7.2-intl php-pear php7.2-imap php-memcache php7.2-pspell php7.2-recode php7.2-tidy php7.2-xmlrpc php7.2-mbstring php-gettext php7.2-gmp php7.2-json php7.2-xml git wget -y

  2. Run the secure installation for MariaDB

    mysql_secure_installation
    Enter current password for root (enter for none):
    Set root password? [Y/n]: N
    Remove anonymous users? [Y/n]: Y
    Disallow root login remotely? [Y/n]: Y
    Remove test database and access to it? [Y/n]: N
    Reload privilege tables now? [Y/n]: Y

  3. Create database and user

    MariaDB [(none)]> create database phpipam;
    MariaDB [(none)]> grant all on phpipam.* to phpipam@localhost identified by 'bontekoe123';

    MariaDB [(none)]> FLUSH PRIVILEGES;
    MariaDB [(none)]> EXIT;

  4. Clone it from Github

    cd /var/www/html
    git clone https://github.com/phpipam/phpipam.git /var/www/html/phpipam/
    cd /var/www/html/phpipam
    git checkout 1.3
    git submodule update --init --recursive

  5. Edit the config file

    cp config.dist.php config.php
    nano config.php

    Add your MariaDB database, username and password there.

  6. Import PHP IPADM Database

    mysql -u root -p phpipam < db/SCHEMA.sql

  7. Set the correct permissions

    chown -R www-data:www-data /var/www/html/phpipam
    chmod -R 755 /var/www/html/phpipam

  8. Apache configuration

    Either you use the default apache configuration or you can create a virtual host specificly for this. That i leave up to you.

  9. Finish the installer

    Go to http://x.x.x.x/phpipam and you will see the wizard, asking you to finish the installation. Follow the easy steps. After that you can login with admin/admin and change the default password.

  10. Create your first subnet

    Here you can create your first subnet. If can automaticly help you by discovery, checking hosts status and more.

Please follow and like us:
error

Ceph Infrastructure Planning

Ceph is open source storage meant to facilitate highly scalable object, block and file-based storage. Because of it’s open source nature it can run on any type of hardware. The CRUSH algorithm will balance and replicate all data throughout the cluster.

At first i was running Ceph on regular servers (HP Proliant, DELL) but since +/- 2 years i decided to play around with custom configurations hoping to squeeze more iops out of it. As i found out, big part of getting the most out of Ceph is: network. My last Ceph configuration was so fast it took down all our servers 🙁

What was i doing?

In order to get the most out of ceph i created a setup using only NVMe drives. Here is my configuration for my ceph nodes:

Supermicro SYS-1028U-TN10RT (link)
Intel E5-2640v4 (10 Core, 2.4Ghz)
32GB DDR4-SDRAM
1x Intel SSD DC P4600
1x ASUS M.2 Hyper x16 Card
4x Samsung Pro 960 NVMe
1x Intel X520-DA2 Dual 10Gbe

I am using the Intel SSD DC P4600 as main (fast) storage and Samsung Pro NVMe’s for slightly slower storage. Everything was looking dandy.

So what happend?

One of our clients woke up and thought, “hey, let’s migrate some stuff today”. Generating only 1Gbps network traffic on the uplinks to the internet everything was looking fine. Until 15 minutes later everything went offline. All VM’s stopped and storage was down. Turns out that the client was migrating a set of DB’s to our network, after the scp copy stopped they extracted the files and imported them in the database server generating a lot of IO on that virtual server. That virtual server was however part of a database cluster so replicated all that data to 3 other VM’s.

As Ceph wrote the data to disk, it also replicated it across the Ceph pool and through the network generating way to much (20 Gbps per Ceph host) traffic on the access layer. Ceph was so fast, it took down the entire network.

How to resolve

As storage (specially Ceph 😉 ) is getting faster, network topology needs to change in order to handle the traffic. A design that is becoming very popular is the leaf-spine design.

This topology replaces the well known access/aggregation/core design and using more bandwidth on the connections to the spines we can handle much more traffic.

A 10 Gbit network is at minimum recommended for Ceph. 1 Gbit will work however the latency and bandwidth limitations will be almost unacceptable for a production network. 10 Gbit is a significant improvement for the latency. Also recommended is that each Ceph node is connected to two different leafs for external traffic (clients/vm’s) and internal traffic (replication).

Please follow and like us:
error

Vyos configuration using Ansible

The goal is to create configurations for VyOS devices and applying them using Ansible. I have used Vyos as my home router, VPN endpoint device (with 1300+ ipsec tunnels) as well as a datacenter router connected to the AMS-IX using 10Gbps uplinks.

Prerequisites

Make sure you have a Vyos installation (can be virtual, can be a box, can even be a Unify Edgerouter) with ssh enabled.

Inventory file

[vyos]
10.0.1.1 ansible_user=ansible-adm ansible_network_os=vyos

Sample Playbook

---

- name: VYOS | Config eth1
  hosts: vyos
  connection: network_cli
  tasks:
    - name: VYOS | BACKUP
      vyos_config:
      backup: yes
    - name: VYOS | Apply eth1 config
      vyos_l3_interface:
        name: eth1
        ipv4: 10.0.2.1/24
        state: present

Run ansible playbook:

$ ansible-playbook -i hosts vyos.yml --ask-pass
SSH password:

PLAY [VYOS | Config eth1] *************************************************************************************************************************************************************

TASK [Gathering Facts] **************************************************************************************************************************************************
ok: [10.0.1.1]

TASK [VYOS | BACKUP ] *************************************************************************************************************************************************************
changed: [10.0.1.1]

TASK [VYOS | Apply eth1 config] *************************************************************************************************************************************************************
changed: [10.0.1.1]

PLAY RECAP **************************************************************************************************************************************************************
10.0.1.1               : ok=2    changed=2    unreachable=0    failed=0

Configure DNS server on VyOS

---
- hosts: vyos
  connection: network_cli
  tasks:
  - name: VYOS | DNS servers and hostname
    vyos_system:
      host_name: "{{inventory_hostname}}"
      domain_name: my.vyos.test
      name_server:
        - 1.1.1.1
        - 8.8.4.4
Please follow and like us:
error

Configuration of HP IRF (Intelligent Resilient Framework)

HP’s Intelligent Resilient Framework (IRF) is an advanced technology that allows one to virtualize 2 or more switches into a single switching and routing system also known as a “virtual switch”. IRF is available on the new HP A series switches such as the A5120 model and the A5500-5800 models.

From Wikipedia:
Intelligent Resilient Framework (IRF) is a software virtualization technology developed by H3C (3Com). Its core idea is to connect multiple network devices through physical IRF ports and perform necessary configurations, and then these devices are virtualized into a distributed device. This virtualization technology realizes the cooperation, unified management, and non-stop maintenance of multiple devices.[1] This technology follows some of the same general concepts as Cisco’s VSS and vPC technologies.

Basic Configuration;

Step 1:
Login to the switch through the console port

Step 2:
Ensure that both switches are running the same software version

[H3C] system view
[H3C]display version

Step 3:
Reset the configuration of the switches.

reset saved-configuration
reboot

Step 4:
Assign a unit number to each S5800. Switch 1 or 2. (Later you will see the unit number on the right side the switch on the front panel led)

On unit 1:

[H3C]irf member 1 renumber 1
Warning: Renumbering the switch number may result in configuration change or loss. Continue?[Y/N]:y

On unit 2:

[H3C]irf member 1 renumber 2
Warning: Renumbering the switch number may result in configuration change or loss. Continue?[Y/N]:y

Step 5:
Save the configuration and reboot the switches

[H3C]quit
save irf.cfg
startup saved-configuration irf.cfg
reboot

Step 6:
Setting priority on Master S5800.
On unit 1:

[H3C]irf member 1 priority 32

Step 7:
Shutdown the 10 Gbps port that will form the IRF Group (on both switches)
On Unit 1:

[H3C]int TenGigabitEthernet 1/0/25
[H3C-Ten-GigabitEthernet1/0/25]shutdown
[H3C]int TenGigabitEthernet 1/0/26
[H3C-Ten-GigabitEthernet1/0/26]shutdown

On Unit 2:

[H3C]int TenGigabitEthernet 2/0/27
[H3C-Ten-GigabitEthernet2/0/27]shutdown
[H3C]int TenGigabitEthernet 2/0/28
[H3C-Ten-GigabitEthernet2/0/28]shutdown

Step 8:
Assign the 10 Gbps port to an IRF port group
On Unit 1:

[H3C]irf-port 1/1
[H3C-irf-port]port group interface TenGigabitEthernet 1/0/25
[H3C-irf-port]port group interface TenGigabitEthernet 1/0/26
[H3C-irf-port]quit

On Unit 2:

[H3C]irf-port 2/2
[H3C-irf-port]port group interface TenGigabitEthernet 2/0/27
[H3C-irf-port]port group interface TenGigabitEthernet 2/0/28
[H3C-irf-port]quit

Step 9:
Enable the 10 Gbps ports that will form the IRF (on both switches)
On unit 1:

[H3C]int TenGigabitEthernet 1/0/25
[H3C-Ten-GigabitEthernet1/0/25]undo shutdown
[H3C]int TenGigabitEthernet 1/0/26
[H3C-Ten-GigabitEthernet1/0/26]undo shutdown

On unit 2:

[H3C]int TenGigabitEthernet 2/0/27
[H3C-Ten-GigabitEthernet2/0/25]undo shutdown
[H3C]int TenGigabitEthernet 2/0/28
[H3C-Ten-GigabitEthernet2/0/26]undo shutdown
Step 10: Activate the IRF Port Configuration (on both switches)
[H3C]irf-port-configuration active

Step 11:
Save the configuration

[H3C]quit
save

Step 12:
Connect the 2 10GbE Direct Attach Cables (DACs) as Shown in the IRF Diagram
NOTE: The secondary switch (unit 2) will now reboot automatically.

Step 13:
The IRF stack should now be formed. Verify IRF operation

[H3C]display irf
[H3C]display irf configuration
[H3C]display irf topology
[H3C]display devices

Done!

Please follow and like us:
error

Ubuntu Bonding (trunk) with LACP

Linux allows us to bond multiple network interfaces into single interface using a special kernel module named bonding. The Linux bonding driver provides a method for combining multiple network interfaces into a single logical “bonded” interface.

sudo apt-get install ifenslave-2.6

Now, we have to make sure that the correct kernel module bonding is present, and loaded at boot time.
Edit /etc/modules file:

# /etc/modules: kernel modules to load at boot time.
#
# This file contains the names of kernel modules that should be loaded
# at boot time, one per line. Lines beginning with "#" are ignored.
# Parameters can be specified after the module name.
bonding

As you can see we added “bonding”.
Now stop the network service:

service networking stop

Load the module (or reboot server):

sudo modprobe bonding

Now edit the interfaces configuration to support bonding and LACP.

auto eth1
iface eth1 inet manual
    bond-master bond0
 
auto eth2
iface eth2 inet manual
    bond-master bond0
 
auto bond0
iface bond0 inet static
    # For jumbo frames, change mtu to 9000
    mtu 1500
    address 192.31.1.2
    netmask 255.255.255.0
    network 192.31.1.0
    broadcast 192.31.1.255
    gateway 192.31.1.1
    bond-miimon 100
    bond-downdelay 200 
    bond-updelay 200 
    bond-mode 4
    bond-slaves none

Now start the network service again

service networking start

Verify the bond is up:

cat /proc/net/bonding/bond0

Output should be something like:

~$ cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
 
Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2 (0)
MII Status: up
MII Polling Interval (ms): 0
Up Delay (ms): 0
Down Delay (ms): 0
 
802.3ad info
LACP rate: slow
Min links: 0
Aggregator selection policy (ad_select): stable
Active Aggregator Info:
    Aggregator ID: 1
    Number of ports: 2
    Actor Key: 33
    Partner Key: 2
    Partner Mac Address: cc:e1:7f:2b:82:80
 
Slave Interface: eth1
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:0c:29:4f:26:c5
Aggregator ID: 1
Slave queue ID: 0
 
Slave Interface: eth2
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:0c:29:4f:26:cf
Aggregator ID: 1
Slave queue ID: 0
Please follow and like us:
error