
The solution and challenges
For AI to work for us and alert us in advance, it should have good quality, reliable data over time, and this data can be retrieved from our classic logs when any event is triggered. Ping and SNMP would only provide data in polling time intervals of two or three minutes, and it seems like a blurred reality; they won’t tell us the current state or projected states on trends.
So the research began: What level of information logs should we be collecting? Information level. We were collecting logs from around 2,500 global devices, and so we need to scale for capacity servers, which is not a problem in a large organization.
We were now collecting every informational level log from our SD-WAN routers, which included SLA violations, CPU spikes on hardware, bandwidth threshold increases, logging configuration changes every second and even collecting netflow…because let’s just agree brownouts usually hide between “user” and “app,” not inside a single device.
SD-WAN routers have SLA monitors configured for DNS, HTTPS and SaaS application monitors, which worked as our synthetic emulators and created a log whenever SLA breached for a layer 7 service or when any website is “slow,” which would help us monitor layer 7 protocols from a router.
From our radius/TACACS servers, we were receiving logs on security violations on layer two ports and MAC flooding(occasionally). Not just that, we even collected granular data like signal strength, SSID, channel bandwidth, and number of clients on the access point on our wireless infrastructure, all thanks to a vendor API that made quick work of this. Similarly, for our switches, we were collecting data from layer two VLAN changes to OSPF convergence, from radius server health to interface statistics.
After all the heavy lifting, we were able to get all this data into a data lake, but it turned out to be more like a swamp, as the data had 10 different timestamps and it was not labeled correctly. And AI without labels is wishful thinking.