It seems that every vendor has their own definition of what a microburst is. As per normal, vendor’s definitions are heavily influenced by what they can and can’t measure. At the macro level there’s broad agreement that a microburst is a “short period of time when a burst of network traffic occurs that is significantly higher than normal” – but how shorter period of time and how much higher than normal are areas where opinions start to differ.
To make any sense of this question it’s important to understand what causes a microburst and what impact they can have on your business systems.
Microbursts are a problem for organizations when the spike in network traffic overwhelms the capacity of systems in the data path to handle the load. When systems (like trading systems for example) are overwhelmed they typically drop or delay the flow of packets pointed at them. When trades are measured in microseconds, dropped or delayed packets have the potential to cost organizations serious momey. For firms engaged in high frequency trading knowing the true capacity of systems in the data path and being able to spot bursts that potentially compromise those systems is a big deal and worthy of serious investment.
Microbursts are not just isolated events on a particular link. They are caused by congestion on multiple downstream devices that feed a single upstream device. Essentially a ‘glut’ of packets all arrive at the same point in the network at exactly the same time. Put another way, think of a 3 lane highway merging into a single lane at rush hour – the burst cannot exceed the total available bandwidth, but it can saturate it for a period of time.
With that clear, how big does a spike have to be and how long does it have to last before it should be called a microburst? The answer to that question now depends on the capacity of the worst system in the data path. A microburst should be defined as any traffic spike that causes that system to fail. So in reality everyone’s network should have its own unique microburst definition.
If you’re looking for a guide, based on our industry experience a sustained burst of traffic that saturates 75% of line capacity over a period of 100 microseconds is enough to cause problems for some under-powered systems. Assuming that you can work out what your unique definition of a microburst is (and that’s normally through a process of trial and error comparing trace files at different line rates) how do you go about alarming on threshold breaches that could potentially cause packet loss? The answer is of course that you buy some kind of microburst detection solution.
Vendor solutions in this market can generally be put into one of three groups
- Group one are vendors that use sampled network statistics (NetFlow) collected from routers and switches in the data path to look for bandwidth over time spikes. Being based on sampled data these systems offer relatively low levels of visibility and typically struggle to provide visibility into bandwidth utilization below 1 second increments.
- Group two are the vendors that manufacture the routers and switches that sit in line. These solutions see network traffic at a bit level (rather than at a packet level) and calculate bandwidth utilization from true data throughput speeds. The problem that these systems have is that they don’t have the necessary processing power to generate a truly accurate calculation of bandwidth over time. In many instances they simply generate NetFlow statistics which they pass off to other systems which really put them in Group 1.
- Group three are vendors, like Endace, that use passive packet capture solution to calculate the volume of traffic based on the packet time stamps and packet sizes. This approach provides far and away the best mathmatical framework for detecting microbursts if (and it’s a big IF) the packet time stamps are of a sufficiently high resolution and they really have captured every packet. In our experience you should take neither of these things as read.
Once you’ve got all the timestamped packets the process of plotting bandwidth utilization is fairly straightforward… Every 100 microseconds sum all of the bits that have transited the wire and plot them on a line graph. It’s essential to make this an exercise in addition rather than multiplication, as taking a short traffic sample and multiplying the data volume out to provide a view over a longer period adds unacceptable inaccuracies into the equation – particularly as network speeds head towards 10Gbps.
If you’re worried about the impact that microbursts might be having on your critical systems here are 4 questions to ask any vendors that may be pitching you their wares.
- Over what time period do you calculate a microburst? (anything less that 100 microseconds is unlikely to give you the answers that you’re looking for)
- How do you guarantee the accuracy of time stamps and to what resolution do you offer time stamping? (look for an answer in the region of +/- 8 nanoseconds)
- What sort of alarms and threshold monitoring do you provide and at what level of granularity?
- What tools do you provide to allow me to extract the packets of interest so I can compare them with the output of my business systems?
If you’re interested in seeing how we monitor for microburst feel free to get in touch and we’ll gladly show you.