Measuring failover times

Started by SimonV, January 24, 2018, 03:17:33 PM

Previous topic - Next topic

SimonV

I'm wondering if there is any tool that will accurately measure failover times in the lab, possibly even down to the milli-second. Anything that's more accurate than a ping would be great...  :)

deanwebb

What about debug logging? Or are their timestamps not specific down to the millisecond? Would syslogs be the answer?
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

icecream-guy

what constitutes failover time? there lies the problem,  when the packets make it to the redundant server? or when the service is responding
:professorcat:

My Moral Fibers have been cut.

SimonV

Quote from: ristau5741 on January 24, 2018, 09:03:09 PM
what constitutes failover time? there lies the problem,  when the packets make it to the redundant server? or when the service is responding

When full end-to-end connectivity has been restored.

Something like iperf, with two nodes on both ends of the topology that you're testing, keeping a steady stream of timestamped traffic open and reporting back when exactly end-to-end connectivity has been restored. Thinking about it, might be possible using BFD for this.

wintermute000

#4
IXIA, spirent STC (L4) / Avalanche (L7) are two big name commercial traffic generator options.

Here's an open source one: https://trex-tgn.cisco.com/

This is a VERY deep field once you get into the weeds... be warned. The exact nature of the traffic will have a bearing on what you measure and what you're actually making the device do - this is especially true for security platforms. Example: I'm driving 2k concurrent TCP connections (RENO) @ ~70k connections with total throughput ~4Gb (lots of tiny 8kb HTTP GETs). It tells me I lost 15000 packets and 500 connections. What's the convergence time? LOLOLOLOLOL


For simple RS something like a spirent which generates a simple stream of packets, count the lost packets that's done - sure, but as soon as you get into stateful devices or real world traffic patterns hmmmm

dlots

I have used Jperf/wireshark for this.  Just send a stream of UDP traffic as fast as possible with a destination of a PC, the PC captures on wireshark.  Cause failure, once the fail-over is done stop the capture.  Filter so that only traffic from the Jperf box is visible.

Go to view, time display format, seconds since previous displayed packet.  Then sort by time.  The largest amount of time there should be your fail-over time.