A brief story about hunting down the issue that caused SIP packets to show out of order in a call logging system.
There are times during the build of a large global phone service that you need to debug SIP messages. For those that aren’t familiar with SIP it’s the protocol commonly used to setup and tear down VoIP/phone calls. For example, who is placing the call, the number they are dialling, the state of the call (e.g. Ringing/ Answered/Complete/Transferred to XYZ/etc…).
One of our logging systems shows the SIP messages in a ladder diagram. This is useful for quickly understanding the main stages a call went through, grouped by components.
A few times whilst investigating call issues I’d noticed that packets were occasionally showing in an unusual order. For example, a call was answered and then started ringing, or a call showed as ringing before it had even been placed. Things were not making sense.
Perhaps Server Time?
My initial reasoning for the out of order packets was some sort of NTP issue. There are a plenty of machines that handle packets for a given call, I wondered if subtle time differences between the machines were to blame. This theory was short lived as I found an example where the same machine was reporting packets in an unusual order.
Perhaps Storage Time?
SIP packets themselves have no time information embedded. Perhaps it’s a storage time problem? We do have a cluster of machines that handling the logging. Perhaps the time was recorded based on the arrival of the packet in to the reporting system?
A further look at the reporting process put an end to this. The SIP packets for a call are captured in a wrapper reporting packet which has its own timestamp field filled in by the original process. This means the time the packet arrives at a reporting server doesn’t matter at all.
A Reporting Packet Error Surely…
What makes the reporting packet? The issue has to be there! A very quick attempt to group recent problem occurrences showed it was often calls that routed through a Perl component that had the unexpected time packets. Pulling up the code in question, one of the third-party libraries we use very roughly does this:
use Time::HiRes qw(time); my $t = time(); my @parts = split(/\./, $t); my $seconds = $parts; my $microseonds = $parts;
On quick glance this looks reasonable, but this doesn’t quite work! The issue
it turns out is all down to microseconds… Whilst Perl’s default
offers seconds, the
Time::HiRes module is used to get a microseconds figure as
well. The manual for
Time::HiRes gives a clear explanation for our problem:
[…] if you print the value of Time::HiRes::time() you seem to be getting only five decimals, not six as promised (microseconds). Not to worry, the microseconds are there (assuming your platform supports such granularity in the first place). What is going on is that the default floating point format of Perl only outputs 15 digits. In this case that means ten digits before the decimal separator and five after. To see the microseconds you can use either printf/sprintf with “%.6f” , or the gettimeofday() function in list context […]
Wait… Is that actually the problem?
If you think this through you might not see the problem. For example, instead of 123456 microseconds we were ending up with 12345, but so what? How are things getting out of order? The final piece of the jigsaw is that we have lots of other components logging in the same format but not using the same broken library. That means a packet that came before our 123456 microseconds example at 123000 microseconds is suddenly showing as arriving later. This checks out as even the examples I found where the same machine was showing out of order packets, the packets were logged by different components.
This was a surprisingly enjoyable issue to fix. One of those bugs that at the start seems so mysterious but effectively simplifies down to another variation of the classic off-by-one error that makes up so much of programming.