Distress - what does it mean?

The "distress" feature of the Saisei STM is a strong tool in the software’s arsenal - it gives a quick synopsis of the state of the network from the point of view of a given user, application, geolocation or AS, from 0 (perfect) to 100 (terrible). This is a very user-friendly score comprehensible to all individuals using the STM (Rather than diving into the arcane details of the TCP protocol).

How is it calculated? What does it signify? This article sets out to answer these questions.

First, distress only applies to TCP. Distress is calculated for each TCP flow, based on TCP-related events: retransmissions, and timeouts. Retransmissions are a normal part of life in TCP. They occur when the network is trying to control the bandwidth that a flow uses, and in small quantities don't indicate that anything bad is happening. Timeouts are a different story - they indicate that the network is seriously congested, typically that it is dropping several consecutive packets.

The distress scores for higher-level objects (applications, etc) are computed by combining the distress scores for individual flows. It's a little bit more than just the average for all flows:

  • It is strongly biased towards recent flows - older values decay with a time constant of one minute, so flows from more than about five minutes ago make an insignificant contribution.
  • It also considers the variance of the distress values of individual flows, as well as the average. The concept is that it is better to have all flows OK, than to have some which are perfect and some which are poor.

Flow Distress Calculation

We keep three raw statistics for a TCP flow that are relevant to distress:

  • Retransmissions: packets whose sequence number is lower than the highest seen so far (all calculations take account of the modulo-32 nature of the sequence number)
  • Retransmission events: packets whose sequence number is lower than the previous packet.
  • Timeouts: indicated by a retransmission of a packet that is now considered lost.

The distress score for a flow depends on:

  • the number of retransmissions, as a proportion of the total packet count
  • ...but we subtract retransmissions which are due to STM’s own packet drops, since these don't indicate a problem elsewhere in the network
  • the number of timeouts, as a proportion of the total packet count, but we square the contribution this makes, doubling its effect on the total value (see below for the detailed mathematics)

These are combined to give the distress value between 0 and 100.

What Distress Is Not

There are some things which you might expect to figure in the distress score, which do not. These are described in detail below to provide clarity on the topic.

  • Bandwidth: it would be great if distress indicated whether the flow is getting the bandwidth it needs for a satisfactory user experience. The problem is, it is very difficult to discern this factor. For example, low definition Youtube-type video is perfectly happy at around 200 kbit/sec (and even if more is available, Youtube doesn't use it). But streaming HDTV needs more like 6 Mbit/sec. If we see a video-like flow, we are unable to understand what “quality” the stream is. Another example: file transfers (bulk upload/download) will use all the bandwidth they can get, within the constraints of the server and the user's network connection. The system is unaware what the user expectation is. Finally, the user's own network connection is very often the bottleneck. It would provide an inaccurate representation of user experience to give a high distress score to almost all flows in the network because of these reasons. 
    As a result, we do not make any attempt to include the current or average bandwidth of a flow in the distress calculation.
  • Round Trip Time: RTT is generally dominated by the topology of the network and the laws of physics. From Saisei’s Sunnyvale office, it's 10 mS to Google (right next door and highly optimised), 30 mS to other neighborhood servers, 60 mS to the East Coast, and about 200 mS to Europe or Japan.  In addition, it varies widely in real-time - consecutive pings to a server in Europe often vary by 50%. Hence, we do not include this in the distress calculation.