In my last post, I talked about how frequent, short outages prevent video calling from being comfortable on Starlink. If you were curious about exactly how short and how frequent I meant, this post is for you.
Starlink's satellite dish exposes statistics that it keeps about its connection. The small "ping success" graphs I shared in the last post are visualizations provided by the Starlink app, which are driven by these statistics.
Thanks to starlink-grpc-tools assembled by sparky8512 and neurocis on Github, I have instructions and some scripts to extract and decode these statistics myself. I haven't been great at collecting the data regularly, but I have six bundles of second-by-second stats, each covering 8-12 hours. (February 1 saw a couple of reboots, so the segments there are approximately 7.5 and 11 hours, instead of 12 for the other segments.)
The raw data exposes a per-second percentage of ping success. It's somewhat common for a single ping's reply to go missing. Several pings are sent per second, though, and one missing every once in a while is mostly no big deal. The script I'm using tallies the number of times /all/ of the pings within a given second went missing (percent lost = 100, or "d=1" in the data's lingo). It also tracks "runs" of seconds where all of the pings in contiguous seconds went missing.
These first two graphs (Figure 1) explain what I mean by "frequent" and "short". This histogram displays one bar per "run length" of all-pings-lost seconds. That is, the left-most bar tracks when all pings were lost for only one second, the next to the right bar tracks when all pings were lost for two consecutive seconds, the third bar tracks when all pings were lost for three consecutive seconds, and so on. The height of the bar represents the number of times an outage of that length was observed. The histogram is stacked, so that the outages on the morning of February 1 (green) begin where the outages on January 31 (blue) end.
Over the 66.5 hours for which I have data, we counted 739 1-second outages. That's an average of just over eleven 1-second outages per hour, or just slightly more often than one every 6 minutes. The decay of this data is pretty nice: two second outages are approximately half as likely (344, averaging just over 5/hr, or just under every 12 min), three-second outages just a bit less than that, and so on. By the time we get to 8 seconds, we're looking at only one per hour.
If we look at one 1s-8s outages, i.e. those that on average happen once per hour or more, we have a total of 2018. That's an average of just over 30 disconnects per hour, or one every two minutes. For once, data proves the subjective experience correct. On a video call, it feels like you get something between a hiccup and a "they last thing I heard you say was…" every couple of minutes.
The right-hand graph is laid out in the same way, but the bars represent minute-long outages. You can just barely see a few counted as 1-minute and 2-minutes in length. Last Thursday, February 4 (red), was the first time we've had a significant Starlink outage, long enough for me to spend time poking around trying to figure out if it's "just us or everyone."
I've been mostly concerned with frequency - how often I can expect outages of each severity. The tool I've used to extract the statistics data exposes the outages differently. It is instead concerned with the total amount of downtime observed.
These graphs (Figure 2) are the data as the extraction tool provides it. Each bar represents outages of a certain length, as before. But now the height of the bar represents the total number of seconds of downtime they caused. The 1-second and 2-second bars are now about the same height because there were about half as many 2-second outages as 1-second outages, but they each lasted twice as long. The total amount of downtime they caused is about the same.
That giant red line that has appeared in the right hand graph is eye-catching. Thirty seven and a half minutes of downtime, caused by one 37-minute outage. That 1-minute outage stack is quite a bit taller too, accounting for ten minutes of total downtime itself. This is how the significant outage on Thursday appeared to us. There was a large chunk of time where we obviously had no connection to the internet (37 minutes), surrounded by quite a bit of time where we'd start getting something to download, but then it would stop (ten 1 and 2 minute outages).
The sum of all 1-second-or-longer downtime we experienced in this 66.5 hours of data is 14686 seconds, or just over 4 hours. That's roughly 94% uptime.
We didn't see the 37-minute outage in the earlier frequency graphs, because it has only happened once. If we zoom in on those graphs (Figure 3), so that most of the 1-13s bars are way off the chart, we can see a few more one-time-only outages. Each day has had some small hiccup in the "long tail" of over twenty seconds. I see hope in the fact that the grey color, which is the most recent data, from the day after the long outage, is nearly absent from the longer-run counts.
I'm curious about the sharp decline between 13 and 14 seconds. Is that a sweet spot for some fault recovery in Starlink's system, or is it just an aberration in my data? I'll have to keep collecting to see if it persists.
I've posted the summary data I used to generate these graphs in a gist on github.
Post Copyright © 2021 Bryan Fink