It's always neat to see nice data collection like this, but unfortunately the average speed to the authoritative name servers isn't a very meaningful measurement. Real world resolvers bias heavily towards the fastest name server for your zone, and they are so latency sensitive that they'll do things like issue concurrent queries to several name servers at the same time.
The upshot of that is that what really matters is the latency to the closest name server, or at worst the latency to the 3rd fastest server; for the rare boot-strapping cases. Bind, the most common resolver by far will issue up to 3 concurrent queries to different name servers as part of its SRTT algorithm. The next most common resolvers; Unbound, OpenDNS, and Google Public DNS perform pre-fetching and so the latencies aren't contributing to the user experience except for extreme outlier queries.
Some large DNS providers design to this behaviour, and seek to increase the average distance to their DNS servers by operating the name servers for each domain in different data centers. That gives routing and path diversity for the DNS queries and responses. Since network path diversity increases with distance, this works best when you include a location or two that are quite far away, which increases the average latency to those servers - but thanks to resolver behavior doesn't do much to the user-experience.
Where the average latencies are low, all of the name servers are in close proximity to the measurement point and I would wager that the network path diversity is probably quite low. A small number of link failures or ddos/congestion events, maybe even one, might make all of the servers unreachable.
A more meaningful measurement of the speed itself is to perform regular DNS resolutions, using real-world DNS resolvers spread out across your users. In-browser tests like Google analytics go a long way here, and it's fairly easy to A/B test different providers. The differences tend to be very small. Caching dominates, as others here have mentioned.
Apologies if I seemed to rain on dnsperf's parade here; it's a neat visualization and measuring this stuff is tough. It's always good to see someone take an interest in measuring DNS!
[Full disclosure: I've worked on Amazon Route 53 ;)]
The RTT mechanisms in resolvers have a high degree of randomness and will aggressively try the other, slower name servers again. E.g., out of 1000 samples, my desktop in the Netherlands (via XS4All) sees low latencies from Route 53 ~60% of the time:
This looks decent at the median (20ms), but falls off beyond with 185ms at the 90th percentile and 88ms average, with one >1s outlier removed.
As you pointed out, Route 53 optimizes for availability and DDoS resilience over RTT performance. There are 4 name server IPs to choose from which gives me 4 different paths to 4 different server locations via anycast, giving me 4 different RTT buckets. Few DNS providers go through such lengths for availability. Still, 185ms is a lot. It's probably because anycast/BGP advertisements from the US get to AMS-IX with a smaller number of hops than competing advertisements from European locations. I would guess Route 53's current striping is not heavily tuned for RTTs.
Caching solves part of this, but there are a lot of resolvers out there. As a thought example: Assume your sources of traffic are uniformly distributed among 75000 resolvers and you use 60 second TTLs (pretty standard), then you won't see significant benefit from caching until you get to >>1000 requests/s.
Many applications also have a long-tail of DNS names and basically wont benefit from caching at all. This could be motivated by availability as well (think shuffle sharding :). I'm building one where DNS query time currently dominates page load time (especially aliasing to CloudFront can be slow :). It's useful to understand that there's a general availability vs. latency trade-off in DNS that only gets partially addressed by the resolver.
Assume your sources of traffic are uniformly distributed among 75000 resolvers
I think your argument is flawed because your users are not uniformly distributed among those 75000 resolvers.
In practice just ~1% of the resolvers (Comcast, NTT, Telekom, etc.) are handling >90% of your users. Consequently the benefits of caching kick in much earlier and stronger than you suggest.
Large ISPs and public DNS resolvers typically don't use a single server, but rather a fleet of DNS resolvers each with their own cache. Some providers like Google Public DNS use two-layered cache, but it's still fragmented per server location. A lot of people also have their own resolvers, think companies especially.
The 75000 was mostly a thought example, it's very hard to know what a good number is, although there is a Route 53-related reason for that number. In any case, the benefit of DNS caching is probably much less than you think due to short TTLs and the number of resolvers.
Large ISPs and public DNS resolvers typically don't use a single server, but rather a fleet [...]
I assume all major ISPs use 2 or 3 layers of cache, which makes the size of their perimeter fleet largely irrelevant.
it's very hard to know what a good number is
Could you perhaps ask your former Route53 colleagues for some log-file insight?
the benefit of DNS caching is probably much less than you think due to short TTLs and the number of resolvers.
I don't think so. The overwhelming majority of clients uses their ISPs resolver. So all it takes is one hit per major ISP per TTL to keep it zippy for almost everyone. That's why DNS works so well, after all?
I assume all major ISPs use 2 or 3 layers of cache, which makes the size of their perimeter fleet largely irrelevant.
Not really. The resolvers tend to be geographically dispersed and use anycast. Having a multi-layered cache would probably decrease performance, except within a specific location.
Could you perhaps ask your former Route53 colleagues for some log-file insight?
They see what's behind the cache, not how much traffic the resolvers are taking. Could be the same, could be 100x more, hard to tell.
So all it takes is one hit per major ISP per TTL to keep it zippy for almost everyone. That's why DNS works so well, after all?
Caching works great with long TTLs, e.g. as used for NS, MX, CNAME records. The problem is the 60 second TTLs that are commonly used for A records in cloud services. Except for reasonably high volume names, it's not highly probable that your A records will be in a given cache at a given time. Many applications also use many different domain names (e.g., one per user), which creates a long tail of low volume names.
Of course, traffic is not uniformly distributed in any way, so there might be parts of the day when your name will be constantly served from cache everywhere, or parts of the world where it is never served from cache.
It's always neat to see nice data collection like this, but unfortunately the average speed to the authoritative name servers isn't a very meaningful measurement. Real world resolvers bias heavily towards the fastest name server for your zone, and they are so latency sensitive that they'll do things like issue concurrent queries to several name servers at the same time.
The upshot of that is that what really matters is the latency to the closest name server, or at worst the latency to the 3rd fastest server; for the rare boot-strapping cases. Bind, the most common resolver by far will issue up to 3 concurrent queries to different name servers as part of its SRTT algorithm. The next most common resolvers; Unbound, OpenDNS, and Google Public DNS perform pre-fetching and so the latencies aren't contributing to the user experience except for extreme outlier queries.
Some large DNS providers design to this behaviour, and seek to increase the average distance to their DNS servers by operating the name servers for each domain in different data centers. That gives routing and path diversity for the DNS queries and responses. Since network path diversity increases with distance, this works best when you include a location or two that are quite far away, which increases the average latency to those servers - but thanks to resolver behavior doesn't do much to the user-experience.
A write up for Route 53's consideration of the trade-offs is here: http://www.awsarchitectureblog.com/2014/05/a-case-study-in-g... (there's also a video about the role this plays in withstanding DDOS attacks: https://www.youtube.com/watch?v=V7vTPlV8P3U around the 10 minute mark).
Where the average latencies are low, all of the name servers are in close proximity to the measurement point and I would wager that the network path diversity is probably quite low. A small number of link failures or ddos/congestion events, maybe even one, might make all of the servers unreachable.
A more meaningful measurement of the speed itself is to perform regular DNS resolutions, using real-world DNS resolvers spread out across your users. In-browser tests like Google analytics go a long way here, and it's fairly easy to A/B test different providers. The differences tend to be very small. Caching dominates, as others here have mentioned.
Apologies if I seemed to rain on dnsperf's parade here; it's a neat visualization and measuring this stuff is tough. It's always good to see someone take an interest in measuring DNS!