Analysis of CDNs Serving Top Alexa Sites

Here’s a quick-and-dirty analysis of the CDNs used by the top Alexa websites that I recently knoecked together for a side project. Leaving it here in case it is useful for others interested in the subject.


CDN Discovery

Using a basic Python script, I queried the top 500 Alexa sites to identify if they are being served by a CDN. Because CNAME records are not returned for all DNS lookups, I relied on another methodology: check if the IP address in a DNS lookup ANSWER section belongs to the domain in question or not. The results are summarised in the following figure (restricted to CDNs serving 3 or more top 500 Alexa sites).

cdn_freqs

Insights

There are no groundbreaking insights from this plot: it shows CDN popularity amongst the top 500 websites. This contains a mix of different CDN provider types. As expected, traditional CDNs like Cloudflare, Fastly, and Akamai are among the most popular. Cloud providers are also popular, e.g. Amazon CloudFront and Aliyun (a.k.a. Alibaba Cloud). There are also telcos (e.g. China Unicom), hosting companies (e.g. OVH, Reflected Networks, and Automattic) and others (e.g. IAC Search & Media).

Upon closer look, the Chinese CDNs (such as Chinanet) are used for websites that are primarily targeting users behind the great firewall and also in Laos. This last fact is interesting as there seems to be no Internet censorship in Laos as far as I am aware, so perhaps this warrants further investigation. It might simply be due to its geographical proximity to China and, thus, Chinese CDN PoPs. Google is used by different Alphabet sites, with the exception of Evernote (perhaps a link formed by Chris O’Neill). Apart from this, other CDNs display a mix of different sites with no discernible pattern.

Limitations

There are a number of limitations to the above approach. It does not include CDNs that are serving through HTTP redirects or Anycasting. It also relies on WHOIS, so is vulnerable to any inaccuracies in the data retrieved from there. A potentially more robust alternative method would be to look into tracking CNAME chains.


CDN Response Times

I wrote another Python script to measure the response times (TTFB) of each site. These times were separated into the time needed for DNS resolution (as such, the use of DNS cache was disabled), TCP connection establishment, SSL negotiation (the local certificate bundle was used), and content transfer (RX). libcurl was used to retrieve these times. Each connection attempt was aborted if no response is received within 5 seconds. Each site’s index was first sought using HTTPS. If that timed out, the ‘www’ subdomain was sought instead, again through HTTPS. Failing these, the script reverts to HTTP at the domain and ‘www’ subdomain, respectively. Otherwise, the site is skipped.

The results are summarised in the plots below, where visualisation is again restricted to the CDNs serving 3 or more of the top 500 Alexa websites. The results are colour-coded by CDN, whose names are printed on the right hand side of each plot.

Insights

This analysis gives us some general CDN comparison results. Some CDNs are better across the board than others. Fastly in particular displays low values for all times. Cloudflare is better than others for DNS, TCP, SSL, but not RX. Similarly, Line’s DNS performance is extremely good but is average for other times. One follow up study from this insight would be to look into how these CDNs operate differently from other CDNs, if at all, and if the sites they serve are of a different nature.

Another general observation is that Telco CDNs (like Chinanet and Japan NIC) seem to fare worse than traditional CDNs (like Akamai and Cloudflare).

The plots uncover other interesting variations within the performance of the sites served by each CDN. For example, there is a wide disparity in the performance of sites served by Akamai. This is particularly evident for DNS and TCP times. The same goes for Chinanet and Google across all times.

Limitations

The approach uses cURL, which returns a slightly different TTFB from that experienced by a user using a web browser. This is something I have looked into in the past (c.f. “Can SPDY Really Make the Web Faster?“) where I used the Chromium browser to retrieve webpages over SPDY and then analysed the resulting HAR and tcpdump files.
An alternative is to use a headless browser like PhantomJS or Selenium, but I found these to be unreliable with varying results that fall in between those of the current methodology and the one used in the paper cited above.

Admittedly, the presented analysis is only a superficial one. More analysis is needed into the distribution of each CDN’s response times to identify where relatively extreme values arise from. An example of this is paytm.com (served by Amazon). Furthermore, a deeper comparison between CDNs across all times could be done to look into where some CDNs outperform others.

The methodology also favours sites not supporting HTTPS as their SSL negotiation times would be 0. This could be resolved by confining the search space to HTTPS results only.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s