Analysis of CDNs Serving Top Alexa Sites

Here’s a quick-and-dirty analysis of the CDNs used by the top Alexa websites that I recently knoecked together for a side project. Leaving it here in case it is useful for others interested in the subject.


CDN Discovery

Using a basic Python script, I queried the top 500 Alexa sites to identify if they are being served by a CDN. Because CNAME records are not returned for all DNS lookups, I relied on another methodology: check if the IP address in a DNS lookup ANSWER section belongs to the domain in question or not. The results are summarised in the following figure (restricted to CDNs serving 3 or more top 500 Alexa sites).

cdn_freqs

Insights

There are no groundbreaking insights from this plot: it shows CDN popularity amongst the top 500 websites. This contains a mix of different CDN provider types. As expected, traditional CDNs like Cloudflare, Fastly, and Akamai are among the most popular. Cloud providers are also popular, e.g. Amazon CloudFront and Aliyun (a.k.a. Alibaba Cloud). There are also telcos (e.g. China Unicom), hosting companies (e.g. OVH, Reflected Networks, and Automattic) and others (e.g. IAC Search & Media).

Upon closer look, the Chinese CDNs (such as Chinanet) are used for websites that are primarily targeting users behind the great firewall and also in Laos. This last fact is interesting as there seems to be no Internet censorship in Laos as far as I am aware, so perhaps this warrants further investigation. It might simply be due to its geographical proximity to China and, thus, Chinese CDN PoPs. Google is used by different Alphabet sites, with the exception of Evernote (perhaps a link formed by Chris O’Neill). Apart from this, other CDNs display a mix of different sites with no discernible pattern.

Limitations

There are a number of limitations to the above approach. It does not include CDNs that are serving through HTTP redirects or Anycasting. It also relies on WHOIS, so is vulnerable to any inaccuracies in the data retrieved from there. A potentially more robust alternative method would be to look into tracking CNAME chains.


CDN Response Times

I wrote another Python script to measure the response times (TTFB) of each site. These times were separated into the time needed for DNS resolution (as such, the use of DNS cache was disabled), TCP connection establishment, SSL negotiation (the local certificate bundle was used), and content transfer (RX). libcurl was used to retrieve these times. Each connection attempt was aborted if no response is received within 5 seconds. Each site’s index was first sought using HTTPS. If that timed out, the ‘www’ subdomain was sought instead, again through HTTPS. Failing these, the script reverts to HTTP at the domain and ‘www’ subdomain, respectively. Otherwise, the site is skipped.

The results are summarised in the plots below, where visualisation is again restricted to the CDNs serving 3 or more of the top 500 Alexa websites. The results are colour-coded by CDN, whose names are printed on the right hand side of each plot.

Insights

This analysis gives us some general CDN comparison results. Some CDNs are better across the board than others. Fastly in particular displays low values for all times. Cloudflare is better than others for DNS, TCP, SSL, but not RX. Similarly, Line’s DNS performance is extremely good but is average for other times. One follow up study from this insight would be to look into how these CDNs operate differently from other CDNs, if at all, and if the sites they serve are of a different nature.

Another general observation is that Telco CDNs (like Chinanet and Japan NIC) seem to fare worse than traditional CDNs (like Akamai and Cloudflare).

The plots uncover other interesting variations within the performance of the sites served by each CDN. For example, there is a wide disparity in the performance of sites served by Akamai. This is particularly evident for DNS and TCP times. The same goes for Chinanet and Google across all times.

Limitations

The approach uses cURL, which returns a slightly different TTFB from that experienced by a user using a web browser. This is something I have looked into in the past (c.f. “Can SPDY Really Make the Web Faster?“) where I used the Chromium browser to retrieve webpages over SPDY and then analysed the resulting HAR and tcpdump files.
An alternative is to use a headless browser like PhantomJS or Selenium, but I found these to be unreliable with varying results that fall in between those of the current methodology and the one used in the paper cited above.

Admittedly, the presented analysis is only a superficial one. More analysis is needed into the distribution of each CDN’s response times to identify where relatively extreme values arise from. An example of this is paytm.com (served by Amazon). Furthermore, a deeper comparison between CDNs across all times could be done to look into where some CDNs outperform others.

The methodology also favours sites not supporting HTTPS as their SSL negotiation times would be 0. This could be resolved by confining the search space to HTTPS results only.

Advertisements

Collaborative LaTeX Editors: ShareLaTeX vs Overleaf

I have used both Overleaf [referral link] and ShareLaTeX [referral link] for multiple collaborative papers over the last 2 years or so. For many months I thought there is little to separate them. I have finally come to a conclusion about which I prefer. Yeah, I took my time. Anyway, here’s why.

First, it’s important to note that both are pretty good at what they do. If you’re simply after a way of writing and compiling LaTeX documents with your collaborators and you’re not particularly picky, then honestly either would do. Also note that I’m only reviewing the free tiers of these services for us pauper academics.

Ok, let’s split some hairs!

Let’s start with ShareLaTeX which I tend to use slightly more and more. The interface is minimal but effective. It is designed for folks comfortable with LaTeX without much bells and whistles. Having said that, this minimal interface has got a lot to help your LaTeXing such as:

  • Choice of 4 compilers (pdfLaTeX = default, LaTeX, XeLaTeX, LuaLaTeX)
  • Automated checkpointing with roll-backs of full document history
  • Built-in chat server
  • See online collaborators and what they are editing
  • Simple keyboard shortcuts that could be changed to Emacs or Vim keybindings
  • Syntax highlighting (with about 2 dozen color themes)
  • Manually-triggered output preview pane, with different compile options and syntax checks
  • Total word count
  • Tag projects with keywords
  • Sync with DropBox and GitHub
  • Private documents by default
  • Autocompletion of commands and reference names

There’s more; I just picked the things that seem most important to me. Some of the above features are not enabled for free accounts, but you can reach them easily as I explain at the end of this post.

Let’s move swiftly to Overleaf. This is clearly designed for people who are new to / skeptical of LaTeX. The interface has 2 modes: ‘Source’ for the LaTeX-competent and ‘Rich Text’ for the non. The latter is not intended to be a WYSIWYG editor (ala LEd) but is much more user-friendly with as little markup as possible and a few edit buttons (bold, italic, new section, bullets, etc.).

I tend to instinctively recommend Overleaf for colleagues who are allergic to LaTeX. Having said that, the rich editor is not really that rich. A common response I get from such folks is “this is very restrictive” and “I don’t know how to…”, which are fair comments. Most other online editors they would use (on blogs, Moodle, etc.) have more functionality which makes them expect more and leave them disappointed. I don’t really care much about this issue because I think (1) it is a difficult problem to solve; and (2) sovling it completely dissolves the benefits of LaTeX editing anyway. But it is noteworthy because many academics are not LaTeX people and will ask about this before anything else.

The main features Overleaf gives you are:

  • Choice of 4 compilers (LaTeX dvipdf, pdfLaTeX, XeLaTeX, LuaLaTeX)
  • Automated checkpointing but only for very limited number of recent changes
  • Built-in comments
  • Simple keyboard shortcuts that could be changed to Emacs or Vim keybindings
  • Syntax highlighting (with about 2 dozen color themes)
  • Automatic and manually-triggered output preview pane
  • Total word count
  • Tag projects with keywords
  • Clone projects with git
  • Autocompletion of commands and reference names
  • Auto-closing brackets
  • Access to an impressive repository of templates

What it lacks, compared to ShareLaTeX, in descending order of importance IMHO is:

  1. Cannot see online collaborators and what they are editing
  2. Documents are public by default and until you pay for Pro level
  3. No built-in chat server
  4. No full history
  5. No sync with DropBox or GitHub
  6. No choice of compilers

Let’s go through these.

You wouldn’t think #1 is important until you use Overleaf and find the document changing in front of your very own overworked eyes without knowing who is doing that. This becomes especially alarming when my paranoia kicks in due to #2. Each new Overleaf document gets a URL with a 13 digit hash code. This URL is unlisted and not indexed by search engines like Google. However, the page is unprotected: This URL is public: anyone with the URL can view and edit the document. Of course it is quite difficult to guess the URL but coupled with #1 it makes me uneasy whenever the mentioned ghost editing experience happens. Also, you cannot un-invite a collaborator or anyone who ends up with the URL for whatever reason (apart from making a copy and purging the original – not an elegant solution). What’s more is that #3 means you have to have Skype or whatever else open on the side to send panic messages like “WHO’S EDITING SECTION 4.3!??!?!” and hope someone responds.

#4 and #5 are such an unhappy couple. Full document history is preserved for Pro levels, as well as syncing to DropBox. This means us peasant academics will have to be content with “goldfish documents”, i.e. those with an extremely short memory (which I think is capped at 24 hours). I don’t like this. Although I don’t require to refer to the full history of a paper all the time, every now and again I do in order to check where the document has gone on a tangent or if collaborators are undoing and redoing each other’s changes. These are things that you sometimes have to deal with in collaborative papers and I will not be online all the time to look out for them. Hence the importance of having a full history and the ability to roll back as a possible resolution. Overleaf really disappoints in this regard. “Nevermind, I can do this outside Overleaf.” Nope. Syncing options are not for you, peasant. You can clone it in git like the nerd you are, and have fun troping through the “Update on Overleaf” commits. Just pray none of them is by

Anonymous <anonymous@overleaf.com>

Finally, you cannot choose your compiler in Overleaf. Never caused me any loss of sleep but I know a couple of XeLaTeX devotees out there who would flip tables over this. You can choose your compiler in Overleaf, which is something I missed before till they got in touch.

What does Overleaf have over ShareLaTeX? A few things, but all of which I can happily live without.

  1. “Rich” text editor
  2. Built-in comments
  3. Automatic update of output preview pane
  4. Clone projects with git
  5. Auto-closing brackets
  6. Access to an impressive repository of templates

I already covered #1. #2 only works in the rich editor, so I have no need for it. #3 is not bad but is RAM hungry which makes Overleaf a bit heavier (see below). #4 is pretty neat, but I find that I’m either collaborating through a git repo or using an online editor. Maybe others need the hybrid setup, in which case this side of Overleaf would be quite handy. #5 is nice. #6 is great but open for all anyway.

On memory use, I did a few quick tests on Firefox 50.1. Here are the results:

  • Homepage: 15MB for ShareLaTeX vs. 24MB for Overleaf
  • Project open: 40MB for ShareLaTeX vs. 69-72MB for Overleaf
  • Project compiling: 49-50MB for ShareLaTeX vs. 71-74MB for Overleaf
  • Project co-editing + compiling: 58MB for ShareLaTeX vs. 76-79MB for Overleaf

(For context: typical Google Doc ~65MB, Twitter ~67MB, BBC ~22MB)

Overall, ShareLaTeX is quite lighter and you can tell.

Another thing on Overleaf. On a recent paper it lost some changes I made. I got all my co-authors to pinky swear they didn’t edit that part, so it might be one of those ghosts I discussed above. This happened on the deadline day of that paper! Thankfully it wasn’t a lot I needed to re-do, and I couldn’t recreate this issue so I’ll leave it there but it sure left a bad taste.

A final note on free tiers and upgrade models. Both give you freebies the more users you bring to them, but at very different rates. Yes you start with only 1 collaborator and no history/DropBox in ShareLaTeX, but you can get easily ramp up: bring 1 user and you get another free collaborator, 3 you get 3, 6 you get another 3 plus DropBox syncing and full history, 9 and you get unlimited. I did this quite quickly by talking to colleagues on my floor and they signed up. In Overleaf, you get unlimited collaborators from the get go which is great! Referrals only bring you additional storage capacity (you start with 1GB which is enough). You cannot bring them enough users to get full history or DropBox syncing.

Final pet peeve: neither of them highlights BibTeX syntax ಠ_ಠ

Domains in a HAR file

Here’s some basic bash that parses a HAR file to extract a list of the unique resource domains. It uses jq

jq '.log.entries[].request | {method,url}' $1 | jq 'if .method=="GET" then .url else "" end' | grep -Eo "http(s?)://([^/]+)./" | sort | uniq

On Gist here.

P.S. You could create HAR files using a number of tools, such as PhantomJS, hdrgrab or chrome-HAR-capturer.

CrossCloud Workshop

We’re organizing CrossCloud, an IEEE INFOCOM’14 workshop on cloud interoperability and federation. The aim is to bring together a congregation of systems researchers who have relevant knowledge and experience pertaining to assembling federated cloud architectures and handling the divergence in emerging cloud APIs. We’ll also have an exciting keynote speaker which will be confirmed very soon, so stay tuned!

Please consider submitting and participating. All details on the website:

     http://www.comp.lancs.ac.uk/~elkhatib/crosscloud

Converting JSON to CSV

A long time ago (I’m slightly embarrassed to admit) I mentioned that I wrote a script to flatten JSON to CSV and I promised to share it. I didn’t fulfill this promise till today. The reason being that the script was very rough and I wanted to sharpen it before letting it loose on the interwebs. I still didn’t get around to properly clean it up and probably will not for a couple more months or more. However, I decided today to release it as is. I’m sure some will find it useful enough and maybe be willing to tidy some of it up.

So here’s the story. There are plenty of tools out there to convert from CSV to JSON; e.g. Mr. Data Converter, CSV to JSON, csv2couch. However, converting from JSON, a flexible self-describing format, to CSV, a much more rigid format, is not as simple. I found some attempts (like jackson-dataformat-csv) but these require a predefined schema. This did not really suit my needs as I do not want to define a schema for each JSON file I need to convert. I just know that I have a JSON with a reliably consistent structure and I want to convert it to CSV/TSV for another program that can only accept such format.

So I developed my code with such requirements in mind. So be careful: inconsistent JSON structure will result in wrong CSV. The code parses the JSON twice: once to discover the schema, and again to convert.

Here it is. Enjoy, and be gentle 🙂

Multi-author comments in LaTeX

This is a simple command to simplify adding comments from multiple users editing a single LaTeX document, such as a manuscript shared between multiple authors.

newcommandnotesAlice[1]{textbf{textcolor{blue}{Alice: #1}}}
newcommandnotesBob[1]{textbf{textcolor{green}{Bob: #1}}}

…and so on. Obviously edit styles to your liking. Then each author enters their notes like this:

notesBob{This paragraph needs more text.}

To enable switching the notes on and off, use the ifthen package. Disabling all notes is handy to check document length, for instance. Here’s how to do it:

usepackage{ifthen}
newboolean{showNotes}
setboolean{showNotes}{false}
newcommandnotesAlice[1]{ifthenelse {boolean{showNotes}}
 {textbf{textcolor{blue}{Alice: #1}}}
 {}
}
newcommandnotesBob[1]{ifthenelse {boolean{showNotes}}
 {textbf{textcolor{green}{Bob: #1}}}
 {}
}