The time two guys hacked HMRC

Once upon a time (not so recent ago, up to a year) my friend who used to work at FedEx briefed me upon the scandalous situation on the VAT databases access layers. Presumedly, the entire market is owned by an Austrian operator, who (at the time of his master’s thesis) thought it appropriate to query revelant countries’ VAT acqusition offices for all the correct numbers of VAT numbers in given country and then to compile them into a browsable database. What a disquestionable fellow that is!

The idea he came forward, is that we should query the HMRC VAT API for all of the possible correct VAT#, compile a database. It would certainly have been possible, save if not for his indisposition.

First, we had to secure their VAT generation algorithm. There were two algorithms, one which was public, and the other secret. Fortunately,, a prior lawsuit solved that for us so we had a quick generator.

Of course I used the splendid library requests for both API queries and talking to the server. The scraper ran entirely on my cluster of machines, whereas before Bright Data we allocated some machines on AWS, fortunately when proxying became available if was just best to utilize our 1 Gbps DSL line.

Also: HMRC seems to have no rate limiter. The requests will proceed happily at maximum speed, until somebody notices that and hands you over a massive IP range ban.

I prepared the scraper to even be ready for the event that certain range of VAT# are simply unavailable for the given range (taught so well by Polish VAT authorities), expecting them to return HTTP 500 (or so) the scraper would simply default to proceed with next block, and mark the current one as needing a retry later. This, questioningly, never happened, and for once I consider HM Government Digital Service to be an insitution of higher culture.

The scraper wrote the data to an internal Cassandra database, which also served to coordinate block assignment from the backend, so that the scrapers could be many. The scraper would make an HTTP request to HMRC with newly generated VAT, and save a success to a local SQLite storage. If met with a 404, it would proceed to the next block. And if it would see something along the lines of 500 it would save that for later retry (something which Polish administration would do). Admittedly, I never seen any 500. Their API is rock solid. BTW guys I advise you to check alphagov. They are an instituion of higher culture.

Upon finishing a block the data would be uploaded to the central server, and a next block would be issued.

Please note that Polish insitutions would frequently return 500s for some data as an alias of “information not yet available”. HMRC API did a good job, and not one request ended with 500 – only 200 or 404.

As far as the scraping limits go – the rules say a request every 3 seconds, which ended up scraping the rules and just loading the servres with as many requests as we could. This triggered their system administrator to ban our IPs manually. Henceforth, we would receive either 401 or 403.

Therefore, a switch was made to Bright Data proxy. They awesome server-class pricing was okay for two guys with $10 tops (we actually spent $3, they greeted us with an extra $7). We proceeded from there with speed unprecended – I rewrote a scraper called bigbertha to poll them using multiiple threads. In less than a week we had a complete base of 2,4 mln VAT tax payers in UK.

Now, to make them searchable, I used ElasticSearch. I made them searchable by name, their VAT address, as a bonus we also scraped the EORI database in the same manner.

The solution is ready at (and connected to my credit card, so all users welcome xD !).

Leave a comment

Your email address will not be published. Required fields are marked *