Once upon a time (not so recent ago, up to a year) my friend who used to work at FedEx (totally check him out, he’s an amazing person) briefed me upon the scandalous situation on the VAT databases access layers. Presumedly, the entire market is owned by an Austrian operator, who (at the time of his master’s thesis) thought it appropriate to query revelant countries’ VAT acqusition offices for all the correct numbers of VAT numbers in given country and then to compile them into a browsable database. What a disquestionable fellow that is!
The idea he came forward, is that we should query the HMRC VAT API for all of the possible correct VAT#, compile a database. It would certainly have been possible, save if not for his indisposition.
First, we had to secure their VAT generation algorithm. There were two algorithms, one which was public, and the other secret. Fortunately,, a prior lawsuit solved that for us so we had a quick generator.
Of course I used the splendid library requests for both API queries and talking to the server. The scraper ran entirely on my cluster of machines, whereas before Bright Data we allocated some machines on AWS, fortunately when proxying became available if was just best to utilize our 1 Gbps DSL line.
Also: HMRC seems to have no rate limiter. The requests will proceed happily at maximum speed, until somebody notices that and hands you over a massive IP range ban, which seems to happen manually, because it was handed over precisely at Thursday 9:24 AM.
I prepared the scraper to even be ready for the event that certain range of VAT# are simply unavailable for the given range (taught so well by Polish VAT authorities), expecting them to return HTTP 500 (or so) the scraper would simply default to proceed with next block, and mark the current one as needing a retry later. This, questioningly, never happened, and for once I consider HM Government Digital Service to be an institution of higher culture.
The scraper wrote the data to an internal Cassandra database, which also served to coordinate block assignment from the backend, so that the scrapers could be many. The scraper would make an HTTP request to HMRC with newly generated VAT, and save a success to a local SQLite storage. If met with a 404, it would proceed to the next block. And if it would see something along the lines of 500 it would save that for later retry (something which Polish administration would do). Admittedly, I never seen any 500. Their API is rock solid.
Upon finishing a block the data would be uploaded to the central server, and a next block would be issued.
Please note that Polish insitutions would frequently return 500s for some data as an alias of “information not yet available”. HMRC API did a good job, and not one request ended with 500 – only 200 or 404.
As far as the scraping limits go – the rules say a request every 3 seconds, which ended up scraping the rules and just loading the servres with as many requests as we could. This triggered their system administrator (after about a week) to ban our IPs manually. Henceforth, we would receive either 401 or 403.
Therefore, a switch was made to Bright Data proxy. They awesome server-class pricing was okay for two guys with $10 tops (we actually spent $3, they greeted us with an extra $7). We proceeded from there with speed unprecended – I rewrote a scraper called bigbertha to poll them using multiple threads. In less than a week we had a complete base of 2,4 mln VAT tax payers in UK.
Now, to make them searchable, I used ElasticSearch. I made them searchable by name, their VAT address, as a bonus we also scraped the EORI database in the same manner.
The solution was ready and online until the friend that I spoke of seemed to malappropriate my money that I sent him for the purpose of refreshing this domain (don’t blame him, he’s in a rough spot right now).
And, as a data scientist I encourage you to visit another article, spurred by this great hack called What can you read from a company’s name in the UK (it’s in Polish for all the English-based people, you can use the translator, it tends to do good work and butchers less and less text each time).
I think that everything composed was actually very logical.
However, consider this, what if you added a little
content? I am not suggesting your information isn’t solid, but what if you added a post title
that makes people want more? I mean The time two guys hacked
HMRC – Henrietta's is kinda boring. You might look at Yahoo’s front page and watch how they create post headlines to get viewers interested.
You might try adding a video or a related picture
or two to grab readers excited about what you’ve written. In my opinion, it might bring your website a little bit more
interesting.
I could not refrain from commenting. Exceptionally
well written!