The Billion Problem

May 2017 ยท 3 minute read

I will eventually write a post about my experience with Upwork. I will not comment just yet as I feel I didn’t get the real Upwork experience for now. I did get some work and so far it is pretty pleasant in all honesty. But I digress.

A client approached me – or rather I approached the client due to how Upwork functions to scrape a few Databases about foods and their respective nutritional values for what seems to be a clone of MyFitnessPal. So obviously, the main victim of my scraping was going to be MyFitnessPal’s food database. Is this illegal? I’m tempted to think so but my client’s lawyer said that it isn’t. Who am I to disagree? I get paid regardless.

The first issue is that MFP is by design hard to scrape. A crawler would take ages to find all the pages and even then… you never know. So the idea is to do it in 2 parts.

Step 1: DDoS them Enumerate all the URLs

Step 2: DDoS them again scrape all the existing URLs

Now this is all fine in theory but in practice, the URLs don’t increment consistently and seem to range from 1 to a very large number. So naturally, I decided it to be a sound idea to scan every single possible page until I found all the real ones. I expect to find somewhere between 2 million and 10 million.

Never have I thought running 1 billion (yes, with a “B”) web requests would be difficult. Boy was I wrong.

Sadly I did not keep the failing code, but each time something failed I learned something. The main take-aways are that NodeJS’s http library is bad for anything requiring high performance. It failed at every single point in the stack where something could have failed. It was some Murphy law stuff right before my eyes.

I am not exaggerating in the slightest when I say that either. The first issue was that it was running out of memory without actually doing anything - just allocating it. I refactored to limit the request rate but that didn’t work out well either - the DNS couldn’t keep up. For some reason it wanted to make sure the IP didn’t change every millisecond. Thus I added an /etc/hosts definition for it which seemed to fix it. Nonetheless it kept on finding other ways to break itself - I gave up.

But then my pathological tendency to over-use NodeJS for literally everything came to an end when I had a thought that came over my head: “I’m on Linux”. Suddenly I finally realized all the power I had in my hands - the power of the GNU/Linux tools. Why would I want to use request when I could use curl? Why would I waste my time with some async libraries when I had GNU’s parallel all along? No reason whatsoever.

And so I refactored my hundred-line application to a ~10 line bash script and it worked beautifully. No errors, just a nice counter streaming in front of my eyes on a dedicated VPS instance. I will update this post when it’s done running.

A follow-up post will be made

Update 1

It seems making a new file for each valid page was a terrible idea. Directories are very hard to parse as streams. Lesson learned.

Just dumping all the file names to another file fixed it.