Happy Birthday Sitemap Crawler: A 30% Increase in Crawling Performace

January 2018 ยท 6 minute read

Brief Intro

This post is more of an informative post rather than an actionable one. Unless you are working with something similar to what I am writing about, you will not be able to reap any benefits beyond a mildly informative read.

Beyond that, this post is split into two aspects which achieved the performance increase. One concerns the language while to other is an optimization uniquely available to my script (only because I haven’t taken advantage of it earlier).

Brief Context

Now before getting into technicalities, I would like to say “Happy Birthday!” to my text-based child for which this entire post is written: Sitemap Generator Crawler. Over three years ago, my laziness to manually produce a list of URLs or build a reasonable automation system for sitemaps produced a crawler which builds XML sitemaps in the worst way possible: Over web-requests and parsing HTML with regular expressions. Yet the interwebs have opted to forgive my sins and, based on the git repository’s analytics, saying that I have hundreds of users would be a very conservative estimation.

I am very happy for all the learning experiences this project has provided me with. Below you will see a graph which illustrates the activity of this repository. Lines of code

Developement has occured over some dozen spikes and most of it being done last year when I broke my initial intention of keeping it all in one file and broke it up into 3 main files.

I have gotten to collaborate with almost 10 people and learn a lot of what git has to offer.

Lastly, I have been compensated for my work through over 50 stars, 30 forks and a few hundred dollars in unsolicited donations. As with all celebrations, the script got a gift: a significant performance improvement.

Optimizations

The language optimization

While both the most significant and the simplest, utilizing PHP’s features is the most important mention in this post and for which most readers might have come. It is about arrays and hashmaps.

There are 2 main ways to check if an element is in an array (something the script would need to do expontentially many times as it progressed through the site): Iterating over an array or using a hashmap.

Array:

<?
in_array(array(), value);

Hashmap

<?
isset(hashmap[value]);

The key of the issue is the complexity. The former has a complexity of O(n), meaning it will run slower for every element that’s added to it. A bit of an oversimplification but fair. The hashmap has a complexit of O(1), meaning that no matter what, it will take a constant amount of time to retrieve the data.

The sitemap crawler was continuously checking and populating an array of already scanned URLs as to not to scan them again:

<?
$scanned = array();

function get_data($url) {
    global $scanned;
    if (!in_array($scanned, $url));
        array_push($scanned, $url);
}

The above is obviously not real code, but serves only to demonstrate. The above is easy to move to isset:

<?
$scanned = array();

function get_data($url) {
    global $scanned;
    if (!isset($scanned[$url]));
        $scanned[$url] = 1;
}

After some 3000 items, the difference was very noticable. While the first variant was advanced at 1 link per every few seconds, the latter never slowed down. You may note what I wrote in the GitHub issue with the speed tests.

The performance increase is notable and I am confident would become much more so after 10000 pages or 100000 pages.

The logical optimization

This one is harder to explain and drawings would aid immensely. Yet I am garbage at drawing so text will have to satisfy you, dear reader.

Prior to making changes, the flow of the program was a loop based on the following:

  1. Get a page and extract all links
  2. Make note of it being scanned to not rescan it again
  3. Start scanning go to step 1 with the first link and defer the rest
  4. If no links, go to step 1 with the closest deferred link

The issue that it greated is that the deferred links would not necessarily be unique, as I had no system in place to enforce it. Adding a second $scanned but for deferred links was an easy fix and resulted in a less noticeable, but much more universal performance increase.

GitHub issue

Bonus: The failed cURL magic

This one is a bonus because it failed but was fun to implement. No, I’m sorry. That was a lie. It was very painful to implement and made me angry at times.

PHP has this way of having light abstractions over C utility calls which both makes it very powerful and a pain to use if you don’t know C.

The idea was to actively fail cURL requests which I knew would be wasteful. The trade-off in that case was the time saved by not downloading the full file versus the time lost by forcing cURL to relaunch. After a certain time, I started realizing that it would result in a net loss and a maintenance burden, yet I treaded on as this became personal.

This is the function of sadness:

<?
curl_setopt( $curl_client, CURLOPT_HEADERFUNCTION, "inspect_headers" );

function inspect_headers($ch, $data)
{
    global $index_pdf, $is_pdf;
    $redirect_url = curl_getinfo($ch, CURLINFO_REDIRECT_URL);
    $content_type = curl_getinfo($ch, CURLINFO_CONTENT_TYPE);
    $http_code = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    if ($redirect_url) {
        logger("URL is a redirect.", 1);
        if (strpos($redirect_url, '?') !== false) {
	     $redirect_url = explode($redirect_url, "?")[0];
	}
	scan_url($redirect_url);
    }
    
    if ($content_type !== false && !stripos($content_type, "html")) {
        if (stripos($content_type, "application/pdf") !== false && $index_pdf){
            $is_pdf = true;
        }
        echo "It's a PDF!!";
        return -1;
    }
    if ($http_code >= 400){
        return -1;
    }
    return strlen($data);
}

The way this works, is that by setting CURLOPT_HEADERFUNCTION, the given function, in this case inspect_headers is called synchronously with the cURL object and the data received in this chunk expecting it’s length to be returned or be failed. Yes, this function gets called for every chunk of headers that was received. As such, before checking that if some expected header was matching what I wanted, I had to check if the header was received yet. THAT IS TERRIBLE!

This was my primary reference. Official docs, great, isn’t it? Which document every single function and option? Gold mine! Hahah NO. This is the example function in the docs:

static size_t header_callback(char *buffer, size_t size,
                              size_t nitems, void *userdata)
{
  /* received header is nitems * size long in 'buffer' NOT ZERO TERMINATED */
  /* 'userdata' is set with CURLOPT_HEADERDATA */
  return nitems * size;
}

Now go look back at my PHP code above. What’s that? The arguments aren’t the same?

The argument I listed came from hours of var_dumping anything I could lay my hands on to figure out the arguments that were actually given and what was expected of me to return because not only were the parameters wrong, the expected return was bonkers too! The docs say that it wants CURLE_OK (equivalent to the integer 0), I had to dig through 10 year old mailing lists which I can’t even find anymore to find that out.

I’m angry just writing this. I suppose I will end this post here.