An improved approach to the crawler

Hello!

Quite a lot has happened since our last post two weeks ago. In our last meeting with Mika it became apparent that we had not quite grasped the sheer volume of articles that we had to crawl. Initially, we were under the impression that we would only be crawling for the articles that contain specific search words that are possibly related to ALS. It has now become apparent that we will, in fact, be crawling pretty much the whole PubMed database, which at its current size is around 23 million articles. The difference is big, but it changes nothing in terms of our technology choices. We think PostgreSQL and Python are still excellent choices.

Due to the fact that PubMed is a dynamic website, with dynamically generated content (like most websites) we could not crawl URLs, but instead we had to find a way to automate a web browser which would search for keywords and automatically go on to the next result page until all pages had been crawled. We used Selenium to do this. It worked fantastically. The only drawback was that it was somewhat slow, but we did not see this as a big problem, since the spider did not have any significant time constraints. Naturally, when we discovered how drastically our scope had changed we needed to rethink our crawling algorithm entirely.

The problem with the desktop version of pubmed was that the URL would always be the same regardless of the search term or page number. It would always look like this: http://www.ncbi.nlm.nih.gov/pubmed. Manipulating the URL was simply not possible. We discovered that in the mobile version of the website, the URLs are generated systematically for each result page. For example, when we search for all authors whose surname starts with “Aa”, the URL will look like the following: http://www.ncbi.nlm.nih.gov/m/pubmed/?term=aa*[Author]&page=2. This means that we can simply parse the URL and set the page number to iterate as many times as needed. This new solution makes the crawling process roughly 20 times faster.

When we were in the process of testing the new solution, we encountered a slight problem. We received an “Access denied due to possible misuse”- message. We suspected that we simply got a temporary ban due to not having throttled the crawler, but a little googling revealed that the problem was not only affecting us, but all users globally were unable to access the site and were receiving the same “access denied” message. The error happened 10 seconds in to our test, and we are not sure if it was caused by us or if it was simply a coincidence. We were unable to use the site for exactly 60 minutes, as was the rest of the world. Below is a screen capture of Twitter at the time of the event.

errormsg

Things are now looking good once again and we will focus on improving our algorithm this week.

All the best,

Jussi, Joni, Joona & Miikka

Leave a Comment

Your email address will not be published. Required fields are marked *