We have made quite a lot of progress since our last post. We have completed the crawler and all the functionality for our proof-of-concept is in place. The crawler can now crawl PubMed in its entirety and compare the words in the abstracts of the articles to search words in our database, and if there are matches, the metadata of the potentially interesting articles will be saved in to our database.
The most challenging aspect of this project in terms of future development lies in optimisation. The bottleneck of the crawler is in the word checking function where every word of the abstract is compared against a list of thousands of words. Granted, our cloud server is quite limited in processing power, so if we were to upgrade the CPU on our server, crawling performance would surely be improved. A distributed processing system would also be something worth considering.
In terms of the schedule, our project is now only a couple of weeks away from completion. The last week or so we have worked on refactoring and documenting our code and we have decided to dedicate the remaining hours of the project to further documentation and testing. We plan on delivering a well-documented and comprehensive product to our customer, which should hopefully make future development for new developers a breeze.
We will write one more devblog post at the very end of our project in late April to wrap everything up.
Until then, we wish you all a nice spring.
Jussi, Joni, Joona & Miikka