What is most efficient language for web scraping purposes

We decided to do this small test to find out what is more efficient (speed, CPU and RAM usage wise) programming language for web scraping purposes. To be honest, we wrote all scraping scripts in a same manner, and we ran it in single thread. Each scraper we ran for 10 minutes on same machine, almost at same time. We ran it on: Linux Ubuntu 14.04 (under Virtual Box), 1 CPU Core, 4Gb RAM.

We compared following programming language: Diggernaut meta-language (based on Golang backend), Perl, PHP5, Python 2.7, Python + Scrapy, Ruby. As target we used U.S. Department of Health & Human Services website.

Lets look at the speed chart

chart1

As you can see there are 3 leaders: Diggernaut was able to fetch almost 3K pages, Ruby – approx 2.5K and Python + Scrapy – approx 1.5K. Other languages are really slow.

However, if we look to CPU usage chart, we will see a bit different picture

chart2

First place here goes to PHP5 which used just 2.5% of CPU, then Diggernaut with 3.5% and third is Perl with approx 4%. Other languages are also close by, except Python + Scrapy – 11% is a way too much we think.

And last parameter we measured is RAM usage:

chart3

Winner here is Diggernaut with 26Mb, then Perl with 29Mb, and PHP5 with 39Mb. Ruby here is outsider with 154Mb of RAM usage.

So to summarize measures we will score each language using 100-points score system. We will score each measure separately (best result gets 100 points, worst gets 0 points) and then we will use average.

chart4

Diggernaut with Golang backend is clear winner in this run. Also we need to mention that development time for Diggernaut scraper generally took 1.5-2 times less time.

We decided to attach files we used for test, so you may try and ensure: scripts

5 comments

    • Evgeniy Solomanidin

      Async may be not so fair in this comparison. But we planned make test for it. If you can help us for making scripts – you are welcome. But in multi-threading test we will use own site as source as we don’t want to abuse other sites with hammering.

  • Andrei

    I had a look at the perl script (since I’m a perl dev) and I saw that you guys loaded the ‘Data::Dumper’ module, probably for debugging purposed, but then I haven’t seen and use of it’s Dumper function. So since Data::Dumper is seen as a heavy module, removing it (line no 8) from your script will bring the startup time and memory usage a bit down in favor of perl. Also wich version of perl are you using? For example the lates ones (5.16 and above) are noticeably faster than 5.8.8. Cheers

Leave a Reply

Your email address will not be published. Required fields are marked *