Benchmarking Python PageRank calculation methods

Aka how to benchmark Python functions ?

It is no secret, PageRank has been and is still used by Google to compare pages popularity amongst the web.
When auditing a website, internal PageRank is a valuable information: it shows the most efficiently linked pages, and helps detect problem in your internal linking.
However, when working on high volumes of pages, this kind of calculus can take a fair bit of time.

There are a few methods to calculate PageRank in Python. We’ll be benchmarking three:

TL;DR

Benchmarking how-to

Along with comparing PageRank functions, what I was interested in when writing this post is defining a process for benchmarking code in Python.

A lot of parameters can influence the results of a benchmark. Number of tests, other processes running, garbage collection, versions of the language and the libs used, …
Moreover, I believe it is essential to be as precise as possible about which part of your code you want to benchmark.
These concepts are clearly explained in this post by Peter Bengtsson, with a few examples.
For this benchmark, I used a slightly modified version of his boilerplate.

My setup

I used Python 3.5.3, and updated each library used for this benchmark. Here is my requirements.txt file:

networkx==2.1  
numpy==1.14.0  
python-igraph==0.7.1.post6  
scipy==1.0.0  
statistics==1.0.3.5  

For more realistic graph generation, I generated for each run of the benchmark a single random directed graph, with a fixed number of nodes and a max number of edges.
I thought the number of nodes and edges in the graph would infuence the results, so I added a few parameters to test these two variables. As you will see in the results, I tested different sizes of websites, from very small ones to very big ones, with an increasing number of possible edges.
The networks created could have reciprocal links but no self-links (a node could not link to itself).
Each pair of values (nodes and links) would be tested 500 times.

Most importantly, I focused the benchmark on the time taken by the pagerank() functions themselves, and not by the graph generation.

Finally, for a more stable system, I ran these tests on a dedicated server, with no other meaningful process running.

The results

Here’s a big, ugly table with the results (in milliseconds):

nodesedgesnx_prnx_prnumpynx_prscipyigraph_prthomas_pr
10010027,1013,564,880,062,67
10020032,2235,364,990,157,23
10050031,4735,894,170,307,80
1001 00040,8435,502,890,2711,41
1002 00057,2435,393,820,2919,83
1005 00091,3435,765,300,3440,96
1 0001 000204,153 436,297,270,7328,34
1 0002 000201,484 148,387,420,9940,22
1 0005 000242,694 186,9010,521,5473,29
1 00010 000370,894 077,2916,852,32126,68
1 00020 000640,344 026,7160,513,77233,66
1 00050 0001 323,504 026,97192,197,91561,46
10 00010 0001 368,85869 861,7841,818,60287,04
10 00020 0001 539,521 184 237,2380,2411,63417,16
10 00050 0002 293,451 211 680,18141,2316,61799,00
10 000100 0003 258,011 194 091,98278,3225,061 488,96
10 000200 0006 257,781 202 512,00556,3744,562 911,54
10 000500 00012 798,451 193 336,291 776,31118,337 267,36

TL;DR

iGraph is way faster than the others. This is not a surprise as a lot of this lib is written in C. If speed is your only concern, just go for it.

However, iGraph might not be the friendliest lib. Even though it’s way slower, NetworkX is much more flexible for other use cases, like adding attributes to nodes. If you have to create complex nodes, NetworkX’s pagerank_scipy() method could be a good compromise.
Finally, writing your own function allows you to tweak the algorithm to try and follow Google’s evolutions, and still be quite fast. Furthermore, you don’t need to create a complex Object (ie Graph) which will surely save RAM.

I hope you’ll find this post useful ! Please share any comments both on the benchmark methodology or on the results !

Let's work together !

Contact me !