Benchmarking Python PageRank calculation methods

Aka how to benchmark Python functions ?

By Julien on March 9, 2018

It is no secret, PageRank has been and is still used by Google to compare pages popularity amongst the web.
When auditing a website, internal PageRank is a valuable information: it shows the most efficiently linked pages, and helps detect problem in your internal linking.
However, when working on high volumes of pages, this kind of calculus can take a fair bit of time.

There are a few methods to calculate PageRank in Python. We’ll be benchmarking three:

iGraph’s function,
Three functions from the NetworkX lib,
and a home-made function courtesy of Thomas Largillier.

TL;DR

Benchmarking how-to

Along with comparing PageRank functions, what I was interested in when writing this post is defining a process for benchmarking code in Python.

A lot of parameters can influence the results of a benchmark. Number of tests, other processes running, garbage collection, versions of the language and the libs used, …
Moreover, I believe it is essential to be as precise as possible about which part of your code you want to benchmark.
These concepts are clearly explained in this post by Peter Bengtsson, with a few examples.
For this benchmark, I used a slightly modified version of his boilerplate.

My setup

I used Python 3.5.3, and updated each library used for this benchmark. Here is my requirements.txt file:

networkx==2.1  
numpy==1.14.0  
python-igraph==0.7.1.post6  
scipy==1.0.0  
statistics==1.0.3.5

For more realistic graph generation, I generated for each run of the benchmark a single random directed graph, with a fixed number of nodes and a max number of edges.
I thought the number of nodes and edges in the graph would infuence the results, so I added a few parameters to test these two variables. As you will see in the results, I tested different sizes of websites, from very small ones to very big ones, with an increasing number of possible edges.
The networks created could have reciprocal links but no self-links (a node could not link to itself).
Each pair of values (nodes and links) would be tested 500 times.

Most importantly, I focused the benchmark on the time taken by the pagerank() functions themselves, and not by the graph generation.

Finally, for a more stable system, I ran these tests on a dedicated server, with no other meaningful process running.

The results

Here’s a big, ugly table with the results (in milliseconds):

nodes	edges	nx_pr	nx_prnumpy	nx_prscipy	igraph_pr	thomas_pr
100	100	27,10	13,56	4,88	0,06	2,67
100	200	32,22	35,36	4,99	0,15	7,23
100	500	31,47	35,89	4,17	0,30	7,80
100	1 000	40,84	35,50	2,89	0,27	11,41
100	2 000	57,24	35,39	3,82	0,29	19,83
100	5 000	91,34	35,76	5,30	0,34	40,96
1 000	1 000	204,15	3 436,29	7,27	0,73	28,34
1 000	2 000	201,48	4 148,38	7,42	0,99	40,22
1 000	5 000	242,69	4 186,90	10,52	1,54	73,29
1 000	10 000	370,89	4 077,29	16,85	2,32	126,68
1 000	20 000	640,34	4 026,71	60,51	3,77	233,66
1 000	50 000	1 323,50	4 026,97	192,19	7,91	561,46
10 000	10 000	1 368,85	869 861,78	41,81	8,60	287,04
10 000	20 000	1 539,52	1 184 237,23	80,24	11,63	417,16
10 000	50 000	2 293,45	1 211 680,18	141,23	16,61	799,00
10 000	100 000	3 258,01	1 194 091,98	278,32	25,06	1 488,96
10 000	200 000	6 257,78	1 202 512,00	556,37	44,56	2 911,54
10 000	500 000	12 798,45	1 193 336,29	1 776,31	118,33	7 267,36

TL;DR

iGraph is way faster than the others. This is not a surprise as a lot of this lib is written in C. If speed is your only concern, just go for it.

However, iGraph might not be the friendliest lib. Even though it’s way slower, NetworkX is much more flexible for other use cases, like adding attributes to nodes. If you have to create complex nodes, NetworkX’s pagerank_scipy() method could be a good compromise.
Finally, writing your own function allows you to tweak the algorithm to try and follow Google’s evolutions, and still be quite fast. Furthermore, you don’t need to create a complex Object (ie Graph) which will surely save RAM.

I hope you’ll find this post useful ! Please share any comments both on the benchmark methodology or on the results !

Benchmarking Python PageRank calculation methods

Benchmarking how-to

My setup

The results

TL;DR

Let's work together !