It is no secret, PageRank has been and is still used by Google to compare pages popularity amongst the web.
When auditing a website, internal PageRank is a valuable information: it shows the most efficiently linked pages, and helps detect problem in your internal linking.
However, when working on high volumes of pages, this kind of calculus can take a fair bit of time.
There are a few methods to calculate PageRank in Python. We’ll be benchmarking three:
- iGraph’s function,
- Three functions from the NetworkX lib,
- and a home-made function courtesy of Thomas Largillier.
Along with comparing PageRank functions, what I was interested in when writing this post is defining a process for benchmarking code in Python.
A lot of parameters can influence the results of a benchmark. Number of tests, other processes running, garbage collection, versions of the language and the libs used, …
Moreover, I believe it is essential to be as precise as possible about which part of your code you want to benchmark.
These concepts are clearly explained in this post by Peter Bengtsson, with a few examples.
For this benchmark, I used a slightly modified version of his boilerplate.
I used Python 3.5.3, and updated each library used for this benchmark. Here is my
networkx==2.1 numpy==1.14.0 python-igraph==0.7.1.post6 scipy==1.0.0 statistics==188.8.131.52
For more realistic graph generation, I generated for each run of the benchmark a single random directed graph, with a fixed number of nodes and a max number of edges.
I thought the number of nodes and edges in the graph would infuence the results, so I added a few parameters to test these two variables. As you will see in the results, I tested different sizes of websites, from very small ones to very big ones, with an increasing number of possible edges.
The networks created could have reciprocal links but no self-links (a node could not link to itself).
Each pair of values (nodes and links) would be tested 500 times.
Most importantly, I focused the benchmark on the time taken by the
pagerank() functions themselves, and not by the graph generation.
Finally, for a more stable system, I ran these tests on a dedicated server, with no other meaningful process running.
Here’s a big, ugly table with the results (in milliseconds):
|1 000||1 000||204,15||3 436,29||7,27||0,73||28,34|
|1 000||2 000||201,48||4 148,38||7,42||0,99||40,22|
|1 000||5 000||242,69||4 186,90||10,52||1,54||73,29|
|1 000||10 000||370,89||4 077,29||16,85||2,32||126,68|
|1 000||20 000||640,34||4 026,71||60,51||3,77||233,66|
|1 000||50 000||1 323,50||4 026,97||192,19||7,91||561,46|
|10 000||10 000||1 368,85||869 861,78||41,81||8,60||287,04|
|10 000||20 000||1 539,52||1 184 237,23||80,24||11,63||417,16|
|10 000||50 000||2 293,45||1 211 680,18||141,23||16,61||799,00|
|10 000||100 000||3 258,01||1 194 091,98||278,32||25,06||1 488,96|
|10 000||200 000||6 257,78||1 202 512,00||556,37||44,56||2 911,54|
|10 000||500 000||12 798,45||1 193 336,29||1 776,31||118,33||7 267,36|
iGraph is way faster than the others. This is not a surprise as a lot of this lib is written in C. If speed is your only concern, just go for it.
However, iGraph might not be the friendliest lib. Even though it’s way slower, NetworkX is much more flexible for other use cases, like adding attributes to nodes. If you have to create complex nodes, NetworkX’s
pagerank_scipy() method could be a good compromise.
Finally, writing your own function allows you to tweak the algorithm to try and follow Google’s evolutions, and still be quite fast. Furthermore, you don’t need to create a complex Object (ie Graph) which will surely save RAM.
I hope you’ll find this post useful ! Please share any comments both on the benchmark methodology or on the results !