Crawl Analysis in Python

Also known as introduction to Pandas and Jupyter Notebooks for SEOs.

By Julien on January 29, 2018

Crawling is one of the most common tasks in technical SEO. However, analyzing a crawl can take quite a long time, especially when working on high volume websites.
Using good tools can help you go faster, and can bring you to more advanced analysis. Talking about advanced stuff, I belive Python is one of the best tools I have at my disposal.

In this post, I want to show you some basic crawl analysis in Python. I crawled the Python Software Foundation Website using Screaming Frog SEO Spider, and simply exported some data to CSV files.

You’ll find in this Gitlab project a Jupyter Notebook that will show you some basic Data analysis using Pandas, a very useful Python library providing data structures and analysis tools.
Using Pandas, it’s quite easy to filter data, generate charts or include additional data, which are common tasks for SEOs.
For example in this notebook, with a few lines of code we’ll:

count URLs per response code,
generate and save charts,
categorize URLs,
export data to CSV,
…

Moreover, you’ll be able to automate these steps, which means a lot of time gained.

How to use this notebook

You should get something like this in your browser

Download or clone the repository, then open a Terminal (Mac/Linux) or Command Prompt (Windows) in the screamingfrog-python-analysis directory.

You’ll need to setup some requirements first:

pip install -r requirements.txt

Once everything is OK, simply run:

jupyter notebook

This should open a new tab in your browser. However you can also access your notebooks using http://localhost:8888.

You can run the code with your own data by replacing the Screaming Frog exports in the data directory.

About Jupyter notebooks

Notebooks contain both live code and text elements. They are very useful to test, explore data or explain code.
Learn how to use them with this notebooks basics guide.

What’s next ?

This example notebook mostly shows basic use of the Pandas library. The most advanced users might be disappointed, but I hope it suits beginners.
However I urge you to try and use these techniques, as this is in my opinion one of the best ways to get to more advanced Python usage.
There are loads of Python libraries you can use to go further, whether you’re into NLP, graph analysis, machine learning, …

Crawl Analysis in Python

How to use this notebook

About Jupyter notebooks

What’s next ?

Let's work together !