I would like to spider a few blogs and programmatically analyze their html and css-based layouts to see e.g. if the sidebar is to the left or right of the main content, how many columns and how wide they are.

How would I do this the best way? Are there any tools or libraries I can use?

(I would prefer a solution in Python or PHP.)


It sounds difficult to do this generically. You might be helped by your constraint of only checking blogs, because there might be some uniformity - if for example, they're using a known template.

Written by thirtydot

Accepted Answer

This sounds like an extremely hard task to do using pure server-side CSS and HTML parsing - you would effectively have to recreate the browser's rendering engine to get reliable results.

Depending on what you need this for, I could think of a way somewhere along these lines:

  • Fetch pages and style sheets using something like wget with --page-requisites

  • Then either:

    • Walk through each downloaded page using a tool like Selenium, search for element names and output their positions (if that is possible in Selenium. I assume it is, but I do not know for sure)

    • Create a piece of jQuery that you inject into each of the downloaded pages. The jQuery searches for elements named "sidebar", "toolbar" etc., gets their positions, saves the results to a local AJAX snippet, and continues to the next downloaded page. You need to only open the first page in the browser, the rest will happen automatically. Not trivial to implement but possible.

If you can use a client side application platform like .NET, you may be easier off building a custom application that incorporates a browser control, whose DOM you can access more freely than using only jQuery.

Written by Pekka
This page was build to provide you fast access to the question and the direct accepted answer.
The content is written by members of the stackoverflow.com community.
It is licensed under cc-wiki