I would like to spider a few blogs and programmatically analyze their html and css-based layouts to see e.g. if the sidebar is to the left or right of the main content, how many columns and how wide they are.
How would I do this the best way? Are there any tools or libraries I can use?
(I would prefer a solution in Python or PHP.)
This sounds like an extremely hard task to do using pure server-side CSS and HTML parsing - you would effectively have to recreate the browser's rendering engine to get reliable results.
Depending on what you need this for, I could think of a way somewhere along these lines:
Fetch pages and style sheets using something like
Walk through each downloaded page using a tool like Selenium, search for element names and output their positions (if that is possible in Selenium. I assume it is, but I do not know for sure)
Create a piece of jQuery that you inject into each of the downloaded pages. The jQuery searches for elements named "sidebar", "toolbar" etc., gets their positions, saves the results to a local AJAX snippet, and continues to the next downloaded page. You need to only open the first page in the browser, the rest will happen automatically. Not trivial to implement but possible.
If you can use a client side application platform like .NET, you may be easier off building a custom application that incorporates a browser control, whose DOM you can access more freely than using only jQuery.
The content is written by members of the stackoverflow.com community.
It is licensed under cc-wiki