Google App Engine (GAE) lets you build and run applications on Google's infrastructure. I have done two applications with GAE and Python (the programming language). The previous version of my tool to get blogger avatars (avafavico) used regular expressions to parse the Blogger profile page for the profile photo, or user's first contributed blog's favico, if no profile photo is found. As you might know, using regular expressions may not be the best solution. It would be better to parse html page document object model (DOM) and get results there. That way the code is not so sensitive to possible changes in the page html code. Of course I did the regular expressions so, that they should work with different html.
Previously I used GAE Python 2.5 environment, which was the default and supported version. On February 27th Python 2.7 became fully supported. GAE with Python 2.7 contains more external libraries than version 2.5, one of those libraries being lxml.
I tried to search for different solutions and examples, but there were not many. With GAE Python 2.5, one can use BeautifulSoup, but there are some issues (problematic 3.1.0 version, uncertain development future of BS, etc.). And there is minidom, but it may not handle broken html well. Blogger profile page should not have broken html, but you never know. lxml is definitely better and faster, supports XPath, etc.
The day before yesterday I updated the GAE app to use Python 2.7 and lxml. There were none to some examples about using python27, lxml and GAE, so I'll show you here a working example. First I started modifying the file app.yaml, there I changed runtime to python27, added "threadsafe" (false), and added (latest) lxml in libraries section. Increased version number to 5. Now app.yaml looks like this:
In blogava.py I added "from lxml import etree" and then used etree functions instead of regular expressions to find things in html. Here's how DOM tree is constructed, variable "result" contains page html as a string, and then XPath is applied to the tree, like this:
>>> tree = etree.HTML(result)
>>> r=tree.xpath("//img[@id='profile-photo']/@src")
Here find from the tree the first img tag, which id is set to "profile-photo", and get that tag's src attribute. In the full script, if no id='profile-photo' is not found, then try to search for the first image that has class "photo". If both fails, search for first "contributed-to" blog, and use it's favicon, if that is not found, use Blogger favicon. And here is the blogava.py source file:
This new version of avafavico has been up and running for two days. I'm very pleased that I got lxml working with GAE, all in all it was quite easy. Hope this example is useful to someone. If it helped you, please leave a comment. :)