Getting information I need out of XML is one of those tasks that occurs rarely enough that I never get to develop that profound understanding of XML documents and their parsing that we all so desire. It also means that every time I want to work with a piece of XML I have to re-learn the bare minimum of Python XML processing.
To fix this problem I am posting the snippet of bare-minimum, cheap XML processing code I usually come up with, using my latest XML processing problem as an illustration.
Be sure to advise me if I’m doing something dreadfully wrong, or if you’ve come up with a superior code snippet for processing XML that involves less cruft. Also stay tuned to my announcements page to see how this code is going to contribute to automatically adding birthdays to Google Calendar!
And finally here is how I get XML into a list of dictionaries where each dictionary contains the important values of the calendar entry elements for a piece of XML like this:
<entry> <id> http://www.google.com/calendar/feeds/default/private/full/... </id> <published> 2006-09-15T04:55:44.000Z </published> <updated> 2006-09-15T04:55:44.000Z </updated> <category scheme="http://schemas.google.com/g/2005#kind" term="http://schemas.google.com/g/2005#event"/> <title type="text"> Mom's Birthday </title> <content type="text"/> <link href="http://www.google.com/calendar/event?eid=..." rel="alternate" title="alternate" type="text/html"/> <link href="http://www.google.com/calendar/feeds/default/private/full/..." rel="self" type="application/atom+xml"/> <link href="http://www.google.com/calendar/feeds/default/private/full/..." rel="edit" type="application/atom+xml"/> <author> <name> .... </name> <email> .... </email> </author> <gd:comments> <gd:feedLink href="http://www.google.com/..."/> </gd:comments> <gd:visibility value="http://schemas.google.com/g/2005#event.default"/> <gd:eventStatus value="http://schemas.google.com/g/2005#event.confirmed"/> <gd:transparency value="http://schemas.google.com/g/2005#event.opaque"/> <gCal:sendEventNotifications value="true"/> <gd:where valueString=""/> <gd:when endTime="2006-09-21" startTime="2006-09-20"> <gd:reminder minutes="2880"/> </gd:when> </entry>
Here is the code. I just feed this function the raw xml returned by Google calendar and it gives me a list of dictionaries where each dictionary holds information from that event like title, and startdate. It’s not that complicated but I always get hung up on all that firstChild and data stuff so it’s easier for me to just copy this snippet and modify it for whatever XML I’m dealing with than redoing it each time.
def parse_entries(raw_xml):
dom = xml.dom.minidom.parseString(data) #Make the dom from raw xml
entries=dom.getElementsByTagName('entry') #Pull out all entry's
result_entries=[] #Make an empty container to fill up and return
for entry in entries:
dentry={} #Make empty dict to hold info on an entry
#Fill up the dict
dentry['id']=entry.getElementsByTagName('id')[0].firstChild.data
dentry['published']=entry.getElementsByTagName('published')[0].firstChild.data
dentry['updated']=entry.getElementsByTagName('updated')[0].firstChild.data
dentry['title']=entry.getElementsByTagName('title')[0].firstChild.data
try: dentry['content']=entry.getElementsByTagName('content')[0].firstChild.data
except AttributeError: dentry['content']=''
dentry['startTime']=entry.getElementsByTagName('gd:when')[0].getAttribute('startTime')
dentry['endTime']=entry.getElementsByTagName('gd:when')[0].getAttribute('endTime')
result_entries.append(dentry)
return result_entries
For the future I’d like to consider trying pyRXP instead which promises to be 97% faster than minidom and parses XML directly into some kind of mix of tuples and other Python primitives.
By the way, if you’re interested in learning more about working with Google Calendar’s API, I’ll be making a few posts on Answer My Searches soon detailing how to do that. (And I may end up even releasing a library for Python). So go ahead and subscribe if you haven’t yet
Tags: Python, XML, Python XML, XML Parsing, minidom, PyRXP, dom, parse
Posted by Greg Pinero (Primary Searcher) as Python, Web Services at 6:23 PM MST
RSS 2.0