Super Easy XML Parsing in Python

Getting information I need out of XML is one of those tasks that occurs rarely enough that I never get to develop that profound understanding of XML documents and their parsing that we all so desire. It also means that every time I want to work with a piece of XML I have to re-learn the bare minimum of Python XML processing.

To fix this problem I am posting the snippet of bare-minimum, cheap XML processing code I usually come up with, using my latest XML processing problem as an illustration.

Be sure to advise me if I’m doing something dreadfully wrong, or if you’ve come up with a superior code snippet for processing XML that involves less cruft. Also stay tuned to my announcements page to see how this code is going to contribute to automatically adding birthdays to Google Calendar!

And finally here is how I get XML into a list of dictionaries where each dictionary contains the important values of the calendar entry elements for a piece of XML like this:

<entry>
		<id>

http://www.google.com/calendar/feeds/default/private/full/...

		</id>
		<published>
			2006-09-15T04:55:44.000Z
		</published>
		<updated>
			2006-09-15T04:55:44.000Z
		</updated>
		<category scheme=\"http://schemas.google.com/g/2005#kind\" term=\"http://schemas.google.com/g/2005#event\"/>
		<title type=\"text\">
			Mom\'s Birthday
		</title>
		<content type=\"text\"/>
		<link href=\"http://www.google.com/calendar/event?eid=...\" rel=\"alternate\" title=\"alternate\" type=\"text/html\"/>
		<link href=\"http://www.google.com/calendar/feeds/default/private/full/...\" rel=\"self\" type=\"application/atom+xml\"/>
		<link href=\"http://www.google.com/calendar/feeds/default/private/full/...\" rel=\"edit\" type=\"application/atom+xml\"/>
		<author>
			<name>
				....
			</name>
			<email>
				....
			</email>
		</author>
		<gd:comments>
			<gd:feedLink href=\"http://www.google.com/...\"/>
		</gd:comments>
		<gd:visibility value=\"http://schemas.google.com/g/2005#event.default\"/>
		<gd:eventStatus value=\"http://schemas.google.com/g/2005#event.confirmed\"/>
		<gd:transparency value=\"http://schemas.google.com/g/2005#event.opaque\"/>
		<gCal:sendEventNotifications value=\"true\"/>
		<gd:where valueString=\"\"/>
		<gd:when endTime=\"2006-09-21\" startTime=\"2006-09-20\">
			<gd:reminder minutes=\"2880\"/>
		</gd:when>
	</entry>

Here is the code. I just feed this function the raw xml returned by Google calendar and it gives me a list of dictionaries where each dictionary holds information from that event like title, and startdate. It’s not that complicated but I always get hung up on all that firstChild and data stuff so it’s easier for me to just copy this snippet and modify it for whatever XML I’m dealing with than redoing it each time.

def parse_entries(raw_xml):
    dom = xml.dom.minidom.parseString(data) #Make the dom from raw xml
    entries=dom.getElementsByTagName('entry') #Pull out all entry's
    result_entries=[] #Make an empty container to fill up and return
    for entry in entries:
        dentry={} #Make empty dict to hold info on an entry
        #Fill up the dict
        dentry['id']=entry.getElementsByTagName('id')[0].firstChild.data
        dentry['published']=entry.getElementsByTagName('published')[0].firstChild.data
        dentry['updated']=entry.getElementsByTagName('updated')[0].firstChild.data
        dentry['title']=entry.getElementsByTagName('title')[0].firstChild.data
        try: dentry['content']=entry.getElementsByTagName('content')[0].firstChild.data
        except AttributeError: dentry['content']=''
        dentry['startTime']=entry.getElementsByTagName('gd:when')[0].getAttribute('startTime')
        dentry['endTime']=entry.getElementsByTagName('gd:when')[0].getAttribute('endTime')
        result_entries.append(dentry)
    return result_entries

For the future I’d like to consider trying pyRXP instead which promises to be 97% faster than minidom and parses XML directly into some kind of mix of tuples and other Python primitives.

By the way, if you’re interested in learning more about working with Google Calendar’s API, I’ll be making a few posts on Answer My Searches soon detailing how to do that. (And I may end up even releasing a library for Python). So go ahead and subscribe if you haven’t yet ;-)

[tags]Python, XML, Python XML, XML Parsing, minidom, PyRXP, dom, parse[/tags]

2 Responses to “Super Easy XML Parsing in Python”

  1. Narayan Desai says:

    You should also take a look at elementtree, which is now included with python2.5. It provides a similar, though simpler API for XML handling. LXML also implements a similar (though more comprehensive) API, but uses libxml2, so it is quite fast. I think this API feels a lot more pythonic than DOM. Your code could be implemented using lxml as the following. (to use elelementtree change the parse function call to use elementtree XML function)


    def parse_entries(raw_xml):
    etree = lxml.etree.XML(data)
    result_entries = []
    for entry in etree.findall('entry'):
    dentry = {}
    for key in ['id', 'published', 'updated', 'title', 'content', 'startTime', 'endTime']:
    dentry[key] = entry.find(key).text
    result_entries.append(dentry)

  2. Thanks, Narayan! That does look better. I even read elementtree’s website when solving this problem. They should probably emphaize their Pythonthonic-ness more so I could have known to use them.