Wednesday, July 20

Python for dummies

I have been using Python extensively in my research at UCL. Since it offers simple and elegant solutions for some issues I had, I'm posting my development sketches for future reference.
The IM logs I'm analyzing are stored in XML files and I'm extracting and normalizing the following data:

  • duration of sessions
  • number of messages
  • words per message
  • frequency of words
I addressed each of the above tasks separately and then combined them all (plus a GUI) in the final program. I wanted to collect some more data, such as the language preference of sender and recipient, but that is not kept by any IM client.
Collecting the duration of the session is a simple matter of keeping the start and end time of the session.
Counting the number of messages is a simple iterator.

Number of words per message is achieved with the pristine code:

lst = msg.split()
len( lst )

Frequency of words is addressed with the elegant:

freq = {}
for str in lst:
  str = str.lower()
  freq[str] = freq.get(str, 0) + 1


Next, some hardcore XML parser was needed. This took a bit longer to write, being that I was re-learning Python simultaneously, so I started by tackling MSN log files. Generalizing this to all other IM clients is a trivial task.

from xml.dom import minidom

def load(filename):
  return minidom.parse(filename)

def getElementsByTagName(node, tagName):
  children = node.getElementsByTagName(tagName)
  if len(children):
    return children
  return []

def first(node, tagName):
  children = getElementsByTagName(node, tagName)
  return len(children) and children[0] or None

def textOf(node):
  return node and "".join([child.data for child in node.childNodes]) or ""

if __name__ == '__main__':
  import sys
  document = load("msn_log_example.xml")
  for item in getElementsByTagName(document, 'Message'):
    print 'Message:', textOf(first(item, 'Text')
  print


Note that the example above is customized for MSN Messenger logs where the <Text> element (containing the message content) is contained in the <Message> tag. Part 2 of this tutorial will follow soon.

No comments: