Python for dummies
I have been using Python extensively in my research at UCL. Since it offers simple and elegant solutions for some issues I had, I'm posting my development sketches for future reference.
The IM logs I'm analyzing are stored in XML files and I'm extracting and normalizing the following data:
- duration of sessions
- number of messages
- words per message
- frequency of words
Collecting the duration of the session is a simple matter of keeping the start and end time of the session.
Counting the number of messages is a simple iterator.
Number of words per message is achieved with the pristine code:
lst = msg.split()
len( lst )
Frequency of words is addressed with the elegant:
freq = {}
for str in lst:
str = str.lower()
freq[str] = freq.get(str, 0) + 1
Next, some hardcore XML parser was needed. This took a bit longer to write, being that I was re-learning Python simultaneously, so I started by tackling MSN log files. Generalizing this to all other IM clients is a trivial task.
from xml.dom import minidom
def load(filename):
return minidom.parse(filename)
def getElementsByTagName(node, tagName):
children = node.getElementsByTagName(tagName)
if len(children):
return children
return []
def first(node, tagName):
children = getElementsByTagName(node, tagName)
return len(children) and children[0] or None
def textOf(node):
return node and "".join([child.data for child in node.childNodes]) or ""
if __name__ == '__main__':
import sys
document = load("msn_log_example.xml")
for item in getElementsByTagName(document, 'Message'):
print 'Message:', textOf(first(item, 'Text')
Note that the example above is customized for MSN Messenger logs where the <Text> element (containing the message content) is contained in the <Message> tag. Part 2 of this tutorial will follow soon.
No comments:
Post a Comment