Wednesday, July 27

Python for dummies, Part II

In my previous Python entry I posted code to collect string/word information from a document/string, plus a simple MSN-specific parser that allows me to navigate a DOM document and extract its nodes and content.

Now, I need functionality to search for the files to parse on a disk and I need a
GUI.

Search first.
I spent an awful lot of time playing around with OS-specific code to traverse and navigate directory structures only to find out that Python has most of this functionality built-in, faster, and cross-platform.

I didn't want my disk search to take ages. Hence, I decided to split it in three phases:

1 - Search for log files on the current folder, in case the user drops the executable there. Assuming the parser from Part I is named 'my_parser':

import os, fnmatch
files = [filename for filename in os.listdir('.')]
for f in files:
  # only list XML files
  if fnmatch.fnmatch( f, '*.xml' ):
    my_parser.parse(f)


2 - Search for files, starting from the user's profile folder. This prunes the search space by avoiding OS and user specific folders. I separated searching the right folders from listing the log files;

import os, fnmatch, re
folders = []

files = []
r = re.compile(r'msn_username') # msn_username is passed by the user

def browse((r, folders), dirpath, namelist):
  for name in namelist:
# list folders' names starting with msn_username given
    if r.search(name):
      folders.append(os.path.join(dirpath, name))

def listfiles((wildcard, files), dirpath, namelist):
  for name in namelist:
    # only append 'wildcard'-specified filenames
    if fnmatch.fnmatch( name, wildcard ):
      files.append(os.path.join(dirpath, name))

userbase = os.environ['USERPROFILE']
# change directory to user profile folder
os.chdir(userbase)
os.path.walk(os.getcwd(), browse, (r, folders))

if folders:
  # populate the list of XML files from the folders
  wcard = '*.xml'
  for fld in folders:

    os.path.walk(fld, listfiles, (wcard, files))


3 - Search for files on the whole disk, in case none was found with the two previous methods;

# replace the userbase value with root directory
userbase = '\\'
os.chdir(userbase)
os.path.walk(os.getcwd(), browse, (r, folders))
...


And that's it. Amazing how simple it looks now. Part III, the GUI, will follow soon.

UPDATE: This module does most of the above in a nicer, more elegant way. No need to reinvent the wheel.

No comments: