the beautiful people

from time to time some good looking person (or, occasionally, an ugly) will state or imply, in words or through their behaviour, this sentiment:

beautiful people are attracted to beautiful people.  ugly people are attracted to ugly people.

and i die a little inside.  i only figured out why today:

cats are attracted to cats.  dogs are attracted to dogs.

when people say the ‘beautiful/ugly people attraction thing’, what i hear is:

ugly people … they’re not really human, are they?

get called “sub-human” often enough and you’d be miserable too.

what to backup

to celebrate the release of fedora 15 and in the spirit of the “release early” part of “release early, release often” i’ve decided to post my latest script.

this one’s a bit more involved than the last ones but the task is to answer this question:

what files do i need to back up from my fedora system?

in particular there are a few problems to solve:

  1. i don’t want to waste space backing up unmodified files from software packages.
  2. i want to see which parts of the file system are using up the most backup space so i can see if i can do something about that.
  3. i need a concise summary of what needs to be backed up so i can review it.  if an entire directory tree needs to be backed up then i only want to see that mentioned once, not a line of output for each file!

if it wasn’t for 2 and 3 i could just rpm -Va and then maybe something like find / -exec rpm -q | grep “no package” but that’s all rather fiddly and produces vast, unmanageable lists of files.  so i turn to python.  🙂

what i’ve ended up with is a tiny app, a file/package scanning class and a helper class for showing progress – checking every file on the system takes hours so you want to know how you’re getting on.

so, first up, the progress display class.  i’ve saved this as cmdmsg.py.

#!/usr/bin/python
# coding=utf-8

# a utility function taken from stackoverflow
def getTerminalSize():
    """
    returns (lines:int, cols:int)
    """
    import os, struct
    def ioctl_GWINSZ(fd):
        import fcntl, termios
        return struct.unpack("hh", fcntl.ioctl(fd, termios.TIOCGWINSZ, "1234"))
    # try stdin, stdout, stderr
    for fd in (0, 1, 2):
        try:
            return ioctl_GWINSZ(fd)
        except:
            pass
    # try os.ctermid()
    try:
        fd = os.open(os.ctermid(), os.O_RDONLY)
        try:
            return ioctl_GWINSZ(fd)
        finally:
            os.close(fd)
    except:
        pass
    # try `stty size`
    try:
        return tuple(int(x) for x in os.popen("stty size", "r").read().split())
    except:
        pass
    # try environment variables
    try:
        return tuple(int(os.getenv(var)) for var in ("LINES", "COLUMNS"))
    except:
        pass
    # i give up. return default.
    return (25, 80)

from datetime import datetime, timedelta
from os.path import commonprefix
from sys import stderr
class cmdmsg():
    def __init__(self, interval = timedelta(0, 1, 0)):
        self.msg = ""
        self.height, self.width = getTerminalSize()
        self.last = datetime.now()
        self.interval = interval

    def say(self, msg, interval = None):
        if interval == None: interval = self.interval
        if datetime.now() - self.last < interval: return
        self.last = datetime.now()
        # multi-byte characters really futz with this stuff
        msg = msg.replace("\t", " ").decode(
            "utf8", 'replace').encode("ascii", 'replace')
        if len(msg) > (self.width - 1):
            ends = self.width / 2 - 2
            msg = msg[:ends] + "..." + msg[-ends:]
        offset = len(commonprefix([self.msg, msg]))
        # BS moves cursor but doesn't appear to remove content - so print spaces
        if len(self.msg) > len(msg):
            extra = len(self.msg) - len(msg)
            stderr.write("\b" * extra + " " * extra)
        stderr.write("\b" * len(self.msg[offset:]) + msg[offset:])
        self.msg = msg

    def saynow(self, msg):
        self.say(msg, timedelta(0, 0, 0))

    def end(self):
        self.saynow("")

    def spit(self, msg):
        stderr.write("\r" + " " * len(self.msg) + "\r" + msg + "\n" + self.msg)

this allows the scanning module to write and overwrite progress messages to the terminal without lots of annoying scrolling (which takes a lot of CPU and means you lose key messages).

next, the scanning module.  i didn’t want to maintain a complete list of all files on the system in a big array and i wanted to do the summarising as i went along, so this has got some tricksy fiddling around with ‘references’ into a big dictionary hierarchy.

(each level of filesystem hierarchy uses two levels of hierarchy in the dictionary.  this is because the dictionary entry for a folder doesn’t have the sub-folders as keys, it contains a set of metadata keys and a ‘dirs’ key for the sub-folders.)

but basically it allows you to maintain a list of files and directories with the required information – should they be backed up or not.

#!/usr/bin/python
# coding=utf-8

# a utility function
def fileSize(bytes):
    suffix = [' bytes', 'K', 'M', 'G', 'T', 'P', 'E']
    size = float(bytes)
    index = 0
    while size >= 1000 and index < len(suffix) - 1:
        index += 1
        size /= 1024
    return str(int(round(size))) + suffix[index]

from pwd import getpwuid
import yum
from datetime import datetime, timedelta
from os.path import commonprefix
from cmdmsg import cmdmsg
from os import path, stat

class pkgScanner():
    def __init__(self):
        self._lastroot = ""
        self._results = {}
        self._rootpath = []
        self._mounts = []
        self._root = ""
        self._rpmva = {}
        self._yb=yum.YumBase()
        self._yb.setCacheDir()
        self._cm = cmdmsg(timedelta(0, 0, 25000))
        self._cd = self._results

    def __str__(self):
        return "scanned:\n" + self._pprec() + "\nnot scanned:\n" + "\n".join(
            self._mounts)

# see if sub folders can be 'collapsed' into their parent
    def _check(self, folder, thisroot = None):
        removes = []
        if 'dirs' in folder:
            for sub in folder['dirs']:
# copy sizes to parent so it has totals for the whole tree
                if 'savesize' in folder['dirs'][sub] and\
                    folder['dirs'][sub]['savesize']:
                    if 'savesize' not in folder: folder['savesize'] = 0
                    folder['savesize'] += folder['dirs'][sub]['savesize']
                if 'unmodifiedsize' in folder['dirs'][sub] and\
                    folder['dirs'][sub]['unmodifiedsize']:
                    if 'unmodifiedsize' not in folder: folder['unmodifiedsize'] = 0
                    folder['unmodifiedsize'] += folder['dirs'][sub]['unmodifiedsize']
                if not len(folder['dirs'][sub]):
                    if thisroot: self._cm.spit(
                        "removing empty " + sub + " from " + thisroot)
                    removes.append(sub)
                elif 'dirs' not in folder['dirs'][sub] and\
                    'unmodified' not in folder['dirs'][sub] and\
                    'save' in folder['dirs'][sub]:
                    if 'save' not in folder: folder['save'] = []
                    folder['save'].append(sub)
                    if thisroot: self._cm.spit(
                        "removing all new/modified " + sub + " from " + thisroot)
                    removes.append(sub)
                elif 'dirs' not in folder['dirs'][sub] and\
                    'unmodified' in folder['dirs'][sub] and\
                    'save' not in folder['dirs'][sub]:
                    if 'unmodified' not in folder: folder['unmodified'] = []
                    folder['unmodified'].append(sub)
                    if thisroot: self._cm.spit(
                        "removing all unmodified " + sub + " from " + thisroot)
                    removes.append(sub)
            for sub in removes:
                del folder['dirs'][sub]
            if not len(folder['dirs']): del folder['dirs']

# figure out which folders should be checked now
    def _checkpath(self):
        self._rootpath = self._root.split("/")
        if self._rootpath[-1] == "": self._rootpath = self._rootpath[:-1]
        lastrootpath = self._lastroot.split("/")
        if lastrootpath[-1] == "": lastrootpath = lastrootpath[:-1]
        if len(self._rootpath) <= len(lastrootpath):
            # find common path of root and lastroot
            n = 1 # skip leading blank before "/"
            folder = self._results
            while n < len(self._rootpath) and self._rootpath[n] == lastrootpath[n]:
                folder = folder['dirs'][self._rootpath[n]]
                n += 1
            checkpath = "/".join(lastrootpath[:n])
            tocheck = []
            checkpaths = []
            while n < len(lastrootpath):
                folder = folder['dirs'][lastrootpath[n]]
                checkpath += "/" + lastrootpath[n]
                tocheck.append(folder)
                checkpaths.append(checkpath)
                n += 1
            tocheck.reverse()
            checkpaths.reverse()
            for index, folder in enumerate(tocheck):
                self._check(folder)

# nicely formatted 'pretty print' of hierarchy
# should probably just make this two levels per recursion
# rather than checking if depth % 2
    def _pprec(self, p = None, depth=0):
        if not p:
            p = self._results
            output = "/"
        else: output = ""
        if type(p) is dict:
            if not depth % 2:
                if 'unmodifiedsize' not in p and 'savesize' not in p:
                    output += " (empty)"
                elif 'savesize' not in p or not p['savesize']: output += " (none)"
                elif 'unmodifiedsize' not in p or not p['unmodifiedsize']:
                    output += " (all)"
                else: output += " save " + fileSize(p['savesize']) + "/" +\
                    fileSize(p['savesize'] + p['unmodifiedsize']) + "=" +\
                    str((100 * p['savesize']) / (p['savesize'] +\
                    p['unmodifiedsize'])) + "%"
                if 'save' in p:
                    names = ", ".join(p['save'])
                    if len(names) > 30: names = names[:30] + "..."
                    output += " (save " + str(len(p['save'])) + " local: " + names + ")"
                if 'unmodified' in p:
                    names = ", ".join(p['unmodified'])
                    if len(names) > 30: names = names[:30] + "..."
                    output += " (unmodified " + str(len(p['unmodified'])) + " local: " +\
                        names + ")"
                output += "\n"
                if 'dirs' in p: output += self._pprec(p['dirs'], depth + 1)
            else:
                output += ''.join("  " * depth + str(x) + self._pprec(p[x],
                     depth + 1) for x in sorted(p))
        else: output += "  " * depth + str(p) + "\n"
        return output

    def dump(self, p = None, path = "/", depth=0):
        if not p:
            p = self._results
        output = ""
        if type(p) is dict:
            if not depth % 2:
                if 'save' in p:
                    output += "".join(path + x + "\n" for x in p['save'])
                if 'dirs' in p: output += self.dump(p['dirs'], path, depth + 1)
                if 'save' not in p and 'dirs' not in p: output += path + "/\n"
            else:
                output += ''.join(self.dump(p[x], path + x + "/",
                    depth + 1) for x in sorted(p))
        return output

    def setRoot(self, root):
        self._cm.say(root)
        self._root = root
        self._checkpath()
        self._lastroot = self._root

        self._cd = self._results
        for folder in self._rootpath[1:]: # skip blank before leading "/"
            if 'dirs' not in self._cd: self._cd['dirs'] = {}
            if folder not in self._cd['dirs']: self._cd['dirs'][folder] = {}
            self._cd = self._cd['dirs'][folder]

    def processFiles(self, files):
        if not files: return
        locallinks = []
        for f in files:
            thispath = path.join(self._root, f)
            if path.islink(thispath): locallinks.append(f)
        for link in locallinks: files.remove(link)
        files.sort()
        linenum = 0
        morepackagesthanfiles = False
        packages = {}
        newfiles = {}
        for doc in files:
            thispath = path.join(self._root, doc)
            thisdoc = {'size': 0, 'owner': 0}
            try:
                thisstat = stat(thispath)
            except OSError: # assume permission denied
                continue
            thisdoc['size'] = thisstat.st_size
            try:
                thisdoc['owner'] = getpwuid(thisstat.st_uid).pw_name
            except KeyError:
                thisdoc['owner'] = thisstat.st_uid
            self._cm.say(thispath)
            self._cm.saynow(thispath + " - providers")
# get yum to ask rpm if this file is from a package
            pckgs = self._yb.rpmdb.whatProvides(thispath, None, (None, None, None))
            self._cm.saynow(thispath)
            if not len(pckgs): newfiles[doc] = thisdoc
            else:
                package = pckgs[0] # assume first match will do
                if package not in packages: packages[package] = {}
                packages[package][doc] = thisdoc

        modified = {}
        unmodified = {}
        pk = packages.keys()
        pk.sort()
        for p in pk:
            if p not in self._rpmva:
                self._cm.say(self._root + " - checking " + str(p))
                self._cm.saynow(self._root + " - " + str(p) + " - checking")
# get yum to ask rpm to verify this package
                self._rpmva[p] = dict((f, ", ".join(list(x.message for x in m)))
                    for f,m in self._yb.rpmdb.searchNevra(p[0], p[2], p[3], p[4],
                    p[1])[0].verify().iteritems())
                self._cm.saynow(self._root + " - " + str(p))
            for f in packages[p]:
                if path.join(self._root, f) in self._rpmva[p]:
                    modified[f] = packages[p][f]
                else: unmodified[f] = packages[p][f]
        #rpmva = {} # trash the cache - trade speed for memory
        if modified or newfiles:
            self._cd['save'] = modified.keys() + newfiles.keys()
            self._cd['savesize'] = sum(modified[x]['size'] for x in modified) +\
                sum(newfiles[x]['size'] for x in newfiles)
        if unmodified:
            self._cd['unmodified'] = unmodified.keys()
            self._cd['unmodifiedsize'] = sum(
                unmodified[x]['size'] for x in unmodified)

    def processFolders(self, dirs):
        localmounts = []
        locallinks = []
        for folder in dirs:
            thispath = path.join(self._root, folder)
            if path.islink(thispath): locallinks.append(folder)
            elif path.ismount(thispath): localmounts.append(folder)
        for link in locallinks: dirs.remove(link)
        for mount in localmounts:
            dirs.remove(mount)
            self._mounts.append(path.join(self._root, mount))
        dirs.sort()
        return

    def close(self):
        self._root = "/"
        self._checkpath()
        self._check(self._results)
        self._cm.end()

    def getRootPath(self):
        return self._rootpath

and so the actual app is nice and small.  it prints those progress messages and the final summary to stderr and the flat list of files and whole directories to stdout.  and it takes hours so i usually run it like this: time backup.py > backup-datetime.out; paplay –volume 30000 /usr/share/sounds/gnome/default/alerts/sonar.ogg

#!/usr/bin/python
# coding=utf-8

from os import walk
from sys import stdout, stderr
from pkgscan import pkgScanner

ps = pkgScanner()
for root, dirs, files in walk("/"):
    ps.setRoot(root)
    ps.processFolders(dirs)
    ps.processFiles(files)

ps.close()
stderr.write(str(ps))
stdout.write(ps.dump())

and that’s that.  oh, sometimes it can use up an awful lot of memory.  keep an eye on it.

what’s next?  i’d like to specify a set of starting points for the scan on the command line, maybe pass in an exclusions file.  also i want to check that i actually have permission to read those files i want to back up.

it’d be nice to be able to generate a backup list for my non-admin user, then pass that list in to the scanner when run as root to generate a short list of stuff that has to be backed up by root.

looking at the output generated so far i’ll need to start writing some (possibly plugin-based) rules to handle/exclude certain files – some config files should be diffed rather than just saved, some files should be backed up by their application’s own backup system (e.g. databases), some files should only be backed up when the user isn’t logged in, some only on shutdown/startup, some only in single user mode.

wnck FTW!

well, yesterday’s hack turned out to be pretty useless.  effectively it just converted hamster into zeitgeist journal (or gnome activity monitor)  – which is okay until you start using an app which isn’t a zeitgeist data provider or which doesn’t happen to emit a signal for the activity you’re performing.

so, i thought i’d have a go at converting hamster into creeper instead.  thanks to creeper for showing me how it’s done.  i’ve called this cree.py 😉

#!/usr/bin/python

# a python version of creeper (a vala app) to monitor active windows
# actually turns hamster into creeper

# because the mainloop appears to catch exceptions
from traceback import print_exc
import hamster.client
class hamster_handler(hamster.client.Storage):
   def __init__(self):
      self.nc = None
      hamster.client.Storage.__init__(self)
      
   def add_fact(self, fact):
      # FIXME insert clever rules here
      fact = fact.replace(",", ";") # don't accidentally create descriptions
      fact = fact.replace("@", "(a)") # don't accidentally create categories
      hamster.client.Storage.add_fact(self, fact)

   def handler(self, scr, prev = None):
      try:
         if prev and self.nc != None: prev.disconnect(self.nc)
         win = scr.get_active_window()
         if win:
            self.add_fact(win.get_name())
            self.nc = win.connect("name_changed", self.name_handler)
      except KeyboardInterrupt: raise
      except:
         print_exc()
        
   def name_handler(self, win):
      try:
         self.add_fact(win.get_name())
      except KeyboardInterrupt: raise
      except:
         print_exc()

hh = hamster_handler()
from gobject import MainLoop
ml = MainLoop()
from wnck import screen_get_default
sc = screen_get_default()
sc.connect("active_window_changed", hh.handler)
sc.connect("window_stacking_changed", hh.handler)
hh.handler(sc)
ml.run()

this is working pretty well for me so far .. on fedora 14.  just tried on 13 and the hamster python library is too old i think.  ah well – yet another reason to upgrade!

python, hamster and zeitgeist FTW!

update: note that this turned out not to be all that useful – tracking active windows into hamster turned out better.

in response to gnome bug 639018 and my general desire to track automatically what i’ve done, i’ve made a python script which connects to the zeitgeist activity monitor and copies its messages to the hamster time tracker.  it goes like this:

#!/usr/bin/python

# monitor zeitgeist and do stuff
from zeitgeist.client import ZeitgeistClient
from zeitgeist.datamodel import TimeRange, Event
from gobject import MainLoop

import hamster.client
class hamster_handler(hamster.client.Storage):
   def handler(self, tr, ev):
      # because the mainloop appears to catch exceptions
      from traceback import print_exc
      from urlparse import urlparse
      try:
         # FIXME insert clever rules here
         app = urlparse(ev[0].actor).netloc
         desk = open("/usr/share/applications/" + app)
         comments = filter(lambda x: x.startswith("Comment[en_GB]="), desk)
         comment = comments[0].split("=")[1].strip()
         self.add_fact(comment + " - " + ev[0].subjects[0].text)
      except:
         print_exc()

hh = hamster_handler()
ml = MainLoop()
ZeitgeistClient().install_monitor(
    TimeRange.from_now(),
    [Event()],
    hh.handler,
    hh.handler)
ml.run()

It never ends until it’s killed so you’ll probably want to run it in the background – i’ve added it to my session ‘startup applications’.  if it doesn’t appear to be working then run it from the command line instead – you should see some error messages if it’s failing to update hamster.

on my fedora 14 system i only get updates for local text files, images and videos opened in gedit, EoG and totem.  on ubuntu i imagine you’ll get a lot more updates.  OTOH, on ubuntu the script will probably need some tweaking for the hard-coded paths and locale.

under surveillance

i’ve decided not to link this to my hamster-to-empathy updater – i don’t really want to broadcast a stream of every little thing i do .. particularly if my IM accounts include twitter and facebook status feeds.  🙂

half wire

this week i could have:

enabled six volunteers, living in remote villages in Uganda, to get the training they need to give life-saving medical treatment to children with malaria.

but i decided that, all things considered, it’d be better for everyone if i just spent the money on some nice sunglasses for me which are almost, but not quite, just what i was looking for.