what to backup

to celebrate the release of fedora 15 and in the spirit of the “release early” part of “release early, release often” i’ve decided to post my latest script.

this one’s a bit more involved than the last ones but the task is to answer this question:

what files do i need to back up from my fedora system?

in particular there are a few problems to solve:

  1. i don’t want to waste space backing up unmodified files from software packages.
  2. i want to see which parts of the file system are using up the most backup space so i can see if i can do something about that.
  3. i need a concise summary of what needs to be backed up so i can review it.  if an entire directory tree needs to be backed up then i only want to see that mentioned once, not a line of output for each file!

if it wasn’t for 2 and 3 i could just rpm -Va and then maybe something like find / -exec rpm -q | grep “no package” but that’s all rather fiddly and produces vast, unmanageable lists of files.  so i turn to python.  🙂

what i’ve ended up with is a tiny app, a file/package scanning class and a helper class for showing progress – checking every file on the system takes hours so you want to know how you’re getting on.

so, first up, the progress display class.  i’ve saved this as cmdmsg.py.

#!/usr/bin/python
# coding=utf-8

# a utility function taken from stackoverflow
def getTerminalSize():
    """
    returns (lines:int, cols:int)
    """
    import os, struct
    def ioctl_GWINSZ(fd):
        import fcntl, termios
        return struct.unpack("hh", fcntl.ioctl(fd, termios.TIOCGWINSZ, "1234"))
    # try stdin, stdout, stderr
    for fd in (0, 1, 2):
        try:
            return ioctl_GWINSZ(fd)
        except:
            pass
    # try os.ctermid()
    try:
        fd = os.open(os.ctermid(), os.O_RDONLY)
        try:
            return ioctl_GWINSZ(fd)
        finally:
            os.close(fd)
    except:
        pass
    # try `stty size`
    try:
        return tuple(int(x) for x in os.popen("stty size", "r").read().split())
    except:
        pass
    # try environment variables
    try:
        return tuple(int(os.getenv(var)) for var in ("LINES", "COLUMNS"))
    except:
        pass
    # i give up. return default.
    return (25, 80)

from datetime import datetime, timedelta
from os.path import commonprefix
from sys import stderr
class cmdmsg():
    def __init__(self, interval = timedelta(0, 1, 0)):
        self.msg = ""
        self.height, self.width = getTerminalSize()
        self.last = datetime.now()
        self.interval = interval

    def say(self, msg, interval = None):
        if interval == None: interval = self.interval
        if datetime.now() - self.last < interval: return
        self.last = datetime.now()
        # multi-byte characters really futz with this stuff
        msg = msg.replace("\t", " ").decode(
            "utf8", 'replace').encode("ascii", 'replace')
        if len(msg) > (self.width - 1):
            ends = self.width / 2 - 2
            msg = msg[:ends] + "..." + msg[-ends:]
        offset = len(commonprefix([self.msg, msg]))
        # BS moves cursor but doesn't appear to remove content - so print spaces
        if len(self.msg) > len(msg):
            extra = len(self.msg) - len(msg)
            stderr.write("\b" * extra + " " * extra)
        stderr.write("\b" * len(self.msg[offset:]) + msg[offset:])
        self.msg = msg

    def saynow(self, msg):
        self.say(msg, timedelta(0, 0, 0))

    def end(self):
        self.saynow("")

    def spit(self, msg):
        stderr.write("\r" + " " * len(self.msg) + "\r" + msg + "\n" + self.msg)

this allows the scanning module to write and overwrite progress messages to the terminal without lots of annoying scrolling (which takes a lot of CPU and means you lose key messages).

next, the scanning module.  i didn’t want to maintain a complete list of all files on the system in a big array and i wanted to do the summarising as i went along, so this has got some tricksy fiddling around with ‘references’ into a big dictionary hierarchy.

(each level of filesystem hierarchy uses two levels of hierarchy in the dictionary.  this is because the dictionary entry for a folder doesn’t have the sub-folders as keys, it contains a set of metadata keys and a ‘dirs’ key for the sub-folders.)

but basically it allows you to maintain a list of files and directories with the required information – should they be backed up or not.

#!/usr/bin/python
# coding=utf-8

# a utility function
def fileSize(bytes):
    suffix = [' bytes', 'K', 'M', 'G', 'T', 'P', 'E']
    size = float(bytes)
    index = 0
    while size >= 1000 and index < len(suffix) - 1:
        index += 1
        size /= 1024
    return str(int(round(size))) + suffix[index]

from pwd import getpwuid
import yum
from datetime import datetime, timedelta
from os.path import commonprefix
from cmdmsg import cmdmsg
from os import path, stat

class pkgScanner():
    def __init__(self):
        self._lastroot = ""
        self._results = {}
        self._rootpath = []
        self._mounts = []
        self._root = ""
        self._rpmva = {}
        self._yb=yum.YumBase()
        self._yb.setCacheDir()
        self._cm = cmdmsg(timedelta(0, 0, 25000))
        self._cd = self._results

    def __str__(self):
        return "scanned:\n" + self._pprec() + "\nnot scanned:\n" + "\n".join(
            self._mounts)

# see if sub folders can be 'collapsed' into their parent
    def _check(self, folder, thisroot = None):
        removes = []
        if 'dirs' in folder:
            for sub in folder['dirs']:
# copy sizes to parent so it has totals for the whole tree
                if 'savesize' in folder['dirs'][sub] and\
                    folder['dirs'][sub]['savesize']:
                    if 'savesize' not in folder: folder['savesize'] = 0
                    folder['savesize'] += folder['dirs'][sub]['savesize']
                if 'unmodifiedsize' in folder['dirs'][sub] and\
                    folder['dirs'][sub]['unmodifiedsize']:
                    if 'unmodifiedsize' not in folder: folder['unmodifiedsize'] = 0
                    folder['unmodifiedsize'] += folder['dirs'][sub]['unmodifiedsize']
                if not len(folder['dirs'][sub]):
                    if thisroot: self._cm.spit(
                        "removing empty " + sub + " from " + thisroot)
                    removes.append(sub)
                elif 'dirs' not in folder['dirs'][sub] and\
                    'unmodified' not in folder['dirs'][sub] and\
                    'save' in folder['dirs'][sub]:
                    if 'save' not in folder: folder['save'] = []
                    folder['save'].append(sub)
                    if thisroot: self._cm.spit(
                        "removing all new/modified " + sub + " from " + thisroot)
                    removes.append(sub)
                elif 'dirs' not in folder['dirs'][sub] and\
                    'unmodified' in folder['dirs'][sub] and\
                    'save' not in folder['dirs'][sub]:
                    if 'unmodified' not in folder: folder['unmodified'] = []
                    folder['unmodified'].append(sub)
                    if thisroot: self._cm.spit(
                        "removing all unmodified " + sub + " from " + thisroot)
                    removes.append(sub)
            for sub in removes:
                del folder['dirs'][sub]
            if not len(folder['dirs']): del folder['dirs']

# figure out which folders should be checked now
    def _checkpath(self):
        self._rootpath = self._root.split("/")
        if self._rootpath[-1] == "": self._rootpath = self._rootpath[:-1]
        lastrootpath = self._lastroot.split("/")
        if lastrootpath[-1] == "": lastrootpath = lastrootpath[:-1]
        if len(self._rootpath) <= len(lastrootpath):
            # find common path of root and lastroot
            n = 1 # skip leading blank before "/"
            folder = self._results
            while n < len(self._rootpath) and self._rootpath[n] == lastrootpath[n]:
                folder = folder['dirs'][self._rootpath[n]]
                n += 1
            checkpath = "/".join(lastrootpath[:n])
            tocheck = []
            checkpaths = []
            while n < len(lastrootpath):
                folder = folder['dirs'][lastrootpath[n]]
                checkpath += "/" + lastrootpath[n]
                tocheck.append(folder)
                checkpaths.append(checkpath)
                n += 1
            tocheck.reverse()
            checkpaths.reverse()
            for index, folder in enumerate(tocheck):
                self._check(folder)

# nicely formatted 'pretty print' of hierarchy
# should probably just make this two levels per recursion
# rather than checking if depth % 2
    def _pprec(self, p = None, depth=0):
        if not p:
            p = self._results
            output = "/"
        else: output = ""
        if type(p) is dict:
            if not depth % 2:
                if 'unmodifiedsize' not in p and 'savesize' not in p:
                    output += " (empty)"
                elif 'savesize' not in p or not p['savesize']: output += " (none)"
                elif 'unmodifiedsize' not in p or not p['unmodifiedsize']:
                    output += " (all)"
                else: output += " save " + fileSize(p['savesize']) + "/" +\
                    fileSize(p['savesize'] + p['unmodifiedsize']) + "=" +\
                    str((100 * p['savesize']) / (p['savesize'] +\
                    p['unmodifiedsize'])) + "%"
                if 'save' in p:
                    names = ", ".join(p['save'])
                    if len(names) > 30: names = names[:30] + "..."
                    output += " (save " + str(len(p['save'])) + " local: " + names + ")"
                if 'unmodified' in p:
                    names = ", ".join(p['unmodified'])
                    if len(names) > 30: names = names[:30] + "..."
                    output += " (unmodified " + str(len(p['unmodified'])) + " local: " +\
                        names + ")"
                output += "\n"
                if 'dirs' in p: output += self._pprec(p['dirs'], depth + 1)
            else:
                output += ''.join("  " * depth + str(x) + self._pprec(p[x],
                     depth + 1) for x in sorted(p))
        else: output += "  " * depth + str(p) + "\n"
        return output

    def dump(self, p = None, path = "/", depth=0):
        if not p:
            p = self._results
        output = ""
        if type(p) is dict:
            if not depth % 2:
                if 'save' in p:
                    output += "".join(path + x + "\n" for x in p['save'])
                if 'dirs' in p: output += self.dump(p['dirs'], path, depth + 1)
                if 'save' not in p and 'dirs' not in p: output += path + "/\n"
            else:
                output += ''.join(self.dump(p[x], path + x + "/",
                    depth + 1) for x in sorted(p))
        return output

    def setRoot(self, root):
        self._cm.say(root)
        self._root = root
        self._checkpath()
        self._lastroot = self._root

        self._cd = self._results
        for folder in self._rootpath[1:]: # skip blank before leading "/"
            if 'dirs' not in self._cd: self._cd['dirs'] = {}
            if folder not in self._cd['dirs']: self._cd['dirs'][folder] = {}
            self._cd = self._cd['dirs'][folder]

    def processFiles(self, files):
        if not files: return
        locallinks = []
        for f in files:
            thispath = path.join(self._root, f)
            if path.islink(thispath): locallinks.append(f)
        for link in locallinks: files.remove(link)
        files.sort()
        linenum = 0
        morepackagesthanfiles = False
        packages = {}
        newfiles = {}
        for doc in files:
            thispath = path.join(self._root, doc)
            thisdoc = {'size': 0, 'owner': 0}
            try:
                thisstat = stat(thispath)
            except OSError: # assume permission denied
                continue
            thisdoc['size'] = thisstat.st_size
            try:
                thisdoc['owner'] = getpwuid(thisstat.st_uid).pw_name
            except KeyError:
                thisdoc['owner'] = thisstat.st_uid
            self._cm.say(thispath)
            self._cm.saynow(thispath + " - providers")
# get yum to ask rpm if this file is from a package
            pckgs = self._yb.rpmdb.whatProvides(thispath, None, (None, None, None))
            self._cm.saynow(thispath)
            if not len(pckgs): newfiles[doc] = thisdoc
            else:
                package = pckgs[0] # assume first match will do
                if package not in packages: packages[package] = {}
                packages[package][doc] = thisdoc

        modified = {}
        unmodified = {}
        pk = packages.keys()
        pk.sort()
        for p in pk:
            if p not in self._rpmva:
                self._cm.say(self._root + " - checking " + str(p))
                self._cm.saynow(self._root + " - " + str(p) + " - checking")
# get yum to ask rpm to verify this package
                self._rpmva[p] = dict((f, ", ".join(list(x.message for x in m)))
                    for f,m in self._yb.rpmdb.searchNevra(p[0], p[2], p[3], p[4],
                    p[1])[0].verify().iteritems())
                self._cm.saynow(self._root + " - " + str(p))
            for f in packages[p]:
                if path.join(self._root, f) in self._rpmva[p]:
                    modified[f] = packages[p][f]
                else: unmodified[f] = packages[p][f]
        #rpmva = {} # trash the cache - trade speed for memory
        if modified or newfiles:
            self._cd['save'] = modified.keys() + newfiles.keys()
            self._cd['savesize'] = sum(modified[x]['size'] for x in modified) +\
                sum(newfiles[x]['size'] for x in newfiles)
        if unmodified:
            self._cd['unmodified'] = unmodified.keys()
            self._cd['unmodifiedsize'] = sum(
                unmodified[x]['size'] for x in unmodified)

    def processFolders(self, dirs):
        localmounts = []
        locallinks = []
        for folder in dirs:
            thispath = path.join(self._root, folder)
            if path.islink(thispath): locallinks.append(folder)
            elif path.ismount(thispath): localmounts.append(folder)
        for link in locallinks: dirs.remove(link)
        for mount in localmounts:
            dirs.remove(mount)
            self._mounts.append(path.join(self._root, mount))
        dirs.sort()
        return

    def close(self):
        self._root = "/"
        self._checkpath()
        self._check(self._results)
        self._cm.end()

    def getRootPath(self):
        return self._rootpath

and so the actual app is nice and small.  it prints those progress messages and the final summary to stderr and the flat list of files and whole directories to stdout.  and it takes hours so i usually run it like this: time backup.py > backup-datetime.out; paplay –volume 30000 /usr/share/sounds/gnome/default/alerts/sonar.ogg

#!/usr/bin/python
# coding=utf-8

from os import walk
from sys import stdout, stderr
from pkgscan import pkgScanner

ps = pkgScanner()
for root, dirs, files in walk("/"):
    ps.setRoot(root)
    ps.processFolders(dirs)
    ps.processFiles(files)

ps.close()
stderr.write(str(ps))
stdout.write(ps.dump())

and that’s that.  oh, sometimes it can use up an awful lot of memory.  keep an eye on it.

what’s next?  i’d like to specify a set of starting points for the scan on the command line, maybe pass in an exclusions file.  also i want to check that i actually have permission to read those files i want to back up.

it’d be nice to be able to generate a backup list for my non-admin user, then pass that list in to the scanner when run as root to generate a short list of stuff that has to be backed up by root.

looking at the output generated so far i’ll need to start writing some (possibly plugin-based) rules to handle/exclude certain files – some config files should be diffed rather than just saved, some files should be backed up by their application’s own backup system (e.g. databases), some files should only be backed up when the user isn’t logged in, some only on shutdown/startup, some only in single user mode.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s