Update 2015-08-18: Boy do I feel silly! It turns out there’s a much simpler and much more robust way of doing what I’ve done with the scripts below. It turns out that, using any revision control system (eg. cvs, git, svn) that stores revisions as deltas (and most if not all do), all you need to do is copy anything into a revision control repository and commit it. Tada! The rcs takes care of the incremental part for you by its use of revision deltas (ie. patches). As a big fan of git I was hoping there was a way for it to fill this role. I had mistakenly thought that git stores whole files without diffs/deltas for every revision. This is true until git garbage collects as I found out with my Stack Overflow question: Can git use patch/diff based storage? There’s some great reading there, check it out. Simply garbage collect after adding and committing in git and you automatically get space efficient incremental backups with the bonus of the robustness and reliability of git (or whatever rcs you choose). Bonus: You can delta anything you can store in an rcs repository meaning files, binary or text, archives, images, etc. You still get the space savings! So, quite literally, my database backup is now something like this: (1) mysql dump, (2) git add dump, (3) git commit dump, (4) git gc. Simple, powerful, elegant, beautiful. As it should be!

Space Efficient Mysqldump Backups Using Incremental Patches

I’m now using Duplicity for super convenient one-liner style incremental backup commands in a simple shell script (seriously, it’s like three commands long) but what I’m missing is incremental space-savings on my database dump. Right now my mysqldump produces about a 40MB file, about 10MB compressed. It’s irked me for some time that there’s no simple way to do intra-file incremental backups. I’ve also wanted to do intra-day, not just daily, backups. Duplicity’s incremental backups allow for that but full database backups add up quickly. Well, I finally went ahead and wrote a shell script to do it and a recover script that can recover to any date in the series of backups – just like duplicity. The key was interdiff for incremental patches. Here’s how I did it…

Caution! The following code should NOT be considered ready for critical systems. I’m using it on my personal server in parallel with a separate full database backup mechanism. The script should work fine but I wouldn’t trust it as my only backup just yet.

The first file is backup-db.sh, I hope the code comments help explain what’s going on:

#!/bin/sh

cd /backup/backup-db || exit $?

ts=`date +'%Y%m%d%H%M%S'`
newSQLFile="mysql.$ts.sql"

# Find original mysqldump to determine whether we need to make an incremental
# backup or a full one. This is the original mysqldump from the first clean
# run. We always use patches after the first full dump so we can save space.
lastSQLFile=`ls -1 mysql.*.sql.gz 2>/dev/null | sort -rdfu | head -n 1`

# Standard mysqldump backup
mysqldump --all-databases --events --lock-all-tables >"$newSQLFile"

# If there was a previous mysqldump then let's make a patch against the new one
if [ -n "$lastSQLFile" ]; then

        # Find last patch file. We'll use this to generate incremental patches
        # if a previous one exists. That helps saves space rather than having
        # ever growing full patches against the original mysqldump.
        lastPatchFile=`ls -1 mysql.*.sql.patch.gz 2>/dev/null | sort -rdfu | head -n 1`

        # Uncompress the last mysqldump so we can diff against it
        gunzip "$lastSQLFile"
        lastSQLFile=`echo "$lastSQLFile" | perl -np -e 's/\.gz$//'`

        # Diff the last mysqldump and the new mysqldump to generate a patch file
        diff -u "$lastSQLFile" "$newSQLFile" >"$newSQLFile.patch"

        # Generate an incremental patch based on the last patch if it exists.
        if [ -n "$lastPatchFile" ]; then
                gunzip "$lastPatchFile"
                lastPatchFile=`echo "$lastPatchFile" | perl -np -e 's/\.gz$//'`
                # Read up on interdiff. It's a tool to find only the unique
                # updates in the second patch as compared to the first patch so
                # that there's no duplicate diffs in the second. This helps
                # save space.
                interdiff "$lastPatchFile" "$newSQLFile.patch" >"$newSQLFile.patch.inter"
                gzip "$lastPatchFile"
                mv "$newSQLFile.patch.inter" "$newSQLFile.patch"
        fi
        gzip "$lastSQLFile"
        gzip "$newSQLFile.patch"
        gzip "$newSQLFile" #rm "$newSQLFile" # gzip to test the script, rm once you're confident it's working
else
        # Else, no previous mysqldump so we compress and we're done. Next time
        # around we'll do the incremental backup part.
        gzip "$newSQLFile"
fi

Now, here’s the recovery script:

#!/bin/sh

# Ensure we have the correct arguments
if [ $# -ne 2 ]; then
        echo "usage: $0 sql-file.gz last-patch-file.gz"
        exit 1
fi

sqlGZFile=$1
sqlFile=`echo "$sqlGZFile" | perl -np -e 's/\.gz$//'`
lastPatch=$2

# Find all patche files matching our pattern and sort in ascending order. This
# determines the sequence the patches will be applied and it *must* be correct
# else patching will fail.
patchFiles=`ls -1 mysql.*.sql.patch.gz 2>/dev/null | sort -dfu`

# Uncompress backup to a new file which we'll work on so we don't touch any
# original files.
gunzip -c "$sqlGZFile" >"$sqlFile.patched"

# Loop over all patches in sequence and apply them to the new patched files
for patchFile in $patchFiles; do
        # Stop patching if we're past the last patch
        if [ "$patchFile" \> "$lastPatch" ]; then break; fi

        # Apply each patch in sequence
        gunzip -c "$patchFile" | patch "$sqlFile.patched" || exit $?
done

There it is! I hope it helps someone out there who has wished for the same thing.

*For the record, the scripts were written and tested on CentOS 6 using Bash shell.

Space Efficient Mysqldump Backups Using Incremental Patches

CommentCancel reply