Space Efficient Mysqldump Backups Using Incremental Patches

Update 2015-08-18: Boy do I feel silly! It turns out there’s a much simpler and much more robust way of doing what I’ve done with the scripts below. It turns out that, using any revision control system (eg. cvs, git, svn) that stores revisions as deltas (and most if not all do), all you need to do is copy anything into a revision control repository and commit it. Tada! The rcs takes care of the incremental part for you by its use of revision deltas (ie. patches). As a big fan of git I was hoping there was a way for it to fill this role. I had mistakenly thought that git stores whole files without diffs/deltas for every revision. This is true until git garbage collects as I found out with my Stack Overflow question: Can git use patch/diff based storage? There’s some great reading there, check it out. Simply garbage collect after adding and committing in git and you automatically get space efficient incremental backups with the bonus of the robustness and reliability of git (or whatever rcs you choose). Bonus: You can delta anything you can store in an rcs repository meaning files, binary or text, archives, images, etc. You still get the space savings! So, quite literally, my database backup is now something like this: (1) mysql dump, (2) git add dump, (3) git commit dump, (4) git gc. Simple, powerful, elegant, beautiful. As it should be!

Space Efficient Mysqldump Backups Using Incremental Patches

I’m now using Duplicity for super convenient one-liner style incremental backup commands in a simple shell script (seriously, it’s like three commands long) but what I’m missing is incremental space-savings on my database dump. Right now my mysqldump produces about a 40MB file, about 10MB compressed. It’s irked me for some time that there’s no simple way to do intra-file incremental backups. I’ve also wanted to do intra-day, not just daily, backups. Duplicity’s incremental backups allow for that but full database backups add up quickly. Well, I finally went ahead and wrote a shell script to do it and a recover script that can recover to any date in the series of backups – just like duplicity. The key was interdiff for incremental patches. Here’s how I did it…

Continue reading “Space Efficient Mysqldump Backups Using Incremental Patches”

How to fix WordPress Twenty Eleven Featured Image

I like the Twenty Eleven theme and I still haven’t upgraded it. Luckily, some people are keeping it up to date and compatible with the latest WordPress in 2015! But there was one bug that bothered me for a long time and that was featured images were broken. These are the big header images at the top of posts (for example, the big image at the top of this post, or the grid of cars image at the top of one of my other posts). These image are just broken and will not display your custom featured image in the stock Twenty Eleven. So here’s how I fixed it…

Continue reading “How to fix WordPress Twenty Eleven Featured Image”

Request for a Versioning File System on Linux

The people who make file systems are developers. As a developer myself the value of a versioning file system is so keenly clear I’m so surprised it’s not, at the very least, a standard option on every single file system ever created. So this is a request to all the file system developers out there: Please, what can we do to get a versioning file system?

The people who make file systems are developers. As a developer myself the value of a versioning file system is so keenly clear I’m surprised it’s not a standard option, at the very least, on every single file system ever created. So this is a request to all the file system developers out there: Please, what can we do to get a versioning file system?

I’ve googled for years looking for a production ready versioning file system. I think a fuse-based fs would even suffice. I may even take the plunge and see what I can do with fuse and perl. But I’m just so surprised it’s not already done.

There are projects out there like ext3cow, which looks like the Right Way(TM) to do it but seems dead, and the Wayback FS, but it looks dead, too. I love the way they tried to implement versioning on ext3cow. It looks exactly like what I’d want. Everything integrated at the console as a first-class citizen fs.

Now, I know people say just use a revision control system. But initialization and adding and committing files on an ongoing basis is just not something I’m going to do. Anything requiring manual labour will slip through the cracks one day. We have computers to do things for us. This should just be one more thing.

Other arguments include things like revisioning constantly changing files like database blobs, log files or just plain very large files (images, movies, data, etc.). I agree there are some things you don’t want to revision but the benefit to a developer of versioning would be immense.

I once lost a day’s work because of a bad console rm. Since it was in the span of a work day, what was supposed to save me? Frequent rcs commits? I don’t use rcs for backup (neither should you, but that’s a whole other story) and I don’t commit unfinished code. That’s just bad practice and is a symptom of a problem. So, daily backups wouldn’t have saved me, but a versioning file system would have. It got me so worked up that I created a poor man’s versioning “file system” in shell script using rsync to mirror a directory and timestamp backups. Unfortunately, it’s IO requirements caused stuttering when doing normal work. I think I’ll post that shell script in case any one wants to improve on it to see if they reduce the IO requirements. Ooh, good time to try github.

So, what can we do to get a versioning file system on linux? I’ve wanted one for years and I can’t understand why it’s not a priority. If I don’t have the knowledge to code a file system myself what can I do to bring this goal a little closer?

git – a love/hate relationship (with maybe a little more hate)

As per my previous post on git, we’ve been investigating and running trial repositories using git. So far, though, git has been frustrating me.

Update 2013-10-05: This is a really old post. After an initial skepticism phase, I am now an enthusiastic git fan and I would use it over anything else now. It really is great. I had to let go of a few things but once you do it turns out better than you thought. And nothing is absolutely hidden anyway. All the commands are there to get at what you want. And all the negatives I talked about below have been solved and relatively easily at that.

As per my previous post on git,  we’ve been investigating and running trial repositories using git. A tiny bit of background: we’re a cvs shop and I have a tonne of experience with it and I’m comfortable using it, although not so comfortable I won’t consider other solutions. So far, though, git has been frustrating me, but let me start with the positives.

There are many wins with git but I’ll mention what I feel are its biggest assets: disconnected operation and speed. I would love to get these in cvs, but there are also negatives to git.

The biggest negative for me is the difficulty, if not impossibility, of central enforcement of some rules, that we could easily accomplish in cvs. I don’t like meaningless hashes for tree revisions, but I could live with that.

And why don’t add/commit hooks on the committer-side let me accomplish the same thing as push hooks on the server-side?

Is there a way, when reading file content during pushes, in order to do policy enforcement, to not have to recurse the entire tree and all commits that have ever been done? The reason that’s important is because a committer not using the proper hooks will run into trouble if the server-side git has been using prohibitive hooks the entire time (also remember some functionality appears impossible with add/commit hooks on the committer-side).

For the sake of argument, assume it was just a requirement a typical cvs model was followed in git and you had to do some policy enforcement on the server-side: How can you maintain and enhance server-side hooks and ensure that every user gets those hooks on the committer-side, too. Some have said the committer-side doesn’t need them, they can just fix them at the time of push/pull. I’ve run into prohibitive server-side hooks that require the committer-side to recurse through every commit and fix  the commit message.

These are my hooks so you may say I’m the author of my own trouble but, remember, sometimes in a company you have to have these policies enforced centrally. And, besides, these hooks save us time, trouble and sanity.

And, so, what I found natural and easy to accomplish in cvs, I find difficult and some things, so far, impossible. Maybe it’s because git is opaque compared to cvs. It just seems difficult to accomplish tasks given git’s design and model.

What can I do?

git – the fast version control system

At work, we’ve been investigating revision control systems to replace our current system, CVS. Primarily, we’re investigating git (wikipedia article) as it has basically become the de facto distributed version control system.

Update 2013-10-05: See my other post on git for my latest thoughts on git. I’m now an enthusiastic fan of git. :)

At work, we’ve been investigating revision control systems to replace our current system, CVS.

Primarily, we’re investigating git (wikipedia article) as it has basically become the de facto distributed version control system. And we definitely see the advantage of being distributed. While we’re focusing on git, I have been reading up on lots and lots of other systems: dcvs, svn, aegis, bazaar, mercurial, and many, many more. I’ve been compiling an internal wiki document with many resources for each so we can better learn and compare them.

The odd thing with me is that I was late to the rcs game, starting to use cvs only around 2004 or so. I started on cvs and have continued to use cvs until now. I know its ins and outs. I’m comfortable with its plain text repository files and their contents. I’ve written scripts to parse them and present structured data back to the user. For example, which files and revisions have we marked with a specific bug in the comments: Great way to get a changeset for one bug if you’re disciplined enough to “tag” comments with the right bug number. Again, I’m comfortable with tagging, branching and all the discipline and structure and branch modeling needed for proper branching and merging that will necessarily occur in significant projects.

So, I see the dotted version numbers they use for revision specifiers and they make sense. Everybody versions their code releases with dotted version numbers (1.0, 1.1, 2.0, 3.0, etc.), whether they use revision control or not. So it makes sense to version files that way. But it seems modern systems don’t like that. They’ll use hashes to represent file versions.

I also see the plain way in which my checkout files and directories are recorded as the same name files and directories in the repository. It makes sense. There are some limitations like symbolic links but they either don’t come up that often or they can be dealt with in other ways (we use project initialization shell scripts that setup the environment complete with symlinks for perl package namespaces, apache config symlinks, etc.). But, again, modern systems will use custom structures which allow them to model plain files and plain directories in their own format and export them as needed.

I can totally understand where the mindset came for cvs back in the day in the 1980s. It was probably significantly if not directly inspired in its design and architecture by the Unix Philosophy. The Unix Philosophy, by the way, is a brilliant thing of utter common sense and simplicity. It is the only philosophy in program design that has produced real interoperability: the command-line.

Then I look at all of the modern, distributed revision control systems and they completely eschew with the idea of plain files and plain directories. They also mostly do away with file version numbers. I understand there are advantages. Namely performance and ability to model more than simple files and directories. The possibility of more easily versioned directories is there as well. There are definitely advantages but I’m still on the fence as to whether it’s worth it or not.

As I said, I grew up on cvs and, as necessity usually forces one to do, I invented means to get around cvs’s shortcomings and now have an ecosystem that I am comfortable with, have a good knowledge of, and have the confidence that I can extend it to do what I need when I need it without a lot of hassle.

Modern systems abstract out concrete file system concepts such as files and directories and simply model them in their own objects. They can then be exported to concrete file system files or directories internally at will. I’ve actually always been one for abstracting and modeling things in a generalized way that gives power to the application rather than to external forces but it does have its disadvantages; things become more opaque and so more difficult to work with.

We have a number of server-side “locks” in place to prevent things that might be “bad” when users commit. For example, a file, directory or branch may be locked, or the commit message was invalid, or the file contained dos line endings which we don’t want, or we’d like to enforce certain policy regarding the content of the file. This is all very easy as, during the commit, these hooks have access to the plain files. I tried to figure out how to do the file access hooks in git and didn’t come away with any real answers – yet.

I guess my concern is that perhaps modern revision control systems have abandoned some good principles of the Unix Philosophy and, as those with experience know, those principles came with a lot of pain. Discard those principles and you’ll probably learn them all over again with a lot of pain. In a common unix adage, paraphrased, those who forget their unix history are doomed to repeat it, badly.

Or perhaps we understand version control well enough now that it’s no longer exciting and we don’t want to implement plainly again so we’re taking the next step? I find this at work in my own code. At times I’m excited by a new technique or implementation and I hone it and hone it until at some point I get bored of it and no longer want to know it inside and out. So I might use a perl cpan module and now it’s a black box and I don’t have to worry about reinventing the wheel.

Whatever the case may be, if hiding file and directory details behind custom, binary data structures, and using meaningless hashes for version identifiers, is the way of the future, perhaps I’ll just need to try a little harder to wrap my head around these new systems and the new ideas they’re bringing to the table.

I’ll mention just one other thing that I think is important. RCS can do everything that modern systems can do, it just needs glue scripts. CVS can do everything that modern systems can do, it just needs glue scripts. Git, being a modern system, does everything it does without the need for glue scripts.

What I’m saying is, I get the advantage of cvs plain files and plain directories, and dotted version numbers, and I get the advantages of changesets and a dstributed nature if I put in place my own solutions for these things – they’re not in by default. Some if not all can be relatively automated. The point is I almost feel I can get that well known 80/20 win where the biggest win comes in the first, relatively small amount of effort. Perhaps it would be a nightmare of details to make cvs distributed but here’s the important thing to remember:

Give me something that follows the Unix Philosophy and I’ll likely be able to extend it to do pretty much whatever I need.

Get a little too complex with your data formats and abstract models and you’ll start suffocating innovation because it’s difficult to work with.