Saturday, 26 May 2012

Managing sites with automatic upload: mmpupload

An archive site obviously needs to be maintained. There are various aspects to that, such as checking that it is still up, backing up the data and introducing new data. Towards solving the backup problem I have been writing some scripts for the Harpur archive that will automatically upload the entire site. My idea eventually is to manage several sites, each consisting of a folder of files, and uploading and downloading (for backup) so that it will be easy to change the content without having to go through the tedious GUI-based editing tools such as form-based file import. It is this last problem that I have been trying to automate.

The situation with Harpur is common to many archives: a set of works collected from various physical texts. Each is a folder containing files to be merged. In my case they are all XML or text files, and the import service in hritserver can handle both formats, even within the one work. Also there are plain files such as anthologies, and biographies, which only exist in one version. I thought it would be easy to script the actual upload using a tool like curl. However, curl presents problems in emulating mime/multipart upload such as that generated by a HTML form. I wanted to emulate that precisely because I wanted the user also to use a form on an 'import' tab to upload individual works. Such added works would then get backed up and eventually join the main/upload download cycle. Also curl is a complex tool that supports many features that get in the way of simple uploads. It is sensitive to spaces and brackets in filename paths, and I wanted my folders to have natural names like 'GENIUS LOST. THE SORROWS PART EIGHT WHITHER' and 'FRAGMENT (1)'. You have no idea how hard it was to write a script that builds a set of path names like that. After trying for a week I decided to write my own upload tool, which I called mmpupload.

It is a simple tool that takes a list of file names or paths, and a set of name-value pairs, wraps them up into a mime multipart document, sends it to a named service and reports the response to the console, and also whether it worked or not. With Harpur I already have 259 folders and there are rumoured to be around 700, so this tool is a key component of the automated archive solution. Each archive has to be a virtual machine that supports ssh and Java, so I can run hritserver. rsync can then be used in combination with ssh to update the site on a regular basis. I already have three planned: the Harpur archive, possibly a Conrad archive, Tagore and Digital Variants. If I can manage all these - and it's just a question of managing the scale - I can add others in time without going mad. That's the beauty of a software solution that is mostly shared by different projects and entirely automatable.

Here's an example of mmpupload:

mmpupload -f AUTHOR=harpur  -f "WORK=Eden Lost" -u http://localhost/upload.php "EDEN LOST/C376.xml" "EDEN LOST/A88.xml"

Simple, huh? There's even a man page and it makes correctly on OSX and Linux.

No comments:

Post a Comment