Saturday, 26 May 2012

Managing sites with automatic upload: mmpupload

An archive site obviously needs to be maintained. There are various aspects to that, such as checking that it is still up, backing up the data and introducing new data. Towards solving the backup problem I have been writing some scripts for the Harpur archive that will automatically upload the entire site. My idea eventually is to manage several sites, each consisting of a folder of files, and uploading and downloading (for backup) so that it will be easy to change the content without having to go through the tedious GUI-based editing tools such as form-based file import. It is this last problem that I have been trying to automate.

The situation with Harpur is common to many archives: a set of works collected from various physical texts. Each is a folder containing files to be merged. In my case they are all XML or text files, and the import service in hritserver can handle both formats, even within the one work. Also there are plain files such as anthologies, and biographies, which only exist in one version. I thought it would be easy to script the actual upload using a tool like curl. However, curl presents problems in emulating mime/multipart upload such as that generated by a HTML form. I wanted to emulate that precisely because I wanted the user also to use a form on an 'import' tab to upload individual works. Such added works would then get backed up and eventually join the main/upload download cycle. Also curl is a complex tool that supports many features that get in the way of simple uploads. It is sensitive to spaces and brackets in filename paths, and I wanted my folders to have natural names like 'GENIUS LOST. THE SORROWS PART EIGHT WHITHER' and 'FRAGMENT (1)'. You have no idea how hard it was to write a script that builds a set of path names like that. After trying for a week I decided to write my own upload tool, which I called mmpupload.

It is a simple tool that takes a list of file names or paths, and a set of name-value pairs, wraps them up into a mime multipart document, sends it to a named service and reports the response to the console, and also whether it worked or not. With Harpur I already have 259 folders and there are rumoured to be around 700, so this tool is a key component of the automated archive solution. Each archive has to be a virtual machine that supports ssh and Java, so I can run hritserver. rsync can then be used in combination with ssh to update the site on a regular basis. I already have three planned: the Harpur archive, possibly a Conrad archive, Tagore and Digital Variants. If I can manage all these - and it's just a question of managing the scale - I can add others in time without going mad. That's the beauty of a software solution that is mostly shared by different projects and entirely automatable.

Here's an example of mmpupload:

mmpupload -f AUTHOR=harpur  -f "WORK=Eden Lost" -u http://localhost/upload.php "EDEN LOST/C376.xml" "EDEN LOST/A88.xml"

Simple, huh? There's even a man page and it makes correctly on OSX and Linux.

Friday, 18 May 2012

couchdb: delete document if it exists

I am uploading some files via a script to a couchdb database. In many cases the files are already there, but out of date. So I want to replace them. However, couch complains that there is an 'update conflict', even if I explicitly try to delete the resource without supplying its revision id. So I followed the instructions to just test that a document is present, extracted the rev-id and then composed a delete operation. In bash scripting language, and using curl as the http transmitter it looks like this:

erase_if_exists()
{
    RESULT=`curl -s --head $1`
    if [ ${RESULT:9:3} == "200" ]; then
      REVID=$(expr "$RESULT" : '.*\"\(.*\)\"')
      curl -X DELETE $1?rev=$REVID
    fi
}

Why HEAD? Because HEAD retrieves only the HTTP document headers but not the document, which could be large. Curl can send a HEAD request using curl -X HEAD but it hangs on reading the response, so the above formulation is an alternative that works. The -s tells it not to print out any download progress, and in this function $1 is the URL. The HTTP response code is extracted from the RESULT string using indexing. If it is 200, the document is present. Here's a sample response:

HTTP/1.1 200 OK
Server: CouchDB/1.1.1 (Erlang OTP/R15B)
Etag: "3-04bacb883737ffed82700eacdf4e74f2"
Date: Fri, 18 May 2012 21:17:04 GMT
Content-Type: text/plain;charset=utf-8
Content-Length: 8077
Cache-Control: must-revalidate

The REVID gets extracted from the Etag property via a regular expression, and in the example is set to 3-04bacb883737ffed82700eacdf4e74f2. This is passed in another call to curl as a URL parameter. You also have to embed the username and password into the URL for this to work, or you can use the -u option to specify a username.