Tuesday 25 December 2012

Last piece in the import puzzle

People wanting to import XML into the HRIT system probably have XSLT scripts that transform the files into some other form and then format it into HTML. Perhaps the two steps are not even separated. If they try to use the current HRIT import system then TEI constructs like the following (taken from the TEI-Lite manual) will fail:

<list>
 <head>A short list</head>
 <label>1</label>
 <item>First item in list.</item>
 <label>2</label>
 <item>Second item in list.</item>
 <label>3</label>
 <item>Third item in list.</item>
</list>

The reason is that the <head> element is inside the <list> element. If we translate one-for-one the elements of XML into elements in HTML we will have to delete the <head> element, because none of the <h1>, <h2>, <h3> elements in HTML can appear inside <ul> or <ol>. But what we really want is to do is move it outside <list> and give it an attribute like type="list". But that's manipulation of the XML DOM, which neither stripper nor formatter (my two import tools) currently perform.

Brainwave

Whatever new facilities I add to stripper to allow such transforms you can bet that someone will need something in XSLT that isn't supported. My stripper program would just keep on getting more and more complex. And I would waste more and more time. So I had the simple idea not to modify stripper or formatter at all, but just to add a further step into the import process. All we have to do is allow XSLT transforms to take place on the imported XML files as a first step in the importation process, through the use of a tool like Xalan. By default a TEI-Lite stylesheet could perform the necessary transforms on the XML to turn it into sane XML for easy conversion into standoff properties. Not only is this a trivial change to implement, it is also extremely powerful. Although existing stylesheets may have to be modified for this to work, no loss of functionality can any longer be claimed for the HRIT system over existing XML based digital editions. A neat result, indeed.

It seems that the only XSLT processor that works on MacOSX and Linux any more is libxslt. XML may not be dead yet but its tools at least are dying. A sign of the times?

D'Oh!

I forgot that Java has an XSLT processor built in, so after getting libxslt to work via JNI I had to scrap it and redo it more simply. Which just goes to show that having a coffee and a walk in the garden before you code something is often time well spent, even though it looks and feels like you're loafing.

Sunday 23 December 2012

Running couchdb as a daemon on OSX

When you install couchdb on OSX using homebrew it doesn't set it up for running continuously. Whenever you launch it using sudo couchdb -b and your Mac goes to sleep it will kill the process, even if you used nohup. To make it run as long as you don't shut down your computer, and launch whenever you start it up you have to run it as a "daemon". Here's how I did it.

  1. Locate the launch daemon file "org.apache.couchdb.plist" located in /usr/local/Cellar/couchdb/1.2.0/Library/LaunchDaemons or some similar place.
  2. Now edit the file. I used:
    sudo nano org.apache.couchdb.plist
    and change the contents of the XML <string> element immediately following the <UserName> element to "root" instead of "couchdb".
  3. Now copy the file to the right place:
    /Library/LaunchDaemons
  4. Finally use the launchctl command to load it:
    sudo launchctl load /Library/LaunchDaemons/org.apache.couchdb.plist
    (You can also unload it using the same command but with "unload" instead of "load".) I've modified the hritserver-start.sh script so it no longer kills and relaunches couchdb. If you are using hritserver make sure you have an up-to-date copy of that.

Friday 7 December 2012

Installation of hritserver

I though I would post some instructions for installing hritserver, since we don't yet have an installer. I should probably write one.

Installation on MacOSX

1. Homebrew

Homebrew can be fetched from http://mxcl.github.com/homebrew/. Follow the instructions there.

2. gcc

To build the C libraries used by hritserver you will first have to install gcc:

brew tap homebrew/dupes
brew install apple-gcc42

3. Install couchdb

The command is

brew install couchdb

4. Set up databases

You need to first download the hritserver package, available from www.github.com/HRIT-Infrastructure. Click on the hritserver repository, then the Downloads tab. Select "hritserver-0.1.3.tar.gz" and download it (Don't click on "Download as .tar.gz" or .zip). When it is on you hard disk it will unpack automatically or by double-clicking on it or on the commandline (OSX removes the .gz automatically):

tar xvf hritserver-0.1.3.tar

Now move the folder to a convenient location:

cd ~/Downloads
mv hritserver-0.1.3 ~
cd ~/hritserver-0.1.3

First set up the database for admin access. In the hritserver-0.1.3 folder are two scripts add-user.sh and couchdb.sh Run this command first:

sudo ./couchdb.sh

After running this you have to press return to get back the prompt. Check that couchdb is running:

ps aux | grep couchdb

You should get two process numbers, one 1-line long (that's the command you just ran) and a longer one about 6 lines long. That's couch running.

Now run:

./add-user.sh

It should respond with "-hashed-9222..." and a lot of hex numbers.

Now test that couch is running by typing the url in a browser: http://localhost:5984/_utils

There should be two entries in red: _replicator and _users.

Now install the test databases:

cd backup
./upload-all.sh

The script asks for a password. Type in jabberw0cky (with a zero). This is a master script that calls all the other upload-*.sh files. (So you can upload them individually.) Finally run hritserver from the command line:

5. Run the installer

cd ..
sudo ./install.sh
sudo ./hritserver-start.sh

Hit return. The service will run even when you log off. To stop the service, log in as the same user who launched it initially and type:

sudo ./hritserver-stop.sh

Access the service on http://localhost:8080/tests/ (trailing slash is significant). To make it visible on port 80 add the following lines to the end of /etc/apache2/httpd.conf,and restart apache

On OSX:

ProxyPass /tests/ http://localhost:8080/tests/ retry=0
ProxyPass /corpix/ http://localhost:8080/corpix/ retry=0
ProxyPass /import/ http://localhost:8080/import/ retry=0
ProxyPass /upload/ http://localhost:8080/upload/ retry=0
Now restart apache: sudo apachectl restart

Ubuntu

Add the above lines to /etc/apache2/mods-available/proxy.conf.

Restart apache2: sudo service apache2 restart

ADDING OTHER TEXTS

The mmpupload tool can be used to upload files in XML or plain text in batches. See www.github.com/HRIT-Infrastructure.

Monday 3 September 2012

Consolidation of code on Github

I've put all my free software offerings on Github. This service has the advantage of unlimited repositories and collaborators. So I created an "organization" called Hrit-Infrastructure into which I put all the general software components. Outside under my own account I'm putting more specific tools that I use for ingestion of external formats etc. HritServer needs a better installer but I've put a basic set of instructions onto the wiki.

At the moment Hritserver uses couchdb, but my colleagues are porting it for me to Mongodb, which is a lot faster. We want speed, because this is where digital humanities software often falls down. I notice that a lot of new projects and old ones too use interpreted languages like ruby, php and python. Sure they're cool, but they are slooooow. One benchmark I saw tested ruby and found it to be at least 200 times slower than C and Java, and php is 500 times slower. That's why I'm happy that hritserver is written in a composite of those two languages. We will do a lot more than the competition and we will do it blindingly fast.

Sunday 19 August 2012

Is Node.js the answer to all our problems?

I had a look at Node.js today and yesterday, and I can now say that I understand what it is. Apart from the Javascript-like syntax and its support for JSON, Node.js uses a totally new set of objects written in C. The original Javascript was a client technology and it doesn't do servlets, sockets or I/O. Even with the V8 engine Node.js is a lot slower than Java.

Ryan Dahl claims that Nginx is faster than Apache and uses fewer resources because it doesn't use threads and instead serves requests asynchronously. So it makes sense to try to mimic that performance in our web applications and their servers. It also sounds great that at both the server and the client end we use Javascript. However, I have a number of concerns:

  1. I don't understand how using just one thread can possibly be a virtue. Hardware is very much designed for multitasking these days. Not to design software to take advantage of that has little chance of being competitive.
  2. Is Node.js really faster than Apache or Java? The benchmarks I've seen so far are mostly against Node.js.
  3. The end to end Javascript idea sounds cool until you wonder if what was designed as a lightweight, typeless language for the client is really suited to the more demanding programming tasks on the server.
  4. What they don't talk one iota about is security. Long polling sounds a cool idea but it's an open invitation to denial of service. Ruby was cool too but when we attacked the simple ruby web server with a trivial XML denial of service it crippled the whole machine.
  5. It may be good for massive web services that have to deal with lots of small requests. For the rest of us who don't need to service thousands of requests per second or want to do more involved things on the server, Node.js offers no advantage.
  6. Clarity, reliability and reusability should be the goals of every programmer. But a style that uses callbacks and closures instead standard object oriented techniques sounds like a recipe for confusion and endless bugs to me. None of these techniques are new, and they haven't caught on for a reason.
  7. If Node programmers have to write their own web-servers in 10 lines how can that rival the configurability, flexibility and security of a professional web server that is tens of thousands of lines long? I prefer to write an independent server application then choose to run it on one web-server or another, not hard-wire the server into the code. If you want the benefits of asynchronous I/O then why not just run your existing app on Nginx and forget about Node.js?

Saturday 26 May 2012

Managing sites with automatic upload: mmpupload

An archive site obviously needs to be maintained. There are various aspects to that, such as checking that it is still up, backing up the data and introducing new data. Towards solving the backup problem I have been writing some scripts for the Harpur archive that will automatically upload the entire site. My idea eventually is to manage several sites, each consisting of a folder of files, and uploading and downloading (for backup) so that it will be easy to change the content without having to go through the tedious GUI-based editing tools such as form-based file import. It is this last problem that I have been trying to automate.

The situation with Harpur is common to many archives: a set of works collected from various physical texts. Each is a folder containing files to be merged. In my case they are all XML or text files, and the import service in hritserver can handle both formats, even within the one work. Also there are plain files such as anthologies, and biographies, which only exist in one version. I thought it would be easy to script the actual upload using a tool like curl. However, curl presents problems in emulating mime/multipart upload such as that generated by a HTML form. I wanted to emulate that precisely because I wanted the user also to use a form on an 'import' tab to upload individual works. Such added works would then get backed up and eventually join the main/upload download cycle. Also curl is a complex tool that supports many features that get in the way of simple uploads. It is sensitive to spaces and brackets in filename paths, and I wanted my folders to have natural names like 'GENIUS LOST. THE SORROWS PART EIGHT WHITHER' and 'FRAGMENT (1)'. You have no idea how hard it was to write a script that builds a set of path names like that. After trying for a week I decided to write my own upload tool, which I called mmpupload.

It is a simple tool that takes a list of file names or paths, and a set of name-value pairs, wraps them up into a mime multipart document, sends it to a named service and reports the response to the console, and also whether it worked or not. With Harpur I already have 259 folders and there are rumoured to be around 700, so this tool is a key component of the automated archive solution. Each archive has to be a virtual machine that supports ssh and Java, so I can run hritserver. rsync can then be used in combination with ssh to update the site on a regular basis. I already have three planned: the Harpur archive, possibly a Conrad archive, Tagore and Digital Variants. If I can manage all these - and it's just a question of managing the scale - I can add others in time without going mad. That's the beauty of a software solution that is mostly shared by different projects and entirely automatable.

Here's an example of mmpupload:

mmpupload -f AUTHOR=harpur  -f "WORK=Eden Lost" -u http://localhost/upload.php "EDEN LOST/C376.xml" "EDEN LOST/A88.xml"

Simple, huh? There's even a man page and it makes correctly on OSX and Linux.

Friday 18 May 2012

couchdb: delete document if it exists

I am uploading some files via a script to a couchdb database. In many cases the files are already there, but out of date. So I want to replace them. However, couch complains that there is an 'update conflict', even if I explicitly try to delete the resource without supplying its revision id. So I followed the instructions to just test that a document is present, extracted the rev-id and then composed a delete operation. In bash scripting language, and using curl as the http transmitter it looks like this:

erase_if_exists()
{
    RESULT=`curl -s --head $1`
    if [ ${RESULT:9:3} == "200" ]; then
      REVID=$(expr "$RESULT" : '.*\"\(.*\)\"')
      curl -X DELETE $1?rev=$REVID
    fi
}

Why HEAD? Because HEAD retrieves only the HTTP document headers but not the document, which could be large. Curl can send a HEAD request using curl -X HEAD but it hangs on reading the response, so the above formulation is an alternative that works. The -s tells it not to print out any download progress, and in this function $1 is the URL. The HTTP response code is extracted from the RESULT string using indexing. If it is 200, the document is present. Here's a sample response:

HTTP/1.1 200 OK
Server: CouchDB/1.1.1 (Erlang OTP/R15B)
Etag: "3-04bacb883737ffed82700eacdf4e74f2"
Date: Fri, 18 May 2012 21:17:04 GMT
Content-Type: text/plain;charset=utf-8
Content-Length: 8077
Cache-Control: must-revalidate

The REVID gets extracted from the Etag property via a regular expression, and in the example is set to 3-04bacb883737ffed82700eacdf4e74f2. This is passed in another call to curl as a URL parameter. You also have to embed the username and password into the URL for this to work, or you can use the -u option to specify a username.

Friday 30 March 2012

Bare bones multi-file upload

What I need for HritServer is an import facility that can handle multiple file uploads. I've found many fancy scripts for doing this that are way too complex and too specific. Sometimes all you need is a bare bones solution you can tailor to suit your needs rather than a fully fledged product you have to spend ages understanding and cutting down to size. Also I hate building in dependencies that only increase the tendency for code to break. So here's my simple contribution to the HTML multifile upload problem.

The basic idea is to have one <input type="file"> for every file you want to upload. But you hide all but the current empty one. So when the user selects the only visible input file element it sets itself to the chosen file, adds itself to an invisiible list of input file elements, and creates a fresh one for the next time. To make it easier to see what's already been selected I maintain a secondary list of paths in a table. Next to each entry is a remove button that let's you take out individual entries. That's it. Here is the php code to handle the upload on the server side. Replace this with something else if you like, such as Apache file upload in Java:

And here is the HTML, with self-contained javascript.

Call the first "upload.php" and the second "upload.html" then put them both in the root document directory of your web-server. If you have php installed navigate to /upload.html and take it from there. You can add styling and change the server script easily because it is simple.

Wednesday 7 March 2012

Uploading a directory of images to couchdb

I wanted to have a general script to upload a set of images to couchdb. Couch can't store images as documents because it uses JSON for that. But you can still create a JSON document and attach a set of annotations to it in the form of images. But here's the catch: you have to specify the document's revision id, and that changes after every image you add.

But I had two problems. My directory of images contained sub-directories. I also wanted to access the images directly using a simple URL. So if I had a directory structure like:

    list
        file1.png
        one
            file2.png
            file3.png
        two
            file4.png

I would want the relative URLs to be: /list/file1.png and /list/one/file2.png etc. The trick in writing the script is to extract the revid from the server response. I used awk for that, then used the returned value to upload the next entry in that directory. The second trick is to use %2F not / as a directory separator when creating the docids. Couch doesn't allow nesting in the database structure but you can simulate it by creating documents called:

list
list%2Fone
line%2Ftwo

The first posting to each of those documents doesn't need a revid, but subsequent ones do. That's just a feature of couch. So here's the script. It's a bity sloppy because if couch responds with an error to any upload it will fall over. This bit of error-handling is currently left as an exercise to the reader. I'll put it in later and may update the post then. To use the script put it into a file called upload.sh and then invoke it thus: ./upload.sh images, where "images" is the master image directory.

Saturday 25 February 2012

The importance of deletions

Standoff properties describe what happens to runs of text, but what about runs of text you don't want, or need to put into attributes when it is formatted as HTML? This problem was revealed when I tried to use formatter to generate a dropdown menu from a list of versions spat out by nmerge. What nmerge gave me was the description of the work on one line, then the name of the version-group, followed by short names and long names for each version:

King Lear
Base F1 Version F1
 F2 Version F2
 F3 Version F3
 F4 Version F4
 Q1 Version Q1
 Q2 Version Q2

But I needed to turn this into HTML that looked like this:

<select name="versions" class="list">
<option title="Version F1" class="version">F1</option> 
<option title="Version F2" class="version">F2</option> 
<option title="Version F3" class="version">F3</option> 
<option title="Version F4" class="version">F4</option> 
<option title="Version Q1" class="version">Q1</option> 
<option title="Version Q2" class="version">Q2</option> 
</select>

It's obvious that strings like "Version F1" need to be moved from the text into attributes, and strings like "Base" and "King Lear" need to be deleted outright. No doubt XSLT wizards are laughing at me, pointing out that if only I had embraced XML I would be able to transform XML into exactly this format easily. But choosing a simpler model for representing textual properties doesn't mean any loss of functionality. In fact "less is more". All it needs is for these problems to be thought through and solved.

All you have to do is allow each standoff property to be "removed". I already had this feature but the length had to be zero. I used this for removing the TEI header from TEI documents and saving the content in an annotation of an empty range. But the length had to be zero, and the text the elements formerly enclosed was removed permanently from the base text. Since I wanted to reuse the same text for other applications I didn't want to remove any text permanently. So I just allowed the length of a "removed" property to be > 0. A deleted property now removes the text it affects from the output, and also any ranges that it encloses, as well as parts of any ranges it overlaps with. So in the JSON all you do is define some "empty" ranges that have the "removed" property set to true:

{ "name": "list", "annotations": [ { "name": "versions" } ], "len": 94, "reloff": 10 },{ "name": "empty", "removed": true, "len": 5, "reloff": 0 },{ "name": "top-group", "annotations": [ { "name": "Base" } ], "len": 88, "reloff": 5 },{ "name": "version", "annotations": [ { "description": "Version F1" } ], "len": 2, "reloff": 0 },{ "name": "empty", "removed": true, "len": 10, "reloff": 3 }...

What this machine-generated JSON says is that the string "Base" should be deleted. Long version names like "Version F1" should be moved to attributes called "description" for the short-names. The resulting HTML given above is exactly that output by formatter. So now we can also pick and choose which text gets selected and where it appears.