Tuesday 25 December 2012

Last piece in the import puzzle

People wanting to import XML into the HRIT system probably have XSLT scripts that transform the files into some other form and then format it into HTML. Perhaps the two steps are not even separated. If they try to use the current HRIT import system then TEI constructs like the following (taken from the TEI-Lite manual) will fail:

<list>
 <head>A short list</head>
 <label>1</label>
 <item>First item in list.</item>
 <label>2</label>
 <item>Second item in list.</item>
 <label>3</label>
 <item>Third item in list.</item>
</list>

The reason is that the <head> element is inside the <list> element. If we translate one-for-one the elements of XML into elements in HTML we will have to delete the <head> element, because none of the <h1>, <h2>, <h3> elements in HTML can appear inside <ul> or <ol>. But what we really want is to do is move it outside <list> and give it an attribute like type="list". But that's manipulation of the XML DOM, which neither stripper nor formatter (my two import tools) currently perform.

Brainwave

Whatever new facilities I add to stripper to allow such transforms you can bet that someone will need something in XSLT that isn't supported. My stripper program would just keep on getting more and more complex. And I would waste more and more time. So I had the simple idea not to modify stripper or formatter at all, but just to add a further step into the import process. All we have to do is allow XSLT transforms to take place on the imported XML files as a first step in the importation process, through the use of a tool like Xalan. By default a TEI-Lite stylesheet could perform the necessary transforms on the XML to turn it into sane XML for easy conversion into standoff properties. Not only is this a trivial change to implement, it is also extremely powerful. Although existing stylesheets may have to be modified for this to work, no loss of functionality can any longer be claimed for the HRIT system over existing XML based digital editions. A neat result, indeed.

It seems that the only XSLT processor that works on MacOSX and Linux any more is libxslt. XML may not be dead yet but its tools at least are dying. A sign of the times?

D'Oh!

I forgot that Java has an XSLT processor built in, so after getting libxslt to work via JNI I had to scrap it and redo it more simply. Which just goes to show that having a coffee and a walk in the garden before you code something is often time well spent, even though it looks and feels like you're loafing.

Sunday 23 December 2012

Running couchdb as a daemon on OSX

When you install couchdb on OSX using homebrew it doesn't set it up for running continuously. Whenever you launch it using sudo couchdb -b and your Mac goes to sleep it will kill the process, even if you used nohup. To make it run as long as you don't shut down your computer, and launch whenever you start it up you have to run it as a "daemon". Here's how I did it.

  1. Locate the launch daemon file "org.apache.couchdb.plist" located in /usr/local/Cellar/couchdb/1.2.0/Library/LaunchDaemons or some similar place.
  2. Now edit the file. I used:
    sudo nano org.apache.couchdb.plist
    and change the contents of the XML <string> element immediately following the <UserName> element to "root" instead of "couchdb".
  3. Now copy the file to the right place:
    /Library/LaunchDaemons
  4. Finally use the launchctl command to load it:
    sudo launchctl load /Library/LaunchDaemons/org.apache.couchdb.plist
    (You can also unload it using the same command but with "unload" instead of "load".) I've modified the hritserver-start.sh script so it no longer kills and relaunches couchdb. If you are using hritserver make sure you have an up-to-date copy of that.

Friday 7 December 2012

Installation of hritserver

I though I would post some instructions for installing hritserver, since we don't yet have an installer. I should probably write one.

Installation on MacOSX

1. Homebrew

Homebrew can be fetched from http://mxcl.github.com/homebrew/. Follow the instructions there.

2. gcc

To build the C libraries used by hritserver you will first have to install gcc:

brew tap homebrew/dupes
brew install apple-gcc42

3. Install couchdb

The command is

brew install couchdb

4. Set up databases

You need to first download the hritserver package, available from www.github.com/HRIT-Infrastructure. Click on the hritserver repository, then the Downloads tab. Select "hritserver-0.1.3.tar.gz" and download it (Don't click on "Download as .tar.gz" or .zip). When it is on you hard disk it will unpack automatically or by double-clicking on it or on the commandline (OSX removes the .gz automatically):

tar xvf hritserver-0.1.3.tar

Now move the folder to a convenient location:

cd ~/Downloads
mv hritserver-0.1.3 ~
cd ~/hritserver-0.1.3

First set up the database for admin access. In the hritserver-0.1.3 folder are two scripts add-user.sh and couchdb.sh Run this command first:

sudo ./couchdb.sh

After running this you have to press return to get back the prompt. Check that couchdb is running:

ps aux | grep couchdb

You should get two process numbers, one 1-line long (that's the command you just ran) and a longer one about 6 lines long. That's couch running.

Now run:

./add-user.sh

It should respond with "-hashed-9222..." and a lot of hex numbers.

Now test that couch is running by typing the url in a browser: http://localhost:5984/_utils

There should be two entries in red: _replicator and _users.

Now install the test databases:

cd backup
./upload-all.sh

The script asks for a password. Type in jabberw0cky (with a zero). This is a master script that calls all the other upload-*.sh files. (So you can upload them individually.) Finally run hritserver from the command line:

5. Run the installer

cd ..
sudo ./install.sh
sudo ./hritserver-start.sh

Hit return. The service will run even when you log off. To stop the service, log in as the same user who launched it initially and type:

sudo ./hritserver-stop.sh

Access the service on http://localhost:8080/tests/ (trailing slash is significant). To make it visible on port 80 add the following lines to the end of /etc/apache2/httpd.conf,and restart apache

On OSX:

ProxyPass /tests/ http://localhost:8080/tests/ retry=0
ProxyPass /corpix/ http://localhost:8080/corpix/ retry=0
ProxyPass /import/ http://localhost:8080/import/ retry=0
ProxyPass /upload/ http://localhost:8080/upload/ retry=0
Now restart apache: sudo apachectl restart

Ubuntu

Add the above lines to /etc/apache2/mods-available/proxy.conf.

Restart apache2: sudo service apache2 restart

ADDING OTHER TEXTS

The mmpupload tool can be used to upload files in XML or plain text in batches. See www.github.com/HRIT-Infrastructure.