Saturday 23 October 2010

Unexpanding entites in expat

Entities are 'expanded', that is resolved, in expat so that if you parse a file containing & it will turn into &. Expat does this automatically and there is no option to turn it off. The problem with this is that any XML file written from the parsed data will be invalid. In the documentation it says you can specify a 'default handler' which will turn off the expansion as a side-effect. However, in a SAX parser if you define a character handler it will receive the resolved entity instead:

XML_Char ch = (XML_Char) XmlPredefinedEntityName(enc,
          s + enc->minBytesPerChar,
          next - enc->minBytesPerChar);
if (ch) {
if (characterDataHandler)
    characterDataHandler(handlerArg, &ch, 1);
else if (defaultHandler)
    reportDefault(parser, enc, s, next);
break;
And even the default handler will receive the resolved ampersand, not the literal entity reference. Ho hum. But the key is the line characterDataHandler(handlerArg, &ch, 1);. Since all ampersands go through the character handler as actual ampersands, you can just re-expand them there and the problem is solved, e.g.
static void XMLCALL charhndl( void *userData, const XML_Char *s,
    int len)
{
    size_t  n;
    if ( len == 1 && s[0] == '&' )
    {
        n = fwrite( "&", 1, 5,text_dest );
        current_text_offset += 5;
    }
    else
    {
        n = fwrite( s, 1, (size_t)len, text_dest );
        current_text_offset += len;
        if ( n != len )
        {
            printf( "write error on text file" );
            exit( 0 );
        }
    }
}