[OSM-dev] osmosis utf-8

Fri Nov 9 08:38:06 GMT 2007

Martijn van Oosterhout wrote:
> On Nov 8, 2007 1:24 PM, Brett Henderson <brett at bretth.com> wrote:
>   
>> I guess Cp1252 isn't quite what mysql uses after all.  Although it seems
>> like we're on the right track.  Perhaps I need to write my own encoding
>> ...  I guess I need to find out what mysql truly does use for latin1.
>>     
>
> Not so quick, you're on the right track. If you compare the two you
> will see they differ by one (1!) byte, a 0x81 is converted to a
> question mark. The reason probably being that 0x81 is not a valid
> character in cp1252. Mysql being what it is doesn't complain.
>
> The choices from here are a bit tricky. You can get the charset
> mapping in various places. Perhaps the easiest solution would be to
> set the "unmappable char" character to 0x81, if it'll let you. I'm
> just worried about the other possible unrepresentable char 0xAD.
>
> Here's the charset we're talking about:
> http://demo.icu-project.org/icu-bin/convexp?conv=ibm-5348_P100-1997&s=ALL
>
> I'm fresh out of ideas here. Part of me says to make your own mapping
> table or converter but how you'd do that within the Java framework I
> have absolutly no idea.
>   
Well, the custom decoder bit seems fairly easy.  You just have to 
subclass java.nio.Charset, java.nio.CharsetEncoder and 
java.nio.CharsetDecoder.

I have written most of the encoder but it is still failing because I 
haven't completed all the character conversions.  Specifically I'm missing:
0x81
0x8D
0x8F
0x90
0x9D

If I receive one of those codes I currently throw an exception because I 
don't know how to map each one yet.  I'll fix each one as I come to it 
and have an example of the correct output.  The first one I've hit is 
0x81 which occurs in my favourite node:
http://www.openstreetmap.org/api/0.5/node/21683296/history

I need food now but will start working through these remaining codes soon.

The code snippet performing the conversion is below.  It is inside the 
decodeLoop method implementation.
while (true) {
    int nextValue;

    if (!in.hasRemaining()) {
        return CoderResult.UNDERFLOW;
    }
    if (!out.hasRemaining()) {
        return CoderResult.OVERFLOW;
    }

    // Convert to char so that we can use unsigned values from this 
point on.
    nextValue = in.get();
    // Clear any "non-byte" bits resulting from byte to int conversion 
for values 0x80 and above.
    nextValue = nextValue & 0xFF;

    if (nextValue >= 0x00 && nextValue < 0x80) {
        // No translation required for this range of characters.
        out.put((char) nextValue);
    } else if (nextValue >= 0x00A0 && nextValue < 0x0100) {
        // No translation required for this range of characters.
        out.put((char) nextValue);
    } else {
        switch (nextValue) {
        case 0x80:
            out.put((char) 0x20AC);
            break;
        case 0x82:
            out.put((char) 0x201A);
            break;
        case 0x83 :
            out.put((char) 0x0192);
            break;
        case 0x84 :
            out.put((char) 0x201E);
            break;
        case 0x85 :
            out.put((char) 0x2026);
            break;
        case 0x86 :
            out.put((char) 0x2020);
            break;
        case 0x87 :
            out.put((char) 0x2021);
            break;
        case 0x88 :
            out.put((char) 0x02C6);
            break;
        case 0x89 :
            out.put((char) 0x2030);
            break;
        case 0x8A :
            out.put((char) 0x0160);
            break;
        case 0x8B :
            out.put((char) 0x2039);
            break;
        case 0x8C :
            out.put((char) 0x0152);
            break;
        case 0x8E :
            out.put((char) 0x017D);
            break;
        case 0x91 :
            out.put((char) 0x2018);
            break;
        case 0x92 :
            out.put((char) 0x2019);
            break;
        case 0x93 :
            out.put((char) 0x201C);
            break;
        case 0x94 :
            out.put((char) 0x201D);
            break;
        case 0x95 :
            out.put((char) 0x2022);
            break;
        case 0x96 :
            out.put((char) 0x2013);
            break;
        case 0x97 :
            out.put((char) 0x2014);
            break;
        case 0x98 :
            out.put((char) 0x02DC);
            break;
        case 0x99 :
            out.put((char) 0x2122);
            break;
        case 0x9A :
            out.put((char) 0x0161);
            break;
        case 0x9B :
            out.put((char) 0x203A);
            break;
        case 0x9C :
            out.put((char) 0x0153);
            break;
        case 0x9E :
            out.put((char) 0x017E);
            break;
        case 0x9F :
            out.put((char) 0x0178);
            break;
        default:
            throw new OsmosisRuntimeException("Byte 0x" + 
Integer.toHexString(nextValue) + " is not recognised.");
        }
    }
}