[OSM-dev] osmosis utf-8
Brett Henderson
brett at bretth.com
Fri Nov 9 08:38:06 GMT 2007
Martijn van Oosterhout wrote:
> On Nov 8, 2007 1:24 PM, Brett Henderson <brett at bretth.com> wrote:
>
>> I guess Cp1252 isn't quite what mysql uses after all. Although it seems
>> like we're on the right track. Perhaps I need to write my own encoding
>> ... I guess I need to find out what mysql truly does use for latin1.
>>
>
> Not so quick, you're on the right track. If you compare the two you
> will see they differ by one (1!) byte, a 0x81 is converted to a
> question mark. The reason probably being that 0x81 is not a valid
> character in cp1252. Mysql being what it is doesn't complain.
>
> The choices from here are a bit tricky. You can get the charset
> mapping in various places. Perhaps the easiest solution would be to
> set the "unmappable char" character to 0x81, if it'll let you. I'm
> just worried about the other possible unrepresentable char 0xAD.
>
> Here's the charset we're talking about:
> http://demo.icu-project.org/icu-bin/convexp?conv=ibm-5348_P100-1997&s=ALL
>
> I'm fresh out of ideas here. Part of me says to make your own mapping
> table or converter but how you'd do that within the Java framework I
> have absolutly no idea.
>
Well, the custom decoder bit seems fairly easy. You just have to
subclass java.nio.Charset, java.nio.CharsetEncoder and
java.nio.CharsetDecoder.
I have written most of the encoder but it is still failing because I
haven't completed all the character conversions. Specifically I'm missing:
0x81
0x8D
0x8F
0x90
0x9D
If I receive one of those codes I currently throw an exception because I
don't know how to map each one yet. I'll fix each one as I come to it
and have an example of the correct output. The first one I've hit is
0x81 which occurs in my favourite node:
http://www.openstreetmap.org/api/0.5/node/21683296/history
I need food now but will start working through these remaining codes soon.
The code snippet performing the conversion is below. It is inside the
decodeLoop method implementation.
while (true) {
int nextValue;
if (!in.hasRemaining()) {
return CoderResult.UNDERFLOW;
}
if (!out.hasRemaining()) {
return CoderResult.OVERFLOW;
}
// Convert to char so that we can use unsigned values from this
point on.
nextValue = in.get();
// Clear any "non-byte" bits resulting from byte to int conversion
for values 0x80 and above.
nextValue = nextValue & 0xFF;
if (nextValue >= 0x00 && nextValue < 0x80) {
// No translation required for this range of characters.
out.put((char) nextValue);
} else if (nextValue >= 0x00A0 && nextValue < 0x0100) {
// No translation required for this range of characters.
out.put((char) nextValue);
} else {
switch (nextValue) {
case 0x80:
out.put((char) 0x20AC);
break;
case 0x82:
out.put((char) 0x201A);
break;
case 0x83 :
out.put((char) 0x0192);
break;
case 0x84 :
out.put((char) 0x201E);
break;
case 0x85 :
out.put((char) 0x2026);
break;
case 0x86 :
out.put((char) 0x2020);
break;
case 0x87 :
out.put((char) 0x2021);
break;
case 0x88 :
out.put((char) 0x02C6);
break;
case 0x89 :
out.put((char) 0x2030);
break;
case 0x8A :
out.put((char) 0x0160);
break;
case 0x8B :
out.put((char) 0x2039);
break;
case 0x8C :
out.put((char) 0x0152);
break;
case 0x8E :
out.put((char) 0x017D);
break;
case 0x91 :
out.put((char) 0x2018);
break;
case 0x92 :
out.put((char) 0x2019);
break;
case 0x93 :
out.put((char) 0x201C);
break;
case 0x94 :
out.put((char) 0x201D);
break;
case 0x95 :
out.put((char) 0x2022);
break;
case 0x96 :
out.put((char) 0x2013);
break;
case 0x97 :
out.put((char) 0x2014);
break;
case 0x98 :
out.put((char) 0x02DC);
break;
case 0x99 :
out.put((char) 0x2122);
break;
case 0x9A :
out.put((char) 0x0161);
break;
case 0x9B :
out.put((char) 0x203A);
break;
case 0x9C :
out.put((char) 0x0153);
break;
case 0x9E :
out.put((char) 0x017E);
break;
case 0x9F :
out.put((char) 0x0178);
break;
default:
throw new OsmosisRuntimeException("Byte 0x" +
Integer.toHexString(nextValue) + " is not recognised.");
}
}
}
More information about the dev
mailing list