[OSM-talk] Osmosis UTF-8 problem (again)

Brett Henderson brett at bretth.com
Sat Jan 12 01:18:57 GMT 2008


Frederik Ramm wrote:
> Hi,
>
>   
>> Any idea what the user name should be? I find it hard to believe that 
>> user="jos??????¯®????" (from the API) is correct.
>>     
>
> Well on 05 December I did have a problem with the planet diff, quoting
> from old E-Mail:
>
>   
>
>    latest daily planet diff has an UTF-8 problem on line 58267:
> <node id="25254929" timestamp="2007-12-04T17:26:52Z" user="josé" ...
> Seems like the user names don't get encoded properly.
>
> <<<<<<
>
> Username looks conspicuously similar ;)
>   
I remember that email, I was hoping the problem would magically 
disappear ;-)

Checking the history of that node from the API again gives user="jos逴巊 
»H´" (hopefully this is coming through okay, it includes a bunch of 
Chinese-like characters).

I'll check it out in more detail soon. It does look like it should be 
user="josé" but given that the API is also returning "interesting" data 
it sounds like there's a deeper problem somewhere. Either way, osmosis 
shouldn't be emitting invalid UTF-8, but fixing it may not be easy. It 
might have something to do with characters that can't be represented 
with 16-bit characters. If it does turn out to be a problem elsewhere I 
can try to put a hack in place to at least emit valid UTF-8, but it will 
require me doing some more reading of unicode standards which I'm not 
excited about :-)





More information about the talk mailing list