[OSM-dev] 04to05.pl doesn't like Puerto Rico

Gabriel Ebner ge at gabrielebner.at
Thu Jan 17 18:11:16 GMT 2008


On Thu, Jan 17, 2008 at 09:28:54AM -0800, Dave Hansen wrote:
> Well, I've done virtually the entire US's TIGER data with the script,
> with no issues, but it finally choked on Puerto Rico.
> 
> It gets this:
> 
> not well-formed (invalid token) at line 330, column 38, byte 14569
> at /usr/local/lib/perl/5.8.8/XML/Parser.pm line 187
> 
> when running on this file:
> 
> http://dev.openstreetmap.org/~daveh/tiger.files/counties/PR/Adjuntas.osm
> 
> I think it's the crazy characters in tags like this:
> 
> 	<tag k="name" v="Carr Sillo de Calder�n"/>
> 	<tag k="tiger:name_base" v="Carr Sillo de Calder�n"/>
> 
> Being a stupid American, I have no real knowledge of character sets and
> that fun.  Any idea what the right way to fix this is?

As others have pointed out, the Calderón bit is ISO-8859-15 while the XML file
says it's UTF-8.  A simple iconv run should make it work:

$ iconv -f ISO-8859-15 -t UTF-8 ~/Adjuntas.osm | ./04to05.pl

  Gabriel.




More information about the dev mailing list