[OSM-dev] UTF8 problem with last night's daily .osc

Richard Fairhurst richard at systemeD.net
Sat Aug 30 09:12:01 BST 2008

Frederik Ramm wrote:

> Frederik Ramm wrote:
>> Closer inspection reveals that this is a tag value that has been
>> truncated at character #255, which happens to be in the MIDST of an
>> UTF-8 sequence. Ouch! Who truncates tags to 255 characters?
> It's a bit embarassing to keep talking to myself here but in case  
> anyone
> else is interested:
> The culprit is way #26604650 which was newly created with Potlatch
> 0.10b, apparently with the tag value being truncated in the middle  
> of an
> UTF-8 sequence

Well, the relevant bit of the migration is

     create_table "current_way_tags", myisam_table do |t|
       t.column "id", :bigint, :limit => 64
       t.column "k",  :string,                :default => "", :null  
=> false
       t.column "v",  :string,                :default => "", :null  
=> false

and a :string means a MySQL 255-character VARCHAR (http:// 
rails-datatypes/)... so yes, that'll be why it's happening.

So I guess the solution is either for Osmosis to conform to Postel's  
Law; or to change the datatype (presumably breaks indexing?); or for  
Potlatch/amf_controller, which don't currently have any limit on key/ 
value lengths (well, 64k :) ), to preprocess keys/values by  
truncating at the nearest UTF-8 boundary before 255 bytes.  
Suggestions welcome as to how this should be done.


