[Openstreetmap-dev] CSV transport encoding scheme

David Sheldon dave-osm at earth.li
Thu Jan 26 23:37:23 GMT 2006


On Thu, Jan 26, 2006 at 11:36:32PM +0100, Lars Aronsson wrote:
> Immanuel Scholz wrote:
> 
> > I am very unsatisfied with the current server performance, 
> > especally for map.rb - requests.
> 
> If this is indeed XML related, maybe it is because of how XML tags 
> are concatenated into longer strings (or parsed?), with a lot of 
> char buffer reallocation?  (That's just a guess.  Many performance 
> bottlenecks are due to copying data around needlessly.) Is there 
> another way to handle XML in Ruby than is currently used in 
> map.rb? 

It looks like REXML's fault.  Looking at the code, it looks like the
time is taken up printing the attributes, and almost half of it in
Text:Normalize

In my entirely unscientific test I tried to print 10000 elements, each
with 4 attributes. Do this with simple  out << "<foo" style lines took
about 0.17 seconds. 

The to_string in the Attribute class in REXML is
 def to_string
   "#@expanded_name='#{to_s().gsub(/'/, ''')}'"
 end

Left like that, it takes 5.5 seconds.

With it replaced by:
      "#@expanded_name='#{@value}'"
it takes 1.7 seconds.

      "#@expanded_name='#{to_s()}'"
5.4 seconds, so the gsub takes very little time (relatively, this is
still almost as long as the whole thing takes when not thinking XML.

In to_s, there is a line 
      @value = @normalized = Text::normalize( @value, doctype )

This caches the normalized value, but that isn't much use to use as we
only print each attribute once.      

Replace that line with "@value = @normalized = @value" (it doesn't do
much any more), and the time taken is 2.2 seconds.

The Text::normalise function is therefore quite an overhead.

I think that if we think hard about our data, and write out XML using
print statements we can retrieve a lot of time. On the other hand, we
might end up shooting ourselves in the foot here if we are not careful
if we start prooducing mal-formed XML. That said, there is quite a
chance of similar shootings if we write our own CSV printer. 


> XML in itself solves so many problems with quoting and structure, 
> that there should be very little reason to use anything else, for 
> applications that are designed today (legacy free).  You already 
> said that UTF-8 was mandatory and that you wouldn't approve of any 
> other character set.  I agree.  And the same should go for XML.

I agree. XML has significant benefits, and though I am normally an
advocate of using XML libraries to read and write XML, I think that we
can justify the time to write XML out manually. We just need to think
about exactly what values we will be getting out of the database. We
KNOW that the doubles we get out of the db are not going to contain
characters that need escaping as they are of the form [0-9]+(\.[0-9]+)?
so we could save there. Similarly for uids from the database. This only
leaves "tags" for both the nodes and segments. 

I'm away for a few days, but if you want I can rewrite map.rb without
using REXML. Alternatively does ruby have a wrapper for expat that can
be used to write the XML? expat is a native XML library and should be a
little faster here.  

David
-- 
"[Hackers] then only have to crack the password to take control"
  -- IT Week on a terrible Unix security flaw




More information about the dev mailing list