[OSM-dev] Rantings about API 0.6

Tue Feb 10 12:37:35 GMT 2009

On Tue, Feb 10, 2009 at 1:01 AM, Frederik Ramm <frederik at remote.org> wrote:
> Hi,
>
> Iván Sánchez Ortega wrote:
>> [2009-02-10 01:00:34.809017 #1664]   User Load (0.000144)   SELECT * FROM
>> `users` WHERE (`users`.`id` = 1)
>> [2009-02-10 01:00:34.810457 #1664]   SQL (0.000144)   SELECT `display_name`
>> FROM `users` WHERE (`users`.display_name = 'ivansanchez' AND `users`.id <> 1)
>> [2009-02-10 01:00:34.811241 #1664]   SQL (0.000108)   SELECT `email` FROM
>> `users` WHERE (`users`.email = 'ivan at sanchezortega.es' AND `users`.id <> 1)
>>
>> WTF does this happen on every object (read: node) I upload? The API already
>> knows who I am the second I started to upload the diff!
>
> Probably the same thing that Matt Amos just posted, the individual
> object processing code being re-used for each object in the change file,
> and so for each object it checks whether you're really the owner of the
> changeset.

its not quite that - the objects "belong to" a changeset and validate
that changeset each time they're saved. the changesets "belong to" a
user and validate that user each time they're saved. what you're
seeing is the node (and historical node) reloading your user entry to
validate it. the check for whether the user owns the changeset is done
in ruby code without going back to the database.

> Maybe we should indeed re-think this one; it is the only optimisation
> that would influence the protocol. (Then again it would do so in a
> backwards-compatible way; if we drop the changeset=... attribute
> requirement later and clients still send it, who cares.)

while this is a good idea which would only slightly uglify the code,
the user and changeset validations would be run regardless.

>> [2009-02-10 01:00:34.863175 #1664]   SQL (0.000244)   SELECT `id` FROM
>> `current_nodes` WHERE (`current_nodes`.id IS NULL)
>>
>> current_nodes.id is defined as NOT NULL. So, WTF?

this is probably coming from the node's validations. one of
  validates_presence_of :id, :on => :update
  validates_uniqueness_of :id
  validates_numericality_of :id, :on => :update, :integer_only => true
is probably to blame. there might be an optimisation to be done here
by telling rails that one or more of these validations doesn't need to
be run.

>> [2009-02-10 01:00:34.864872 #1664]   SQL (0.000258)   SELECT `id` FROM
>> `changesets` WHERE (`changesets`.id = 44 AND `changesets`.id <> 44)
>>
>> ***WTF***???!!!!
>
> I can only assume that these things are somehow deeply magic Rails
> incantations that will provide good Karma for the requests to come. And
> I can assure you that it would be worse with Hibernate ;-)

i think this is also coming from a similar validation, presumably
checking numericality or something like that... looks a bit weird i
agree, but i presume some rails guru would be able to tell us what its
trying to do and why thats the best way to do it... maybe.

>> [2009-02-10 01:00:34.859181 #1664]   Changeset Update (0.000314)   UPDATE
>> `changesets` SET `num_changes` = 18443 WHERE `id` = 44
>>
>> Couldn't this wait until the diff upload is complete?
>
> As we're in a transaction, I guess you're right. - You have probably
> found a weak spot where we traded efficiency for maintainability.
> Re-using the individual object change building blocks as-is makes the
> code easier to read, understand, maintain, but carries a performance
> penalty...

just for fun i've uglified the code by adding a parameter to save
which controls whether the changeset gets saved when the element does.
this saves the grand total of one update query in the midst of 18
others. so for teh moar fun, i disabled the node and old_node
changeset validations and brought the query count for a node creation
down to 6:

[2009-02-10 12:26:07.139315 #13734]   Node Create (0.000156)   INSERT
INTO `current_nodes` (`latitude`, `changeset_id`, `timestamp`, `tile`,
 `id`, `version`, `longitude`, `visible`) VALUES(0, 1, '2009-02-10
12:26:07', 3221225472, NULL, 1, 0, 1)
[2009-02-10 12:26:07.140367 #13734]   NodeTag Load (0.000181)   SELECT
* FROM `current_node_tags` WHERE (`current_node_tags`.`id` = 263)
[2009-02-10 12:26:07.140666 #13734]   NodeTag Delete all (0.000099)
DELETE FROM `current_node_tags` WHERE (id = 263)
[2009-02-10 12:26:07.142385 #13734]   OldNode Create (0.000155)
INSERT INTO `nodes` (`latitude`, `changeset_id`, `timestamp`, `tile`,
`id`,
`version`, `longitude`, `visible`) VALUES(0, 1, '2009-02-10 12:26:07',
3221225472, 263, 1, 0, 1)
[2009-02-10 12:26:07.143013 #13734]   OldNode Load (0.000164)   SELECT
* FROM `nodes` WHERE (id = 263 AND timestamp = '2009-02-10 12:26:07'
AND version = 1) LIMIT 1
[2009-02-10 12:26:07.144963 #13734]   SQL (0.000124)   SELECT `id`
FROM `current_nodes` WHERE (`current_nodes`.id IS NULL)

>> I do know OSM will have some brand new DB server (now with 0.1 more API!) that
>> should be able to handle the extra overhead, and that most of those queries
>> will hit data that will be cached anyway.
>
> Also, the above requests would all happen if you did individual uploads,
> *plus* you would have the http setup overhead, so even as-is, diff
> uploads will bring a considerable gain.

as you say, with any luck the queries for users and changesets will be
cached. so disabling validations would gain us little and admit the
possibility that invalid data gets into the DB somehow... not saving
the changesets on each diff might be a win - but is it big enough to
justify uglifying the code?

>> Is it worth to optimize the code for diff uploads?
>
> It can probably wait till later.

i agree. there is scope for optimisation within rails with varying
degrees of ugliness ;-)

cheers,

matt