[openstreetmap/openstreetmap-website] Why are some diff uploads slow? (#1710)

Sun Dec 24 09:34:36 UTC 2017

I ran a very small local test based on [Changeset 53407241](http://www.openstreetmap.org/changeset/53407241), which took around 15 minutes via JOSM upload. Most CPU time was spent on _puma 001_ process (rails). On the console I noticed a rather large number of single row INSERT/DELETE/UPDATE operations:

  - Each single node (for current and old nodes)
  - Each single tag
  - Each single node in a way
  - Frequent checks for current changeset

Overall, I'm also suspecting shared/exclusive locks to contribute to the larger overall processing time on the server. With long running transactions, chances increase that other people start working on the same objects.

I think it's time for a bit out of the box thinking, and leaving the current implementation behind for a minute. That's why I gave [pyosmium](http://osmcode.org/pyosmium/) and some bulk SQL statements a short try.

The same 8MB diff file was completely parsed in 140ms. Based on this I generated some bulk insert statements, effectively creating all 4300+ nodes in one go:

```sql
BEGIN;
INSERT INTO "current_nodes" ("latitude", "longitude", "changeset_id", "visible", "timestamp", "tile", "version") VALUES
(423712730, -832172277, 19, 't', '2017-12-24 09:15:27.204641', 1234, 1),
(423712670, -834339064, 19, 't', '2017-12-24 09:15:27.204675', 1234, 1),
(423715247, -832172360, 19, 't', '2017-12-24 09:15:27.204684', 1234, 1),
(423715140, -834339084, 19, 't', '2017-12-24 09:15:27.204691', 1234, 1),
(423717479, -834339184, 19, 't', '2017-12-24 09:15:27.204697', 1234, 1),
(423672264, -832171144, 19, 't', '2017-12-24 09:15:27.204704', 1234, 1),
(423672601, -834336964, 19, 't', '2017-12-24 09:15:27.204710', 1234, 1),
(423674831, -832171161, 19, 't', '2017-12-24 09:15:27.204717', 1234, 1)
[...many more...]
 RETURNING (id);
COMMIT;
```

This query only took a few hundred ms on my local db (not much in there). The number of entries  (=package size) in one such INSERT statement could of course be made configurable, and the returned node ids used for later replacement operations. They could even be stored in a temporary table.

Regarding modify and delete operations, a huge part of the task is to pull data node by node, tag by tag out of the database, and INSERT it in the history table. I wonder, why we don't let the database do this job and only provide a list of object ids to do the copying in one go. This takes maybe a few seconds.

Here's a very rough idea:

```sql
insert into node_tags ( select node_id, version, k, v from current_node_tags t
inner join current_nodes n on t.node_id = n.id where id in (   19302,   19301,   19300,   
19299,   19298,   19297,   19296,   19295,   19294,   19293,   19292,   19291,   19290,   19289,   
19288,   19287,   19286,   19285,   19284,   19283,   19282,   19281,   19280,   19279,   19278,   
19277,   19276,   19275,   19274,   19273,   19272,   19271,   19270,   19269,   19268,   19267,   
19266));
```

I'm far away from understanding all of the details in the current implementation. Nevertheless, I think there's a lot of potential for improvement here, although it could mean that we can't reuse much of the current implementation. Wouldn't that be a good fit for cgimap?

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/openstreetmap/openstreetmap-website/issues/1710#issuecomment-353774921
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/rails-dev/attachments/20171224/7fbd8cc0/attachment.html>