<p>I ran a very small local test based on <a href="http://www.openstreetmap.org/changeset/53407241" rel="nofollow">Changeset 53407241</a>, which took around 15 minutes via JOSM upload. Most CPU time was spent on <em>puma 001</em> process (rails). On the console I noticed a rather large number of single row INSERT/DELETE/UPDATE operations:</p>
<ul>
<li>Each single node (for current and old nodes)</li>
<li>Each single tag</li>
<li>Each single node in a way</li>
<li>Frequent checks for current changeset</li>
</ul>
<p>Overall, I'm also suspecting shared/exclusive locks to contribute to the larger overall processing time on the server. With long running transactions, chances increase that other people start working on the same objects.</p>
<p>I think it's time for a bit out of the box thinking, and leaving the current implementation behind for a minute. That's why I gave <a href="http://osmcode.org/pyosmium/" rel="nofollow">pyosmium</a> and some bulk SQL statements a short try.</p>
<p>The same 8MB diff file was completely parsed in 140ms. Based on this I generated some bulk insert statements, effectively creating all 4300+ nodes in one go:</p>
<div class="highlight highlight-source-sql"><pre><span class="pl-k">BEGIN</span>;
<span class="pl-k">INSERT INTO</span> <span class="pl-s"><span class="pl-pds">"</span>current_nodes<span class="pl-pds">"</span></span> (<span class="pl-s"><span class="pl-pds">"</span>latitude<span class="pl-pds">"</span></span>, <span class="pl-s"><span class="pl-pds">"</span>longitude<span class="pl-pds">"</span></span>, <span class="pl-s"><span class="pl-pds">"</span>changeset_id<span class="pl-pds">"</span></span>, <span class="pl-s"><span class="pl-pds">"</span>visible<span class="pl-pds">"</span></span>, <span class="pl-s"><span class="pl-pds">"</span>timestamp<span class="pl-pds">"</span></span>, <span class="pl-s"><span class="pl-pds">"</span>tile<span class="pl-pds">"</span></span>, <span class="pl-s"><span class="pl-pds">"</span>version<span class="pl-pds">"</span></span>) <span class="pl-k">VALUES</span>
(<span class="pl-c1">423712730</span>, <span class="pl-k">-</span><span class="pl-c1">832172277</span>, <span class="pl-c1">19</span>, <span class="pl-s"><span class="pl-pds">'</span>t<span class="pl-pds">'</span></span>, <span class="pl-s"><span class="pl-pds">'</span>2017-12-24 09:15:27.204641<span class="pl-pds">'</span></span>, <span class="pl-c1">1234</span>, <span class="pl-c1">1</span>),
(<span class="pl-c1">423712670</span>, <span class="pl-k">-</span><span class="pl-c1">834339064</span>, <span class="pl-c1">19</span>, <span class="pl-s"><span class="pl-pds">'</span>t<span class="pl-pds">'</span></span>, <span class="pl-s"><span class="pl-pds">'</span>2017-12-24 09:15:27.204675<span class="pl-pds">'</span></span>, <span class="pl-c1">1234</span>, <span class="pl-c1">1</span>),
(<span class="pl-c1">423715247</span>, <span class="pl-k">-</span><span class="pl-c1">832172360</span>, <span class="pl-c1">19</span>, <span class="pl-s"><span class="pl-pds">'</span>t<span class="pl-pds">'</span></span>, <span class="pl-s"><span class="pl-pds">'</span>2017-12-24 09:15:27.204684<span class="pl-pds">'</span></span>, <span class="pl-c1">1234</span>, <span class="pl-c1">1</span>),
(<span class="pl-c1">423715140</span>, <span class="pl-k">-</span><span class="pl-c1">834339084</span>, <span class="pl-c1">19</span>, <span class="pl-s"><span class="pl-pds">'</span>t<span class="pl-pds">'</span></span>, <span class="pl-s"><span class="pl-pds">'</span>2017-12-24 09:15:27.204691<span class="pl-pds">'</span></span>, <span class="pl-c1">1234</span>, <span class="pl-c1">1</span>),
(<span class="pl-c1">423717479</span>, <span class="pl-k">-</span><span class="pl-c1">834339184</span>, <span class="pl-c1">19</span>, <span class="pl-s"><span class="pl-pds">'</span>t<span class="pl-pds">'</span></span>, <span class="pl-s"><span class="pl-pds">'</span>2017-12-24 09:15:27.204697<span class="pl-pds">'</span></span>, <span class="pl-c1">1234</span>, <span class="pl-c1">1</span>),
(<span class="pl-c1">423672264</span>, <span class="pl-k">-</span><span class="pl-c1">832171144</span>, <span class="pl-c1">19</span>, <span class="pl-s"><span class="pl-pds">'</span>t<span class="pl-pds">'</span></span>, <span class="pl-s"><span class="pl-pds">'</span>2017-12-24 09:15:27.204704<span class="pl-pds">'</span></span>, <span class="pl-c1">1234</span>, <span class="pl-c1">1</span>),
(<span class="pl-c1">423672601</span>, <span class="pl-k">-</span><span class="pl-c1">834336964</span>, <span class="pl-c1">19</span>, <span class="pl-s"><span class="pl-pds">'</span>t<span class="pl-pds">'</span></span>, <span class="pl-s"><span class="pl-pds">'</span>2017-12-24 09:15:27.204710<span class="pl-pds">'</span></span>, <span class="pl-c1">1234</span>, <span class="pl-c1">1</span>),
(<span class="pl-c1">423674831</span>, <span class="pl-k">-</span><span class="pl-c1">832171161</span>, <span class="pl-c1">19</span>, <span class="pl-s"><span class="pl-pds">'</span>t<span class="pl-pds">'</span></span>, <span class="pl-s"><span class="pl-pds">'</span>2017-12-24 09:15:27.204717<span class="pl-pds">'</span></span>, <span class="pl-c1">1234</span>, <span class="pl-c1">1</span>)
[...many more...]
RETURNING (id);
<span class="pl-k">COMMIT</span>;</pre></div>
<p>This query only took a few hundred ms on my local db (not much in there). The number of entries (=package size) in one such INSERT statement could of course be made configurable, and the returned node ids used for later replacement operations. They could even be stored in a temporary table.</p>
<p>Regarding modify and delete operations, a huge part of the task is to pull data node by node, tag by tag out of the database, and INSERT it in the history table. I wonder, why we don't let the database do this job and only provide a list of object ids to do the copying in one go. This takes maybe a few seconds.</p>
<p>Here's a very rough idea:</p>
<div class="highlight highlight-source-sql"><pre><span class="pl-k">insert into</span> node_tags ( <span class="pl-k">select</span> node_id, version, k, v <span class="pl-k">from</span> current_node_tags t
<span class="pl-k">inner join</span> current_nodes n <span class="pl-k">on</span> <span class="pl-c1">t</span>.<span class="pl-c1">node_id</span> <span class="pl-k">=</span> <span class="pl-c1">n</span>.<span class="pl-c1">id</span> <span class="pl-k">where</span> id <span class="pl-k">in</span> ( <span class="pl-c1">19302</span>, <span class="pl-c1">19301</span>, <span class="pl-c1">19300</span>,
<span class="pl-c1">19299</span>, <span class="pl-c1">19298</span>, <span class="pl-c1">19297</span>, <span class="pl-c1">19296</span>, <span class="pl-c1">19295</span>, <span class="pl-c1">19294</span>, <span class="pl-c1">19293</span>, <span class="pl-c1">19292</span>, <span class="pl-c1">19291</span>, <span class="pl-c1">19290</span>, <span class="pl-c1">19289</span>,
<span class="pl-c1">19288</span>, <span class="pl-c1">19287</span>, <span class="pl-c1">19286</span>, <span class="pl-c1">19285</span>, <span class="pl-c1">19284</span>, <span class="pl-c1">19283</span>, <span class="pl-c1">19282</span>, <span class="pl-c1">19281</span>, <span class="pl-c1">19280</span>, <span class="pl-c1">19279</span>, <span class="pl-c1">19278</span>,
<span class="pl-c1">19277</span>, <span class="pl-c1">19276</span>, <span class="pl-c1">19275</span>, <span class="pl-c1">19274</span>, <span class="pl-c1">19273</span>, <span class="pl-c1">19272</span>, <span class="pl-c1">19271</span>, <span class="pl-c1">19270</span>, <span class="pl-c1">19269</span>, <span class="pl-c1">19268</span>, <span class="pl-c1">19267</span>,
<span class="pl-c1">19266</span>));</pre></div>
<p>I'm far away from understanding all of the details in the current implementation. Nevertheless, I think there's a lot of potential for improvement here, although it could mean that we can't reuse much of the current implementation. Wouldn't that be a good fit for cgimap?</p>
<p style="font-size:small;-webkit-text-size-adjust:none;color:#666;">—<br />You are receiving this because you are subscribed to this thread.<br />Reply to this email directly, <a href="https://github.com/openstreetmap/openstreetmap-website/issues/1710#issuecomment-353774921">view it on GitHub</a>, or <a href="https://github.com/notifications/unsubscribe-auth/ABWnLUaqcGxJQxZaQLbZNmv20T_Xs5wtks5tDhqsgaJpZM4RK36m">mute the thread</a>.<img alt="" height="1" src="https://github.com/notifications/beacon/ABWnLZHJ0mv45D33uZ22QKDMb4ww4EYMks5tDhqsgaJpZM4RK36m.gif" width="1" /></p>
<div itemscope itemtype="http://schema.org/EmailMessage">
<div itemprop="action" itemscope itemtype="http://schema.org/ViewAction">
<link itemprop="url" href="https://github.com/openstreetmap/openstreetmap-website/issues/1710#issuecomment-353774921"></link>
<meta itemprop="name" content="View Issue"></meta>
</div>
<meta itemprop="description" content="View this Issue on GitHub"></meta>
</div>
<script type="application/json" data-scope="inboxmarkup">{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/openstreetmap/openstreetmap-website","title":"openstreetmap/openstreetmap-website","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/openstreetmap/openstreetmap-website"}},"updates":{"snippets":[{"icon":"PERSON","message":"@mmd-osm in #1710: \r\nI ran a very small local test based on [Changeset 53407241](http://www.openstreetmap.org/changeset/53407241), which took around 15 minutes via JOSM upload. Most CPU time was spent on _puma 001_ process (rails). On the console I noticed a rather large number of single row INSERT/DELETE/UPDATE operations:\r\n\r\n - Each single node (for current and old nodes)\r\n - Each single tag\r\n - Each single node in a way\r\n - Frequent checks for current changeset\r\n\r\nOverall, I'm also suspecting shared/exclusive locks to contribute to the larger overall processing time on the server. With long running transactions, chances increase that other people start working on the same objects.\r\n\r\nI think it's time for a bit out of the box thinking, and leaving the current implementation behind for a minute. That's why I gave [pyosmium](http://osmcode.org/pyosmium/) and some bulk SQL statements a short try.\r\n\r\nThe same 8MB diff file was completely parsed in 140ms. Based on this I generated some bulk insert statements, effectively creating all 4300+ nodes in one go:\r\n\r\n```sql\r\nBEGIN;\r\nINSERT INTO \"current_nodes\" (\"latitude\", \"longitude\", \"changeset_id\", \"visible\", \"timestamp\", \"tile\", \"version\") VALUES\r\n(423712730, -832172277, 19, 't', '2017-12-24 09:15:27.204641', 1234, 1),\r\n(423712670, -834339064, 19, 't', '2017-12-24 09:15:27.204675', 1234, 1),\r\n(423715247, -832172360, 19, 't', '2017-12-24 09:15:27.204684', 1234, 1),\r\n(423715140, -834339084, 19, 't', '2017-12-24 09:15:27.204691', 1234, 1),\r\n(423717479, -834339184, 19, 't', '2017-12-24 09:15:27.204697', 1234, 1),\r\n(423672264, -832171144, 19, 't', '2017-12-24 09:15:27.204704', 1234, 1),\r\n(423672601, -834336964, 19, 't', '2017-12-24 09:15:27.204710', 1234, 1),\r\n(423674831, -832171161, 19, 't', '2017-12-24 09:15:27.204717', 1234, 1)\r\n[...many more...]\r\n RETURNING (id);\r\nCOMMIT;\r\n```\r\n\r\nThis query only took a few hundred ms on my local db (not much in there). The number of entries (=package size) in one such INSERT statement could of course be made configurable, and the returned node ids used for later replacement operations. They could even be stored in a temporary table.\r\n\r\nRegarding modify and delete operations, a huge part of the task is to pull data node by node, tag by tag out of the database, and INSERT it in the history table. I wonder, why we don't let the database do this job and only provide a list of object ids to do the copying in one go. This takes maybe a few seconds.\r\n\r\nHere's a very rough idea:\r\n\r\n```sql\r\ninsert into node_tags ( select node_id, version, k, v from current_node_tags t\r\ninner join current_nodes n on t.node_id = n.id where id in ( 19302, 19301, 19300, \r\n19299, 19298, 19297, 19296, 19295, 19294, 19293, 19292, 19291, 19290, 19289, \r\n19288, 19287, 19286, 19285, 19284, 19283, 19282, 19281, 19280, 19279, 19278, \r\n19277, 19276, 19275, 19274, 19273, 19272, 19271, 19270, 19269, 19268, 19267, \r\n19266));\r\n```\r\n\r\nI'm far away from understanding all of the details in the current implementation. Nevertheless, I think there's a lot of potential for improvement here, although it could mean that we can't reuse much of the current implementation. Wouldn't that be a good fit for cgimap?\r\n"}],"action":{"name":"View Issue","url":"https://github.com/openstreetmap/openstreetmap-website/issues/1710#issuecomment-353774921"}}}</script>