[Tile-serving] [openstreetmap/osm2pgsql] OOM killed while building index on table planet_osm_rels (#834)

Josip Rodin notifications at github.com
Tue Mar 27 07:22:30 UTC 2018


Hi,

I think I exhibited symptoms similar to https://github.com/openstreetmap/osm2pgsql/issues/351

The invocation was through Nominatim's setup.php:

```
time /srv/nominatim/Nominatim/build/utils/setup.php --osm-file /opt/import/europe-latest.osm.pbf --all --osm2pgsql-cache $((25*1024)) --threads 4
[...]
osm2pgsql version 0.93.0-dev (64 bit id space)

Using projection SRS 4326 (Latlong)
NOTICE:  table "place" does not exist, skipping
Allocating memory for dense node cache
Allocating dense node cache in one big chunk
Allocating memory for sparse node cache
Sharing dense sparse
Node-cache: cache=25600MB, maxblocks=409600*65536, allocation method=11
Mid: loading persistent node cache from /opt/flatnode/europe.flatnode
Mid: pgsql, cache=25600
Setting up table: planet_osm_nodes
Setting up table: planet_osm_ways
Setting up table: planet_osm_rels

Reading in file: /opt/import/europe-latest.osm.pbf
Using PBF parser.
[...]
Node stats: total(2062884192), max(5503870021) in 3082s
Way stats: total(251002077), max(572904737) in 8024s
Relation stats: total(3916024), max(8147194) in 3547s
Setting up table: planet_osm_nodes
Setting up table: planet_osm_ways
Setting up table: planet_osm_rels

Going over pending ways...
        0 ways are pending

Using 1 helper-processes
^MFinished processing 0 ways in 0 s


Going over pending relations...
        0 relations are pending

Using 1 helper-processes
^MFinished processing 0 relations in 0 s

Stopping table: planet_osm_nodes
Stopped table: planet_osm_nodes in 0s
Stopping table: planet_osm_ways
Stopped table: planet_osm_ways in 0s
Stopping table: planet_osm_rels
Building index on table: planet_osm_rels
Killed
ERROR: Error executing external command: /srv/nominatim/Nominatim-3.1.0/build/osm2pgsql/osm2pgsql --flat-nodes /opt/flatnode/europe.flatnode -lsc -O gazetteer --hstore --number-processes 1 -C 25600 -P 5432 -d nominatim /opt/import/europe-latest.osm.pbf
Error executing external command: /srv/nominatim/Nominatim-3.1.0/build/osm2pgsql/osm2pgsql --flat-nodes /opt/flatnode/europe.flatnode -lsc -O gazetteer --hstore --number-processes 1 -C 25600 -P 5432 -d nominatim /opt/import/europe-latest.osm.pbf

real    247m3.825s
user    322m23.432s
sys     7m0.540s
```

This is the Linux kernel log of the crashed run:

```
[263891.316993] postgres invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
[263891.317001] postgres cpuset=/ mems_allowed=0
[263891.317047] CPU: 5 PID: 14798 Comm: postgres Not tainted 4.4.0-116-generic #140-Ubuntu
[263891.317050] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 2:1.10.2-58953eb7 04/01/2014
[263891.317053]  0000000000000286 123cdd746b262241 ffff880701b4f9d8 ffffffff813ffc13
[263891.317057]  ffff880701b4fb90 ffff88077102e200 ffff880701b4fa48 ffffffff8121012e
[263891.317061]  ffff88076fae1c00 ffff88076d659150 ffff88076d658e00 ffff880701b4fb90
[263891.317065] Call Trace:
[263891.317116]  [<ffffffff813ffc13>] dump_stack+0x63/0x90
[263891.317137]  [<ffffffff8121012e>] dump_header+0x5a/0x1c5
[263891.317153]  [<ffffffff811968f2>] oom_kill_process+0x202/0x3c0
[263891.317157]  [<ffffffff81196d19>] out_of_memory+0x219/0x460
[263891.317167]  [<ffffffff8119cd45>] __alloc_pages_slowpath.constprop.88+0x965/0xb00
[263891.317172]  [<ffffffff8119d168>] __alloc_pages_nodemask+0x288/0x2a0
[263891.317179]  [<ffffffff811e6ccc>] alloc_pages_current+0x8c/0x110
[263891.317183]  [<ffffffff81192e0b>] __page_cache_alloc+0xab/0xc0
[263891.317187]  [<ffffffff81195390>] filemap_fault+0x150/0x400
[263891.317208]  [<ffffffff812a81d6>] ext4_filemap_fault+0x36/0x50
[263891.317214]  [<ffffffff811c2216>] __do_fault+0x56/0xf0
[263891.317218]  [<ffffffff811c5d75>] handle_mm_fault+0xfa5/0x1820
[263891.317233]  [<ffffffff810bd0dc>] ? set_next_entity+0x9c/0xb0
[263891.317249]  [<ffffffff8106c747>] __do_page_fault+0x197/0x400
[263891.317254]  [<ffffffff8106ca17>] trace_do_page_fault+0x37/0xe0
[263891.317261]  [<ffffffff81064fb9>] do_async_page_fault+0x19/0x70
[263891.317286]  [<ffffffff81851a08>] async_page_fault+0x28/0x30
[263891.317318] Mem-Info:
[263891.317326] active_anon:7358988 inactive_anon:22263 isolated_anon:0
                 active_file:15 inactive_file:0 isolated_file:0
                 unevictable:913 dirty:0 writeback:0 unstable:0
                 slab_reclaimable:56535 slab_unreclaimable:7638
                 mapped:547407 shmem:548861 pagetables:24885 bounce:0
                 free:48362 free_pcp:732 free_cma:0
[263891.317333] Node 0 DMA free:15908kB min:32kB low:40kB high:48kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[263891.317343] lowmem_reserve[]: 0 2963 29419 29419 29419
[263891.317348] Node 0 DMA32 free:114524kB min:6800kB low:8500kB high:10200kB active_anon:2887820kB inactive_anon:10828kB active_file:40kB inactive_file:0kB unevictable:568kB isolated(anon):0kB isolated(file):0kB present:3129200kB managed:3048436kB mlocked:568kB dirty:0kB writeback:0kB mapped:226864kB shmem:227984kB slab_reclaimable:21872kB slab_unreclaimable:2716kB kernel_stack:144kB pagetables:8112kB unstable:0kB bounce:0kB free_pcp:868kB local_pcp:112kB free_cma:0kB writeback_tmp:0kB pages_scanned:1532 all_unreclaimable? yes
[263891.317358] lowmem_reserve[]: 0 0 26456 26456 26456
[263891.317367] Node 0 Normal free:63016kB min:60744kB low:75928kB high:91116kB active_anon:26548132kB inactive_anon:78224kB active_file:20kB inactive_file:0kB unevictable:3084kB isolated(anon):0kB isolated(file):0kB present:27574272kB managed:27091108kB mlocked:3084kB dirty:0kB writeback:0kB mapped:1962764kB shmem:1967460kB slab_reclaimable:204268kB slab_unreclaimable:27836kB kernel_stack:3696kB pagetables:91428kB unstable:0kB bounce:0kB free_pcp:2060kB local_pcp:288kB free_cma:0kB writeback_tmp:0kB pages_scanned:260 all_unreclaimable? yes
[263891.317398] lowmem_reserve[]: 0 0 0 0 0
[263891.317403] Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15908kB
[263891.317448] Node 0 DMA32: 1904*4kB (UMEH) 1458*8kB (UMEH) 895*16kB (UMEH) 730*32kB (MEH) 369*64kB (ME) 132*128kB (ME) 43*256kB (ME) 9*512kB (ME) 1*1024kB (M) 0*2048kB 0*4096kB = 114112kB
[263891.317514] Node 0 Normal: 5952*4kB (UMEH) 4461*8kB (UMEH) 215*16kB (UE) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 62936kB
[263891.317532] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[263891.317538] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[263891.317540] 549521 total pagecache pages
[263891.317549] 0 pages in swap cache
[263891.317551] Swap cache stats: add 0, delete 0, find 0/0
[263891.317553] Free swap  = 0kB
[263891.317555] Total swap = 0kB
[263891.317557] 7679866 pages RAM
[263891.317558] 0 pages HighMem/MovableOnly
[263891.317560] 141003 pages reserved
[263891.317561] 0 pages cma reserved
[263891.317563] 0 pages hwpoisoned
[263891.317564] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[263891.317575] [  694]     0   694    13015     5172      30       3        0             0 systemd-journal
[263891.317580] [  730]     0   730    25742      135      17       3        0             0 lvmetad
[263891.317584] [ 1195]     0  1195     4030      533      12       3        0             0 dhclient
[263891.317588] [ 1309]     0  1309     1305       29       8       3        0             0 iscsid
[263891.317592] [ 1310]     0  1310     1430      877       9       3        0           -17 iscsid
[263891.317595] [ 1314]     0  1314    77160      337      20       4        0             0 lxcfs
[263891.317599] [ 1322]     0  1322     6932      258      18       3        0             0 cron
[263891.317603] [ 1325]   107  1325    10753      381      25       3        0          -900 dbus-daemon
[263891.317607] [ 1330]   104  1330    64098     1076      27       3        0             0 rsyslogd
[263891.317610] [ 1332]     0  1332    68622      590      36       3        0             0 accounts-daemon
[263891.317614] [ 1335]     0  1335     6511      383      18       3        0             0 atd
[263891.317618] [ 1342]     0  1342     1099      302       8       3        0             0 acpid
[263891.317621] [ 1344]     0  1344   150692     3683      46       6        0          -900 snapd
[263891.317625] [ 1351]     0  1351     7136      269      18       3        0             0 systemd-logind
[263891.317633] [ 1353]     0  1353    16377      576      37       3        0         -1000 sshd
[263891.317637] [ 1380]     0  1380     3343       34      11       3        0             0 mdadm
[263891.317640] [ 1384]     0  1384    69294      590      39       3        0             0 polkitd
[263891.317644] [ 1431]     0  1431     4893      412      14       3        0             0 irqbalance
[263891.317648] [ 1441]     0  1441     3618      405      12       3        0             0 agetty
[263891.317651] [ 1442]     0  1442     3664      329      12       3        0             0 agetty
[263891.317655] [ 5681]   100  5681    25081      303      20       3        0             0 systemd-timesyn
[263891.317662] [12231]     0 12231    10556      384      22       3        0         -1000 systemd-udevd
[263891.317667] [13785]     0 13785    73091     3130      45       3        0             0 fail2ban-server
[263891.317670] [19845]     0 19845    88852     4545     132       4        0             0 apache2
[263891.317690] [24491]    33 24491    88983     2059     124       3        0             0 apache2
[263891.317694] [24492]    33 24492    88870     2259     122       3        0             0 apache2
[263891.317698] [24493]    33 24493    88870     2259     122       3        0             0 apache2
[263891.317702] [24495]    33 24495    88870     2259     122       3        0             0 apache2
[263891.317705] [24496]    33 24496    88983     2310     124       3        0             0 apache2
[263891.317709] [25184]    33 25184    88870     2259     122       3        0             0 apache2
[263891.317713] [25527]     0 25527    23199      234      49       4        0             0 sshd
[263891.317716] [25529]  1001 25529    11319      351      26       3        0             0 systemd
[263891.317720] [25531]  1001 25531    52146      466      36       4        0             0 (sd-pam)
[263891.317724] [25593]  1001 25593    23234      275      48       4        0             0 sshd
[263891.317729] [25596]  1001 25596    10849      929      26       3        0             0 zsh
[263891.317733] [25896]     0 25896    13195     2750      30       3        0             0 munin-node
[263891.317737] [27479]   113 27479   580196    18552      99       4        0          -900 postgres
[263891.317741] [27481]   113 27481   580875   536815    1110       5        0             0 postgres
[263891.317745] [27482]   113 27482   580196   492026    1100       5        0             0 postgres
[263891.317749] [27483]   113 27483   580196     4430      66       4        0             0 postgres
[263891.317752] [27484]   113 27484   580298      625      65       4        0             0 postgres
[263891.317755] [27485]   113 27485    37136      644      55       3        0             0 postgres
[263891.317759] [ 9255]  1001  9255     6103      569      16       3        0             0 screen
[263891.317763] [ 9256]  1001  9256    10349      839      26       3        0             0 zsh
[263891.317767] [14032]     0 14032    12447      537      30       3        0             0 sudo
[263891.317770] [14033]  1004 14033     4782      415      15       3        0             0 script
[263891.317774] [14034]  1004 14034     4571      549      15       3        0             0 bash
[263891.317778] [14695]  1004 14695    60232     1999     108       4        0             0 setup.php
[263891.317781] [14700]   113 14700   657125     4395     188       4        0             0 postgres
[263891.317785] [14793]  1004 14793     1126      142       8       3        0             0 sh
[263891.317789] [14794]  1004 14794 17553080  6646846   18231      70        0             0 osm2pgsql
[263891.317792] [14798]   113 14798   740513   264531    1428       6        0             0 postgres
[263891.317797] [21177]  1001 21177     6035      436      17       3        0             0 screen
[263891.317800] [27179]   113 27179   582022     2864      83       5        0             0 postgres
[263891.317804] [27180]   113 27180   582022     2751      83       5        0             0 postgres
[263891.317807] [27181]   113 27181   582022     2867      82       5        0             0 postgres
[263891.317811] [27182]   113 27182   581986     1755      81       5        0             0 postgres
[263891.317815] [27185]   113 27185   608463     8222     136       5        0             0 postgres
[263891.317818] [27272]   113 27272   582551     6633     104       5        0             0 postgres
[263891.317822] [27291]     0 27291    13195     2616      30       3        0             0 /usr/sbin/munin
[263891.317826] [27292]     0 27292     4542      130      14       3        0             0 if_ens3
[263891.317829] [27293]   113 27293   580436     1569      70       4        0             0 postgres
[263891.317833] [27294]     0 27294     4543       98      13       3        0             0 if_ens3
[263891.317836] Out of memory: Kill process 14794 (osm2pgsql) score 884 or sacrifice child
[263891.322761] Killed process 14794 (osm2pgsql) total-vm:70212320kB, anon-rss:26585740kB, file-rss:1644kB
```

This is the memory graph of the crashed run:

![image](https://user-images.githubusercontent.com/302579/37951003-1ca65ea0-319b-11e8-99bb-bacbbe9f6c67.png)

Afterwards I added 8 GB swap and it went through like this:

```
time /srv/nominatim/Nominatim/build/utils/setup.php --osm-file /opt/import/europe-latest.osm.pbf --osm2pgsql-cache $((25*1024)) --threads 4 --import-data
[...]
Node stats: total(2062884192), max(5503870021) in 3305s
Way stats: total(251002077), max(572904737) in 7773s
Relation stats: total(3916024), max(8147194) in 3629s
Setting up table: planet_osm_nodes
Setting up table: planet_osm_ways
Setting up table: planet_osm_rels

Going over pending ways...
        0 ways are pending

Using 1 helper-processes
^MFinished processing 0 ways in 0 s


Going over pending relations...
        0 relations are pending

Using 1 helper-processes
^MFinished processing 0 relations in 0 s

Stopping table: planet_osm_nodes
Stopped table: planet_osm_nodes in 0s
Stopping table: planet_osm_ways
Stopped table: planet_osm_ways in 0s
Stopping table: planet_osm_rels
Building index on table: planet_osm_rels
Stopped table: planet_osm_rels in 483s
node cache: stored: 2002553530(97.08%), storage efficiency: 59.68% (dense blocks: 213590, sparse nodes: 802857374), hit rate: 97.44%

Osm2pgsql took 15194s overall
Summary of warnings:


2018-03-27 00:37:36 == Setup finished.

real    253m14.902s
user    323m58.592s
sys     10m4.312s
```

And the memory graph of that good run was:

![image](https://user-images.githubusercontent.com/302579/37951240-0cb25b4c-319c-11e8-8d1d-3e33f50e790e.png)

So it appears it had been in that memory spike in the final five minutes of a four-hour process, what a waste.

The core issue is that the user thinks they're telling the program to consume 25 GB of memory, yet the program takes a substantial amount more, and because the use case is such that it typically pushes the hardware to its limits, the software spirals out of control, and there's no recovery (the long-running process has to be repeated from the beginning, the docs I found all discourage it any attempt not to do that).

In hindsight, the machine had about 20% leeway which may or may not have been reasonable. I got that estimate from talking to Paul Norman on the OSM geocoding mailing list - cf. the thread at https://lists.openstreetmap.org/pipermail/geocoding/2018-March/001925.html

Either way, these hardware requirements and implications need to be better documented if we don't want users to keep reproducing essentially the same issues over and over (that's my impression from googling this).

Please document this. TIA.


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/openstreetmap/osm2pgsql/issues/834
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstreetmap.org/pipermail/tile-serving/attachments/20180327/599ec152/attachment-0001.html>


More information about the Tile-serving mailing list