[Talk-ca] Bodies of seawater in Canada - area definitions
John Whelan
jwhelan0112 at gmail.com
Fri Oct 22 01:47:37 UTC 2021
Really off topic but a bit of background. We measure disk performance in
milliseconds, memory in nanoseconds and there is one million nanoseconds
in a millisecond. Basically memory costs more per gig than a hard disk
does. Hard disks rotate so you get a delay when you rotate and if you
have to move the head that has a cost as well. Traditionally we
improved performance by reading in the entire track of data from the
hard disk rather than wait to get the bit we want. We'd cache the bit we
didn't need on the hard drive. Typically databases are not CPU limited,
my experience was 3% CPU utilization was quite normal. The limiting
factor typically would be memory and disk access.
SSDs have improved things we no longer have to move the head.
TPC.org has a series of benchmarks and companies put together hardware
and software to compete in the benchmarks that include price
performance. They are a bit artificial in that typically the log files
are turned off and sometimes the code is optimised for the benchmark but
they do give a very good idea of the optimal machine configuration if
you look at the detail of the benchmarks. Looking at one of the
cheapest price performance configurations it has 16 x 64 GB memory
modules for example. I'll let you do the maths. The companies will use
whatever they need software wise so some will be happy with Windows,
some with UNIX of one flavour or another. You'll see many different
database software being used but PostgreSQL isn't mentioned often in the
price performance charts. It probably doesn't have a particular feature
that is tested by the benchmark. That doesn't mean it isn't a good
choice for OpenStreetMap.
So the secret of price performance or even just performance is
shovelling data into the CPU and if it comes from memory so much the
better. The best databases anticipate what the CPU will want next. When
it's time to write to the hard drives they'll do what is called a lazy
write and shove it in memory, it gets written the hard drives a little
later on but the CPU can get on with the next task. It does remember
that it is still in memory so if it gets a request it will read it from
memory rather than going to the hard drive. Roughly a million times faster.
Caching works on reads as well. Parts of the database will be accessed
more often than others. For OSM we can expect that at 3 am local time
there will be less mappers active than at say 5 pm. So at 3 am we leave
that bit of the database on the hard drive and keep another bit in
memory. At 5 pm we keep the local bit in memory. This is very much a
simplification but the basic idea holds true. Typically we hope to get
80% of the data we're after in memory so we get 80% of the speed of
memory at perhaps 5% of the cost of a pure memory system and that is
where the database software pays for itself. We can also gain
performance with a cluster, the database is spread over a number of
servers. It does add a level of complexity though which doesn't always
help on the reliability side.
I think Oracle was one of the first relational databases but Sybase
looked at the problems Oracle had and wrote a relational database that
was an improvement in many ways. Microsoft licensed it for Windows,
Microsoft SQL server but when Sybase was a bit unresponsive to bug
reports took the database code in house. They then spent money hiring
database knowledgeable people and rewrote Microsoft SQL server. So in
my opinion Microsoft SQL server is solid and is self tuning in it's
caches but it is a relational database and they are not optimal in all
circumstances.
One aspect of computer code is once written you can run it on many
machines. In my opinion the more people who run it the fewer
undocumented system features you'll find. So from a reliability point
of view it's better not to be an outlier. Again this is just my opinion.
Overall translation to improve the database performance you just throw
on faster SSDs and more memory. It's called throwing hardware at the
problem but if you can optimise what you put in the database that's much
cheaper. Besides I like being kind to databases.
Cheerio John
Iain Ingram wrote on 10/21/2021 1:52 PM:
> Database performance aside as I would see this falling into regular
> issues. If we see the database straining now we will see it in a year
> regardless.
>
> Just my two cents.
More information about the Talk-ca
mailing list