[Tilesathome] Server Outage

Mon Nov 12 12:43:17 GMT 2007

On Mon, Nov 12, 2007 at 12:52:22AM -0500, Christopher Schmidt wrote:
> On Mon, Nov 12, 2007 at 12:48:21AM -0500, Christopher Schmidt wrote:
> > The pain of using a shared server...
> > 
> > Due to a mistaken script n the T at H server, the machine appears to have
> > been clobbered. I've sent off an email to my server admin with physical
> > access: I'm about to pass out, and the server is still down at the
> > moment. Everything should restart automatically when it comes back up. I
> > apologize for the downtime, and I'll do what I can to make it more
> > resilient for the future as soon as it's back up.
> > 
> > Apologies, and I'll send an update as soon as I hear more...
> 
> Just heard back from my local contact -- he's driving out to the box now
> to kick it. Everything should come back automaticallyl after the
> restart: Anything that doesn't, I'll fix in ~5 hours  when I wake
> up. 

Looking at the stats and IRC, it looks like everything came back up
okay.

So, I'll admit it: the idiot that caused this one was me.

Last night, I was working with John Grahm, the 'maintainer' of the
HyperCube machine (insofar as he's the one who scored its donation to
SDSU from Intel, and manages its resource allocation, stuff like that).
John is a pretty well known guy in the Open Source GIS community: he has
a tendancy to get ridiculous amounts of resources quickly. The Katrina
Imagery processing that was done right after the hurricane was done with
HJG (HyperJohnGrahm) at the head of the hardware.

John has some new imagery from San Diego county for the wildfires --
taken at 1ft/pixel, really great stuff. (It's also totally free for
reuse, but not yet orthorectified... more on that later.) I was trying
to get a TileCache set up against it, and typically, with TileCache, I
just set it up so that people can browse it, and the cache will populate
automatically.

Unfortunately, I wasn't aware of the resources that would be required to
render data from JPEG2000 images -- each image was taking ~3min of
rendering, and 150MB of RAM. Since OpenLayer exacerbates such problems
10 fold, we quickly saw the machine using 10 mapserver processes... then
20... then 40...

Machine load was at 127 and rising when it stopped responding.

The problem was pobably the overall memory usage. The first level of
failure, of course, was mine, in making a resource available before
determining, properly, the load that it would take. The second level of
failure is that there are currently no limits to the memory usage of
Apache that result in it being killed. I'm going to see what I can do to
change that.

And of course, the mapsever layer that killed the server is no longer
available :)

I'm sorry for the downtime, and I'll work on getting things set up so it
doesn't happen again.

Regards,
-- 
Christopher Schmidt
MetaCarta