sysmap_64bit: rmap ovflo, lost

From: <joel-garry_at_home.com>
Date: 26 Apr 2006 14:21:28 -0700
Message-ID: <1146086488.370831.261210@e56g2000cwe.googlegroups.com>

helpful gurus:

hp-ux 11.11 4 processor rpr3340 box crashed last night. Trying to figure out how to prevent this in the future. Oracle 9.2.0.6.

The uptime was a little over 2 months. Looking at syslog, I see lots
(>17K lines) of:

Apr 25 20:47:34 ZEUS vmunix: sysmap_64bit: rmap ovflo, lost [68419543,68419559)

They started at the exact time my Oracle RMAN backup started. The script that does the backup does a number of things, such as remove old backup files, run the RMAN script (nocatalog), then compress some of the backup files onto an nfs device. The RMAN completed, system crashed during compress.

Looking at
http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=70397
(neither of the links in there work for me), I now have the idea that
something fragmented kernel memory. But what? I was about to write a script to periodically capture the largest processes while RMAN is running, but then I started wondering if it is not really RMAN, but something previous to RMAN that sets up the problem. Looking again at syslog, I see the rmaps happening on a few days in April at various times during the day, once during production day and 7 times off-hours
(sometimes during RMAN, sometimes during compress), April 10-15, but no
other times since boot. If it were RMAN, wouldn't I see the problem whenever RMAN ran? And why this time did it go nuts and crash the system, but not the other times?

Using the 'UNIX95= ps -e -o "vsz args" |sort' command, I see that some third party application processes get big: 132640K is the biggest just now, (those are killed off nightly if the users forget to log off - but later than this backup). So I tried 'ps -efl|sort -nk10|tail -10' , which shows that same process as 29527 pages (and lets me see exactly who it is). But I don't quite get what vsz and sz are telling me, I guess I need to subtract some shared memory? man ps isn't too clear.

I don't see how to figure which process is fragmenting memory. Don't have glance. Should I be looking for processes that get bigger and smaller, rather than the largest? There is a transaction monitor that appears to be doing that. Or should I watch for something continually growing? I don't know of anything that has changed on this system specific to this month, and don't really see how a memory leak could come and go and come back big when users and cron do the same thing day-to-day.

Is it really going to be necessary to reboot this thing monthly?

Any help appreciated, I'm trying to do as much as possible before the hardware folk start interrupting production.

This is pretty typical swapinfo:

# swapinfo -am

             Mb      Mb      Mb   PCT  START/      Mb
TYPE      AVAIL    USED    FREE  USED   LIMIT RESERVE  PRI  NAME
dev        4096     633    3463   15%       0       -    1
/dev/vg00/lvol2
reserve       -    3400   -3400
memory     6320    4499    1821   71%

TIA jg

-- 
@home.com is bogus.
s/home.com/cox.net/

Received on Wed Apr 26 2006 - 16:21:28 CDT