I lost all shell access to my netbsd 3.1 box yesterday. it's on a
secure network where only trusted users can access it. they might
connect by a wired switch, wifi + vpn to that switch (locally or
remote) or they may have a vpn wired remote connection.
the host does very little, network protocol wise, internet and
mail gw, lan dns and sshd. it's running pf and pflogd (though I'm
not listening to pflogd atm) with fairly simple pf rules, other
than nat, pass and block lines we have the following pf rules:
scrub in all fragment reassemble
no rdr on lo0 from any to any
antispoof log for wm0 inet
antispoof log for wm1 inet
antispoof log for fxp0 inet
This setup has been working without issue for about 45 days. That changed
yesterday when nobody could connect to sshd; but everything else worked.
RAM test was perfect. the only anomaly was in the kernel log.
Our logging system rotates by log size not time, and we keep 10Mb
of old logs, that might change, but that's what we have.
The oldest log starts like this:
2007-02-28 13:34:29.767786500 pf_normalize_ip: reass frag 64442 @ 20720-22200
2007-02-28 13:34:29.767895500 pf_normalize_ip: reass frag 64442 @ 22200-23680
2007-02-28 13:34:29.767903500 pf_normalize_ip: reass frag 64442 @ 23680-25160
2007-02-28 13:34:29.767908500 pf_normalize_ip: reass frag 64442 @ 25160-26640
2007-02-28 13:34:29.767913500 pf_normalize_ip: reass frag 64442 @ 26640-28120
2007-02-28 13:34:29.767918500 pf_normalize_ip: reass frag 64442 @ 28120-29600
2007-02-28 13:34:29.767923500 pf_normalize_ip: reass frag 64442 @ 29600-31080
2007-02-28 13:34:29.767928500 pf_normalize_ip: reass frag 64442 @ 31080-32560
2007-02-28 13:34:29.767933500 pf_normalize_ip: reass frag 64442 @ 32560-32920
2007-02-28 13:34:29.767938500 pf_reassemble: 32920 < 32920?
2007-02-28 13:34:29.767943500 pf_reassemble: complete: 0xc3eb3200(32940)
2007-02-28 13:34:29.767947500 pf_normalize_ip: reass frag 64954 @ 0-1480
2007-02-28 13:34:29.767952500 pf_normalize_ip: reass frag 64954 @ ...
Continued from a tech-kern thread...
the host stopped responding again after 24 or 36 hours up. When it
completed fsck (successfully), run from an install cdrom kernel,
the PS/2 keyboard no longer accepted input.
so apparently there is hardware issue (or some whack compromised
eprom...), any ideas to identify the fault directly? we could
do some, but testing each component in known good hardware would
probably be too time consuming.
The odd thing is the RAM tested okay in an all night test Thurs
eve. So, is the second (of two) cpu the most likely problem?
(hummm, the keyboard failed with a mono cpu install kernel).
George Georgalis, systems architect, administrator <IXOYE><