Making the most of a kernel panic

Frequently Asked Questions for FreeBSD 2.X : For serious FreeBSD hackers only : Making the most of a kernel panic
Previous: Alternative layout policies for directories
Next: ACKNOWLEDGMENTS

13.13. Making the most of a kernel panic

[This section was extracted from a mail written by Bill Paul on the freebsd-current mailing list by Dag-Erling Coïdan Smørgrav, who fixed a few typos and added the bracketed comments]

From: Bill Paul <wpaul@skynet.ctr.columbia.edu>
Subject: Re: the fs fun never stops
To: ben@rosengart.com
Date: Sun, 20 Sep 1998 15:22:50 -0400 (EDT)
Cc: current@FreeBSD.ORG

[<ben@rosengart.com> posted the following panic message]

> Fatal trap 12: page fault while in kernel mode
> fault virtual address   = 0x40
> fault code              = supervisor read, page not present
> instruction pointer     = 0x8:0xf014a7e5
                                ^^^^^^^^^^
> stack pointer           = 0x10:0xf4ed6f24
> frame pointer           = 0x10:0xf4ed6f28
> code segment            = base 0x0, limit 0xfffff, type 0x1b
>                         = DPL 0, pres 1, def32 1, gran 1
> processor eflags        = interrupt enabled, resume, IOPL = 0
> current process         = 80 (mount)
> interrupt mask          =
> trap number             = 12
> panic: page fault

[When] you see a message like this, it's not enough to just reproduce it and send it in. The instruction pointer value that I highlighted up there is important; unfortunately, it's also configuration dependent. In other words, the value varies depending on the exact kernel image that you're using. If you're using a GENERIC kernel image from one of the snapshots, then it's possible for somebody else to track down the offending function, but if you're running a custom kernel then only you can tell us where the fault occured.

What you should do is this:

Write down the instruction pointer value. Note that the 0x8: part at the begining is not significant in this case: it's the 0xf0xxxxxx part that we want.
When the system reboots, do the following:
```
% nm /kernel.that.caused.the.panic | grep f0xxxxxx
          
```
where f0xxxxxx is the instruction pointer value. The odds are you will not get an exact match since the symbols in the kernel symbol table are for the entry points of functions and the instruction pointer address will be somewhere inside a function, not at the start. If you don't get an exact match, omit the last digit from the instruction pointer value and try again, i.e.:
```
% nm /kernel.that.caused.the.panic | grep f0xxxxx
	  
```
If that doesn't yield any results, chop off another digit. Repeat until you get some sort of output. The result will be a possible list of functions which caused the panic. This is a less than exact mechanism for tracking down the point of failure, but it's better than nothing.

I see people constantly show panic messages like this but rarely do I see someone take the time to match up the instruction pointer with a function in the kernel symbol table.

The best way to track down the cause of a panic is by capturing a crash dump, then using gdb(1) to to a stack trace on the crash dump. Of course, this depends on gdb(1) in -current working correctly, which I can't guarantee (I recall somebody saying that the new ELF-ized gdb(1) didn't handle kernel crash dumps correctly: somebody should check this before 3.0 goes out of beta or there'll be a lot of red faces after the CDs ship).

In any case, the method I normally use is this:

Set up a kernel config file, optionally adding 'options DDB' if you think you need the kernel debugger for something. (I use this mainly for setting beakpoints if I suspect an infinite loop condition of some kind.)
Use config -g KERNELCONFIG to set up the build directory.
cd /sys/compile/KERNELCONFIG; make
Wait for kernel to finish compiling.
cp kernel kernel.debug
strip -d kernel
mv kernel /kernel.orig/
cp kernel /
reboot

[Note: Now that FreeBSD 3.x kernels are Elf by default, you should use strip -g instead of strip -d. If for some reason your kernel is still a.out, use strip -aout -d.]

Note that YOU DO NOT WANT TO ACTUALLY BOOT THE KERNEL WITH ALL THE DEBUG SYMBOLS IN IT. A kernel compiled with -g can easily be close to 10MB in size. You don't have to actually boot this massive image: you only need it later for gdb(1) (gdb(1) wants the symbol table). Instead, you want to keep a copy of the full image and create a second image with the debug symbols stripped out using strip -d. It is this second stripped image that you want to boot.

To make sure you capture a crash dump, you need edit /etc/rc.conf and set dumpdev to point to your swap partition. This will cause the rc(8) scripts to use the dumpon(8) command to enable crash dumps. You can also run dumpon(8) manually. After a panic, the crash dump can be recovered using savecore(8); if dumpdev is set in /etc/rc.conf, the rc(8) scripts will run savecore(8) automatically and put the crash dump in /var/crash.

NOTE: FreeBSD crash dumps are usually the same size as the physical RAM size of your machine. That is, if you have 64MB of RAM, you will get a 64MB crash dump. Therefore you must make sure there's enough space in /var/crash to hold the dump. Alternatively, you run savecore(8) manually and have it recover the crash dump to another directory where you have more room. It's possible to limit the size of the crash dump by using options MAXMEM=(foo) to set the amount of memory the kernel will use to something a little more sensible. For example, if you have 128MB of RAM, you can limit the kernel's memory usage to 16MB so that your crash dump size will be 16MB instead of 128MB.

Once you have recovered the crash dump, you can get a stack trace with gdb(1) as follows:

% gdb -k /sys/compile/KERNELCONFIG/kernel.debug /var/crash/vmcore.0
(gdb) where

Note that there may be several screens worth of information; ideally you should use script(1) to capture all of them. Using the unstripped kernel image with all the debug symbols should show the exact line of kernel source code where the panic occured. Usually you have to read the stack trace from the bottom up in order to trace the exact sequence of events that lead to the crash. You can also use gdb(1) to print out the contents of various variables or structures in order to examine the system state at the time of the crash.

Now, if you're really insane and have a second computer, you can also configure gdb(1) to do remote debugging such that you can use gdb(1) on one system to debug the kernel on another system, including setting breakpoints, single-stepping through the kernel code, just like you can do with a normal user-mode program. I haven't played with this yet as I don't often have the chance to set up two machines side by side for debugging purposes.

[Bill adds: "I forgot to mention one thing: if you have DDB enabled and the kernel drops into the debugger, you can force a panic (and a crash dump) just by typing 'panic' at the ddb prompt. It may stop in the debugger again during the panic phase. If it does, type 'continue' and it will finish the crash dump." -ed]