如何调试在引导期间冻结的Linux内核?

I have a legacy device with a binary Linux 2.6.18 kernel that boots normally to its rootfs. However, if I try to compile this kernel from the source, the resulting kernel binary will freeze during the boot. I don't have the .config file used to build the previous kernel binary that is currently booting normally.

我有一个带有二进制Linux 2.6.18内核的遗留设备，它通常引导到它的rootfs。但是，如果我试图从源代码编译这个内核，结果内核二进制文件将在引导期间冻结。我没有.config文件用于构建当前正常启动的前一个内核二进制文件。

The boot is freezing and no error output is provided. Here is the boot log:

启动是冻结的，没有提供错误输出。这是启动日志:

Linux version 2.6.18-6.2 (myuser@host) (gcc version 4.2.0 20070124 (prerelease) - BRCM 10ts-20080721) #10 SMP Sun Apr 28 18:25:24 BRT 2013
Fetching vars from bootloader... OK (E,d,B,C)
Detected 512 MB on MEMC0 (strap 0x23430310)
Board strapped at 512 MB, default is 256 MB
Options: sata=1 enet=1 emac_1=1 no_mdio=0 docsis=0 ebi_war=0 pci=1 smp=1
CPU revision is: 0002a044
FPU revision is: 00130001
Primary instruction cache 32kB, physically tagged, 2-way, linesize 64 bytes.
Primary data cache 64kB, 4-way, linesize 64 bytes.
<6>Synthesized TLB refill handler (23 instructions).
<6>Synthesized TLB load handler fastpath (37 instructions).
<6>Synthesized TLB store handler fastpath (37 instructions).
<6>Synthesized TLB modify handler fastpath (36 instructions).
Determined physical RAM map:
 memory: 10000000 @ 00000000 (usable)
 memory: 10000000 @ 20000000 (usable)
Using 32MB for memory, overwrite by passing mem=xx
User-defined physical RAM map:
node [00000000, 02000000: RAM]
node [02000000, 0e000000: RSVD]
node [20000000, 10000000: RAM]
<5>Reserving 224 MB upper memory starting at 02000000
<7>On node 0 totalpages: 65536
<7>  DMA zone: 65536 pages, LIFO batch:15
<7>On node 1 totalpages: 65536
<7>  Normal zone: 65536 pages, LIFO batch:15
Built 2 zonelists.  Total pages: 131072
<5>Kernel command line: root=/dev/mtdblock3 rw rootfstype=jffs2 console=ttyS0,115200
PID hash table entries: 4096 (order: 12, 16384 bytes)
mips_counter_frequency = 202000000 from Calibration, = 202500000 from header(CPU_MHz/2)
Dentry cache hash table entries: 65536 (order: 6, 262144 bytes)
Inode-cache hash table entries: 32768 (order: 5, 131072 bytes)
Memory: 286336k/524288k available (2924k kernel code, 237760k reserved, 544k data, 164k init, 0k highmem)
Mount-cache hash table entries: 512
Checking for 'wait' instruction...  available.
plat_prepare_cpus: ENABLING 2nd Thread...
TP0: prom_boot_secondary: Kick off 2nd CPU...
CPU revision is: 0002a044
FPU revision is: 00130001
Primary instruction cache 32kB, physically tagged, 2-way, linesize 64 bytes.
Primary data cache 64kB, 4-way, linesize 64 bytes.
Synthesized TLB refill handler (23 instructions).
Brought up 2 CPUs
migration_cost=1000
NET: Registered protocol family 16
registering PCI controller with io_map_base unset
registering PCI controller with io_map_base unset
SCSI subsystem initialized
usbcore: registered new driver usbfs
usbcore: registered new driver hub
NET: Registered protocol family 2
IP route cache hash table entries: 16384 (order: 4, 65536 bytes)
TCP established hash table entries: 65536 (order: 7, 524288 bytes)
TCP bind hash table entries: 32768 (order: 6, 262144 bytes)
TCP: Hash tables configured (established 65536 bind 32768)
TCP reno registered
brcm-pm: disabling power to USB block
brcm-pm: disabling power to ENET block
brcm-pm: disabling power to SATA block
squashfs: version 3.2-r2 (2007/01/15) Phillip Lougher
JFFS2 version 2.2. (NAND) (SUMMARY)  (C) 2001-2006 Red Hat, Inc.
io scheduler noop registered
io scheduler anticipatory registered (default)
io scheduler deadline registered
io scheduler cfq registered
Serial: 8250/16550 driver $Revision: 1.1.1.1 $ 3 ports, IRQ sharing disabled
serial8250: ttyS0 at MMIO 0x0 (irq = 22) is a 16550A
serial8250: ttyS1 at MMIO 0x0 (irq = 66) is a 16550A
serial8250: ttyS2 at MMIO 0x0 (irq = 67) is a 16550A
loop: loaded (max 8 devices)
brcm-pm: enabling power to ENET block

How do I go about debugging this? Any insights on possible solutions to the freeze are welcome as well.

我该如何调试呢?对于可能的冻结方案的任何见解也受到欢迎。

3 个解决方案

#1

One way to deal with this is to enable CONFIG_EARLY_PRINTK and add some printk() statements in kernel code that you suspect is freezing (most likely some drivers configuration parameters are wrong).

处理此问题的一种方法是启用CONFIG_EARLY_PRINTK并在内核代码中添加一些printk()语句，您怀疑这些语句是冻结的(很可能一些驱动程序配置参数是错误的)。

Also, you might be able to get old kernel config by looking at /boot/config-*, or at /proc/config.gz (it will exist only if old kernel had option CONFIG_IKCONFIG_PROC enabled).

此外，您还可以通过查看/boot/config-*或查看/proc/config -*来获得旧的内核配置。gz(只有当旧内核启用了CONFIG_IKCONFIG_PROC选项时，它才会存在)。

#2

There are some debugger options like kdb and kgdb, but I've always found them flaky and temperamental. Probably more-so if you can't even get your machine to boot. I concur with the CONFIG_EARLY_PRINTK advise, and would advise you to make sure you get kernel output on boot (not "quiet"), but it seems you have this already.

有一些调试器选项，比如kdb和kgdb，但是我总是发现它们不稳定而且不稳定。可能更多——如果你甚至不能启动你的机器。我同意CONFIG_EARLY_PRINTK的建议，并建议您确保在引导时获得内核输出(不是“安静”)，但是看起来您已经拥有了。

The "GPIO" suggestion above could work - but is very system-dependent and cumbersome. That said, I think you want an answer better than "Start adding a lot of printk's". You can start with the offending ethernet driver (BRC-PM?) or try removing that to see if that's related.

上面的“GPIO”建议可能有用——但是非常依赖系统和麻烦。也就是说，我认为你想要的答案比“开始添加大量的printk”要好。您可以从出现问题的以太网驱动程序(BRC-PM?)开始，或者尝试删除它，看看是否相关。

It'll take some investigation - sorry, but no "magic bullet"! :-O

这需要一些调查——对不起，没有“魔弹”!:- o

#3

add initcall_debug to CONFIG_CMDLINE (kernel command line).

向CONFIG_CMDLINE(内核命令行)添加initcall_debug。

CONFIG_CMDLINE="root=/dev/ram0 rw mem=512M@0x0 initrd=0x800000,16M console=ttyS0,38400n8 rootfstype=ext2 init=/bin/busybox init -s initcall_debug"

#1