diff mbox

Kernel crash if both devices in raid1 are failing

Message ID 57202777.3010402@gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Dmitry Katsubo April 27, 2016, 2:44 a.m. UTC
On 2016-04-25 09:12, Dmitry Katsubo wrote:
> I have run "btrfs check /dev/sda" two times. One time it has completed
> OK, actually showing only one error. The 2nd time it has shown many messages
> 
> "parent transid verify failed on NNN wanted AAA found BBB"
> 
> and then asserted :) But I think the 2nd run is not representative as I have
> gracefully removed one drive from btrfs array to build a new array. The
> "btrfs device remove" completed successfully, but it might have written some
> metadata to the remaining drives, which perhaps was not synchronized
> correctly.
> 
> What I am going to do next is to recompile btrfs-tools so that "-i" CLI option
> applies "(y)" to all questions and run "btrfs restore" again. Hopefully it can
> handle transid mismatch correctly...

OK, I have recompiled btrfs with necessary fix (attached). It allowed me to capture
"btrfs restore" output because due to reads from console it was not possible, even
with attempts like this:

while true; do echo y; done | btrfs restore -voxmSi /dev/sda /mnt/backup 2>&1 | tee btrfs_restore

For the matter of experiment I have upgraded kernel to 4.4.6 and it still crashes
on problematic file:

# cat /mnt/tmp/file > /dev/null
[   11.432059] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[   11.436665] ata3.00: BMDMA stat 0x25
[   11.441301] ata3.00: failed command: READ DMA
[   11.479570] ata3.00: cmd c8/00:20:40:ec:f3/00:00:00:00:00/e3 tag 0 dma 16384 in
[   11.479664]          res 51/40:1e:42:ec:f3/00:00:00:00:00/e3 Emask 0x9 (media error)
[   11.619086] ata3.00: status: { DRDY ERR }
[   11.619126] ata3.00: error: { UNC }
[   11.625750] blk_update_request: I/O error, dev sda, sector 66317378
[   11.625779] NOHZ: local_softirq_pending 40
[   70.969876] ------------[ cut here ]------------
[   70.969879] kernel BUG at /build/linux-SBJFwR/linux-4.4.6/debian/build/source_rt/fs/btrfs/volumes.c:5509!
[   70.969885] invalid opcode: 0000 [#1] PREEMPT SMP 
[   70.969954] Modules linked in: netconsole configfs bridge stp llc arc4 iTCO_wdt iTCO_vendor_support ppdev coretemp pcspkr serio_raw i2c_i801 ath5k ath mac80211 cfg80211 sr9700 evdev rfkill dm9601 usbnet lpc_ich mfd_core mii option usb_wwan usbserial rng_core sg snd_hda_codec_realtek snd_hda_codec_generic i915 snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm_oss snd_mixer_oss snd_pcm acpi_cpufreq snd_timer video 8250_fintek snd drm_kms_helper soundcore tpm_tis drm tpm parport_pc i2c_algo_bit parport shpchp button processor binfmt_misc w83627hf hwmon_vid autofs4 xfs libcrc32c hid_generic usbhid hid crc32c_generic btrfs xor raid6_pq uas usb_storage sd_mod sr_mod cdrom ata_generic firewire_ohci ata_piix libata scsi_mod firewire_core crc_itu_t ehci_pci uhci_hcd ehci_hcd usbcore usb_common e1000e ptp pps_core
[   70.969965] CPU: 0 PID: 114 Comm: kworker/u4:3 Tainted: G        W       4.4.0-1-rt-686-pae #1 Debian 4.4.6-1
[   70.969968] Hardware name: AOpen i945GMx-IF/i945GMx-IF, BIOS i945GMx-IF R1.01 Mar.02.2007 AOpen Inc. 03/02/2007
[   70.970029] Workqueue: btrfs-endio btrfs_endio_helper [btrfs]
[   70.970032] task: f3eec0c0 ti: f6a76000 task.ti: f6a76000
[   70.970036] EIP: 0060:[<f87506be>] EFLAGS: 00010217 CPU: 0
[   70.970076] EIP is at __btrfs_map_block+0x11be/0x15a0 [btrfs]

Unfortunately I was not able to capture the whole trace, as there seem to be
concurrent problem with netconsole: the whole system hangs at the point above.

P.S. If debian maintainer of btrfs-progs is on the list: Project packaging fails
for me (happens at the very end during binaries installation):

# debuild
...
dpkg-source: info: building btrfs-progs in btrfs-progs_4.4.1-1.1.debian.tar.xz
dpkg-source: info: building btrfs-progs in btrfs-progs_4.4.1-1.1.dsc
 debian/rules build
dh build --parallel
   dh_testdir -O--parallel
   debian/rules override_dh_auto_configure
make[1]: Entering directory '/home/btrfs-progs-4.4.1'
dh_auto_configure -- --bindir=/bin
make[1]: Leaving directory '/home/btrfs-progs-4.4.1'
   dh_auto_build -O--parallel
 fakeroot debian/rules binary
dh binary --parallel
   dh_testroot -O--parallel
   dh_prep -O--parallel
   debian/rules override_dh_auto_install
make[1]: Entering directory '/home/btrfs-progs-4.4.1'
dh_auto_install --destdir=debian/btrfs-progs
# Adding initramfs-tools integration
install -D -m 0755 debian/local/btrfs.hook debian/btrfs-progs/usr/share/initramfs-tools/hooks/btrfs
install -D -m 0755 debian/local/btrfs.local-premount debian/btrfs-progs/usr/share/initramfs-tools/scripts/local-premount/btrfs
make[1]: Leaving directory '/home/btrfs-progs-4.4.1'
   dh_install -O--parallel
/home/btrfs-progs-4.4.1/debian/btrfs-progs.install: 1: /home/btrfs-progs-4.4.1/debian/btrfs-progs.install: btrfs-calc-size: not found
/home/btrfs-progs-4.4.1/debian/btrfs-progs.install: 2: /home/btrfs-progs-4.4.1/debian/btrfs-progs.install: btrfs-select-super: not found
/home/btrfs-progs-4.4.1/debian/btrfs-progs.install: 3: /home/btrfs-progs-4.4.1/debian/btrfs-progs.install: ioctl.h: not found
dh_install: problem reading debian/btrfs-progs.install: 
debian/rules:16: recipe for target 'binary' failed
make: *** [binary] Error 127
dpkg-buildpackage: error: fakeroot debian/rules binary gave error exit status 2
debuild: fatal error at line 1376:
dpkg-buildpackage -rfakeroot -D -us -uc failed

Comments

Dmitry Katsubo May 2, 2016, 8:51 p.m. UTC | #1
Hello,

If somebody is interested in digging into the problem, I would be happy to provide
more information and/or do the testing.

On 2016-04-27 04:44, Dmitry Katsubo wrote:
> # cat /mnt/tmp/file > /dev/null
> [   11.432059] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
> [   11.436665] ata3.00: BMDMA stat 0x25
> [   11.441301] ata3.00: failed command: READ DMA
> [   11.479570] ata3.00: cmd c8/00:20:40:ec:f3/00:00:00:00:00/e3 tag 0 dma 16384 in
> [   11.479664]          res 51/40:1e:42:ec:f3/00:00:00:00:00/e3 Emask 0x9 (media error)
> [   11.619086] ata3.00: status: { DRDY ERR }
> [   11.619126] ata3.00: error: { UNC }
> [   11.625750] blk_update_request: I/O error, dev sda, sector 66317378
> [   11.625779] NOHZ: local_softirq_pending 40
> [   70.969876] ------------[ cut here ]------------
> [   70.969879] kernel BUG at /build/linux-SBJFwR/linux-4.4.6/debian/build/source_rt/fs/btrfs/volumes.c:5509!
> [   70.969885] invalid opcode: 0000 [#1] PREEMPT SMP 
> [   70.969954] Modules linked in: netconsole configfs bridge stp llc arc4 iTCO_wdt iTCO_vendor_support ppdev coretemp pcspkr serio_raw i2c_i801 ath5k ath mac80211 cfg80211 sr9700 evdev rfkill dm9601 usbnet lpc_ich mfd_core mii option usb_wwan usbserial rng_core sg snd_hda_codec_realtek snd_hda_codec_generic i915 snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm_oss snd_mixer_oss snd_pcm acpi_cpufreq snd_timer video 8250_fintek snd drm_kms_helper soundcore tpm_tis drm tpm parport_pc i2c_algo_bit parport shpchp button processor binfmt_misc w83627hf hwmon_vid autofs4 xfs libcrc32c hid_generic usbhid hid crc32c_generic btrfs xor raid6_pq uas usb_storage sd_mod sr_mod cdrom ata_generic firewire_ohci ata_piix libata scsi_mod firewire_core crc_itu_t ehci_pci uhci_hcd ehci_hcd usbcore usb_common e1000e ptp pps_core
> [   70.969965] CPU: 0 PID: 114 Comm: kworker/u4:3 Tainted: G        W       4.4.0-1-rt-686-pae #1 Debian 4.4.6-1
> [   70.969968] Hardware name: AOpen i945GMx-IF/i945GMx-IF, BIOS i945GMx-IF R1.01 Mar.02.2007 AOpen Inc. 03/02/2007
> [   70.970029] Workqueue: btrfs-endio btrfs_endio_helper [btrfs]
> [   70.970032] task: f3eec0c0 ti: f6a76000 task.ti: f6a76000
> [   70.970036] EIP: 0060:[<f87506be>] EFLAGS: 00010217 CPU: 0
> [   70.970076] EIP is at __btrfs_map_block+0x11be/0x15a0 [btrfs]
diff mbox

Patch

Index: btrfs-progs-4.4.1/cmds-restore.c
===================================================================
--- btrfs-progs-4.4.1.orig/cmds-restore.c
+++ btrfs-progs-4.4.1/cmds-restore.c
@@ -438,6 +438,9 @@  static enum loop_response ask_to_continu
 	char buf[2];
 	char *ret;
 
+	if (ignore_errors)
+		return LOOP_CONTINUE;
+
 	printf("We seem to be looping a lot on %s, do you want to keep going "
 	       "on ? (y/N/a): ", file);
 again: