diff mbox

[BUG] pvmove corrupting XFS filesystems (was Re: [BUG] Internal error xfs_dir2_data_reada_verify)

Message ID 20130226044039.GM5551@dastard (mailing list archive)
State Deferred, archived
Headers show

Commit Message

Dave Chinner Feb. 26, 2013, 4:40 a.m. UTC
[cc'd dm-devel]

On Mon, Feb 25, 2013 at 07:47:32PM -0500, Matteo Frigo wrote:
> For some reason XFS reliably crashes for me when used in conjunction
> with LVM2's pvmove.  The error reliably appears when removing a large
> number of files from a volume that is being pvmove'd at the same time.
> I am using vanilla kernel 3.8.  A typical kernel message looks like the
> following:
> 
>    [  262.396471] ffff88001ecfb000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>    [  262.398314] XFS (dm-1): Internal error xfs_dir2_data_reada_verify at line 226 of file fs/xfs/xfs_dir2_data.c.  Caller 0xffffffffa01eb42d
>    [  262.398314] 
>    [  262.398740] Pid: 134, comm: kworker/0:1H Not tainted 3.8.0 #1
>    [  262.398742] Call Trace:
>    [  262.398767]  [<ffffffffa01ed225>] ? xfs_corruption_error+0x54/0x6f [xfs]
>    [  262.398777]  [<ffffffffa01eb42d>] ? xfs_buf_iodone_work+0x2e/0x5c [xfs]
>    [  262.398792]  [<ffffffffa02194c5>] ? xfs_dir2_data_reada_verify+0x76/0x88 [xfs]
>    [  262.398801]  [<ffffffffa01eb42d>] ? xfs_buf_iodone_work+0x2e/0x5c [xfs]
>    [  262.398809]  [<ffffffffa01eb42d>] ? xfs_buf_iodone_work+0x2e/0x5c [xfs]
>    [  262.398814]  [<ffffffff81053399>] ? process_one_work+0x16d/0x2c5
>    [  262.398831]  [<ffffffffa01eb3ff>] ? xfs_buf_relse+0x12/0x12 [xfs]
>    [  262.398834]  [<ffffffff810537b4>] ? worker_thread+0x117/0x1b1
>    [  262.398837]  [<ffffffff8105369d>] ? rescuer_thread+0x187/0x187
>    [  262.398840]  [<ffffffff81056f6c>] ? kthread+0x81/0x89
>    [  262.398842]  [<ffffffff81056eeb>] ? __kthread_parkme+0x5c/0x5c
>    [  262.398846]  [<ffffffff8137d93c>] ? ret_from_fork+0x7c/0xb0
>    [  262.398848]  [<ffffffff81056eeb>] ? __kthread_parkme+0x5c/0x5c
>    [  262.398850] XFS (dm-1): Corruption detected. Unmount and run xfs_repair
>    [  262.399089] XFS (dm-1): metadata I/O error: block 0x805060 ("xfs_trans_read_buf_map") error 117 numblks 8

That's clearly a 3.8 kernel you are seeing this on. The readahead
has returned zeros rather than data which implies that the readahead
has been cancelled. I can see a potential bug in the 3.8 code where
the above verifier is called even on an IO error. Hence a readahead
will trigger a corruption error like above, even though a failed
readahead is supposed to be silent.

A failed readahead is then supposed to be retried by a blocking read
when the metadata is actually read, and this means that failed
readahead can be ignored.

> I have observed the problem with vanilla kernels 3.8 and 3.4.33, and
> with CentOS 6.3 using the CentOS variant of 2.6.32.  I have observed the
> problem on various virtual machines using debian wheezy/sid and Fedora
> 18, and on at least three physical machines.  Even though the kernel
> reports "Corruption detected", xfs_repair appears to be able to fix any
> problems, so I haven't actually lost any data.  The problem goes away
> after stopping pvmove.

Which implies two things: 1) that the corruption is not limited to
readahead; and 2) the bug is a result of whatever DM is doing during
a pvmove operation.

> HOW TO REPRODUCE: I created a virtual machine with debian sid and two
> virtual disks /dev/vdb and /dev/vdc.  Then, the following script
> reliably triggers the failure.  
> 
>     pvcreate /dev/vd[bc]
>     vgcreate test /dev/vd[bc]
>     lvcreate -L 8G -n vol test /dev/vdb
>     mkfs.xfs -f /dev/mapper/test-vol
>     mount -o noatime /dev/mapper/test-vol /mnt
>     cd /mnt
>     git clone ~/linux-stable
>     cd /
>     umount /mnt
> 
>     mount -o noatime /dev/mapper/test-vol /mnt
>     pvmove -b /dev/vdb /dev/vdc
>     sleep 2
>     rm -rf /mnt/linux-stable

Yup, I can reproduce it here on a 3.8 kernel. Note that it's not
just returning buffers with zeros in them, it's returning buffers
with stale data in them:

ffff88011cfd4000: 31 5f 53 50 41 52 45 5f 4d 41 53 4b 29 20 3e 3e 1_SPARE_MASK) >>
[98756.147971] XFS (dm-0): Internal error xfs_dir3_data_reada_verify at line 240 of file fs/xfs/xfs_dir2_data.c.  Caller 0xffffffff81457dbf

> I have observed the problem even without unmounting and re-mounting the
> file system, but it appears that the umount/mount sequence makes the
> problem 100% reproducible.

Yup, because it's got to read the metadata from disk to do the rm,
and hence triggers lots of directory block readahead to occur.

> I am not implying that this is an xfs bug---it may well be a dm bug for
> all I know.  However, the same test completes correctly using ext4
> instead of xfs, suggesting that dm is working in at least some cases.

ext4 doesn't use readahead in the same way that XFS does. XFS has a
long history of discovering readahead handling bugs in DM...

So, I added a check to ensure we don't run the verifier on buffers
that have had an IO error returned (i.e. we preserve the error and
IO completion checking until a read buffer read occurs):



and so errors on buffers are passed all the way through. This
occurred:

[  112.067262] XFS (dm-0): metadata I/O error: block 0x180068 ("xfs_trans_read_buf_map") error 11 numblks 8
[  112.069331] XFS (dm-0): metadata I/O error: block 0x180068 ("xfs_trans_read_buf_map") error 11 numblks 8
[  112.073369] XFS (dm-0): metadata I/O error: block 0x4800a8 ("xfs_trans_read_buf_map") error 11 numblks 8
[  112.077853] XFS (dm-0): metadata I/O error: block 0x4800a8 ("xfs_trans_read_buf_map") error 11 numblks 8
[  112.108900] XFS (dm-0): metadata I/O error: block 0x180d10 ("xfs_trans_read_buf_map") error 11 numblks 8
[  112.112003] XFS (dm-0): metadata I/O error: block 0x180d10 ("xfs_trans_read_buf_map") error 11 numblks 8
[  112.116444] XFS (dm-0): metadata I/O error: block 0x480d20 ("xfs_trans_read_buf_map") error 11 numblks 8
[  112.119678] XFS (dm-0): metadata I/O error: block 0x480d20 ("xfs_trans_read_buf_map") error 11 numblks 8
[  112.125633] XFS (dm-0): metadata I/O error: block 0x1816b8 ("xfs_trans_read_buf_map") error 11 numblks 8
[  112.129345] XFS (dm-0): metadata I/O error: block 0x1816b8 ("xfs_trans_read_buf_map") error 11 numblks 8

Error 11 is EAGAIN, which means that the IO is being completed with
an error of EAGAIN on it. This is not an XFS bug - EAGAIN is not a
recognised IO stack error on any filesystem and so whatever is
returning it is broken. There are also definitely other problems in
the lower layers of the storage stack when pvmove is executed, as
this happened next:

[  112.130211] BUG: unable to handle kernel NULL pointer dereference at 000000000000007c
[  112.130232] IP: [<ffffffff8175aa67>] do_raw_spin_lock+0x17/0x110
[  112.130235] PGD 21cb16067 PUD 21c65d067 PMD 0 
[  112.130238] Oops: 0000 [#1] SMP 
[  112.130244] Modules linked in:
[  112.130249] CPU 5 
[  112.130249] Pid: 0, comm: swapper/5 Not tainted 3.8.0-dgc+ #433 Bochs Bochs
[  112.130254] RIP: 0010:[<ffffffff8175aa67>]  [<ffffffff8175aa67>] do_raw_spin_lock+0x17/0x110
[  112.130259] RSP: 0018:ffff88011dd03c18  EFLAGS: 00010086
[  112.130260] RAX: 0000000000000092 RBX: 0000000000000078 RCX: ffff88011cdcc680
[  112.130261] RDX: ffff88007dbc5fd8 RSI: ffff88007bd6fcb8 RDI: 0000000000000078
[  112.130262] RBP: ffff88011dd03c48 R08: 0000000000000000 R09: 0000000000000000
[  112.130264] R10: 0000000000000000 R11: 0000000000000001 R12: 00000000000000a0
[  112.130265] R13: 0000000000000078 R14: 0000000000010000 R15: 0000000000010000
[  112.130267] FS:  0000000000000000(0000) GS:ffff88011dd00000(0000) knlGS:0000000000000000
[  112.130268] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  112.130269] CR2: 000000000000007c CR3: 000000021c3f9000 CR4: 00000000000006e0
[  112.130276] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  112.130282] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  112.130284] Process swapper/5 (pid: 0, threadinfo ffff88007dbc4000, task ffff88007dbc23c0)
[  112.130284] Stack:
[  112.130287]  ffff88011dd03c78 0000000000000086 ffff88011dd03c88 0000000000000092
[  112.130289]  00000000000000a0 0000000000000078 ffff88011dd03c68 ffffffff81bd7cf4
[  112.130291]  ffff88007c58a200 ffff88007bd6fcb8 ffff88011dd03c98 ffffffff81a65d2e
[  112.130292] Call Trace:
[  112.130295]  <IRQ> 
[  112.130300]  [<ffffffff81bd7cf4>] _raw_spin_lock_irqsave+0x34/0x40
[  112.130316]  [<ffffffff81a65d2e>] push+0x2e/0x60
[  112.130319]  [<ffffffff81a65e10>] ? run_pages_job+0xb0/0xb0
[  112.130322]  [<ffffffff81a65e7a>] complete_io+0x6a/0xa0
[  112.130325]  [<ffffffff81a64f33>] dec_count.part.5+0x63/0x80
[  112.130328]  [<ffffffff81a64fa5>] endio+0x55/0xa0
[  112.130340]  [<ffffffff811b8b9d>] bio_endio+0x1d/0x30
[  112.130348]  [<ffffffff8172957b>] req_bio_endio.isra.63+0x8b/0xd0
[  112.130351]  [<ffffffff817296c8>] blk_update_request+0x108/0x4d0
[  112.130354]  [<ffffffff81bd7c3e>] ? _raw_spin_unlock+0xe/0x20
[  112.130356]  [<ffffffff81729ab7>] blk_update_bidi_request+0x27/0xa0
[  112.130360]  [<ffffffff8172cae0>] __blk_end_bidi_request+0x20/0x50
[  112.130362]  [<ffffffff8172cb2f>] __blk_end_request_all+0x1f/0x30
[  112.130373]  [<ffffffff8194beb0>] virtblk_done+0x100/0x240
[  112.130385]  [<ffffffff817d4f45>] vring_interrupt+0x35/0x60
[  112.130395]  [<ffffffff810fb01c>] handle_irq_event_percpu+0x6c/0x230
[  112.130398]  [<ffffffff810fb228>] handle_irq_event+0x48/0x70
[  112.130404]  [<ffffffff810fd987>] handle_edge_irq+0x77/0x110
[  112.130408]  [<ffffffff81049192>] handle_irq+0x22/0x40
[  112.130412]  [<ffffffff81be1c9a>] do_IRQ+0x5a/0xe0
[  112.130414]  [<ffffffff81bd80ed>] common_interrupt+0x6d/0x6d
[  112.130416]  <EOI> 
[  112.130420]  [<ffffffff81074146>] ? native_safe_halt+0x6/0x10
[  112.130424]  [<ffffffff8104f82f>] default_idle+0x4f/0x220
[  112.130427]  [<ffffffff81050646>] cpu_idle+0xb6/0xf0
[  112.130442]  [<ffffffff81bc565e>] start_secondary+0x1aa/0x1ae
[  112.130466] Code: 10 89 47 08 c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83 ec 30 48 89 5d e8 48 89 fb 4c 89 65 f0 4c 89 6d f8 <81> 7f 04 ad 4e  
[  112.130469] RIP  [<ffffffff8175aa67>] do_raw_spin_lock+0x17/0x110
[  112.130470]  RSP <ffff88011dd03c18>
[  112.130471] CR2: 000000000000007c
[  112.130474] ---[ end trace e4b9ec49f84b99c2 ]---
[  112.130476] Kernel panic - not syncing: Fatal exception in interrupt

Which indicates that IO completion through kcopyd died.

And finally, this corruption occurred on the third run:

[   68.309233] XFS (dm-0): Corruption detected. Unmount and run xfs_repair
[   68.310281] XFS (dm-0): bad inode magic/vsn daddr 1578640 #0 (magic=5844)
[   68.312136] XFS: Assertion failed: 0, file: fs/xfs/xfs_inode.c, line: 417
[   68.313906] ------------[ cut here ]------------
[   68.314822] kernel BUG at fs/xfs/xfs_message.c:100!
[   68.315838] invalid opcode: 0000 [#1] SMP 
[   68.316117] Modules linked in:
[   68.316117] CPU 3 
[   68.316117] Pid: 4401, comm: kworker/3:1H Not tainted 3.8.0-dgc+ #433 Bochs Bochs
[   68.316117] RIP: 0010:[<ffffffff8146a1f2>]  [<ffffffff8146a1f2>] assfail+0x22/0x30
[   68.316117] RSP: 0018:ffff88021c70dd08  EFLAGS: 00010296
[   68.316117] RAX: 000000000000003d RBX: 0000000000000000 RCX: 0000000000008f8e
[   68.316117] RDX: 0000000000008e8e RSI: 0000000000000096 RDI: 0000000000000246
[   68.316117] RBP: ffff88021c70dd08 R08: 000000000000000a R09: 0000000000000215
[   68.316117] R10: 0000000000000000 R11: 0000000000000214 R12: ffff88011d005000
[   68.316117] R13: ffff88011b71d5c0 R14: 0000000000000020 R15: ffff88011b5d1000
[   68.316117] FS:  0000000000000000(0000) GS:ffff88021fc00000(0000) knlGS:0000000000000000
[   68.316117] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   68.316117] CR2: 0000000000757000 CR3: 000000019a1ac000 CR4: 00000000000006e0
[   68.316117] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   68.316117] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   68.316117] Process kworker/3:1H (pid: 4401, threadinfo ffff88021c70c000, task ffff88021c97c280)
[   68.316117] Stack:
[   68.316117]  ffff88021c70dd68 ffffffff814b2c37 ffffffff814b2cee 0000000000000003
[   68.316117]  ffff88021c70dd48 ffff88011d0052a0 ffff88021fc12c40 ffff88011b71d688
[   68.316117]  ffff88011b71d5c0 0000000000000001 0000000000000000 ffff88011b71d688
[   68.316117] Call Trace:
[   68.316117]  [<ffffffff814b2c37>] xfs_inode_buf_verify+0xf7/0x190
[   68.316117]  [<ffffffff814b2cee>] ? xfs_inode_buf_read_verify+0xe/0x10
[   68.316117]  [<ffffffff814b2cee>] xfs_inode_buf_read_verify+0xe/0x10
[   68.316117]  [<ffffffff81457e25>] xfs_buf_iodone_work+0xc5/0xf0
[   68.316117]  [<ffffffff810a0b42>] process_one_work+0x132/0x4e0
[   68.316117]  [<ffffffff81457d60>] ? xfs_buf_unlock+0xa0/0xa0
[   68.316117]  [<ffffffff810a244d>] worker_thread+0x15d/0x450
[   68.316117]  [<ffffffff810a22f0>] ? __next_gcwq_cpu+0x60/0x60
[   68.316117]  [<ffffffff810a84d8>] kthread+0xd8/0xe0
[   68.316117]  [<ffffffff810a8400>] ? flush_kthread_worker+0xa0/0xa0
[   68.316117]  [<ffffffff81be03ec>] ret_from_fork+0x7c/0xb0
[   68.316117]  [<ffffffff810a8400>] ? flush_kthread_worker+0xa0/0xa0

Which is on a buffer that has never had read-ahead issued on it.
It's an inode buffer, but the partial magic number indicates that
the block contains directory data of some kind. IOWs, it looks to me
like data from the wrong blocks was returned to XFS....

[Ben: now do you see why I want every single piece of XFS metadata
to be completely self describing with embedded UUIDs, block/inode
numbers and owners? Having a block/inode number encoded into the
metadata will detect these sorts of storage layer bugs immediately
and tell us the class of error is occurring without having to
guess...]

All of the evidence points of a problem caused by the pvmove
operation. I'd suggest that you don't use it until we get to the
bottom of the problem (i.e. where the EAGAIN is coming from and
why)....

Cheers,

Dave.

Comments

Matteo Frigo Feb. 26, 2013, 11:29 a.m. UTC | #1
Dave Chinner <david@fromorbit.com> writes:

> All of the evidence points of a problem caused by the pvmove
> operation. I'd suggest that you don't use it until we get to the
> bottom of the problem (i.e. where the EAGAIN is coming from and
> why)....

Dave,

thanks for your quick reply and detailed analysis.

For the benefit of other readers, I wish to add that the same problem
occurs when creating lvm mirrors (e.g. "lvconvert -m1 vg/lvol"), which
is not surprising since pvmove works itself by creating a mirror.
Consequently, mirror creation should also be avoided until the root
cause of the problem is eliminated.

Cheers,
Matteo

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Alasdair G Kergon Feb. 27, 2013, 1:04 a.m. UTC | #2
(Quick pointers that might be relevant)

EAGAIN, I'm not aware of dm itself returning that on the i/o path.

For 3.8 issues, read dm-devel around https://www.redhat.com/archives/dm-devel/2013-February/msg00086.html
(I queued the dm-side fixes for linux-next earlier today)

For pvmove, check exactly which version and whether discards are enabled: there
was a userspace bug for a short period some time ago when discards were enabled.

Alasdair

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Dave Chinner Feb. 27, 2013, 1:49 a.m. UTC | #3
On Wed, Feb 27, 2013 at 01:04:14AM +0000, Alasdair G Kergon wrote:
> (Quick pointers that might be relevant)
> 
> EAGAIN, I'm not aware of dm itself returning that on the i/o path.

Neither am I, but it's coming from somewhere in the IO path...

> For 3.8 issues, read dm-devel around https://www.redhat.com/archives/dm-devel/2013-February/msg00086.html
> (I queued the dm-side fixes for linux-next earlier today)

It's reproducable on lots of different kernels, apparently - 3.8,
3.4.33, CentOS 6.3, debian sid/wheezy and Fedora 18 were mentioned
specifically by the OP - so it doesn't look like a recent
regression or constrained to a specific kernel.

> For pvmove, check exactly which version and whether discards are enabled: there
> was a userspace bug for a short period some time ago when discards were enabled.

The version I used to reproduce on a 3.8.0 kernel was:

$ pvmove --version
  LVM version:     2.02.95(2) (2012-03-06)
  Library version: 1.02.74 (2012-03-06)
  Driver version:  4.23.1

>From Debian unstable.

Cheers,

Dave.
Matteo Frigo Feb. 27, 2013, 2:21 a.m. UTC | #4
Dave Chinner <david@fromorbit.com> writes:

> On Wed, Feb 27, 2013 at 01:04:14AM +0000, Alasdair G Kergon wrote:
>> (Quick pointers that might be relevant)
>> 
>> EAGAIN, I'm not aware of dm itself returning that on the i/o path.
>
> Neither am I, but it's coming from somewhere in the IO path...

Well, I don't really know anything about this topic, so I may be
completely off the mark, but dm-raid1.c:mirror_map() does indeed return
EWOULDBLOCK, and EWOULDBLOCK is #define'd to be EAGAIN, so it seems to
me that dm-raid1 does indeed return EAGAIN for "rw == READA" (which I
assume is read-ahead) if the "region is not in-sync":

	/*
	 * If region is not in-sync queue the bio.
	 */
	if (!r || (r == -EWOULDBLOCK)) {
		if (rw == READA)
			return -EWOULDBLOCK;

		queue_bio(ms, bio, rw);
		return DM_MAPIO_SUBMITTED;
	}

Regards,
MF

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Dave Chinner Feb. 27, 2013, 2:29 a.m. UTC | #5
On Tue, Feb 26, 2013 at 09:21:44PM -0500, Matteo Frigo wrote:
> Dave Chinner <david@fromorbit.com> writes:
> 
> > On Wed, Feb 27, 2013 at 01:04:14AM +0000, Alasdair G Kergon wrote:
> >> (Quick pointers that might be relevant)
> >> 
> >> EAGAIN, I'm not aware of dm itself returning that on the i/o path.
> >
> > Neither am I, but it's coming from somewhere in the IO path...
> 
> Well, I don't really know anything about this topic, so I may be
> completely off the mark, but dm-raid1.c:mirror_map() does indeed return
> EWOULDBLOCK, and EWOULDBLOCK is #define'd to be EAGAIN, so it seems to
> me that dm-raid1 does indeed return EAGAIN for "rw == READA" (which I
> assume is read-ahead) if the "region is not in-sync":
> 
> 	/*
> 	 * If region is not in-sync queue the bio.
> 	 */
> 	if (!r || (r == -EWOULDBLOCK)) {
> 		if (rw == READA)
> 			return -EWOULDBLOCK;
> 
> 		queue_bio(ms, bio, rw);
> 		return DM_MAPIO_SUBMITTED;
> 	}

Trees, forest....

Thanks for pointing out the obvious, Matteo. :)

Cheers,

Dave.
Mike Snitzer Feb. 27, 2013, 3:07 p.m. UTC | #6
On Tue, Feb 26 2013 at  8:49pm -0500,
Dave Chinner <david@fromorbit.com> wrote:

> On Wed, Feb 27, 2013 at 01:04:14AM +0000, Alasdair G Kergon wrote:
> > (Quick pointers that might be relevant)
> > 
> > EAGAIN, I'm not aware of dm itself returning that on the i/o path.
> 
> Neither am I, but it's coming from somewhere in the IO path...
> 
> > For 3.8 issues, read dm-devel around https://www.redhat.com/archives/dm-devel/2013-February/msg00086.html
> > (I queued the dm-side fixes for linux-next earlier today)
> 
> It's reproducable on lots of different kernels, apparently - 3.8,
> 3.4.33, CentOS 6.3, debian sid/wheezy and Fedora 18 were mentioned
> specifically by the OP - so it doesn't look like a recent
> regression or constrained to a specific kernel.
> 
> > For pvmove, check exactly which version and whether discards are enabled: there
> > was a userspace bug for a short period some time ago when discards were enabled.
> 
> The version I used to reproduce on a 3.8.0 kernel was:
> 
> $ pvmove --version
>   LVM version:     2.02.95(2) (2012-03-06)
>   Library version: 1.02.74 (2012-03-06)
>   Driver version:  4.23.1

Was issue_discards enabled in lvm.conf?

If so, as Alasdair said, this lvm2 2.02.97 fix is needed:
http://git.fedorahosted.org/cgit/lvm2.git/commit/?id=07a25c249b3e

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Matteo Frigo Feb. 27, 2013, 3:10 p.m. UTC | #7
I have "issue_discards = 0" in lvm.conf, so this not appear to be the
problem.
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Dave Chinner Feb. 27, 2013, 11:07 p.m. UTC | #8
On Wed, Feb 27, 2013 at 10:07:15AM -0500, Mike Snitzer wrote:
> On Tue, Feb 26 2013 at  8:49pm -0500,
> Dave Chinner <david@fromorbit.com> wrote:
> 
> > On Wed, Feb 27, 2013 at 01:04:14AM +0000, Alasdair G Kergon wrote:
> > > (Quick pointers that might be relevant)
> > > 
> > > EAGAIN, I'm not aware of dm itself returning that on the i/o path.
> > 
> > Neither am I, but it's coming from somewhere in the IO path...
> > 
> > > For 3.8 issues, read dm-devel around https://www.redhat.com/archives/dm-devel/2013-February/msg00086.html
> > > (I queued the dm-side fixes for linux-next earlier today)
> > 
> > It's reproducable on lots of different kernels, apparently - 3.8,
> > 3.4.33, CentOS 6.3, debian sid/wheezy and Fedora 18 were mentioned
> > specifically by the OP - so it doesn't look like a recent
> > regression or constrained to a specific kernel.
> > 
> > > For pvmove, check exactly which version and whether discards are enabled: there
> > > was a userspace bug for a short period some time ago when discards were enabled.
> > 
> > The version I used to reproduce on a 3.8.0 kernel was:
> > 
> > $ pvmove --version
> >   LVM version:     2.02.95(2) (2012-03-06)
> >   Library version: 1.02.74 (2012-03-06)
> >   Driver version:  4.23.1
> 
> Was issue_discards enabled in lvm.conf?

$ grep issue_discards /etc/lvm/lvm.conf
    issue_discards = 0
$

Cheers,

Dave.
Matteo Frigo March 7, 2013, 12:13 p.m. UTC | #9
Matteo Frigo <athena@fftw.org> writes:

> Well, I don't really know anything about this topic, so I may be
> completely off the mark, but dm-raid1.c:mirror_map() does indeed return
> EWOULDBLOCK, and EWOULDBLOCK is #define'd to be EAGAIN, so it seems to
> me that dm-raid1 does indeed return EAGAIN for "rw == READA" (which I
> assume is read-ahead) if the "region is not in-sync":
>
> 	/*
> 	 * If region is not in-sync queue the bio.
> 	 */
> 	if (!r || (r == -EWOULDBLOCK)) {
> 		if (rw == READA)
> 			return -EWOULDBLOCK;
>
> 		queue_bio(ms, bio, rw);
> 		return DM_MAPIO_SUBMITTED;
> 	}

Dave (and others),

do you have any suggestion on what should be done to fix this bug?

I have tried returning -EIO instead of -EWOULDBLOCK, but xfs does not
like that.  dm-zero.c:zero_map() appears to return -EIO too, so this is
another potential issue.

I have verified that removing the READA special case, treating READA
like READ, fixes the problem:

 		if(0) if (rw == READA)
 			return -EWOULDBLOCK;

Of course this "fix" throws away the baby with the bath water.

I am willing to write and submit a patch, but I would appreciate
directions as to what the correct protocol between xfs and dm is
supposed to be.

Regards,
MF

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Dave Chinner March 7, 2013, 10:31 p.m. UTC | #10
On Thu, Mar 07, 2013 at 07:13:27AM -0500, Matteo Frigo wrote:
> Matteo Frigo <athena@fftw.org> writes:
> 
> > Well, I don't really know anything about this topic, so I may be
> > completely off the mark, but dm-raid1.c:mirror_map() does indeed return
> > EWOULDBLOCK, and EWOULDBLOCK is #define'd to be EAGAIN, so it seems to
> > me that dm-raid1 does indeed return EAGAIN for "rw == READA" (which I
> > assume is read-ahead) if the "region is not in-sync":
> >
> > 	/*
> > 	 * If region is not in-sync queue the bio.
> > 	 */
> > 	if (!r || (r == -EWOULDBLOCK)) {
> > 		if (rw == READA)
> > 			return -EWOULDBLOCK;
> >
> > 		queue_bio(ms, bio, rw);
> > 		return DM_MAPIO_SUBMITTED;
> > 	}
> 
> Dave (and others),
> 
> do you have any suggestion on what should be done to fix this bug?
> 
> I have tried returning -EIO instead of -EWOULDBLOCK, but xfs does not
> like that.  dm-zero.c:zero_map() appears to return -EIO too, so this is
> another potential issue.

You need the XFS patch I posted so that readahead buffer
verification is avoided in the case of an error being returned from
the readahead.

Cheers,

Dave.
Dave Chinner March 7, 2013, 10:50 p.m. UTC | #11
On Fri, Mar 08, 2013 at 09:31:40AM +1100, Dave Chinner wrote:
> On Thu, Mar 07, 2013 at 07:13:27AM -0500, Matteo Frigo wrote:
> > Matteo Frigo <athena@fftw.org> writes:
> > 
> > > Well, I don't really know anything about this topic, so I may be
> > > completely off the mark, but dm-raid1.c:mirror_map() does indeed return
> > > EWOULDBLOCK, and EWOULDBLOCK is #define'd to be EAGAIN, so it seems to
> > > me that dm-raid1 does indeed return EAGAIN for "rw == READA" (which I
> > > assume is read-ahead) if the "region is not in-sync":
> > >
> > > 	/*
> > > 	 * If region is not in-sync queue the bio.
> > > 	 */
> > > 	if (!r || (r == -EWOULDBLOCK)) {
> > > 		if (rw == READA)
> > > 			return -EWOULDBLOCK;
> > >
> > > 		queue_bio(ms, bio, rw);
> > > 		return DM_MAPIO_SUBMITTED;
> > > 	}
> > 
> > Dave (and others),
> > 
> > do you have any suggestion on what should be done to fix this bug?
> > 
> > I have tried returning -EIO instead of -EWOULDBLOCK, but xfs does not
> > like that.  dm-zero.c:zero_map() appears to return -EIO too, so this is
> > another potential issue.
> 
> You need the XFS patch I posted so that readahead buffer
> verification is avoided in the case of an error being returned from
> the readahead.

I don't recall if that patch was sent to this thread, so here it is:

http://oss.sgi.com/archives/xfs/2013-02/msg00516.html

Cheers,

Dave.
diff mbox

Patch

--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1022,7 +1022,9 @@  xfs_buf_iodone_work(
 	bool                    read = !!(bp->b_flags & XBF_READ);
 
 	bp->b_flags &= ~(XBF_READ | XBF_WRITE | XBF_READ_AHEAD);
-	if (read && bp->b_ops)
+
+	/* validate buffers that were read without errors */
+	if (read && bp->b_ops && !bp->b_error && (bp->b_flags & XBF_DONE))
 		bp->b_ops->verify_read(bp);
 
 	if (bp->b_iodone)