diff mbox

dm-io: reject unsupported DISCARD/WRITE SAME requests with EOPNOTSUPP

Message ID 20150213092432.GC11034@birch.djwong.org (mailing list archive)
State Superseded, archived
Delegated to: Mike Snitzer
Headers show

Commit Message

Darrick J. Wong Feb. 13, 2015, 9:24 a.m. UTC
I created a dm-raid1 device backed by a device that supports DISCARD
and another device that does NOT support DISCARD with the following
dm configuration:

# echo '0 2048 mirror core 1 512 2 /dev/sda 0 /dev/sdb 0' | dmsetup create moo
# lsblk -D
NAME         DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
sda                 0        4K       1G         0
`-moo (dm-0)        0        4K       1G         0
sdb                 0        0B       0B         0
`-moo (dm-0)        0        4K       1G         0

Notice that the mirror device /dev/mapper/moo advertises DISCARD
support even though one of the mirror halves doesn't.

If I issue a DISCARD request (via fstrim, mount -o discard, or ioctl
BLKDISCARD) through the mirror, kmirrord gets stuck in an infinite
loop in do_region() when it tries to issue a DISCARD request to sdb.
The problem is that when we call do_region() against sdb, num_sectors
is set to zero because q->limits.max_discard_sectors is zero.
Therefore, "remaining" never decreases and the loop never terminates.

Before entering the loop, check for the combination of REQ_DISCARD and
no discard and return -EOPNOTSUPP to avoid hanging up the mirror
device.  Fix the same problem with WRITE_DISCARD while we're at it.

This bug was found by the unfortunate coincidence of pvmove and a
discard operation in the RHEL 6.5 kernel; 3.19 is also affected.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Srinivas Eeda <srinivas.eeda@oracle.com>
---
 drivers/md/dm-io.c |    7 +++++++
 1 file changed, 7 insertions(+)




--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Comments

Mike Snitzer Feb. 13, 2015, 3:55 p.m. UTC | #1
On Fri, Feb 13 2015 at  4:24am -0500,
Darrick J. Wong <darrick.wong@oracle.com> wrote:

> I created a dm-raid1 device backed by a device that supports DISCARD
> and another device that does NOT support DISCARD with the following
> dm configuration:
> 
> # echo '0 2048 mirror core 1 512 2 /dev/sda 0 /dev/sdb 0' | dmsetup create moo
> # lsblk -D
> NAME         DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
> sda                 0        4K       1G         0
> `-moo (dm-0)        0        4K       1G         0
> sdb                 0        0B       0B         0
> `-moo (dm-0)        0        4K       1G         0
> 
> Notice that the mirror device /dev/mapper/moo advertises DISCARD
> support even though one of the mirror halves doesn't.
> 
> If I issue a DISCARD request (via fstrim, mount -o discard, or ioctl
> BLKDISCARD) through the mirror, kmirrord gets stuck in an infinite
> loop in do_region() when it tries to issue a DISCARD request to sdb.
> The problem is that when we call do_region() against sdb, num_sectors
> is set to zero because q->limits.max_discard_sectors is zero.
> Therefore, "remaining" never decreases and the loop never terminates.
> 
> Before entering the loop, check for the combination of REQ_DISCARD and
> no discard and return -EOPNOTSUPP to avoid hanging up the mirror
> device.  Fix the same problem with WRITE_DISCARD while we're at it.
> 
> This bug was found by the unfortunate coincidence of pvmove and a
> discard operation in the RHEL 6.5 kernel; 3.19 is also affected.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
> Cc: Srinivas Eeda <srinivas.eeda@oracle.com>

Your patch looks fine but it is laser focused on dm-io.  Again, that is
fine (fixes a real problem).  But I'm wondering how other targets will
respond in the face of partial discard support across the logical
address space of the DM device.

When I implemented dm_table_supports_discards() I consciously allowed a
DM table to contain a mix of discard support.  I'm now wondering where
it is we benefit from that?  Seems like more of a liability than
anything -- so a bigger hammer approach to fixing this would be to
require all targets and all devices in a DM table support discard.
Which amounts to changing dm_table_supports_discards() to be like
dm_table_supports_write_same().

BTW, given dm_table_supports_write_same(), your patch shouldn't need to
worry about WRITE SAME.  Did you experience issues with WRITE SAME too
or were you just being proactive?

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Darrick J. Wong Feb. 13, 2015, 5:01 p.m. UTC | #2
On Fri, Feb 13, 2015 at 10:55:50AM -0500, Mike Snitzer wrote:
> On Fri, Feb 13 2015 at  4:24am -0500,
> Darrick J. Wong <darrick.wong@oracle.com> wrote:
> 
> > I created a dm-raid1 device backed by a device that supports DISCARD
> > and another device that does NOT support DISCARD with the following
> > dm configuration:
> > 
> > # echo '0 2048 mirror core 1 512 2 /dev/sda 0 /dev/sdb 0' | dmsetup create moo
> > # lsblk -D
> > NAME         DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
> > sda                 0        4K       1G         0
> > `-moo (dm-0)        0        4K       1G         0
> > sdb                 0        0B       0B         0
> > `-moo (dm-0)        0        4K       1G         0
> > 
> > Notice that the mirror device /dev/mapper/moo advertises DISCARD
> > support even though one of the mirror halves doesn't.
> > 
> > If I issue a DISCARD request (via fstrim, mount -o discard, or ioctl
> > BLKDISCARD) through the mirror, kmirrord gets stuck in an infinite
> > loop in do_region() when it tries to issue a DISCARD request to sdb.
> > The problem is that when we call do_region() against sdb, num_sectors
> > is set to zero because q->limits.max_discard_sectors is zero.
> > Therefore, "remaining" never decreases and the loop never terminates.
> > 
> > Before entering the loop, check for the combination of REQ_DISCARD and
> > no discard and return -EOPNOTSUPP to avoid hanging up the mirror
> > device.  Fix the same problem with WRITE_DISCARD while we're at it.
> > 
> > This bug was found by the unfortunate coincidence of pvmove and a
> > discard operation in the RHEL 6.5 kernel; 3.19 is also affected.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
> > Cc: Srinivas Eeda <srinivas.eeda@oracle.com>
> 
> Your patch looks fine but it is laser focused on dm-io.  Again, that is
> fine (fixes a real problem).  But I'm wondering how other targets will
> respond in the face of partial discard support across the logical
> address space of the DM device.

I'm not sure.  As far as I could tell, the simple dm targets seem happy enough
to pass along the -EOPNOTSUPP code to whomever gave it the bio.  I didn't look
at complicated things like dm-thin*, dm-cache, or dm-raid456.  It might be
worth having a regression test that can be applied to all the dm* devices in
rapid succession so we can find out.

Of the callers that I audited, ext4, xfs, btrfs, and f2fs silently ignore the
EOPNOTSUPP code.  jfs and swapfiles printk about the error but take no other
action.  BLKDISCARD (the ioctl) simply returns the error code to userspace.
gfs2 and nilfs2 will disable discards; and fat doesn't check error codes at
all.  None of them panic or otherwise go bonkers.

It seems that most FSes expect transient discard failures and/or partial
support.

> When I implemented dm_table_supports_discards() I consciously allowed a
> DM table to contain a mix of discard support.  I'm now wondering where
> it is we benefit from that?  Seems like more of a liability than

<shrug> Given the result above, I think we can leave it this way if we perform
a code audit, though certainly this cautious approach would mask any other bugs
that could be lurking.

> anything -- so a bigger hammer approach to fixing this would be to
> require all targets and all devices in a DM table support discard.
> Which amounts to changing dm_table_supports_discards() to be like
> dm_table_supports_write_same().
> 
> BTW, given dm_table_supports_write_same(), your patch shouldn't need to
> worry about WRITE SAME.  Did you experience issues with WRITE SAME too
> or were you just being proactive?

I was being (perhaps overly) proactive, at 1am. :)

That part of the if statement can go away.  I'll send a v2.

--D

> 
> Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Martin K. Petersen Feb. 13, 2015, 7:21 p.m. UTC | #3
>>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes:

Mike> When I implemented dm_table_supports_discards() I consciously
Mike> allowed a DM table to contain a mix of discard support.  I'm now
Mike> wondering where it is we benefit from that?  Seems like more of a
Mike> liability than anything -- so a bigger hammer approach to fixing
Mike> this would be to require all targets and all devices in a DM table
Mike> support discard.

I think our original rationale was that since discard is only a hint it
would be fine to mix and match. And at the time there seemed to be value
in supporting a heterogeneous setups with say a disk drive and an SSD.

Back then the SSD vendors were all busy telling us how crucial discard
would be going forward. However, that turned out not to be the case and
discard often causes more problems than it solves. So I'm perfectly OK
with requiring all devices in a table to have the same capabilities. In
many ways I think that's a cleaner approach.
Darrick J. Wong Feb. 13, 2015, 8:07 p.m. UTC | #4
On Fri, Feb 13, 2015 at 02:21:01PM -0500, Martin K. Petersen wrote:
> >>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes:
> 
> Mike> When I implemented dm_table_supports_discards() I consciously
> Mike> allowed a DM table to contain a mix of discard support.  I'm now
> Mike> wondering where it is we benefit from that?  Seems like more of a
> Mike> liability than anything -- so a bigger hammer approach to fixing
> Mike> this would be to require all targets and all devices in a DM table
> Mike> support discard.
> 
> I think our original rationale was that since discard is only a hint it
> would be fine to mix and match. And at the time there seemed to be value
> in supporting a heterogeneous setups with say a disk drive and an SSD.
> 
> Back then the SSD vendors were all busy telling us how crucial discard
> would be going forward. However, that turned out not to be the case and
> discard often causes more problems than it solves. So I'm perfectly OK
> with requiring all devices in a table to have the same capabilities. In
> many ways I think that's a cleaner approach.

I agree that imposing that extra requirement would be cleaner from a software
engineering POV.

That said, given that discard is advisory and most of the callers seem to
anticipate that it might not work, why not allow heterogeneous mixes?  It seems
unfortunate to remove that ability just because there are kernel bugs.  If
you're implementing thin provisioning and are unmapping storage when discard
requests come through, wouldn't it be an antifeature that this suddenly stops
just because something got in the way?  fstrim ought to be able to talk to
the parts of the compound device that can be trimmed.

Second question: Can a dm device detect that q->limits.max_discard_sectors has
changed in one of the devices it's sitting on?  Say I have this:

X -> Y -> SSD1
 \-> Z -> SSD2

SSD*, Z, Y, and X all advertise discard.

Then I change Y to point to a spinny disk:

X -> Y -> HDD
 \-> Z -> SSD2

Even if Y notices that it no longer supports discard, will X figure that out
too?

--D

> 
> -- 
> Martin K. Petersen	Oracle Linux Engineering

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer Feb. 13, 2015, 8:58 p.m. UTC | #5
On Fri, Feb 13 2015 at  3:07P -0500,
Darrick J. Wong <darrick.wong@oracle.com> wrote:

> On Fri, Feb 13, 2015 at 02:21:01PM -0500, Martin K. Petersen wrote:
> > >>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes:
> > 
> > Mike> When I implemented dm_table_supports_discards() I consciously
> > Mike> allowed a DM table to contain a mix of discard support.  I'm now
> > Mike> wondering where it is we benefit from that?  Seems like more of a
> > Mike> liability than anything -- so a bigger hammer approach to fixing
> > Mike> this would be to require all targets and all devices in a DM table
> > Mike> support discard.
> > 
> > I think our original rationale was that since discard is only a hint it
> > would be fine to mix and match. And at the time there seemed to be value
> > in supporting a heterogeneous setups with say a disk drive and an SSD.
> > 
> > Back then the SSD vendors were all busy telling us how crucial discard
> > would be going forward. However, that turned out not to be the case and
> > discard often causes more problems than it solves. So I'm perfectly OK
> > with requiring all devices in a table to have the same capabilities. In
> > many ways I think that's a cleaner approach.
> 
> I agree that imposing that extra requirement would be cleaner from a software
> engineering POV.
> 
> That said, given that discard is advisory and most of the callers seem to
> anticipate that it might not work, why not allow heterogeneous mixes?  It seems
> unfortunate to remove that ability just because there are kernel bugs.  If
> you're implementing thin provisioning and are unmapping storage when discard
> requests come through, wouldn't it be an antifeature that this suddenly stops
> just because something got in the way?  fstrim ought to be able to talk to
> the parts of the compound device that can be trimmed.

Yeah, I'm OK with leaving it as is for now (allowing a mix).  So I'll be
picking your dm-io patch up for 3.20-rc.  But if other similar bugs
manifest then we can revisit disallowing a mix.

As for your question on thin provisioning.  The DM thinp target
advertises support for discards even if the underlying data device
doesn't (unless the user explicitly asked for discards to be disabled on
the thin-pool).  This is accomplished with the target's
'discards_supported' override.  So even if we did make the change to
disallowing a mix: the 'discards_supported' override would still enable
discards.

> Second question: Can a dm device detect that q->limits.max_discard_sectors has
> changed in one of the devices it's sitting on?  Say I have this:
> 
> X -> Y -> SSD1
>  \-> Z -> SSD2
> 
> SSD*, Z, Y, and X all advertise discard.
> 
> Then I change Y to point to a spinny disk:
> 
> X -> Y -> HDD
>  \-> Z -> SSD2
> 
> Even if Y notices that it no longer supports discard, will X figure that out
> too?

No, the DM table reload for Y would alter the discard capability of Y
but it will _not_ bubble up to X.

(would be nice if such limit stacking with refresh but...)

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Martin K. Petersen Feb. 13, 2015, 9:12 p.m. UTC | #6
>>>>> "Mike" == Mike Snitzer <snitzer@redhat.com> writes:

Mike> Yeah, I'm OK with leaving it as is for now (allowing a mix).

That's fine with me as long as you're comfortable that we're handling
the corner cases correctly.

Mike> So I'll be picking your dm-io patch up for 3.20-rc.

Should also go to stable.

Mike> No, the DM table reload for Y would alter the discard capability
Mike> of Y but it will _not_ bubble up to X.

Mike> (would be nice if such limit stacking with refresh but...)

Yeah. We keep talking about this. Maybe we can have a quick brain storm
at LSF/MM?
Mikulas Patocka Feb. 26, 2015, 7:56 p.m. UTC | #7
On Fri, 13 Feb 2015, Mike Snitzer wrote:

> On Fri, Feb 13 2015 at  4:24am -0500,
> Darrick J. Wong <darrick.wong@oracle.com> wrote:
> 
> > I created a dm-raid1 device backed by a device that supports DISCARD
> > and another device that does NOT support DISCARD with the following
> > dm configuration:
> > 
> > # echo '0 2048 mirror core 1 512 2 /dev/sda 0 /dev/sdb 0' | dmsetup create moo
> > # lsblk -D
> > NAME         DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
> > sda                 0        4K       1G         0
> > `-moo (dm-0)        0        4K       1G         0
> > sdb                 0        0B       0B         0
> > `-moo (dm-0)        0        4K       1G         0
> > 
> > Notice that the mirror device /dev/mapper/moo advertises DISCARD
> > support even though one of the mirror halves doesn't.
> > 
> > If I issue a DISCARD request (via fstrim, mount -o discard, or ioctl
> > BLKDISCARD) through the mirror, kmirrord gets stuck in an infinite
> > loop in do_region() when it tries to issue a DISCARD request to sdb.
> > The problem is that when we call do_region() against sdb, num_sectors
> > is set to zero because q->limits.max_discard_sectors is zero.
> > Therefore, "remaining" never decreases and the loop never terminates.
> > 
> > Before entering the loop, check for the combination of REQ_DISCARD and
> > no discard and return -EOPNOTSUPP to avoid hanging up the mirror
> > device.  Fix the same problem with WRITE_DISCARD while we're at it.
> > 
> > This bug was found by the unfortunate coincidence of pvmove and a
> > discard operation in the RHEL 6.5 kernel; 3.19 is also affected.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
> > Cc: Srinivas Eeda <srinivas.eeda@oracle.com>
> 
> Your patch looks fine but it is laser focused on dm-io.  Again, that is
> fine (fixes a real problem).  But I'm wondering how other targets will
> respond in the face of partial discard support across the logical
> address space of the DM device.
> 
> When I implemented dm_table_supports_discards() I consciously allowed a
> DM table to contain a mix of discard support.  I'm now wondering where
> it is we benefit from that?  Seems like more of a liability than
> anything -- so a bigger hammer approach to fixing this would be to
> require all targets and all devices in a DM table support discard.
> Which amounts to changing dm_table_supports_discards() to be like
> dm_table_supports_write_same().
> 
> BTW, given dm_table_supports_write_same(), your patch shouldn't need to
> worry about WRITE SAME.  Did you experience issues with WRITE SAME too
> or were you just being proactive?
> 
> Mike

I think that Darrick's patch is needed even for WRITE SAME.

Note that queue limits and flags can't be reliably prevent bios from 
coming in.

For example:

1. Some piece of code tests queue flags and sees that 
max_write_same_sectors is non-zero, it constructs WRITE_SAME bio and sends 
it with submit_bio.

2. Meanwhile, the device is reconfigured so that it doesn't support 
WRITE_SAME. q->limits.max_write_same_sectors is set to zero.

3. The bio submitted at step 1 can't be reverted, so it arrives at the 
device mapper even if it advertises that it doesn't support write same - 
now, it causes the lockup that Darrick observed.


Another problem is that queue flags are not propagated up when you reload 
a single device - someone could reload a mirror leg with a different dm 
table that doesn't support write_same, and even after the reload, the 
mirror keeps advertising that it does support WRITE_SAME.


Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka Feb. 26, 2015, 8:10 p.m. UTC | #8
On Thu, 26 Feb 2015, Mikulas Patocka wrote:

> 
> 
> On Fri, 13 Feb 2015, Mike Snitzer wrote:
> 
> > On Fri, Feb 13 2015 at  4:24am -0500,
> > Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > 
> > > I created a dm-raid1 device backed by a device that supports DISCARD
> > > and another device that does NOT support DISCARD with the following
> > > dm configuration:
> > > 
> > > # echo '0 2048 mirror core 1 512 2 /dev/sda 0 /dev/sdb 0' | dmsetup create moo
> > > # lsblk -D
> > > NAME         DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
> > > sda                 0        4K       1G         0
> > > `-moo (dm-0)        0        4K       1G         0
> > > sdb                 0        0B       0B         0
> > > `-moo (dm-0)        0        4K       1G         0
> > > 
> > > Notice that the mirror device /dev/mapper/moo advertises DISCARD
> > > support even though one of the mirror halves doesn't.
> > > 
> > > If I issue a DISCARD request (via fstrim, mount -o discard, or ioctl
> > > BLKDISCARD) through the mirror, kmirrord gets stuck in an infinite
> > > loop in do_region() when it tries to issue a DISCARD request to sdb.
> > > The problem is that when we call do_region() against sdb, num_sectors
> > > is set to zero because q->limits.max_discard_sectors is zero.
> > > Therefore, "remaining" never decreases and the loop never terminates.
> > > 
> > > Before entering the loop, check for the combination of REQ_DISCARD and
> > > no discard and return -EOPNOTSUPP to avoid hanging up the mirror
> > > device.  Fix the same problem with WRITE_DISCARD while we're at it.
> > > 
> > > This bug was found by the unfortunate coincidence of pvmove and a
> > > discard operation in the RHEL 6.5 kernel; 3.19 is also affected.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
> > > Cc: Srinivas Eeda <srinivas.eeda@oracle.com>
> > 
> > Your patch looks fine but it is laser focused on dm-io.  Again, that is
> > fine (fixes a real problem).  But I'm wondering how other targets will
> > respond in the face of partial discard support across the logical
> > address space of the DM device.
> > 
> > When I implemented dm_table_supports_discards() I consciously allowed a
> > DM table to contain a mix of discard support.  I'm now wondering where
> > it is we benefit from that?  Seems like more of a liability than
> > anything -- so a bigger hammer approach to fixing this would be to
> > require all targets and all devices in a DM table support discard.
> > Which amounts to changing dm_table_supports_discards() to be like
> > dm_table_supports_write_same().
> > 
> > BTW, given dm_table_supports_write_same(), your patch shouldn't need to
> > worry about WRITE SAME.  Did you experience issues with WRITE SAME too
> > or were you just being proactive?
> > 
> > Mike
> 
> I think that Darrick's patch is needed even for WRITE SAME.
> 
> Note that queue limits and flags can't be reliably prevent bios from 
> coming in.
> 
> For example:
> 
> 1. Some piece of code tests queue flags and sees that 
> max_write_same_sectors is non-zero, it constructs WRITE_SAME bio and sends 
> it with submit_bio.
> 
> 2. Meanwhile, the device is reconfigured so that it doesn't support 
> WRITE_SAME. q->limits.max_write_same_sectors is set to zero.
> 
> 3. The bio submitted at step 1 can't be reverted, so it arrives at the 
> device mapper even if it advertises that it doesn't support write same - 
> now, it causes the lockup that Darrick observed.
> 
> 
> Another problem is that queue flags are not propagated up when you reload 
> a single device - someone could reload a mirror leg with a different dm 
> table that doesn't support write_same, and even after the reload, the 
> mirror keeps advertising that it does support WRITE_SAME.

It comes to another idea - if the limits change while the do-while loop is 
in progress, even the original Darrick's patch is wrong and fails to 
prevent the lockup. So - we need to read the limits in advance, test them 
and never re-read them.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Darrick J. Wong Feb. 27, 2015, 6:42 p.m. UTC | #9
On Thu, Feb 26, 2015 at 02:56:20PM -0500, Mikulas Patocka wrote:
> 
> 
> On Fri, 13 Feb 2015, Mike Snitzer wrote:
> 
> > On Fri, Feb 13 2015 at  4:24am -0500,
> > Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > 
> > > I created a dm-raid1 device backed by a device that supports DISCARD
> > > and another device that does NOT support DISCARD with the following
> > > dm configuration:
> > > 
> > > # echo '0 2048 mirror core 1 512 2 /dev/sda 0 /dev/sdb 0' | dmsetup create moo
> > > # lsblk -D
> > > NAME         DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
> > > sda                 0        4K       1G         0
> > > `-moo (dm-0)        0        4K       1G         0
> > > sdb                 0        0B       0B         0
> > > `-moo (dm-0)        0        4K       1G         0
> > > 
> > > Notice that the mirror device /dev/mapper/moo advertises DISCARD
> > > support even though one of the mirror halves doesn't.
> > > 
> > > If I issue a DISCARD request (via fstrim, mount -o discard, or ioctl
> > > BLKDISCARD) through the mirror, kmirrord gets stuck in an infinite
> > > loop in do_region() when it tries to issue a DISCARD request to sdb.
> > > The problem is that when we call do_region() against sdb, num_sectors
> > > is set to zero because q->limits.max_discard_sectors is zero.
> > > Therefore, "remaining" never decreases and the loop never terminates.
> > > 
> > > Before entering the loop, check for the combination of REQ_DISCARD and
> > > no discard and return -EOPNOTSUPP to avoid hanging up the mirror
> > > device.  Fix the same problem with WRITE_DISCARD while we're at it.
> > > 
> > > This bug was found by the unfortunate coincidence of pvmove and a
> > > discard operation in the RHEL 6.5 kernel; 3.19 is also affected.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
> > > Cc: Srinivas Eeda <srinivas.eeda@oracle.com>
> > 
> > Your patch looks fine but it is laser focused on dm-io.  Again, that is
> > fine (fixes a real problem).  But I'm wondering how other targets will
> > respond in the face of partial discard support across the logical
> > address space of the DM device.
> > 
> > When I implemented dm_table_supports_discards() I consciously allowed a
> > DM table to contain a mix of discard support.  I'm now wondering where
> > it is we benefit from that?  Seems like more of a liability than
> > anything -- so a bigger hammer approach to fixing this would be to
> > require all targets and all devices in a DM table support discard.
> > Which amounts to changing dm_table_supports_discards() to be like
> > dm_table_supports_write_same().
> > 
> > BTW, given dm_table_supports_write_same(), your patch shouldn't need to
> > worry about WRITE SAME.  Did you experience issues with WRITE SAME too
> > or were you just being proactive?
> > 
> > Mike
> 
> I think that Darrick's patch is needed even for WRITE SAME.
> 
> Note that queue limits and flags can't be reliably prevent bios from 
> coming in.
> 
> For example:
> 
> 1. Some piece of code tests queue flags and sees that 
> max_write_same_sectors is non-zero, it constructs WRITE_SAME bio and sends 
> it with submit_bio.
> 
> 2. Meanwhile, the device is reconfigured so that it doesn't support 
> WRITE_SAME. q->limits.max_write_same_sectors is set to zero.
> 
> 3. The bio submitted at step 1 can't be reverted, so it arrives at the 
> device mapper even if it advertises that it doesn't support write same - 
> now, it causes the lockup that Darrick observed.

I'd pondered patching the WRITE SAME case too.

> Another problem is that queue flags are not propagated up when you reload 
> a single device - someone could reload a mirror leg with a different dm 
> table that doesn't support write_same, and even after the reload, the 
> mirror keeps advertising that it does support WRITE_SAME.

Not sure how to deal with that -- I suppose we could save the 'last recorded q
limits' and watch for changes, but I suppose that depends on how resilient
callers are against DISCARD and WRITE SAME returning -EOPNOTSUPP.  So far as I
can tell the in-kernel caller handles it gracefully enough, and I'd hope that
any sane userland program would know to follow up with a regular write (if
applicable), but who knows...

> It comes to another idea - if the limits change while the do-while loop is 
> in progress, even the original Darrick's patch is wrong and fails to 
> prevent the lockup. So - we need to read the limits in advance, test them 
> and never re-read them.

Agreed.  I /think/ issuing discard/write same to a device that no longer
supports it will return an error... or at least in my crummy 5 minute test it
seemed to work.

--D

> 
> 
> Mikulas
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
diff mbox

Patch

diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index c09359d..32d330a 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -290,6 +290,13 @@  static void do_region(int rw, unsigned region, struct dm_io_region *where,
 	unsigned short logical_block_size = queue_logical_block_size(q);
 	sector_t num_sectors;
 
+	/* Reject unsupported discard/write same requests */
+	if (((rw & REQ_DISCARD) && !blk_queue_discard(q)) ||
+	    ((rw & REQ_WRITE_SAME) && !bdev_write_same(where->bdev))) {
+		dec_count(io, region, -EOPNOTSUPP);
+		return;
+	}
+
 	/*
 	 * where->count may be zero if rw holds a flush and we need to
 	 * send a zero-sized flush.