diff mbox

Revert "dm mpath: fix stalls when handling invalid ioctls"

Message ID 1446121463-17828-1-git-send-email-mauricfo@linux.vnet.ibm.com (mailing list archive)
State Accepted, archived
Delegated to: Mike Snitzer
Headers show

Commit Message

Mauricio Faria de Oliveira Oct. 29, 2015, 12:24 p.m. UTC
This reverts commit a1989b330093578ea5470bea0a00f940c444c466.

That commit introduced a regression at least for the case of the SG_IO ioctl()
running without CAP_SYS_RAWIO capability (e.g., unprivileged users) when there
are no active paths: the ioctl() fails with the ENOTTY errno immediately rather
than blocking due to queue_if_no_path until a path becomes active, for example.

That case happens to be exercised by QEMU KVM guests with 'scsi-block' devices
(qemu "-device scsi-block" [1], libvirt "<disk type='block' device='lun'>" [2])
from multipath devices; which leads to SCSI/filesystem errors in such a guest.

More general scenarios can hit that regression too. The following demonstration
employs a SG_IO ioctl() with a standard SCSI INQUIRY command for this objective
(some output & user changes omitted for brevity and comments added for clarity).

Reverting that commit restores normal operation (queueing) in failing scenarios;
tested on linux-next (next-20151022).

1) Test-case is based on sg_simple0 [3] (just SG_IO; remove SG_GET_VERSION_NUM)

    $ cat sg_simple0.c
    ... see [3] ...
    $ sed '/SG_GET_VERSION_NUM/,/}/d' sg_simple0.c > sgio_inquiry.c
    $ gcc sgio_inquiry.c -o sgio_inquiry

2) The ioctl() works fine with active paths present.

    # multipath -l 85ag56
    85ag56 (...) dm-19 IBM     ,2145
    size=60G features='1 queue_if_no_path' hwhandler='0' wp=rw
    |-+- policy='service-time 0' prio=0 status=active
    | |- 8:0:11:0  sdz  65:144  active undef running
    | `- 9:0:9:0   sdbf 67:144  active undef running
    `-+- policy='service-time 0' prio=0 status=enabled
      |- 8:0:12:0  sdae 65:224  active undef running
      `- 9:0:12:0  sdbo 68:32   active undef running

    $ ./sgio_inquiry /dev/mapper/85ag56
    Some of the INQUIRY command's response:
        IBM       2145              0000
    INQUIRY duration=0 millisecs, resid=0

3) The ioctl() fails with ENOTTY errno with _no_ active paths present,
   for unprivileged users (rather than blocking due to queue_if_no_path).

    # for path in $(multipath -l 85ag56 | grep -o 'sd[a-z]\+'); \
          do multipathd -k"fail path $path"; done

    # multipath -l 85ag56
    85ag56 (...) dm-19 IBM     ,2145
    size=60G features='1 queue_if_no_path' hwhandler='0' wp=rw
    |-+- policy='service-time 0' prio=0 status=enabled
    | |- 8:0:11:0  sdz  65:144  failed undef running
    | `- 9:0:9:0   sdbf 67:144  failed undef running
    `-+- policy='service-time 0' prio=0 status=enabled
      |- 8:0:12:0  sdae 65:224  failed undef running
      `- 9:0:12:0  sdbo 68:32   failed undef running

    $ ./sgio_inquiry /dev/mapper/85ag56
    sg_simple0: Inquiry SG_IO ioctl error: Inappropriate ioctl for device

4) dmesg shows that scsi_verify_blk_ioctl() failed for SG_IO (0x2285);
   it returns -ENOIOCTLCMD, later replaced with -ENOTTY in vfs_ioctl().

    $ dmesg
    <...>
    [] device-mapper: multipath: Failing path 65:144.
    [] device-mapper: multipath: Failing path 67:144.
    [] device-mapper: multipath: Failing path 65:224.
    [] device-mapper: multipath: Failing path 68:32.
    [] sgio_inquiry: sending ioctl 2285 to a partition!

5) The ioctl() only works if the SYS_CAP_RAWIO capability is present
   (then queueing happens -- in this example, queue_if_no_path is set);
   this is due to a conditional check in scsi_verify_blk_ioctl().

    # capsh --drop=cap_sys_rawio -- -c './sgio_inquiry /dev/mapper/85ag56'
    sg_simple0: Inquiry SG_IO ioctl error: Inappropriate ioctl for device

    # ./sgio_inquiry /dev/mapper/85ag56 &
    [1] 72830

    # cat /proc/72830/stack
    [<c00000171c0df700>] 0xc00000171c0df700
    [<c000000000015934>] __switch_to+0x204/0x350
    [<c000000000152d4c>] msleep+0x5c/0x80
    [<c00000000077dfb0>] dm_blk_ioctl+0x70/0x170
    [<c000000000487c40>] blkdev_ioctl+0x2b0/0x9b0
    [<c0000000003128e4>] block_ioctl+0x64/0xd0
    [<c0000000002dd3b0>] do_vfs_ioctl+0x490/0x780
    [<c0000000002dd774>] SyS_ioctl+0xd4/0xf0
    [<c000000000009358>] system_call+0x38/0xd0

6) This is the function call chain exercised in this analysis:

SYSCALL_DEFINE3(ioctl, <...>) @ fs/ioctl.c
    -> do_vfs_ioctl()
        -> vfs_ioctl()
            ...
            error = filp->f_op->unlocked_ioctl(filp, cmd, arg);
            ...
                -> dm_blk_ioctl() @ drivers/md/dm.c
                    -> multipath_ioctl() @ drivers/md/dm-mpath.c
                        ...
                        (bdev = NULL, due to no active paths)
                        ...
                        if (!bdev || <...>) {
                            int err = scsi_verify_blk_ioctl(NULL, cmd);
                            if (err)
                                r = err;
                        }
                        ...
                            -> scsi_verify_blk_ioctl() @ block/scsi_ioctl.c
                                ...
                                if (bd && bd == bd->bd_contains) // not taken (bd = NULL)
                                    return 0;
                                ...
                                if (capable(CAP_SYS_RAWIO)) // not taken (unprivileged user)
                                    return 0;
                                ...
                                printk_ratelimited(KERN_WARNING
                                           "%s: sending ioctl %x to a partition!\n" <...>);

                                return -ENOIOCTLCMD;
                            <-
                        ...
                        return r ? : <...>
                    <-
            ...
            if (error == -ENOIOCTLCMD)
                error = -ENOTTY;
             out:
                return error;
            ...

Links:
[1] http://git.qemu.org/?p=qemu.git;a=commit;h=336a6915bc7089fb20fea4ba99972ad9a97c5f52
[2] https://libvirt.org/formatdomain.html#elementsDisks (see 'disk' -> 'device')
[3] http://tldp.org/HOWTO/SCSI-Generic-HOWTO/pexample.html (Revision 1.2, 2002-05-03)

Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
---
 drivers/md/dm-mpath.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

Comments

Mauricio Faria de Oliveira Oct. 29, 2015, 12:33 p.m. UTC | #1
On 10/29/2015 10:24 AM, Mauricio Faria de Oliveira wrote:
> This reverts commit a1989b330093578ea5470bea0a00f940c444c466.

Cc'ing Hannes manually (-cc failed).
Mike Snitzer Oct. 29, 2015, 1:18 p.m. UTC | #2
On Thu, Oct 29 2015 at  8:24am -0400,
Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com> wrote:

> This reverts commit a1989b330093578ea5470bea0a00f940c444c466.
> 
> That commit introduced a regression at least for the case of the SG_IO ioctl()
> running without CAP_SYS_RAWIO capability (e.g., unprivileged users) when there
> are no active paths: the ioctl() fails with the ENOTTY errno immediately rather
> than blocking due to queue_if_no_path until a path becomes active, for example.
> 
> That case happens to be exercised by QEMU KVM guests with 'scsi-block' devices
> (qemu "-device scsi-block" [1], libvirt "<disk type='block' device='lun'>" [2])
> from multipath devices; which leads to SCSI/filesystem errors in such a guest.
> 
> More general scenarios can hit that regression too. The following demonstration
> employs a SG_IO ioctl() with a standard SCSI INQUIRY command for this objective
> (some output & user changes omitted for brevity and comments added for clarity).
> 
> Reverting that commit restores normal operation (queueing) in failing scenarios;
> tested on linux-next (next-20151022).

Obviously !bdev is causing you to take the branch you'd like to change
via revert.  Once in that branch scsi_verify_blk_ioctl() is called.  If
the ioctl is valid the ioctl is allowed to carry on (regardless of
whether there are paths or not).
 
> 1) Test-case is based on sg_simple0 [3] (just SG_IO; remove SG_GET_VERSION_NUM)
> 
>     $ cat sg_simple0.c
>     ... see [3] ...
>     $ sed '/SG_GET_VERSION_NUM/,/}/d' sg_simple0.c > sgio_inquiry.c
>     $ gcc sgio_inquiry.c -o sgio_inquiry
> 
> 2) The ioctl() works fine with active paths present.
> 
>     # multipath -l 85ag56
>     85ag56 (...) dm-19 IBM     ,2145
>     size=60G features='1 queue_if_no_path' hwhandler='0' wp=rw
>     |-+- policy='service-time 0' prio=0 status=active
>     | |- 8:0:11:0  sdz  65:144  active undef running
>     | `- 9:0:9:0   sdbf 67:144  active undef running
>     `-+- policy='service-time 0' prio=0 status=enabled
>       |- 8:0:12:0  sdae 65:224  active undef running
>       `- 9:0:12:0  sdbo 68:32   active undef running
> 
>     $ ./sgio_inquiry /dev/mapper/85ag56
>     Some of the INQUIRY command's response:
>         IBM       2145              0000
>     INQUIRY duration=0 millisecs, resid=0
> 
> 3) The ioctl() fails with ENOTTY errno with _no_ active paths present,
>    for unprivileged users (rather than blocking due to queue_if_no_path).
> 
>     # for path in $(multipath -l 85ag56 | grep -o 'sd[a-z]\+'); \
>           do multipathd -k"fail path $path"; done
> 
>     # multipath -l 85ag56
>     85ag56 (...) dm-19 IBM     ,2145
>     size=60G features='1 queue_if_no_path' hwhandler='0' wp=rw
>     |-+- policy='service-time 0' prio=0 status=enabled
>     | |- 8:0:11:0  sdz  65:144  failed undef running
>     | `- 9:0:9:0   sdbf 67:144  failed undef running
>     `-+- policy='service-time 0' prio=0 status=enabled
>       |- 8:0:12:0  sdae 65:224  failed undef running
>       `- 9:0:12:0  sdbo 68:32   failed undef running
> 
>     $ ./sgio_inquiry /dev/mapper/85ag56
>     sg_simple0: Inquiry SG_IO ioctl error: Inappropriate ioctl for device
> 
> 4) dmesg shows that scsi_verify_blk_ioctl() failed for SG_IO (0x2285);
>    it returns -ENOIOCTLCMD, later replaced with -ENOTTY in vfs_ioctl().
> 
>     $ dmesg
>     <...>
>     [] device-mapper: multipath: Failing path 65:144.
>     [] device-mapper: multipath: Failing path 67:144.
>     [] device-mapper: multipath: Failing path 65:224.
>     [] device-mapper: multipath: Failing path 68:32.
>     [] sgio_inquiry: sending ioctl 2285 to a partition!

So scsi_verify_blk_ioctl() considers the ioctl invalid.

> 5) The ioctl() only works if the SYS_CAP_RAWIO capability is present
>    (then queueing happens -- in this example, queue_if_no_path is set);
>    this is due to a conditional check in scsi_verify_blk_ioctl().
> 
>     # capsh --drop=cap_sys_rawio -- -c './sgio_inquiry /dev/mapper/85ag56'
>     sg_simple0: Inquiry SG_IO ioctl error: Inappropriate ioctl for device
> 
>     # ./sgio_inquiry /dev/mapper/85ag56 &
>     [1] 72830

Sorry but I fail to see why your request to revert is viable.
Commit a1989b330093578ea5470bea0a00f940c444c466 fixes issues seen with
udevd (namely "udevd[]: worker [] unexpectedly returned with status 0x0100")

Wouldn't the correct fix be to train scsi_verify_blk_ioctl() to work
even without SYS_CAP_RAWIO?

I'm not doubting that the commit caused problems for the case you care
about (unprivledged users issueing ioctls) but that is an entirely
different issue that needs to be discussed more directly (with the
broader linux-scsi community, now cc'd).

Mike


p.s. leaving full context for linux-scsi so they don't need to track
down other mails (btw, thanks for the detailed patch header but it
enabled me to be skeptical of your request to revert):

> 6) This is the function call chain exercised in this analysis:
> 
> SYSCALL_DEFINE3(ioctl, <...>) @ fs/ioctl.c
>     -> do_vfs_ioctl()
>         -> vfs_ioctl()
>             ...
>             error = filp->f_op->unlocked_ioctl(filp, cmd, arg);
>             ...
>                 -> dm_blk_ioctl() @ drivers/md/dm.c
>                     -> multipath_ioctl() @ drivers/md/dm-mpath.c
>                         ...
>                         (bdev = NULL, due to no active paths)
>                         ...
>                         if (!bdev || <...>) {
>                             int err = scsi_verify_blk_ioctl(NULL, cmd);
>                             if (err)
>                                 r = err;
>                         }
>                         ...
>                             -> scsi_verify_blk_ioctl() @ block/scsi_ioctl.c
>                                 ...
>                                 if (bd && bd == bd->bd_contains) // not taken (bd = NULL)
>                                     return 0;
>                                 ...
>                                 if (capable(CAP_SYS_RAWIO)) // not taken (unprivileged user)
>                                     return 0;
>                                 ...
>                                 printk_ratelimited(KERN_WARNING
>                                            "%s: sending ioctl %x to a partition!\n" <...>);
> 
>                                 return -ENOIOCTLCMD;
>                             <-
>                         ...
>                         return r ? : <...>
>                     <-
>             ...
>             if (error == -ENOIOCTLCMD)
>                 error = -ENOTTY;
>              out:
>                 return error;
>             ...
> 
> Links:
> [1] http://git.qemu.org/?p=qemu.git;a=commit;h=336a6915bc7089fb20fea4ba99972ad9a97c5f52
> [2] https://libvirt.org/formatdomain.html#elementsDisks (see 'disk' -> 'device')
> [3] http://tldp.org/HOWTO/SCSI-Generic-HOWTO/pexample.html (Revision 1.2, 2002-05-03)
> 
> Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
> ---
>  drivers/md/dm-mpath.c | 7 ++-----
>  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
> index 5a67671..bdc96cd 100644
> --- a/drivers/md/dm-mpath.c
> +++ b/drivers/md/dm-mpath.c
> @@ -1569,11 +1569,8 @@ static int multipath_ioctl(struct dm_target *ti, unsigned int cmd,
>  	/*
>  	 * Only pass ioctls through if the device sizes match exactly.
>  	 */
> -	if (!bdev || ti->len != i_size_read(bdev->bd_inode) >> SECTOR_SHIFT) {
> -		int err = scsi_verify_blk_ioctl(NULL, cmd);
> -		if (err)
> -			r = err;
> -	}
> +	if (!r && ti->len != i_size_read(bdev->bd_inode) >> SECTOR_SHIFT)
> +		r = scsi_verify_blk_ioctl(NULL, cmd);
>  
>  	if (r == -ENOTCONN && !fatal_signal_pending(current)) {
>  		spin_lock_irqsave(&m->lock, flags);
> -- 
> 1.9.1
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mauricio Faria de Oliveira Oct. 29, 2015, 2:47 p.m. UTC | #3
Hi Mike,

On 10/29/2015 11:18 AM, Mike Snitzer wrote:
> Sorry but I fail to see why your request to revert is viable.

No problem. Thanks for moving this for a discussion on a proper fix.

I'm somewhat new to kernel and SCSI workings and its community process,
so that's certainly appreciated.

> Wouldn't the correct fix be to train scsi_verify_blk_ioctl() to work
> even without SYS_CAP_RAWIO?

I see it would fix the problem as well, but I don't happen to know
all of its usages, so I'd have to defer to question to someone who
knows more of it.

> I'm not doubting that the commit caused problems for the case you care
> about (unprivledged users issueing ioctls) but that is an entirely
> different issue that needs to be discussed more directly (with the
> broader linux-scsi community, now cc'd).

Sure. Thanks for opening the discussion.

> p.s. leaving full context for linux-scsi so they don't need to track
> down other mails (btw, thanks for the detailed patch header but it
> enabled me to be skeptical of your request to revert):

You're welcome. If it's been useful for rejecting this patch and
getting a better one later, it's worth it. :)

Kind regards,
Paolo Bonzini Oct. 31, 2015, 3:33 p.m. UTC | #4
On 29/10/2015 14:18, Mike Snitzer wrote:
> > 4) dmesg shows that scsi_verify_blk_ioctl() failed for SG_IO (0x2285);
> >    it returns -ENOIOCTLCMD, later replaced with -ENOTTY in vfs_ioctl().
> > 
> >     $ dmesg
> >     <...>
> >     [] device-mapper: multipath: Failing path 65:144.
> >     [] device-mapper: multipath: Failing path 67:144.
> >     [] device-mapper: multipath: Failing path 65:224.
> >     [] device-mapper: multipath: Failing path 68:32.
> >     [] sgio_inquiry: sending ioctl 2285 to a partition!
> 
> So scsi_verify_blk_ioctl() considers the ioctl invalid.

But that's wrong, I think.  It's a false positive in
scsi_verify_blk_ioctl().

If the ioctl is valid when bdev becomes non-NULL (and it will be if
ti->len becomes equal to i_size_read(bdev->bd_inode) >> SECTOR_SHIFT),
you should not return -ENOIOCTLCMD aka ENOTTY, because userspace doesn't
think the ioctls can go away and come back.  So Hannes's patch broke the
userspace ABI. :(

Paolo

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer Oct. 31, 2015, 6:13 p.m. UTC | #5
On Sat, Oct 31 2015 at 11:33am -0400,
Paolo Bonzini <pbonzini@redhat.com> wrote:

> 
> 
> On 29/10/2015 14:18, Mike Snitzer wrote:
> > > 4) dmesg shows that scsi_verify_blk_ioctl() failed for SG_IO (0x2285);
> > >    it returns -ENOIOCTLCMD, later replaced with -ENOTTY in vfs_ioctl().
> > > 
> > >     $ dmesg
> > >     <...>
> > >     [] device-mapper: multipath: Failing path 65:144.
> > >     [] device-mapper: multipath: Failing path 67:144.
> > >     [] device-mapper: multipath: Failing path 65:224.
> > >     [] device-mapper: multipath: Failing path 68:32.
> > >     [] sgio_inquiry: sending ioctl 2285 to a partition!
> > 
> > So scsi_verify_blk_ioctl() considers the ioctl invalid.
> 
> But that's wrong, I think.  It's a false positive in
> scsi_verify_blk_ioctl().
> 
> If the ioctl is valid when bdev becomes non-NULL (and it will be if
> ti->len becomes equal to i_size_read(bdev->bd_inode) >> SECTOR_SHIFT),
> you should not return -ENOIOCTLCMD aka ENOTTY, because userspace doesn't
> think the ioctls can go away and come back.  So Hannes's patch broke the
> userspace ABI. :(

Huh?  All that Hannes' patch did was add early verification of the ioctl
if there are no paths, since: there is no point queueing an ioctl that
is invalid.

But looking just now, Christoph's recent ioctl refactoring that I staged
for 4.4 does seem to subtley revert Hannes' change:
https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.4&id=40cf639be1db8cc2b8183fe2ccd390ca77b90396
With hch's change multipath_prepare_ioctl() will _not_ do early
verification of the ioctl if no paths are available (and
queue_if_no_path is configured).  Because the call to
scsi_verify_blk_ioctl() was moved to dm_blk_ioctl() and is only called
if the return is > 0 (again -ENOTCONN is being returned).

Not to mention hch's lifting of scsi_verify_blk_ioctl() into DM core's
dm_blk_ioctl() -- likely motivated by not requiring all targets to do
the call like they were doing -- should really be done as part of the
new DM target .prepare_ioctl hook.

Christoph, I think your DM ioctl changes need more review/work.. which
implies they'll very likely miss 4.4.. sorry.

Anyway, all DM specifics aside:

The point is scsi_verify_blk_ioctl() is saying the ioctl isn't valid.
It has nothing to do with the existance of a bdev or not; but everything
to do with the unprivledged user's request to issue an ioctl.

Paolo, AFAIK unprivledged ioctls is one of your pet-projects so your
insight on what, if anything, needs changing to support them is the
insight I think we need.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Paolo Bonzini Oct. 31, 2015, 7:07 p.m. UTC | #6
On 31/10/2015 19:13, Mike Snitzer wrote:
> > But that's wrong, I think.  It's a false positive in
> > scsi_verify_blk_ioctl().
> > 
> > If the ioctl is valid when bdev becomes non-NULL (and it will be if
> > ti->len becomes equal to i_size_read(bdev->bd_inode) >> SECTOR_SHIFT),
> > you should not return -ENOIOCTLCMD aka ENOTTY, because userspace doesn't
> > think the ioctls can go away and come back.  So Hannes's patch broke the
> > userspace ABI. :(
> 
> Huh?  All that Hannes' patch did was add early verification of the ioctl
> if there are no paths, since: there is no point queueing an ioctl that
> is invalid.
> 
> [snip discussion of Christoph's patches]
> 
> The point is scsi_verify_blk_ioctl() is saying the ioctl isn't valid.
> It has nothing to do with the existance of a bdev or not; but everything
> to do with the unprivledged user's request to issue an ioctl.

... but the call is skipped (and all ioctls are valid) if ti->len ==
i_size_read(bdev->bd_inode) >> SECTOR_SHIFT.  Therefore, until you have
the bdev you don't know which ioctls are valid, and you must assume all
of them are.  You can't do anything unsafe anyway until you have the
bdev.  This is the reasoning prior to Hannes's change.

Afterwards, you end up calling scsi_verify_blk_ioctl() when bdev ==
NULL.  If the future bdev satisfies the above condition on ti->len, this
means that ioctl(SG_IO) switches from ENOTTY to available.  Userspace is
clearly not expecting that.

> Paolo, AFAIK unprivledged ioctls is one of your pet-projects so your
> insight on what, if anything, needs changing to support them is the
> insight I think we need.

I hope the above provides some extra information.

Paolo

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer Oct. 31, 2015, 10:47 p.m. UTC | #7
On Sat, Oct 31 2015 at  3:07pm -0400,
Paolo Bonzini <pbonzini@redhat.com> wrote:

> 
> 
> On 31/10/2015 19:13, Mike Snitzer wrote:
> > > But that's wrong, I think.  It's a false positive in
> > > scsi_verify_blk_ioctl().
> > > 
> > > If the ioctl is valid when bdev becomes non-NULL (and it will be if
> > > ti->len becomes equal to i_size_read(bdev->bd_inode) >> SECTOR_SHIFT),
> > > you should not return -ENOIOCTLCMD aka ENOTTY, because userspace doesn't
> > > think the ioctls can go away and come back.  So Hannes's patch broke the
> > > userspace ABI. :(
> > 
> > Huh?  All that Hannes' patch did was add early verification of the ioctl
> > if there are no paths, since: there is no point queueing an ioctl that
> > is invalid.
> > 
> > [snip discussion of Christoph's patches]
> > 
> > The point is scsi_verify_blk_ioctl() is saying the ioctl isn't valid.
> > It has nothing to do with the existance of a bdev or not; but everything
> > to do with the unprivledged user's request to issue an ioctl.
> 
> ... but the call is skipped (and all ioctls are valid) if ti->len ==
> i_size_read(bdev->bd_inode) >> SECTOR_SHIFT.  Therefore, until you have
> the bdev you don't know which ioctls are valid, and you must assume all
> of them are.  You can't do anything unsafe anyway until you have the
> bdev.  This is the reasoning prior to Hannes's change.

Yes, with your commit ec8013be ("dm: do not forward ioctls from logical
volumes to the underlying device") you added protections to disallow
issuing ioctls to a partition that could impact the rest of the device.

Given that I can see why you're seizing on the "ti->len !=
i_size_read(bdev->bd_inode) >> SECTOR_SHIFT" negative checks that gate
the call to scsi_verify_blk_ioctl().

> Afterwards, you end up calling scsi_verify_blk_ioctl() when bdev ==
> NULL.  If the future bdev satisfies the above condition on ti->len, this
> means that ioctl(SG_IO) switches from ENOTTY to available.  Userspace is
> clearly not expecting that.

For Hannes, and in my head, it didn't matter if a future bdev satisfies
the length condition.  I don't think Hannes was trying to guard against
dangerous partition ioctls being issues by udev... but now I _do_
question what exactly _is_ the point of his commit 21a2807bc3f.  Which
invalid ioctls, from udev, did Hannes' change actually disallow?

I obviously should've asked more questions before now!

At the time, I thought scsi_verify_blk_ioctl() was generally useful.
But it clearly only makes sense in the narrow context of guarding
against partition ioctls expanding their target to the parent block
device.

(scsi_verify_blk_ioctl should be renamed to something like
'scsi_verify_blk_ioctl_partition_safe' but I digress)
 
> > Paolo, AFAIK unprivledged ioctls is one of your pet-projects so your
> > insight on what, if anything, needs changing to support them is the
> > insight I think we need.
> 
> I hope the above provides some extra information.

It did, but you made me work for it ;)

I could've sworn that unprivledged users (without CAP_SYS_RAWIO)
wouldn't be allowed to issue ioctls.  Am I completely mistaken?  Or is
it still contentious and DM-mpath removing the ability to allow these
unprivledged ioctls (as a side-effect of Hannes' commit ec8013be) makes
your life, and other virt users' lives, harder?

Apologies for being out of touch on what the latest is on this
unprivledged ioctl issue.

Given Hannes' commit ec8013be was marked for stable and went in some
time ago we really need to get to resolve this ASAP.

To that end, I'm going to prep a tree that first applies Mauricio's
patch: https://patchwork.kernel.org/patch/7518601/

And I'll then rebase Christoph's DM ioctl changes ontop of it.

Hannes will need to revalidate his desire to verify ioctls in the
context of a dm-multipath device without paths. (sorry Hannes)

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Hannes Reinecke Nov. 2, 2015, 7:28 a.m. UTC | #8
On 10/31/2015 11:47 PM, Mike Snitzer wrote:
> On Sat, Oct 31 2015 at  3:07pm -0400,
> Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
>>
>>
>> On 31/10/2015 19:13, Mike Snitzer wrote:
>>>> But that's wrong, I think.  It's a false positive in
>>>> scsi_verify_blk_ioctl().
>>>>
>>>> If the ioctl is valid when bdev becomes non-NULL (and it will be if
>>>> ti->len becomes equal to i_size_read(bdev->bd_inode) >> SECTOR_SHIFT),
>>>> you should not return -ENOIOCTLCMD aka ENOTTY, because userspace doesn't
>>>> think the ioctls can go away and come back.  So Hannes's patch broke the
>>>> userspace ABI. :(
>>>
>>> Huh?  All that Hannes' patch did was add early verification of the ioctl
>>> if there are no paths, since: there is no point queueing an ioctl that
>>> is invalid.
>>>
>>> [snip discussion of Christoph's patches]
>>>
>>> The point is scsi_verify_blk_ioctl() is saying the ioctl isn't valid.
>>> It has nothing to do with the existance of a bdev or not; but everything
>>> to do with the unprivledged user's request to issue an ioctl.
>>
>> ... but the call is skipped (and all ioctls are valid) if ti->len ==
>> i_size_read(bdev->bd_inode) >> SECTOR_SHIFT.  Therefore, until you have
>> the bdev you don't know which ioctls are valid, and you must assume all
>> of them are.  You can't do anything unsafe anyway until you have the
>> bdev.  This is the reasoning prior to Hannes's change.
> 
> Yes, with your commit ec8013be ("dm: do not forward ioctls from logical
> volumes to the underlying device") you added protections to disallow
> issuing ioctls to a partition that could impact the rest of the device.
> 
> Given that I can see why you're seizing on the "ti->len !=
> i_size_read(bdev->bd_inode) >> SECTOR_SHIFT" negative checks that gate
> the call to scsi_verify_blk_ioctl().
> 
>> Afterwards, you end up calling scsi_verify_blk_ioctl() when bdev ==
>> NULL.  If the future bdev satisfies the above condition on ti->len, this
>> means that ioctl(SG_IO) switches from ENOTTY to available.  Userspace is
>> clearly not expecting that.
> 
> For Hannes, and in my head, it didn't matter if a future bdev satisfies
> the length condition.  I don't think Hannes was trying to guard against
> dangerous partition ioctls being issues by udev... but now I _do_
> question what exactly _is_ the point of his commit 21a2807bc3f.  Which
> invalid ioctls, from udev, did Hannes' change actually disallow?
> 
The reasoning is thus:

With the original code we would need to wait for path activation
before we would be able to figure out if the ioctl is valid.
However, the callback to verify the ioctl is sd_ioctl(), which as
a first step calls scsi_verify_ioctl().
So my reasoning was that we can as well call scsi_verify_ioctl()
early, and allow it to filter out known invalid ioctls.
Then we wouldn't need to wait for path activation and return
immediately.

Incidentally, in the 'r == -ENOTCONN' case, we're waiting
for path activation, but then just bail out with -ENOTCONN.
As we're not resetting -ENOTCONN, where's the point in activate the
paths here?
Shouldn't we retry to figure out if -ENOTCONN is still true?

Cheers,

Hannes
Paolo Bonzini Nov. 2, 2015, 9:55 a.m. UTC | #9
On 31/10/2015 23:47, Mike Snitzer wrote:
> Yes, with your commit ec8013be ("dm: do not forward ioctls from logical
> volumes to the underlying device") you added protections to disallow
> issuing ioctls to a partition that could impact the rest of the device.
> 
> Given that I can see why you're seizing on the "ti->len !=
> i_size_read(bdev->bd_inode) >> SECTOR_SHIFT" negative checks that gate
> the call to scsi_verify_blk_ioctl().

Right.

> For Hannes, and in my head, it didn't matter if a future bdev satisfies
> the length condition.

I agree actually.  The only problem is that the returned errno value is
ENOTTY, and to userspace that "sounds like" a future bdev will not make
the ioctl valid.

> I could've sworn that unprivledged users (without CAP_SYS_RAWIO)
> wouldn't be allowed to issue ioctls.  Am I completely mistaken?

They are allowed to issue ioctls.

CAP_SYS_RAWIO changes that to also allow issuing of ioctls to
partitions.  That was required by Linus for backwards compatibility.

> Or is
> it still contentious and DM-mpath removing the ability to allow these
> unprivledged ioctls (as a side-effect of Hannes' commit ec8013be) makes
> your life, and other virt users' lives, harder?

Yes, it would.  virt runs as an unprivileged user (so does CD burning,
which was the original reason to let SG_IO run by unprivileged users;
there are probably other use cases).

Paolo

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Paolo Bonzini Nov. 2, 2015, 9:57 a.m. UTC | #10
On 02/11/2015 08:28, Hannes Reinecke wrote:
> 
> With the original code we would need to wait for path activation
> before we would be able to figure out if the ioctl is valid.
> However, the callback to verify the ioctl is sd_ioctl(), which as
> a first step calls scsi_verify_ioctl().
> So my reasoning was that we can as well call scsi_verify_ioctl()
> early, and allow it to filter out known invalid ioctls.
> Then we wouldn't need to wait for path activation and return
> immediately.

That in principle makes sense.  Unfortunately, before path activation
you can only find out if a ioctl is valid.  You cannot find out which
ioctls are _in_valid, because path activation sets the bdev and that
might make all ioctls valid.

> Incidentally, in the 'r == -ENOTCONN' case, we're waiting
> for path activation, but then just bail out with -ENOTCONN.
> As we're not resetting -ENOTCONN, where's the point in activate the
> paths here?
> Shouldn't we retry to figure out if -ENOTCONN is still true?

That makes sense too.  You've done the work, might as well reap the
benefits...

Paolo

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer Nov. 2, 2015, 1:31 p.m. UTC | #11
On Mon, Nov 02 2015 at  2:28am -0500,
Hannes Reinecke <hare@suse.de> wrote:

> On 10/31/2015 11:47 PM, Mike Snitzer wrote:
> > 
> > For Hannes, and in my head, it didn't matter if a future bdev satisfies
> > the length condition.  I don't think Hannes was trying to guard against
> > dangerous partition ioctls being issued by udev... but now I _do_
> > question what exactly _is_ the point of his commit 21a2807bc3f.  Which
> > invalid ioctls, from udev, did Hannes' change actually disallow?
> > 

FYI, I meant s/21a2807bc3f/a1989b33/

> The reasoning is thus:
> 
> With the original code we would need to wait for path activation
> before we would be able to figure out if the ioctl is valid.
> However, the callback to verify the ioctl is sd_ioctl(), which as
> a first step calls scsi_verify_ioctl().
> So my reasoning was that we can as well call scsi_verify_ioctl()
> early, and allow it to filter out known invalid ioctls.
> Then we wouldn't need to wait for path activation and return
> immediately.

Right, I understood that to be your intent.  And it seemed reasonible.
Unfortuntely, as we now know, scsi_verify_blk_ioctl() only really cares
to filter out ioctls that are invalid for partitions.

I'm still curious: which ioctls were being issued by udev that your
commit a1989b33 "fixed" (so udev workers didn't hang)?  Is the right fix
to modify udev to not issue such ioctls?  What was invalid about the
udev issued ioctls?

> Incidentally, in the 'r == -ENOTCONN' case, we're waiting
> for path activation, but then just bail out with -ENOTCONN.
> As we're not resetting -ENOTCONN, where's the point in activate the
> paths here?
> Shouldn't we retry to figure out if -ENOTCONN is still true?

We do, in DM core, see _your_ commit that added this ;)
6c182cd88d ("dm mpath: fix ioctl deadlock when no paths")

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Hannes Reinecke Nov. 2, 2015, 1:56 p.m. UTC | #12
On 11/02/2015 02:31 PM, Mike Snitzer wrote:
> On Mon, Nov 02 2015 at  2:28am -0500,
> Hannes Reinecke <hare@suse.de> wrote:
> 
>> On 10/31/2015 11:47 PM, Mike Snitzer wrote:
>>>
>>> For Hannes, and in my head, it didn't matter if a future bdev satisfies
>>> the length condition.  I don't think Hannes was trying to guard against
>>> dangerous partition ioctls being issued by udev... but now I _do_
>>> question what exactly _is_ the point of his commit 21a2807bc3f.  Which
>>> invalid ioctls, from udev, did Hannes' change actually disallow?
>>>
> 
> FYI, I meant s/21a2807bc3f/a1989b33/
> 
>> The reasoning is thus:
>>
>> With the original code we would need to wait for path activation
>> before we would be able to figure out if the ioctl is valid.
>> However, the callback to verify the ioctl is sd_ioctl(), which as
>> a first step calls scsi_verify_ioctl().
>> So my reasoning was that we can as well call scsi_verify_ioctl()
>> early, and allow it to filter out known invalid ioctls.
>> Then we wouldn't need to wait for path activation and return
>> immediately.
> 
> Right, I understood that to be your intent.  And it seemed reasonible.
> Unfortuntely, as we now know, scsi_verify_blk_ioctl() only really cares
> to filter out ioctls that are invalid for partitions.
> 
> I'm still curious: which ioctls were being issued by udev that your
> commit a1989b33 "fixed" (so udev workers didn't hang)?  Is the right fix
> to modify udev to not issue such ioctls?  What was invalid about the
> udev issued ioctls?
> 
Thing is, it's not udev itself which issues ioctl. It's the programs
started by udev via the various rules (PROGRAM or IMPORT keywords).
They might (and occasionally, will) issue ioctls, and we have no
means of controlling them.

>> Incidentally, in the 'r == -ENOTCONN' case, we're waiting
>> for path activation, but then just bail out with -ENOTCONN.
>> As we're not resetting -ENOTCONN, where's the point in activate the
>> paths here?
>> Shouldn't we retry to figure out if -ENOTCONN is still true?
> 
> We do, in DM core, see _your_ commit that added this ;)
> 6c182cd88d ("dm mpath: fix ioctl deadlock when no paths")
> 
Indeed.

But then the real question remains:

What is the 'correct' behaviour for ioctls when no path retry
is active (or when no paths are present)?

Should we start path activation?
If so, should we wait for path activation to finish, risking udev
killing the worker for that event (udev has a built-in timeout of
120 seconds, which we might easily exceed when we need to activate
paths for large installations or slow path activation ... just
thinking of NetApp takeover/giveback cycle)?
If we're not waiting for path activation, where would be the point
in starting it, seeing that we're not actually interested in the result?
And if we shouldn't start a path activation, what is the point of
having code for it in the first place?
Couldn't we just rip it out altogether, and avoid much of the
current unpleasantness?

Cheers,

Hannes
Mike Snitzer Nov. 2, 2015, 2:12 p.m. UTC | #13
On Mon, Nov 02 2015 at  8:56am -0500,
Hannes Reinecke <hare@suse.de> wrote:

> On 11/02/2015 02:31 PM, Mike Snitzer wrote:
> > On Mon, Nov 02 2015 at  2:28am -0500,
> > Hannes Reinecke <hare@suse.de> wrote:
> > 
> >> On 10/31/2015 11:47 PM, Mike Snitzer wrote:
> >>>
> >>> For Hannes, and in my head, it didn't matter if a future bdev satisfies
> >>> the length condition.  I don't think Hannes was trying to guard against
> >>> dangerous partition ioctls being issued by udev... but now I _do_
> >>> question what exactly _is_ the point of his commit 21a2807bc3f.  Which
> >>> invalid ioctls, from udev, did Hannes' change actually disallow?
> >>>
> > 
> > FYI, I meant s/21a2807bc3f/a1989b33/
> > 
> >> The reasoning is thus:
> >>
> >> With the original code we would need to wait for path activation
> >> before we would be able to figure out if the ioctl is valid.
> >> However, the callback to verify the ioctl is sd_ioctl(), which as
> >> a first step calls scsi_verify_ioctl().
> >> So my reasoning was that we can as well call scsi_verify_ioctl()
> >> early, and allow it to filter out known invalid ioctls.
> >> Then we wouldn't need to wait for path activation and return
> >> immediately.
> > 
> > Right, I understood that to be your intent.  And it seemed reasonible.
> > Unfortuntely, as we now know, scsi_verify_blk_ioctl() only really cares
> > to filter out ioctls that are invalid for partitions.
> > 
> > I'm still curious: which ioctls were being issued by udev that your
> > commit a1989b33 "fixed" (so udev workers didn't hang)?  Is the right fix
> > to modify udev to not issue such ioctls?  What was invalid about the
> > udev issued ioctls?
> > 
> Thing is, it's not udev itself which issues ioctl. It's the programs
> started by udev via the various rules (PROGRAM or IMPORT keywords).
> They might (and occasionally, will) issue ioctls, and we have no
> means of controlling them.
> 
> >> Incidentally, in the 'r == -ENOTCONN' case, we're waiting
> >> for path activation, but then just bail out with -ENOTCONN.
> >> As we're not resetting -ENOTCONN, where's the point in activate the
> >> paths here?
> >> Shouldn't we retry to figure out if -ENOTCONN is still true?
> > 
> > We do, in DM core, see _your_ commit that added this ;)
> > 6c182cd88d ("dm mpath: fix ioctl deadlock when no paths")
> > 
> Indeed.
> 
> But then the real question remains:
> 
> What is the 'correct' behaviour for ioctls when no path retry
> is active (or when no paths are present)?
> 
> Should we start path activation?
> If so, should we wait for path activation to finish, risking udev
> killing the worker for that event (udev has a built-in timeout of
> 120 seconds, which we might easily exceed when we need to activate
> paths for large installations or slow path activation ... just
> thinking of NetApp takeover/giveback cycle)?
> If we're not waiting for path activation, where would be the point
> in starting it, seeing that we're not actually interested in the result?

We only start it: if (r == -ENOTCONN && !fatal_signal_pending(current))

-ENOTCONN is set with:
if ((pgpath && m->queue_io) || (!pgpath && m->queue_if_no_path))

> And if we shouldn't start a path activation, what is the point of
> having code for it in the first place?

The point is we have reason to believe paths will be coming back or that
the user wants to queue_if_no_path.

> Couldn't we just rip it out altogether, and avoid much of the
> current unpleasantness?

Sorry, not seeing how you're getting there.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Paolo Bonzini Nov. 2, 2015, 2:52 p.m. UTC | #14
On 02/11/2015 14:56, Hannes Reinecke wrote:
> But then the real question remains:
> 
> What is the 'correct' behaviour for ioctls when no path retry
> is active (or when no paths are present)?
> 
> Should we start path activation?
> If so, should we wait for path activation to finish, risking udev
> killing the worker for that event (udev has a built-in timeout of
> 120 seconds, which we might easily exceed when we need to activate
> paths for large installations or slow path activation ... just
> thinking of NetApp takeover/giveback cycle)?
> If we're not waiting for path activation, where would be the point
> in starting it, seeing that we're not actually interested in the result?
> And if we shouldn't start a path activation, what is the point of
> having code for it in the first place?

That's a fair question, and it depends on what said udev worker actually
does.

In any case, if we don't start path activation we should return
ENOTCONN, not ENOTTY.

Paolo

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer Nov. 2, 2015, 3:05 p.m. UTC | #15
On Mon, Nov 02 2015 at  9:52am -0500,
Paolo Bonzini <pbonzini@redhat.com> wrote:

> 
> 
> On 02/11/2015 14:56, Hannes Reinecke wrote:
> > But then the real question remains:
> > 
> > What is the 'correct' behaviour for ioctls when no path retry
> > is active (or when no paths are present)?
> > 
> > Should we start path activation?
> > If so, should we wait for path activation to finish, risking udev
> > killing the worker for that event (udev has a built-in timeout of
> > 120 seconds, which we might easily exceed when we need to activate
> > paths for large installations or slow path activation ... just
> > thinking of NetApp takeover/giveback cycle)?
> > If we're not waiting for path activation, where would be the point
> > in starting it, seeing that we're not actually interested in the result?
> > And if we shouldn't start a path activation, what is the point of
> > having code for it in the first place?
> 
> That's a fair question, and it depends on what said udev worker actually
> does.
> 
> In any case, if we don't start path activation we should return
> ENOTCONN, not ENOTTY.

Currently, if we don't start path activation we're returning EIO.
ENOTCONN is used for when we do start path activation (and ENOTCONN is
the means for DM core to retry)

We _could_ change the ENOTCONN to be EAGAIN and EIO to ENOTCONN...

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Hannes Reinecke Nov. 2, 2015, 3:32 p.m. UTC | #16
On 11/02/2015 03:52 PM, Paolo Bonzini wrote:
> 
> 
> On 02/11/2015 14:56, Hannes Reinecke wrote:
>> But then the real question remains:
>>
>> What is the 'correct' behaviour for ioctls when no path retry
>> is active (or when no paths are present)?
>>
>> Should we start path activation?
>> If so, should we wait for path activation to finish, risking udev
>> killing the worker for that event (udev has a built-in timeout of
>> 120 seconds, which we might easily exceed when we need to activate
>> paths for large installations or slow path activation ... just
>> thinking of NetApp takeover/giveback cycle)?
>> If we're not waiting for path activation, where would be the point
>> in starting it, seeing that we're not actually interested in the result?
>> And if we shouldn't start a path activation, what is the point of
>> having code for it in the first place?
> 
> That's a fair question, and it depends on what said udev worker actually
> does.
> 
Point is, we have no idea whatsoever what udev might be doing.

And experience shows that while most admins/distros etc have a fair
knowledge of writing valid udev rules, more often than not the
'queue_if_no_path' feature from multipath totally escapes them.
So we will be finding ourselves in positions where programs from
udev rules issue ioctls to devices where queue_if_no_path is active.

Plus it's a really daunting task to validate all udev rules to check
if any of those programs might issue ioctls and blank them out for
the queue_if_no_path case.
I did a review once for SUSE, but even that might be obsoleted now
as rules are being changed all the time.

> In any case, if we don't start path activation we should return
> ENOTCONN, not ENOTTY.
> 
No problem with that.

Cheers,

Hannes
Paolo Bonzini Nov. 2, 2015, 3:45 p.m. UTC | #17
On 02/11/2015 16:05, Mike Snitzer wrote:
> > In any case, if we don't start path activation we should return
> > ENOTCONN, not ENOTTY.
> 
> Currently, if we don't start path activation we're returning EIO.
> ENOTCONN is used for when we do start path activation (and ENOTCONN is
> the means for DM core to retry)
> 
> We _could_ change the ENOTCONN to be EAGAIN and EIO to ENOTCONN...

This makes sense... though of course testing the impact of this on
userspace is going to be hard. :(  Chances are that userspace is not
expecting EAGAIN either.

Even if they did, how would someone know that they can now retry the
ioctl after getting EAGAIN?  Should they just do it in a loop?

Paolo

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer Nov. 2, 2015, 3:49 p.m. UTC | #18
On Mon, Nov 02 2015 at 10:45am -0500,
Paolo Bonzini <pbonzini@redhat.com> wrote:

> 
> 
> On 02/11/2015 16:05, Mike Snitzer wrote:
> > > In any case, if we don't start path activation we should return
> > > ENOTCONN, not ENOTTY.
> > 
> > Currently, if we don't start path activation we're returning EIO.
> > ENOTCONN is used for when we do start path activation (and ENOTCONN is
> > the means for DM core to retry)
> > 
> > We _could_ change the ENOTCONN to be EAGAIN and EIO to ENOTCONN...
> 
> This makes sense... though of course testing the impact of this on
> userspace is going to be hard. :(  Chances are that userspace is not
> expecting EAGAIN either.
> 
> Even if they did, how would someone know that they can now retry the
> ioctl after getting EAGAIN?  Should they just do it in a loop?

Turns out multipath (userspace) has a udev rule for this now (prajnoha
pointed this out):
http://git.opensvc.com/gitweb.cgi?p=multipath-tools/.git;a=blob;f=multipath/11-dm-mpath.rules

So now I'm wondering if we _need_ to do any retries in kernel (aside
from while activation is active)?

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
diff mbox

Patch

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 5a67671..bdc96cd 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -1569,11 +1569,8 @@  static int multipath_ioctl(struct dm_target *ti, unsigned int cmd,
 	/*
 	 * Only pass ioctls through if the device sizes match exactly.
 	 */
-	if (!bdev || ti->len != i_size_read(bdev->bd_inode) >> SECTOR_SHIFT) {
-		int err = scsi_verify_blk_ioctl(NULL, cmd);
-		if (err)
-			r = err;
-	}
+	if (!r && ti->len != i_size_read(bdev->bd_inode) >> SECTOR_SHIFT)
+		r = scsi_verify_blk_ioctl(NULL, cmd);
 
 	if (r == -ENOTCONN && !fatal_signal_pending(current)) {
 		spin_lock_irqsave(&m->lock, flags);