scsi: BUG in scsi_init_io

Message ID	20170131092048.GB3687@linux-x5ow.site (mailing list archive)
State	Changes Requested, archived
Headers	show Return-Path: <linux-scsi-owner@kernel.org> Date: Tue, 31 Jan 2017 10:20:48 +0100 From: Johannes Thumshirn <jthumshirn@suse.de> To: Dmitry Vyukov <dvyukov@google.com> Cc: jejb@linux.vnet.ibm.com, "Martin K. Petersen" <martin.petersen@oracle.com>, linux-scsi <linux-scsi@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>, Al Viro <viro@zeniv.linux.org.uk>, syzkaller <syzkaller@googlegroups.com>, Hannes Reinecke <hare@suse.de> Subject: Re: scsi: BUG in scsi_init_io Message-ID: <20170131092048.GB3687@linux-x5ow.site> References: <CACT4Y+YAX_SmwzsHenzkswM0kp=om5NivmLW8d9Js6EN3zKmGQ@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <CACT4Y+YAX_SmwzsHenzkswM0kp=om5NivmLW8d9Js6EN3zKmGQ@mail.gmail.com> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-scsi-owner@vger.kernel.org Precedence: bulk

Johannes Thumshirn Jan. 31, 2017, 9:20 a.m. UTC

On Tue, Jan 31, 2017 at 09:55:52AM +0100, Dmitry Vyukov wrote:
> Hello,
> 
> The following program triggers BUG in scsi_init_io:

Well crashing a machine just because of an empty dma transfer is a bit harsh,
isn't it?

From 86e6fa5f618fe588b98e923e032f33e075fcd4f4 Mon Sep 17 00:00:00 2001
From: Johannes Thumshirn <jthumshirn@suse.de>
Date: Tue, 31 Jan 2017 10:16:00 +0100
Subject: [PATCH] scsi: don't BUG_ON() empty DMA transfers

Don't crash the machine just because of an empty transfer. Use WARN_ON()
combined with returning an error.

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
---
 drivers/scsi/scsi_lib.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Dmitry Vyukov Jan. 31, 2017, 9:50 a.m. UTC | #1

On Tue, Jan 31, 2017 at 10:20 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote:
> On Tue, Jan 31, 2017 at 09:55:52AM +0100, Dmitry Vyukov wrote:
>> Hello,
>>
>> The following program triggers BUG in scsi_init_io:
>
> Well crashing a machine just because of an empty dma transfer is a bit harsh,
> isn't it?
>
> From 86e6fa5f618fe588b98e923e032f33e075fcd4f4 Mon Sep 17 00:00:00 2001
> From: Johannes Thumshirn <jthumshirn@suse.de>
> Date: Tue, 31 Jan 2017 10:16:00 +0100
> Subject: [PATCH] scsi: don't BUG_ON() empty DMA transfers
>
> Don't crash the machine just because of an empty transfer. Use WARN_ON()
> combined with returning an error.
>
> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
> ---
>  drivers/scsi/scsi_lib.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index e9e1e14..414588a 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -1040,7 +1040,8 @@ int scsi_init_io(struct scsi_cmnd *cmd)
>         bool is_mq = (rq->mq_ctx != NULL);
>         int error;
>
> -       BUG_ON(!blk_rq_nr_phys_segments(rq));
> +       if (WARN_ON(!blk_rq_nr_phys_segments(rq)))
> +               return -EINVAL;


Please-please-please, let's not use WARN for something that is not a
kernel bug and is user-triggerable. This makes it impossible to
automate kernel testing and requires hiring an army of people doing
mechanical job of sorting out WARNING reports into kernel-bugs and
non-kernel-bugs.
If the message is absolutely necessary (while kernel does not
generally explain every EINVAL on console), the following will do:

      if (!blk_rq_nr_phys_segments(rq)) {
              pr_err("you are doing something wrong\n");
              return -EINVAL;
      }


>         error = scsi_init_sgtable(rq, &cmd->sdb);
>         if (error)
> --
> 2.10.2
>
>
> --
> Johannes Thumshirn                                          Storage
> jthumshirn@suse.de                                +49 911 74053 689
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: Felix Imendörffer, Jane Smithard, Graham Norton
> HRB 21284 (AG Nürnberg)
> Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Johannes Thumshirn Jan. 31, 2017, 9:58 a.m. UTC | #2

On Tue, Jan 31, 2017 at 10:50:49AM +0100, Dmitry Vyukov wrote:
> On Tue, Jan 31, 2017 at 10:20 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote:
> > On Tue, Jan 31, 2017 at 09:55:52AM +0100, Dmitry Vyukov wrote:

[...]

> Please-please-please, let's not use WARN for something that is not a
> kernel bug and is user-triggerable. This makes it impossible to
> automate kernel testing and requires hiring an army of people doing
> mechanical job of sorting out WARNING reports into kernel-bugs and
> non-kernel-bugs.
> If the message is absolutely necessary (while kernel does not
> generally explain every EINVAL on console), the following will do:
> 
>       if (!blk_rq_nr_phys_segments(rq)) {
>               pr_err("you are doing something wrong\n");
>               return -EINVAL;
>       }

Yes I understand that. OTOH having the WARN helps you finding the caller because
of to the stack trace. But arguably that could be accomplished with function
graph tracing as well. I'll re-send a v2 as a proper patch.

Thanks,
	Johannes

Dmitry Vyukov Jan. 31, 2017, 10:06 a.m. UTC | #3

On Tue, Jan 31, 2017 at 10:58 AM, Johannes Thumshirn <jthumshirn@suse.de> wrote:
>
> [...]
>
>> Please-please-please, let's not use WARN for something that is not a
>> kernel bug and is user-triggerable. This makes it impossible to
>> automate kernel testing and requires hiring an army of people doing
>> mechanical job of sorting out WARNING reports into kernel-bugs and
>> non-kernel-bugs.
>> If the message is absolutely necessary (while kernel does not
>> generally explain every EINVAL on console), the following will do:
>>
>>       if (!blk_rq_nr_phys_segments(rq)) {
>>               pr_err("you are doing something wrong\n");
>>               return -EINVAL;
>>       }
>
> Yes I understand that. OTOH having the WARN helps you finding the caller because
> of to the stack trace. But arguably that could be accomplished with function
> graph tracing as well. I'll re-send a v2 as a proper patch.


Thank you very much.
Stack trace can be done with dump_stack() if necessary, e.g.

               pr_err("you are doing something wrong here:\n");
               dump_stack();
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

James Bottomley Jan. 31, 2017, 3:41 p.m. UTC | #4

On Tue, 2017-01-31 at 10:50 +0100, Dmitry Vyukov wrote:
> On Tue, Jan 31, 2017 at 10:20 AM, Johannes Thumshirn <
> jthumshirn@suse.de> wrote:
> > On Tue, Jan 31, 2017 at 09:55:52AM +0100, Dmitry Vyukov wrote:
> > > Hello,
> > > 
> > > The following program triggers BUG in scsi_init_io:
> > 
> > Well crashing a machine just because of an empty dma transfer is a
> > bit harsh,
> > isn't it?
> > 
> > From 86e6fa5f618fe588b98e923e032f33e075fcd4f4 Mon Sep 17 00:00:00
> > 2001
> > From: Johannes Thumshirn <jthumshirn@suse.de>
> > Date: Tue, 31 Jan 2017 10:16:00 +0100
> > Subject: [PATCH] scsi: don't BUG_ON() empty DMA transfers
> > 
> > Don't crash the machine just because of an empty transfer. Use
> > WARN_ON()
> > combined with returning an error.
> > 
> > Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
> > ---
> >  drivers/scsi/scsi_lib.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > index e9e1e14..414588a 100644
> > --- a/drivers/scsi/scsi_lib.c
> > +++ b/drivers/scsi/scsi_lib.c
> > @@ -1040,7 +1040,8 @@ int scsi_init_io(struct scsi_cmnd *cmd)
> >         bool is_mq = (rq->mq_ctx != NULL);
> >         int error;
> > 
> > -       BUG_ON(!blk_rq_nr_phys_segments(rq));
> > +       if (WARN_ON(!blk_rq_nr_phys_segments(rq)))
> > +               return -EINVAL;
> 
> 
> Please-please-please, let's not use WARN for something that is not a
> kernel bug and is user-triggerable.

It is a kernel bug and it should not be user triggerable, so it should
have a warn_on or bug_on.  It means something called a data setup
function with no data.  There's actually a root cause that patches like
this won't fix, can we find it?

James

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds Feb. 19, 2017, 5:38 p.m. UTC | #5

On Tue, Jan 31, 2017 at 7:41 AM, James Bottomley
<jejb@linux.vnet.ibm.com> wrote:
>
> It is a kernel bug and it should not be user triggerable, so it should
> have a warn_on or bug_on.

Hell NO.

Christ, James, listen to yourself. What you are basically saying when
you say it should be a BUG_ON() is

 "This shouldn't happen, but if it ever does happen, let's just turn
our mistaken assumptions into a dead machine that is really hard to
debug".

Because a BUG_ON() effectively kills the machine if the call chain has
some locks held. In the SCSI layer, that generally means that there
will be no logged oops either, because any locks held likely just
killed your filesystem or disk subsystem, so now that oops is
basically not even likely to be reported by most normal users.

So stop this "should have a bug_on". In fact, since this apparently
_is_ easily user-triggerable, it damn well shouldn't have a warn_on
either. At most, a WARN_ON_ONCE(), so that we might get reports of
_what_ the bad call chain is, but we will never kill the machine and
we will *not* give people the ability to randomly spam the system
logs.

BUG_ON() needs to die. People need to realize that it is a _problem_,
and that it makes any bugs _worse_. Don't do it.

The only valid reason for BUG_ON() is when some very core data
structure is _so_ corrupt that you can't even continue, because you
simply can't even return an error and there's no way for you to just
say "log it once and continue".

And by that I don't mean some random value you have in a request. I
mean literally "this is a really core data structure, and I simply
_cannot_ continue" (where that "cannot" is about an actual physical
impossibility, not a "I could continue but I think this is serious").

Anything else is a "return error, possibly with a WARN_ON() to let
people know that bad things are going on".

Basically, BUG_ON() should be in core kernel code. Not in drivers. And
even in core kernel code, it's likely wrong.

             Linus

scsi: BUG in scsi_init_io

Commit Message

Comments

Patch