diff mbox series

scsi: avoid repetitive logging of device offline messages

Message ID 20200309181416.10665-1-emilne@redhat.com (mailing list archive)
State Superseded
Headers show
Series scsi: avoid repetitive logging of device offline messages | expand

Commit Message

Ewan Milne March 9, 2020, 6:14 p.m. UTC
Large queues of I/O to offline devices that are eventually
submitted when devices are unblocked result in a many repeated
"rejecting I/O to offline device" messages.  These messages
can fill up the dmesg buffer in crash dumps so no useful
prior messages remain.  In addition, if a serial console
is used, the flood of messages can cause a hard lockup in
the console code.

Introduce a flag indicating the message has already been logged
for the device, and reset the flag when scsi_device_set_state()
changes the device state.

Signed-off-by: Ewan D. Milne <emilne@redhat.com>
---
 drivers/scsi/scsi_lib.c    | 8 ++++++--
 include/scsi/scsi_device.h | 2 ++
 2 files changed, 8 insertions(+), 2 deletions(-)

Comments

James Bottomley March 9, 2020, 7:05 p.m. UTC | #1
On Mon, 2020-03-09 at 14:14 -0400, Ewan D. Milne wrote:
> Large queues of I/O to offline devices that are eventually
> submitted when devices are unblocked result in a many repeated
> "rejecting I/O to offline device" messages.  These messages
> can fill up the dmesg buffer in crash dumps so no useful
> prior messages remain.  In addition, if a serial console
> is used, the flood of messages can cause a hard lockup in
> the console code.
> 
> Introduce a flag indicating the message has already been logged
> for the device, and reset the flag when scsi_device_set_state()
> changes the device state.
> 
> Signed-off-by: Ewan D. Milne <emilne@redhat.com>
> ---
>  drivers/scsi/scsi_lib.c    | 8 ++++++--
>  include/scsi/scsi_device.h | 2 ++
>  2 files changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 610ee41..d3a6d97 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -1240,8 +1240,11 @@ scsi_prep_state_check(struct scsi_device
> *sdev, struct request *req)
>  		 * commands.  The device must be brought online
>  		 * before trying any recovery commands.
>  		 */
> -		sdev_printk(KERN_ERR, sdev,
> -			    "rejecting I/O to offline device\n");
> +		if (!sdev->offline_already) {
> +			sdev->offline_already = 1;
> +			sdev_printk(KERN_ERR, sdev,
> +				    "rejecting I/O to offline
> device\n");
> +		}

Offline->online is a legal transition, so you'll need to clear this
flag when that happens so we get another offline message if it goes
offline again.

James
Bart Van Assche March 9, 2020, 7:36 p.m. UTC | #2
On 3/9/20 11:14 AM, Ewan D. Milne wrote:
> Large queues of I/O to offline devices that are eventually
> submitted when devices are unblocked result in a many repeated
> "rejecting I/O to offline device" messages.  These messages
> can fill up the dmesg buffer in crash dumps so no useful
> prior messages remain.  In addition, if a serial console
> is used, the flood of messages can cause a hard lockup in
> the console code.
> 
> Introduce a flag indicating the message has already been logged
> for the device, and reset the flag when scsi_device_set_state()
> changes the device state.
> 
> Signed-off-by: Ewan D. Milne <emilne@redhat.com>
> ---
>   drivers/scsi/scsi_lib.c    | 8 ++++++--
>   include/scsi/scsi_device.h | 2 ++
>   2 files changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 610ee41..d3a6d97 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -1240,8 +1240,11 @@ scsi_prep_state_check(struct scsi_device *sdev, struct request *req)
>   		 * commands.  The device must be brought online
>   		 * before trying any recovery commands.
>   		 */
> -		sdev_printk(KERN_ERR, sdev,
> -			    "rejecting I/O to offline device\n");
> +		if (!sdev->offline_already) {
> +			sdev->offline_already = 1;
> +			sdev_printk(KERN_ERR, sdev,
> +				    "rejecting I/O to offline device\n");
> +		}
>   		return BLK_STS_IOERR;
>   	case SDEV_DEL:
>   		/*
> @@ -2340,6 +2343,7 @@ scsi_device_set_state(struct scsi_device *sdev, enum scsi_device_state state)
>   		break;
>   
>   	}
> +	sdev->offline_already = 0;
>   	sdev->sdev_state = state;
>   	return 0;
>   
> diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
> index f8312a3..72987a0 100644
> --- a/include/scsi/scsi_device.h
> +++ b/include/scsi/scsi_device.h
> @@ -204,6 +204,8 @@ struct scsi_device {
>   	unsigned unmap_limit_for_ws:1;	/* Use the UNMAP limit for WRITE SAME */
>   	unsigned rpm_autosuspend:1;	/* Enable runtime autosuspend at device
>   					 * creation time */
> +	unsigned offline_already:1;	/* Device offline message logged */
> +
>   	atomic_t disk_events_disable_depth; /* disable depth for disk events */
>   
>   	DECLARE_BITMAP(supported_events, SDEV_EVT_MAXBITS); /* supported events */

Bitfields are troublesome in multithreaded software. Has it been 
considered to use rate-limiting instead of introducing a new bitfield 
member?

Thanks,

Bart.
Ewan Milne March 9, 2020, 8:49 p.m. UTC | #3
On Mon, 2020-03-09 at 12:05 -0700, James Bottomley wrote:
> On Mon, 2020-03-09 at 14:14 -0400, Ewan D. Milne wrote:
> > Large queues of I/O to offline devices that are eventually
> > submitted when devices are unblocked result in a many repeated
> > "rejecting I/O to offline device" messages.  These messages
> > can fill up the dmesg buffer in crash dumps so no useful
> > prior messages remain.  In addition, if a serial console
> > is used, the flood of messages can cause a hard lockup in
> > the console code.
> > 
> > Introduce a flag indicating the message has already been logged
> > for the device, and reset the flag when scsi_device_set_state()
> > changes the device state.
> > 
> > Signed-off-by: Ewan D. Milne <emilne@redhat.com>
> > ---
> >  drivers/scsi/scsi_lib.c    | 8 ++++++--
> >  include/scsi/scsi_device.h | 2 ++
> >  2 files changed, 8 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > index 610ee41..d3a6d97 100644
> > --- a/drivers/scsi/scsi_lib.c
> > +++ b/drivers/scsi/scsi_lib.c
> > @@ -1240,8 +1240,11 @@ scsi_prep_state_check(struct scsi_device
> > *sdev, struct request *req)
> >  		 * commands.  The device must be brought online
> >  		 * before trying any recovery commands.
> >  		 */
> > -		sdev_printk(KERN_ERR, sdev,
> > -			    "rejecting I/O to offline device\n");
> > +		if (!sdev->offline_already) {
> > +			sdev->offline_already = 1;
> > +			sdev_printk(KERN_ERR, sdev,
> > +				    "rejecting I/O to offline
> > device\n");
> > +		}
> 
> Offline->online is a legal transition, so you'll need to clear this
> flag when that happens so we get another offline message if it goes
> offline again.
> 
> James
> 

That's what resetting the flag in scsi_device_set_state() does.
The only thing that worries me is if some driver manipulates the
sdev->sdev_state value directly.

-Ewan
Ewan Milne March 9, 2020, 8:54 p.m. UTC | #4
On Mon, 2020-03-09 at 12:36 -0700, Bart Van Assche wrote:
> On 3/9/20 11:14 AM, Ewan D. Milne wrote:
> > Large queues of I/O to offline devices that are eventually
> > submitted when devices are unblocked result in a many repeated
> > "rejecting I/O to offline device" messages.  These messages
> > can fill up the dmesg buffer in crash dumps so no useful
> > prior messages remain.  In addition, if a serial console
> > is used, the flood of messages can cause a hard lockup in
> > the console code.
> > 
> > Introduce a flag indicating the message has already been logged
> > for the device, and reset the flag when scsi_device_set_state()
> > changes the device state.
> > 
> > Signed-off-by: Ewan D. Milne <emilne@redhat.com>
> > ---
> >   drivers/scsi/scsi_lib.c    | 8 ++++++--
> >   include/scsi/scsi_device.h | 2 ++
> >   2 files changed, 8 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > index 610ee41..d3a6d97 100644
> > --- a/drivers/scsi/scsi_lib.c
> > +++ b/drivers/scsi/scsi_lib.c
> > @@ -1240,8 +1240,11 @@ scsi_prep_state_check(struct scsi_device *sdev, struct request *req)
> >   		 * commands.  The device must be brought online
> >   		 * before trying any recovery commands.
> >   		 */
> > -		sdev_printk(KERN_ERR, sdev,
> > -			    "rejecting I/O to offline device\n");
> > +		if (!sdev->offline_already) {
> > +			sdev->offline_already = 1;
> > +			sdev_printk(KERN_ERR, sdev,
> > +				    "rejecting I/O to offline device\n");
> > +		}
> >   		return BLK_STS_IOERR;
> >   	case SDEV_DEL:
> >   		/*
> > @@ -2340,6 +2343,7 @@ scsi_device_set_state(struct scsi_device *sdev, enum scsi_device_state state)
> >   		break;
> >   
> >   	}
> > +	sdev->offline_already = 0;
> >   	sdev->sdev_state = state;
> >   	return 0;
> >   
> > diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
> > index f8312a3..72987a0 100644
> > --- a/include/scsi/scsi_device.h
> > +++ b/include/scsi/scsi_device.h
> > @@ -204,6 +204,8 @@ struct scsi_device {
> >   	unsigned unmap_limit_for_ws:1;	/* Use the UNMAP limit for WRITE SAME */
> >   	unsigned rpm_autosuspend:1;	/* Enable runtime autosuspend at device
> >   					 * creation time */
> > +	unsigned offline_already:1;	/* Device offline message logged */
> > +
> >   	atomic_t disk_events_disable_depth; /* disable depth for disk events */
> >   
> >   	DECLARE_BITMAP(supported_events, SDEV_EVT_MAXBITS); /* supported events */
> 
> Bitfields are troublesome in multithreaded software. Has it been 
> considered to use rate-limiting instead of introducing a new bitfield 
> member?
> 
> Thanks,
> 
> Bart.
> 

I did but printk_ratelimited() does not do what is desired here.
What we want is only a single message per-device.  If we ratelimit
the message instance itself we lose information in the log about which
devices were affected (which makes debugging issues with multipath I/O
much harder).

The only purpose of the flag is to try to suppress duplicate messages,
in the common case it is a single thread submitting the queued I/O which
is going to get rejected.  If multiple threads submit I/O there might
be duplicated messages but that is not all that critical.  Hence the
lack of locking on the flag.

-Ewan
Laurence Oberman March 9, 2020, 10:36 p.m. UTC | #5
On Mon, 2020-03-09 at 14:14 -0400, Ewan D. Milne wrote:
> Large queues of I/O to offline devices that are eventually
> submitted when devices are unblocked result in a many repeated
> "rejecting I/O to offline device" messages.  These messages
> can fill up the dmesg buffer in crash dumps so no useful
> prior messages remain.  In addition, if a serial console
> is used, the flood of messages can cause a hard lockup in
> the console code.
> 
> Introduce a flag indicating the message has already been logged
> for the device, and reset the flag when scsi_device_set_state()
> changes the device state.
> 
> Signed-off-by: Ewan D. Milne <emilne@redhat.com>
> ---
>  drivers/scsi/scsi_lib.c    | 8 ++++++--
>  include/scsi/scsi_device.h | 2 ++
>  2 files changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 610ee41..d3a6d97 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -1240,8 +1240,11 @@ scsi_prep_state_check(struct scsi_device
> *sdev, struct request *req)
>  		 * commands.  The device must be brought online
>  		 * before trying any recovery commands.
>  		 */
> -		sdev_printk(KERN_ERR, sdev,
> -			    "rejecting I/O to offline device\n");
> +		if (!sdev->offline_already) {
> +			sdev->offline_already = 1;
> +			sdev_printk(KERN_ERR, sdev,
> +				    "rejecting I/O to offline
> device\n");
> +		}
>  		return BLK_STS_IOERR;
>  	case SDEV_DEL:
>  		/*
> @@ -2340,6 +2343,7 @@ scsi_device_set_state(struct scsi_device *sdev,
> enum scsi_device_state state)
>  		break;
>  
>  	}
> +	sdev->offline_already = 0;
>  	sdev->sdev_state = state;
>  	return 0;
>  
> diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
> index f8312a3..72987a0 100644
> --- a/include/scsi/scsi_device.h
> +++ b/include/scsi/scsi_device.h
> @@ -204,6 +204,8 @@ struct scsi_device {
>  	unsigned unmap_limit_for_ws:1;	/* Use the UNMAP limit for
> WRITE SAME */
>  	unsigned rpm_autosuspend:1;	/* Enable runtime autosuspend
> at device
>  					 * creation time */
> +	unsigned offline_already:1;	/* Device offline message
> logged */
> +
>  	atomic_t disk_events_disable_depth; /* disable depth for disk
> events */
>  
>  	DECLARE_BITMAP(supported_events, SDEV_EVT_MAXBITS); /*
> supported events */

I tested this on a customer issue.
If its accepted please use

Tested-by: Laurence Oberman <loberman@redhat.com>
Reviewed-by: Laurence Oberman <loberman@redhat.com>
Bart Van Assche March 10, 2020, 2:02 a.m. UTC | #6
On 2020-03-09 13:54, Ewan D. Milne wrote:
> The only purpose of the flag is to try to suppress duplicate messages,
> in the common case it is a single thread submitting the queued I/O which
> is going to get rejected.  If multiple threads submit I/O there might
> be duplicated messages but that is not all that critical.  Hence the
> lack of locking on the flag.

Hi Ewan,

My concern is that scsi_prep_state_check() may be called from more than
one thread at the same time and that bitfield changes are not thread-safe.

Thanks,

Bart.
Douglas Gilbert March 10, 2020, 3:58 a.m. UTC | #7
On 2020-03-09 4:49 p.m., Ewan D. Milne wrote:
> On Mon, 2020-03-09 at 12:05 -0700, James Bottomley wrote:
>> On Mon, 2020-03-09 at 14:14 -0400, Ewan D. Milne wrote:
>>> Large queues of I/O to offline devices that are eventually
>>> submitted when devices are unblocked result in a many repeated
>>> "rejecting I/O to offline device" messages.  These messages
>>> can fill up the dmesg buffer in crash dumps so no useful
>>> prior messages remain.  In addition, if a serial console
>>> is used, the flood of messages can cause a hard lockup in
>>> the console code.
>>>
>>> Introduce a flag indicating the message has already been logged
>>> for the device, and reset the flag when scsi_device_set_state()
>>> changes the device state.
>>>
>>> Signed-off-by: Ewan D. Milne <emilne@redhat.com>
>>> ---
>>>   drivers/scsi/scsi_lib.c    | 8 ++++++--
>>>   include/scsi/scsi_device.h | 2 ++
>>>   2 files changed, 8 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
>>> index 610ee41..d3a6d97 100644
>>> --- a/drivers/scsi/scsi_lib.c
>>> +++ b/drivers/scsi/scsi_lib.c
>>> @@ -1240,8 +1240,11 @@ scsi_prep_state_check(struct scsi_device
>>> *sdev, struct request *req)
>>>   		 * commands.  The device must be brought online
>>>   		 * before trying any recovery commands.
>>>   		 */
>>> -		sdev_printk(KERN_ERR, sdev,
>>> -			    "rejecting I/O to offline device\n");
>>> +		if (!sdev->offline_already) {
>>> +			sdev->offline_already = 1;
>>> +			sdev_printk(KERN_ERR, sdev,
>>> +				    "rejecting I/O to offline
>>> device\n");
>>> +		}
>>
>> Offline->online is a legal transition, so you'll need to clear this
>> flag when that happens so we get another offline message if it goes
>> offline again.
>>
>> James
>>
> 
> That's what resetting the flag in scsi_device_set_state() does.
> The only thing that worries me is if some driver manipulates the
> sdev->sdev_state value directly.

Perhaps they could sneak in some c++isms and allow the definition
of struct fields to be preceded by [[protected]]. The idea would be
to complain loudly if that field is directly accessed. The accessor
functions might need something similar (or the same decoration) to
defeat that noise.

In the meantime:

Reviewed-by: Douglas Gilbert <dgilbert@interlog.com>
Ewan Milne March 10, 2020, 4:44 p.m. UTC | #8
On Mon, 2020-03-09 at 19:02 -0700, Bart Van Assche wrote:
> On 2020-03-09 13:54, Ewan D. Milne wrote:
> > The only purpose of the flag is to try to suppress duplicate messages,
> > in the common case it is a single thread submitting the queued I/O which
> > is going to get rejected.  If multiple threads submit I/O there might
> > be duplicated messages but that is not all that critical.  Hence the
> > lack of locking on the flag.
> 
> Hi Ewan,
> 
> My concern is that scsi_prep_state_check() may be called from more than
> one thread at the same time and that bitfield changes are not thread-safe.
> 
> Thanks,
> 
> Bart.

Yes, I agree that the flag is not thread-safe, but I don't think that
is a concern.  Because if we get multiple rejecting I/O messages until
the other threads see the flag change we are no worse off than before,
and once the sdev state changes back to SDEV_RUNNING we won't call
scsi_prep_state_check() and examine the flag.

-Ewan
Bart Van Assche March 11, 2020, 4:01 a.m. UTC | #9
On 2020-03-10 09:44, Ewan D. Milne wrote:
> On Mon, 2020-03-09 at 19:02 -0700, Bart Van Assche wrote:
>> On 2020-03-09 13:54, Ewan D. Milne wrote:
>>> The only purpose of the flag is to try to suppress duplicate messages,
>>> in the common case it is a single thread submitting the queued I/O which
>>> is going to get rejected.  If multiple threads submit I/O there might
>>> be duplicated messages but that is not all that critical.  Hence the
>>> lack of locking on the flag.
>>
>> My concern is that scsi_prep_state_check() may be called from more than
>> one thread at the same time and that bitfield changes are not thread-safe.
> 
> Yes, I agree that the flag is not thread-safe, but I don't think that
> is a concern.  Because if we get multiple rejecting I/O messages until
> the other threads see the flag change we are no worse off than before,
> and once the sdev state changes back to SDEV_RUNNING we won't call
> scsi_prep_state_check() and examine the flag.

Hi Ewan,

This is the entire list of bitfields in struct scsi_device:

	unsigned removable:1;
	unsigned changed:1;
	unsigned busy:1;
	unsigned lockable:1;
	unsigned locked:1;
	unsigned borken:1;
	unsigned disconnect:1;
	unsigned soft_reset:1;
	unsigned sdtr:1;
	unsigned wdtr:1;
	unsigned ppr:1;
	unsigned tagged_supported:1;
	unsigned simple_tags:1;
	unsigned was_reset:1;
	unsigned expecting_cc_ua:1;
	unsigned use_10_for_rw:1;
	unsigned use_10_for_ms:1;
	unsigned set_dbd_for_ms:1;
	unsigned no_report_opcodes:1;
	unsigned no_write_same:1;
	unsigned use_16_for_rw:1;
	unsigned skip_ms_page_8:1;
	unsigned skip_ms_page_3f:1;
	unsigned skip_vpd_pages:1;
	unsigned try_vpd_pages:1;
	unsigned use_192_bytes_for_3f:1;
	unsigned no_start_on_add:1;
	unsigned allow_restart:1;
	unsigned manage_start_stop:1;
	unsigned start_stop_pwr_cond:1;
	unsigned no_uld_attach:1;
	unsigned select_no_atn:1;
	unsigned fix_capacity:1;
	unsigned guess_capacity:1;
	unsigned retry_hwerror:1;
	unsigned last_sector_bug:1;
	unsigned no_read_disc_info:1;
	unsigned no_read_capacity_16:1;
	unsigned try_rc_10_first:1;
	unsigned security_supported:1;
	unsigned is_visible:1;
	unsigned wce_default_on:1;
	unsigned no_dif:1;
	unsigned broken_fua:1;
	unsigned lun_in_cdb:1;
	unsigned unmap_limit_for_ws:1;
	unsigned rpm_autosuspend:1;

If any thread modifies any of these existing bitfields concurrently with
scsi_prep_state_check(), one of the two modifications will be lost. That
is because compilers implement bitfield changes as follows:

new_value = (old_value & ~(1 << ...)) | (1 << ...);

If two such assignment statements are executed concurrently, both start
from the same 'old_value' and only one of the two changes will happen.
I'm concerned that this patch introduces a maintenance problem in the
long term. Has it been considered to make 'offline_already' a 32-bits
integer variable or a boolean instead of a bitfield? I think such
variables can be modified without discarding values written from another
thread.

Thanks,

Bart.
Douglas Gilbert March 11, 2020, 5:12 a.m. UTC | #10
On 2020-03-11 12:01 a.m., Bart Van Assche wrote:
> On 2020-03-10 09:44, Ewan D. Milne wrote:
>> On Mon, 2020-03-09 at 19:02 -0700, Bart Van Assche wrote:
>>> On 2020-03-09 13:54, Ewan D. Milne wrote:
>>>> The only purpose of the flag is to try to suppress duplicate messages,
>>>> in the common case it is a single thread submitting the queued I/O which
>>>> is going to get rejected.  If multiple threads submit I/O there might
>>>> be duplicated messages but that is not all that critical.  Hence the
>>>> lack of locking on the flag.
>>>
>>> My concern is that scsi_prep_state_check() may be called from more than
>>> one thread at the same time and that bitfield changes are not thread-safe.
>>
>> Yes, I agree that the flag is not thread-safe, but I don't think that
>> is a concern.  Because if we get multiple rejecting I/O messages until
>> the other threads see the flag change we are no worse off than before,
>> and once the sdev state changes back to SDEV_RUNNING we won't call
>> scsi_prep_state_check() and examine the flag.
> 
> Hi Ewan,
> 
> This is the entire list of bitfields in struct scsi_device:
> 
> 	unsigned removable:1;
> 	unsigned changed:1;
> 	unsigned busy:1;
> 	unsigned lockable:1;
> 	unsigned locked:1;
> 	unsigned borken:1;
> 	unsigned disconnect:1;
> 	unsigned soft_reset:1;
> 	unsigned sdtr:1;
> 	unsigned wdtr:1;
> 	unsigned ppr:1;
> 	unsigned tagged_supported:1;
> 	unsigned simple_tags:1;
> 	unsigned was_reset:1;
> 	unsigned expecting_cc_ua:1;
> 	unsigned use_10_for_rw:1;
> 	unsigned use_10_for_ms:1;
> 	unsigned set_dbd_for_ms:1;
> 	unsigned no_report_opcodes:1;
> 	unsigned no_write_same:1;
> 	unsigned use_16_for_rw:1;
> 	unsigned skip_ms_page_8:1;
> 	unsigned skip_ms_page_3f:1;
> 	unsigned skip_vpd_pages:1;
> 	unsigned try_vpd_pages:1;
> 	unsigned use_192_bytes_for_3f:1;
> 	unsigned no_start_on_add:1;
> 	unsigned allow_restart:1;
> 	unsigned manage_start_stop:1;
> 	unsigned start_stop_pwr_cond:1;
> 	unsigned no_uld_attach:1;
> 	unsigned select_no_atn:1;
> 	unsigned fix_capacity:1;
> 	unsigned guess_capacity:1;
> 	unsigned retry_hwerror:1;
> 	unsigned last_sector_bug:1;
> 	unsigned no_read_disc_info:1;
> 	unsigned no_read_capacity_16:1;
> 	unsigned try_rc_10_first:1;
> 	unsigned security_supported:1;
> 	unsigned is_visible:1;
> 	unsigned wce_default_on:1;
> 	unsigned no_dif:1;
> 	unsigned broken_fua:1;
> 	unsigned lun_in_cdb:1;
> 	unsigned unmap_limit_for_ws:1;
> 	unsigned rpm_autosuspend:1;
> 
> If any thread modifies any of these existing bitfields concurrently with
> scsi_prep_state_check(), one of the two modifications will be lost. That
> is because compilers implement bitfield changes as follows:
> 
> new_value = (old_value & ~(1 << ...)) | (1 << ...);
> 
> If two such assignment statements are executed concurrently, both start
> from the same 'old_value' and only one of the two changes will happen.
> I'm concerned that this patch introduces a maintenance problem in the
> long term. Has it been considered to make 'offline_already' a 32-bits
> integer variable or a boolean instead of a bitfield? I think such
> variables can be modified without discarding values written from another
> thread.

Most of the stuff there is slow moving data, typically set once at device
creation time. The design predates atomic bitops. I can't see why the
addition of an extra bitfield, whose overwriting is not going to cause
a calamity ***, warrants the proposer having the rewrite this patch.

Doug Gilbert


*** setting offline_already effectively invalidates a lot of other bitfields.
     Still, if the offline_already fails to be set, due to a clash
     (with what?), then at worst the warning message is repeated and the
     patch code tries again to set that bitfield.
Ewan Milne March 11, 2020, 2:38 p.m. UTC | #11
On Tue, 2020-03-10 at 21:01 -0700, Bart Van Assche wrote:
> 

...

> If any thread modifies any of these existing bitfields concurrently with
> scsi_prep_state_check(), one of the two modifications will be lost. That
> is because compilers implement bitfield changes as follows:
> 
> new_value = (old_value & ~(1 << ...)) | (1 << ...);
> 
> If two such assignment statements are executed concurrently, both start
> from the same 'old_value' and only one of the two changes will happen.
> I'm concerned that this patch introduces a maintenance problem in the
> long term. Has it been considered to make 'offline_already' a 32-bits
> integer variable or a boolean instead of a bitfield? I think such
> variables can be modified without discarding values written from another
> thread.
> 
> Thanks,
> 
> Bart.
> 

I understand your concern about long-term maintenance, and introducing
a pattern that could be used by other code that needs more accurate
state.  (Although, it is likely that a lock or an atomic operation might
be required in other cases).  I'll post a v2 changing this to a boolean.

-Ewan
diff mbox series

Patch

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 610ee41..d3a6d97 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1240,8 +1240,11 @@  scsi_prep_state_check(struct scsi_device *sdev, struct request *req)
 		 * commands.  The device must be brought online
 		 * before trying any recovery commands.
 		 */
-		sdev_printk(KERN_ERR, sdev,
-			    "rejecting I/O to offline device\n");
+		if (!sdev->offline_already) {
+			sdev->offline_already = 1;
+			sdev_printk(KERN_ERR, sdev,
+				    "rejecting I/O to offline device\n");
+		}
 		return BLK_STS_IOERR;
 	case SDEV_DEL:
 		/*
@@ -2340,6 +2343,7 @@  scsi_device_set_state(struct scsi_device *sdev, enum scsi_device_state state)
 		break;
 
 	}
+	sdev->offline_already = 0;
 	sdev->sdev_state = state;
 	return 0;
 
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index f8312a3..72987a0 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -204,6 +204,8 @@  struct scsi_device {
 	unsigned unmap_limit_for_ws:1;	/* Use the UNMAP limit for WRITE SAME */
 	unsigned rpm_autosuspend:1;	/* Enable runtime autosuspend at device
 					 * creation time */
+	unsigned offline_already:1;	/* Device offline message logged */
+
 	atomic_t disk_events_disable_depth; /* disable depth for disk events */
 
 	DECLARE_BITMAP(supported_events, SDEV_EVT_MAXBITS); /* supported events */