mbox series

[v4,00/11] Add persistent durable identifier to storage log messages

Message ID 20200724171706.1550403-1-tasleson@redhat.com (mailing list archive)
Headers show
Series Add persistent durable identifier to storage log messages | expand

Message

Tony Asleson July 24, 2020, 5:16 p.m. UTC
Today users have no easy way to correlate kernel log messages for storage
devices across reboots, device dynamic add/remove, or when the device is
physically or logically moved from from system to system.  This is due
to the existing log IDs which identify how the device is attached and not
a unique ID of what is attached.  Additionally, even when the attachment
hasn't changed, it's not always obvious which messages belong to the
device as the different areas in the storage stack use different
identifiers, eg. (sda, sata1.00, sd 0:0:0:0).

This change addresses this by adding a unique ID to each log
message.  It couples the existing structured key/value logging capability
and VPD 0x83 device identification.


Some examples of logs filtered for a specific device utilizing this patch
series.

$ journalctl -b  _KERNEL_DURABLE_NAME="`cat /sys/block/sdb/device/wwid`" 
| cut -c 25- | fmt -t
l: scsi 1:0:0:0: Attached scsi generic sg1 type 0
l: sd 1:0:0:0: [sdb] 209715200 512-byte logical blocks: (107 GB/100 GiB)
l: sd 1:0:0:0: [sdb] Write Protect is off
l: sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
l: sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't
   support DPO or FUA
l: sd 1:0:0:0: [sdb] Attached SCSI disk
l: ata2.00: exception Emask 0x0 SAct 0x8 SErr 0x8 action 0x6 frozen
l: ata2.00: failed command: READ FPDMA QUEUED
l: ata2.00: cmd 60/01:18:10:27:00/00:00:00:00:00/40 tag 3 ncq dma 512
            in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
l: ata2.00: status: { DRDY }
l: ata2.00: configured for UDMA/100
l: ata2.00: device reported invalid CHS sector 0
l: ata2.00: exception Emask 0x0 SAct 0x4000 SErr 0x4000 action 0x6 frozen
l: ata2.00: failed command: READ FPDMA QUEUED
l: ata2.00: cmd 60/01:70:10:27:00/00:00:00:00:00/40 tag 14 ncq dma 512
            in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
l: ata2.00: status: { DRDY }
l: ata2.00: configured for UDMA/100
l: ata2.00: device reported invalid CHS sector 0
l: ata2.00: exception Emask 0x0 SAct 0x80000000 SErr 0x80000000 action
            0x6 frozen
l: ata2.00: failed command: READ FPDMA QUEUED
l: ata2.00: cmd 60/01:f8:10:27:00/00:00:00:00:00/40 tag 31 ncq dma 512
            in res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
l: ata2.00: status: { DRDY }
l: ata2.00: configured for UDMA/100
l: ata2.00: NCQ disabled due to excessive errors
l: ata2.00: exception Emask 0x0 SAct 0x40000 SErr 0x40000 action 0x6
            frozen
l: ata2.00: failed command: READ FPDMA QUEUED
l: ata2.00: cmd 60/01:90:10:27:00/00:00:00:00:00/40 tag 18 ncq dma 512
            in res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
l: ata2.00: status: { DRDY }
l: ata2.00: configured for UDMA/100

$ journalctl -b  _KERNEL_DURABLE_NAME="`cat /sys/block/nvme0n1/wwid`" 
| cut -c 25- | fmt -t
l: blk_update_request: critical medium error, dev nvme0n1, sector 10000
   op 0x0:(READ) flags 0x80700 phys_seg 4 prio class 0
l: blk_update_request: critical medium error, dev nvme0n1, sector 10000
   op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
l: Buffer I/O error on dev nvme0n1, logical block 1250, async page read

$ journalctl -b  _KERNEL_DURABLE_NAME="`cat /sys/block/sdc/device/wwid`"
| cut -c 25- | fmt -t
l: sd 8:0:0:0: Power-on or device reset occurred
l: sd 8:0:0:0: [sdc] 16777216 512-byte logical blocks: (8.59 GB/8.00 GiB)
l: sd 8:0:0:0: Attached scsi generic sg2 type 0
l: sd 8:0:0:0: [sdc] Write Protect is off
l: sd 8:0:0:0: [sdc] Mode Sense: 63 00 00 08
l: sd 8:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't
   support DPO or FUA
l: sd 8:0:0:0: [sdc] Attached SCSI disk
l: sd 8:0:0:0: [sdc] tag#255 FAILED Result: hostbyte=DID_OK
   driverbyte=DRIVER_SENSE cmd_age=0s
l: sd 8:0:0:0: [sdc] tag#255 Sense Key : Medium Error [current]
l: sd 8:0:0:0: [sdc] tag#255 Add. Sense: Unrecovered read error
l: sd 8:0:0:0: [sdc] tag#255 CDB: Read(10) 28 00 00 00 27 10 00 00 01 00
l: blk_update_request: critical medium error, dev sdc, sector 10000 op
   0x0:(READ) flags 0x0 phys_seg 1 prio class 0

There should be no changes to the log message content with this patch series.
I ran release kernel and this patch series and did a compare while forcing the
kernel through the same errors paths to verify.

The first 5 commits in the patch series utilize changes needed for dev_printk
code path.  The last 6 commits in the patch add the needed changes to utilize
durable_name_printk.  The function durable_name_printk is nothing more than
a printk that adds structured key/value durable name to unmodified printk
output.  I structured it this way so only a subset of the patch series could
be theoretically applied if we cannot get agreement on complete patch series.

v2:
- Incorporated changes suggested by James Bottomley
- Removed string function which removed leading/trailing/duplicate adjacent
  spaces from generated id, value matches /sys/block/<device>/device/wwid
- Remove xfs patch, limiting changes to lower block layers
- Moved callback from struct device_type to struct device.  Struct device_type
  is typically static const and with a number of different areas using shared
  implementation of genhd unable to modify for each of the different areas.

v3:
- Increase the size of the buffers for NVMe id generation and
  dev_vprintk_emit
  
v4:
- Back out dev_printk for those locations that weren't using it before, so that
  we don't change the content of the user visible log message by using a
  function durable_name_printk.
- Remove RFC from patch series.

Tony Asleson (11):
  struct device: Add function callback durable_name
  create_syslog_header: Add durable name
  dev_vprintk_emit: Increase hdr size
  scsi: Add durable_name for dev_printk
  nvme: Add durable name for dev_printk
  libata: Add ata_scsi_durable_name
  Add durable_name_printk
  libata: use durable_name_printk
  Add durable_name_printk_ratelimited
  print_req_error: Use durable_name_printk_ratelimited
  buffer_io_error: Use durable_name_printk_ratelimited

 block/blk-core.c           |  5 ++++-
 drivers/ata/libata-core.c  | 17 +++++++-------
 drivers/ata/libata-scsi.c  | 13 ++++++++++-
 drivers/base/core.c        | 46 +++++++++++++++++++++++++++++++++++++-
 drivers/nvme/host/core.c   | 18 +++++++++++++++
 drivers/scsi/scsi_lib.c    | 14 ++++++++++++
 drivers/scsi/scsi_sysfs.c  | 23 +++++++++++++++++++
 drivers/scsi/sd.c          |  2 ++
 fs/buffer.c                | 10 +++++++--
 include/linux/dev_printk.h | 14 ++++++++++++
 include/linux/device.h     |  4 ++++
 include/scsi/scsi_device.h |  3 +++
 12 files changed, 156 insertions(+), 13 deletions(-)


base-commit: 3d77e6a8804abcc0504c904bd6e5cdf3a5cf8162

Comments

Christoph Hellwig July 26, 2020, 3:10 p.m. UTC | #1
FYI, I think these identifiers are absolutely horrible and have no
business in dmesg:

Nacked-by: Christoph Hellwig <hch@lst.de>
Tony Asleson July 27, 2020, 3:45 p.m. UTC | #2
On 7/26/20 10:10 AM, Christoph Hellwig wrote:
> FYI, I think these identifiers are absolutely horrible and have no
> business in dmesg:

The identifiers are structured data, they're not visible unless you go
looking for them.

I'm open to other suggestions on how we can positively identify storage
devices over time, across reboots, replacement, and dynamic reconfiguration.

My home system has 4 disks, 2 are identical except for serial number.
Even with this simple configuration, it's not trivial to identify which
message goes with which disk across reboots.

-Tony
Hannes Reinecke July 27, 2020, 4:46 p.m. UTC | #3
On 7/27/20 5:45 PM, Tony Asleson wrote:
> On 7/26/20 10:10 AM, Christoph Hellwig wrote:
>> FYI, I think these identifiers are absolutely horrible and have no
>> business in dmesg:
> 
> The identifiers are structured data, they're not visible unless you go
> looking for them.
> 
> I'm open to other suggestions on how we can positively identify storage
> devices over time, across reboots, replacement, and dynamic reconfiguration.
> 
> My home system has 4 disks, 2 are identical except for serial number.
> Even with this simple configuration, it's not trivial to identify which
> message goes with which disk across reboots.
> 
Well; the more important bits would be to identify the physical location 
where these disks reside.
If one goes bad it doesn't really help you if have a persistent 
identification in the OS; what you really need is a physical indicator 
or a physical location allowing you to identify which disk to pull.

Which isn't addressed at all with this patchset (nor should it; the 
physical location it typically found via other means).

And for the other use-cases: We do have persistent device links, do we 
not? They provide even more information that you'd extract with this 
this patchset, and don't require any kernel changes whatsoever.

Cheers,

Hannes
Tony Asleson July 27, 2020, 5:42 p.m. UTC | #4
On 7/27/20 11:46 AM, Hannes Reinecke wrote:
> On 7/27/20 5:45 PM, Tony Asleson wrote:
>> On 7/26/20 10:10 AM, Christoph Hellwig wrote:
>>> FYI, I think these identifiers are absolutely horrible and have no
>>> business in dmesg:
>>
>> The identifiers are structured data, they're not visible unless you go
>> looking for them.
>>
>> I'm open to other suggestions on how we can positively identify storage
>> devices over time, across reboots, replacement, and dynamic
>> reconfiguration.
>>
>> My home system has 4 disks, 2 are identical except for serial number.
>> Even with this simple configuration, it's not trivial to identify which
>> message goes with which disk across reboots.
>>
> Well; the more important bits would be to identify the physical location
> where these disks reside.
> If one goes bad it doesn't really help you if have a persistent
> identification in the OS; what you really need is a physical indicator
> or a physical location allowing you to identify which disk to pull.

In my use case I have no slot information.  I have no SCSI enclosure
services to toggle identification LEDs or fault LEDs for the drive sled.
 For some users the device might be a virtual one in a storage server,
vpd helps.

In my case the in kernel vpd (WWN) data can be used to correlate with
the sticker on the disk as the disks have the WWN printed on them.  I
would think this is true for most disks/storage devices, but obviously I
can't make that statement with 100% certainty as I have a small sample size.

> Which isn't addressed at all with this patchset (nor should it; the
> physical location it typically found via other means).
> 
> And for the other use-cases: We do have persistent device links, do we
> not?

How does /dev/disk/by-* help when you are looking at the journal from 1
or more reboots ago and the only thing you have in your journal is
something like:

blk_update_request: critical medium error, dev sde, sector 43578 op
0x0:(READ) flags 0x0 phys_seg 1 prio class 0

The links are only valid for right now.

-Tony
Douglas Gilbert July 27, 2020, 7:17 p.m. UTC | #5
On 2020-07-27 1:42 p.m., Tony Asleson wrote:
> On 7/27/20 11:46 AM, Hannes Reinecke wrote:
>> On 7/27/20 5:45 PM, Tony Asleson wrote:
>>> On 7/26/20 10:10 AM, Christoph Hellwig wrote:
>>>> FYI, I think these identifiers are absolutely horrible and have no
>>>> business in dmesg:
>>>
>>> The identifiers are structured data, they're not visible unless you go
>>> looking for them.
>>>
>>> I'm open to other suggestions on how we can positively identify storage
>>> devices over time, across reboots, replacement, and dynamic
>>> reconfiguration.
>>>
>>> My home system has 4 disks, 2 are identical except for serial number.
>>> Even with this simple configuration, it's not trivial to identify which
>>> message goes with which disk across reboots.
>>>
>> Well; the more important bits would be to identify the physical location
>> where these disks reside.
>> If one goes bad it doesn't really help you if have a persistent
>> identification in the OS; what you really need is a physical indicator
>> or a physical location allowing you to identify which disk to pull.
> 
> In my use case I have no slot information.  I have no SCSI enclosure
> services to toggle identification LEDs or fault LEDs for the drive sled.
>   For some users the device might be a virtual one in a storage server,
> vpd helps.
> 
> In my case the in kernel vpd (WWN) data can be used to correlate with
> the sticker on the disk as the disks have the WWN printed on them.  I
> would think this is true for most disks/storage devices, but obviously I
> can't make that statement with 100% certainty as I have a small sample size.
> 
>> Which isn't addressed at all with this patchset (nor should it; the
>> physical location it typically found via other means).
>>
>> And for the other use-cases: We do have persistent device links, do we
>> not?
> 
> How does /dev/disk/by-* help when you are looking at the journal from 1
> or more reboots ago and the only thing you have in your journal is
> something like:
> 
> blk_update_request: critical medium error, dev sde, sector 43578 op
> 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
> 
> The links are only valid for right now.

Does:
    lsscsi -U
or
    lsscsi -UU

solve your problem, or come close?

Example:
# lsscsi -UU
[1:0:0:0]    disk    naa.5000cca02b38d0b8  /dev/sda
[1:0:1:0]    disk    naa.5000c5003011cb2b  /dev/sdb
[1:0:2:0]    enclosu naa.5001b4d516ecc03f  -
[N:0:1:1]    disk    eui.e8238fa6bf530001001b448b46bd5525    /dev/nvme0n1

The first two (SAS SSDs) NAAs are printed on the disk labels. I don't
think either that enclosure or the M2 NVMe SSD have their numbers
visible (i.e. the last two lines of output).

If it is what you want, then perhaps you could arrange for its output
to be sent to the log when the system has stabilized after a reboot. That
would only leave disk hotplug events exposed.

Faced with the above medium error I would try:
    dd if=<all_possibles> bs=512 skip=43578 iflag=direct of=/dev/null count=1
and look for noise in the logs. Change 'bs=512' up to 4096 if that is
the logical block size. For <all_possibles> use /dev/sde (and /dev/sdf and
/dev/dev/sdg or whatever) IOWs the _whole_ disk device name.

Doug Gilbert
Tony Asleson July 27, 2020, 8:27 p.m. UTC | #6
On 7/27/20 2:17 PM, Douglas Gilbert wrote:
> On 2020-07-27 1:42 p.m., Tony Asleson wrote:
>> On 7/27/20 11:46 AM, Hannes Reinecke wrote:
>>> On 7/27/20 5:45 PM, Tony Asleson wrote:
>>>> On 7/26/20 10:10 AM, Christoph Hellwig wrote:
>>>>> FYI, I think these identifiers are absolutely horrible and have no
>>>>> business in dmesg:
>>>>
>>>> The identifiers are structured data, they're not visible unless you go
>>>> looking for them.
>>>>
>>>> I'm open to other suggestions on how we can positively identify storage
>>>> devices over time, across reboots, replacement, and dynamic
>>>> reconfiguration.
>>>>
>>>> My home system has 4 disks, 2 are identical except for serial number.
>>>> Even with this simple configuration, it's not trivial to identify which
>>>> message goes with which disk across reboots.
>>>>
>>> Well; the more important bits would be to identify the physical location
>>> where these disks reside.
>>> If one goes bad it doesn't really help you if have a persistent
>>> identification in the OS; what you really need is a physical indicator
>>> or a physical location allowing you to identify which disk to pull.
>>
>> In my use case I have no slot information.  I have no SCSI enclosure
>> services to toggle identification LEDs or fault LEDs for the drive sled.
>>   For some users the device might be a virtual one in a storage server,
>> vpd helps.
>>
>> In my case the in kernel vpd (WWN) data can be used to correlate with
>> the sticker on the disk as the disks have the WWN printed on them.  I
>> would think this is true for most disks/storage devices, but obviously I
>> can't make that statement with 100% certainty as I have a small sample
>> size.
>>
>>> Which isn't addressed at all with this patchset (nor should it; the
>>> physical location it typically found via other means).
>>>
>>> And for the other use-cases: We do have persistent device links, do we
>>> not?
>>
>> How does /dev/disk/by-* help when you are looking at the journal from 1
>> or more reboots ago and the only thing you have in your journal is
>> something like:
>>
>> blk_update_request: critical medium error, dev sde, sector 43578 op
>> 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
>>
>> The links are only valid for right now.
> 
> Does:
>    lsscsi -U
> or
>    lsscsi -UU
> 
> solve your problem, or come close?
> 
> Example:
> # lsscsi -UU
> [1:0:0:0]    disk    naa.5000cca02b38d0b8  /dev/sda
> [1:0:1:0]    disk    naa.5000c5003011cb2b  /dev/sdb
> [1:0:2:0]    enclosu naa.5001b4d516ecc03f  -
> [N:0:1:1]    disk    eui.e8238fa6bf530001001b448b46bd5525    /dev/nvme0n1
> 
> The first two (SAS SSDs) NAAs are printed on the disk labels. I don't
> think either that enclosure or the M2 NVMe SSD have their numbers
> visible (i.e. the last two lines of output).
> 
> If it is what you want, then perhaps you could arrange for its output
> to be sent to the log when the system has stabilized after a reboot. That
> would only leave disk hotplug events exposed.

Yes, if we write a new udev rule or script we could place bread crumbs
in the journal so we can correlate
sda == naa.5000cca02b38d0b8 at the time of the error.  However, none of
the existing tooling will allow you to see all the log messages that
pertain to the device easily.  The user is still left with a bunch of
log messages that have different ways to refer to the same device
attachment eg. sda, ata1.00, sd 0:0:0:0.  For them to understand which
messages go with which device is not trivial.  Even if someone writes a
tool to parse the messages, looking for the string that contains the ID
and has the needed decoder information to associate it with the correct
piece of hardware, it's only good until the message changes in the kernel.

If we stuff the defacto ID into the message itself when it occurs, the
ambiguity of what device is associated with a message is removed.

I would like to know, why this is so horrible?  Is it processing time in
an error path?  Stack usage holding the data in flight, wasted disk
space on disk?  Unique identifiers are just too long and terse?

The only valid reason I can think of is someone working with very
sensitive data and not wanting the unique ID of a removable or network
storage device to be in their logs.  Of course we could add a disable
boot time option for that or make the default off for those that don't
want/care.

> Faced with the above medium error I would try:
>    dd if=<all_possibles> bs=512 skip=43578 iflag=direct of=/dev/null
> count=1
> and look for noise in the logs. Change 'bs=512' up to 4096 if that is
> the logical block size. For <all_possibles> use /dev/sde (and /dev/sdf and
> /dev/dev/sdg or whatever) IOWs the _whole_ disk device name.

Assuming the error reproduces, this would work.  However, I think this
solution speaks volumes for how difficult it is to simply identify what
device an error originated from.

-Tony