diff mbox series

Unreliable disk detection order in 5.x

Message ID 20211105064623.GD32560@hostway.ca (mailing list archive)
State New, archived
Headers show
Series Unreliable disk detection order in 5.x | expand

Commit Message

Simon Kirby Nov. 5, 2021, 6:46 a.m. UTC
I'm seeing disk detection order changing across reboots on 5.x kernels
(5.4, 5.10, 5.14), but not 4.9, 4.14, 4.19, with megaraid_sas (Dell
PERC_H700). With 13 disks and 5.14.14, the order changes almost always.

I did initially try to bisect this issue, but it seems to become more
rare in earlier kernels, and there are some non-booting problems between
4.x and 5.x.

The most common effect is swapping of sda with sdb, or two neighboring
devices in the list; for example:

# diff -u lsblk-S-5.10.0 lsblk-S-5.10.0-2

This is happening on vendor (Debian 5.10.0) and home-built kernels, and
on a variety of hosts. On all kernels, the detection printks come up in
an interesting order, but in older kernels, it always ends up with an
sd-name that is ordered by SCSI ID ascending:

[    2.289776] sd 0:2:0:0: [sda] 999030784 512-byte logical blocks: (512 GB/476 GiB)
[    2.289918] sd 0:2:4:0: [sdd] 11719933952 512-byte logical blocks: (6.00 TB/5.46 TiB)
[    2.289947] sd 0:2:3:0: [sdc] 11719933952 512-byte logical blocks: (6.00 TB/5.46 TiB)
[    2.290032] sd 0:2:6:0: [sdf] 11719933952 512-byte logical blocks: (6.00 TB/5.46 TiB)
[    2.290210] sd 0:2:7:0: [sdg] 11719933952 512-byte logical blocks: (6.00 TB/5.46 TiB)
[    2.290248] sd 0:2:9:0: [sdi] 11719933952 512-byte logical blocks: (6.00 TB/5.46 TiB)
[    2.290323] sd 0:2:2:0: [sdb] 11719933952 512-byte logical blocks: (6.00 TB/5.46 TiB)
[    2.290461] sd 0:2:5:0: [sde] 11719933952 512-byte logical blocks: (6.00 TB/5.46 TiB)
[    2.290476] sd 0:2:8:0: [sdh] 11719933952 512-byte logical blocks: (6.00 TB/5.46 TiB)

Full "dmesg" is saved here: https://0x.ca/sim/ref/5.10.0/dmesg

Any ideas on suggestions on what I could use to track down what changed
here, or ideas on what might have influenced it?

Simon-

Comments

Damien Le Moal Nov. 5, 2021, 7:45 a.m. UTC | #1
On 2021/11/05 15:46, Simon Kirby wrote:
> I'm seeing disk detection order changing across reboots on 5.x kernels
> (5.4, 5.10, 5.14), but not 4.9, 4.14, 4.19, with megaraid_sas (Dell
> PERC_H700). With 13 disks and 5.14.14, the order changes almost always.
> 
> I did initially try to bisect this issue, but it seems to become more
> rare in earlier kernels, and there are some non-booting problems between
> 4.x and 5.x.
> 
> The most common effect is swapping of sda with sdb, or two neighboring
> devices in the list; for example:
> 
> # diff -u lsblk-S-5.10.0 lsblk-S-5.10.0-2
> --- lsblk-S-5.10.0      2021-11-04 15:23:23.767008360 -0400
> +++ lsblk-S-5.10.0-2    2021-11-04 17:34:37.748310196 -0400
> @@ -1,6 +1,6 @@
>  NAME HCTL       TYPE VENDOR   MODEL      REV TRAN
> -sda  0:2:0:0    disk DELL     PERC_H700 2.10
> -sdb  0:2:2:0    disk DELL     PERC_H700 2.10
> +sda  0:2:2:0    disk DELL     PERC_H700 2.10
> +sdb  0:2:0:0    disk DELL     PERC_H700 2.10
>  sdc  0:2:3:0    disk DELL     PERC_H700 2.10
>  sdd  0:2:4:0    disk DELL     PERC_H700 2.10
>  sde  0:2:5:0    disk DELL     PERC_H700 2.10
> 
> This is happening on vendor (Debian 5.10.0) and home-built kernels, and
> on a variety of hosts. On all kernels, the detection printks come up in
> an interesting order, but in older kernels, it always ends up with an
> sd-name that is ordered by SCSI ID ascending:
> 
> [    2.289776] sd 0:2:0:0: [sda] 999030784 512-byte logical blocks: (512 GB/476 GiB)
> [    2.289918] sd 0:2:4:0: [sdd] 11719933952 512-byte logical blocks: (6.00 TB/5.46 TiB)
> [    2.289947] sd 0:2:3:0: [sdc] 11719933952 512-byte logical blocks: (6.00 TB/5.46 TiB)
> [    2.290032] sd 0:2:6:0: [sdf] 11719933952 512-byte logical blocks: (6.00 TB/5.46 TiB)
> [    2.290210] sd 0:2:7:0: [sdg] 11719933952 512-byte logical blocks: (6.00 TB/5.46 TiB)
> [    2.290248] sd 0:2:9:0: [sdi] 11719933952 512-byte logical blocks: (6.00 TB/5.46 TiB)
> [    2.290323] sd 0:2:2:0: [sdb] 11719933952 512-byte logical blocks: (6.00 TB/5.46 TiB)
> [    2.290461] sd 0:2:5:0: [sde] 11719933952 512-byte logical blocks: (6.00 TB/5.46 TiB)
> [    2.290476] sd 0:2:8:0: [sdh] 11719933952 512-byte logical blocks: (6.00 TB/5.46 TiB)
> 
> Full "dmesg" is saved here: https://0x.ca/sim/ref/5.10.0/dmesg
> 
> Any ideas on suggestions on what I could use to track down what changed
> here, or ideas on what might have influenced it?

Most distro kernels are now compiled with asynchronous device scan enabled to
speedup the boot process. This potentially result in the device names changing
across reboots. Reliable device names are provided by udev under
/dev/disk/by-id, by-uuid etc.

You can turn off scsi asynchronous device scan using the scsi_mod.scan=sync
kernel boot argument, or disable the CONFIG_SCSI_SCAN_ASYNC option for your
kernel (device drivers -> scsi device support -> asynchronous scsi scanning).

But even with synchronous scanning, device names are not reliable and there are
no guarantees that one particular device will always have the same name.
Simon Kirby Nov. 7, 2021, 2:24 a.m. UTC | #2
On Fri, Nov 05, 2021 at 04:45:53PM +0900, Damien Le Moal wrote:

> > Any ideas on suggestions on what I could use to track down what changed
> > here, or ideas on what might have influenced it?
> 
> Most distro kernels are now compiled with asynchronous device scan enabled to
> speedup the boot process. This potentially result in the device names changing
> across reboots. Reliable device names are provided by udev under
> /dev/disk/by-id, by-uuid etc.
> 
> You can turn off scsi asynchronous device scan using the scsi_mod.scan=sync
> kernel boot argument, or disable the CONFIG_SCSI_SCAN_ASYNC option for your
> kernel (device drivers -> scsi device support -> asynchronous scsi scanning).
> 
> But even with synchronous scanning, device names are not reliable and there are
> no guarantees that one particular device will always have the same name.

This occurs regardless of the CONFIG_SCSI_SCAN_ASYNC setting, and also
with scsi_mod.scan=sync on vendor kernels. All of these disks are coming
from the same driver and card.

I understand that using UUIDs, by-id, etc., is an option to work around
this, but then we would have to push IDs for disks in every server to our
configuration management. It does not seem that this change is really
intentional.

Simon-
Bart Van Assche Nov. 7, 2021, 7:51 p.m. UTC | #3
On 11/6/21 19:24, Simon Kirby wrote:
> This occurs regardless of the CONFIG_SCSI_SCAN_ASYNC setting, and
> also with scsi_mod.scan=sync on vendor kernels. All of these disks
> are coming from the same driver and card.
> 
> I understand that using UUIDs, by-id, etc., is an option to work
> around this, but then we would have to push IDs for disks in every
> server to our configuration management. It does not seem that this
> change is really intentional.

SCSI disk detection is asynchronous on purpose since a long time. The 
most recent commit I know of that changed SCSI disk scanning
behavior is commit f049cf1a7b67 ("scsi: sd: Rely on the driver core for
asynchronous probing").

Please use one of the /dev/disk/by-*/* identifiers as Damien requested.

Thanks,

Bart.
Simon Kirby Nov. 11, 2021, 1:01 a.m. UTC | #4
On Sun, Nov 07, 2021 at 11:51:45AM -0800, Bart Van Assche wrote:

> On 11/6/21 19:24, Simon Kirby wrote:
> > This occurs regardless of the CONFIG_SCSI_SCAN_ASYNC setting, and
> > also with scsi_mod.scan=sync on vendor kernels. All of these disks
> > are coming from the same driver and card.
> > 
> > I understand that using UUIDs, by-id, etc., is an option to work
> > around this, but then we would have to push IDs for disks in every
> > server to our configuration management. It does not seem that this
> > change is really intentional.
> 
> SCSI disk detection is asynchronous on purpose since a long time. The most
> recent commit I know of that changed SCSI disk scanning
> behavior is commit f049cf1a7b67 ("scsi: sd: Rely on the driver core for
> asynchronous probing").
> 
> Please use one of the /dev/disk/by-*/* identifiers as Damien requested.

Hi Bart,

So, we're using DRBD on top of these, which means by-uuid is not
available; we can use only by-id and by-path. by-id is dependent on disk
models and serial numbers, and by-path is dependent on PCI bus details.
Both are going to be a good deal more work to maintain, since they're
both not just a simple enumeration.

I did try 5.14.17 with f049cf1a7b67 (and a065c0faacb1) reverted, and it
does indeed restore the behaviour where sd* order appears to be reliable.
Scan time (time until systemd starts) is within 4ms across 3 boots with
and without the revert, but this is just our particular case.

I don't fully understand the scan process here, but I can understand the
challenges in trying to parallelize it and still end up with a consistent
enumerated list.

I guess you would agree that removing sd* entirely would not be an option
because they've existed forever historically, but at the same time, the
only time they really "work" now are as symlink targets for by-*, and in
the case where only one disk exists at boot time. Do I have this right?

Simon-
Damien Le Moal Nov. 11, 2021, 1:16 a.m. UTC | #5
On 2021/11/11 10:01, Simon Kirby wrote:
> On Sun, Nov 07, 2021 at 11:51:45AM -0800, Bart Van Assche wrote:
> 
>> On 11/6/21 19:24, Simon Kirby wrote:
>>> This occurs regardless of the CONFIG_SCSI_SCAN_ASYNC setting, and
>>> also with scsi_mod.scan=sync on vendor kernels. All of these disks
>>> are coming from the same driver and card.
>>>
>>> I understand that using UUIDs, by-id, etc., is an option to work
>>> around this, but then we would have to push IDs for disks in every
>>> server to our configuration management. It does not seem that this
>>> change is really intentional.
>>
>> SCSI disk detection is asynchronous on purpose since a long time. The most
>> recent commit I know of that changed SCSI disk scanning
>> behavior is commit f049cf1a7b67 ("scsi: sd: Rely on the driver core for
>> asynchronous probing").
>>
>> Please use one of the /dev/disk/by-*/* identifiers as Damien requested.
> 
> Hi Bart,
> 
> So, we're using DRBD on top of these, which means by-uuid is not
> available; we can use only by-id and by-path. by-id is dependent on disk
> models and serial numbers, and by-path is dependent on PCI bus details.
> Both are going to be a good deal more work to maintain, since they're
> both not just a simple enumeration.
> 
> I did try 5.14.17 with f049cf1a7b67 (and a065c0faacb1) reverted, and it
> does indeed restore the behaviour where sd* order appears to be reliable.
> Scan time (time until systemd starts) is within 4ms across 3 boots with
> and without the revert, but this is just our particular case.
> 
> I don't fully understand the scan process here, but I can understand the
> challenges in trying to parallelize it and still end up with a consistent
> enumerated list.

Even without parallel disk scan on boot to ensure a consistent naming of drives
from some port or LUN order, any run-time event that cause a drive to "go away"
and come back (e.g. topology change event) can result in the drive name
changing. The order itself depends on the LLD code too. A driver change can
result in a different probe order, so in different names. Same if say you
create/delete LUNs on a RAID system: when doing it, you will get some drive
names, but after a reboot & scan, the LUNs may be presented with different
names. /dev/sdX names are simply not reliable. For consistent, reliable, drive
configurations, applications must use the /dev/disk/by-*/* IDs.

> 
> I guess you would agree that removing sd* entirely would not be an option
> because they've existed forever historically, but at the same time, the
> only time they really "work" now are as symlink targets for by-*, and in
> the case where only one disk exists at boot time. Do I have this right?
> 
> Simon-
>
Hannes Reinecke Nov. 11, 2021, 6:57 a.m. UTC | #6
On 11/11/21 2:01 AM, Simon Kirby wrote:
> On Sun, Nov 07, 2021 at 11:51:45AM -0800, Bart Van Assche wrote:
> 
>> On 11/6/21 19:24, Simon Kirby wrote:
>>> This occurs regardless of the CONFIG_SCSI_SCAN_ASYNC setting, and
>>> also with scsi_mod.scan=sync on vendor kernels. All of these disks
>>> are coming from the same driver and card.
>>>
>>> I understand that using UUIDs, by-id, etc., is an option to work
>>> around this, but then we would have to push IDs for disks in every
>>> server to our configuration management. It does not seem that this
>>> change is really intentional.
>>
>> SCSI disk detection is asynchronous on purpose since a long time. The most
>> recent commit I know of that changed SCSI disk scanning
>> behavior is commit f049cf1a7b67 ("scsi: sd: Rely on the driver core for
>> asynchronous probing").
>>
>> Please use one of the /dev/disk/by-*/* identifiers as Damien requested.
> 
> Hi Bart,
> 
> So, we're using DRBD on top of these, which means by-uuid is not
> available; we can use only by-id and by-path. by-id is dependent on disk
> models and serial numbers, and by-path is dependent on PCI bus details.
> Both are going to be a good deal more work to maintain, since they're
> both not just a simple enumeration.
> 
Why is by-uuid not available?
The uuid is the disk-internal unique identification, and to my knowledge 
all recent SCSI and SATA drives implement them.
So where is the problem here?

Cheers,

Hannes
Phillip Susi Nov. 12, 2021, 12:11 a.m. UTC | #7
Hannes Reinecke <hare@suse.de> writes:

> Why is by-uuid not available?
> The uuid is the disk-internal unique identification, and to my
> knowledge all recent SCSI and SATA drives implement them.
> So where is the problem here?

It is probably just an oversight by the udev rules that create the
by-uuid links.
Hannes Reinecke Nov. 12, 2021, 6:38 a.m. UTC | #8
On 11/12/21 1:11 AM, Phillip Susi wrote:
> 
> Hannes Reinecke <hare@suse.de> writes:
> 
>> Why is by-uuid not available?
>> The uuid is the disk-internal unique identification, and to my
>> knowledge all recent SCSI and SATA drives implement them.
>> So where is the problem here?
> 
> It is probably just an oversight by the udev rules that create the
> by-uuid links.
> 
So shouldn't we rather fix this?

Putting in some udev rules is always easier than trying to 'fix' things 
in the kernel.

Cheers,

Hannes
diff mbox series

Patch

--- lsblk-S-5.10.0      2021-11-04 15:23:23.767008360 -0400
+++ lsblk-S-5.10.0-2    2021-11-04 17:34:37.748310196 -0400
@@ -1,6 +1,6 @@ 
 NAME HCTL       TYPE VENDOR   MODEL      REV TRAN
-sda  0:2:0:0    disk DELL     PERC_H700 2.10
-sdb  0:2:2:0    disk DELL     PERC_H700 2.10
+sda  0:2:2:0    disk DELL     PERC_H700 2.10
+sdb  0:2:0:0    disk DELL     PERC_H700 2.10
 sdc  0:2:3:0    disk DELL     PERC_H700 2.10
 sdd  0:2:4:0    disk DELL     PERC_H700 2.10
 sde  0:2:5:0    disk DELL     PERC_H700 2.10