Message ID | 20211105064623.GD32560@hostway.ca (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Unreliable disk detection order in 5.x | expand |
On 2021/11/05 15:46, Simon Kirby wrote: > I'm seeing disk detection order changing across reboots on 5.x kernels > (5.4, 5.10, 5.14), but not 4.9, 4.14, 4.19, with megaraid_sas (Dell > PERC_H700). With 13 disks and 5.14.14, the order changes almost always. > > I did initially try to bisect this issue, but it seems to become more > rare in earlier kernels, and there are some non-booting problems between > 4.x and 5.x. > > The most common effect is swapping of sda with sdb, or two neighboring > devices in the list; for example: > > # diff -u lsblk-S-5.10.0 lsblk-S-5.10.0-2 > --- lsblk-S-5.10.0 2021-11-04 15:23:23.767008360 -0400 > +++ lsblk-S-5.10.0-2 2021-11-04 17:34:37.748310196 -0400 > @@ -1,6 +1,6 @@ > NAME HCTL TYPE VENDOR MODEL REV TRAN > -sda 0:2:0:0 disk DELL PERC_H700 2.10 > -sdb 0:2:2:0 disk DELL PERC_H700 2.10 > +sda 0:2:2:0 disk DELL PERC_H700 2.10 > +sdb 0:2:0:0 disk DELL PERC_H700 2.10 > sdc 0:2:3:0 disk DELL PERC_H700 2.10 > sdd 0:2:4:0 disk DELL PERC_H700 2.10 > sde 0:2:5:0 disk DELL PERC_H700 2.10 > > This is happening on vendor (Debian 5.10.0) and home-built kernels, and > on a variety of hosts. On all kernels, the detection printks come up in > an interesting order, but in older kernels, it always ends up with an > sd-name that is ordered by SCSI ID ascending: > > [ 2.289776] sd 0:2:0:0: [sda] 999030784 512-byte logical blocks: (512 GB/476 GiB) > [ 2.289918] sd 0:2:4:0: [sdd] 11719933952 512-byte logical blocks: (6.00 TB/5.46 TiB) > [ 2.289947] sd 0:2:3:0: [sdc] 11719933952 512-byte logical blocks: (6.00 TB/5.46 TiB) > [ 2.290032] sd 0:2:6:0: [sdf] 11719933952 512-byte logical blocks: (6.00 TB/5.46 TiB) > [ 2.290210] sd 0:2:7:0: [sdg] 11719933952 512-byte logical blocks: (6.00 TB/5.46 TiB) > [ 2.290248] sd 0:2:9:0: [sdi] 11719933952 512-byte logical blocks: (6.00 TB/5.46 TiB) > [ 2.290323] sd 0:2:2:0: [sdb] 11719933952 512-byte logical blocks: (6.00 TB/5.46 TiB) > [ 2.290461] sd 0:2:5:0: [sde] 11719933952 512-byte logical blocks: (6.00 TB/5.46 TiB) > [ 2.290476] sd 0:2:8:0: [sdh] 11719933952 512-byte logical blocks: (6.00 TB/5.46 TiB) > > Full "dmesg" is saved here: https://0x.ca/sim/ref/5.10.0/dmesg > > Any ideas on suggestions on what I could use to track down what changed > here, or ideas on what might have influenced it? Most distro kernels are now compiled with asynchronous device scan enabled to speedup the boot process. This potentially result in the device names changing across reboots. Reliable device names are provided by udev under /dev/disk/by-id, by-uuid etc. You can turn off scsi asynchronous device scan using the scsi_mod.scan=sync kernel boot argument, or disable the CONFIG_SCSI_SCAN_ASYNC option for your kernel (device drivers -> scsi device support -> asynchronous scsi scanning). But even with synchronous scanning, device names are not reliable and there are no guarantees that one particular device will always have the same name.
On Fri, Nov 05, 2021 at 04:45:53PM +0900, Damien Le Moal wrote: > > Any ideas on suggestions on what I could use to track down what changed > > here, or ideas on what might have influenced it? > > Most distro kernels are now compiled with asynchronous device scan enabled to > speedup the boot process. This potentially result in the device names changing > across reboots. Reliable device names are provided by udev under > /dev/disk/by-id, by-uuid etc. > > You can turn off scsi asynchronous device scan using the scsi_mod.scan=sync > kernel boot argument, or disable the CONFIG_SCSI_SCAN_ASYNC option for your > kernel (device drivers -> scsi device support -> asynchronous scsi scanning). > > But even with synchronous scanning, device names are not reliable and there are > no guarantees that one particular device will always have the same name. This occurs regardless of the CONFIG_SCSI_SCAN_ASYNC setting, and also with scsi_mod.scan=sync on vendor kernels. All of these disks are coming from the same driver and card. I understand that using UUIDs, by-id, etc., is an option to work around this, but then we would have to push IDs for disks in every server to our configuration management. It does not seem that this change is really intentional. Simon-
On 11/6/21 19:24, Simon Kirby wrote: > This occurs regardless of the CONFIG_SCSI_SCAN_ASYNC setting, and > also with scsi_mod.scan=sync on vendor kernels. All of these disks > are coming from the same driver and card. > > I understand that using UUIDs, by-id, etc., is an option to work > around this, but then we would have to push IDs for disks in every > server to our configuration management. It does not seem that this > change is really intentional. SCSI disk detection is asynchronous on purpose since a long time. The most recent commit I know of that changed SCSI disk scanning behavior is commit f049cf1a7b67 ("scsi: sd: Rely on the driver core for asynchronous probing"). Please use one of the /dev/disk/by-*/* identifiers as Damien requested. Thanks, Bart.
On Sun, Nov 07, 2021 at 11:51:45AM -0800, Bart Van Assche wrote: > On 11/6/21 19:24, Simon Kirby wrote: > > This occurs regardless of the CONFIG_SCSI_SCAN_ASYNC setting, and > > also with scsi_mod.scan=sync on vendor kernels. All of these disks > > are coming from the same driver and card. > > > > I understand that using UUIDs, by-id, etc., is an option to work > > around this, but then we would have to push IDs for disks in every > > server to our configuration management. It does not seem that this > > change is really intentional. > > SCSI disk detection is asynchronous on purpose since a long time. The most > recent commit I know of that changed SCSI disk scanning > behavior is commit f049cf1a7b67 ("scsi: sd: Rely on the driver core for > asynchronous probing"). > > Please use one of the /dev/disk/by-*/* identifiers as Damien requested. Hi Bart, So, we're using DRBD on top of these, which means by-uuid is not available; we can use only by-id and by-path. by-id is dependent on disk models and serial numbers, and by-path is dependent on PCI bus details. Both are going to be a good deal more work to maintain, since they're both not just a simple enumeration. I did try 5.14.17 with f049cf1a7b67 (and a065c0faacb1) reverted, and it does indeed restore the behaviour where sd* order appears to be reliable. Scan time (time until systemd starts) is within 4ms across 3 boots with and without the revert, but this is just our particular case. I don't fully understand the scan process here, but I can understand the challenges in trying to parallelize it and still end up with a consistent enumerated list. I guess you would agree that removing sd* entirely would not be an option because they've existed forever historically, but at the same time, the only time they really "work" now are as symlink targets for by-*, and in the case where only one disk exists at boot time. Do I have this right? Simon-
On 2021/11/11 10:01, Simon Kirby wrote: > On Sun, Nov 07, 2021 at 11:51:45AM -0800, Bart Van Assche wrote: > >> On 11/6/21 19:24, Simon Kirby wrote: >>> This occurs regardless of the CONFIG_SCSI_SCAN_ASYNC setting, and >>> also with scsi_mod.scan=sync on vendor kernels. All of these disks >>> are coming from the same driver and card. >>> >>> I understand that using UUIDs, by-id, etc., is an option to work >>> around this, but then we would have to push IDs for disks in every >>> server to our configuration management. It does not seem that this >>> change is really intentional. >> >> SCSI disk detection is asynchronous on purpose since a long time. The most >> recent commit I know of that changed SCSI disk scanning >> behavior is commit f049cf1a7b67 ("scsi: sd: Rely on the driver core for >> asynchronous probing"). >> >> Please use one of the /dev/disk/by-*/* identifiers as Damien requested. > > Hi Bart, > > So, we're using DRBD on top of these, which means by-uuid is not > available; we can use only by-id and by-path. by-id is dependent on disk > models and serial numbers, and by-path is dependent on PCI bus details. > Both are going to be a good deal more work to maintain, since they're > both not just a simple enumeration. > > I did try 5.14.17 with f049cf1a7b67 (and a065c0faacb1) reverted, and it > does indeed restore the behaviour where sd* order appears to be reliable. > Scan time (time until systemd starts) is within 4ms across 3 boots with > and without the revert, but this is just our particular case. > > I don't fully understand the scan process here, but I can understand the > challenges in trying to parallelize it and still end up with a consistent > enumerated list. Even without parallel disk scan on boot to ensure a consistent naming of drives from some port or LUN order, any run-time event that cause a drive to "go away" and come back (e.g. topology change event) can result in the drive name changing. The order itself depends on the LLD code too. A driver change can result in a different probe order, so in different names. Same if say you create/delete LUNs on a RAID system: when doing it, you will get some drive names, but after a reboot & scan, the LUNs may be presented with different names. /dev/sdX names are simply not reliable. For consistent, reliable, drive configurations, applications must use the /dev/disk/by-*/* IDs. > > I guess you would agree that removing sd* entirely would not be an option > because they've existed forever historically, but at the same time, the > only time they really "work" now are as symlink targets for by-*, and in > the case where only one disk exists at boot time. Do I have this right? > > Simon- >
On 11/11/21 2:01 AM, Simon Kirby wrote: > On Sun, Nov 07, 2021 at 11:51:45AM -0800, Bart Van Assche wrote: > >> On 11/6/21 19:24, Simon Kirby wrote: >>> This occurs regardless of the CONFIG_SCSI_SCAN_ASYNC setting, and >>> also with scsi_mod.scan=sync on vendor kernels. All of these disks >>> are coming from the same driver and card. >>> >>> I understand that using UUIDs, by-id, etc., is an option to work >>> around this, but then we would have to push IDs for disks in every >>> server to our configuration management. It does not seem that this >>> change is really intentional. >> >> SCSI disk detection is asynchronous on purpose since a long time. The most >> recent commit I know of that changed SCSI disk scanning >> behavior is commit f049cf1a7b67 ("scsi: sd: Rely on the driver core for >> asynchronous probing"). >> >> Please use one of the /dev/disk/by-*/* identifiers as Damien requested. > > Hi Bart, > > So, we're using DRBD on top of these, which means by-uuid is not > available; we can use only by-id and by-path. by-id is dependent on disk > models and serial numbers, and by-path is dependent on PCI bus details. > Both are going to be a good deal more work to maintain, since they're > both not just a simple enumeration. > Why is by-uuid not available? The uuid is the disk-internal unique identification, and to my knowledge all recent SCSI and SATA drives implement them. So where is the problem here? Cheers, Hannes
Hannes Reinecke <hare@suse.de> writes: > Why is by-uuid not available? > The uuid is the disk-internal unique identification, and to my > knowledge all recent SCSI and SATA drives implement them. > So where is the problem here? It is probably just an oversight by the udev rules that create the by-uuid links.
On 11/12/21 1:11 AM, Phillip Susi wrote: > > Hannes Reinecke <hare@suse.de> writes: > >> Why is by-uuid not available? >> The uuid is the disk-internal unique identification, and to my >> knowledge all recent SCSI and SATA drives implement them. >> So where is the problem here? > > It is probably just an oversight by the udev rules that create the > by-uuid links. > So shouldn't we rather fix this? Putting in some udev rules is always easier than trying to 'fix' things in the kernel. Cheers, Hannes
--- lsblk-S-5.10.0 2021-11-04 15:23:23.767008360 -0400 +++ lsblk-S-5.10.0-2 2021-11-04 17:34:37.748310196 -0400 @@ -1,6 +1,6 @@ NAME HCTL TYPE VENDOR MODEL REV TRAN -sda 0:2:0:0 disk DELL PERC_H700 2.10 -sdb 0:2:2:0 disk DELL PERC_H700 2.10 +sda 0:2:2:0 disk DELL PERC_H700 2.10 +sdb 0:2:0:0 disk DELL PERC_H700 2.10 sdc 0:2:3:0 disk DELL PERC_H700 2.10 sdd 0:2:4:0 disk DELL PERC_H700 2.10 sde 0:2:5:0 disk DELL PERC_H700 2.10