Message ID | 1399671097-32015-1-git-send-email-sage@inktank.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, May 9, 2014 at 11:31 PM, Sage Weil <sage@inktank.com> wrote: > The Ceph OSD initialization relies on identifying GPT partitions by type > in order to mount data volumes and start daemons. Currently we ship this > rule separately, but it is awkward to duplicate the conditional logic that > precedes this block and it would be much simpler if it were simply included > in the upstream rules. Types are by definition not unique. The symlinks in /dev/disk/by-*/ are *expected* to be unique. We handle duplicated labels, but they are specified by humans, multiple partitions with the same GPT types are just normal expected behavior; and they would have no order or priority, they just overwrite each other depending on probing order. We should not add such things, the logic to find these volumes at bootup are better handled by a specific program like systemd's systemd-gpt-auto-generator, without putting unreliable and unpredictable content into /dev. http://www.freedesktop.org/software/systemd/man/systemd-gpt-auto-generator.html http://www.freedesktop.org/wiki/Specifications/DiscoverablePartitionsSpec/ Thanks, Kay -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Kay, On Fri, 9 May 2014, Kay Sievers wrote: > On Fri, May 9, 2014 at 11:31 PM, Sage Weil <sage@inktank.com> wrote: > > The Ceph OSD initialization relies on identifying GPT partitions by type > > in order to mount data volumes and start daemons. Currently we ship this > > rule separately, but it is awkward to duplicate the conditional logic that > > precedes this block and it would be much simpler if it were simply included > > in the upstream rules. > > Types are by definition not unique. The symlinks in /dev/disk/by-*/ > are *expected* to be unique. > > We handle duplicated labels, but they are specified by humans, > multiple partitions with the same GPT types are just normal expected > behavior; and they would have no order or priority, they just > overwrite each other depending on probing order. This is why the label has both the type (fixed, to identify this as a ceph partition) and the label (random): /dev/disk/by-parttypeuuid/$env{ID_PART_ENTRY_TYPE}.$env{ID_PART_ENTRY_UUID} > We should not add such things, the logic to find these volumes at > bootup are better handled by a specific program like systemd's > systemd-gpt-auto-generator, without putting unreliable and > unpredictable content into /dev. I think this is what we're trying to accomplish with the ceph-disk tool, which relies on these (reliable and predictable) symlinks. The labels alone (by-partuuid) aren't sufficient since we want to be able to scan for partitions of a given type without re-running blkid on every volume. Right now sysvinit, upstart, and udev may trigger ceph-disk to activate volumes. Hopefully we'll have native systemd support sometime soon. But first I want to make sure the approach has been validated, and hopefully avoid shipping custom rules... Thanks! sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, May 10, 2014 at 12:00 AM, Sage Weil <sage@inktank.com> wrote: > On Fri, 9 May 2014, Kay Sievers wrote: >> On Fri, May 9, 2014 at 11:31 PM, Sage Weil <sage@inktank.com> wrote: >> > The Ceph OSD initialization relies on identifying GPT partitions by type >> > in order to mount data volumes and start daemons. Currently we ship this >> > rule separately, but it is awkward to duplicate the conditional logic that >> > precedes this block and it would be much simpler if it were simply included >> > in the upstream rules. >> >> Types are by definition not unique. The symlinks in /dev/disk/by-*/ >> are *expected* to be unique. >> >> We handle duplicated labels, but they are specified by humans, >> multiple partitions with the same GPT types are just normal expected >> behavior; and they would have no order or priority, they just >> overwrite each other depending on probing order. > > This is why the label has both the type (fixed, to identify this as a ceph > partition) and the label (random): > > /dev/disk/by-parttypeuuid/$env{ID_PART_ENTRY_TYPE}.$env{ID_PART_ENTRY_UUID} > >> We should not add such things, the logic to find these volumes at >> bootup are better handled by a specific program like systemd's >> systemd-gpt-auto-generator, without putting unreliable and >> unpredictable content into /dev. > > I think this is what we're trying to accomplish with the ceph-disk tool, > which relies on these (reliable and predictable) symlinks. The labels > alone (by-partuuid) aren't sufficient since we want to be able to scan for > partitions of a given type without re-running blkid on every volume. /dev is an API which should by default not contain custom links which are not generally useful, and these links are not useful for other tools. These links are not even recognizable by type without doing readdir() over it and string match operations to find the types, we really should not add such stuff to the default rules set. We have to be careful here, it seems like the wrong approach to put that in the public visible /dev API. Tools can get all this information programatically out of the udev database, there is no create symlinks or to run blkid. Ultimately, there is nothing wrong with tools shipping their own rules, but please do not use generic names like /dev/disk/by-parttypeuuid/. The name does even sound misleading because it combines two different things in one name, with a '.' as a separator. Why don't you just use a leading directory for the type instead of stuffing that into one name? Kay -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, 10 May 2014, Kay Sievers wrote: > On Sat, May 10, 2014 at 12:00 AM, Sage Weil <sage@inktank.com> wrote: > > On Fri, 9 May 2014, Kay Sievers wrote: > >> On Fri, May 9, 2014 at 11:31 PM, Sage Weil <sage@inktank.com> wrote: > >> > The Ceph OSD initialization relies on identifying GPT partitions by type > >> > in order to mount data volumes and start daemons. Currently we ship this > >> > rule separately, but it is awkward to duplicate the conditional logic that > >> > precedes this block and it would be much simpler if it were simply included > >> > in the upstream rules. > >> > >> Types are by definition not unique. The symlinks in /dev/disk/by-*/ > >> are *expected* to be unique. > >> > >> We handle duplicated labels, but they are specified by humans, > >> multiple partitions with the same GPT types are just normal expected > >> behavior; and they would have no order or priority, they just > >> overwrite each other depending on probing order. > > > > This is why the label has both the type (fixed, to identify this as a ceph > > partition) and the label (random): > > > > /dev/disk/by-parttypeuuid/$env{ID_PART_ENTRY_TYPE}.$env{ID_PART_ENTRY_UUID} > > > >> We should not add such things, the logic to find these volumes at > >> bootup are better handled by a specific program like systemd's > >> systemd-gpt-auto-generator, without putting unreliable and > >> unpredictable content into /dev. > > > > I think this is what we're trying to accomplish with the ceph-disk tool, > > which relies on these (reliable and predictable) symlinks. The labels > > alone (by-partuuid) aren't sufficient since we want to be able to scan for > > partitions of a given type without re-running blkid on every volume. > > /dev is an API which should by default not contain custom links which > are not generally useful, and these links are not useful for other > tools. FWIW I was surprised that there wasn't already a way to find partitions by type in /dev, but I assume you know better than I how other tools are using udev. It seems at least as useful as by-partuuid to me. > These links are not even recognizable by type without doing readdir() > over it and string match operations to find the types, we really > should not add such stuff to the default rules set. We have to be > careful here, it seems like the wrong approach to put that in the > public visible /dev API. > > Tools can get all this information programatically out of the udev > database, there is no create symlinks or to run blkid. I just looked up libudev and it looks like there is even a pyudev wrapper, so that could indeed work better. I take it that queries via udev_enumerate for (say) ID_PART_ENTRY_TYPE=x are efficient? > Ultimately, there is nothing wrong with tools shipping their own > rules, but please do not use generic names like > /dev/disk/by-parttypeuuid/. The name does even sound misleading > because it combines two different things in one name, with a '.' as a > separator. > > Why don't you just use a leading directory for the type instead of > stuffing that into one name? I'd happily replace the . with /. I suspect I wasn't sure at the time if nested directories were handled properly. Anyway, libudev + pyudev looks like a better solution for ceph-disk. Thanks! sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, May 10, 2014 at 12:34 AM, Sage Weil <sage@inktank.com> wrote: > On Sat, 10 May 2014, Kay Sievers wrote: >> On Sat, May 10, 2014 at 12:00 AM, Sage Weil <sage@inktank.com> wrote: >> > On Fri, 9 May 2014, Kay Sievers wrote: >> >> On Fri, May 9, 2014 at 11:31 PM, Sage Weil <sage@inktank.com> wrote: >> >> > The Ceph OSD initialization relies on identifying GPT partitions by type >> >> > in order to mount data volumes and start daemons. Currently we ship this >> >> > rule separately, but it is awkward to duplicate the conditional logic that >> >> > precedes this block and it would be much simpler if it were simply included >> >> > in the upstream rules. >> >> >> >> Types are by definition not unique. The symlinks in /dev/disk/by-*/ >> >> are *expected* to be unique. >> >> >> >> We handle duplicated labels, but they are specified by humans, >> >> multiple partitions with the same GPT types are just normal expected >> >> behavior; and they would have no order or priority, they just >> >> overwrite each other depending on probing order. >> > >> > This is why the label has both the type (fixed, to identify this as a ceph >> > partition) and the label (random): >> > >> > /dev/disk/by-parttypeuuid/$env{ID_PART_ENTRY_TYPE}.$env{ID_PART_ENTRY_UUID} >> > >> >> We should not add such things, the logic to find these volumes at >> >> bootup are better handled by a specific program like systemd's >> >> systemd-gpt-auto-generator, without putting unreliable and >> >> unpredictable content into /dev. >> > >> > I think this is what we're trying to accomplish with the ceph-disk tool, >> > which relies on these (reliable and predictable) symlinks. The labels >> > alone (by-partuuid) aren't sufficient since we want to be able to scan for >> > partitions of a given type without re-running blkid on every volume. >> >> /dev is an API which should by default not contain custom links which >> are not generally useful, and these links are not useful for other >> tools. > > FWIW I was surprised that there wasn't already a way to find partitions by > type in /dev, but I assume you know better than I how other tools are > using udev. It seems at least as useful as by-partuuid to me. > >> These links are not even recognizable by type without doing readdir() >> over it and string match operations to find the types, we really >> should not add such stuff to the default rules set. We have to be >> careful here, it seems like the wrong approach to put that in the >> public visible /dev API. >> >> Tools can get all this information programatically out of the udev >> database, there is no create symlinks or to run blkid. > > I just looked up libudev and it looks like there is even a pyudev wrapper, > so that could indeed work better. I take it that queries via > udev_enumerate for (say) ID_PART_ENTRY_TYPE=x are efficient? Sure, filter for "block" devices and this or other GPT properties. The libudev API will just find the devices is /sys and read the database files in tmpfs /run and, it will not talk to any devices, so it should perform pretty well. Kay -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, May 12, 2014 at 05:18:56PM +0200, Kay Sievers wrote: > > I just looked up libudev and it looks like there is even a pyudev wrapper, > > so that could indeed work better. I take it that queries via > > udev_enumerate for (say) ID_PART_ENTRY_TYPE=x are efficient? > > Sure, filter for "block" devices and this or other GPT properties. The > libudev API will just find the devices is /sys and read the database > files in tmpfs /run and, it will not talk to any devices, so it should > perform pretty well. and on command line lsblk(8) reads the info from udev db, so you can use: $ lsblk -o+PARTTYPE NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT PARTTYPE sda 8:0 0 149.1G 0 disk ??sda1 8:1 0 1000M 0 part /boot c12a7328-f81f-11d2-ba4b-00a0c93ec93b ??sda2 8:2 0 2G 0 part ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 ??sda3 8:3 0 9.7G 0 part [SWAP] 0657fd6d-a4ab-43c4-84e5-0933c84b4f4f ??sda4 8:4 0 34.2G 0 part / ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 ??sda5 8:5 0 63.2G 0 part /home ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 ??sda6 8:6 0 39.1G 0 part ebd0a0a2-b9e5-4433-87c0-68b6b72699c7 (well, this is from git tree, will be in v2.25 :-) Karel
diff --git a/rules/60-persistent-storage.rules b/rules/60-persistent-storage.rules index 475b151..42f5edb 100644 --- a/rules/60-persistent-storage.rules +++ b/rules/60-persistent-storage.rules @@ -80,9 +80,10 @@ ENV{ID_FS_USAGE}=="filesystem|other", ENV{ID_FS_LABEL_ENC}=="?*", SYMLINK+="disk ENV{DEVTYPE}=="disk", ENV{ID_WWN_WITH_EXTENSION}=="?*", SYMLINK+="disk/by-id/wwn-$env{ID_WWN_WITH_EXTENSION}" ENV{DEVTYPE}=="partition", ENV{ID_WWN_WITH_EXTENSION}=="?*", SYMLINK+="disk/by-id/wwn-$env{ID_WWN_WITH_EXTENSION}-part%n" -# by-partlabel/by-partuuid links (partition metadata) +# by-partlabel/by-partuuid/by-parttypeuuid links (partition metadata) ENV{ID_PART_ENTRY_SCHEME}=="gpt", ENV{ID_PART_ENTRY_UUID}=="?*", SYMLINK+="disk/by-partuuid/$env{ID_PART_ENTRY_UUID}" ENV{ID_PART_ENTRY_SCHEME}=="gpt", ENV{ID_PART_ENTRY_NAME}=="?*", SYMLINK+="disk/by-partlabel/$env{ID_PART_ENTRY_NAME}" +ENV{ID_PART_ENTRY_SCHEME}=="gpt", ENV{ID_PART_ENTRY_TYPE}=="?*", ENV{ID_PART_ENTRY_UUID}=="?*", SYMLINK+="disk/by-parttypeuuid/$env{ID_PART_ENTRY_TYPE}.$env{ID_PART_ENTRY_UUID}" # add symlink to GPT root disk ENV{ID_PART_ENTRY_SCHEME}=="gpt", ENV{ID_PART_GPT_AUTO_ROOT}=="1", SYMLINK+="gpt-auto-root"
The Ceph OSD initialization relies on identifying GPT partitions by type in order to mount data volumes and start daemons. Currently we ship this rule separately, but it is awkward to duplicate the conditional logic that precedes this block and it would be much simpler if it were simply included in the upstream rules. Signed-off-by: Sage Weil <sage@inktank.com> --- rules/60-persistent-storage.rules | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)