rules: add by-parttypeuuid rule for GPT labeled partitions

Message ID	1399671097-32015-1-git-send-email-sage@inktank.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <ceph-devel-owner@kernel.org> From: Sage Weil <sage@inktank.com> To: systemd-devel@lists.freedesktop.org Cc: ceph-devel@vger.kernel.org, Sage Weil <sage@inktank.com> Subject: [PATCH] rules: add by-parttypeuuid rule for GPT labeled partitions Date: Fri, 9 May 2014 14:31:37 -0700 Message-Id: <1399671097-32015-1-git-send-email-sage@inktank.com> Sender: ceph-devel-owner@vger.kernel.org Precedence: bulk

Sage Weil May 9, 2014, 9:31 p.m. UTC

The Ceph OSD initialization relies on identifying GPT partitions by type
in order to mount data volumes and start daemons.  Currently we ship this
rule separately, but it is awkward to duplicate the conditional logic that
precedes this block and it would be much simpler if it were simply included
in the upstream rules.

Signed-off-by: Sage Weil <sage@inktank.com>
---
 rules/60-persistent-storage.rules |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Kay Sievers May 9, 2014, 9:51 p.m. UTC | #1

On Fri, May 9, 2014 at 11:31 PM, Sage Weil <sage@inktank.com> wrote:
> The Ceph OSD initialization relies on identifying GPT partitions by type
> in order to mount data volumes and start daemons.  Currently we ship this
> rule separately, but it is awkward to duplicate the conditional logic that
> precedes this block and it would be much simpler if it were simply included
> in the upstream rules.

Types are by definition not unique. The symlinks in /dev/disk/by-*/
are *expected* to be unique.

We handle duplicated labels, but they are specified by humans,
multiple partitions with the same GPT types are just normal expected
behavior; and they would have no order or priority, they just
overwrite each other depending on probing order.

We should not add such things, the logic to find these volumes at
bootup are better handled by a specific program like systemd's
systemd-gpt-auto-generator, without putting unreliable and
unpredictable content into /dev.

  http://www.freedesktop.org/software/systemd/man/systemd-gpt-auto-generator.html
  http://www.freedesktop.org/wiki/Specifications/DiscoverablePartitionsSpec/

Thanks,
Kay
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sage Weil May 9, 2014, 10 p.m. UTC | #2

Hi Kay,

On Fri, 9 May 2014, Kay Sievers wrote:
> On Fri, May 9, 2014 at 11:31 PM, Sage Weil <sage@inktank.com> wrote:
> > The Ceph OSD initialization relies on identifying GPT partitions by type
> > in order to mount data volumes and start daemons.  Currently we ship this
> > rule separately, but it is awkward to duplicate the conditional logic that
> > precedes this block and it would be much simpler if it were simply included
> > in the upstream rules.
> 
> Types are by definition not unique. The symlinks in /dev/disk/by-*/
> are *expected* to be unique.
> 
> We handle duplicated labels, but they are specified by humans,
> multiple partitions with the same GPT types are just normal expected
> behavior; and they would have no order or priority, they just
> overwrite each other depending on probing order.

This is why the label has both the type (fixed, to identify this as a ceph 
partition) and the label (random):

 /dev/disk/by-parttypeuuid/$env{ID_PART_ENTRY_TYPE}.$env{ID_PART_ENTRY_UUID}

> We should not add such things, the logic to find these volumes at
> bootup are better handled by a specific program like systemd's
> systemd-gpt-auto-generator, without putting unreliable and
> unpredictable content into /dev.

I think this is what we're trying to accomplish with the ceph-disk tool, 
which relies on these (reliable and predictable) symlinks.  The labels 
alone (by-partuuid) aren't sufficient since we want to be able to scan for 
partitions of a given type without re-running blkid on every volume.

Right now sysvinit, upstart, and udev may trigger ceph-disk to activate 
volumes. Hopefully we'll have native systemd support sometime soon.  But 
first I want to make sure the approach has been validated, and hopefully 
avoid shipping custom rules...

Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Kay Sievers May 9, 2014, 10:17 p.m. UTC | #3

On Sat, May 10, 2014 at 12:00 AM, Sage Weil <sage@inktank.com> wrote:
> On Fri, 9 May 2014, Kay Sievers wrote:
>> On Fri, May 9, 2014 at 11:31 PM, Sage Weil <sage@inktank.com> wrote:
>> > The Ceph OSD initialization relies on identifying GPT partitions by type
>> > in order to mount data volumes and start daemons.  Currently we ship this
>> > rule separately, but it is awkward to duplicate the conditional logic that
>> > precedes this block and it would be much simpler if it were simply included
>> > in the upstream rules.
>>
>> Types are by definition not unique. The symlinks in /dev/disk/by-*/
>> are *expected* to be unique.
>>
>> We handle duplicated labels, but they are specified by humans,
>> multiple partitions with the same GPT types are just normal expected
>> behavior; and they would have no order or priority, they just
>> overwrite each other depending on probing order.
>
> This is why the label has both the type (fixed, to identify this as a ceph
> partition) and the label (random):
>
>  /dev/disk/by-parttypeuuid/$env{ID_PART_ENTRY_TYPE}.$env{ID_PART_ENTRY_UUID}
>
>> We should not add such things, the logic to find these volumes at
>> bootup are better handled by a specific program like systemd's
>> systemd-gpt-auto-generator, without putting unreliable and
>> unpredictable content into /dev.
>
> I think this is what we're trying to accomplish with the ceph-disk tool,
> which relies on these (reliable and predictable) symlinks.  The labels
> alone (by-partuuid) aren't sufficient since we want to be able to scan for
> partitions of a given type without re-running blkid on every volume.

/dev is an API which should by default not contain custom links which
are not generally useful, and these links are not useful for other
tools.

These links are not even recognizable by type without doing readdir()
over it and string match operations to find the types, we really
should not add such stuff to the default rules set. We have to be
careful here, it seems like the wrong approach to put that in the
public visible /dev API.

Tools can get all this information programatically out of the udev
database, there is no create symlinks or to run blkid.

Ultimately, there is nothing wrong with tools shipping their own
rules, but please do not use generic names like
/dev/disk/by-parttypeuuid/. The name does even sound misleading
because it combines two different things in one name, with a '.' as a
separator.

Why don't you just use a leading directory for the type instead of
stuffing that into one name?

Kay
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sage Weil May 9, 2014, 10:34 p.m. UTC | #4

On Sat, 10 May 2014, Kay Sievers wrote:
> On Sat, May 10, 2014 at 12:00 AM, Sage Weil <sage@inktank.com> wrote:
> > On Fri, 9 May 2014, Kay Sievers wrote:
> >> On Fri, May 9, 2014 at 11:31 PM, Sage Weil <sage@inktank.com> wrote:
> >> > The Ceph OSD initialization relies on identifying GPT partitions by type
> >> > in order to mount data volumes and start daemons.  Currently we ship this
> >> > rule separately, but it is awkward to duplicate the conditional logic that
> >> > precedes this block and it would be much simpler if it were simply included
> >> > in the upstream rules.
> >>
> >> Types are by definition not unique. The symlinks in /dev/disk/by-*/
> >> are *expected* to be unique.
> >>
> >> We handle duplicated labels, but they are specified by humans,
> >> multiple partitions with the same GPT types are just normal expected
> >> behavior; and they would have no order or priority, they just
> >> overwrite each other depending on probing order.
> >
> > This is why the label has both the type (fixed, to identify this as a ceph
> > partition) and the label (random):
> >
> >  /dev/disk/by-parttypeuuid/$env{ID_PART_ENTRY_TYPE}.$env{ID_PART_ENTRY_UUID}
> >
> >> We should not add such things, the logic to find these volumes at
> >> bootup are better handled by a specific program like systemd's
> >> systemd-gpt-auto-generator, without putting unreliable and
> >> unpredictable content into /dev.
> >
> > I think this is what we're trying to accomplish with the ceph-disk tool,
> > which relies on these (reliable and predictable) symlinks.  The labels
> > alone (by-partuuid) aren't sufficient since we want to be able to scan for
> > partitions of a given type without re-running blkid on every volume.
> 
> /dev is an API which should by default not contain custom links which
> are not generally useful, and these links are not useful for other
> tools.

FWIW I was surprised that there wasn't already a way to find partitions by 
type in /dev, but I assume you know better than I how other tools are 
using udev.  It seems at least as useful as by-partuuid to me.

> These links are not even recognizable by type without doing readdir()
> over it and string match operations to find the types, we really
> should not add such stuff to the default rules set. We have to be
> careful here, it seems like the wrong approach to put that in the
> public visible /dev API.
> 
> Tools can get all this information programatically out of the udev
> database, there is no create symlinks or to run blkid.

I just looked up libudev and it looks like there is even a pyudev wrapper, 
so that could indeed work better.  I take it that queries via 
udev_enumerate for (say) ID_PART_ENTRY_TYPE=x are efficient?

> Ultimately, there is nothing wrong with tools shipping their own
> rules, but please do not use generic names like
> /dev/disk/by-parttypeuuid/. The name does even sound misleading
> because it combines two different things in one name, with a '.' as a
> separator.
> 
> Why don't you just use a leading directory for the type instead of
> stuffing that into one name?

I'd happily replace the . with /.  I suspect I wasn't sure at the time if 
nested directories were handled properly.

Anyway, libudev + pyudev looks like a better solution for ceph-disk.  

Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Kay Sievers May 12, 2014, 3:18 p.m. UTC | #5

On Sat, May 10, 2014 at 12:34 AM, Sage Weil <sage@inktank.com> wrote:
> On Sat, 10 May 2014, Kay Sievers wrote:
>> On Sat, May 10, 2014 at 12:00 AM, Sage Weil <sage@inktank.com> wrote:
>> > On Fri, 9 May 2014, Kay Sievers wrote:
>> >> On Fri, May 9, 2014 at 11:31 PM, Sage Weil <sage@inktank.com> wrote:
>> >> > The Ceph OSD initialization relies on identifying GPT partitions by type
>> >> > in order to mount data volumes and start daemons.  Currently we ship this
>> >> > rule separately, but it is awkward to duplicate the conditional logic that
>> >> > precedes this block and it would be much simpler if it were simply included
>> >> > in the upstream rules.
>> >>
>> >> Types are by definition not unique. The symlinks in /dev/disk/by-*/
>> >> are *expected* to be unique.
>> >>
>> >> We handle duplicated labels, but they are specified by humans,
>> >> multiple partitions with the same GPT types are just normal expected
>> >> behavior; and they would have no order or priority, they just
>> >> overwrite each other depending on probing order.
>> >
>> > This is why the label has both the type (fixed, to identify this as a ceph
>> > partition) and the label (random):
>> >
>> >  /dev/disk/by-parttypeuuid/$env{ID_PART_ENTRY_TYPE}.$env{ID_PART_ENTRY_UUID}
>> >
>> >> We should not add such things, the logic to find these volumes at
>> >> bootup are better handled by a specific program like systemd's
>> >> systemd-gpt-auto-generator, without putting unreliable and
>> >> unpredictable content into /dev.
>> >
>> > I think this is what we're trying to accomplish with the ceph-disk tool,
>> > which relies on these (reliable and predictable) symlinks.  The labels
>> > alone (by-partuuid) aren't sufficient since we want to be able to scan for
>> > partitions of a given type without re-running blkid on every volume.
>>
>> /dev is an API which should by default not contain custom links which
>> are not generally useful, and these links are not useful for other
>> tools.
>
> FWIW I was surprised that there wasn't already a way to find partitions by
> type in /dev, but I assume you know better than I how other tools are
> using udev.  It seems at least as useful as by-partuuid to me.
>
>> These links are not even recognizable by type without doing readdir()
>> over it and string match operations to find the types, we really
>> should not add such stuff to the default rules set. We have to be
>> careful here, it seems like the wrong approach to put that in the
>> public visible /dev API.
>>
>> Tools can get all this information programatically out of the udev
>> database, there is no create symlinks or to run blkid.
>
> I just looked up libudev and it looks like there is even a pyudev wrapper,
> so that could indeed work better.  I take it that queries via
> udev_enumerate for (say) ID_PART_ENTRY_TYPE=x are efficient?

Sure, filter for "block" devices and this or other GPT properties. The
libudev API will just find the devices is /sys and read the database
files in tmpfs /run and, it will not talk to any devices, so it should
perform pretty well.

Kay
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Karel Zak May 13, 2014, 9:53 a.m. UTC | #6

On Mon, May 12, 2014 at 05:18:56PM +0200, Kay Sievers wrote:
> > I just looked up libudev and it looks like there is even a pyudev wrapper,
> > so that could indeed work better.  I take it that queries via
> > udev_enumerate for (say) ID_PART_ENTRY_TYPE=x are efficient?
> 
> Sure, filter for "block" devices and this or other GPT properties. The
> libudev API will just find the devices is /sys and read the database
> files in tmpfs /run and, it will not talk to any devices, so it should
> perform pretty well.

 and on command line lsblk(8) reads the info from udev db, so you can
 use:

 $ lsblk -o+PARTTYPE
 NAME   MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT PARTTYPE
 sda      8:0    0 149.1G  0 disk
 ??sda1   8:1    0  1000M  0 part  /boot      c12a7328-f81f-11d2-ba4b-00a0c93ec93b
 ??sda2   8:2    0     2G  0 part             ebd0a0a2-b9e5-4433-87c0-68b6b72699c7
 ??sda3   8:3    0   9.7G  0 part  [SWAP]     0657fd6d-a4ab-43c4-84e5-0933c84b4f4f
 ??sda4   8:4    0  34.2G  0 part  /          ebd0a0a2-b9e5-4433-87c0-68b6b72699c7
 ??sda5   8:5    0  63.2G  0 part  /home      ebd0a0a2-b9e5-4433-87c0-68b6b72699c7
 ??sda6   8:6    0  39.1G  0 part             ebd0a0a2-b9e5-4433-87c0-68b6b72699c7


(well, this is from git tree, will be in v2.25 :-)

    Karel

rules: add by-parttypeuuid rule for GPT labeled partitions

Commit Message

Comments

Patch