diff mbox

dm-zoned-tools: add zoned disk udev rules for scheduler / dmsetup

Message ID 20180614001147.1545-1-mcgrof@kernel.org (mailing list archive)
State New, archived
Headers show

Commit Message

Luis Chamberlain June 14, 2018, 12:11 a.m. UTC
Setting up a zoned disks in a generic form is not so trivial. There
is also quite a bit of tribal knowledge with these devices which is not
easy to find.

The currently supplied demo script works but it is not generic enough to be
practical for Linux distributions or even developers which often move
from one kernel to another.

This tries to put a bit of this tribal knowledge into an initial udev
rule for development with the hopes Linux distributions can later
deploy. Three rule are added. One rule is optional for now, it should be
extended later to be more distribution-friendly and then I think this
may be ready for consideration for integration on distributions.

1) scheduler setup
2) backlist f2fs devices
3) run dmsetup for the rest of devices

Note that this udev rule will not work well if you want to use a disk
with f2fs on part of the disk and another filesystem on another part of
the disk. That setup will require manual love so these setups can use
the same backlist on rule 2).

Its not widely known for instance that as of v4.16 it is mandated to use
either deadline or the mq-deadline scheduler for *all* SMR drivers. Its
also been determined that the Linux kernel is not the place to set this up,
so a udev rule *is required* as per latest discussions. This is the
first rule we add.

Furthermore if you are *not* using f2fs you always have to run dmsetup.
dmsetups do not persist, so you currently *always* have to run a custom
sort of script, which is not ideal for Linux distributions. We can invert
this logic into a udev rule to enable users to blacklist disks they know they
want to use f2fs for. This the second optional rule. This blacklisting
can be generalized further in the future with an exception list file, for
instance using INPUT{db} or the like.

The third and final rule added then runs dmsetup for the rest of the disks
using the disk serial number for the new device mapper name.

Note that it is currently easy for users to make a mistake and run mkfs
on the the original disk, not the /dev/mapper/ device for non f2fs
arrangements. If that is done experience shows things can easily fall
apart with alignment *eventually*. We have no generic way today to
error out on this condition and proactively prevent this.

Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
---
 README                    | 10 +++++-
 udev/99-zoned-disks.rules | 78 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 87 insertions(+), 1 deletion(-)
 create mode 100644 udev/99-zoned-disks.rules

Comments

Damien Le Moal June 14, 2018, 10:01 a.m. UTC | #1
On 6/14/18 09:11, Luis R. Rodriguez wrote:
> Setting up a zoned disks in a generic form is not so trivial. There

> is also quite a bit of tribal knowledge with these devices which is not

> easy to find.

> 

> The currently supplied demo script works but it is not generic enough to be

> practical for Linux distributions or even developers which often move

> from one kernel to another.

> 

> This tries to put a bit of this tribal knowledge into an initial udev

> rule for development with the hopes Linux distributions can later

> deploy. Three rule are added. One rule is optional for now, it should be

> extended later to be more distribution-friendly and then I think this

> may be ready for consideration for integration on distributions.

> 

> 1) scheduler setup

> 2) backlist f2fs devices

> 3) run dmsetup for the rest of devices

> 

> Note that this udev rule will not work well if you want to use a disk

> with f2fs on part of the disk and another filesystem on another part of

> the disk. That setup will require manual love so these setups can use

> the same backlist on rule 2).

> 

> Its not widely known for instance that as of v4.16 it is mandated to use

> either deadline or the mq-deadline scheduler for *all* SMR drivers. Its

> also been determined that the Linux kernel is not the place to set this up,

> so a udev rule *is required* as per latest discussions. This is the

> first rule we add.

> 

> Furthermore if you are *not* using f2fs you always have to run dmsetup.

> dmsetups do not persist, so you currently *always* have to run a custom

> sort of script, which is not ideal for Linux distributions. We can invert

> this logic into a udev rule to enable users to blacklist disks they know they

> want to use f2fs for. This the second optional rule. This blacklisting

> can be generalized further in the future with an exception list file, for

> instance using INPUT{db} or the like.

> 

> The third and final rule added then runs dmsetup for the rest of the disks

> using the disk serial number for the new device mapper name.

> 

> Note that it is currently easy for users to make a mistake and run mkfs

> on the the original disk, not the /dev/mapper/ device for non f2fs

> arrangements. If that is done experience shows things can easily fall

> apart with alignment *eventually*. We have no generic way today to

> error out on this condition and proactively prevent this.

> 

> Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>

> ---

>  README                    | 10 +++++-

>  udev/99-zoned-disks.rules | 78 +++++++++++++++++++++++++++++++++++++++++++++++

>  2 files changed, 87 insertions(+), 1 deletion(-)

>  create mode 100644 udev/99-zoned-disks.rules

> 

> diff --git a/README b/README

> index 65e96c34fd04..f49541eaabc8 100644

> --- a/README

> +++ b/README

> @@ -168,7 +168,15 @@ Options:

>                       reclaiming random zones if the percentage of

>                       free random data zones falls below <perc>.

>  

> -V. Example scripts

> +V. Udev zone disk deployment

> +============================

> +

> +A udev rule is provided which enables you to set the IO scheduler, blacklist

> +driver to run dmsetup, and runs dmsetup for the rest of the zone drivers.

> +If you use this udev rule the below script is not needed. Be sure to mkfs only

> +on the resulting /dev/mapper/zone-$serial device you end up with.

> +

> +VI. Example scripts

>  ==================

>  

>  [[

> diff --git a/udev/99-zoned-disks.rules b/udev/99-zoned-disks.rules

> new file mode 100644

> index 000000000000..e19b738dcc0e

> --- /dev/null

> +++ b/udev/99-zoned-disks.rules

> @@ -0,0 +1,78 @@

> +# To use a zone disks first thing you need to:

> +#

> +# 1) Enable zone disk support in your kernel

> +# 2) Use the deadline or mq-deadline scheduler for it - mandated as of v4.16

> +# 3) Blacklist devices dedicated for f2fs as of v4.10

> +# 4) Run dmsetup other disks

> +# 5) Create the filesystem -- NOTE: use mkfs /dev/mapper/zone-serial if

> +#    you enabled use dmsetup on the disk.

> +# 6) Consider using nofail mount option in case you run an supported kernel

> +#

> +# You can use this udev rules file for 2) 3) and 4). Further details below.

> +#

> +# 1) Enable zone disk support in your kernel

> +#

> +#    o CONFIG_BLK_DEV_ZONED

> +#    o CONFIG_DM_ZONED

> +#

> +# This will let the kernel actually see these devices, ie, via fdisk /dev/sda

> +# for instance. Run:

> +#

> +# 	dmzadm --format /dev/sda

> +

> +# 2) Set deadline or mq-deadline for all disks which are zoned

> +#

> +# Zoned disks can only work with the deadline or mq-deadline scheduler. This is

> +# mandated for all SMR drives since v4.16. It has been determined this must be

> +# done through a udev rule, and the kernel should not set this up for disks.

> +# This magic will have to live for *all* zoned disks.

> +# XXX: what about distributions that want mq-deadline ? Probably easy for now

> +#      to assume deadline and later have a mapping file to enable

> +#      mq-deadline for specific serial devices?

> +ACTION=="add|change", KERNEL=="sd*[!0-9]", ATTRS{queue/zoned}=="host-managed", \

> +	ATTR{queue/scheduler}="deadline"

> +

> +# 3) Blacklist f2fs devices as of v4.10

> +# We don't have to run dmsetup on on disks where you want to use f2fs, so you

> +# can use this rule to skip dmsetup for it. First get the serial short number.

> +#

> +#	udevadm info --name=/dev/sda  | grep -i serial_shor

> +# XXX: To generalize this for distributions consider using INPUT{db} to or so

> +# and then use that to check if the serial number matches one on the database.

> +#ACTION=="add", SUBSYSTEM=="block", ENV{ID_SERIAL_SHORT}=="XXA1ZFFF", GOTO="zone_disk_group_end"

> +

> +# 4) We need to run dmsetup if you want to use other filesystems

> +#

> +# dmsetup is not persistent, so it needs to be run on upon every boot.  We use

> +# the device serial number for the /dev/mapper/ name.

> +ACTION=="add", KERNEL=="sd*[!0-9]", ATTRS{queue/zoned}=="host-managed", \

> +	RUN+="/sbin/dmsetup create zoned-$env{ID_SERIAL_SHORT} --table '0 %s{size} zoned $devnode'", $attr{size}

> +

> +# 4) Create a filesystem for the device

> +#

> +# Be 100% sure you use /dev/mapper/zone-$YOUR_DEVICE_SERIAL for the mkfs

> +# command as otherwise things can break.

> +#

> +# XXX: preventing the above proactively in the kernel would be ideal however

> +# this may be hard.

> +#

> +# Once you create the filesystem it will get a UUID.

> +#

> +# Find out what UUID is, you can do this for instance if your zoned disk is

> +# your second device-mapper device, ie dm-1 by:

> +#

> +# 	ls -l /dev/disk/by-uuid/dm-1

> +#

> +# To figure out which dm-$number it is, use dmsetup info, the minor number

> +# is the $number.

> +#

> +# 5) Add an etry in /etc/fstab with nofail for example:

> +#

> +# UUID=99999999-aaaa-bbbb-c1234aaaabbb33456 /media/monster xfs nofail 0 0

> +#

> +# nofail will ensure system boots fine even if you boot into a kernel which

> +# lacks support for the device and so it is not found. Since the UUID will

> +# always match the device we don't care if the device moves around the bus

> +# on the system. We just need to get the UUID once.

> +

> +LABEL="zone_disk_group_end"


Applied. Thanks Luis !

-- 
Damien Le Moal,
Western Digital
Mike Snitzer June 14, 2018, 12:38 p.m. UTC | #2
On Wed, Jun 13 2018 at  8:11pm -0400,
Luis R. Rodriguez <mcgrof@kernel.org> wrote:

> Setting up a zoned disks in a generic form is not so trivial. There
> is also quite a bit of tribal knowledge with these devices which is not
> easy to find.
> 
> The currently supplied demo script works but it is not generic enough to be
> practical for Linux distributions or even developers which often move
> from one kernel to another.
> 
> This tries to put a bit of this tribal knowledge into an initial udev
> rule for development with the hopes Linux distributions can later
> deploy. Three rule are added. One rule is optional for now, it should be
> extended later to be more distribution-friendly and then I think this
> may be ready for consideration for integration on distributions.
> 
> 1) scheduler setup

This is wrong.. if zoned devices are so dependent on deadline or
mq-deadline then the kernel should allow them to be hardcoded.  I know
Jens removed the API to do so but the fact that drivers need to rely on
hacks like this udev rule to get a functional device is proof we need to
allow drivers to impose the scheduler used.

> 2) backlist f2fs devices

There should porbably be support in dm-zoned for detecting whether a
zoned device was formatted with f2fs (assuming there is a known f2fs
superblock)?

> 3) run dmsetup for the rest of devices

automagically running dmsetup directly from udev to create a dm-zoned
target is very much wrong.  It just gets in the way of proper support
that should be add to appropriate tools that admins use to setup their
zoned devices.  For instance, persistent use of dm-zoned target should
be made reliable with a volume manager..

In general this udev script is unwelcome and makes things way worse for
the long-term success of zoned devices.

I don't dispute there is an obvious void for how to properly setup zoned
devices, but this script is _not_ what should fill that void.

So a heartfelt:

Nacked-by: Mike Snitzer <snitzer@redhat.com>
Bart Van Assche June 14, 2018, 1:39 p.m. UTC | #3
On Thu, 2018-06-14 at 10:01 +0000, Damien Le Moal wrote:
> Applied. Thanks Luis !


Hello Damien,

Can this still be undone? I agree with Mike that it's wrong to invoke
"/sbin/dmsetup create ... zoned ..." from a udev rule.

Thanks,

Bart.
Christoph Hellwig June 14, 2018, 1:42 p.m. UTC | #4
On Thu, Jun 14, 2018 at 01:39:50PM +0000, Bart Van Assche wrote:
> On Thu, 2018-06-14 at 10:01 +0000, Damien Le Moal wrote:
> > Applied. Thanks Luis !
> 
> Hello Damien,
> 
> Can this still be undone? I agree with Mike that it's wrong to invoke
> "/sbin/dmsetup create ... zoned ..." from a udev rule.

Yes.  We'll really need to verfify the device has dm-zoned metadata
first.  Preferably including a uuid for stable device naming.
Bart Van Assche June 14, 2018, 4:19 p.m. UTC | #5
On Wed, 2018-06-13 at 17:11 -0700, Luis R. Rodriguez wrote:
> This tries to put a bit of this tribal knowledge into an initial udev

> rule for development with the hopes Linux distributions can later

> deploy. Three rule are added. One rule is optional for now, it should be

> extended later to be more distribution-friendly and then I think this

> may be ready for consideration for integration on distributions.

> 

> 1) scheduler setup

> 2) backlist f2fs devices

> 3) run dmsetup for the rest of devices


Hello Luis,

I think it is wrong to package the zoned block device scheduler rule in the
dm-zoned-tools package. That udev rule should be activated whether or not the
dm-zoned-tools package has been installed. Have you considered to submit the
zoned block device scheduler rule to the systemd project since today that
project includes all base udev rules?

> +# Zoned disks can only work with the deadline or mq-deadline scheduler. This is

> +# mandated for all SMR drives since v4.16. It has been determined this must be

> +# done through a udev rule, and the kernel should not set this up for disks.

> +# This magic will have to live for *all* zoned disks.

> +# XXX: what about distributions that want mq-deadline ? Probably easy for now

> +#      to assume deadline and later have a mapping file to enable

> +#      mq-deadline for specific serial devices?

> +ACTION=="add|change", KERNEL=="sd*[!0-9]", ATTRS{queue/zoned}=="host-managed", \

> +	ATTR{queue/scheduler}="deadline"


I think it is wrong to limit this rule to SCSI disks only. Work is ongoing to
add zoned block device support to the null_blk driver. That is a block driver
and not a SCSI driver. I think the above udev rule should apply to that block
driver too.

Regarding blk-mq, from the mq-deadline source code:
	.elevator_alias = "deadline",

In other words, the name "deadline" should work both for legacy and for blk-mq
block devices.

Thanks,

Bart.
Bart Van Assche June 14, 2018, 4:23 p.m. UTC | #6
On Thu, 2018-06-14 at 08:38 -0400, Mike Snitzer wrote:
> On Wed, Jun 13 2018 at  8:11pm -0400,

> Luis R. Rodriguez <mcgrof@kernel.org> wrote:

> > 1) scheduler setup

> 

> This is wrong.. if zoned devices are so dependent on deadline or

> mq-deadline then the kernel should allow them to be hardcoded.  I know

> Jens removed the API to do so but the fact that drivers need to rely on

> hacks like this udev rule to get a functional device is proof we need to

> allow drivers to impose the scheduler used.


Hello Mike,

As you know the Linux kernel block layer stack can reorder requests. However,
for zoned block devices it is essential that the block device receives write
requests in the same order as these were submitted by the (user space)
application or by the (kernel) filesystem. After a long debate the choice was
made to make I/O schedulers responsible for guaranteeing to preserve the write
order. Today only the deadline scheduler guarantees that the write order is
preserved. Hence the udev rule that sets the deadline scheduler for zoned
block devices.

Bart.
Luis Chamberlain June 14, 2018, 5:37 p.m. UTC | #7
On Thu, Jun 14, 2018 at 08:38:06AM -0400, Mike Snitzer wrote:
> On Wed, Jun 13 2018 at  8:11pm -0400,
> Luis R. Rodriguez <mcgrof@kernel.org> wrote:
> 
> > Setting up a zoned disks in a generic form is not so trivial. There
> > is also quite a bit of tribal knowledge with these devices which is not
> > easy to find.
> > 
> > The currently supplied demo script works but it is not generic enough to be
> > practical for Linux distributions or even developers which often move
> > from one kernel to another.
> > 
> > This tries to put a bit of this tribal knowledge into an initial udev
> > rule for development with the hopes Linux distributions can later
> > deploy. Three rule are added. One rule is optional for now, it should be
> > extended later to be more distribution-friendly and then I think this
> > may be ready for consideration for integration on distributions.
> > 
> > 1) scheduler setup
> 
> This is wrong.. if zoned devices are so dependent on deadline or
> mq-deadline then the kernel should allow them to be hardcoded.  I know
> Jens removed the API to do so but the fact that drivers need to rely on
> hacks like this udev rule to get a functional device is proof we need to
> allow drivers to impose the scheduler used.

This is the point to the patch as well, I actually tend to agree with you,
and I had tried to draw up a patch to do just that, however its *not* possible
today to do this and would require some consensus. So from what I can tell
we *have* to live with this one or a form of it. Ie a file describing which
disk serial gets deadline and which one gets mq-deadline.

Jens?

Anyway, let's assume this is done in the kernel, which one would use deadline,
which one would use mq-deadline?

> > 2) backlist f2fs devices
> 
> There should porbably be support in dm-zoned for detecting whether a
> zoned device was formatted with f2fs (assuming there is a known f2fs
> superblock)?

Not sure what you mean. Are you suggesting we always setup dm-zoned for
all zoned disks and just make an excemption on dm-zone code to somehow
use the disk directly if a filesystem supports zoned disks directly somehow?

f2fs does not require dm-zoned. What would be required is a bit more complex
given one could dedicate portions of the disk to f2fs and other portions to
another filesystem, which would require dm-zoned.

Also filesystems which *do not* support zoned disks should *not* be allowing
direct setup. Today that's all filesystems other than f2fs, in the future
that may change. Those are bullets we are allowing to trigger for users
just waiting to shot themselves on the foot with.

So who's going to work on all the above?

The point of the udev script is to illustrate the pains to properly deploy
zoned disks on distributions today and without a roadmap... this is what
at least I need on my systems today to reasonably deploy these disks for
my own development.

Consensus is indeed needed for a broader picture.

> > 3) run dmsetup for the rest of devices
> 
> automagically running dmsetup directly from udev to create a dm-zoned
> target is very much wrong.  It just gets in the way of proper support
> that should be add to appropriate tools that admins use to setup their
> zoned devices.  For instance, persistent use of dm-zoned target should
> be made reliable with a volume manager..

Ah yes, but who's working on that? How long will it take?

I agree it is odd to expect one to use dmsetup and then use a volume manager on
top of it, if we can just add proper support onto the volume manager... then
that's a reasonable way to go.

But *we're not there* yet, and as-is today, what is described in the udev
script is the best we can do for a generic setup.

> In general this udev script is unwelcome and makes things way worse for
> the long-term success of zoned devices.

dm-zoned-tools does not acknowledge in any way a roadmap, and just provides
a script, which IMHO is less generic and less distribution friendly. Having
a udev rule in place to demonstrate the current state of affairs IMHO is
more scalable demonstrates the issues better than the script.

If we have an agreed upon long term strategy lets document that. But from
what I gather we are not even in consensus with regards to the scheduler
stuff. If we have consensus on the other stuff lets document that as
dm-zoned-tools is the only place I think folks could find to reasonably
deploy these things.

> I don't dispute there is an obvious void for how to properly setup zoned
> devices, but this script is _not_ what should fill that void.

Good to know! Again, consider it as an alternative to the script.

I'm happy to adapt the language and supply it only as an example script
developers can use, but we can't leave users hanging as well. Let's at
least come up with a plan which we seem to agree on and document that.

  Luis
Luis Chamberlain June 14, 2018, 5:44 p.m. UTC | #8
On Thu, Jun 14, 2018 at 04:19:23PM +0000, Bart Van Assche wrote:
> On Wed, 2018-06-13 at 17:11 -0700, Luis R. Rodriguez wrote:
> > This tries to put a bit of this tribal knowledge into an initial udev
> > rule for development with the hopes Linux distributions can later
> > deploy. Three rule are added. One rule is optional for now, it should be
> > extended later to be more distribution-friendly and then I think this
> > may be ready for consideration for integration on distributions.
> > 
> > 1) scheduler setup
> > 2) backlist f2fs devices
> > 3) run dmsetup for the rest of devices
> 
> Hello Luis,
> 
> I think it is wrong to package the zoned block device scheduler rule in the
> dm-zoned-tools package. That udev rule should be activated whether or not the
> dm-zoned-tools package has been installed. Have you considered to submit the
> zoned block device scheduler rule to the systemd project since today that
> project includes all base udev rules?

Nope, this is a udev rule intended for developers wishing to brush up on our
state of affairs. All I wanted to do is to get my own drive to work at home in
a reasonably reliable and generic form, which also allowed me to hop around
kernels without a fatal issue.

Clearly there is much to discuss still and a roadmap to illustrate. We should
update the documentation to reflect this first, and based on that enable then
distributions based on that discussion. We should have short term and long term
plans. That discussion may have taken place already so forgive me for not
knowing about it, I did try to read as much from the code and existing tools
and just didn't find anything else to be clear on it.

> > +# Zoned disks can only work with the deadline or mq-deadline scheduler. This is
> > +# mandated for all SMR drives since v4.16. It has been determined this must be
> > +# done through a udev rule, and the kernel should not set this up for disks.
> > +# This magic will have to live for *all* zoned disks.
> > +# XXX: what about distributions that want mq-deadline ? Probably easy for now
> > +#      to assume deadline and later have a mapping file to enable
> > +#      mq-deadline for specific serial devices?
> > +ACTION=="add|change", KERNEL=="sd*[!0-9]", ATTRS{queue/zoned}=="host-managed", \
> > +	ATTR{queue/scheduler}="deadline"
> 
> I think it is wrong to limit this rule to SCSI disks only. Work is ongoing to
> add zoned block device support to the null_blk driver. That is a block driver
> and not a SCSI driver. I think the above udev rule should apply to that block
> driver too.

Sure patches welcomed :)

> Regarding blk-mq, from the mq-deadline source code:
> 	.elevator_alias = "deadline",
> 
> In other words, the name "deadline" should work both for legacy and for blk-mq
> block devices.

Groovy that helps.

  Luis
Luis Chamberlain June 14, 2018, 5:46 p.m. UTC | #9
On Thu, Jun 14, 2018 at 07:37:19PM +0200, Luis R. Rodriguez wrote:
> 
> Ie a file describing which
> disk serial gets deadline and which one gets mq-deadline.
> Anyway, let's assume this is done in the kernel, which one would use deadline,
> which one would use mq-deadline?

Nevermind this, deadline will work, Bart already stated why.

 Luis
Mike Snitzer June 14, 2018, 5:58 p.m. UTC | #10
On Thu, Jun 14 2018 at  1:37pm -0400,
Luis R. Rodriguez <mcgrof@kernel.org> wrote:

> On Thu, Jun 14, 2018 at 08:38:06AM -0400, Mike Snitzer wrote:
> > On Wed, Jun 13 2018 at  8:11pm -0400,
> > Luis R. Rodriguez <mcgrof@kernel.org> wrote:
> > 
> > > Setting up a zoned disks in a generic form is not so trivial. There
> > > is also quite a bit of tribal knowledge with these devices which is not
> > > easy to find.
> > > 
> > > The currently supplied demo script works but it is not generic enough to be
> > > practical for Linux distributions or even developers which often move
> > > from one kernel to another.
> > > 
> > > This tries to put a bit of this tribal knowledge into an initial udev
> > > rule for development with the hopes Linux distributions can later
> > > deploy. Three rule are added. One rule is optional for now, it should be
> > > extended later to be more distribution-friendly and then I think this
> > > may be ready for consideration for integration on distributions.
> > > 
> > > 1) scheduler setup
> > 
> > This is wrong.. if zoned devices are so dependent on deadline or
> > mq-deadline then the kernel should allow them to be hardcoded.  I know
> > Jens removed the API to do so but the fact that drivers need to rely on
> > hacks like this udev rule to get a functional device is proof we need to
> > allow drivers to impose the scheduler used.
> 
> This is the point to the patch as well, I actually tend to agree with you,
> and I had tried to draw up a patch to do just that, however its *not* possible
> today to do this and would require some consensus. So from what I can tell
> we *have* to live with this one or a form of it. Ie a file describing which
> disk serial gets deadline and which one gets mq-deadline.
> 
> Jens?
> 
> Anyway, let's assume this is done in the kernel, which one would use deadline,
> which one would use mq-deadline?

The zoned storage driver needs to make that call based on what mode it
is in.  If it is using blk-mq then it selects mq-deadline, otherwise
deadline.
 
> > > 2) backlist f2fs devices
> > 
> > There should porbably be support in dm-zoned for detecting whether a
> > zoned device was formatted with f2fs (assuming there is a known f2fs
> > superblock)?
> 
> Not sure what you mean. Are you suggesting we always setup dm-zoned for
> all zoned disks and just make an excemption on dm-zone code to somehow
> use the disk directly if a filesystem supports zoned disks directly somehow?

No, I'm saying that a udev rule wouldn't be needed if dm-zoned just
errored out if asked to consume disks that already have an f2fs
superblock.  And existing filesystems should get conflicting superblock
awareness "for free" if blkid or whatever is trained to be aware of
f2fs's superblock.
 
> f2fs does not require dm-zoned. What would be required is a bit more complex
> given one could dedicate portions of the disk to f2fs and other portions to
> another filesystem, which would require dm-zoned.
> 
> Also filesystems which *do not* support zoned disks should *not* be allowing
> direct setup. Today that's all filesystems other than f2fs, in the future
> that may change. Those are bullets we are allowing to trigger for users
> just waiting to shot themselves on the foot with.
> 
> So who's going to work on all the above?

It should take care of itself if existing tools are trained to be aware
of new signatures.  E.g. ext4 and xfs already are aware of one another
so that you cannot reformat a device with the other unless force is
given.

Same kind of mutual exclussion needs to happen for zoned devices.

So the zoned device tools, dm-zoned, f2fs, whatever.. they need to be
updated to not step on each others toes.  And other filesystems' tools
need to be updated to be zoned device aware.

> The point of the udev script is to illustrate the pains to properly deploy
> zoned disks on distributions today and without a roadmap... this is what
> at least I need on my systems today to reasonably deploy these disks for
> my own development.
> 
> Consensus is indeed needed for a broader picture.

Yeap.

> > > 3) run dmsetup for the rest of devices
> > 
> > automagically running dmsetup directly from udev to create a dm-zoned
> > target is very much wrong.  It just gets in the way of proper support
> > that should be add to appropriate tools that admins use to setup their
> > zoned devices.  For instance, persistent use of dm-zoned target should
> > be made reliable with a volume manager..
> 
> Ah yes, but who's working on that? How long will it take?

No idea, as is (from my vantage point) there is close to zero demand for
zoned devices.  It won't be a priority until enough customers are asking
for it.

> I agree it is odd to expect one to use dmsetup and then use a volume manager on
> top of it, if we can just add proper support onto the volume manager... then
> that's a reasonable way to go.
> 
> But *we're not there* yet, and as-is today, what is described in the udev
> script is the best we can do for a generic setup.

Just because doing things right takes work doesn't mean it makes sense
to elevate this udev script to be packaged in some upstream project like
udev or whatever.

But if SUSE or some other distro wants to ship it that is fine.

> > In general this udev script is unwelcome and makes things way worse for
> > the long-term success of zoned devices.
> 
> dm-zoned-tools does not acknowledge in any way a roadmap, and just provides
> a script, which IMHO is less generic and less distribution friendly. Having
> a udev rule in place to demonstrate the current state of affairs IMHO is
> more scalable demonstrates the issues better than the script.
> 
> If we have an agreed upon long term strategy lets document that. But from
> what I gather we are not even in consensus with regards to the scheduler
> stuff. If we have consensus on the other stuff lets document that as
> dm-zoned-tools is the only place I think folks could find to reasonably
> deploy these things.

I'm sure Damien and others will have something to say here.

> > I don't dispute there is an obvious void for how to properly setup zoned
> > devices, but this script is _not_ what should fill that void.
> 
> Good to know! Again, consider it as an alternative to the script.
> 
> I'm happy to adapt the language and supply it only as an example script
> developers can use, but we can't leave users hanging as well. Let's at
> least come up with a plan which we seem to agree on and document that.

Best to try to get Damien and others more invested in zoned devices to
help you take up your cause.  I think it is worthwhile to develop a
strategy.  But it needs to be done in terms of the norms of the existing
infrastructure we all make use of today.  So first step is making
existing tools zoned device aware (even if to reject such devices).

Mike
Damien Le Moal June 15, 2018, 9 a.m. UTC | #11
Mike,

On 6/14/18 21:38, Mike Snitzer wrote:
> On Wed, Jun 13 2018 at  8:11pm -0400,

> Luis R. Rodriguez <mcgrof@kernel.org> wrote:

> 

>> Setting up a zoned disks in a generic form is not so trivial. There

>> is also quite a bit of tribal knowledge with these devices which is not

>> easy to find.

>>

>> The currently supplied demo script works but it is not generic enough to be

>> practical for Linux distributions or even developers which often move

>> from one kernel to another.

>>

>> This tries to put a bit of this tribal knowledge into an initial udev

>> rule for development with the hopes Linux distributions can later

>> deploy. Three rule are added. One rule is optional for now, it should be

>> extended later to be more distribution-friendly and then I think this

>> may be ready for consideration for integration on distributions.

>>

>> 1) scheduler setup

> 

> This is wrong.. if zoned devices are so dependent on deadline or

> mq-deadline then the kernel should allow them to be hardcoded.  I know

> Jens removed the API to do so but the fact that drivers need to rely on

> hacks like this udev rule to get a functional device is proof we need to

> allow drivers to impose the scheduler used.


I agree. Switching scheduler in the kernel during device probe/bring-up
would be my preferred choice. But until we come to a consensus on the
best way to do this, I think that this udev rule is useful since the
"scheduler=" kernel parameter is rather heavy handed and applies to all
single queue devices. Not to mention that adding this parameter to the
kernel arguments is in essence similar to the udev rule addition: action
from the system user is necessary to achieve a correct configuration
even though that could easily be done automatically from within the
block layer.

In the meantime, documenting properly that the deadline scheduler has a
special relationship with zoned block devices is still a nice thing to
do. But yes, dm-zoned-tools may not be the best place to do that. A
simple text file under Documentation/block may be better. At the very
least, Documentation/block/deadline-iosched.txt should have some mention
of its almost mandatory use with zoned block devices. I will work on that.

>> 2) backlist f2fs devices

> 

> There should porbably be support in dm-zoned for detecting whether a

> zoned device was formatted with f2fs (assuming there is a known f2fs

> superblock)?


That would certainly be nice to have and would not be too hard to code
in dmzadm. I can add that and do what for instance mkfs.ext4 or mkfs.xfs
do, which is ask for confirmation or bail out if an existing valid FS
format is detected on the disk.

That said, such test are far from common. mdadm for instance will
happily format and start an array using unmounted disks with valid file
systems on them.

>> 3) run dmsetup for the rest of devices

> 

> automagically running dmsetup directly from udev to create a dm-zoned

> target is very much wrong.  It just gets in the way of proper support

> that should be add to appropriate tools that admins use to setup their

> zoned devices.  For instance, persistent use of dm-zoned target should

> be made reliable with a volume manager..

> 

> In general this udev script is unwelcome and makes things way worse for

> the long-term success of zoned devices.

> 

> I don't dispute there is an obvious void for how to properly setup zoned

> devices, but this script is _not_ what should fill that void.


Fair points. I agree that it is hackish. The intent was as you say to
temporarily fill a void. But granted, temporary hacks tend to stay (too)
long around and can in the end get in the way of clean solutions. I will
start looking into a proper fix for dm-zoned setup persistence.

> So a heartfelt:

> 

> Nacked-by: Mike Snitzer <snitzer@redhat.com>


Understood. I will revert and not document this udev rule in
dm-zoned-tools. I was a little too quick in applying this patch and did
not wait to see comments first. My bad.

Thank you for your comments.

Best regards.

-- 
Damien Le Moal,
Western Digital
Damien Le Moal June 15, 2018, 9:59 a.m. UTC | #12
Mike,

On 6/15/18 02:58, Mike Snitzer wrote:
> On Thu, Jun 14 2018 at  1:37pm -0400,

> Luis R. Rodriguez <mcgrof@kernel.org> wrote:

> 

>> On Thu, Jun 14, 2018 at 08:38:06AM -0400, Mike Snitzer wrote:

>>> On Wed, Jun 13 2018 at  8:11pm -0400,

>>> Luis R. Rodriguez <mcgrof@kernel.org> wrote:

>>>

>>>> Setting up a zoned disks in a generic form is not so trivial. There

>>>> is also quite a bit of tribal knowledge with these devices which is not

>>>> easy to find.

>>>>

>>>> The currently supplied demo script works but it is not generic enough to be

>>>> practical for Linux distributions or even developers which often move

>>>> from one kernel to another.

>>>>

>>>> This tries to put a bit of this tribal knowledge into an initial udev

>>>> rule for development with the hopes Linux distributions can later

>>>> deploy. Three rule are added. One rule is optional for now, it should be

>>>> extended later to be more distribution-friendly and then I think this

>>>> may be ready for consideration for integration on distributions.

>>>>

>>>> 1) scheduler setup

>>>

>>> This is wrong.. if zoned devices are so dependent on deadline or

>>> mq-deadline then the kernel should allow them to be hardcoded.  I know

>>> Jens removed the API to do so but the fact that drivers need to rely on

>>> hacks like this udev rule to get a functional device is proof we need to

>>> allow drivers to impose the scheduler used.

>>

>> This is the point to the patch as well, I actually tend to agree with you,

>> and I had tried to draw up a patch to do just that, however its *not* possible

>> today to do this and would require some consensus. So from what I can tell

>> we *have* to live with this one or a form of it. Ie a file describing which

>> disk serial gets deadline and which one gets mq-deadline.

>>

>> Jens?

>>

>> Anyway, let's assume this is done in the kernel, which one would use deadline,

>> which one would use mq-deadline?

> 

> The zoned storage driver needs to make that call based on what mode it

> is in.  If it is using blk-mq then it selects mq-deadline, otherwise

> deadline.


As Bart pointed out, deadline is an alias of mq-deadline. So using
"deadline" as the scheduler name works in both legacy and mq cases.

>>>> 2) backlist f2fs devices

>>>

>>> There should porbably be support in dm-zoned for detecting whether a

>>> zoned device was formatted with f2fs (assuming there is a known f2fs

>>> superblock)?

>>

>> Not sure what you mean. Are you suggesting we always setup dm-zoned for

>> all zoned disks and just make an excemption on dm-zone code to somehow

>> use the disk directly if a filesystem supports zoned disks directly somehow?

> 

> No, I'm saying that a udev rule wouldn't be needed if dm-zoned just

> errored out if asked to consume disks that already have an f2fs

> superblock.  And existing filesystems should get conflicting superblock

> awareness "for free" if blkid or whatever is trained to be aware of

> f2fs's superblock.


Well that is the case already: on startup, dm-zoned will read its own
metadata from sector 0, same as f2fs would do with its super-block. If
the format/magic does not match expected values, dm-zoned will bail out
and return an error. dm-zoned metadata and f2fs metadata reside in the
same place and overwrite each other. There is no way to get one working
on top of the other. I do not see any possibility of a problem on startup.

But definitely, the user land format tools can step on each other toes.
That needs fixing.

>> f2fs does not require dm-zoned. What would be required is a bit more complex

>> given one could dedicate portions of the disk to f2fs and other portions to

>> another filesystem, which would require dm-zoned.

>>

>> Also filesystems which *do not* support zoned disks should *not* be allowing

>> direct setup. Today that's all filesystems other than f2fs, in the future

>> that may change. Those are bullets we are allowing to trigger for users

>> just waiting to shot themselves on the foot with.

>>

>> So who's going to work on all the above?

> 

> It should take care of itself if existing tools are trained to be aware

> of new signatures.  E.g. ext4 and xfs already are aware of one another

> so that you cannot reformat a device with the other unless force is

> given.

> 

> Same kind of mutual exclussion needs to happen for zoned devices.


Yes.

> So the zoned device tools, dm-zoned, f2fs, whatever.. they need to be

> updated to not step on each others toes.  And other filesystems' tools

> need to be updated to be zoned device aware.


I will update dm-zoned tools to check for known FS superblocks,
similarly to what mkfs.ext4 and mkfs.xfs do.

>>>> 3) run dmsetup for the rest of devices

>>>

>>> automagically running dmsetup directly from udev to create a dm-zoned

>>> target is very much wrong.  It just gets in the way of proper support

>>> that should be add to appropriate tools that admins use to setup their

>>> zoned devices.  For instance, persistent use of dm-zoned target should

>>> be made reliable with a volume manager..

>>

>> Ah yes, but who's working on that? How long will it take?

> 

> No idea, as is (from my vantage point) there is close to zero demand for

> zoned devices.  It won't be a priority until enough customers are asking

> for it.


From my point of view (drive vendor), things are different. We do see an
increasing interest for these drives. However, most use cases are still
limited to application based direct disk access with minimal involvement
from the kernel and so few "support" requests. Many reasons to this, but
one is to some extent the current lack of extended support by the
kernel. Despite all the recent work done, as Luis experienced, zoned
drives are still far harder to easily setup than regular disks. Chicken
and egg situation...

>> I agree it is odd to expect one to use dmsetup and then use a volume manager on

>> top of it, if we can just add proper support onto the volume manager... then

>> that's a reasonable way to go.

>>

>> But *we're not there* yet, and as-is today, what is described in the udev

>> script is the best we can do for a generic setup.

> 

> Just because doing things right takes work doesn't mean it makes sense

> to elevate this udev script to be packaged in some upstream project like

> udev or whatever.


Agree. Will start looking into better solutions now that at least one
user (Luis) complained. The customer is king.

>>> In general this udev script is unwelcome and makes things way worse for

>>> the long-term success of zoned devices.

>>

>> dm-zoned-tools does not acknowledge in any way a roadmap, and just provides

>> a script, which IMHO is less generic and less distribution friendly. Having

>> a udev rule in place to demonstrate the current state of affairs IMHO is

>> more scalable demonstrates the issues better than the script.

>>

>> If we have an agreed upon long term strategy lets document that. But from

>> what I gather we are not even in consensus with regards to the scheduler

>> stuff. If we have consensus on the other stuff lets document that as

>> dm-zoned-tools is the only place I think folks could find to reasonably

>> deploy these things.

> 

> I'm sure Damien and others will have something to say here.


Yes. The scheduler setup pain is real. Jens made it clear that he
prefers a udev rule. I fully understand his point of view, yet, I think
an automatic switch in the block layer would be far easier and generate
a lot less problem for users, and likely less "bug report" to
distributions vendors (and to myself too).

That said, I also like to see the current dependency of zoned devices on
the deadline scheduler as temporary until a better solution for ensuring
write ordering is found. After all, requiring deadline as the disk
scheduler does impose other limitations on the user. Lack of I/O
priority support and no cgroup based fairness are two examples of what
other schedulers provide but is lost with forcing deadline.

The obvious fix is of course to make all disk schedulers zone device
aware. A little heavy handed, probably lots of duplicated/similar code,
and many more test cases to cover. This approach does not seem
sustainable to me.

We discussed other possibilities at LSF/MM (specialized write queue in
multi-queue path). One could also think of more invasive changes to the
block layer (e.g. adding an optional "dispatcher" layer to tightly
control command ordering ?). And probably a lot more options, But I am
not yet sure what an appropriate replacement to deadline would be.

Eventually, the removal of the legacy I/O path may also be the trigger
to introduce some deeper design changes to blk-mq to accommodate more
easily zoned block devices or other non-standard block devices (open
channel SSDs for instance).

As you can see from the above, working with these drives all day long
does not make for a clear strategy. Inputs from other here are more than
welcome. I would be happy to write up all the ideas I have to start a
discussion so that we can come to a consensus and have a plan.

>>> I don't dispute there is an obvious void for how to properly setup zoned

>>> devices, but this script is _not_ what should fill that void.

>>

>> Good to know! Again, consider it as an alternative to the script.

>>

>> I'm happy to adapt the language and supply it only as an example script

>> developers can use, but we can't leave users hanging as well. Let's at

>> least come up with a plan which we seem to agree on and document that.

> 

> Best to try to get Damien and others more invested in zoned devices to

> help you take up your cause.  I think it is worthwhile to develop a

> strategy.  But it needs to be done in terms of the norms of the existing

> infrastructure we all make use of today.  So first step is making

> existing tools zoned device aware (even if to reject such devices).


Rest assured that I am fully invested in improving the existing
infrastructure for zoned block devices. As mentioned above, applications
based use of zoned block devices still prevails today. So I do tend to
work more on that side of things (libzbc, tcmu, sysutils for instance)
rather than on a better integration with more advanced tools (such as
LVM) relying on kernel features. I am however seeing rising interest in
file systems and also in dm-zoned. So definitely it is time to step up
work in that area to further simplify using these drives.

Thank you for the feedback.

Best regards.

-- 
Damien Le Moal,
Western Digital
Martin Wilck June 15, 2018, 11:07 a.m. UTC | #13
On Thu, 2018-06-14 at 06:42 -0700, Christoph Hellwig wrote:
> On Thu, Jun 14, 2018 at 01:39:50PM +0000, Bart Van Assche wrote:
> > On Thu, 2018-06-14 at 10:01 +0000, Damien Le Moal wrote:
> > > Applied. Thanks Luis !
> > 
> > Hello Damien,
> > 
> > Can this still be undone? I agree with Mike that it's wrong to
> > invoke
> > "/sbin/dmsetup create ... zoned ..." from a udev rule.
> 
> Yes.  We'll really need to verfify the device has dm-zoned metadata
> first.  Preferably including a uuid for stable device naming.

libblkid would be the central hub for metadata discovery, so perhaps a
patch should be made to make libblkid dm-zoned-aware.

Anyway, as Damien explained, dmzoned bails out if it doesn't find
matching meta data, so AFAICS, little harm is done by calling it for a
SMR device in host-managed mode. I fail to get the point why this would
be wrong in general - what's the difference to e.g. calling "mdadm -I"?

Regards
Martin
Mike Snitzer June 15, 2018, 2:50 p.m. UTC | #14
On Fri, Jun 15 2018 at  5:59am -0400,
Damien Le Moal <Damien.LeMoal@wdc.com> wrote:

> Mike,
> 
> On 6/15/18 02:58, Mike Snitzer wrote:
> > On Thu, Jun 14 2018 at  1:37pm -0400,
> > Luis R. Rodriguez <mcgrof@kernel.org> wrote:
> > 
> >> On Thu, Jun 14, 2018 at 08:38:06AM -0400, Mike Snitzer wrote:
> >>> On Wed, Jun 13 2018 at  8:11pm -0400,
> >>> Luis R. Rodriguez <mcgrof@kernel.org> wrote:
> >>>
> >>>> Setting up a zoned disks in a generic form is not so trivial. There
> >>>> is also quite a bit of tribal knowledge with these devices which is not
> >>>> easy to find.
> >>>>
> >>>> The currently supplied demo script works but it is not generic enough to be
> >>>> practical for Linux distributions or even developers which often move
> >>>> from one kernel to another.
> >>>>
> >>>> This tries to put a bit of this tribal knowledge into an initial udev
> >>>> rule for development with the hopes Linux distributions can later
> >>>> deploy. Three rule are added. One rule is optional for now, it should be
> >>>> extended later to be more distribution-friendly and then I think this
> >>>> may be ready for consideration for integration on distributions.
> >>>>
> >>>> 1) scheduler setup
> >>>
> >>> This is wrong.. if zoned devices are so dependent on deadline or
> >>> mq-deadline then the kernel should allow them to be hardcoded.  I know
> >>> Jens removed the API to do so but the fact that drivers need to rely on
> >>> hacks like this udev rule to get a functional device is proof we need to
> >>> allow drivers to impose the scheduler used.
> >>
> >> This is the point to the patch as well, I actually tend to agree with you,
> >> and I had tried to draw up a patch to do just that, however its *not* possible
> >> today to do this and would require some consensus. So from what I can tell
> >> we *have* to live with this one or a form of it. Ie a file describing which
> >> disk serial gets deadline and which one gets mq-deadline.
> >>
> >> Jens?
> >>
> >> Anyway, let's assume this is done in the kernel, which one would use deadline,
> >> which one would use mq-deadline?
> > 
> > The zoned storage driver needs to make that call based on what mode it
> > is in.  If it is using blk-mq then it selects mq-deadline, otherwise
> > deadline.
> 
> As Bart pointed out, deadline is an alias of mq-deadline. So using
> "deadline" as the scheduler name works in both legacy and mq cases.
> 
> >>>> 2) backlist f2fs devices
> >>>
> >>> There should porbably be support in dm-zoned for detecting whether a
> >>> zoned device was formatted with f2fs (assuming there is a known f2fs
> >>> superblock)?
> >>
> >> Not sure what you mean. Are you suggesting we always setup dm-zoned for
> >> all zoned disks and just make an excemption on dm-zone code to somehow
> >> use the disk directly if a filesystem supports zoned disks directly somehow?
> > 
> > No, I'm saying that a udev rule wouldn't be needed if dm-zoned just
> > errored out if asked to consume disks that already have an f2fs
> > superblock.  And existing filesystems should get conflicting superblock
> > awareness "for free" if blkid or whatever is trained to be aware of
> > f2fs's superblock.
> 
> Well that is the case already: on startup, dm-zoned will read its own
> metadata from sector 0, same as f2fs would do with its super-block. If
> the format/magic does not match expected values, dm-zoned will bail out
> and return an error. dm-zoned metadata and f2fs metadata reside in the
> same place and overwrite each other. There is no way to get one working
> on top of the other. I do not see any possibility of a problem on startup.
> 
> But definitely, the user land format tools can step on each other toes.
> That needs fixing.

Right, I was talking about in the .ctr path for initial device creation,
not activation of a previously created dm-zoned device.

But I agree it makes most sense to do this check in userspace.

> >> f2fs does not require dm-zoned. What would be required is a bit more complex
> >> given one could dedicate portions of the disk to f2fs and other portions to
> >> another filesystem, which would require dm-zoned.
> >>
> >> Also filesystems which *do not* support zoned disks should *not* be allowing
> >> direct setup. Today that's all filesystems other than f2fs, in the future
> >> that may change. Those are bullets we are allowing to trigger for users
> >> just waiting to shot themselves on the foot with.
> >>
> >> So who's going to work on all the above?
> > 
> > It should take care of itself if existing tools are trained to be aware
> > of new signatures.  E.g. ext4 and xfs already are aware of one another
> > so that you cannot reformat a device with the other unless force is
> > given.
> > 
> > Same kind of mutual exclussion needs to happen for zoned devices.
> 
> Yes.
> 
> > So the zoned device tools, dm-zoned, f2fs, whatever.. they need to be
> > updated to not step on each others toes.  And other filesystems' tools
> > need to be updated to be zoned device aware.
> 
> I will update dm-zoned tools to check for known FS superblocks,
> similarly to what mkfs.ext4 and mkfs.xfs do.

Thanks.

> >>>> 3) run dmsetup for the rest of devices
> >>>
> >>> automagically running dmsetup directly from udev to create a dm-zoned
> >>> target is very much wrong.  It just gets in the way of proper support
> >>> that should be add to appropriate tools that admins use to setup their
> >>> zoned devices.  For instance, persistent use of dm-zoned target should
> >>> be made reliable with a volume manager..
> >>
> >> Ah yes, but who's working on that? How long will it take?
> > 
> > No idea, as is (from my vantage point) there is close to zero demand for
> > zoned devices.  It won't be a priority until enough customers are asking
> > for it.
> 
> From my point of view (drive vendor), things are different. We do see an
> increasing interest for these drives. However, most use cases are still
> limited to application based direct disk access with minimal involvement
> from the kernel and so few "support" requests. Many reasons to this, but
> one is to some extent the current lack of extended support by the
> kernel. Despite all the recent work done, as Luis experienced, zoned
> drives are still far harder to easily setup than regular disks. Chicken
> and egg situation...
> 
> >> I agree it is odd to expect one to use dmsetup and then use a volume manager on
> >> top of it, if we can just add proper support onto the volume manager... then
> >> that's a reasonable way to go.
> >>
> >> But *we're not there* yet, and as-is today, what is described in the udev
> >> script is the best we can do for a generic setup.
> > 
> > Just because doing things right takes work doesn't mean it makes sense
> > to elevate this udev script to be packaged in some upstream project like
> > udev or whatever.
> 
> Agree. Will start looking into better solutions now that at least one
> user (Luis) complained. The customer is king.
> 
> >>> In general this udev script is unwelcome and makes things way worse for
> >>> the long-term success of zoned devices.
> >>
> >> dm-zoned-tools does not acknowledge in any way a roadmap, and just provides
> >> a script, which IMHO is less generic and less distribution friendly. Having
> >> a udev rule in place to demonstrate the current state of affairs IMHO is
> >> more scalable demonstrates the issues better than the script.
> >>
> >> If we have an agreed upon long term strategy lets document that. But from
> >> what I gather we are not even in consensus with regards to the scheduler
> >> stuff. If we have consensus on the other stuff lets document that as
> >> dm-zoned-tools is the only place I think folks could find to reasonably
> >> deploy these things.
> > 
> > I'm sure Damien and others will have something to say here.
> 
> Yes. The scheduler setup pain is real. Jens made it clear that he
> prefers a udev rule. I fully understand his point of view, yet, I think
> an automatic switch in the block layer would be far easier and generate
> a lot less problem for users, and likely less "bug report" to
> distributions vendors (and to myself too).

Yeap, Jens would say that ;)  Unfortnately using udev to get this
critical configuration correct is a real leap of faith that will prove
to be a whack-a-mole across distributions.

> That said, I also like to see the current dependency of zoned devices on
> the deadline scheduler as temporary until a better solution for ensuring
> write ordering is found. After all, requiring deadline as the disk
> scheduler does impose other limitations on the user. Lack of I/O
> priority support and no cgroup based fairness are two examples of what
> other schedulers provide but is lost with forcing deadline.
> 
> The obvious fix is of course to make all disk schedulers zone device
> aware. A little heavy handed, probably lots of duplicated/similar code,
> and many more test cases to cover. This approach does not seem
> sustainable to me.

Right, it isn't sustainable.  There isn't enough zoned device developer
expertise to go around.

> We discussed other possibilities at LSF/MM (specialized write queue in
> multi-queue path). One could also think of more invasive changes to the
> block layer (e.g. adding an optional "dispatcher" layer to tightly
> control command ordering ?). And probably a lot more options, But I am
> not yet sure what an appropriate replacement to deadline would be.
> 
> Eventually, the removal of the legacy I/O path may also be the trigger
> to introduce some deeper design changes to blk-mq to accommodate more
> easily zoned block devices or other non-standard block devices (open
> channel SSDs for instance).
> 
> As you can see from the above, working with these drives all day long
> does not make for a clear strategy. Inputs from other here are more than
> welcome. I would be happy to write up all the ideas I have to start a
> discussion so that we can come to a consensus and have a plan.

Doesn't hurt to establish a future plan(s) but we need to deal with the
reality of what we have.  And all we have for this particular issue is
"deadline".  Setting anything else is a bug.

Short of the block layer reinstating the ability for a driver to specify
an elevator: should the zoned driver put a check in place that errors
out if anything other than deadline is configured?

That'd at least save users from a very cutthroat learning curve.

> >>> I don't dispute there is an obvious void for how to properly setup zoned
> >>> devices, but this script is _not_ what should fill that void.
> >>
> >> Good to know! Again, consider it as an alternative to the script.
> >>
> >> I'm happy to adapt the language and supply it only as an example script
> >> developers can use, but we can't leave users hanging as well. Let's at
> >> least come up with a plan which we seem to agree on and document that.
> > 
> > Best to try to get Damien and others more invested in zoned devices to
> > help you take up your cause.  I think it is worthwhile to develop a
> > strategy.  But it needs to be done in terms of the norms of the existing
> > infrastructure we all make use of today.  So first step is making
> > existing tools zoned device aware (even if to reject such devices).
> 
> Rest assured that I am fully invested in improving the existing
> infrastructure for zoned block devices. As mentioned above, applications
> based use of zoned block devices still prevails today. So I do tend to
> work more on that side of things (libzbc, tcmu, sysutils for instance)
> rather than on a better integration with more advanced tools (such as
> LVM) relying on kernel features. I am however seeing rising interest in
> file systems and also in dm-zoned. So definitely it is time to step up
> work in that area to further simplify using these drives.
> 
> Thank you for the feedback.

Thanks for your insight.  Sounds like you're ontop of it.

Mike
diff mbox

Patch

diff --git a/README b/README
index 65e96c34fd04..f49541eaabc8 100644
--- a/README
+++ b/README
@@ -168,7 +168,15 @@  Options:
                      reclaiming random zones if the percentage of
                      free random data zones falls below <perc>.
 
-V. Example scripts
+V. Udev zone disk deployment
+============================
+
+A udev rule is provided which enables you to set the IO scheduler, blacklist
+driver to run dmsetup, and runs dmsetup for the rest of the zone drivers.
+If you use this udev rule the below script is not needed. Be sure to mkfs only
+on the resulting /dev/mapper/zone-$serial device you end up with.
+
+VI. Example scripts
 ==================
 
 [[
diff --git a/udev/99-zoned-disks.rules b/udev/99-zoned-disks.rules
new file mode 100644
index 000000000000..e19b738dcc0e
--- /dev/null
+++ b/udev/99-zoned-disks.rules
@@ -0,0 +1,78 @@ 
+# To use a zone disks first thing you need to:
+#
+# 1) Enable zone disk support in your kernel
+# 2) Use the deadline or mq-deadline scheduler for it - mandated as of v4.16
+# 3) Blacklist devices dedicated for f2fs as of v4.10
+# 4) Run dmsetup other disks
+# 5) Create the filesystem -- NOTE: use mkfs /dev/mapper/zone-serial if
+#    you enabled use dmsetup on the disk.
+# 6) Consider using nofail mount option in case you run an supported kernel
+#
+# You can use this udev rules file for 2) 3) and 4). Further details below.
+#
+# 1) Enable zone disk support in your kernel
+#
+#    o CONFIG_BLK_DEV_ZONED
+#    o CONFIG_DM_ZONED
+#
+# This will let the kernel actually see these devices, ie, via fdisk /dev/sda
+# for instance. Run:
+#
+# 	dmzadm --format /dev/sda
+
+# 2) Set deadline or mq-deadline for all disks which are zoned
+#
+# Zoned disks can only work with the deadline or mq-deadline scheduler. This is
+# mandated for all SMR drives since v4.16. It has been determined this must be
+# done through a udev rule, and the kernel should not set this up for disks.
+# This magic will have to live for *all* zoned disks.
+# XXX: what about distributions that want mq-deadline ? Probably easy for now
+#      to assume deadline and later have a mapping file to enable
+#      mq-deadline for specific serial devices?
+ACTION=="add|change", KERNEL=="sd*[!0-9]", ATTRS{queue/zoned}=="host-managed", \
+	ATTR{queue/scheduler}="deadline"
+
+# 3) Blacklist f2fs devices as of v4.10
+# We don't have to run dmsetup on on disks where you want to use f2fs, so you
+# can use this rule to skip dmsetup for it. First get the serial short number.
+#
+#	udevadm info --name=/dev/sda  | grep -i serial_shor
+# XXX: To generalize this for distributions consider using INPUT{db} to or so
+# and then use that to check if the serial number matches one on the database.
+#ACTION=="add", SUBSYSTEM=="block", ENV{ID_SERIAL_SHORT}=="XXA1ZFFF", GOTO="zone_disk_group_end"
+
+# 4) We need to run dmsetup if you want to use other filesystems
+#
+# dmsetup is not persistent, so it needs to be run on upon every boot.  We use
+# the device serial number for the /dev/mapper/ name.
+ACTION=="add", KERNEL=="sd*[!0-9]", ATTRS{queue/zoned}=="host-managed", \
+	RUN+="/sbin/dmsetup create zoned-$env{ID_SERIAL_SHORT} --table '0 %s{size} zoned $devnode'", $attr{size}
+
+# 4) Create a filesystem for the device
+#
+# Be 100% sure you use /dev/mapper/zone-$YOUR_DEVICE_SERIAL for the mkfs
+# command as otherwise things can break.
+#
+# XXX: preventing the above proactively in the kernel would be ideal however
+# this may be hard.
+#
+# Once you create the filesystem it will get a UUID.
+#
+# Find out what UUID is, you can do this for instance if your zoned disk is
+# your second device-mapper device, ie dm-1 by:
+#
+# 	ls -l /dev/disk/by-uuid/dm-1
+#
+# To figure out which dm-$number it is, use dmsetup info, the minor number
+# is the $number.
+#
+# 5) Add an etry in /etc/fstab with nofail for example:
+#
+# UUID=99999999-aaaa-bbbb-c1234aaaabbb33456 /media/monster xfs nofail 0 0
+#
+# nofail will ensure system boots fine even if you boot into a kernel which
+# lacks support for the device and so it is not found. Since the UUID will
+# always match the device we don't care if the device moves around the bus
+# on the system. We just need to get the UUID once.
+
+LABEL="zone_disk_group_end"