mbox series

[0/6] xfsprogs: blockdev dax detection and warnings

Message ID 20200824203724.13477-1-ailiop@suse.com
Headers show
Series xfsprogs: blockdev dax detection and warnings | expand

Message

Anthony Iliopoulos Aug. 24, 2020, 8:37 p.m. UTC
Hi all,

This short series adds blockdev dax capability detection via libblkid,
and subsequently uses this bit to warn on a couple of incompatible
configurations during mkfs.

The first patch adds the detection capability to libtopology, and the
following two patches add mkfs warnings that are issued when the fs
block size is not matching the page size, and when reflink is being
enabled in conjunction with dax.

The next patch adds a new cli option that can be used to override
warnings, and converts all warnings that can be forced to this option
instead the overloaded -f option. This avoids cases where forcing a
warning may also be implicitly forcing overwriting an existing
partition.

The last two patches are small cleanups that remove redundant code.

Anthony Iliopoulos (6):
  libfrog: add dax capability detection in topology probing
  mkfs: warn if blocksize doesn't match pagesize on dax devices
  mkfs: warn if reflink option is enabled on dax-capable devices
  mkfs: introduce -x option to force incompat config combinations
  mkfs: remove redundant assignment of cli sb options on failure
  mkfs: remove a couple of unused function parameters

 include/builddefs.in |  1 +
 libfrog/Makefile     |  4 ++++
 libfrog/topology.c   | 21 +++++++++++------
 libfrog/topology.h   |  1 +
 m4/package_blkid.m4  |  5 ++++
 man/man8/mkfs.xfs.8  |  6 +++++
 mkfs/xfs_mkfs.c      | 55 +++++++++++++++++++++++++++++---------------
 7 files changed, 68 insertions(+), 25 deletions(-)

--
2.28.0

Comments

Dave Chinner Aug. 24, 2020, 10:55 p.m. UTC | #1
On Mon, Aug 24, 2020 at 10:37:18PM +0200, Anthony Iliopoulos wrote:
> Hi all,
> 
> This short series adds blockdev dax capability detection via libblkid,
> and subsequently uses this bit to warn on a couple of incompatible
> configurations during mkfs.
> 
> The first patch adds the detection capability to libtopology, and the
> following two patches add mkfs warnings that are issued when the fs
> block size is not matching the page size, and when reflink is being
> enabled in conjunction with dax.

This makes the assumption that anyone running mkfs on a dax capable
device is going to use DAX, and prevents mkfs from running if the
config is not DAX compatible.

The issue here is that making a filesystem that is not DAX
compatible on a DAX capable device is not a fatal error. The
filesystem will work just fine using buffered and direct IO, and
there are definitely workloads where we want to use buffered IO on
pmem and not DAX. Why? Because existing pmem is terribly slow for
write intensive applications compared to page cache based mmap().
And even buffered writes are faster than DAX direct writes because
the slow writeback is done in the background via async writeback.

Also, what happens if you have a 64kB page size? mkfs defaults to
4kB block size, so with these changes mkfs will refuse to run on a
dax capable device unless the user specifically directs it to do
something different. That's not a good behaviour for the default
config to have....

Hence I don't think that preventing mkfs from running unless the config
is exactly waht DAX requires or the "force" option is set is the
right policy here.

I agree that mkfs needs to be aware of DAX capability of the block
device, but that capability existing should not cause mkfs to fail.
If we want users to be able to direct mkfs to to create a DAX
capable filesystem then adding a -d dax option would be a better
idea. This would direct mkfs to align/size all the data options to
use a DAX compatible topology if blkid supports reporting the DAX
topology. It would also do things like turn off reflink (until that
is supported w/ DAX), etc.

i.e. if the user knows they are going to use DAX (and they will)
then they can tell mkfs to make a DAX compatible filesystem.

> The next patch adds a new cli option that can be used to override
> warnings, and converts all warnings that can be forced to this option
> instead the overloaded -f option. This avoids cases where forcing a
> warning may also be implicitly forcing overwriting an existing
> partition.

I don't want both an "ignore warnings" and a "force" CLI option.
They both do the same thing - allow the user to override things that
are potentially destructive or result in an unusable config - so why
should we add the complexity of having a different "force" options
for every different possible thing that can be overridden?

Then we'll grow a "--force-all" option to override everything.
i.e.  we end up with this insanity:

$ dpkg --help |grep force
  --force-help                     Show help on forcing.
  --[no-]triggers            Skip or force consequential trigger processing.
  --force-<thing>[,...]      Override problems (see --force-help).
  --no-force-<thing>[,...]   Stop when problems encountered.

and "--force-help" gives you a list of about 30 different things you
can force the behaviour of, including "--force-all". This is simply
not a good UI model...

> The last two patches are small cleanups that remove redundant code.

They should be first.

Cheers,

Dave.
Anthony Iliopoulos Aug. 25, 2020, 8:48 a.m. UTC | #2
On Tue, Aug 25, 2020 at 08:55:33AM +1000, Dave Chinner wrote:
> On Mon, Aug 24, 2020 at 10:37:18PM +0200, Anthony Iliopoulos wrote:
> > Hi all,
> > 
> > This short series adds blockdev dax capability detection via libblkid,
> > and subsequently uses this bit to warn on a couple of incompatible
> > configurations during mkfs.
> > 
> > The first patch adds the detection capability to libtopology, and the
> > following two patches add mkfs warnings that are issued when the fs
> > block size is not matching the page size, and when reflink is being
> > enabled in conjunction with dax.
> 
> This makes the assumption that anyone running mkfs on a dax capable
> device is going to use DAX, and prevents mkfs from running if the
> config is not DAX compatible.
> 
> The issue here is that making a filesystem that is not DAX
> compatible on a DAX capable device is not a fatal error. The
> filesystem will work just fine using buffered and direct IO, and
> there are definitely workloads where we want to use buffered IO on
> pmem and not DAX. Why? Because existing pmem is terribly slow for
> write intensive applications compared to page cache based mmap().
> And even buffered writes are faster than DAX direct writes because
> the slow writeback is done in the background via async writeback.
> 
> Also, what happens if you have a 64kB page size? mkfs defaults to
> 4kB block size, so with these changes mkfs will refuse to run on a
> dax capable device unless the user specifically directs it to do
> something different. That's not a good behaviour for the default
> config to have....
> 
> Hence I don't think that preventing mkfs from running unless the config
> is exactly waht DAX requires or the "force" option is set is the
> right policy here.
> 
> I agree that mkfs needs to be aware of DAX capability of the block
> device, but that capability existing should not cause mkfs to fail.
> If we want users to be able to direct mkfs to to create a DAX
> capable filesystem then adding a -d dax option would be a better
> idea. This would direct mkfs to align/size all the data options to
> use a DAX compatible topology if blkid supports reporting the DAX
> topology. It would also do things like turn off reflink (until that
> is supported w/ DAX), etc.

I do like the idea of adding an explicit dax option, but I'm not sure
what the right policy would be:

1. -d dax option specified, set dax-compatible parameters irrespective
   of dax capability (blkid detection not strictly required)

2. -d dax option specified, set dax-compatible parameters only if
   blockdev is dax capable; fallback to default params otherwise

3. autodetect dax capability and automatically set dax-compatible
   params, irrespective of if the option is specified or not
   (potentially also needs an explicit override option e.g. -d dax=0).

I'd be inclined to go with the second option which should lead to the
least amount of surprises for users.

Still, this doesn't address the exact thing I was trying to do (see
below).

> i.e. if the user knows they are going to use DAX (and they will)
> then they can tell mkfs to make a DAX compatible filesystem.

So I've been trying to prevent cases where users create a filesystem
with the default params, go on to populate it, and at some later point
of time find themselves wanting to mount it with dax only to realize
that this is not possible without mkfs (most commonly due to the
mismatch of the 64kB page size on ppc64). We could potentially just
issue the warning and not force mkfs to bail out, but I'm afraid that
warnings aren't very discernible and are easily missed. I do agree
though that requiring an override may not be the best model here.

Would it make sense to simply emit the warnings and drop the bail-out
and override logic altogether?

Alternatively the third option above (autodetection), would take care of
those cases at the expense of overriding otherwise potentially desirable
options (e.g. switching off reflink), which will come as a surprise to
users that don't intent to use dax. I don't think this would be a good
default policy.

> > The next patch adds a new cli option that can be used to override
> > warnings, and converts all warnings that can be forced to this option
> > instead the overloaded -f option. This avoids cases where forcing a
> > warning may also be implicitly forcing overwriting an existing
> > partition.
>
> I don't want both an "ignore warnings" and a "force" CLI option.
> They both do the same thing - allow the user to override things that
> are potentially destructive or result in an unusable config - so why
> should we add the complexity of having a different "force" options
> for every different possible thing that can be overridden?

The rationale here was to only make a distinction between destructive
and (conditionally) unusable, otherwise we would indeed need an override
toggle per warning - which I totally agree is an overkill. If implicitly
suppressing the destructive operation confirmation isn't a concern, then
I'd definitely drop this patch.

Thanks for the feedback!

Anthony
Eric Sandeen Aug. 25, 2020, 1:59 p.m. UTC | #3
On 8/24/20 5:55 PM, Dave Chinner wrote:
> I agree that mkfs needs to be aware of DAX capability of the block
> device, but that capability existing should not cause mkfs to fail.
> If we want users to be able to direct mkfs to to create a DAX
> capable filesystem then adding a -d dax option would be a better
> idea. This would direct mkfs to align/size all the data options to
> use a DAX compatible topology if blkid supports reporting the DAX
> topology. It would also do things like turn off reflink (until that
> is supported w/ DAX), etc.
> 
> i.e. if the user knows they are going to use DAX (and they will)
> then they can tell mkfs to make a DAX compatible filesystem.

FWIW, Darrick /just/ added a -d daxinherit option, though all it does
now is set the inheritable dax flag on the root dir, it doesn't enforce
things like page vs block size, etc.

That change is currently staged in my local tree.

I suppose we could condition that on other requirements, although we've
always had the ability to mkfs a filesystem that can't necessarily be
used on the current machine - i.e. you can make a 64k block size filesystem
on a 4k page machine, etc.  So I'm not sure we want to tie mkfs abilities
to the current mkfs environment....

Still, I wonder if I should hold off on "-d daxinherit" patch until we
have thought through things like reflink conflicts, for now.

(though again, mkfs is "perfectly capapable" of making a consistent
reflink+dax filesystem, it's just that no kernel can mount it today...)

-Eric
Darrick J. Wong Aug. 25, 2020, 2:40 p.m. UTC | #4
On Tue, Aug 25, 2020 at 08:59:39AM -0500, Eric Sandeen wrote:
> On 8/24/20 5:55 PM, Dave Chinner wrote:
> > I agree that mkfs needs to be aware of DAX capability of the block
> > device, but that capability existing should not cause mkfs to fail.
> > If we want users to be able to direct mkfs to to create a DAX
> > capable filesystem then adding a -d dax option would be a better
> > idea. This would direct mkfs to align/size all the data options to
> > use a DAX compatible topology if blkid supports reporting the DAX
> > topology. It would also do things like turn off reflink (until that
> > is supported w/ DAX), etc.
> > 
> > i.e. if the user knows they are going to use DAX (and they will)
> > then they can tell mkfs to make a DAX compatible filesystem.
> 
> FWIW, Darrick /just/ added a -d daxinherit option, though all it does
> now is set the inheritable dax flag on the root dir, it doesn't enforce
> things like page vs block size, etc.
> 
> That change is currently staged in my local tree.
> 
> I suppose we could condition that on other requirements, although we've
> always had the ability to mkfs a filesystem that can't necessarily be
> used on the current machine - i.e. you can make a 64k block size filesystem
> on a 4k page machine, etc.  So I'm not sure we want to tie mkfs abilities
> to the current mkfs environment....
> 
> Still, I wonder if I should hold off on "-d daxinherit" patch until we
> have thought through things like reflink conflicts, for now.
> 
> (though again, mkfs is "perfectly capapable" of making a consistent
> reflink+dax filesystem, it's just that no kernel can mount it today...)

No, please don't layer additional meanings onto daxinherit=1.

I actually /do/ want to have a -d dax=1 option for "set up this
filesystem for DAX" that will configure the geometry for that device to
play nicely with the things that (some) DAX users want.

IOWs, you say "-d dax=1" and that means that mkfs sniffs out the
DAXiness of the underlying device and the PMD size.  Then it turns off
reflink by default, sets the daxinherit=1 hint, and configures the
extent size and su/sw hints to match the PMD size.

Or, you say "-r dax=1" for the realtime device, and now it sets the
allocation unit to the PMD size for people running huge databases and
want only huge pages to back their table data<cough>.

Zooming out a bit, maybe we should instead introduce a new "tuning"
parameter for -d and -r so that administrators could tune the filesystem
for specific purposes:

	-d tune=dax: Reject if device not dax, set daxinherit=1, set
	extsize/su/sw to match PMD

	-d tune=ssd: Set agcount to match the number of CPUs if
	possible, make the log larger to support a large number of
	threads and iops.

	-d tune=rotational: Probably does nothing. ;)

	-d tune=auto: Query blkid to guess which of the above three
	profiles we should use.

	-d tune=none: No tuning.

And then you'd do the same for the realtime device.

This would help us get rid of the seeeekret mkfs wrapper that we use to
make it easier for our internal customers to use DAX since mkfs.xfs
doesn't support config files.

--D

> -Eric
Anthony Iliopoulos Aug. 25, 2020, 3:09 p.m. UTC | #5
On Tue, Aug 25, 2020 at 08:59:39AM -0500, Eric Sandeen wrote:
> On 8/24/20 5:55 PM, Dave Chinner wrote:
> > I agree that mkfs needs to be aware of DAX capability of the block
> > device, but that capability existing should not cause mkfs to fail.
> > If we want users to be able to direct mkfs to to create a DAX
> > capable filesystem then adding a -d dax option would be a better
> > idea. This would direct mkfs to align/size all the data options to
> > use a DAX compatible topology if blkid supports reporting the DAX
> > topology. It would also do things like turn off reflink (until that
> > is supported w/ DAX), etc.
> > 
> > i.e. if the user knows they are going to use DAX (and they will)
> > then they can tell mkfs to make a DAX compatible filesystem.
> 
> FWIW, Darrick /just/ added a -d daxinherit option, though all it does
> now is set the inheritable dax flag on the root dir, it doesn't enforce
> things like page vs block size, etc.

I am aware of that patch, but I considered the option to be somewhat
orthogonal, given that FS_XFLAG_DAX can be set (and inherited)
irrespective of dax support in the block device (and overridden via
mount opts if need be), so I didn't want to overload daxinherit.

> That change is currently staged in my local tree.
> 
> I suppose we could condition that on other requirements, although we've
> always had the ability to mkfs a filesystem that can't necessarily be
> used on the current machine - i.e. you can make a 64k block size filesystem
> on a 4k page machine, etc.  So I'm not sure we want to tie mkfs abilities
> to the current mkfs environment....

Agreed, so I suppose any dax option should be an opt-in, e.g. similar to
the -d dax=1 proposal. That won't prevent users from neglecting it and
creating a fs which will be later incompatible with -o dax, but that's a
different story I guess..

- Anthony
Eric Sandeen Aug. 25, 2020, 5:32 p.m. UTC | #6
On 8/25/20 10:09 AM, Anthony Iliopoulos wrote:
> On Tue, Aug 25, 2020 at 08:59:39AM -0500, Eric Sandeen wrote:
>> On 8/24/20 5:55 PM, Dave Chinner wrote:
>>> I agree that mkfs needs to be aware of DAX capability of the block
>>> device, but that capability existing should not cause mkfs to fail.
>>> If we want users to be able to direct mkfs to to create a DAX
>>> capable filesystem then adding a -d dax option would be a better
>>> idea. This would direct mkfs to align/size all the data options to
>>> use a DAX compatible topology if blkid supports reporting the DAX
>>> topology. It would also do things like turn off reflink (until that
>>> is supported w/ DAX), etc.
>>>
>>> i.e. if the user knows they are going to use DAX (and they will)
>>> then they can tell mkfs to make a DAX compatible filesystem.
>>
>> FWIW, Darrick /just/ added a -d daxinherit option, though all it does
>> now is set the inheritable dax flag on the root dir, it doesn't enforce
>> things like page vs block size, etc.
> 
> I am aware of that patch, but I considered the option to be somewhat
> orthogonal, given that FS_XFLAG_DAX can be set (and inherited)
> irrespective of dax support in the block device (and overridden via
> mount opts if need be), so I didn't want to overload daxinherit.

OK fair enough.

>> That change is currently staged in my local tree.
>>
>> I suppose we could condition that on other requirements, although we've
>> always had the ability to mkfs a filesystem that can't necessarily be
>> used on the current machine - i.e. you can make a 64k block size filesystem
>> on a 4k page machine, etc.  So I'm not sure we want to tie mkfs abilities
>> to the current mkfs environment....
> 
> Agreed, so I suppose any dax option should be an opt-in, e.g. similar to
> the -d dax=1 proposal. That won't prevent users from neglecting it and
> creating a fs which will be later incompatible with -o dax, but that's a
> different story I guess..

I guess my overarching desire here is to not try to predict too much, or to
make predictions that won't stand the test of time. Inferring what the admin
wants can be tricky.  :)  

I also want to be consistent; i.e. today we can mkfs a 64k block filesystem
on a 4k block machine, and no warning is emitted. If we mkfs a V5/CRC filesystem
or a reflink filesystem on an old kernel that doesn't support it, no warning is
emitted.

If we're going to warn every time when "this filesystem can't be used under the
currently running kernel" then there are quite a lot of cases to handle... I
don't want to warn about some things and not others, so we need to decide if
this is actually something we want to do in general, not just for dax.  I'm 
somewhat disinclined, TBH, and would rather rely on a mount failure to alert
the admin to the problem.  (It's not like there are a lot of resources invested
between mkfs & mount).

I'll reread this thread & Dave's responses and give this some more thought.

Thanks,
-Eric

> - Anthony
>
Dave Chinner Aug. 25, 2020, 10:31 p.m. UTC | #7
On Tue, Aug 25, 2020 at 07:40:15AM -0700, Darrick J. Wong wrote:
> Zooming out a bit, maybe we should instead introduce a new "tuning"
> parameter for -d and -r so that administrators could tune the filesystem
> for specific purposes:
> 
> 	-d tune=dax: Reject if device not dax, set daxinherit=1, set
> 	extsize/su/sw to match PMD
> 
> 	-d tune=ssd: Set agcount to match the number of CPUs if
> 	possible, make the log larger to support a large number of
> 	threads and iops.
> 
> 	-d tune=rotational: Probably does nothing. ;)
> 
> 	-d tune=auto: Query blkid to guess which of the above three
> 	profiles we should use.
> 
> 	-d tune=none: No tuning.
> 
> And then you'd do the same for the realtime device.

Please, no.

The problem with this is that a specific "tune" will need to vary
over time (e.g. when reflink is supported with DAX) and so now we
back to the same situation where the definition of "tune=dax"
changes depending on what version of mkfs you use. Hence you can
still make a filesystem that the kernel won't mount because you have
a xfsprogs that supports DAX+reflink and a kernel that doesn't.

I just don't see this as a viable way to produce filesystems that
work for specific situations because it doesn't solve the kernel vs
xfsprogs version support issue that requires tweaking mkfs
parameters manually to avoid...

> This would help us get rid of the seeeekret mkfs wrapper that we use to
> make it easier for our internal customers to use DAX since mkfs.xfs
> doesn't support config files.

Let's fix that, then. I've written a bunch of stuff in the past
couple of years that uses basic ini config files via a simple
library and it just works. If people are happy with ini format
config files via a library, then I'll just go do that, eh?

Cheers,

Dave.