diff mbox series

[RFC] xfs: completely disable toggling DAX flag via ioctl on reg files

Message ID 405e182f-0c24-81fe-9d04-1031a8bbac17@redhat.com (mailing list archive)
State New, archived
Headers show
Series [RFC] xfs: completely disable toggling DAX flag via ioctl on reg files | expand

Commit Message

Eric Sandeen July 25, 2018, 9:20 p.m. UTC
742d842 xfs: disable per-inode DAX flag was, I think, intended
as a short-term workaround to avoid races when toggling DAX on
and off of active inodes until mm/ sorted that out.

(It's also a confusing title, as it didn't really disable
per-inode DAX at all.)

However, it has the surprising (to me, at least) result that while
the ioctl succeeds, no behavior changes until the inode is cycled
out of cache and re-read from disk at some unknown later time.
This seems to badly violate the principle of least surprise.

Until said races are properly resolved, it seems most prudent to
disallow modification of the flag on regular files altogether.
We can still allow per-inode DAX flagging via directory inheritance.

Since DAX is still flagged as experimental (in part due to these
concerns) I don't think it's a problem to (temporarily?) break
this interface further.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
---



--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Brian Foster July 26, 2018, 12:08 p.m. UTC | #1
On Wed, Jul 25, 2018 at 02:20:54PM -0700, Eric Sandeen wrote:
> 742d842 xfs: disable per-inode DAX flag was, I think, intended
> as a short-term workaround to avoid races when toggling DAX on
> and off of active inodes until mm/ sorted that out.
> 
> (It's also a confusing title, as it didn't really disable
> per-inode DAX at all.)
> 
> However, it has the surprising (to me, at least) result that while
> the ioctl succeeds, no behavior changes until the inode is cycled
> out of cache and re-read from disk at some unknown later time.
> This seems to badly violate the principle of least surprise.
> 
> Until said races are properly resolved, it seems most prudent to
> disallow modification of the flag on regular files altogether.
> We can still allow per-inode DAX flagging via directory inheritance.
> 
> Since DAX is still flagged as experimental (in part due to these
> concerns) I don't think it's a problem to (temporarily?) break
> this interface further.
> 
> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
> ---

I'm not in tune with the latest state of dax, but if the situation is
that we don't currently have a means to correctly switch the per-inode
state for an active inode (and thus have simply skipped changing the
online flag while carrying on with the on-disk flag, leading to this
inode cache cycling requirement), then I think this makes sense. The
current interface is essentially incomplete, I don't see any reason to
allow unless/until it actually works sanely.

BTW, what bits are actually missing to make that happen? Why is the
flush/inval currently in this function not sufficient?

Implementation wise, I'm a little curious why we're piling on hacks
(such as the return short-circuit and the previous #if 0) as opposed to
just removing the code. The code can always be restored directly from
the git history, right?

Brian

> 
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index 0ef5ece5634c..94e9e25883cc 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -1102,6 +1102,16 @@ xfs_ioctl_setattr_dax_invalidate(
>  
>  	if (S_ISDIR(inode->i_mode))
>  		return 0;
> +	/*
> +	 * For now, don't allow changing DAX mode on files via ioctl at all.
> +	 * See 742d842 xfs: disable per-inode DAX flag, which only disabled
> +	 * switching it on "live" inodes, but still set it on disk for the
> +	 * next read - which leads to changed behavior only after cycling
> +	 * out of cache.  Eliminate this surprising behavior.
> +	 * We do still allow setting it as an inheritance flag on directories
> +	 * (above) which will then affect newly-created files in the dir.
> +	 */
> +	return -EINVAL;
>  
>  	/* lock, flush and invalidate mapping in preparation for flag change */
>  	xfs_ilock(ip, XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL);
> @@ -1359,6 +1369,9 @@ xfs_ioctl_setattr(
>  	 * or cancel time, so need to be passed through to
>  	 * xfs_ioctl_setattr_get_trans() so it can apply them to the join call
>  	 * appropriately.
> +	 *
> +	 * Note, changing DAX flags on regular files via ioctl is currently
> +	 * disallowed by this function, see comments within.
>  	 */
>  	code = xfs_ioctl_setattr_dax_invalidate(ip, fa, &join_flags);
>  	if (code)
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen July 26, 2018, 1:23 p.m. UTC | #2
On 7/26/18 5:08 AM, Brian Foster wrote:
> On Wed, Jul 25, 2018 at 02:20:54PM -0700, Eric Sandeen wrote:
>> 742d842 xfs: disable per-inode DAX flag was, I think, intended
>> as a short-term workaround to avoid races when toggling DAX on
>> and off of active inodes until mm/ sorted that out.
>>
>> (It's also a confusing title, as it didn't really disable
>> per-inode DAX at all.)
>>
>> However, it has the surprising (to me, at least) result that while
>> the ioctl succeeds, no behavior changes until the inode is cycled
>> out of cache and re-read from disk at some unknown later time.
>> This seems to badly violate the principle of least surprise.
>>
>> Until said races are properly resolved, it seems most prudent to
>> disallow modification of the flag on regular files altogether.
>> We can still allow per-inode DAX flagging via directory inheritance.
>>
>> Since DAX is still flagged as experimental (in part due to these
>> concerns) I don't think it's a problem to (temporarily?) break
>> this interface further.
>>
>> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
>> ---
> 
> I'm not in tune with the latest state of dax, but if the situation is
> that we don't currently have a means to correctly switch the per-inode
> state for an active inode (and thus have simply skipped changing the
> online flag while carrying on with the on-disk flag, leading to this
> inode cache cycling requirement), then I think this makes sense. The
> current interface is essentially incomplete, I don't see any reason to
> allow unless/until it actually works sanely.
> 
> BTW, what bits are actually missing to make that happen? Why is the
> flush/inval currently in this function not sufficient?

TBH I don't actually know the low-level details. :/

> Implementation wise, I'm a little curious why we're piling on hacks
> (such as the return short-circuit and the previous #if 0) as opposed to
> just removing the code. The code can always be restored directly from
> the git history, right?

Well, fair point.  If this goes in it should probably at least remove
the other #if 0 so it's not stacking hacks on hacks, but if the issue
is in mm/ it might be a big ask for anyone eventually working on that
to dig supporting code back out of xfs git history ...

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Brian Foster July 26, 2018, 2:15 p.m. UTC | #3
On Thu, Jul 26, 2018 at 06:23:58AM -0700, Eric Sandeen wrote:
> On 7/26/18 5:08 AM, Brian Foster wrote:
> > On Wed, Jul 25, 2018 at 02:20:54PM -0700, Eric Sandeen wrote:
> >> 742d842 xfs: disable per-inode DAX flag was, I think, intended
> >> as a short-term workaround to avoid races when toggling DAX on
> >> and off of active inodes until mm/ sorted that out.
> >>
> >> (It's also a confusing title, as it didn't really disable
> >> per-inode DAX at all.)
> >>
> >> However, it has the surprising (to me, at least) result that while
> >> the ioctl succeeds, no behavior changes until the inode is cycled
> >> out of cache and re-read from disk at some unknown later time.
> >> This seems to badly violate the principle of least surprise.
> >>
> >> Until said races are properly resolved, it seems most prudent to
> >> disallow modification of the flag on regular files altogether.
> >> We can still allow per-inode DAX flagging via directory inheritance.
> >>
> >> Since DAX is still flagged as experimental (in part due to these
> >> concerns) I don't think it's a problem to (temporarily?) break
> >> this interface further.
> >>
> >> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
> >> ---
> > 
> > I'm not in tune with the latest state of dax, but if the situation is
> > that we don't currently have a means to correctly switch the per-inode
> > state for an active inode (and thus have simply skipped changing the
> > online flag while carrying on with the on-disk flag, leading to this
> > inode cache cycling requirement), then I think this makes sense. The
> > current interface is essentially incomplete, I don't see any reason to
> > allow unless/until it actually works sanely.
> > 
> > BTW, what bits are actually missing to make that happen? Why is the
> > flush/inval currently in this function not sufficient?
> 
> TBH I don't actually know the low-level details. :/
> 

Ok, it would be nice if somebody in the know could chime in on that. ;)

> > Implementation wise, I'm a little curious why we're piling on hacks
> > (such as the return short-circuit and the previous #if 0) as opposed to
> > just removing the code. The code can always be restored directly from
> > the git history, right?
> 
> Well, fair point.  If this goes in it should probably at least remove
> the other #if 0 so it's not stacking hacks on hacks, but if the issue
> is in mm/ it might be a big ask for anyone eventually working on that
> to dig supporting code back out of xfs git history ...
> 

If that's the case, shouldn't it essentially just be a revert of this
patch that rips the code out? Even if the surrounding code changes in
the meantime, we should be able to add it back fairly easily. If that's
not the case, then it sounds like that dead code would need to change
anyways. 

Brian

> -Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner July 26, 2018, 11:17 p.m. UTC | #4
On Thu, Jul 26, 2018 at 06:23:58AM -0700, Eric Sandeen wrote:
> On 7/26/18 5:08 AM, Brian Foster wrote:
> > On Wed, Jul 25, 2018 at 02:20:54PM -0700, Eric Sandeen wrote:
> >> 742d842 xfs: disable per-inode DAX flag was, I think, intended
> >> as a short-term workaround to avoid races when toggling DAX on
> >> and off of active inodes until mm/ sorted that out.
> >>
> >> (It's also a confusing title, as it didn't really disable
> >> per-inode DAX at all.)
> >>
> >> However, it has the surprising (to me, at least) result that while
> >> the ioctl succeeds, no behavior changes until the inode is cycled
> >> out of cache and re-read from disk at some unknown later time.
> >> This seems to badly violate the principle of least surprise.
> >>
> >> Until said races are properly resolved, it seems most prudent to
> >> disallow modification of the flag on regular files altogether.
> >> We can still allow per-inode DAX flagging via directory inheritance.
> >>
> >> Since DAX is still flagged as experimental (in part due to these
> >> concerns) I don't think it's a problem to (temporarily?) break
> >> this interface further.
> >>
> >> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
> >> ---
> > 
> > I'm not in tune with the latest state of dax, but if the situation is
> > that we don't currently have a means to correctly switch the per-inode
> > state for an active inode (and thus have simply skipped changing the
> > online flag while carrying on with the on-disk flag, leading to this
> > inode cache cycling requirement), then I think this makes sense. The
> > current interface is essentially incomplete, I don't see any reason to
> > allow unless/until it actually works sanely.
> > 
> > BTW, what bits are actually missing to make that happen? Why is the
> > flush/inval currently in this function not sufficient?
> 
> TBH I don't actually know the low-level details. :/

page faults aren't synchronised with filesystem locks, so we can
change the aops callout behaviour half way through a page fault.
i.e. the first half of the page fault sees the S_DAX flag and does
prep work based on that, the second half of the page fault doesn't
see the S_DAX flag and assumes it's working on a page cache page
that doesn't exist and things go bang...

As it is, I don't think we can remove this now - people are using
the on-disk flags already, and the inherit flag from the directory
has none of the problems of changing S_DAX dynamically. Hence just
disabling it is the wrong thing to do because it removes the ability
for people to manage the flags that are already on disk....

I'd much prefer we fix the page fault synchronisation problem than
break stuff that /isn't actually broken/. Yes, it's current
behaviour is suboptimal, but that is only supposed to be /temporary/
until the aops callout problem is fixed.

Cheers,

Dave.
Eric Sandeen July 27, 2018, 1:20 a.m. UTC | #5
On 7/26/18 4:17 PM, Dave Chinner wrote:

...

> As it is, I don't think we can remove this now - people are using
> the on-disk flags already, and the inherit flag from the directory
> has none of the problems of changing S_DAX dynamically.

Which is why I left the inheritance part ...

> Hence just
> disabling it is the wrong thing to do because it removes the ability
> for people to manage the flags that are already on disk....

Well, it's EXPERIMENTAL... o_O

But yeah, this does remove the ability to turn off the flag, at least
w/o making a file copy or some other gross thing.

> I'd much prefer we fix the page fault synchronisation problem than
> break stuff that /isn't actually broken/. Yes, it's current
> behaviour is suboptimal, but that is only supposed to be /temporary/
> until the aops callout problem is fixed.

Well, it's super confusing and from the user POV more or less
nondeterministic, at this point.  That borders on broken/unusable IMHO.
"Change the flag and some time later, not really within your control, the
behavior will change."

It's been broken for a year.  TBH I'm not sure I personally have
the knowledge/skill (and I almost certainly don't have the time)
to fix it.  Just trying to make the best of a sticky situation.
This seemed better to me, modulo the inability to remove the
flag.  :/  And I figured more permissive flag manipulation could
be added back later if it ever does get fixed.

Thanks,
-Eric
 
> Cheers,
> 
> Dave.
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Brian Foster July 27, 2018, 6:51 p.m. UTC | #6
On Thu, Jul 26, 2018 at 06:20:01PM -0700, Eric Sandeen wrote:
> On 7/26/18 4:17 PM, Dave Chinner wrote:
> 
> ...
> 
> > As it is, I don't think we can remove this now - people are using
> > the on-disk flags already, and the inherit flag from the directory
> > has none of the problems of changing S_DAX dynamically.
> 
> Which is why I left the inheritance part ...
> 
> > Hence just
> > disabling it is the wrong thing to do because it removes the ability
> > for people to manage the flags that are already on disk....
> 
> Well, it's EXPERIMENTAL... o_O
> 
> But yeah, this does remove the ability to turn off the flag, at least
> w/o making a file copy or some other gross thing.
> 

You could retain the ability to change the flag on existing files when
-odax is used, right? IIUC, that means the runtime change is essentially
a no-op and so the state change problem doesn't exist.

That's obviously tedious, but is there any other truly predictable way
to toggle the behavior change other than a mount cycle anyways?

> > I'd much prefer we fix the page fault synchronisation problem than
> > break stuff that /isn't actually broken/. Yes, it's current
> > behaviour is suboptimal, but that is only supposed to be /temporary/
> > until the aops callout problem is fixed.
> 

I agree, but it sounds like nobody has a solution for that (let alone a
patch). :/

> Well, it's super confusing and from the user POV more or less
> nondeterministic, at this point.  That borders on broken/unusable IMHO.
> "Change the flag and some time later, not really within your control, the
> behavior will change."
> 

Perhaps broken is strong if whatever crash problem existed was worked
around by Christoph's change. I agree that this is certainly not usable.
You can literally create a new/empty file, set the DAX flag, start
writing data and see the exact opposite behavior from what is expected.
I'm not sure suboptimal really describes that either. ;P

I guess the more important question is does anything beyond this inode
flag switching problem hold DAX+XFS in experimental status? The purpose
of this rfc is to essentially probe whether we can lift some portion of
dax out of experimental status, right? If there are other concerns that
would prevent that, then perhaps this is moot until that changes.

Brian

> It's been broken for a year.  TBH I'm not sure I personally have
> the knowledge/skill (and I almost certainly don't have the time)
> to fix it.  Just trying to make the best of a sticky situation.
> This seemed better to me, modulo the inability to remove the
> flag.  :/  And I figured more permissive flag manipulation could
> be added back later if it ever does get fixed.
> 
> Thanks,
> -Eric
>  
> > Cheers,
> > 
> > Dave.
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick J. Wong July 30, 2018, 4:09 p.m. UTC | #7
On Fri, Jul 27, 2018 at 02:51:25PM -0400, Brian Foster wrote:
> On Thu, Jul 26, 2018 at 06:20:01PM -0700, Eric Sandeen wrote:
> > On 7/26/18 4:17 PM, Dave Chinner wrote:
> > 
> > ...
> > 
> > > As it is, I don't think we can remove this now - people are using
> > > the on-disk flags already, and the inherit flag from the directory
> > > has none of the problems of changing S_DAX dynamically.
> > 
> > Which is why I left the inheritance part ...
> > 
> > > Hence just
> > > disabling it is the wrong thing to do because it removes the ability
> > > for people to manage the flags that are already on disk....
> > 
> > Well, it's EXPERIMENTAL... o_O
> > 
> > But yeah, this does remove the ability to turn off the flag, at least
> > w/o making a file copy or some other gross thing.
> > 
> 
> You could retain the ability to change the flag on existing files when
> -odax is used, right? IIUC, that means the runtime change is essentially
> a no-op and so the state change problem doesn't exist.
> 
> That's obviously tedious, but is there any other truly predictable way
> to toggle the behavior change other than a mount cycle anyways?
> 
> > > I'd much prefer we fix the page fault synchronisation problem than
> > > break stuff that /isn't actually broken/. Yes, it's current
> > > behaviour is suboptimal, but that is only supposed to be /temporary/
> > > until the aops callout problem is fixed.

I thought Jan Kara proposed checking the S_DAX state in the DAX/non-DAX
code paths and bailing out with some sort of -EAGAIN type status (or
whatever the vm_fault_t equivalent of that is) if it's inconsistent,
though the same discussion also proposed restricting DAX flag changes to
directories and files with zero size, zero i_blocks, and zero
i_delayed_blks.

> > 
> 
> I agree, but it sounds like nobody has a solution for that (let alone a
> patch). :/
> 
> > Well, it's super confusing and from the user POV more or less
> > nondeterministic, at this point.  That borders on broken/unusable IMHO.
> > "Change the flag and some time later, not really within your control, the
> > behavior will change."
> > 
> 
> Perhaps broken is strong if whatever crash problem existed was worked
> around by Christoph's change. I agree that this is certainly not usable.

He didn't work around it so much as disabled the in-core state changes
that were supposed to accompany the on-disk state changes.

> You can literally create a new/empty file, set the DAX flag, start
> writing data and see the exact opposite behavior from what is expected.
> I'm not sure suboptimal really describes that either. ;P
> 
> I guess the more important question is does anything beyond this inode
> flag switching problem hold DAX+XFS in experimental status?

In order of increasing complexity (as perceived by me):

Lack of reflink support, maintainer's continuing unease about how all
this interacts with the memory manager / "we can't quite fix this right
now", unresolved debate over what role inode flags have vs. kernel
autoconfiguration...

> The purpose of this rfc is to essentially probe whether we can lift
> some portion of dax out of experimental status, right? If there are
> other concerns that would prevent that, then perhaps this is moot
> until that changes.

I thought the point of the RFC was to turn off the parts of the
interface that have imprecisely defined outcomes until a later time when
we can deliver a precisely defined outcome.

--D

> Brian
> 
> > It's been broken for a year.  TBH I'm not sure I personally have
> > the knowledge/skill (and I almost certainly don't have the time)
> > to fix it.  Just trying to make the best of a sticky situation.
> > This seemed better to me, modulo the inability to remove the
> > flag.  :/  And I figured more permissive flag manipulation could
> > be added back later if it ever does get fixed.
> > 
> > Thanks,
> > -Eric
> >  
> > > Cheers,
> > > 
> > > Dave.
> > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox series

Patch

diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 0ef5ece5634c..94e9e25883cc 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1102,6 +1102,16 @@  xfs_ioctl_setattr_dax_invalidate(
 
 	if (S_ISDIR(inode->i_mode))
 		return 0;
+	/*
+	 * For now, don't allow changing DAX mode on files via ioctl at all.
+	 * See 742d842 xfs: disable per-inode DAX flag, which only disabled
+	 * switching it on "live" inodes, but still set it on disk for the
+	 * next read - which leads to changed behavior only after cycling
+	 * out of cache.  Eliminate this surprising behavior.
+	 * We do still allow setting it as an inheritance flag on directories
+	 * (above) which will then affect newly-created files in the dir.
+	 */
+	return -EINVAL;
 
 	/* lock, flush and invalidate mapping in preparation for flag change */
 	xfs_ilock(ip, XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL);
@@ -1359,6 +1369,9 @@  xfs_ioctl_setattr(
 	 * or cancel time, so need to be passed through to
 	 * xfs_ioctl_setattr_get_trans() so it can apply them to the join call
 	 * appropriately.
+	 *
+	 * Note, changing DAX flags on regular files via ioctl is currently
+	 * disallowed by this function, see comments within.
 	 */
 	code = xfs_ioctl_setattr_dax_invalidate(ip, fa, &join_flags);
 	if (code)