diff mbox series

[14/13] xfs: make XFS_IOC_COMMIT_RANGE freshness data opaque

Message ID 20240227174649.GL6184@frogsfrogsfrogs (mailing list archive)
State New
Headers show
Series None | expand

Commit Message

Darrick J. Wong Feb. 27, 2024, 5:46 p.m. UTC
From: Darrick J. Wong <djwong@kernel.org>

To head off bikeshedding about the fields in xfs_commit_range, let's
make it an opaque u64 array and require the userspace program to call
a third ioctl to sample the freshness data for us.  If we ever converge
on a definition for i_version then we can use that; for now we'll just
use mtime/ctime like the old swapext ioctl.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_fs.h |   13 +++--------
 fs/xfs/xfs_exchrange.c |   15 ++++++++++++
 fs/xfs/xfs_exchrange.h |    1 +
 fs/xfs/xfs_ioctl.c     |   58 +++++++++++++++++++++++++++++++++++++++++++-----
 4 files changed, 72 insertions(+), 15 deletions(-)

Comments

Amir Goldstein Feb. 27, 2024, 6:52 p.m. UTC | #1
On Tue, Feb 27, 2024 at 7:46 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> From: Darrick J. Wong <djwong@kernel.org>
>
> To head off bikeshedding about the fields in xfs_commit_range, let's
> make it an opaque u64 array and require the userspace program to call
> a third ioctl to sample the freshness data for us.  If we ever converge
> on a definition for i_version then we can use that; for now we'll just
> use mtime/ctime like the old swapext ioctl.

This addresses my concerns about using mtime/ctime.

I have to say, Darrick, that I think that referring to this concern as
bikeshedding is not being honest.

I do hate nit picking reviews and I do hate "maybe also fix the world"
review comments, but I think the question about using mtime/ctime in
this new API was not out of place and I think that making the freshness
data opaque is better for everyone in the long run and hopefully, this will
help you move to the things you care about faster.

Thanks,
Amir.
Darrick J. Wong Feb. 29, 2024, 11:27 p.m. UTC | #2
On Tue, Feb 27, 2024 at 08:52:58PM +0200, Amir Goldstein wrote:
> On Tue, Feb 27, 2024 at 7:46 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > To head off bikeshedding about the fields in xfs_commit_range, let's
> > make it an opaque u64 array and require the userspace program to call
> > a third ioctl to sample the freshness data for us.  If we ever converge
> > on a definition for i_version then we can use that; for now we'll just
> > use mtime/ctime like the old swapext ioctl.
> 
> This addresses my concerns about using mtime/ctime.

Oh good! :)

> I have to say, Darrick, that I think that referring to this concern as
> bikeshedding is not being honest.
> 
> I do hate nit picking reviews and I do hate "maybe also fix the world"
> review comments, but I think the question about using mtime/ctime in
> this new API was not out of place

I agree, your question about mtime/ctime:

"Maybe a stupid question, but under which circumstances would mtime
change and ctime not change? Why are both needed?"

was a very good question.  But perhaps that statement referred to the
other part of that thread.

>                                   and I think that making the freshness
> data opaque is better for everyone in the long run and hopefully, this will
> help you move to the things you care about faster.

I wish you'd suggested an opaque blob that the fs can lay out however it
wants instead of suggesting specifically the change cookie.  I'm very
much ok with an opaque freshness blob that allows future flexibility in
how we define the blob's contents.

I was however very upset about the Jeff's suggestion of using i_version.
I apologize for using all caps in that reply, and snarling about it in
the commit message here.  The final version of this patch will not have
that.

That said, I don't think it is at all helpful to suggest using a file
attribute whose behavior is as yet unresolved.  Multigrain timestamps
were a clever idea, regrettably reverted.  As far as I could tell when I
wrote my reply, neither had NFS implemented a better behavior and
quietly merged it; nor have Jeff and Dave produced any sort of candidate
patchset to fix all the resulting issues in XFS.

Reading "I realize that STATX_CHANGE_COOKIE is currently kernel
internal" made me think "OH $deity, they wants me to do that work
too???"

A better way to have woreded that might've been "How about switching
this to a fs-determined structure so that we can switch the freshness
check to i_version when that's fully working on XFS?"

The problem I have with reading patch review emails is that I can't
easily tell whether an author's suggestion is being made in a casual
offhand manner?  Or if it reflects something they feel strongly needs
change before merging.

In fairness to you, Amir, I don't know how much you've kept on top of
that i_version vs. XFS discussion.  So I have no idea if you were aware
of the status of that work.

--D

> Thanks,
> Amir.
>
Amir Goldstein March 1, 2024, 1 p.m. UTC | #3
On Fri, Mar 1, 2024 at 1:27 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Tue, Feb 27, 2024 at 08:52:58PM +0200, Amir Goldstein wrote:
> > On Tue, Feb 27, 2024 at 7:46 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > From: Darrick J. Wong <djwong@kernel.org>
> > >
> > > To head off bikeshedding about the fields in xfs_commit_range, let's
> > > make it an opaque u64 array and require the userspace program to call
> > > a third ioctl to sample the freshness data for us.  If we ever converge
> > > on a definition for i_version then we can use that; for now we'll just
> > > use mtime/ctime like the old swapext ioctl.
> >
> > This addresses my concerns about using mtime/ctime.
>
> Oh good! :)
>
> > I have to say, Darrick, that I think that referring to this concern as
> > bikeshedding is not being honest.
> >
> > I do hate nit picking reviews and I do hate "maybe also fix the world"
> > review comments, but I think the question about using mtime/ctime in
> > this new API was not out of place
>
> I agree, your question about mtime/ctime:
>
> "Maybe a stupid question, but under which circumstances would mtime
> change and ctime not change? Why are both needed?"
>
> was a very good question.  But perhaps that statement referred to the
> other part of that thread.
>
> >                                   and I think that making the freshness
> > data opaque is better for everyone in the long run and hopefully, this will
> > help you move to the things you care about faster.
>
> I wish you'd suggested an opaque blob that the fs can lay out however it
> wants instead of suggesting specifically the change cookie.  I'm very
> much ok with an opaque freshness blob that allows future flexibility in
> how we define the blob's contents.
>

I wish I had thought of it myself - it is a good idea - just did not
occur to me.
Using the language of i_changecounter, that is "the current xfs implementation
of i_version", I still think that using it as the content of the
opaque freshness blob
makes more sense than mtime+ctime, but it is none of my concern what
you decide to fill in the freshness blob for the first version.

I was not aware of the way xfs_fsr is currently using mtime+ctime when
I replied and I am not sure if and how it is relevant to the new API.

> I was however very upset about the Jeff's suggestion of using i_version.
> I apologize for using all caps in that reply, and snarling about it in
> the commit message here.  The final version of this patch will not have
> that.
>
> That said, I don't think it is at all helpful to suggest using a file
> attribute whose behavior is as yet unresolved.  Multigrain timestamps
> were a clever idea, regrettably reverted.  As far as I could tell when I
> wrote my reply, neither had NFS implemented a better behavior and
> quietly merged it; nor have Jeff and Dave produced any sort of candidate
> patchset to fix all the resulting issues in XFS.
>
> Reading "I realize that STATX_CHANGE_COOKIE is currently kernel
> internal" made me think "OH $deity, they wants me to do that work
> too???"
>
> A better way to have woreded that might've been "How about switching
> this to a fs-determined structure so that we can switch the freshness
> check to i_version when that's fully working on XFS?"
>

Yeh, I should have chosen my words more carefully.
I was perfectly aware of your lack of interest in doing extra work
and wasn't trying to request any.

> The problem I have with reading patch review emails is that I can't
> easily tell whether an author's suggestion is being made in a casual
> offhand manner?  Or if it reflects something they feel strongly needs
> change before merging.
>

Can't speak for everyone else, but coming from the middle east,
I have fewer politeness filters.
When I write "wouldn't it be better to use change_cookie?"
I am just asking that question.

When I am asking something to be changed before merge,
I try to be much more explicit about it and this is what I expect
others to do when reviewing my patches.

Thanks,
Amir.
Jeffrey Layton March 1, 2024, 1:31 p.m. UTC | #4
On Thu, 2024-02-29 at 15:27 -0800, Darrick J. Wong wrote:
> On Tue, Feb 27, 2024 at 08:52:58PM +0200, Amir Goldstein wrote:
> > On Tue, Feb 27, 2024 at 7:46 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > 
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > To head off bikeshedding about the fields in xfs_commit_range, let's
> > > make it an opaque u64 array and require the userspace program to call
> > > a third ioctl to sample the freshness data for us.  If we ever converge
> > > on a definition for i_version then we can use that; for now we'll just
> > > use mtime/ctime like the old swapext ioctl.
> > 
> > This addresses my concerns about using mtime/ctime.
> 
> Oh good! :)
> 
> > I have to say, Darrick, that I think that referring to this concern as
> > bikeshedding is not being honest.
> > 
> > I do hate nit picking reviews and I do hate "maybe also fix the world"
> > review comments, but I think the question about using mtime/ctime in
> > this new API was not out of place
> 
> I agree, your question about mtime/ctime:
> 
> "Maybe a stupid question, but under which circumstances would mtime
> change and ctime not change? Why are both needed?"
> 
> was a very good question.  But perhaps that statement referred to the
> other part of that thread.
> 
> >                                   and I think that making the freshness
> > data opaque is better for everyone in the long run and hopefully, this will
> > help you move to the things you care about faster.
> 
> I wish you'd suggested an opaque blob that the fs can lay out however it
> wants instead of suggesting specifically the change cookie.  I'm very
> much ok with an opaque freshness blob that allows future flexibility in
> how we define the blob's contents.
> 
> I was however very upset about the Jeff's suggestion of using i_version.
> I apologize for using all caps in that reply, and snarling about it in
> the commit message here.  The final version of this patch will not have
> that.
> 
> That said, I don't think it is at all helpful to suggest using a file
> attribute whose behavior is as yet unresolved.  Multigrain timestamps
> were a clever idea, regrettably reverted.  As far as I could tell when I
> wrote my reply, neither had NFS implemented a better behavior and
> quietly merged it; nor have Jeff and Dave produced any sort of candidate
> patchset to fix all the resulting issues in XFS.
>
> Reading "I realize that STATX_CHANGE_COOKIE is currently kernel
> internal" made me think "OH $deity, they wants me to do that work
> too???"
> 
> A better way to have woreded that might've been "How about switching
> this to a fs-determined structure so that we can switch the freshness
> check to i_version when that's fully working on XFS?"
> 
> The problem I have with reading patch review emails is that I can't
> easily tell whether an author's suggestion is being made in a casual
> offhand manner?  Or if it reflects something they feel strongly needs
> change before merging.
> 
> In fairness to you, Amir, I don't know how much you've kept on top of
> that i_version vs. XFS discussion.  So I have no idea if you were aware
> of the status of that work.
> 

Sorry, I didn't mean to trigger anyone, but I do have real concerns
about any API that attempts to use timestamps to detect whether
something has changed.

We learned that lesson in NFS in the 90's. VFS timestamp resolution is
just not enough to show whether there was a change to a file -- full
stop.

I get the hand-wringing over i_version definitions and I don't care to
rehash that discussion here, but I'll point out that this is a
(proposed) XFS-private interface:

What you could do is expose the XFS change counter (the one that gets
bumped for everything, even atime updates, possibly via different
ioctl), and use that for your "freshness" check.

You'd unfortunately get false negative freshness checks after read
operations, but you shouldn't get any false positives (which is real
danger with timestamps).
Darrick J. Wong March 2, 2024, 2:48 a.m. UTC | #5
On Fri, Mar 01, 2024 at 08:31:21AM -0500, Jeff Layton wrote:
> On Thu, 2024-02-29 at 15:27 -0800, Darrick J. Wong wrote:
> > On Tue, Feb 27, 2024 at 08:52:58PM +0200, Amir Goldstein wrote:
> > > On Tue, Feb 27, 2024 at 7:46 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > 
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > 
> > > > To head off bikeshedding about the fields in xfs_commit_range, let's
> > > > make it an opaque u64 array and require the userspace program to call
> > > > a third ioctl to sample the freshness data for us.  If we ever converge
> > > > on a definition for i_version then we can use that; for now we'll just
> > > > use mtime/ctime like the old swapext ioctl.
> > > 
> > > This addresses my concerns about using mtime/ctime.
> > 
> > Oh good! :)
> > 
> > > I have to say, Darrick, that I think that referring to this concern as
> > > bikeshedding is not being honest.
> > > 
> > > I do hate nit picking reviews and I do hate "maybe also fix the world"
> > > review comments, but I think the question about using mtime/ctime in
> > > this new API was not out of place
> > 
> > I agree, your question about mtime/ctime:
> > 
> > "Maybe a stupid question, but under which circumstances would mtime
> > change and ctime not change? Why are both needed?"
> > 
> > was a very good question.  But perhaps that statement referred to the
> > other part of that thread.
> > 
> > >                                   and I think that making the freshness
> > > data opaque is better for everyone in the long run and hopefully, this will
> > > help you move to the things you care about faster.
> > 
> > I wish you'd suggested an opaque blob that the fs can lay out however it
> > wants instead of suggesting specifically the change cookie.  I'm very
> > much ok with an opaque freshness blob that allows future flexibility in
> > how we define the blob's contents.
> > 
> > I was however very upset about the Jeff's suggestion of using i_version.
> > I apologize for using all caps in that reply, and snarling about it in
> > the commit message here.  The final version of this patch will not have
> > that.
> > 
> > That said, I don't think it is at all helpful to suggest using a file
> > attribute whose behavior is as yet unresolved.  Multigrain timestamps
> > were a clever idea, regrettably reverted.  As far as I could tell when I
> > wrote my reply, neither had NFS implemented a better behavior and
> > quietly merged it; nor have Jeff and Dave produced any sort of candidate
> > patchset to fix all the resulting issues in XFS.
> >
> > Reading "I realize that STATX_CHANGE_COOKIE is currently kernel
> > internal" made me think "OH $deity, they wants me to do that work
> > too???"
> > 
> > A better way to have woreded that might've been "How about switching
> > this to a fs-determined structure so that we can switch the freshness
> > check to i_version when that's fully working on XFS?"
> > 
> > The problem I have with reading patch review emails is that I can't
> > easily tell whether an author's suggestion is being made in a casual
> > offhand manner?  Or if it reflects something they feel strongly needs
> > change before merging.
> > 
> > In fairness to you, Amir, I don't know how much you've kept on top of
> > that i_version vs. XFS discussion.  So I have no idea if you were aware
> > of the status of that work.
> > 
> 
> Sorry, I didn't mean to trigger anyone, but I do have real concerns
> about any API that attempts to use timestamps to detect whether
> something has changed.
> 
> We learned that lesson in NFS in the 90's. VFS timestamp resolution is
> just not enough to show whether there was a change to a file -- full
> stop.
> 
> I get the hand-wringing over i_version definitions and I don't care to
> rehash that discussion here, but I'll point out that this is a
> (proposed) XFS-private interface:
> 
> What you could do is expose the XFS change counter (the one that gets
> bumped for everything, even atime updates, possibly via different
> ioctl), and use that for your "freshness" check.
> 
> You'd unfortunately get false negative freshness checks after read
> operations, but you shouldn't get any false positives (which is real
> danger with timestamps).

I don't see how would that work for this usecase?  You have to sample
file2 before reflinking file2's contents to file1, writing the changes
to file1, and executing COMMIT_RANGE.  Setting the xfs-private REFLINK
inode flag on file2 will trigger an iversion update even though it won't
change mtime or ctime.  The COMMIT then fails due to the inode flags
change.

Worse yet, applications aren't going to know if a particular access is
actually the one that will trigger an atime update.  So this will just
fail unpredictably.

If iversion was purely a write counter then I would switch the freshness
implementation to use it.  But it's not, and I know this to be true
because I tried that and could not get COMMIT_RANGE to work reliably.
I suppose the advantage of the blob thing is that we actually /can/
switch over whenever it's ready.

--D

> -- 
> Jeff Layton <jlayton@kernel.org>
>
Jeffrey Layton March 2, 2024, 12:43 p.m. UTC | #6
On Fri, 2024-03-01 at 18:48 -0800, Darrick J. Wong wrote:
> On Fri, Mar 01, 2024 at 08:31:21AM -0500, Jeff Layton wrote:
> > On Thu, 2024-02-29 at 15:27 -0800, Darrick J. Wong wrote:
> > > On Tue, Feb 27, 2024 at 08:52:58PM +0200, Amir Goldstein wrote:
> > > > On Tue, Feb 27, 2024 at 7:46 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > 
> > > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > > 
> > > > > To head off bikeshedding about the fields in xfs_commit_range, let's
> > > > > make it an opaque u64 array and require the userspace program to call
> > > > > a third ioctl to sample the freshness data for us.  If we ever converge
> > > > > on a definition for i_version then we can use that; for now we'll just
> > > > > use mtime/ctime like the old swapext ioctl.
> > > > 
> > > > This addresses my concerns about using mtime/ctime.
> > > 
> > > Oh good! :)
> > > 
> > > > I have to say, Darrick, that I think that referring to this concern as
> > > > bikeshedding is not being honest.
> > > > 
> > > > I do hate nit picking reviews and I do hate "maybe also fix the world"
> > > > review comments, but I think the question about using mtime/ctime in
> > > > this new API was not out of place
> > > 
> > > I agree, your question about mtime/ctime:
> > > 
> > > "Maybe a stupid question, but under which circumstances would mtime
> > > change and ctime not change? Why are both needed?"
> > > 
> > > was a very good question.  But perhaps that statement referred to the
> > > other part of that thread.
> > > 
> > > >                                   and I think that making the freshness
> > > > data opaque is better for everyone in the long run and hopefully, this will
> > > > help you move to the things you care about faster.
> > > 
> > > I wish you'd suggested an opaque blob that the fs can lay out however it
> > > wants instead of suggesting specifically the change cookie.  I'm very
> > > much ok with an opaque freshness blob that allows future flexibility in
> > > how we define the blob's contents.
> > > 
> > > I was however very upset about the Jeff's suggestion of using i_version.
> > > I apologize for using all caps in that reply, and snarling about it in
> > > the commit message here.  The final version of this patch will not have
> > > that.
> > > 
> > > That said, I don't think it is at all helpful to suggest using a file
> > > attribute whose behavior is as yet unresolved.  Multigrain timestamps
> > > were a clever idea, regrettably reverted.  As far as I could tell when I
> > > wrote my reply, neither had NFS implemented a better behavior and
> > > quietly merged it; nor have Jeff and Dave produced any sort of candidate
> > > patchset to fix all the resulting issues in XFS.
> > > 
> > > Reading "I realize that STATX_CHANGE_COOKIE is currently kernel
> > > internal" made me think "OH $deity, they wants me to do that work
> > > too???"
> > > 
> > > A better way to have woreded that might've been "How about switching
> > > this to a fs-determined structure so that we can switch the freshness
> > > check to i_version when that's fully working on XFS?"
> > > 
> > > The problem I have with reading patch review emails is that I can't
> > > easily tell whether an author's suggestion is being made in a casual
> > > offhand manner?  Or if it reflects something they feel strongly needs
> > > change before merging.
> > > 
> > > In fairness to you, Amir, I don't know how much you've kept on top of
> > > that i_version vs. XFS discussion.  So I have no idea if you were aware
> > > of the status of that work.
> > > 
> > 
> > Sorry, I didn't mean to trigger anyone, but I do have real concerns
> > about any API that attempts to use timestamps to detect whether
> > something has changed.
> > 
> > We learned that lesson in NFS in the 90's. VFS timestamp resolution is
> > just not enough to show whether there was a change to a file -- full
> > stop.
> > 
> > I get the hand-wringing over i_version definitions and I don't care to
> > rehash that discussion here, but I'll point out that this is a
> > (proposed) XFS-private interface:
> > 
> > What you could do is expose the XFS change counter (the one that gets
> > bumped for everything, even atime updates, possibly via different
> > ioctl), and use that for your "freshness" check.
> > 
> > You'd unfortunately get false negative freshness checks after read
> > operations, but you shouldn't get any false positives (which is real
> > danger with timestamps).
> 
> I don't see how would that work for this usecase?  You have to sample
> file2 before reflinking file2's contents to file1, writing the changes
> to file1, and executing COMMIT_RANGE.  Setting the xfs-private REFLINK
> inode flag on file2 will trigger an iversion update even though it won't
> change mtime or ctime.  The COMMIT then fails due to the inode flags
> change.
> 
> Worse yet, applications aren't going to know if a particular access is
> actually the one that will trigger an atime update.  So this will just
> fail unpredictably.
> 
> If iversion was purely a write counter then I would switch the freshness
> implementation to use it.  But it's not, and I know this to be true
> because I tried that and could not get COMMIT_RANGE to work reliably.
> I suppose the advantage of the blob thing is that we actually /can/
> switch over whenever it's ready.
> 

Yeah, that's the other part -- you have to be willing to redrive the I/O
every time the freshness check fails, which can get expensive depending
on how active the file is. Again this is an XFS interface, so I don't
really have a dog in this fight. If you think timestamps are good
enough, then so be it.

All I can do is mention that it has been our experience in the NFS world
that relying on timestamps like this will eventually lead to data
corruption. The race conditions may be tight, and much of the time the
race may be benign, but if you do this enough you'll eventually get
bitten, and end up exchanging data when you shouldn't have.

All of that said, I think this is great discussion fodder for LSF this
year. I feel like the time is right to consider these sorts of
interfaces that do synchronized I/O without locking. I've already
proposed a discussion around the state of the i_version counter, so
maybe we can chat about it then?
Darrick J. Wong March 7, 2024, 11:25 p.m. UTC | #7
On Sat, Mar 02, 2024 at 07:43:53AM -0500, Jeff Layton wrote:
> On Fri, 2024-03-01 at 18:48 -0800, Darrick J. Wong wrote:
> > On Fri, Mar 01, 2024 at 08:31:21AM -0500, Jeff Layton wrote:
> > > On Thu, 2024-02-29 at 15:27 -0800, Darrick J. Wong wrote:
> > > > On Tue, Feb 27, 2024 at 08:52:58PM +0200, Amir Goldstein wrote:
> > > > > On Tue, Feb 27, 2024 at 7:46 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > > 
> > > > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > > > 
> > > > > > To head off bikeshedding about the fields in xfs_commit_range, let's
> > > > > > make it an opaque u64 array and require the userspace program to call
> > > > > > a third ioctl to sample the freshness data for us.  If we ever converge
> > > > > > on a definition for i_version then we can use that; for now we'll just
> > > > > > use mtime/ctime like the old swapext ioctl.
> > > > > 
> > > > > This addresses my concerns about using mtime/ctime.
> > > > 
> > > > Oh good! :)
> > > > 
> > > > > I have to say, Darrick, that I think that referring to this concern as
> > > > > bikeshedding is not being honest.
> > > > > 
> > > > > I do hate nit picking reviews and I do hate "maybe also fix the world"
> > > > > review comments, but I think the question about using mtime/ctime in
> > > > > this new API was not out of place
> > > > 
> > > > I agree, your question about mtime/ctime:
> > > > 
> > > > "Maybe a stupid question, but under which circumstances would mtime
> > > > change and ctime not change? Why are both needed?"
> > > > 
> > > > was a very good question.  But perhaps that statement referred to the
> > > > other part of that thread.
> > > > 
> > > > >                                   and I think that making the freshness
> > > > > data opaque is better for everyone in the long run and hopefully, this will
> > > > > help you move to the things you care about faster.
> > > > 
> > > > I wish you'd suggested an opaque blob that the fs can lay out however it
> > > > wants instead of suggesting specifically the change cookie.  I'm very
> > > > much ok with an opaque freshness blob that allows future flexibility in
> > > > how we define the blob's contents.
> > > > 
> > > > I was however very upset about the Jeff's suggestion of using i_version.
> > > > I apologize for using all caps in that reply, and snarling about it in
> > > > the commit message here.  The final version of this patch will not have
> > > > that.
> > > > 
> > > > That said, I don't think it is at all helpful to suggest using a file
> > > > attribute whose behavior is as yet unresolved.  Multigrain timestamps
> > > > were a clever idea, regrettably reverted.  As far as I could tell when I
> > > > wrote my reply, neither had NFS implemented a better behavior and
> > > > quietly merged it; nor have Jeff and Dave produced any sort of candidate
> > > > patchset to fix all the resulting issues in XFS.
> > > > 
> > > > Reading "I realize that STATX_CHANGE_COOKIE is currently kernel
> > > > internal" made me think "OH $deity, they wants me to do that work
> > > > too???"
> > > > 
> > > > A better way to have woreded that might've been "How about switching
> > > > this to a fs-determined structure so that we can switch the freshness
> > > > check to i_version when that's fully working on XFS?"
> > > > 
> > > > The problem I have with reading patch review emails is that I can't
> > > > easily tell whether an author's suggestion is being made in a casual
> > > > offhand manner?  Or if it reflects something they feel strongly needs
> > > > change before merging.
> > > > 
> > > > In fairness to you, Amir, I don't know how much you've kept on top of
> > > > that i_version vs. XFS discussion.  So I have no idea if you were aware
> > > > of the status of that work.
> > > > 
> > > 
> > > Sorry, I didn't mean to trigger anyone, but I do have real concerns
> > > about any API that attempts to use timestamps to detect whether
> > > something has changed.
> > > 
> > > We learned that lesson in NFS in the 90's. VFS timestamp resolution is
> > > just not enough to show whether there was a change to a file -- full
> > > stop.
> > > 
> > > I get the hand-wringing over i_version definitions and I don't care to
> > > rehash that discussion here, but I'll point out that this is a
> > > (proposed) XFS-private interface:
> > > 
> > > What you could do is expose the XFS change counter (the one that gets
> > > bumped for everything, even atime updates, possibly via different
> > > ioctl), and use that for your "freshness" check.
> > > 
> > > You'd unfortunately get false negative freshness checks after read
> > > operations, but you shouldn't get any false positives (which is real
> > > danger with timestamps).
> > 
> > I don't see how would that work for this usecase?  You have to sample
> > file2 before reflinking file2's contents to file1, writing the changes
> > to file1, and executing COMMIT_RANGE.  Setting the xfs-private REFLINK
> > inode flag on file2 will trigger an iversion update even though it won't
> > change mtime or ctime.  The COMMIT then fails due to the inode flags
> > change.
> > 
> > Worse yet, applications aren't going to know if a particular access is
> > actually the one that will trigger an atime update.  So this will just
> > fail unpredictably.
> > 
> > If iversion was purely a write counter then I would switch the freshness
> > implementation to use it.  But it's not, and I know this to be true
> > because I tried that and could not get COMMIT_RANGE to work reliably.
> > I suppose the advantage of the blob thing is that we actually /can/
> > switch over whenever it's ready.
> > 
> 
> Yeah, that's the other part -- you have to be willing to redrive the I/O
> every time the freshness check fails, which can get expensive depending
> on how active the file is. Again this is an XFS interface, so I don't
> really have a dog in this fight. If you think timestamps are good
> enough, then so be it.
> 
> All I can do is mention that it has been our experience in the NFS world
> that relying on timestamps like this will eventually lead to data
> corruption. The race conditions may be tight, and much of the time the
> race may be benign, but if you do this enough you'll eventually get
> bitten, and end up exchanging data when you shouldn't have.
> 
> All of that said, I think this is great discussion fodder for LSF this
> year. I feel like the time is right to consider these sorts of
> interfaces that do synchronized I/O without locking. I've already
> proposed a discussion around the state of the i_version counter, so
> maybe we can chat about it then?

Yes.  I've gotten an invitation, so corporate approval and dumb injuries
notwithstanding, I'll be there this year. :)

--D

> -- 
> Jeff Layton <jlayton@kernel.org>
>
diff mbox series

Patch

diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 01b3553adfc55..4019a78ee3ea5 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -860,14 +860,8 @@  struct xfs_commit_range {
 
 	__u64		flags;		/* see XFS_EXCHRANGE_* below */
 
-	/* file2 metadata for freshness checks */
-	__u64		file2_ino;	/* inode number */
-	__s64		file2_mtime;	/* modification time */
-	__s64		file2_ctime;	/* change time */
-	__s32		file2_mtime_nsec; /* mod time, nsec */
-	__s32		file2_ctime_nsec; /* change time, nsec */
-
-	__u64		pad;		/* must be zeroes */
+	/* opaque file2 metadata for freshness checks */
+	__u64		file2_freshness[5];
 };
 
 /*
@@ -973,7 +967,8 @@  struct xfs_commit_range {
 #define XFS_IOC_BULKSTAT	     _IOR ('X', 127, struct xfs_bulkstat_req)
 #define XFS_IOC_INUMBERS	     _IOR ('X', 128, struct xfs_inumbers_req)
 #define XFS_IOC_EXCHANGE_RANGE	     _IOWR('X', 129, struct xfs_exch_range)
-#define XFS_IOC_COMMIT_RANGE	     _IOWR('X', 129, struct xfs_commit_range)
+#define XFS_IOC_START_COMMIT	     _IOWR('X', 130, struct xfs_commit_range)
+#define XFS_IOC_COMMIT_RANGE	     _IOWR('X', 131, struct xfs_commit_range)
 /*	XFS_IOC_GETFSUUID ---------- deprecated 140	 */
 
 
diff --git a/fs/xfs/xfs_exchrange.c b/fs/xfs/xfs_exchrange.c
index e55ae06f1a32c..dae855515c3c4 100644
--- a/fs/xfs/xfs_exchrange.c
+++ b/fs/xfs/xfs_exchrange.c
@@ -863,3 +863,18 @@  xfs_exchange_range(
 		fsnotify_modify(fxr->file2);
 	return 0;
 }
+
+/* Sample freshness data from fxr->file2 for a commit range operation. */
+void
+xfs_exchrange_freshness(
+	struct xfs_exchrange	*fxr)
+{
+	struct inode		*inode2 = file_inode(fxr->file2);
+	struct xfs_inode	*ip2 = XFS_I(inode2);
+
+	xfs_ilock(ip2, XFS_IOLOCK_SHARED | XFS_MMAPLOCK_SHARED | XFS_ILOCK_SHARED);
+	fxr->file2_ino = ip2->i_ino;
+	fxr->file2_ctime = inode_get_ctime(inode2);
+	fxr->file2_mtime = inode_get_mtime(inode2);
+	xfs_iunlock(ip2, XFS_IOLOCK_SHARED | XFS_MMAPLOCK_SHARED | XFS_ILOCK_SHARED);
+}
diff --git a/fs/xfs/xfs_exchrange.h b/fs/xfs/xfs_exchrange.h
index 2dd9ab7d76828..942283a7f75f5 100644
--- a/fs/xfs/xfs_exchrange.h
+++ b/fs/xfs/xfs_exchrange.h
@@ -36,6 +36,7 @@  struct xfs_exchrange {
 	struct timespec64	file2_ctime;
 };
 
+void xfs_exchrange_freshness(struct xfs_exchrange *fxr);
 int xfs_exchange_range(struct xfs_exchrange *fxr);
 
 /* XFS-specific parts of file exchanges */
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index ee26ac2028da1..1940da72a1da7 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -2402,6 +2402,47 @@  xfs_ioc_exchange_range(
 	return error;
 }
 
+/* Opaque freshness blob for XFS_IOC_COMMIT_RANGE */
+struct xfs_commit_range_fresh {
+	__u64		file2_ino;	/* inode number */
+	__s64		file2_mtime;	/* modification time */
+	__s64		file2_ctime;	/* change time */
+	__s32		file2_mtime_nsec; /* mod time, nsec */
+	__s32		file2_ctime_nsec; /* change time, nsec */
+	__u64		pad;		/* zero */
+};
+
+static long
+xfs_ioc_start_commit(
+	struct file			*file,
+	struct xfs_commit_range __user	*argp)
+{
+	struct xfs_exchrange		fxr = {
+		.file2			= file,
+	};
+	struct xfs_commit_range		args;
+	struct xfs_commit_range_fresh	*kern_f;
+	struct xfs_commit_range_fresh	__user *user_f;
+
+	BUILD_BUG_ON(sizeof(struct xfs_commit_range_fresh) !=
+		     sizeof(args.file2_freshness));
+
+	xfs_exchrange_freshness(&fxr);
+
+	kern_f = (struct xfs_commit_range_fresh *)&args.file2_freshness;
+	kern_f->file2_ino		= fxr.file2_ino;
+	kern_f->file2_mtime		= fxr.file2_mtime.tv_sec;
+	kern_f->file2_mtime_nsec	= fxr.file2_mtime.tv_nsec;
+	kern_f->file2_ctime		= fxr.file2_ctime.tv_sec;
+	kern_f->file2_ctime_nsec	= fxr.file2_ctime.tv_nsec;
+
+	user_f = (struct xfs_commit_range_fresh *)&argp->file2_freshness;
+	if (copy_to_user(user_f, kern_f, sizeof(*kern_f)))
+		return -EFAULT;
+
+	return 0;
+}
+
 static long
 xfs_ioc_commit_range(
 	struct file			*file,
@@ -2411,12 +2452,15 @@  xfs_ioc_commit_range(
 		.file2			= file,
 	};
 	struct xfs_commit_range		args;
+	struct xfs_commit_range_fresh	*kern_f;
 	struct fd			file1;
 	int				error;
 
+	kern_f = (struct xfs_commit_range_fresh *)&args.file2_freshness;
+
 	if (copy_from_user(&args, argp, sizeof(args)))
 		return -EFAULT;
-	if (memchr_inv(&args.pad, 0, sizeof(args.pad)))
+	if (memchr_inv(&kern_f->pad, 0, sizeof(kern_f->pad)))
 		return -EINVAL;
 	if (args.flags & ~XFS_EXCHRANGE_ALL_FLAGS)
 		return -EINVAL;
@@ -2425,11 +2469,11 @@  xfs_ioc_commit_range(
 	fxr.file2_offset	= args.file2_offset;
 	fxr.length		= args.length;
 	fxr.flags		= args.flags | __XFS_EXCHRANGE_CHECK_FRESH2;
-	fxr.file2_ino		= args.file2_ino;
-	fxr.file2_mtime.tv_sec	= args.file2_mtime;
-	fxr.file2_mtime.tv_nsec	= args.file2_mtime_nsec;
-	fxr.file2_ctime.tv_sec	= args.file2_ctime;
-	fxr.file2_ctime.tv_nsec	= args.file2_ctime_nsec;
+	fxr.file2_ino		= kern_f->file2_ino;
+	fxr.file2_mtime.tv_sec	= kern_f->file2_mtime;
+	fxr.file2_mtime.tv_nsec	= kern_f->file2_mtime_nsec;
+	fxr.file2_ctime.tv_sec	= kern_f->file2_ctime;
+	fxr.file2_ctime.tv_nsec	= kern_f->file2_ctime_nsec;
 
 	file1 = fdget(args.file1_fd);
 	if (!file1.file)
@@ -2782,6 +2826,8 @@  xfs_file_ioctl(
 
 	case XFS_IOC_EXCHANGE_RANGE:
 		return xfs_ioc_exchange_range(filp, arg);
+	case XFS_IOC_START_COMMIT:
+		return xfs_ioc_start_commit(filp, arg);
 	case XFS_IOC_COMMIT_RANGE:
 		return xfs_ioc_commit_range(filp, arg);