diff mbox

[RFC,3/3] fs: detect that the i_rwsem has already been taken exclusively

Message ID 1506602373-4799-4-git-send-email-zohar@linux.vnet.ibm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Mimi Zohar Sept. 28, 2017, 12:39 p.m. UTC
Don't attempt to take the i_rwsem, if it has already been taken
exclusively.

Signed-off-by:  Mimi Zohar <zohar@linux.vnet.ibm.com>
---
 fs/ext2/file.c    |  6 ++++--
 fs/ext4/file.c    |  8 +++++---
 fs/xfs/xfs_file.c | 10 ++++++----
 3 files changed, 15 insertions(+), 9 deletions(-)

Comments

Dave Chinner Sept. 28, 2017, 10:02 p.m. UTC | #1
On Thu, Sep 28, 2017 at 08:39:33AM -0400, Mimi Zohar wrote:
> Don't attempt to take the i_rwsem, if it has already been taken
> exclusively.
> 
> Signed-off-by:  Mimi Zohar <zohar@linux.vnet.ibm.com>

That's bloody awful.

The locking in filesystem IO paths is already complex enough without
adding a new IO path semantic that says "caller has already locked
the i_rwsem in some order and some dependencies that we have no idea
about". Instead of having well defined locking in a small amount of
self contained code, we've now got to search through completely
unfamiliar code to analyse any sort of filesystem lockdep report or
deadlock to determine if that somethign else has screwed up the
filesystem IO path locking.

It also seems to have an undocumented semantic of not updating
access times on the inode, which effectively makes this invisible IO
and means we're assuming that timestamp updates will be done by
correctly callers outside the filesystem IO path. That's almost
certainly going to be a source of bugs in the future.

This seems like a recipe for future disasters to me....

Cheers,

Dave.
Linus Torvalds Sept. 28, 2017, 11:39 p.m. UTC | #2
On Thu, Sep 28, 2017 at 3:02 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Sep 28, 2017 at 08:39:33AM -0400, Mimi Zohar wrote:
>> Don't attempt to take the i_rwsem, if it has already been taken
>> exclusively.
>>
>> Signed-off-by:  Mimi Zohar <zohar@linux.vnet.ibm.com>
>
> That's bloody awful.
>
> The locking in filesystem IO paths is already complex enough without
> adding a new IO path semantic that says "caller has already locked
> the i_rwsem in some order and some dependencies that we have no idea
> about".

I do have to admit that I never got a satisfactory answer on why IMA
doesn't just use its own private per-inode lock for this all.

It isn't using the i_rwsem for file consistency reasons anyway, so it
seems to be purely about serializing the actual signature generation
with the xattr writing, but since IMA does those both, why isn't IMA
just using its own lock (not the filesystem lock) to do that?

                 Linus
Mimi Zohar Sept. 29, 2017, 12:12 a.m. UTC | #3
On Thu, 2017-09-28 at 16:39 -0700, Linus Torvalds wrote:
> On Thu, Sep 28, 2017 at 3:02 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Thu, Sep 28, 2017 at 08:39:33AM -0400, Mimi Zohar wrote:
> >> Don't attempt to take the i_rwsem, if it has already been taken
> >> exclusively.
> >>
> >> Signed-off-by:  Mimi Zohar <zohar@linux.vnet.ibm.com>
> >
> > That's bloody awful.
> >
> > The locking in filesystem IO paths is already complex enough without
> > adding a new IO path semantic that says "caller has already locked
> > the i_rwsem in some order and some dependencies that we have no idea
> > about".
> 
> I do have to admit that I never got a satisfactory answer on why IMA
> doesn't just use its own private per-inode lock for this all.
> 
> It isn't using the i_rwsem for file consistency reasons anyway, so it
> seems to be purely about serializing the actual signature generation
> with the xattr writing, but since IMA does those both, why isn't IMA
> just using its own lock (not the filesystem lock) to do that?

Originally IMA did define it's own lock, prior to IMA-appraisal.  IMA-
appraisal introduced writing the file hash as an xattr, which required
taking the i_mutex.  process_measurement() and ima_file_free() took
the iint->mutex first and then the i_mutex, while setxattr, chmod and
chown took the locks in reverse order.  To resolve the potential
deadlock, the iint->mutex was eliminated.

Mimi
Linus Torvalds Sept. 29, 2017, 12:33 a.m. UTC | #4
On Thu, Sep 28, 2017 at 5:12 PM, Mimi Zohar <zohar@linux.vnet.ibm.com> wrote:
>
> Originally IMA did define it's own lock, prior to IMA-appraisal.  IMA-
> appraisal introduced writing the file hash as an xattr, which required
> taking the i_mutex.  process_measurement() and ima_file_free() took
> the iint->mutex first and then the i_mutex, while setxattr, chmod and
> chown took the locks in reverse order.  To resolve the potential
> deadlock, the iint->mutex was eliminated.

Umm. You already have an explicit invalidation model, where you
invalidate after a write has occurred.

But the locking of the generation count (or "invalidation status" or
whatever) can - and should be - entirely independent of the locking of
the actual appraisal.

So make the appraisal itself use a semaphore ("only one appraisal at a time").

But use a separate lock for the generation count.
So then appraisal is:

 - get appraisal semaphore
      - get generation count lock
            read generation count
      - drop generation count lock
      - do the actual appraisal
 - drop appraisal semaphore

Note that you now have a tuple of "generation count, appraisal" that
you have *not* saved off yet, but it's your stable thing.

Now you can write the xattr:

  - get exclusive inode lock (for xattr)
      - get generation count lock
          - if the appraisal generation does not match, do NOT write
the appraisal you just calculated, since it's pointless: it's already
stale.
          - otherwise write the appraisal and generation count to the xattr
      - drop generation count lock
  - release exclusive inode lock

and then for anything that does setxattr or chmod or whatever, just
use that generation count lock to invalidate the appraisal. You don't
need to actual appraisal lock for that.

So now the appraisal lock is always the outermost one, and the
generation count lock is always the innermost.

Anyway, I haven't looked at the details of what IMA does, but
something like the above really sounds like it should work and seems
pretty straightforward.

No?

               Linus
Mimi Zohar Sept. 29, 2017, 1:53 a.m. UTC | #5
On Thu, 2017-09-28 at 17:33 -0700, Linus Torvalds wrote:
> On Thu, Sep 28, 2017 at 5:12 PM, Mimi Zohar <zohar@linux.vnet.ibm.com> wrote:
> >
> > Originally IMA did define it's own lock, prior to IMA-appraisal.  IMA-
> > appraisal introduced writing the file hash as an xattr, which required
> > taking the i_mutex.  process_measurement() and ima_file_free() took
> > the iint->mutex first and then the i_mutex, while setxattr, chmod and
> > chown took the locks in reverse order.  To resolve the potential
> > deadlock, the iint->mutex was eliminated.
> 
> Umm. You already have an explicit invalidation model, where you
> invalidate after a write has occurred.

Invalidating after each write would be horrible performance.  Only
after all the changes are made, after the file close, is the file
integrity status invalidated and the file hash re-calculated and
written out.

At some point, we might want to go back and look at having finer grain
file integrity invalidation.

> But the locking of the generation count (or "invalidation status" or
> whatever) can - and should be - entirely independent of the locking of
> the actual appraisal.

The locking issue isn't with validating the file hash, but with the
setxattr, chmod, chown syscalls.  Each of these syscalls takes the
i_rwsem exclusively before IMA (or EVM) is called.

In ima_file_free(), the locking would be:

lock: iint->mutex
lock: i_rwsem
	write hash as xattr
unlock: i_rwsem
unlock iint->mutex


In setxattr, chmod, chown syscalls, IMA (and EVM) are called after the
i_rwsem is already taken.  So the locking would be:

lock: i_rwsem
lock: iint->mutex

unlock: iint->mutex
unlock: i_rwsem

Perhaps now the problem is clearer?

Mimi
 

> So make the appraisal itself use a semaphore ("only one appraisal at a time").
> 
> But use a separate lock for the generation count.
> So then appraisal is:
> 
>  - get appraisal semaphore
>       - get generation count lock
>             read generation count
>       - drop generation count lock
>       - do the actual appraisal
>  - drop appraisal semaphore
> 
> Note that you now have a tuple of "generation count, appraisal" that
> you have *not* saved off yet, but it's your stable thing.
> 
> Now you can write the xattr:
> 
>   - get exclusive inode lock (for xattr)
>       - get generation count lock
>           - if the appraisal generation does not match, do NOT write
> the appraisal you just calculated, since it's pointless: it's already
> stale.
>           - otherwise write the appraisal and generation count to the xattr
>       - drop generation count lock
>   - release exclusive inode lock
> 
> and then for anything that does setxattr or chmod or whatever, just
> use that generation count lock to invalidate the appraisal. You don't
> need to actual appraisal lock for that.
> 
> So now the appraisal lock is always the outermost one, and the
> generation count lock is always the innermost.
> 
> Anyway, I haven't looked at the details of what IMA does, but
> something like the above really sounds like it should work and seems
> pretty straightforward.
> 
> No?
> 
>                Linus
>
Linus Torvalds Sept. 29, 2017, 3:26 a.m. UTC | #6
On Thu, Sep 28, 2017 at 6:53 PM, Mimi Zohar <zohar@linux.vnet.ibm.com> wrote:
>
> The locking issue isn't with validating the file hash, but with the
> setxattr, chmod, chown syscalls.  Each of these syscalls takes the
> i_rwsem exclusively before IMA (or EVM) is called.

Read my email again.

> In setxattr, chmod, chown syscalls, IMA (and EVM) are called after the
> i_rwsem is already taken.  So the locking would be:
>
> lock: i_rwsem
> lock: iint->mutex

No.

Two locks. One inner, one outer. Only the actual ones that calculates
the hash would take the outer one. Read my email.

           Linus
Eric W. Biederman Oct. 1, 2017, 1:33 a.m. UTC | #7
Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Thu, Sep 28, 2017 at 6:53 PM, Mimi Zohar <zohar@linux.vnet.ibm.com> wrote:
>>
>> The locking issue isn't with validating the file hash, but with the
>> setxattr, chmod, chown syscalls.  Each of these syscalls takes the
>> i_rwsem exclusively before IMA (or EVM) is called.
>
> Read my email again.
>
>> In setxattr, chmod, chown syscalls, IMA (and EVM) are called after the
>> i_rwsem is already taken.  So the locking would be:
>>
>> lock: i_rwsem
>> lock: iint->mutex
>
> No.
>
> Two locks. One inner, one outer. Only the actual ones that calculates
> the hash would take the outer one. Read my email.

That would require a task_work or another kind of work callback so that
the writes of the xattr are not synchronous with the vfs callback
correct?

Eric
Mimi Zohar Oct. 1, 2017, 12:08 p.m. UTC | #8
On Sat, 2017-09-30 at 18:56 -0700, Linus Torvalds wrote:
> On Sep 30, 2017 18:33, "Eric W. Biederman" <ebiederm@xmission.com> wrote:.
> 
> 
> That would require a task_work or another kind of work callback so that
> the writes of the xattr are not synchronous with the vfs callback
> correct?
> 
> 
> No, why?
> 
> You should just invalidate the IMA on xattr write or other operations that
> make the measurement invalid. You only need the inner lock.

Right, re-introducing the iint->mutex and a new i_generation field in
the iint struct with a separate set of locks should work.  It will be
reset if the file metadata changes (eg. setxattr, chown, chmod).

(We need i_generation for namespacing IMA as well.)

thanks,

Mimi
Linus Torvalds Oct. 1, 2017, 6:41 p.m. UTC | #9
On Sun, Oct 1, 2017 at 5:08 AM, Mimi Zohar <zohar@linux.vnet.ibm.com> wrote:
>
> Right, re-introducing the iint->mutex and a new i_generation field in
> the iint struct with a separate set of locks should work.  It will be
> reset if the file metadata changes (eg. setxattr, chown, chmod).

Note that the "inner lock" could possibly be omitted if the
invalidation can be just a single atomic instruction.

So particularly if invalidation could be just an atomic_inc() on the
generation count, there might not need to be any inner lock at all.

You'd have to serialize the actual measurement with the "read
generation count", but that should be as simple as just doing a
smp_rmb() between the "read generation count" and "do measurement on
file contents".

Of course, if you do something more complex in invalidation, you may
end up needing a real lock.

                 Linus
Eric W. Biederman Oct. 1, 2017, 10:06 p.m. UTC | #10
Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Sep 30, 2017 18:33, "Eric W. Biederman" <ebiederm@xmission.com> wrote:.
>
>  That would require a task_work or another kind of work callback so that
>  the writes of the xattr are not synchronous with the vfs callback
>  correct?
>
> No, why?
>
> You should just invalidate the IMA on xattr write or other operations that make the measurement invalid. You only need the inner
> lock.
>
> Why are you guys making up all these things just to make it complicated?

I am not trying to make things complicated I am just trying to
understand the conversation.

Unless I misread something it was being pointed out there are some vfs
operations today on which ima writes an ima xattr as a side effect.  And
those operations hold the i_sem.  So perhaps I am misunderstanding
things or writing the ima xattr needs to happen at some point.  Which
implies something like queued work.

But perhaps I a misunderstanding the conversation and ima.  I frequenly
misunderstand ima.

Eric
Linus Torvalds Oct. 1, 2017, 10:20 p.m. UTC | #11
On Sun, Oct 1, 2017 at 3:06 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Unless I misread something it was being pointed out there are some vfs
> operations today on which ima writes an ima xattr as a side effect.  And
> those operations hold the i_sem.  So perhaps I am misunderstanding
> things or writing the ima xattr needs to happen at some point.  Which
> implies something like queued work.

So the issue is indeed the inode semaphore, as it is used by IMA. But
all these IMA patches to work around the issue are just horribly ugly.
One adds a VFS-layer filesystem method that most filesystems end up
not really needing (it's the same as the regular read), and other
filesystems end up then having hacks with ("oh, I don't need to take
this lock because it was already taken by the caller").

The second patch attempt avoided the need for a new filesystem method,
but added a flag in an annoying place (for the same basic logic). The
advantage is that now most filesystems don't actually need to care any
more (and the filesystems that used to care now check that flag).

There was discussion about moving the flag to a mode convenient spot,
which would have made it a lot less intrusive.

But the basic issue is that almost always when you see lock
inversions, the problem can just be fixed by doing the locking
differently instead.

And that's what I was/am pushing for.

There really are two totally different issues:

 - the integrity _measurement_.

   This one wants to be serialized, so that you don't have multiple
concurrent measurements, and the serialization fundamentally has to be
around all the IO, so this lock pretty much has to be outside the
i_sem.

 - the integrity invalidation on certain operations.

   This one fundamentally had to be inside the i_sem, since some of
the operations that cause this end up already holding the i_sem at a
VFS layer.

so you had these two different requirements (inside _and_ outside),
and the IMA approach was basically to avoid the problem by making
i_sem *the* lock, and then making the IO routines aware of it already
being held. That does solve the inside/outside issue.

But the simpler way to fix it is to simply use two locks that nest
inside each other, with i_sem nesting in the middle.  That just avoids
the problem entirely, and doesn't require anybody to ever care about
i_sem semantic changes, because i_sem semantics simply didn't change
at all.

So that's the approach I'm pushing. I admittedly haven't actually
looked at the IMA details, but from a high-level standpoint you can
basically describe it (as above) without having to care too much about
exactly what IMA even wants.

The two-lock approach does require that the operations that invalidate
the integrity measurements always only invalidate it, and don't try to
re-compute it. But I suspect that would be entirely insane anyway
(imagine a world where "setxattr" would have to read the whole file
contents in order to revalidate the integrity measurement - even if
there is nobody who even *cares*).

           Linus
Dave Chinner Oct. 1, 2017, 10:34 p.m. UTC | #12
On Sun, Oct 01, 2017 at 11:41:48AM -0700, Linus Torvalds wrote:
> On Sun, Oct 1, 2017 at 5:08 AM, Mimi Zohar <zohar@linux.vnet.ibm.com> wrote:
> >
> > Right, re-introducing the iint->mutex and a new i_generation field in
> > the iint struct with a separate set of locks should work.  It will be
> > reset if the file metadata changes (eg. setxattr, chown, chmod).
> 
> Note that the "inner lock" could possibly be omitted if the
> invalidation can be just a single atomic instruction.
> 
> So particularly if invalidation could be just an atomic_inc() on the
> generation count, there might not need to be any inner lock at all.
> 
> You'd have to serialize the actual measurement with the "read
> generation count", but that should be as simple as just doing a
> smp_rmb() between the "read generation count" and "do measurement on
> file contents".

We already have a change counter on the inode, which is modified on
any data or metadata write (i_version) under filesystem locks.  The
i_version counter has well defined semantics - it's required by
NFSv4 to increment on any metadata or data change - so we should be
able to rely on it's behaviour to implement IMA as well. Filesystems
that support i_version are marked with [SB|MS]_I_VERSION in the
superblock (IS_I_VERSION(inode)) so it should be easy to tell if IMA
can be supported on a specific filesystem (btrfs, ext4, fuse and xfs
ATM).

The IMA code should be able to sample that at measurement time and
either fail or be retried if i_version changes during measurement.
We can then simply make the IMA xattr write conditional on the
i_version value being unchanged from the sample the IMA code passes
into the filesystem once the filesystem holds all the locks it needs
to write the xattr...

I note that IMA already grabs the i_version in
ima_collect_measurement(), so this shouldn't be too hard to do.
Perhaps we don't need any new locks or counters at all, maybe just
the ability to feed a version cookie to the set_xattr method?

Cheers,

Dave.
Linus Torvalds Oct. 1, 2017, 11:15 p.m. UTC | #13
On Sun, Oct 1, 2017 at 3:34 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> We already have a change counter on the inode, which is modified on
> any data or metadata write (i_version) under filesystem locks.  The
> i_version counter has well defined semantics - it's required by
> NFSv4 to increment on any metadata or data change - so we should be
> able to rely on it's behaviour to implement IMA as well.

I actually think i_version has exactly the wrong semantics.

Afaik, it doesn't actually version the file _data_ at all, it only
versions "inode itself changed".

But I might have missed something obvious. The updates are hidden in
some odd places sometimes.

              Linus
Mimi Zohar Oct. 1, 2017, 11:42 p.m. UTC | #14
On Mon, 2017-10-02 at 09:34 +1100, Dave Chinner wrote:
> On Sun, Oct 01, 2017 at 11:41:48AM -0700, Linus Torvalds wrote:
> > On Sun, Oct 1, 2017 at 5:08 AM, Mimi Zohar <zohar@linux.vnet.ibm.com> wrote:
> > >
> > > Right, re-introducing the iint->mutex and a new i_generation field in
> > > the iint struct with a separate set of locks should work.  It will be
> > > reset if the file metadata changes (eg. setxattr, chown, chmod).
> > 
> > Note that the "inner lock" could possibly be omitted if the
> > invalidation can be just a single atomic instruction.
> > 
> > So particularly if invalidation could be just an atomic_inc() on the
> > generation count, there might not need to be any inner lock at all.
> > 
> > You'd have to serialize the actual measurement with the "read
> > generation count", but that should be as simple as just doing a
> > smp_rmb() between the "read generation count" and "do measurement on
> > file contents".
> 
> We already have a change counter on the inode, which is modified on
> any data or metadata write (i_version) under filesystem locks.  The
> i_version counter has well defined semantics - it's required by
> NFSv4 to increment on any metadata or data change - so we should be
> able to rely on it's behaviour to implement IMA as well. Filesystems
> that support i_version are marked with [SB|MS]_I_VERSION in the
> superblock (IS_I_VERSION(inode)) so it should be easy to tell if IMA
> can be supported on a specific filesystem (btrfs, ext4, fuse and xfs
> ATM).

Recently I received a patch to replace i_version with mtime/atime.
 Now, even more recently, I received a patch that claims that
i_version is just a performance improvement.  For file systems that
don't support i_version, assume that the file has changed.

For file systems that don't support i_version, instead of assuming
that the file has changed, we can at least use i_generation.

With Linus' suggested changes, I think this will work nicely.

> The IMA code should be able to sample that at measurement time and
> either fail or be retried if i_version changes during measurement.
> We can then simply make the IMA xattr write conditional on the
> i_version value being unchanged from the sample the IMA code passes
> into the filesystem once the filesystem holds all the locks it needs
> to write the xattr...

> I note that IMA already grabs the i_version in
> ima_collect_measurement(), so this shouldn't be too hard to do.
> Perhaps we don't need any new locks or counterst all, maybe just
> the ability to feed a version cookie to the set_xattr method?

The security.ima xattr is normally written out in
ima_check_last_writer(), not in ima_collect_measurement().
 ima_collect_measurement() calculates the file hash for storing in the
measurement list (IMA-measurement), verifying the hash/signature (IMA-
appraisal) already stored in the xattr, and auditing (IMA-audit).

The only time that ima_collect_measurement() writes the file xattr is
in "fix" mode.  Writing the xattr will need to be deferred until after
the iint->mutex is released.

There should be no open writers in ima_check_last_writer(), so the
file shouldn't be changing.

Mimi
Mimi Zohar Oct. 1, 2017, 11:54 p.m. UTC | #15
On Sun, 2017-10-01 at 15:20 -0700, Linus Torvalds wrote:
> On Sun, Oct 1, 2017 at 3:06 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
> >
> > Unless I misread something it was being pointed out there are some vfs
> > operations today on which ima writes an ima xattr as a side effect.  And
> > those operations hold the i_sem.  So perhaps I am misunderstanding
> > things or writing the ima xattr needs to happen at some point.  Which
> > implies something like queued work.
> 
> So the issue is indeed the inode semaphore, as it is used by IMA. But
> all these IMA patches to work around the issue are just horribly ugly.
> One adds a VFS-layer filesystem method that most filesystems end up
> not really needing (it's the same as the regular read), and other
> filesystems end up then having hacks with ("oh, I don't need to take
> this lock because it was already taken by the caller").
> 
> The second patch attempt avoided the need for a new filesystem method,
> but added a flag in an annoying place (for the same basic logic). The
> advantage is that now most filesystems don't actually need to care any
> more (and the filesystems that used to care now check that flag).
> 
> There was discussion about moving the flag to a mode convenient spot,
> which would have made it a lot less intrusive.
> 
> But the basic issue is that almost always when you see lock
> inversions, the problem can just be fixed by doing the locking
> differently instead.

This is what I've been missing.  Thank you for taking the time to
understand the problem and explain how!

> And that's what I was/am pushing for.

> There really are two totally different issues:
> 
>  - the integrity _measurement_.
> 
>    This one wants to be serialized, so that you don't have multiple
> concurrent measurements, and the serialization fundamentally has to be
> around all the IO, so this lock pretty much has to be outside the
> i_sem.
> 
>  - the integrity invalidation on certain operations.
> 
>    This one fundamentally had to be inside the i_sem, since some of
> the operations that cause this end up already holding the i_sem at a
> VFS layer.
> 
> so you had these two different requirements (inside _and_ outside),
> and the IMA approach was basically to avoid the problem by making
> i_sem *the* lock, and then making the IO routines aware of it already
> being held. That does solve the inside/outside issue.
> 
> But the simpler way to fix it is to simply use two locks that nest
> inside each other, with i_sem nesting in the middle.  That just avoids
> the problem entirely, and doesn't require anybody to ever care about
> i_sem semantic changes, because i_sem semantics simply didn't change
> at all.
> 
> So that's the approach I'm pushing. I admittedly haven't actually
> looked at the IMA details, but from a high-level standpoint you can
> basically describe it (as above) without having to care too much about
> exactly what IMA even wants.
> 
> The two-lock approach does require that the operations that invalidate
> the integrity measurements always only invalidate it, and don't try to
> re-compute it. But I suspect that would be entirely insane anyway
> (imagine a world where "setxattr" would have to read the whole file
> contents in order to revalidate the integrity measurement - even if
> there is nobody who even *cares*).

Right, the setxattr, chmod, chown syscalls just resets the cached
flags, which indicate whether the file needs to be re-measured, re-
validated, or re-audited.  The file hash is not re-calculated at this
point.  That happens on the next access (in policy).

Mimi
Eric W. Biederman Oct. 2, 2017, 3:25 a.m. UTC | #16
Mimi Zohar <zohar@linux.vnet.ibm.com> writes:

> On Mon, 2017-10-02 at 09:34 +1100, Dave Chinner wrote:
>> On Sun, Oct 01, 2017 at 11:41:48AM -0700, Linus Torvalds wrote:
>> > On Sun, Oct 1, 2017 at 5:08 AM, Mimi Zohar <zohar@linux.vnet.ibm.com> wrote:
>> > >
>> > > Right, re-introducing the iint->mutex and a new i_generation field in
>> > > the iint struct with a separate set of locks should work.  It will be
>> > > reset if the file metadata changes (eg. setxattr, chown, chmod).
>> > 
>> > Note that the "inner lock" could possibly be omitted if the
>> > invalidation can be just a single atomic instruction.
>> > 
>> > So particularly if invalidation could be just an atomic_inc() on the
>> > generation count, there might not need to be any inner lock at all.
>> > 
>> > You'd have to serialize the actual measurement with the "read
>> > generation count", but that should be as simple as just doing a
>> > smp_rmb() between the "read generation count" and "do measurement on
>> > file contents".
>> 
>> We already have a change counter on the inode, which is modified on
>> any data or metadata write (i_version) under filesystem locks.  The
>> i_version counter has well defined semantics - it's required by
>> NFSv4 to increment on any metadata or data change - so we should be
>> able to rely on it's behaviour to implement IMA as well. Filesystems
>> that support i_version are marked with [SB|MS]_I_VERSION in the
>> superblock (IS_I_VERSION(inode)) so it should be easy to tell if IMA
>> can be supported on a specific filesystem (btrfs, ext4, fuse and xfs
>> ATM).
>
> Recently I received a patch to replace i_version with mtime/atime.
>  Now, even more recently, I received a patch that claims that
> i_version is just a performance improvement.  For file systems that
> don't support i_version, assume that the file has changed.
>
> For file systems that don't support i_version, instead of assuming
> that the file has changed, we can at least use i_generation.
>
> With Linus' suggested changes, I think this will work nicely.
>
>> The IMA code should be able to sample that at measurement time and
>> either fail or be retried if i_version changes during measurement.
>> We can then simply make the IMA xattr write conditional on the
>> i_version value being unchanged from the sample the IMA code passes
>> into the filesystem once the filesystem holds all the locks it needs
>> to write the xattr...
>
>> I note that IMA already grabs the i_version in
>> ima_collect_measurement(), so this shouldn't be too hard to do.
>> Perhaps we don't need any new locks or counterst all, maybe just
>> the ability to feed a version cookie to the set_xattr method?
>
> The security.ima xattr is normally written out in
> ima_check_last_writer(), not in ima_collect_measurement().
>  ima_collect_measurement() calculates the file hash for storing in the
> measurement list (IMA-measurement), verifying the hash/signature (IMA-
> appraisal) already stored in the xattr, and auditing (IMA-audit).
>
> The only time that ima_collect_measurement() writes the file xattr is
> in "fix" mode.  Writing the xattr will need to be deferred until after
> the iint->mutex is released.
>
> There should be no open writers in ima_check_last_writer(), so the
> file shouldn't be changing.

This is slightly tangential but I think important to consider.
What do you do about distributed filesystems fuse, nfs, etc that
can change the data behind the kernels back.

Do you not support such systems or do you have a sufficient way to
detect changes?

Eric
Dave Chinner Oct. 2, 2017, 3:54 a.m. UTC | #17
On Sun, Oct 01, 2017 at 04:15:07PM -0700, Linus Torvalds wrote:
> On Sun, Oct 1, 2017 at 3:34 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > We already have a change counter on the inode, which is modified on
> > any data or metadata write (i_version) under filesystem locks.  The
> > i_version counter has well defined semantics - it's required by
> > NFSv4 to increment on any metadata or data change - so we should be
> > able to rely on it's behaviour to implement IMA as well.
> 
> I actually think i_version has exactly the wrong semantics.
> 
> Afaik, it doesn't actually version the file _data_ at all, it only
> versions "inode itself changed".

No, the NFSv4 change attribute must change if either data or
metadata on the inode is changed, and be consistent and persistent
across server crashes. For data updates, they piggy back on mtime
updates ....

> But I might have missed something obvious. The updates are hidden in
> some odd places sometimes.

... which are in file_update_time().

Hence every data write or write page fault will call
file_update_time() and trigger an i_version increment, even if the
mtime doesn't change.

Cheers,

Dave.
Dave Chinner Oct. 2, 2017, 4:35 a.m. UTC | #18
On Sun, Oct 01, 2017 at 07:42:42PM -0400, Mimi Zohar wrote:
> On Mon, 2017-10-02 at 09:34 +1100, Dave Chinner wrote:
> > On Sun, Oct 01, 2017 at 11:41:48AM -0700, Linus Torvalds wrote:
> > > On Sun, Oct 1, 2017 at 5:08 AM, Mimi Zohar <zohar@linux.vnet.ibm.com> wrote:
> > > >
> > > > Right, re-introducing the iint->mutex and a new i_generation field in
> > > > the iint struct with a separate set of locks should work.  It will be
> > > > reset if the file metadata changes (eg. setxattr, chown, chmod).
> > > 
> > > Note that the "inner lock" could possibly be omitted if the
> > > invalidation can be just a single atomic instruction.
> > > 
> > > So particularly if invalidation could be just an atomic_inc() on the
> > > generation count, there might not need to be any inner lock at all.
> > > 
> > > You'd have to serialize the actual measurement with the "read
> > > generation count", but that should be as simple as just doing a
> > > smp_rmb() between the "read generation count" and "do measurement on
> > > file contents".
> > 
> > We already have a change counter on the inode, which is modified on
> > any data or metadata write (i_version) under filesystem locks.  The
> > i_version counter has well defined semantics - it's required by
> > NFSv4 to increment on any metadata or data change - so we should be
> > able to rely on it's behaviour to implement IMA as well. Filesystems
> > that support i_version are marked with [SB|MS]_I_VERSION in the
> > superblock (IS_I_VERSION(inode)) so it should be easy to tell if IMA
> > can be supported on a specific filesystem (btrfs, ext4, fuse and xfs
> > ATM).
> 
> Recently I received a patch to replace i_version with mtime/atime.

mtime is not guaranteed to change on data writes - the resolution of
the filesystem timestamps may mean mtime only changes once a second
regardless of the number of writes performed to that file. That's
why NFS can't use it as a change attribute, and hence we have
i_version....

>  Now, even more recently, I received a patch that claims that
> i_version is just a performance improvement.

Did you ask them to explain/quantify the performance improvement? 

e.g. Using i_version on XFS slows down performance on small
writes by 2-3% because i_version because all data writes log a
version change rather than only logging a change when mtime updates.
We take that penalty because NFS requires specific change attribute
behaviour, otherwise we wouldn't have implemented it at all in
XFS...

>  For file systems that
> don't support i_version, assume that the file has changed.
> 
> For file systems that don't support i_version, instead of assuming
> that the file has changed, we can at least use i_generation.

I'm not sure what you mean here - the struct inode already has a
i_generation variable. It's a lifecycle indicator used to
discriminate between alloc/free cycles on the same inode number.
i.e. It only changes at inode allocation time, not whenever the data
in the inode changes...

> With Linus' suggested changes, I think this will work nicely.
> 
> > The IMA code should be able to sample that at measurement time and
> > either fail or be retried if i_version changes during measurement.
> > We can then simply make the IMA xattr write conditional on the
> > i_version value being unchanged from the sample the IMA code passes
> > into the filesystem once the filesystem holds all the locks it needs
> > to write the xattr...
> 
> > I note that IMA already grabs the i_version in
> > ima_collect_measurement(), so this shouldn't be too hard to do.
> > Perhaps we don't need any new locks or counterst all, maybe just
> > the ability to feed a version cookie to the set_xattr method?
> 
> The security.ima xattr is normally written out in
> ima_check_last_writer(), not in ima_collect_measurement().

Which, if IIUC, does this to measure and update the xattr:

ima_check_last_writer
  -> ima_update_xattr
    -> ima_collect_measurement
    -> ima_fix_xattr

>  ima_collect_measurement() calculates the file hash for storing in the
> measurement list (IMA-measurement), verifying the hash/signature (IMA-
> appraisal) already stored in the xattr, and auditing (IMA-audit).

Yup, and it samples the i_version before it calculates the hash and
stores it in the iint, which then gets passed to ima_fix_xattr().
Looks like all that is needed is to pass the i_version back to the
filesystem through the xattr call....

IOWs, sample the i_version early while we hold the inode lock and
check the writer count, then if it is the last writer drop the inode
lock and call ima_update_xattr(). The sampled i_version then tells
us if the file has changed before we write the updated xattr...

> The only time that ima_collect_measurement() writes the file xattr is
> in "fix" mode.  Writing the xattr will need to be deferred until after
> the iint->mutex is released.

ima_collect_measurement() doesn't write an xattr at all - it just
reads the file data and calculates the hash.

> There should be no open writers in ima_check_last_writer(), so the
> file shouldn't be changing.

If that code is not holding the inode i_rwsem across
ima_update_xattr(), then the writer check is racy as hell.  We're
trying to get rid of the need for this code to hold the inode lock
to stabilise the writer count for the entire operation, and it looks
to me like everything is there to use the i_version to ensure the
the IMA code doesn't need to hold the inode lock across
ima_collect_measurement() and ima_fix_xattr()...

Cheers,

Dave.
Mimi Zohar Oct. 2, 2017, 12:09 p.m. UTC | #19
On Mon, 2017-10-02 at 15:35 +1100, Dave Chinner wrote:
> On Sun, Oct 01, 2017 at 07:42:42PM -0400, Mimi Zohar wrote:
> > On Mon, 2017-10-02 at 09:34 +1100, Dave Chinner wrote:
> > > On Sun, Oct 01, 2017 at 11:41:48AM -0700, Linus Torvalds wrote:
> > > > On Sun, Oct 1, 2017 at 5:08 AM, Mimi Zohar <zohar@linux.vnet.ibm.com> wrote:
> > > > >
> > > > > Right, re-introducing the iint->mutex and a new i_generation field in
> > > > > the iint struct with a separate set of locks should work.  It will be
> > > > > reset if the file metadata changes (eg. setxattr, chown, chmod).
> > > > 
> > > > Note that the "inner lock" could possibly be omitted if the
> > > > invalidation can be just a single atomic instruction.
> > > > 
> > > > So particularly if invalidation could be just an atomic_inc() on the
> > > > generation count, there might not need to be any inner lock at all.
> > > > 
> > > > You'd have to serialize the actual measurement with the "read
> > > > generation count", but that should be as simple as just doing a
> > > > smp_rmb() between the "read generation count" and "do measurement on
> > > > file contents".
> > > 
> > > We already have a change counter on the inode, which is modified on
> > > any data or metadata write (i_version) under filesystem locks.  The
> > > i_version counter has well defined semantics - it's required by
> > > NFSv4 to increment on any metadata or data change - so we should be
> > > able to rely on it's behaviour to implement IMA as well. Filesystems
> > > that support i_version are marked with [SB|MS]_I_VERSION in the
> > > superblock (IS_I_VERSION(inode)) so it should be easy to tell if IMA
> > > can be supported on a specific filesystem (btrfs, ext4, fuse and xfs
> > > ATM).
> > 
> > Recently I received a patch to replace i_version with mtime/atime.
> 
> mtime is not guaranteed to change on data writes - the resolution of
> the filesystem timestamps may mean mtime only changes once a second
> regardless of the number of writes performed to that file. That's
> why NFS can't use it as a change attribute, and hence we have
> i_version....
> 
> >  Now, even more recently, I received a patch that claims that
> > i_version is just a performance improvement.
> 
> Did you ask them to explain/quantify the performance improvement?

Using i_version is a performance improvement as opposed to always
calculating the file hash and writing the xattr.  The patch is
intended for filesystems that don't support i_version (eg. ubifs).
 
> e.g. Using i_version on XFS slows down performance on small
> writes by 2-3% because i_version because all data writes log a
> version change rather than only logging a change when mtime updates.
> We take that penalty because NFS requires specific change attribute
> behaviour, otherwise we wouldn't have implemented it at all in
> XFS...
> 
> >  For file systems that
> > don't support i_version, assume that the file has changed.
> > 
> > For file systems that don't support i_version, instead of assuming
> > that the file has changed, we can at least use i_generation.
> 
> I'm not sure what you mean here - the struct inode already has a
> i_generation variable. It's a lifecycle indicator used to
> discriminate between alloc/free cycles on the same inode number.
> i.e. It only changes at inode allocation time, not whenever the data
> in the inode changes...

Sigh, my error.

> 
> > With Linus' suggested changes, I think this will work nicely.
> > 
> > > The IMA code should be able to sample that at measurement time and
> > > either fail or be retried if i_version changes during measurement.
> > > We can then simply make the IMA xattr write conditional on the
> > > i_version value being unchanged from the sample the IMA code passes
> > > into the filesystem once the filesystem holds all the locks it needs
> > > to write the xattr...
> > 
> > > I note that IMA already grabs the i_version in
> > > ima_collect_measurement(), so this shouldn't be too hard to do.
> > > Perhaps we don't need any new locks or counterst all, maybe just
> > > the ability to feed a version cookie to the set_xattr method?
> > 
> > The security.ima xattr is normally written out in
> > ima_check_last_writer(), not in ima_collect_measurement().
> 
> Which, if IIUC, does this to measure and update the xattr:
> 
> ima_check_last_writer
>   -> ima_update_xattr
>     -> ima_collect_measurement
>     -> ima_fix_xattr
> 
> >  ima_collect_measurement() calculates the file hash for storing in the
> > measurement list (IMA-measurement), verifying the hash/signature (IMA-
> > appraisal) already stored in the xattr, and auditing (IMA-audit).
> 
> Yup, and it samples the i_version before it calculates the hash and
> stores it in the iint, which then gets passed to ima_fix_xattr().
> Looks like all that is needed is to pass the i_version back to the
> filesystem through the xattr call....
> 
> IOWs, sample the i_version early while we hold the inode lock and
> check the writer count, then if it is the last writer drop the inode
> lock and call ima_update_xattr(). The sampled i_version then tells
> us if the file has changed before we write the updated xattr...
> 
> > The only time that ima_collect_measurement() writes the file xattr is
> > in "fix" mode.  Writing the xattr will need to be deferred until after
> > the iint->mutex is released.
> 
> ima_collect_measurement() doesn't write an xattr at all - it just
> reads the file data and calculates the hash.

There's another call to ima_fix_xattr() from ima_appraise_measurement(). 

> > There should be no open writers in ima_check_last_writer(), so the
> > file shouldn't be changing.
> 
> If that code is not holding the inode i_rwsem across
> ima_update_xattr(), then the writer check is racy as hell.  We're
> trying to get rid of the need for this code to hold the inode lock
> to stabilise the writer count for the entire operation, and it looks
> to me like everything is there to use the i_version to ensure the
> the IMA code doesn't need to hold the inode lock across
> ima_collect_measurement() and ima_fix_xattr()...

Ok

Mimi
Mimi Zohar Oct. 2, 2017, 12:25 p.m. UTC | #20
On Sun, 2017-10-01 at 22:25 -0500, Eric W. Biederman wrote:
> Mimi Zohar <zohar@linux.vnet.ibm.com> writes:

> > There should be no open writers in ima_check_last_writer(), so the
> > file shouldn't be changing.
> 
> This is slightly tangential but I think important to consider.
> What do you do about distributed filesystems fuse, nfs, etc that
> can change the data behind the kernels back.

Exactly!

> Do you not support such systems or do you have a sufficient way to
> detect changes?

Currently, only the initial file access in policy is measured,
verified, audited.  Even if there was a way of detecting the change,
since we can't trust these file systems, the performance would be
awful, but we should probably not be caching the
measurement/verification results.

Mimi
Jeff Layton Oct. 2, 2017, 12:43 p.m. UTC | #21
On Mon, 2017-10-02 at 08:09 -0400, Mimi Zohar wrote:
> On Mon, 2017-10-02 at 15:35 +1100, Dave Chinner wrote:
> > On Sun, Oct 01, 2017 at 07:42:42PM -0400, Mimi Zohar wrote:
> > > On Mon, 2017-10-02 at 09:34 +1100, Dave Chinner wrote:
> > > > On Sun, Oct 01, 2017 at 11:41:48AM -0700, Linus Torvalds wrote:
> > > > > On Sun, Oct 1, 2017 at 5:08 AM, Mimi Zohar <zohar@linux.vnet.
> > > > > ibm.com> wrote:
> > > > > > 
> > > > > > Right, re-introducing the iint->mutex and a new
> > > > > > i_generation field in
> > > > > > the iint struct with a separate set of locks should
> > > > > > work.  It will be
> > > > > > reset if the file metadata changes (eg. setxattr, chown,
> > > > > > chmod).
> > > > > 
> > > > > Note that the "inner lock" could possibly be omitted if the
> > > > > invalidation can be just a single atomic instruction.
> > > > > 
> > > > > So particularly if invalidation could be just an atomic_inc()
> > > > > on the
> > > > > generation count, there might not need to be any inner lock
> > > > > at all.
> > > > > 
> > > > > You'd have to serialize the actual measurement with the "read
> > > > > generation count", but that should be as simple as just doing
> > > > > a
> > > > > smp_rmb() between the "read generation count" and "do
> > > > > measurement on
> > > > > file contents".
> > > > 
> > > > We already have a change counter on the inode, which is
> > > > modified on
> > > > any data or metadata write (i_version) under filesystem
> > > > locks.  The
> > > > i_version counter has well defined semantics - it's required by
> > > > NFSv4 to increment on any metadata or data change - so we
> > > > should be
> > > > able to rely on it's behaviour to implement IMA as well.
> > > > Filesystems
> > > > that support i_version are marked with [SB|MS]_I_VERSION in the
> > > > superblock (IS_I_VERSION(inode)) so it should be easy to tell
> > > > if IMA
> > > > can be supported on a specific filesystem (btrfs, ext4, fuse
> > > > and xfs
> > > > ATM).
> > > 
> > > Recently I received a patch to replace i_version with
> > > mtime/atime.
> > 

I assume you're talking here about the patch I sent a few months ago.

I specifically do _not_ want to replace i_version with the mtime/atime.
The point there was to stop trying to use i_version on filesystems that
don't properly implement it (which is most of them).

The next best approximation on those filesystems is the mtime. It's not
perfect, but it's better than nothing (which is what you have now on
filesystems that never increment i_version on writes). IOW, it just
added a fallback for when you can't count on the i_version changing.

(BTW: atime is worthless here -- who cares if the thing was accessed?
IIUC, we only care if something changed.)

Ideally, all filesystems would implement i_version properly. In
practice, that's a tall order as that may require on-disk changes for
some of them. That's not always possible where cross-OS compatibility
is necessary (e.g. FAT or NTFS).

> > mtime is not guaranteed to change on data writes - the resolution
> > of
> > the filesystem timestamps may mean mtime only changes once a second
> > regardless of the number of writes performed to that file. That's
> > why NFS can't use it as a change attribute, and hence we have
> > i_version....
> > 
> > >  Now, even more recently, I received a patch that claims that
> > > i_version is just a performance improvement.
> > 
> > Did you ask them to explain/quantify the performance improvement?
> 
> Using i_version is a performance improvement as opposed to always
> calculating the file hash and writing the xattr.  The patch is
> intended for filesystems that don't support i_version (eg. ubifs).
>  
> > e.g. Using i_version on XFS slows down performance on small
> > writes by 2-3% because i_version because all data writes log a
> > version change rather than only logging a change when mtime
> > updates.
> > We take that penalty because NFS requires specific change attribute
> > behaviour, otherwise we wouldn't have implemented it at all in
> > XFS...
> > 
> > >  For file systems that
> > > don't support i_version, assume that the file has changed.
> > > 
> > > For file systems that don't support i_version, instead of
> > > assuming
> > > that the file has changed, we can at least use i_generation.
> > 
> > I'm not sure what you mean here - the struct inode already has a
> > i_generation variable. It's a lifecycle indicator used to
> > discriminate between alloc/free cycles on the same inode number.
> > i.e. It only changes at inode allocation time, not whenever the
> > data
> > in the inode changes...
> 
> Sigh, my error.
> 
> > 
> > > With Linus' suggested changes, I think this will work nicely.
> > > 
> > > > The IMA code should be able to sample that at measurement time
> > > > and
> > > > either fail or be retried if i_version changes during
> > > > measurement.
> > > > We can then simply make the IMA xattr write conditional on the
> > > > i_version value being unchanged from the sample the IMA code
> > > > passes
> > > > into the filesystem once the filesystem holds all the locks it
> > > > needs
> > > > to write the xattr...
> > > > I note that IMA already grabs the i_version in
> > > > ima_collect_measurement(), so this shouldn't be too hard to do.
> > > > Perhaps we don't need any new locks or counterst all, maybe
> > > > just
> > > > the ability to feed a version cookie to the set_xattr method?
> > > 
> > > The security.ima xattr is normally written out in
> > > ima_check_last_writer(), not in ima_collect_measurement().
> > 
> > Which, if IIUC, does this to measure and update the xattr:
> > 
> > ima_check_last_writer
> >   -> ima_update_xattr
> >     -> ima_collect_measurement
> >     -> ima_fix_xattr
> > 
> > >  ima_collect_measurement() calculates the file hash for storing
> > > in the
> > > measurement list (IMA-measurement), verifying the hash/signature
> > > (IMA-
> > > appraisal) already stored in the xattr, and auditing (IMA-audit).
> > 
> > Yup, and it samples the i_version before it calculates the hash and
> > stores it in the iint, which then gets passed to ima_fix_xattr().
> > Looks like all that is needed is to pass the i_version back to the
> > filesystem through the xattr call....
> > 
> > IOWs, sample the i_version early while we hold the inode lock and
> > check the writer count, then if it is the last writer drop the
> > inode
> > lock and call ima_update_xattr(). The sampled i_version then tells
> > us if the file has changed before we write the updated xattr...
> > 
> > > The only time that ima_collect_measurement() writes the file
> > > xattr is
> > > in "fix" mode.  Writing the xattr will need to be deferred until
> > > after
> > > the iint->mutex is released.
> > 
> > ima_collect_measurement() doesn't write an xattr at all - it just
> > reads the file data and calculates the hash.
> 
> There's another call to ima_fix_xattr() from
> ima_appraise_measurement(). 
> 
> > > There should be no open writers in ima_check_last_writer(), so
> > > the
> > > file shouldn't be changing.
> > 
> > If that code is not holding the inode i_rwsem across
> > ima_update_xattr(), then the writer check is racy as hell.  We're
> > trying to get rid of the need for this code to hold the inode lock
> > to stabilise the writer count for the entire operation, and it
> > looks
> > to me like everything is there to use the i_version to ensure the
> > the IMA code doesn't need to hold the inode lock across
> > ima_collect_measurement() and ima_fix_xattr()...
> 
> Ok
> 
> Mimi
>
diff mbox

Patch

diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 839095f66d8d..d174b2880929 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -38,9 +38,11 @@  static ssize_t ext2_dax_read_iter(struct kiocb *iocb, struct iov_iter *to,
 	if (!iov_iter_count(to))
 		return 0; /* skip atime */
 
-	inode_lock_shared(inode);
+	if (!rwf)
+		inode_lock_shared(inode);
 	ret = dax_iomap_rw(iocb, to, &ext2_iomap_ops);
-	inode_unlock_shared(inode);
+	if (!rwf)
+		inode_unlock_shared(inode);
 
 	file_accessed(iocb->ki_filp);
 	return ret;
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 10789666725e..7d7c0e380add 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -38,7 +38,7 @@  static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to,
 	struct inode *inode = file_inode(iocb->ki_filp);
 	ssize_t ret;
 
-	if (!inode_trylock_shared(inode)) {
+	if (!rwf && !inode_trylock_shared(inode)) {
 		if (iocb->ki_flags & IOCB_NOWAIT)
 			return -EAGAIN;
 		inode_lock_shared(inode);
@@ -48,12 +48,14 @@  static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to,
 	 * change anymore
 	 */
 	if (!IS_DAX(inode)) {
-		inode_unlock_shared(inode);
+		if (!rwf)
+			inode_unlock_shared(inode);
 		/* Fallback to buffered IO in case we cannot support DAX */
 		return generic_file_read_iter(iocb, to, rwf);
 	}
 	ret = dax_iomap_rw(iocb, to, &ext4_iomap_ops);
-	inode_unlock_shared(inode);
+	if (!rwf)
+		inode_unlock_shared(inode);
 
 	file_accessed(iocb->ki_filp);
 	return ret;
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index cf1ce8961601..0cffca97ed68 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -239,7 +239,7 @@  xfs_file_dax_read(
 	if (!count)
 		return 0; /* skip atime */
 
-	if (!xfs_ilock_nowait(ip, XFS_IOLOCK_SHARED)) {
+	if (!rwf && !xfs_ilock_nowait(ip, XFS_IOLOCK_SHARED)) {
 		if (iocb->ki_flags & IOCB_NOWAIT)
 			return -EAGAIN;
 		xfs_ilock(ip, XFS_IOLOCK_SHARED);
@@ -247,7 +247,8 @@  xfs_file_dax_read(
 	ret = dax_iomap_rw(iocb, to, &xfs_iomap_ops);
 	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
 
-	file_accessed(iocb->ki_filp);
+	if (!rwf)
+		file_accessed(iocb->ki_filp);
 	return ret;
 }
 
@@ -262,13 +263,14 @@  xfs_file_buffered_aio_read(
 
 	trace_xfs_file_buffered_read(ip, iov_iter_count(to), iocb->ki_pos);
 
-	if (!xfs_ilock_nowait(ip, XFS_IOLOCK_SHARED)) {
+	if (!rwf && !xfs_ilock_nowait(ip, XFS_IOLOCK_SHARED)) {
 		if (iocb->ki_flags & IOCB_NOWAIT)
 			return -EAGAIN;
 		xfs_ilock(ip, XFS_IOLOCK_SHARED);
 	}
 	ret = generic_file_read_iter(iocb, to, rwf);
-	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
+	if (!rwf)
+		xfs_iunlock(ip, XFS_IOLOCK_SHARED);
 
 	return ret;
 }