Message ID | 1506602373-4799-4-git-send-email-zohar@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Sep 28, 2017 at 08:39:33AM -0400, Mimi Zohar wrote: > Don't attempt to take the i_rwsem, if it has already been taken > exclusively. > > Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com> That's bloody awful. The locking in filesystem IO paths is already complex enough without adding a new IO path semantic that says "caller has already locked the i_rwsem in some order and some dependencies that we have no idea about". Instead of having well defined locking in a small amount of self contained code, we've now got to search through completely unfamiliar code to analyse any sort of filesystem lockdep report or deadlock to determine if that somethign else has screwed up the filesystem IO path locking. It also seems to have an undocumented semantic of not updating access times on the inode, which effectively makes this invisible IO and means we're assuming that timestamp updates will be done by correctly callers outside the filesystem IO path. That's almost certainly going to be a source of bugs in the future. This seems like a recipe for future disasters to me.... Cheers, Dave.
On Thu, Sep 28, 2017 at 3:02 PM, Dave Chinner <david@fromorbit.com> wrote: > On Thu, Sep 28, 2017 at 08:39:33AM -0400, Mimi Zohar wrote: >> Don't attempt to take the i_rwsem, if it has already been taken >> exclusively. >> >> Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com> > > That's bloody awful. > > The locking in filesystem IO paths is already complex enough without > adding a new IO path semantic that says "caller has already locked > the i_rwsem in some order and some dependencies that we have no idea > about". I do have to admit that I never got a satisfactory answer on why IMA doesn't just use its own private per-inode lock for this all. It isn't using the i_rwsem for file consistency reasons anyway, so it seems to be purely about serializing the actual signature generation with the xattr writing, but since IMA does those both, why isn't IMA just using its own lock (not the filesystem lock) to do that? Linus
On Thu, 2017-09-28 at 16:39 -0700, Linus Torvalds wrote: > On Thu, Sep 28, 2017 at 3:02 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Thu, Sep 28, 2017 at 08:39:33AM -0400, Mimi Zohar wrote: > >> Don't attempt to take the i_rwsem, if it has already been taken > >> exclusively. > >> > >> Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com> > > > > That's bloody awful. > > > > The locking in filesystem IO paths is already complex enough without > > adding a new IO path semantic that says "caller has already locked > > the i_rwsem in some order and some dependencies that we have no idea > > about". > > I do have to admit that I never got a satisfactory answer on why IMA > doesn't just use its own private per-inode lock for this all. > > It isn't using the i_rwsem for file consistency reasons anyway, so it > seems to be purely about serializing the actual signature generation > with the xattr writing, but since IMA does those both, why isn't IMA > just using its own lock (not the filesystem lock) to do that? Originally IMA did define it's own lock, prior to IMA-appraisal. IMA- appraisal introduced writing the file hash as an xattr, which required taking the i_mutex. process_measurement() and ima_file_free() took the iint->mutex first and then the i_mutex, while setxattr, chmod and chown took the locks in reverse order. To resolve the potential deadlock, the iint->mutex was eliminated. Mimi
On Thu, Sep 28, 2017 at 5:12 PM, Mimi Zohar <zohar@linux.vnet.ibm.com> wrote: > > Originally IMA did define it's own lock, prior to IMA-appraisal. IMA- > appraisal introduced writing the file hash as an xattr, which required > taking the i_mutex. process_measurement() and ima_file_free() took > the iint->mutex first and then the i_mutex, while setxattr, chmod and > chown took the locks in reverse order. To resolve the potential > deadlock, the iint->mutex was eliminated. Umm. You already have an explicit invalidation model, where you invalidate after a write has occurred. But the locking of the generation count (or "invalidation status" or whatever) can - and should be - entirely independent of the locking of the actual appraisal. So make the appraisal itself use a semaphore ("only one appraisal at a time"). But use a separate lock for the generation count. So then appraisal is: - get appraisal semaphore - get generation count lock read generation count - drop generation count lock - do the actual appraisal - drop appraisal semaphore Note that you now have a tuple of "generation count, appraisal" that you have *not* saved off yet, but it's your stable thing. Now you can write the xattr: - get exclusive inode lock (for xattr) - get generation count lock - if the appraisal generation does not match, do NOT write the appraisal you just calculated, since it's pointless: it's already stale. - otherwise write the appraisal and generation count to the xattr - drop generation count lock - release exclusive inode lock and then for anything that does setxattr or chmod or whatever, just use that generation count lock to invalidate the appraisal. You don't need to actual appraisal lock for that. So now the appraisal lock is always the outermost one, and the generation count lock is always the innermost. Anyway, I haven't looked at the details of what IMA does, but something like the above really sounds like it should work and seems pretty straightforward. No? Linus
On Thu, 2017-09-28 at 17:33 -0700, Linus Torvalds wrote: > On Thu, Sep 28, 2017 at 5:12 PM, Mimi Zohar <zohar@linux.vnet.ibm.com> wrote: > > > > Originally IMA did define it's own lock, prior to IMA-appraisal. IMA- > > appraisal introduced writing the file hash as an xattr, which required > > taking the i_mutex. process_measurement() and ima_file_free() took > > the iint->mutex first and then the i_mutex, while setxattr, chmod and > > chown took the locks in reverse order. To resolve the potential > > deadlock, the iint->mutex was eliminated. > > Umm. You already have an explicit invalidation model, where you > invalidate after a write has occurred. Invalidating after each write would be horrible performance. Only after all the changes are made, after the file close, is the file integrity status invalidated and the file hash re-calculated and written out. At some point, we might want to go back and look at having finer grain file integrity invalidation. > But the locking of the generation count (or "invalidation status" or > whatever) can - and should be - entirely independent of the locking of > the actual appraisal. The locking issue isn't with validating the file hash, but with the setxattr, chmod, chown syscalls. Each of these syscalls takes the i_rwsem exclusively before IMA (or EVM) is called. In ima_file_free(), the locking would be: lock: iint->mutex lock: i_rwsem write hash as xattr unlock: i_rwsem unlock iint->mutex In setxattr, chmod, chown syscalls, IMA (and EVM) are called after the i_rwsem is already taken. So the locking would be: lock: i_rwsem lock: iint->mutex unlock: iint->mutex unlock: i_rwsem Perhaps now the problem is clearer? Mimi > So make the appraisal itself use a semaphore ("only one appraisal at a time"). > > But use a separate lock for the generation count. > So then appraisal is: > > - get appraisal semaphore > - get generation count lock > read generation count > - drop generation count lock > - do the actual appraisal > - drop appraisal semaphore > > Note that you now have a tuple of "generation count, appraisal" that > you have *not* saved off yet, but it's your stable thing. > > Now you can write the xattr: > > - get exclusive inode lock (for xattr) > - get generation count lock > - if the appraisal generation does not match, do NOT write > the appraisal you just calculated, since it's pointless: it's already > stale. > - otherwise write the appraisal and generation count to the xattr > - drop generation count lock > - release exclusive inode lock > > and then for anything that does setxattr or chmod or whatever, just > use that generation count lock to invalidate the appraisal. You don't > need to actual appraisal lock for that. > > So now the appraisal lock is always the outermost one, and the > generation count lock is always the innermost. > > Anyway, I haven't looked at the details of what IMA does, but > something like the above really sounds like it should work and seems > pretty straightforward. > > No? > > Linus >
On Thu, Sep 28, 2017 at 6:53 PM, Mimi Zohar <zohar@linux.vnet.ibm.com> wrote: > > The locking issue isn't with validating the file hash, but with the > setxattr, chmod, chown syscalls. Each of these syscalls takes the > i_rwsem exclusively before IMA (or EVM) is called. Read my email again. > In setxattr, chmod, chown syscalls, IMA (and EVM) are called after the > i_rwsem is already taken. So the locking would be: > > lock: i_rwsem > lock: iint->mutex No. Two locks. One inner, one outer. Only the actual ones that calculates the hash would take the outer one. Read my email. Linus
Linus Torvalds <torvalds@linux-foundation.org> writes: > On Thu, Sep 28, 2017 at 6:53 PM, Mimi Zohar <zohar@linux.vnet.ibm.com> wrote: >> >> The locking issue isn't with validating the file hash, but with the >> setxattr, chmod, chown syscalls. Each of these syscalls takes the >> i_rwsem exclusively before IMA (or EVM) is called. > > Read my email again. > >> In setxattr, chmod, chown syscalls, IMA (and EVM) are called after the >> i_rwsem is already taken. So the locking would be: >> >> lock: i_rwsem >> lock: iint->mutex > > No. > > Two locks. One inner, one outer. Only the actual ones that calculates > the hash would take the outer one. Read my email. That would require a task_work or another kind of work callback so that the writes of the xattr are not synchronous with the vfs callback correct? Eric
On Sat, 2017-09-30 at 18:56 -0700, Linus Torvalds wrote: > On Sep 30, 2017 18:33, "Eric W. Biederman" <ebiederm@xmission.com> wrote:. > > > That would require a task_work or another kind of work callback so that > the writes of the xattr are not synchronous with the vfs callback > correct? > > > No, why? > > You should just invalidate the IMA on xattr write or other operations that > make the measurement invalid. You only need the inner lock. Right, re-introducing the iint->mutex and a new i_generation field in the iint struct with a separate set of locks should work. It will be reset if the file metadata changes (eg. setxattr, chown, chmod). (We need i_generation for namespacing IMA as well.) thanks, Mimi
On Sun, Oct 1, 2017 at 5:08 AM, Mimi Zohar <zohar@linux.vnet.ibm.com> wrote: > > Right, re-introducing the iint->mutex and a new i_generation field in > the iint struct with a separate set of locks should work. It will be > reset if the file metadata changes (eg. setxattr, chown, chmod). Note that the "inner lock" could possibly be omitted if the invalidation can be just a single atomic instruction. So particularly if invalidation could be just an atomic_inc() on the generation count, there might not need to be any inner lock at all. You'd have to serialize the actual measurement with the "read generation count", but that should be as simple as just doing a smp_rmb() between the "read generation count" and "do measurement on file contents". Of course, if you do something more complex in invalidation, you may end up needing a real lock. Linus
Linus Torvalds <torvalds@linux-foundation.org> writes: > On Sep 30, 2017 18:33, "Eric W. Biederman" <ebiederm@xmission.com> wrote:. > > That would require a task_work or another kind of work callback so that > the writes of the xattr are not synchronous with the vfs callback > correct? > > No, why? > > You should just invalidate the IMA on xattr write or other operations that make the measurement invalid. You only need the inner > lock. > > Why are you guys making up all these things just to make it complicated? I am not trying to make things complicated I am just trying to understand the conversation. Unless I misread something it was being pointed out there are some vfs operations today on which ima writes an ima xattr as a side effect. And those operations hold the i_sem. So perhaps I am misunderstanding things or writing the ima xattr needs to happen at some point. Which implies something like queued work. But perhaps I a misunderstanding the conversation and ima. I frequenly misunderstand ima. Eric
On Sun, Oct 1, 2017 at 3:06 PM, Eric W. Biederman <ebiederm@xmission.com> wrote: > > Unless I misread something it was being pointed out there are some vfs > operations today on which ima writes an ima xattr as a side effect. And > those operations hold the i_sem. So perhaps I am misunderstanding > things or writing the ima xattr needs to happen at some point. Which > implies something like queued work. So the issue is indeed the inode semaphore, as it is used by IMA. But all these IMA patches to work around the issue are just horribly ugly. One adds a VFS-layer filesystem method that most filesystems end up not really needing (it's the same as the regular read), and other filesystems end up then having hacks with ("oh, I don't need to take this lock because it was already taken by the caller"). The second patch attempt avoided the need for a new filesystem method, but added a flag in an annoying place (for the same basic logic). The advantage is that now most filesystems don't actually need to care any more (and the filesystems that used to care now check that flag). There was discussion about moving the flag to a mode convenient spot, which would have made it a lot less intrusive. But the basic issue is that almost always when you see lock inversions, the problem can just be fixed by doing the locking differently instead. And that's what I was/am pushing for. There really are two totally different issues: - the integrity _measurement_. This one wants to be serialized, so that you don't have multiple concurrent measurements, and the serialization fundamentally has to be around all the IO, so this lock pretty much has to be outside the i_sem. - the integrity invalidation on certain operations. This one fundamentally had to be inside the i_sem, since some of the operations that cause this end up already holding the i_sem at a VFS layer. so you had these two different requirements (inside _and_ outside), and the IMA approach was basically to avoid the problem by making i_sem *the* lock, and then making the IO routines aware of it already being held. That does solve the inside/outside issue. But the simpler way to fix it is to simply use two locks that nest inside each other, with i_sem nesting in the middle. That just avoids the problem entirely, and doesn't require anybody to ever care about i_sem semantic changes, because i_sem semantics simply didn't change at all. So that's the approach I'm pushing. I admittedly haven't actually looked at the IMA details, but from a high-level standpoint you can basically describe it (as above) without having to care too much about exactly what IMA even wants. The two-lock approach does require that the operations that invalidate the integrity measurements always only invalidate it, and don't try to re-compute it. But I suspect that would be entirely insane anyway (imagine a world where "setxattr" would have to read the whole file contents in order to revalidate the integrity measurement - even if there is nobody who even *cares*). Linus
On Sun, Oct 01, 2017 at 11:41:48AM -0700, Linus Torvalds wrote: > On Sun, Oct 1, 2017 at 5:08 AM, Mimi Zohar <zohar@linux.vnet.ibm.com> wrote: > > > > Right, re-introducing the iint->mutex and a new i_generation field in > > the iint struct with a separate set of locks should work. It will be > > reset if the file metadata changes (eg. setxattr, chown, chmod). > > Note that the "inner lock" could possibly be omitted if the > invalidation can be just a single atomic instruction. > > So particularly if invalidation could be just an atomic_inc() on the > generation count, there might not need to be any inner lock at all. > > You'd have to serialize the actual measurement with the "read > generation count", but that should be as simple as just doing a > smp_rmb() between the "read generation count" and "do measurement on > file contents". We already have a change counter on the inode, which is modified on any data or metadata write (i_version) under filesystem locks. The i_version counter has well defined semantics - it's required by NFSv4 to increment on any metadata or data change - so we should be able to rely on it's behaviour to implement IMA as well. Filesystems that support i_version are marked with [SB|MS]_I_VERSION in the superblock (IS_I_VERSION(inode)) so it should be easy to tell if IMA can be supported on a specific filesystem (btrfs, ext4, fuse and xfs ATM). The IMA code should be able to sample that at measurement time and either fail or be retried if i_version changes during measurement. We can then simply make the IMA xattr write conditional on the i_version value being unchanged from the sample the IMA code passes into the filesystem once the filesystem holds all the locks it needs to write the xattr... I note that IMA already grabs the i_version in ima_collect_measurement(), so this shouldn't be too hard to do. Perhaps we don't need any new locks or counters at all, maybe just the ability to feed a version cookie to the set_xattr method? Cheers, Dave.
On Sun, Oct 1, 2017 at 3:34 PM, Dave Chinner <david@fromorbit.com> wrote: > > We already have a change counter on the inode, which is modified on > any data or metadata write (i_version) under filesystem locks. The > i_version counter has well defined semantics - it's required by > NFSv4 to increment on any metadata or data change - so we should be > able to rely on it's behaviour to implement IMA as well. I actually think i_version has exactly the wrong semantics. Afaik, it doesn't actually version the file _data_ at all, it only versions "inode itself changed". But I might have missed something obvious. The updates are hidden in some odd places sometimes. Linus
On Mon, 2017-10-02 at 09:34 +1100, Dave Chinner wrote: > On Sun, Oct 01, 2017 at 11:41:48AM -0700, Linus Torvalds wrote: > > On Sun, Oct 1, 2017 at 5:08 AM, Mimi Zohar <zohar@linux.vnet.ibm.com> wrote: > > > > > > Right, re-introducing the iint->mutex and a new i_generation field in > > > the iint struct with a separate set of locks should work. It will be > > > reset if the file metadata changes (eg. setxattr, chown, chmod). > > > > Note that the "inner lock" could possibly be omitted if the > > invalidation can be just a single atomic instruction. > > > > So particularly if invalidation could be just an atomic_inc() on the > > generation count, there might not need to be any inner lock at all. > > > > You'd have to serialize the actual measurement with the "read > > generation count", but that should be as simple as just doing a > > smp_rmb() between the "read generation count" and "do measurement on > > file contents". > > We already have a change counter on the inode, which is modified on > any data or metadata write (i_version) under filesystem locks. The > i_version counter has well defined semantics - it's required by > NFSv4 to increment on any metadata or data change - so we should be > able to rely on it's behaviour to implement IMA as well. Filesystems > that support i_version are marked with [SB|MS]_I_VERSION in the > superblock (IS_I_VERSION(inode)) so it should be easy to tell if IMA > can be supported on a specific filesystem (btrfs, ext4, fuse and xfs > ATM). Recently I received a patch to replace i_version with mtime/atime. Now, even more recently, I received a patch that claims that i_version is just a performance improvement. For file systems that don't support i_version, assume that the file has changed. For file systems that don't support i_version, instead of assuming that the file has changed, we can at least use i_generation. With Linus' suggested changes, I think this will work nicely. > The IMA code should be able to sample that at measurement time and > either fail or be retried if i_version changes during measurement. > We can then simply make the IMA xattr write conditional on the > i_version value being unchanged from the sample the IMA code passes > into the filesystem once the filesystem holds all the locks it needs > to write the xattr... > I note that IMA already grabs the i_version in > ima_collect_measurement(), so this shouldn't be too hard to do. > Perhaps we don't need any new locks or counterst all, maybe just > the ability to feed a version cookie to the set_xattr method? The security.ima xattr is normally written out in ima_check_last_writer(), not in ima_collect_measurement(). ima_collect_measurement() calculates the file hash for storing in the measurement list (IMA-measurement), verifying the hash/signature (IMA- appraisal) already stored in the xattr, and auditing (IMA-audit). The only time that ima_collect_measurement() writes the file xattr is in "fix" mode. Writing the xattr will need to be deferred until after the iint->mutex is released. There should be no open writers in ima_check_last_writer(), so the file shouldn't be changing. Mimi
On Sun, 2017-10-01 at 15:20 -0700, Linus Torvalds wrote: > On Sun, Oct 1, 2017 at 3:06 PM, Eric W. Biederman <ebiederm@xmission.com> wrote: > > > > Unless I misread something it was being pointed out there are some vfs > > operations today on which ima writes an ima xattr as a side effect. And > > those operations hold the i_sem. So perhaps I am misunderstanding > > things or writing the ima xattr needs to happen at some point. Which > > implies something like queued work. > > So the issue is indeed the inode semaphore, as it is used by IMA. But > all these IMA patches to work around the issue are just horribly ugly. > One adds a VFS-layer filesystem method that most filesystems end up > not really needing (it's the same as the regular read), and other > filesystems end up then having hacks with ("oh, I don't need to take > this lock because it was already taken by the caller"). > > The second patch attempt avoided the need for a new filesystem method, > but added a flag in an annoying place (for the same basic logic). The > advantage is that now most filesystems don't actually need to care any > more (and the filesystems that used to care now check that flag). > > There was discussion about moving the flag to a mode convenient spot, > which would have made it a lot less intrusive. > > But the basic issue is that almost always when you see lock > inversions, the problem can just be fixed by doing the locking > differently instead. This is what I've been missing. Thank you for taking the time to understand the problem and explain how! > And that's what I was/am pushing for. > There really are two totally different issues: > > - the integrity _measurement_. > > This one wants to be serialized, so that you don't have multiple > concurrent measurements, and the serialization fundamentally has to be > around all the IO, so this lock pretty much has to be outside the > i_sem. > > - the integrity invalidation on certain operations. > > This one fundamentally had to be inside the i_sem, since some of > the operations that cause this end up already holding the i_sem at a > VFS layer. > > so you had these two different requirements (inside _and_ outside), > and the IMA approach was basically to avoid the problem by making > i_sem *the* lock, and then making the IO routines aware of it already > being held. That does solve the inside/outside issue. > > But the simpler way to fix it is to simply use two locks that nest > inside each other, with i_sem nesting in the middle. That just avoids > the problem entirely, and doesn't require anybody to ever care about > i_sem semantic changes, because i_sem semantics simply didn't change > at all. > > So that's the approach I'm pushing. I admittedly haven't actually > looked at the IMA details, but from a high-level standpoint you can > basically describe it (as above) without having to care too much about > exactly what IMA even wants. > > The two-lock approach does require that the operations that invalidate > the integrity measurements always only invalidate it, and don't try to > re-compute it. But I suspect that would be entirely insane anyway > (imagine a world where "setxattr" would have to read the whole file > contents in order to revalidate the integrity measurement - even if > there is nobody who even *cares*). Right, the setxattr, chmod, chown syscalls just resets the cached flags, which indicate whether the file needs to be re-measured, re- validated, or re-audited. The file hash is not re-calculated at this point. That happens on the next access (in policy). Mimi
Mimi Zohar <zohar@linux.vnet.ibm.com> writes: > On Mon, 2017-10-02 at 09:34 +1100, Dave Chinner wrote: >> On Sun, Oct 01, 2017 at 11:41:48AM -0700, Linus Torvalds wrote: >> > On Sun, Oct 1, 2017 at 5:08 AM, Mimi Zohar <zohar@linux.vnet.ibm.com> wrote: >> > > >> > > Right, re-introducing the iint->mutex and a new i_generation field in >> > > the iint struct with a separate set of locks should work. It will be >> > > reset if the file metadata changes (eg. setxattr, chown, chmod). >> > >> > Note that the "inner lock" could possibly be omitted if the >> > invalidation can be just a single atomic instruction. >> > >> > So particularly if invalidation could be just an atomic_inc() on the >> > generation count, there might not need to be any inner lock at all. >> > >> > You'd have to serialize the actual measurement with the "read >> > generation count", but that should be as simple as just doing a >> > smp_rmb() between the "read generation count" and "do measurement on >> > file contents". >> >> We already have a change counter on the inode, which is modified on >> any data or metadata write (i_version) under filesystem locks. The >> i_version counter has well defined semantics - it's required by >> NFSv4 to increment on any metadata or data change - so we should be >> able to rely on it's behaviour to implement IMA as well. Filesystems >> that support i_version are marked with [SB|MS]_I_VERSION in the >> superblock (IS_I_VERSION(inode)) so it should be easy to tell if IMA >> can be supported on a specific filesystem (btrfs, ext4, fuse and xfs >> ATM). > > Recently I received a patch to replace i_version with mtime/atime. > Now, even more recently, I received a patch that claims that > i_version is just a performance improvement. For file systems that > don't support i_version, assume that the file has changed. > > For file systems that don't support i_version, instead of assuming > that the file has changed, we can at least use i_generation. > > With Linus' suggested changes, I think this will work nicely. > >> The IMA code should be able to sample that at measurement time and >> either fail or be retried if i_version changes during measurement. >> We can then simply make the IMA xattr write conditional on the >> i_version value being unchanged from the sample the IMA code passes >> into the filesystem once the filesystem holds all the locks it needs >> to write the xattr... > >> I note that IMA already grabs the i_version in >> ima_collect_measurement(), so this shouldn't be too hard to do. >> Perhaps we don't need any new locks or counterst all, maybe just >> the ability to feed a version cookie to the set_xattr method? > > The security.ima xattr is normally written out in > ima_check_last_writer(), not in ima_collect_measurement(). > ima_collect_measurement() calculates the file hash for storing in the > measurement list (IMA-measurement), verifying the hash/signature (IMA- > appraisal) already stored in the xattr, and auditing (IMA-audit). > > The only time that ima_collect_measurement() writes the file xattr is > in "fix" mode. Writing the xattr will need to be deferred until after > the iint->mutex is released. > > There should be no open writers in ima_check_last_writer(), so the > file shouldn't be changing. This is slightly tangential but I think important to consider. What do you do about distributed filesystems fuse, nfs, etc that can change the data behind the kernels back. Do you not support such systems or do you have a sufficient way to detect changes? Eric
On Sun, Oct 01, 2017 at 04:15:07PM -0700, Linus Torvalds wrote: > On Sun, Oct 1, 2017 at 3:34 PM, Dave Chinner <david@fromorbit.com> wrote: > > > > We already have a change counter on the inode, which is modified on > > any data or metadata write (i_version) under filesystem locks. The > > i_version counter has well defined semantics - it's required by > > NFSv4 to increment on any metadata or data change - so we should be > > able to rely on it's behaviour to implement IMA as well. > > I actually think i_version has exactly the wrong semantics. > > Afaik, it doesn't actually version the file _data_ at all, it only > versions "inode itself changed". No, the NFSv4 change attribute must change if either data or metadata on the inode is changed, and be consistent and persistent across server crashes. For data updates, they piggy back on mtime updates .... > But I might have missed something obvious. The updates are hidden in > some odd places sometimes. ... which are in file_update_time(). Hence every data write or write page fault will call file_update_time() and trigger an i_version increment, even if the mtime doesn't change. Cheers, Dave.
On Sun, Oct 01, 2017 at 07:42:42PM -0400, Mimi Zohar wrote: > On Mon, 2017-10-02 at 09:34 +1100, Dave Chinner wrote: > > On Sun, Oct 01, 2017 at 11:41:48AM -0700, Linus Torvalds wrote: > > > On Sun, Oct 1, 2017 at 5:08 AM, Mimi Zohar <zohar@linux.vnet.ibm.com> wrote: > > > > > > > > Right, re-introducing the iint->mutex and a new i_generation field in > > > > the iint struct with a separate set of locks should work. It will be > > > > reset if the file metadata changes (eg. setxattr, chown, chmod). > > > > > > Note that the "inner lock" could possibly be omitted if the > > > invalidation can be just a single atomic instruction. > > > > > > So particularly if invalidation could be just an atomic_inc() on the > > > generation count, there might not need to be any inner lock at all. > > > > > > You'd have to serialize the actual measurement with the "read > > > generation count", but that should be as simple as just doing a > > > smp_rmb() between the "read generation count" and "do measurement on > > > file contents". > > > > We already have a change counter on the inode, which is modified on > > any data or metadata write (i_version) under filesystem locks. The > > i_version counter has well defined semantics - it's required by > > NFSv4 to increment on any metadata or data change - so we should be > > able to rely on it's behaviour to implement IMA as well. Filesystems > > that support i_version are marked with [SB|MS]_I_VERSION in the > > superblock (IS_I_VERSION(inode)) so it should be easy to tell if IMA > > can be supported on a specific filesystem (btrfs, ext4, fuse and xfs > > ATM). > > Recently I received a patch to replace i_version with mtime/atime. mtime is not guaranteed to change on data writes - the resolution of the filesystem timestamps may mean mtime only changes once a second regardless of the number of writes performed to that file. That's why NFS can't use it as a change attribute, and hence we have i_version.... > Now, even more recently, I received a patch that claims that > i_version is just a performance improvement. Did you ask them to explain/quantify the performance improvement? e.g. Using i_version on XFS slows down performance on small writes by 2-3% because i_version because all data writes log a version change rather than only logging a change when mtime updates. We take that penalty because NFS requires specific change attribute behaviour, otherwise we wouldn't have implemented it at all in XFS... > For file systems that > don't support i_version, assume that the file has changed. > > For file systems that don't support i_version, instead of assuming > that the file has changed, we can at least use i_generation. I'm not sure what you mean here - the struct inode already has a i_generation variable. It's a lifecycle indicator used to discriminate between alloc/free cycles on the same inode number. i.e. It only changes at inode allocation time, not whenever the data in the inode changes... > With Linus' suggested changes, I think this will work nicely. > > > The IMA code should be able to sample that at measurement time and > > either fail or be retried if i_version changes during measurement. > > We can then simply make the IMA xattr write conditional on the > > i_version value being unchanged from the sample the IMA code passes > > into the filesystem once the filesystem holds all the locks it needs > > to write the xattr... > > > I note that IMA already grabs the i_version in > > ima_collect_measurement(), so this shouldn't be too hard to do. > > Perhaps we don't need any new locks or counterst all, maybe just > > the ability to feed a version cookie to the set_xattr method? > > The security.ima xattr is normally written out in > ima_check_last_writer(), not in ima_collect_measurement(). Which, if IIUC, does this to measure and update the xattr: ima_check_last_writer -> ima_update_xattr -> ima_collect_measurement -> ima_fix_xattr > ima_collect_measurement() calculates the file hash for storing in the > measurement list (IMA-measurement), verifying the hash/signature (IMA- > appraisal) already stored in the xattr, and auditing (IMA-audit). Yup, and it samples the i_version before it calculates the hash and stores it in the iint, which then gets passed to ima_fix_xattr(). Looks like all that is needed is to pass the i_version back to the filesystem through the xattr call.... IOWs, sample the i_version early while we hold the inode lock and check the writer count, then if it is the last writer drop the inode lock and call ima_update_xattr(). The sampled i_version then tells us if the file has changed before we write the updated xattr... > The only time that ima_collect_measurement() writes the file xattr is > in "fix" mode. Writing the xattr will need to be deferred until after > the iint->mutex is released. ima_collect_measurement() doesn't write an xattr at all - it just reads the file data and calculates the hash. > There should be no open writers in ima_check_last_writer(), so the > file shouldn't be changing. If that code is not holding the inode i_rwsem across ima_update_xattr(), then the writer check is racy as hell. We're trying to get rid of the need for this code to hold the inode lock to stabilise the writer count for the entire operation, and it looks to me like everything is there to use the i_version to ensure the the IMA code doesn't need to hold the inode lock across ima_collect_measurement() and ima_fix_xattr()... Cheers, Dave.
On Mon, 2017-10-02 at 15:35 +1100, Dave Chinner wrote: > On Sun, Oct 01, 2017 at 07:42:42PM -0400, Mimi Zohar wrote: > > On Mon, 2017-10-02 at 09:34 +1100, Dave Chinner wrote: > > > On Sun, Oct 01, 2017 at 11:41:48AM -0700, Linus Torvalds wrote: > > > > On Sun, Oct 1, 2017 at 5:08 AM, Mimi Zohar <zohar@linux.vnet.ibm.com> wrote: > > > > > > > > > > Right, re-introducing the iint->mutex and a new i_generation field in > > > > > the iint struct with a separate set of locks should work. It will be > > > > > reset if the file metadata changes (eg. setxattr, chown, chmod). > > > > > > > > Note that the "inner lock" could possibly be omitted if the > > > > invalidation can be just a single atomic instruction. > > > > > > > > So particularly if invalidation could be just an atomic_inc() on the > > > > generation count, there might not need to be any inner lock at all. > > > > > > > > You'd have to serialize the actual measurement with the "read > > > > generation count", but that should be as simple as just doing a > > > > smp_rmb() between the "read generation count" and "do measurement on > > > > file contents". > > > > > > We already have a change counter on the inode, which is modified on > > > any data or metadata write (i_version) under filesystem locks. The > > > i_version counter has well defined semantics - it's required by > > > NFSv4 to increment on any metadata or data change - so we should be > > > able to rely on it's behaviour to implement IMA as well. Filesystems > > > that support i_version are marked with [SB|MS]_I_VERSION in the > > > superblock (IS_I_VERSION(inode)) so it should be easy to tell if IMA > > > can be supported on a specific filesystem (btrfs, ext4, fuse and xfs > > > ATM). > > > > Recently I received a patch to replace i_version with mtime/atime. > > mtime is not guaranteed to change on data writes - the resolution of > the filesystem timestamps may mean mtime only changes once a second > regardless of the number of writes performed to that file. That's > why NFS can't use it as a change attribute, and hence we have > i_version.... > > > Now, even more recently, I received a patch that claims that > > i_version is just a performance improvement. > > Did you ask them to explain/quantify the performance improvement? Using i_version is a performance improvement as opposed to always calculating the file hash and writing the xattr. The patch is intended for filesystems that don't support i_version (eg. ubifs). > e.g. Using i_version on XFS slows down performance on small > writes by 2-3% because i_version because all data writes log a > version change rather than only logging a change when mtime updates. > We take that penalty because NFS requires specific change attribute > behaviour, otherwise we wouldn't have implemented it at all in > XFS... > > > For file systems that > > don't support i_version, assume that the file has changed. > > > > For file systems that don't support i_version, instead of assuming > > that the file has changed, we can at least use i_generation. > > I'm not sure what you mean here - the struct inode already has a > i_generation variable. It's a lifecycle indicator used to > discriminate between alloc/free cycles on the same inode number. > i.e. It only changes at inode allocation time, not whenever the data > in the inode changes... Sigh, my error. > > > With Linus' suggested changes, I think this will work nicely. > > > > > The IMA code should be able to sample that at measurement time and > > > either fail or be retried if i_version changes during measurement. > > > We can then simply make the IMA xattr write conditional on the > > > i_version value being unchanged from the sample the IMA code passes > > > into the filesystem once the filesystem holds all the locks it needs > > > to write the xattr... > > > > > I note that IMA already grabs the i_version in > > > ima_collect_measurement(), so this shouldn't be too hard to do. > > > Perhaps we don't need any new locks or counterst all, maybe just > > > the ability to feed a version cookie to the set_xattr method? > > > > The security.ima xattr is normally written out in > > ima_check_last_writer(), not in ima_collect_measurement(). > > Which, if IIUC, does this to measure and update the xattr: > > ima_check_last_writer > -> ima_update_xattr > -> ima_collect_measurement > -> ima_fix_xattr > > > ima_collect_measurement() calculates the file hash for storing in the > > measurement list (IMA-measurement), verifying the hash/signature (IMA- > > appraisal) already stored in the xattr, and auditing (IMA-audit). > > Yup, and it samples the i_version before it calculates the hash and > stores it in the iint, which then gets passed to ima_fix_xattr(). > Looks like all that is needed is to pass the i_version back to the > filesystem through the xattr call.... > > IOWs, sample the i_version early while we hold the inode lock and > check the writer count, then if it is the last writer drop the inode > lock and call ima_update_xattr(). The sampled i_version then tells > us if the file has changed before we write the updated xattr... > > > The only time that ima_collect_measurement() writes the file xattr is > > in "fix" mode. Writing the xattr will need to be deferred until after > > the iint->mutex is released. > > ima_collect_measurement() doesn't write an xattr at all - it just > reads the file data and calculates the hash. There's another call to ima_fix_xattr() from ima_appraise_measurement(). > > There should be no open writers in ima_check_last_writer(), so the > > file shouldn't be changing. > > If that code is not holding the inode i_rwsem across > ima_update_xattr(), then the writer check is racy as hell. We're > trying to get rid of the need for this code to hold the inode lock > to stabilise the writer count for the entire operation, and it looks > to me like everything is there to use the i_version to ensure the > the IMA code doesn't need to hold the inode lock across > ima_collect_measurement() and ima_fix_xattr()... Ok Mimi
On Sun, 2017-10-01 at 22:25 -0500, Eric W. Biederman wrote: > Mimi Zohar <zohar@linux.vnet.ibm.com> writes: > > There should be no open writers in ima_check_last_writer(), so the > > file shouldn't be changing. > > This is slightly tangential but I think important to consider. > What do you do about distributed filesystems fuse, nfs, etc that > can change the data behind the kernels back. Exactly! > Do you not support such systems or do you have a sufficient way to > detect changes? Currently, only the initial file access in policy is measured, verified, audited. Even if there was a way of detecting the change, since we can't trust these file systems, the performance would be awful, but we should probably not be caching the measurement/verification results. Mimi
On Mon, 2017-10-02 at 08:09 -0400, Mimi Zohar wrote: > On Mon, 2017-10-02 at 15:35 +1100, Dave Chinner wrote: > > On Sun, Oct 01, 2017 at 07:42:42PM -0400, Mimi Zohar wrote: > > > On Mon, 2017-10-02 at 09:34 +1100, Dave Chinner wrote: > > > > On Sun, Oct 01, 2017 at 11:41:48AM -0700, Linus Torvalds wrote: > > > > > On Sun, Oct 1, 2017 at 5:08 AM, Mimi Zohar <zohar@linux.vnet. > > > > > ibm.com> wrote: > > > > > > > > > > > > Right, re-introducing the iint->mutex and a new > > > > > > i_generation field in > > > > > > the iint struct with a separate set of locks should > > > > > > work. It will be > > > > > > reset if the file metadata changes (eg. setxattr, chown, > > > > > > chmod). > > > > > > > > > > Note that the "inner lock" could possibly be omitted if the > > > > > invalidation can be just a single atomic instruction. > > > > > > > > > > So particularly if invalidation could be just an atomic_inc() > > > > > on the > > > > > generation count, there might not need to be any inner lock > > > > > at all. > > > > > > > > > > You'd have to serialize the actual measurement with the "read > > > > > generation count", but that should be as simple as just doing > > > > > a > > > > > smp_rmb() between the "read generation count" and "do > > > > > measurement on > > > > > file contents". > > > > > > > > We already have a change counter on the inode, which is > > > > modified on > > > > any data or metadata write (i_version) under filesystem > > > > locks. The > > > > i_version counter has well defined semantics - it's required by > > > > NFSv4 to increment on any metadata or data change - so we > > > > should be > > > > able to rely on it's behaviour to implement IMA as well. > > > > Filesystems > > > > that support i_version are marked with [SB|MS]_I_VERSION in the > > > > superblock (IS_I_VERSION(inode)) so it should be easy to tell > > > > if IMA > > > > can be supported on a specific filesystem (btrfs, ext4, fuse > > > > and xfs > > > > ATM). > > > > > > Recently I received a patch to replace i_version with > > > mtime/atime. > > I assume you're talking here about the patch I sent a few months ago. I specifically do _not_ want to replace i_version with the mtime/atime. The point there was to stop trying to use i_version on filesystems that don't properly implement it (which is most of them). The next best approximation on those filesystems is the mtime. It's not perfect, but it's better than nothing (which is what you have now on filesystems that never increment i_version on writes). IOW, it just added a fallback for when you can't count on the i_version changing. (BTW: atime is worthless here -- who cares if the thing was accessed? IIUC, we only care if something changed.) Ideally, all filesystems would implement i_version properly. In practice, that's a tall order as that may require on-disk changes for some of them. That's not always possible where cross-OS compatibility is necessary (e.g. FAT or NTFS). > > mtime is not guaranteed to change on data writes - the resolution > > of > > the filesystem timestamps may mean mtime only changes once a second > > regardless of the number of writes performed to that file. That's > > why NFS can't use it as a change attribute, and hence we have > > i_version.... > > > > > Now, even more recently, I received a patch that claims that > > > i_version is just a performance improvement. > > > > Did you ask them to explain/quantify the performance improvement? > > Using i_version is a performance improvement as opposed to always > calculating the file hash and writing the xattr. The patch is > intended for filesystems that don't support i_version (eg. ubifs). > > > e.g. Using i_version on XFS slows down performance on small > > writes by 2-3% because i_version because all data writes log a > > version change rather than only logging a change when mtime > > updates. > > We take that penalty because NFS requires specific change attribute > > behaviour, otherwise we wouldn't have implemented it at all in > > XFS... > > > > > For file systems that > > > don't support i_version, assume that the file has changed. > > > > > > For file systems that don't support i_version, instead of > > > assuming > > > that the file has changed, we can at least use i_generation. > > > > I'm not sure what you mean here - the struct inode already has a > > i_generation variable. It's a lifecycle indicator used to > > discriminate between alloc/free cycles on the same inode number. > > i.e. It only changes at inode allocation time, not whenever the > > data > > in the inode changes... > > Sigh, my error. > > > > > > With Linus' suggested changes, I think this will work nicely. > > > > > > > The IMA code should be able to sample that at measurement time > > > > and > > > > either fail or be retried if i_version changes during > > > > measurement. > > > > We can then simply make the IMA xattr write conditional on the > > > > i_version value being unchanged from the sample the IMA code > > > > passes > > > > into the filesystem once the filesystem holds all the locks it > > > > needs > > > > to write the xattr... > > > > I note that IMA already grabs the i_version in > > > > ima_collect_measurement(), so this shouldn't be too hard to do. > > > > Perhaps we don't need any new locks or counterst all, maybe > > > > just > > > > the ability to feed a version cookie to the set_xattr method? > > > > > > The security.ima xattr is normally written out in > > > ima_check_last_writer(), not in ima_collect_measurement(). > > > > Which, if IIUC, does this to measure and update the xattr: > > > > ima_check_last_writer > > -> ima_update_xattr > > -> ima_collect_measurement > > -> ima_fix_xattr > > > > > ima_collect_measurement() calculates the file hash for storing > > > in the > > > measurement list (IMA-measurement), verifying the hash/signature > > > (IMA- > > > appraisal) already stored in the xattr, and auditing (IMA-audit). > > > > Yup, and it samples the i_version before it calculates the hash and > > stores it in the iint, which then gets passed to ima_fix_xattr(). > > Looks like all that is needed is to pass the i_version back to the > > filesystem through the xattr call.... > > > > IOWs, sample the i_version early while we hold the inode lock and > > check the writer count, then if it is the last writer drop the > > inode > > lock and call ima_update_xattr(). The sampled i_version then tells > > us if the file has changed before we write the updated xattr... > > > > > The only time that ima_collect_measurement() writes the file > > > xattr is > > > in "fix" mode. Writing the xattr will need to be deferred until > > > after > > > the iint->mutex is released. > > > > ima_collect_measurement() doesn't write an xattr at all - it just > > reads the file data and calculates the hash. > > There's another call to ima_fix_xattr() from > ima_appraise_measurement(). > > > > There should be no open writers in ima_check_last_writer(), so > > > the > > > file shouldn't be changing. > > > > If that code is not holding the inode i_rwsem across > > ima_update_xattr(), then the writer check is racy as hell. We're > > trying to get rid of the need for this code to hold the inode lock > > to stabilise the writer count for the entire operation, and it > > looks > > to me like everything is there to use the i_version to ensure the > > the IMA code doesn't need to hold the inode lock across > > ima_collect_measurement() and ima_fix_xattr()... > > Ok > > Mimi >
diff --git a/fs/ext2/file.c b/fs/ext2/file.c index 839095f66d8d..d174b2880929 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -38,9 +38,11 @@ static ssize_t ext2_dax_read_iter(struct kiocb *iocb, struct iov_iter *to, if (!iov_iter_count(to)) return 0; /* skip atime */ - inode_lock_shared(inode); + if (!rwf) + inode_lock_shared(inode); ret = dax_iomap_rw(iocb, to, &ext2_iomap_ops); - inode_unlock_shared(inode); + if (!rwf) + inode_unlock_shared(inode); file_accessed(iocb->ki_filp); return ret; diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 10789666725e..7d7c0e380add 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -38,7 +38,7 @@ static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to, struct inode *inode = file_inode(iocb->ki_filp); ssize_t ret; - if (!inode_trylock_shared(inode)) { + if (!rwf && !inode_trylock_shared(inode)) { if (iocb->ki_flags & IOCB_NOWAIT) return -EAGAIN; inode_lock_shared(inode); @@ -48,12 +48,14 @@ static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to, * change anymore */ if (!IS_DAX(inode)) { - inode_unlock_shared(inode); + if (!rwf) + inode_unlock_shared(inode); /* Fallback to buffered IO in case we cannot support DAX */ return generic_file_read_iter(iocb, to, rwf); } ret = dax_iomap_rw(iocb, to, &ext4_iomap_ops); - inode_unlock_shared(inode); + if (!rwf) + inode_unlock_shared(inode); file_accessed(iocb->ki_filp); return ret; diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index cf1ce8961601..0cffca97ed68 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -239,7 +239,7 @@ xfs_file_dax_read( if (!count) return 0; /* skip atime */ - if (!xfs_ilock_nowait(ip, XFS_IOLOCK_SHARED)) { + if (!rwf && !xfs_ilock_nowait(ip, XFS_IOLOCK_SHARED)) { if (iocb->ki_flags & IOCB_NOWAIT) return -EAGAIN; xfs_ilock(ip, XFS_IOLOCK_SHARED); @@ -247,7 +247,8 @@ xfs_file_dax_read( ret = dax_iomap_rw(iocb, to, &xfs_iomap_ops); xfs_iunlock(ip, XFS_IOLOCK_SHARED); - file_accessed(iocb->ki_filp); + if (!rwf) + file_accessed(iocb->ki_filp); return ret; } @@ -262,13 +263,14 @@ xfs_file_buffered_aio_read( trace_xfs_file_buffered_read(ip, iov_iter_count(to), iocb->ki_pos); - if (!xfs_ilock_nowait(ip, XFS_IOLOCK_SHARED)) { + if (!rwf && !xfs_ilock_nowait(ip, XFS_IOLOCK_SHARED)) { if (iocb->ki_flags & IOCB_NOWAIT) return -EAGAIN; xfs_ilock(ip, XFS_IOLOCK_SHARED); } ret = generic_file_read_iter(iocb, to, rwf); - xfs_iunlock(ip, XFS_IOLOCK_SHARED); + if (!rwf) + xfs_iunlock(ip, XFS_IOLOCK_SHARED); return ret; }
Don't attempt to take the i_rwsem, if it has already been taken exclusively. Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com> --- fs/ext2/file.c | 6 ++++-- fs/ext4/file.c | 8 +++++--- fs/xfs/xfs_file.c | 10 ++++++---- 3 files changed, 15 insertions(+), 9 deletions(-)