diff mbox

[RFC] freeing unliked file indefinitely delayed

Message ID 20150708014237.GC17109@ZenIV.linux.org.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Al Viro July 8, 2015, 1:42 a.m. UTC
Normally opening a file, unlinking it and then closing will have
the inode freed upon close() (provided that it's not otherwise busy and
has no remaining links, of course).  However, there's one case where that
does *not* happen.  Namely, if you open it by fhandle with cold dcache,
then unlink() and close().

	In normal case you get d_delete() in unlink(2) notice that dentry
is busy and unhash it; on the final dput() it will be forcibly evicted from
dcache, triggering iput() and inode removal.  In this case, though, we end
up with *two* dentries - disconnected (created by open-by-fhandle) and
regular one (used by unlink()).  The latter will have its reference to inode
dropped just fine, but the former will not - it's considered hashed (it
is on the ->s_anon list), so it will stay around until the memory pressure
will finally do it in.  As the result, we have the final iput() delayed
indefinitely.  It's trivial to reproduce -

#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

void flush_dcache(void)
{
        system("mount -o remount,rw /");
}

static char buf[20 * 1024 * 1024];

main()
{
        int fd;
        union { 
                struct file_handle f;
                char buf[MAX_HANDLE_SZ];
        } x;
        int m;

        x.f.handle_bytes = sizeof(x);
        chdir("/root");
        mkdir("foo", 0700);
        fd = open("foo/bar", O_CREAT | O_RDWR, 0600);
        close(fd);
        name_to_handle_at(AT_FDCWD, "foo/bar", &x.f, &m, 0);
        flush_dcache();
        fd = open_by_handle_at(AT_FDCWD, &x.f, O_RDWR);
        unlink("foo/bar");
        write(fd, buf, sizeof(buf));
        system("df .");			/* 20Mb eaten */
        close(fd);
        system("df .");			/* should've freed those 20Mb */
        flush_dcache();
        system("df .");			/* should be the same as #2 */
}

will spit out something like
Filesystem     1K-blocks   Used Available Use% Mounted on
/dev/root         322023 303843      1131 100% /
Filesystem     1K-blocks   Used Available Use% Mounted on
/dev/root         322023 303843      1131 100% /
Filesystem     1K-blocks   Used Available Use% Mounted on
/dev/root         322023 283282     21692  93% /
- inode gets freed only when dentry is finally evicted (here we trigger
than by remount; normally it would've happened in response to memory
pressure hell knows when).

IMO it's a bug.  Between the close() and final flush_dcache() the file has
no surviving links, is *not* busy, won't show up in fuser/lsof/whatnot
output, and yet it's still not freed.  I'm not saying that it's realistically
exploitable (albeit with nfsd it might be), but it's a very unpleasant
self-LART.

FWIW, my prefered fix would be simply to have the final dput() treat
disconnected dentries as "kill on sight"; checking for i_nlink of the
inode, as Bruce suggested several years ago, will *not* work, simply
because having another link to that file and unlinking it after close
will reproduce the situation - disconnected dentry sticks around in
dcache past its final dput() and past the last unlink() of our file.
Theoretically it might cause an overhead for nfsd (no_subtree_check v3
export might see more d_alloc()/d_free(); icache retention will still
prevent constant rereading the inode from disk).  _IF_ that proves to
be noticable, we might need to come up with something more elaborate
(e.g. have unlink() and rename() kick disconnected aliases out if the link
count has reached 0), but it's more complex and needs careful ananlysis
of correctness - we need to prove that there's no way to miss the link
count reaching 0.  I would prefer to treat all disconnected as unhashed
for dcache retention purposes - it's simpler and less brittle.  Comments?
I mean something like this:

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

J. Bruce Fields July 8, 2015, 2:39 a.m. UTC | #1
On Wed, Jul 08, 2015 at 02:42:38AM +0100, Al Viro wrote:
> 	Normally opening a file, unlinking it and then closing will have
> the inode freed upon close() (provided that it's not otherwise busy and
> has no remaining links, of course).  However, there's one case where that
> does *not* happen.  Namely, if you open it by fhandle with cold dcache,
> then unlink() and close().
> 
> 	In normal case you get d_delete() in unlink(2) notice that dentry
> is busy and unhash it; on the final dput() it will be forcibly evicted from
> dcache, triggering iput() and inode removal.  In this case, though, we end
> up with *two* dentries - disconnected (created by open-by-fhandle) and
> regular one (used by unlink()).  The latter will have its reference to inode
> dropped just fine, but the former will not - it's considered hashed (it
> is on the ->s_anon list), so it will stay around until the memory pressure
> will finally do it in.  As the result, we have the final iput() delayed
> indefinitely.  It's trivial to reproduce -
> 
> #define _GNU_SOURCE
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> 
> void flush_dcache(void)
> {
>         system("mount -o remount,rw /");
> }
> 
> static char buf[20 * 1024 * 1024];
> 
> main()
> {
>         int fd;
>         union { 
>                 struct file_handle f;
>                 char buf[MAX_HANDLE_SZ];
>         } x;
>         int m;
> 
>         x.f.handle_bytes = sizeof(x);
>         chdir("/root");
>         mkdir("foo", 0700);
>         fd = open("foo/bar", O_CREAT | O_RDWR, 0600);
>         close(fd);
>         name_to_handle_at(AT_FDCWD, "foo/bar", &x.f, &m, 0);
>         flush_dcache();
>         fd = open_by_handle_at(AT_FDCWD, &x.f, O_RDWR);
>         unlink("foo/bar");
>         write(fd, buf, sizeof(buf));
>         system("df .");			/* 20Mb eaten */
>         close(fd);
>         system("df .");			/* should've freed those 20Mb */
>         flush_dcache();
>         system("df .");			/* should be the same as #2 */
> }
> 
> will spit out something like
> Filesystem     1K-blocks   Used Available Use% Mounted on
> /dev/root         322023 303843      1131 100% /
> Filesystem     1K-blocks   Used Available Use% Mounted on
> /dev/root         322023 303843      1131 100% /
> Filesystem     1K-blocks   Used Available Use% Mounted on
> /dev/root         322023 283282     21692  93% /
> - inode gets freed only when dentry is finally evicted (here we trigger
> than by remount; normally it would've happened in response to memory
> pressure hell knows when).
> 
> IMO it's a bug.

Definitely agreed.

> Between the close() and final flush_dcache() the file has
> no surviving links, is *not* busy, won't show up in fuser/lsof/whatnot
> output, and yet it's still not freed.  I'm not saying that it's realistically
> exploitable (albeit with nfsd it might be)

I'd be surprised if it doesn't happen.  This, for example, is a "space
on nfs export not freed when I expected it after an unlink" report that
could easily have the same cause:

	https://www.redhat.com/archives/linux-cluster/2015-May/msg00000.html

(Not sure, I'll get back to them and see if they can confirm.)

> but it's a very unpleasant self-LART.

> FWIW, my prefered fix would be simply to have the final dput() treat
> disconnected dentries as "kill on sight"; checking for i_nlink of the
> inode, as Bruce suggested several years ago, will *not* work, simply
> because having another link to that file and unlinking it after close
> will reproduce the situation - disconnected dentry sticks around in
> dcache past its final dput() and past the last unlink() of our file.
> Theoretically it might cause an overhead for nfsd (no_subtree_check v3
> export might see more d_alloc()/d_free(); icache retention will still
> prevent constant rereading the inode from disk).  _IF_ that proves to
> be noticable, we might need to come up with something more elaborate
> (e.g. have unlink() and rename() kick disconnected aliases out if the link
> count has reached 0), but it's more complex and needs careful ananlysis
> of correctness - we need to prove that there's no way to miss the link
> count reaching 0.  I would prefer to treat all disconnected as unhashed
> for dcache retention purposes - it's simpler and less brittle.  Comments?
> I mean something like this:

ACK from me.

--b.

> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 7a3f3e5..5c8ea15 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -642,7 +642,7 @@ static inline bool fast_dput(struct dentry *dentry)
>  
>  	/*
>  	 * If we have a d_op->d_delete() operation, we sould not
> -	 * let the dentry count go to zero, so use "put__or_lock".
> +	 * let the dentry count go to zero, so use "put_or_lock".
>  	 */
>  	if (unlikely(dentry->d_flags & DCACHE_OP_DELETE))
>  		return lockref_put_or_lock(&dentry->d_lockref);
> @@ -697,7 +697,7 @@ static inline bool fast_dput(struct dentry *dentry)
>  	 */
>  	smp_rmb();
>  	d_flags = ACCESS_ONCE(dentry->d_flags);
> -	d_flags &= DCACHE_REFERENCED | DCACHE_LRU_LIST;
> +	d_flags &= DCACHE_REFERENCED | DCACHE_LRU_LIST | DCACHE_DISCONNECTED;
>  
>  	/* Nothing to do? Dropping the reference was all we needed? */
>  	if (d_flags == (DCACHE_REFERENCED | DCACHE_LRU_LIST) && !d_unhashed(dentry))
> @@ -776,6 +776,9 @@ repeat:
>  	if (unlikely(d_unhashed(dentry)))
>  		goto kill_it;
>  
> +	if (unlikely(dentry->d_flags & DCACHE_DISCONNECTED))
> +		goto kill_it;
> +
>  	if (unlikely(dentry->d_flags & DCACHE_OP_DELETE)) {
>  		if (dentry->d_op->d_delete(dentry))
>  			goto kill_it;
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Myers July 8, 2015, 3:41 p.m. UTC | #2
Hey Al,

On Wed, Jul 08, 2015 at 02:42:38AM +0100, Al Viro wrote:
> 	Normally opening a file, unlinking it and then closing will have
> the inode freed upon close() (provided that it's not otherwise busy and
> has no remaining links, of course).  However, there's one case where that
> does *not* happen.  Namely, if you open it by fhandle with cold dcache,
> then unlink() and close().
> 
> 	In normal case you get d_delete() in unlink(2) notice that dentry
> is busy and unhash it; on the final dput() it will be forcibly evicted from
> dcache, triggering iput() and inode removal.  In this case, though, we end
> up with *two* dentries - disconnected (created by open-by-fhandle) and
> regular one (used by unlink()).  The latter will have its reference to inode
> dropped just fine, but the former will not - it's considered hashed (it
> is on the ->s_anon list), so it will stay around until the memory pressure
> will finally do it in.  As the result, we have the final iput() delayed
> indefinitely.  It's trivial to reproduce -
> 
> #define _GNU_SOURCE
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> 
> void flush_dcache(void)
> {
>         system("mount -o remount,rw /");
> }
> 
> static char buf[20 * 1024 * 1024];
> 
> main()
> {
>         int fd;
>         union { 
>                 struct file_handle f;
>                 char buf[MAX_HANDLE_SZ];
>         } x;
>         int m;
> 
>         x.f.handle_bytes = sizeof(x);
>         chdir("/root");
>         mkdir("foo", 0700);
>         fd = open("foo/bar", O_CREAT | O_RDWR, 0600);
>         close(fd);
>         name_to_handle_at(AT_FDCWD, "foo/bar", &x.f, &m, 0);
>         flush_dcache();
>         fd = open_by_handle_at(AT_FDCWD, &x.f, O_RDWR);
>         unlink("foo/bar");
>         write(fd, buf, sizeof(buf));
>         system("df .");			/* 20Mb eaten */
>         close(fd);
>         system("df .");			/* should've freed those 20Mb */
>         flush_dcache();
>         system("df .");			/* should be the same as #2 */
> }
> 
> will spit out something like
> Filesystem     1K-blocks   Used Available Use% Mounted on
> /dev/root         322023 303843      1131 100% /
> Filesystem     1K-blocks   Used Available Use% Mounted on
> /dev/root         322023 303843      1131 100% /
> Filesystem     1K-blocks   Used Available Use% Mounted on
> /dev/root         322023 283282     21692  93% /
> - inode gets freed only when dentry is finally evicted (here we trigger
> than by remount; normally it would've happened in response to memory
> pressure hell knows when).
> 
> IMO it's a bug.  Between the close() and final flush_dcache() the file has
> no surviving links, is *not* busy, won't show up in fuser/lsof/whatnot
> output, and yet it's still not freed.  I'm not saying that it's realistically
> exploitable (albeit with nfsd it might be), but it's a very unpleasant
> self-LART.
> 
> FWIW, my prefered fix would be simply to have the final dput() treat
> disconnected dentries as "kill on sight"; checking for i_nlink of the
> inode, as Bruce suggested several years ago, will *not* work, simply
> because having another link to that file and unlinking it after close
> will reproduce the situation - disconnected dentry sticks around in
> dcache past its final dput() and past the last unlink() of our file.
> Theoretically it might cause an overhead for nfsd (no_subtree_check v3
> export might see more d_alloc()/d_free(); icache retention will still
> prevent constant rereading the inode from disk).  _IF_ that proves to
> be noticable, we might need to come up with something more elaborate
> (e.g. have unlink() and rename() kick disconnected aliases out if the link
> count has reached 0), but it's more complex and needs careful ananlysis
> of correctness - we need to prove that there's no way to miss the link
> count reaching 0.

The bug rings a bell for me so I will stick my neck out instead of
lurking.  Don't you need to sample that link count under the filesystems
internal lock in order to avoid an unlink/iget race?  I suggest creating
a helper to prune disconnected dentries which a filesystem could call in
.unlink.  That would avoid the risk of unintended side effects with the
d_alloc/d_free/icache approach and have provable link count correctness.

Thanks,
Ben
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ian Kent July 9, 2015, 11:17 a.m. UTC | #3
On Wed, 2015-07-08 at 02:42 +0100, Al Viro wrote:
> 	Normally opening a file, unlinking it and then closing will have
> the inode freed upon close() (provided that it's not otherwise busy and
> has no remaining links, of course).  However, there's one case where that
> does *not* happen.  Namely, if you open it by fhandle with cold dcache,
> then unlink() and close().
> 
> 	In normal case you get d_delete() in unlink(2) notice that dentry
> is busy and unhash it; on the final dput() it will be forcibly evicted from
> dcache, triggering iput() and inode removal.  In this case, though, we end
> up with *two* dentries - disconnected (created by open-by-fhandle) and
> regular one (used by unlink()).  The latter will have its reference to inode
> dropped just fine, but the former will not - it's considered hashed (it
> is on the ->s_anon list), so it will stay around until the memory pressure
> will finally do it in.  As the result, we have the final iput() delayed
> indefinitely.  It's trivial to reproduce -
> 
> #define _GNU_SOURCE
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> 
> void flush_dcache(void)
> {
>         system("mount -o remount,rw /");
> }
> 
> static char buf[20 * 1024 * 1024];
> 
> main()
> {
>         int fd;
>         union { 
>                 struct file_handle f;
>                 char buf[MAX_HANDLE_SZ];
>         } x;
>         int m;
> 
>         x.f.handle_bytes = sizeof(x);
>         chdir("/root");
>         mkdir("foo", 0700);
>         fd = open("foo/bar", O_CREAT | O_RDWR, 0600);
>         close(fd);
>         name_to_handle_at(AT_FDCWD, "foo/bar", &x.f, &m, 0);
>         flush_dcache();
>         fd = open_by_handle_at(AT_FDCWD, &x.f, O_RDWR);
>         unlink("foo/bar");
>         write(fd, buf, sizeof(buf));
>         system("df .");			/* 20Mb eaten */
>         close(fd);
>         system("df .");			/* should've freed those 20Mb */
>         flush_dcache();
>         system("df .");			/* should be the same as #2 */
> }
> 
> will spit out something like
> Filesystem     1K-blocks   Used Available Use% Mounted on
> /dev/root         322023 303843      1131 100% /
> Filesystem     1K-blocks   Used Available Use% Mounted on
> /dev/root         322023 303843      1131 100% /
> Filesystem     1K-blocks   Used Available Use% Mounted on
> /dev/root         322023 283282     21692  93% /
> - inode gets freed only when dentry is finally evicted (here we trigger
> than by remount; normally it would've happened in response to memory
> pressure hell knows when).
> 
> IMO it's a bug.  Between the close() and final flush_dcache() the file has
> no surviving links, is *not* busy, won't show up in fuser/lsof/whatnot
> output, and yet it's still not freed.  I'm not saying that it's realistically
> exploitable (albeit with nfsd it might be), but it's a very unpleasant
> self-LART.
> 
> FWIW, my prefered fix would be simply to have the final dput() treat
> disconnected dentries as "kill on sight"; checking for i_nlink of the
> inode, as Bruce suggested several years ago, will *not* work, simply
> because having another link to that file and unlinking it after close
> will reproduce the situation - disconnected dentry sticks around in
> dcache past its final dput() and past the last unlink() of our file.
> Theoretically it might cause an overhead for nfsd (no_subtree_check v3
> export might see more d_alloc()/d_free(); icache retention will still
> prevent constant rereading the inode from disk).  _IF_ that proves to
> be noticable, we might need to come up with something more elaborate
> (e.g. have unlink() and rename() kick disconnected aliases out if the link
> count has reached 0), but it's more complex and needs careful ananlysis
> of correctness - we need to prove that there's no way to miss the link
> count reaching 0.  I would prefer to treat all disconnected as unhashed
> for dcache retention purposes - it's simpler and less brittle.  Comments?
> I mean something like this:

Al, help me out here, I'm struggling to understand where these dentrys
come from (apart from your reproducer).

For example, on the heavily patched 2.6.32 kernel where this was first
seen the problem dentry is annoymous, refcount 0, and unhashed.

But the dentrys that will most likely face summary execution will be
hashed, such as was the case on that 2.6.32 kernel at dput().

Doesn't that mean that something dropped the dentry after the dput(),
that will now also free the dentry, that took the refcount to 0?

Ian

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ian Kent July 9, 2015, 11:26 a.m. UTC | #4
On Thu, 2015-07-09 at 19:17 +0800, Ian Kent wrote:
> On Wed, 2015-07-08 at 02:42 +0100, Al Viro wrote:
> > 	Normally opening a file, unlinking it and then closing will have
> > the inode freed upon close() (provided that it's not otherwise busy and
> > has no remaining links, of course).  However, there's one case where that
> > does *not* happen.  Namely, if you open it by fhandle with cold dcache,
> > then unlink() and close().
> > 
> > 	In normal case you get d_delete() in unlink(2) notice that dentry
> > is busy and unhash it; on the final dput() it will be forcibly evicted from
> > dcache, triggering iput() and inode removal.  In this case, though, we end
> > up with *two* dentries - disconnected (created by open-by-fhandle) and
> > regular one (used by unlink()).  The latter will have its reference to inode
> > dropped just fine, but the former will not - it's considered hashed (it
> > is on the ->s_anon list), so it will stay around until the memory pressure
> > will finally do it in.  As the result, we have the final iput() delayed
> > indefinitely.  It's trivial to reproduce -
> > 
> > #define _GNU_SOURCE
> > #include <sys/types.h>
> > #include <sys/stat.h>
> > #include <fcntl.h>
> > 
> > void flush_dcache(void)
> > {
> >         system("mount -o remount,rw /");
> > }
> > 
> > static char buf[20 * 1024 * 1024];
> > 
> > main()
> > {
> >         int fd;
> >         union { 
> >                 struct file_handle f;
> >                 char buf[MAX_HANDLE_SZ];
> >         } x;
> >         int m;
> > 
> >         x.f.handle_bytes = sizeof(x);
> >         chdir("/root");
> >         mkdir("foo", 0700);
> >         fd = open("foo/bar", O_CREAT | O_RDWR, 0600);
> >         close(fd);
> >         name_to_handle_at(AT_FDCWD, "foo/bar", &x.f, &m, 0);
> >         flush_dcache();
> >         fd = open_by_handle_at(AT_FDCWD, &x.f, O_RDWR);
> >         unlink("foo/bar");
> >         write(fd, buf, sizeof(buf));
> >         system("df .");			/* 20Mb eaten */
> >         close(fd);
> >         system("df .");			/* should've freed those 20Mb */
> >         flush_dcache();
> >         system("df .");			/* should be the same as #2 */
> > }
> > 
> > will spit out something like
> > Filesystem     1K-blocks   Used Available Use% Mounted on
> > /dev/root         322023 303843      1131 100% /
> > Filesystem     1K-blocks   Used Available Use% Mounted on
> > /dev/root         322023 303843      1131 100% /
> > Filesystem     1K-blocks   Used Available Use% Mounted on
> > /dev/root         322023 283282     21692  93% /
> > - inode gets freed only when dentry is finally evicted (here we trigger
> > than by remount; normally it would've happened in response to memory
> > pressure hell knows when).
> > 
> > IMO it's a bug.  Between the close() and final flush_dcache() the file has
> > no surviving links, is *not* busy, won't show up in fuser/lsof/whatnot
> > output, and yet it's still not freed.  I'm not saying that it's realistically
> > exploitable (albeit with nfsd it might be), but it's a very unpleasant
> > self-LART.
> > 
> > FWIW, my prefered fix would be simply to have the final dput() treat
> > disconnected dentries as "kill on sight"; checking for i_nlink of the
> > inode, as Bruce suggested several years ago, will *not* work, simply
> > because having another link to that file and unlinking it after close
> > will reproduce the situation - disconnected dentry sticks around in
> > dcache past its final dput() and past the last unlink() of our file.
> > Theoretically it might cause an overhead for nfsd (no_subtree_check v3
> > export might see more d_alloc()/d_free(); icache retention will still
> > prevent constant rereading the inode from disk).  _IF_ that proves to
> > be noticable, we might need to come up with something more elaborate
> > (e.g. have unlink() and rename() kick disconnected aliases out if the link
> > count has reached 0), but it's more complex and needs careful ananlysis
> > of correctness - we need to prove that there's no way to miss the link
> > count reaching 0.  I would prefer to treat all disconnected as unhashed
> > for dcache retention purposes - it's simpler and less brittle.  Comments?
> > I mean something like this:
> 
> Al, help me out here, I'm struggling to understand where these dentrys
> come from (apart from your reproducer).
> 
> For example, on the heavily patched 2.6.32 kernel where this was first
> seen the problem dentry is annoymous, refcount 0, and unhashed.
> 
> But the dentrys that will most likely face summary execution will be
> hashed, such as was the case on that 2.6.32 kernel at dput().
> 
> Doesn't that mean that something dropped the dentry after the dput(),
> that will now also free the dentry, that took the refcount to 0?

Oh wait, think I get it now ... perhaps it's prune_one_dentry() doing
it ...

Ian

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro July 12, 2015, 3 p.m. UTC | #5
On Wed, Jul 08, 2015 at 10:41:43AM -0500, Ben Myers wrote:

> The bug rings a bell for me so I will stick my neck out instead of
> lurking.  Don't you need to sample that link count under the filesystems
> internal lock in order to avoid an unlink/iget race?  I suggest creating
> a helper to prune disconnected dentries which a filesystem could call in
> .unlink.  That would avoid the risk of unintended side effects with the
> d_alloc/d_free/icache approach and have provable link count correctness.

For one thing, this patch does *not* check for i_nlink at all.  For another,
there's no such thing as 'filesystems internal lock' for i_nlink protection -
that's handled by i_mutex...  And what does iget() have to do with any of that?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro July 12, 2015, 3:17 p.m. UTC | #6
On Thu, Jul 09, 2015 at 07:26:44PM +0800, Ian Kent wrote:
> > But the dentrys that will most likely face summary execution will be
> > hashed, such as was the case on that 2.6.32 kernel at dput().
> > 
> > Doesn't that mean that something dropped the dentry after the dput(),
> > that will now also free the dentry, that took the refcount to 0?
> 
> Oh wait, think I get it now ... perhaps it's prune_one_dentry() doing
> it ...

What, unhashing?  Yes, it does.

A bit of context - the breakage that had first pointed in direction of
this bug had been a deadlock with dcache shrinker run on frozen fs was
stumbling across a hashed dentry with zero refcount *and* zero link count
of its inode, triggering its eviction, final iput(), inode freeing and
deadlock on attempt to do sb_start_intwrite() there; figuring out how could
such a dentry appear in the first place had uncovered this fun.  Which
	a) is a bug in its own right and
	b) happens in mainline as well.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ian Kent July 13, 2015, 2:30 a.m. UTC | #7
On Sun, 2015-07-12 at 16:17 +0100, Al Viro wrote:
> On Thu, Jul 09, 2015 at 07:26:44PM +0800, Ian Kent wrote:
> > > But the dentrys that will most likely face summary execution will be
> > > hashed, such as was the case on that 2.6.32 kernel at dput().
> > > 
> > > Doesn't that mean that something dropped the dentry after the dput(),
> > > that will now also free the dentry, that took the refcount to 0?
> > 
> > Oh wait, think I get it now ... perhaps it's prune_one_dentry() doing
> > it ...
> 
> What, unhashing?  Yes, it does.

Yep, that was what I was thinking at the time.

> 
> A bit of context - the breakage that had first pointed in direction of
> this bug had been a deadlock with dcache shrinker run on frozen fs was
> stumbling across a hashed dentry with zero refcount *and* zero link count
> of its inode, triggering its eviction, final iput(), inode freeing and
> deadlock on attempt to do sb_start_intwrite() there; figuring out how could
> such a dentry appear in the first place had uncovered this fun.  Which
> 	a) is a bug in its own right and
> 	b) happens in mainline as well.

I get all of that, and it sure does look like these things should be
treated as unhashed.

My puzzle is the life cycle of DCACHE_DISCONNECTED dentrys, which is
mostly unrelated.

Not to worry, this isn't the first time I've been defeated trying to
work it out.

The only way I can see disconnected dentrys created (possibly unhashed,
and maybe not materialized) is via nfs and nfsd, beside the usage
mentioned here of course.

There must be some indirection I'm missing wrt. export_operations
usage ....

Ian

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Myers July 13, 2015, 6:17 p.m. UTC | #8
Hey Al,

On Sun, Jul 12, 2015 at 04:00:35PM +0100, Al Viro wrote:
> On Wed, Jul 08, 2015 at 10:41:43AM -0500, Ben Myers wrote:
> 
> > The bug rings a bell for me so I will stick my neck out instead of
> > lurking.  Don't you need to sample that link count under the filesystems
> > internal lock in order to avoid an unlink/iget race?  I suggest creating
> > a helper to prune disconnected dentries which a filesystem could call in
> > .unlink.  That would avoid the risk of unintended side effects with the
> > d_alloc/d_free/icache approach and have provable link count correctness.
> 
> For one thing, this patch does *not* check for i_nlink at all.

I agree that no checking of i_nlink has the advantage of brevity.
Anyone who is using dentry.d_fsdata with an open_by_handle workload (if
there are any) will be affected.

> For another, there's no such thing as 'filesystems internal lock' for
> i_nlink protection - that's handled by i_mutex...  And what does
> iget() have to do with any of that?

i_mutex is good enough only for local filesystems.
Network/clustered/distributed filesystems need to take an internal lock
to provide exclusion for this .unlink with a .link on another host.
That's where I'm coming from with iget().  

Maybe plumbing i_op.unlink with another argument to return i_nlink is
something to consider?  A helper for the few filesystems that need to do
this might be good enough in the near term.

Thanks,
Ben
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro July 13, 2015, 7:56 p.m. UTC | #9
On Mon, Jul 13, 2015 at 01:17:51PM -0500, Ben Myers wrote:
> > For one thing, this patch does *not* check for i_nlink at all.
> 
> I agree that no checking of i_nlink has the advantage of brevity.
> Anyone who is using dentry.d_fsdata with an open_by_handle workload (if
> there are any) will be affected.

Translate, please.  What does d_fsdata have to anything above?

> > For another, there's no such thing as 'filesystems internal lock' for
> > i_nlink protection - that's handled by i_mutex...  And what does
> > iget() have to do with any of that?
> 
> i_mutex is good enough only for local filesystems.
> Network/clustered/distributed filesystems need to take an internal lock
> to provide exclusion for this .unlink with a .link on another host.
> That's where I'm coming from with iget().  
> 
> Maybe plumbing i_op.unlink with another argument to return i_nlink is
> something to consider?  A helper for the few filesystems that need to do
> this might be good enough in the near term.

????

a) iget() had been gone since way back
b) it never had been called by VFS - it's a filesystem's responsibility
c) again, what the hell does iget() or its replacements have to do with
dentry eviction?  It does *NOT* affect dentry refcount.  Never had.
d) checks for _inode_ retention in icache are done by filesystem code, which
is certainly free to use its locks.  Incidentally, for normal filesystems
no locks are needed at all - everything that changes ->i_nlink is holding
a referfence to in-core inode, so in a situation when its refcount is zero
and ->i_lock is held (providing an exclusion with icache lookups), ->i_nlink
is guaranteed to be stable.
e) why would VFS possibly want to know if there are links remaining after
successful ->unlink()?

I'm sorry, but you are not making any sense...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Myers July 14, 2015, 12:54 a.m. UTC | #10
Hey Al,

On Mon, Jul 13, 2015 at 08:56:46PM +0100, Al Viro wrote:
> On Mon, Jul 13, 2015 at 01:17:51PM -0500, Ben Myers wrote:
> > > For one thing, this patch does *not* check for i_nlink at all.
> > 
> > I agree that no checking of i_nlink has the advantage of brevity.
> > Anyone who is using dentry.d_fsdata with an open_by_handle workload (if
> > there are any) will be affected.
> 
> Translate, please.  What does d_fsdata have to anything above?

If my understanding of your patch is correct, any disconnected dentries will be
killed immediately at dput instead of being put on the dentry lru.  I have some
code that I think may rely on the old behavior with regard to
DCACHE_DISCONNECTED.  That is, the disconnected dentry sits on the lru and there
are lots of open_by_handle calls and nothing to keep a ref on the inode between
open_by_handle/close except that dentry's ref.  Maybe there are better ways to
do that and I just need to look for a workaround.

I wonder if there are others who will be affected by this change in behavior.
In particular, if there are filesystems that need to keep internal state in
d_fsdata across open_by_handle calls.  They won't be able to use d_fsdata
very well anymore for this workload.

> > > For another, there's no such thing as 'filesystems internal lock' for
> > > i_nlink protection - that's handled by i_mutex...  And what does
> > > iget() have to do with any of that?
> > 
> > i_mutex is good enough only for local filesystems.
> > Network/clustered/distributed filesystems need to take an internal lock
> > to provide exclusion for this .unlink with a .link on another host.
> > That's where I'm coming from with iget().  
> > 
> > Maybe plumbing i_op.unlink with another argument to return i_nlink is
> > something to consider?  A helper for the few filesystems that need to do
> > this might be good enough in the near term.
> 
> ????
> 
> a) iget() had been gone since way back
> b) it never had been called by VFS - it's a filesystem's responsibility
> c) again, what the hell does iget() or its replacements have to do with
> dentry eviction?  It does *NOT* affect dentry refcount.  Never had.

I misspoke with regard to iget.  It is the locking for i_nlink that I am
concerned about.

> d) checks for _inode_ retention in icache are done by filesystem code, which
> is certainly free to use its locks.  Incidentally, for normal filesystems
> no locks are needed at all - everything that changes ->i_nlink is holding
> a referfence to in-core inode, so in a situation when its refcount is zero
> and ->i_lock is held (providing an exclusion with icache lookups), ->i_nlink
> is guaranteed to be stable.

Thanks for that explanation, that's what I needed.  For filesystems which have a
distributed lock manager involved on multiple hosts, i_nlink is not stable on an
individual host even with i_lock held.  You would also need to hold whatever
finer-grained lock the filesystem is protecting i_nlink with.  So the VFS
_can't_ know when i_nlink has gone to zero, only the filesystem can.  That's
what I was trying to get at.

> e) why would VFS possibly want to know if there are links remaining after
> successful ->unlink()?

The VFS wouldn't.  I guess the argument goes that it would be nice to keep the
old behavior and still cache disconnected dentries, and leave it to the
filesystems which require it to destroy the anonymous dentries in ->unlink when
they they know that i_nlink has gone to zero, having taken whatever internal
locks they need to use to adequately protect i_nlink from other hosts.  Thats
where a helper to prune disconnected dentries would come in handy.
 
> I'm sorry, but you are not making any sense...

I'm sorry for the confusion, it took me a little awhile to digest this.  The
change in caching behavior could affect more than nfs.  Any filesystem which
expects disconnected dentries to remain cached is affected.

Thanks,
Ben
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/dcache.c b/fs/dcache.c
index 7a3f3e5..5c8ea15 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -642,7 +642,7 @@  static inline bool fast_dput(struct dentry *dentry)
 
 	/*
 	 * If we have a d_op->d_delete() operation, we sould not
-	 * let the dentry count go to zero, so use "put__or_lock".
+	 * let the dentry count go to zero, so use "put_or_lock".
 	 */
 	if (unlikely(dentry->d_flags & DCACHE_OP_DELETE))
 		return lockref_put_or_lock(&dentry->d_lockref);
@@ -697,7 +697,7 @@  static inline bool fast_dput(struct dentry *dentry)
 	 */
 	smp_rmb();
 	d_flags = ACCESS_ONCE(dentry->d_flags);
-	d_flags &= DCACHE_REFERENCED | DCACHE_LRU_LIST;
+	d_flags &= DCACHE_REFERENCED | DCACHE_LRU_LIST | DCACHE_DISCONNECTED;
 
 	/* Nothing to do? Dropping the reference was all we needed? */
 	if (d_flags == (DCACHE_REFERENCED | DCACHE_LRU_LIST) && !d_unhashed(dentry))
@@ -776,6 +776,9 @@  repeat:
 	if (unlikely(d_unhashed(dentry)))
 		goto kill_it;
 
+	if (unlikely(dentry->d_flags & DCACHE_DISCONNECTED))
+		goto kill_it;
+
 	if (unlikely(dentry->d_flags & DCACHE_OP_DELETE)) {
 		if (dentry->d_op->d_delete(dentry))
 			goto kill_it;