netns: dangling pointer on netns bind mount point.

Message ID	20200407023512.GA25005@ubuntu (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=IGvX=5X=vger.kernel.org=linux-fsdevel-owner@kernel.org> Date: Tue, 7 Apr 2020 11:35:46 +0900 From: Levi <ppbuk5246@gmail.com> To: viro@zeniv.linux.org.uk, davem@davemloft.net, kuba@kernel.org, gnault@redhat.com, nicolas.dichtel@6wind.com, edumazet@google.com, lirongqing@baidu.com, tglx@linutronix.de, johannes.berg@intel.com, dhowells@redhat.com, daniel@iogearbox.net, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, netdev@vger.kernel.org Subject: [PATCH] netns: dangling pointer on netns bind mount point. Message-ID: <20200407023512.GA25005@ubuntu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.9.4 (2018-02-28) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk
Series	netns: dangling pointer on netns bind mount point. \| expand netns: dangling pointer on netns bind mount point.

Yun Levi April 7, 2020, 2:35 a.m. UTC

When we try to bind mount on network namespace (ex) /proc/{pid}/ns/net,
inode's private data can have dangling pointer to net_namespace that was
already freed in below case.

    1. Forking the process.
    2. [PARENT] Waiting the Child till the end.
    3. [CHILD] call unshare for creating new network namespace
    4. [CHILD] Bind mount with /proc/self/ns/net to some mount point.
    5. [CHILD] Exit child.
    6. [PARENT] Try to setns with binded mount point

In step 5, net_namespace made by child process'll be freed,
But in bind mount point, it still held the pointer to net_namespace made
by child process.
In this situation, when parent try to call "setns" systemcall with the
bind mount point, parent process try to access to freed memory, That
makes memory corruption.

This patch fix the above scenario by increaseing reference count.

Signed-off-by: Levi Yun <ppbuk5246@gmail.com>
---
 fs/namespace.c              | 31 +++++++++++++++++++++++++++++++
 include/net/net_namespace.h |  7 +++++++
 net/core/net_namespace.c    |  5 -----
 3 files changed, 38 insertions(+), 5 deletions(-)

--
2.26.0

Al Viro April 7, 2020, 3:05 a.m. UTC | #1

On Tue, Apr 07, 2020 at 11:35:46AM +0900, Levi wrote:
> When we try to bind mount on network namespace (ex) /proc/{pid}/ns/net,
> inode's private data can have dangling pointer to net_namespace that was
> already freed in below case.
> 
>     1. Forking the process.
>     2. [PARENT] Waiting the Child till the end.
>     3. [CHILD] call unshare for creating new network namespace
>     4. [CHILD] Bind mount with /proc/self/ns/net to some mount point.
>     5. [CHILD] Exit child.
>     6. [PARENT] Try to setns with binded mount point
> 
> In step 5, net_namespace made by child process'll be freed,
> But in bind mount point, it still held the pointer to net_namespace made
> by child process.
> In this situation, when parent try to call "setns" systemcall with the
> bind mount point, parent process try to access to freed memory, That
> makes memory corruption.
> 
> This patch fix the above scenario by increaseing reference count.

This can't be the right fix.

> +#ifdef CONFIG_NET_NS
> +	if (!(flag & CL_COPY_MNT_NS_FILE) && is_net_ns_file(root)) {
> +		ns = get_proc_ns(d_inode(root));
> +		if (ns == NULL || ns->ops->type != CLONE_NEWNET) {
> +			err = -EINVAL;
> +
> +			goto out_free;
> +		}
> +
> +		net = to_net_ns(ns);
> +		net = get_net(net);

No.  This is completely wrong.  You have each struct mount pointing to
that sucker to grab an extra reference on an object; you do *NOT* drop
said reference when struct mount is destroyed.  You are papering over
a dangling pointer of some sort by introducing a trivially exploitable
leak that happens to hit your scenario.

Hell, do (your step 4 + umount your mountpoint) in a loop, then watch what
happens to refcounts with that patch.

This is bollocks; the reference is *NOT* in struct mount.  At all.
It's not even in struct dentry.  What it's attached to is struct
inode and it should be pinned as long as that inode is alive -
it's dropped in nsfs_evict().  And if _that_ gets called while
dentry is still pinned (as ->mnt_root of something), you have
much worse problems.

Could you post a reproducer, preferably one that would trigger an oops
on mainline?

Al Viro April 7, 2020, 3:13 a.m. UTC | #2

On Tue, Apr 07, 2020 at 04:05:04AM +0100, Al Viro wrote:

> Could you post a reproducer, preferably one that would trigger an oops
> on mainline?

BTW, just to make sure - are we talking about analysis of existing
oops, or is it "never seen it happen, but looks like it should be
triggerable" kind of situation?

Yun Levi April 7, 2020, 3:29 a.m. UTC | #3

Actually I confirm that Kernel Panic on 4.19 version.
But I couldn't check main line not yet. Below is the one of the panic
log what i've experienced

Internal error: Oops: 96000004 [#1] SMP 2020-03-14T15:35
Modules linked in: hsl_linux_helper(O) saswifmod(O) asrim(O)
linux_bcm_knet(O) linux_user_bde(O) linux_kernel_bde(O)
ds_peripheral(O) m_vlog(O) m_scontrol(O)
CPU: 2 PID: 5966 Comm: c.EQMD_ASYNC Tainted: G W O 4.19.26 NOS
Hardware name: LS1046A RDB Board (DT)
pstate: 60000005 (nZCv daif -PAN -UAO) :44
pc : sock_prot_inuse_add+0x10/0x20
 lr : netlink_release+0x30c/0x5a8
sp : ffff0000244c3b20
x29: ffff0000244c3b20 x28: ffff8008774ea000
x27: ffff800865b093d0 x26: ffff800865b09000
x25: ffff800863580c00 x24: ffff00001892f000
x23: ffff80086514ba00 x22: ffff000018acd000
x21: ffff0000189ae000 x20: ffff00001892a000
x19: ffff0000088bfb2c x18: 0000000000000400
x16: 0000000000000000 x15: 0000000000000400
x14: 0000000000000400  x13: 0000000000000000
x12: 0000000000000001 x11: 000000000000a949
x10: 0000000000062d05 x9 : 0000000000000027
x8 : ffff800877120280 x7 : 0000000000000200
x6 : ffff800865b092e8 x5 : ffff0000244c3ae0
x4 : 0000000000000000 x3 : 0000800866e9a000
x2 : 00000000ffffffff x1 : 0000000000000000
x0 : 0000000000000000
Process c.EQMD_ASYNC (pid: 5966, stack limit = 0x0000000088e20a05)
Call trace:
sock_prot_inuse_add+0x10/0x20
__sock_release+0x44/0xc0
sock_close+0x14/0x20
__fput+0x8c/0x1b8
____fput+0xc/0x18
task_work_run+0xa8/0xd8
do_exit+0x2e4/0xa50
do_group_exit+0x34/0xc8
get_signal+0xd4/0x600
do_signal+0x174/0x268
do_notify_resume+0xcc/0x110
work_pending+0x8/0x10
Code: b940c821 f940c000 d538d083 8b010800 (b8606861) ---[ end trace
0b98c9ccbfd9f6fd ]---
Date/Time : 02-14-0120 06:35:47 Kernel panic - not syncing: Fatal
exception in interrupt SMP: stopping secondary CPUs Kernel Offset:
disabled CPU features: 0x0,21806000 Memory Limit: none

What I saw is when I try to bind on some mount point to
/proc/{pid}/ns/net which made by child process, That's doesn't
increase the netns' refcnt.
And when the child process's going to exit, it frees the netns But
Still bind mount point's inode's private data point to netns which was
freed by child when it exits.

Thank you for your reviewing But I also confirm that problem on mainline too.

And sorry to my fault.


On Tue, Apr 7, 2020 at 12:13 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Tue, Apr 07, 2020 at 04:05:04AM +0100, Al Viro wrote:
>
> > Could you post a reproducer, preferably one that would trigger an oops
> > on mainline?
>
> BTW, just to make sure - are we talking about analysis of existing
> oops, or is it "never seen it happen, but looks like it should be
> triggerable" kind of situation?

Al Viro April 7, 2020, 4:03 a.m. UTC | #4

On Tue, Apr 07, 2020 at 12:29:34PM +0900, Yun Levi wrote:

> sock_prot_inuse_add+0x10/0x20
> __sock_release+0x44/0xc0
> sock_close+0x14/0x20
> __fput+0x8c/0x1b8
> ____fput+0xc/0x18
> task_work_run+0xa8/0xd8
> do_exit+0x2e4/0xa50
> do_group_exit+0x34/0xc8
> get_signal+0xd4/0x600
> do_signal+0x174/0x268
> do_notify_resume+0xcc/0x110
> work_pending+0x8/0x10
> Code: b940c821 f940c000 d538d083 8b010800 (b8606861) ---[ end trace
> 0b98c9ccbfd9f6fd ]---
> Date/Time : 02-14-0120 06:35:47 Kernel panic - not syncing: Fatal
> exception in interrupt SMP: stopping secondary CPUs Kernel Offset:
> disabled CPU features: 0x0,21806000 Memory Limit: none
> 
> What I saw is when I try to bind on some mount point to
> /proc/{pid}/ns/net which made by child process, That's doesn't
> increase the netns' refcnt.

Why would it?  Increase of netns refcount should happen when you
follow /proc/*/ns/net and stay for as long as nsfs inode is alive,
not by cloning that mount.  And while we are at it, you don't need
to bind it anywhere in order to call setns() - just open the
sucker and then feed the resulting descriptor to setns(2).  No
mount --bind involved and if child exiting between open() and
setns() would free netns, we would have exact same problem.

> And when the child process's going to exit, it frees the netns But
> Still bind mount point's inode's private data point to netns which was
> freed by child when it exits.

Look: we call ns_get_path(), which calls ns_get_path_cb(), which
calls the callback passed to it (ns_get_path_task(), in this case),
which grabs a reference to ns.  Then we pass that reference to
__ns_get_path().

__ns_get_path() looks for dentry stashed in ns.  If there is one
and it's still alive, we grab a reference to that dentry and drop
the reference to ns - no new inodes had been created, so no new
namespace references have appeared.  Existing inode is pinned
by dentry and dentry is pinned by _dentry_ reference we've got.

If dentry is not there or is already in the middle of being destroyed,
we allocate a new inode, stash our namespace reference into it,
create a dentry referencing that new inode, stash it into namespace
and return that dentry.  Without dropping namespace reference we'd
obtained in ns_get_path_task() - it went into new inode.

If inode or dentry creation fails (out of memory), we drop what
we'd obtained (namespace if inode creation fails, just the inode
if dentry creation fails - namespace reference that went into
inode will be dropped by inode destructor in the latter case) and
return an error.

If somebody else manages to get through the entire thing while
we'd been allocating stuff and we see _their_ dentry already
stashed into namespace, we just drop our dentry (its destructor
will drop inode reference, which will lead to inode destructor,
which will drop namespace reference) and bugger off with -EAGAIN.
The caller (ns_get_path_cb()) will retry the entire thing.

The invariant to be preserved here is "each of those inodes
holds a reference to namespace for as long as the inode exists".
ns_get_path() increments namespace refcount if and only if it
has allocated an nsfs inode.  If it grabs an existing dentry
instead, the namespace refcount remains unchanged.

Anyway, I would really like to see .config and C (or shell)
reproducer.  Ideally - .config that could be run in a KVM.

Yun Levi April 7, 2020, 12:57 p.m. UTC | #5

---------- Forwarded message ---------
From: Yun Levi <ppbuk5246@gmail.com>
Date: Tue, Apr 7, 2020 at 9:53 PM
Subject: Re: [PATCH] netns: dangling pointer on netns bind mount point.
To: Al Viro <viro@zeniv.linux.org.uk>
Cc: David S. Miller <davem@davemloft.net>, Jakub Kicinski
<kuba@kernel.org>, Guillaume Nault <gnault@redhat.com>, Nicolas
Dichtel <nicolas.dichtel@6wind.com>, Eric Dumazet
<edumazet@google.com>, Li RongQing <lirongqing@baidu.com>, Thomas
Gleixner <tglx@linutronix.de>, Johannes Berg
<johannes.berg@intel.com>, David Howells <dhowells@redhat.com>,
<daniel@iogearbox.net>, <linux-fsdevel@vger.kernel.org>,
<linux-kernel@vger.kernel.org>, <netdev@vger.kernel.org>


BTW, It's my question.

>Look: we call ns_get_path(), which calls ns_get_path_cb(), which
>calls the callback passed to it (ns_get_path_task(), in this case),
>which grabs a reference to ns.  Then we pass that reference to
>__ns_get_path().
>
>__ns_get_path() looks for dentry stashed in ns.  If there is one
>and it's still alive, we grab a reference to that dentry and drop
>the reference to ns - no new inodes had been created, so no new
>namespace references have appeared.  Existing inode is pinned
>by dentry and dentry is pinned by _dentry_ reference we've got.

actually ns_get_path is called in unshare(2). and it makes new dentry
and inode in __ns_get_path finally (Cuz it create new network
namespace)
In that case, when I mount with --bind option to this
proc/self/ns/net, it only increase dentry refcount on
do_loopback->clone_mnt finally (not call netns_operation->get)
That means it's not increase previous created network namespace
reference count but only increase reference count of _dentry_
In that situation, If I exit the child process it definitely frees the
net_namespace previous created at the same time it decrease
net_namespace's refcnt in exit_task_namespace().
It means it's possible that bind mount point can hold the dentry with
inode having net_namespace's dangling pointer in another process.
In above situation, parent who know that binded mount point calls
setns(2) then it sets the net_namespace which is refered by the inode
of the dentry increased by do_loopback.
That makes set the net_namespace which was freed already.

The Kernel Panic log is happend NOT in child kill but in Parent killed
after I setns(2) with the net_namespace made by child process.

Thanks you for your reviewing, But Please forgive that I couldn't
share .config right now (I try to make for x86....)
I try to understand your comments But Please forgive my fault because
of my ignorant...
If my explain is wrong, please rebuke me... and Please share your knowledge...

Thank you.

On Tue, Apr 7, 2020 at 1:04 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Tue, Apr 07, 2020 at 12:29:34PM +0900, Yun Levi wrote:
>
> > sock_prot_inuse_add+0x10/0x20
> > __sock_release+0x44/0xc0
> > sock_close+0x14/0x20
> > __fput+0x8c/0x1b8
> > ____fput+0xc/0x18
> > task_work_run+0xa8/0xd8
> > do_exit+0x2e4/0xa50
> > do_group_exit+0x34/0xc8
> > get_signal+0xd4/0x600
> > do_signal+0x174/0x268
> > do_notify_resume+0xcc/0x110
> > work_pending+0x8/0x10
> > Code: b940c821 f940c000 d538d083 8b010800 (b8606861) ---[ end trace
> > 0b98c9ccbfd9f6fd ]---
> > Date/Time : 02-14-0120 06:35:47 Kernel panic - not syncing: Fatal
> > exception in interrupt SMP: stopping secondary CPUs Kernel Offset:
> > disabled CPU features: 0x0,21806000 Memory Limit: none
> >
> > What I saw is when I try to bind on some mount point to
> > /proc/{pid}/ns/net which made by child process, That's doesn't
> > increase the netns' refcnt.
>
> Why would it?  Increase of netns refcount should happen when you
> follow /proc/*/ns/net and stay for as long as nsfs inode is alive,
> not by cloning that mount.  And while we are at it, you don't need
> to bind it anywhere in order to call setns() - just open the
> sucker and then feed the resulting descriptor to setns(2).  No
> mount --bind involved and if child exiting between open() and
> setns() would free netns, we would have exact same problem.
>
> > And when the child process's going to exit, it frees the netns But
> > Still bind mount point's inode's private data point to netns which was
> > freed by child when it exits.
>
> Look: we call ns_get_path(), which calls ns_get_path_cb(), which
> calls the callback passed to it (ns_get_path_task(), in this case),
> which grabs a reference to ns.  Then we pass that reference to
> __ns_get_path().
>
> __ns_get_path() looks for dentry stashed in ns.  If there is one
> and it's still alive, we grab a reference to that dentry and drop
> the reference to ns - no new inodes had been created, so no new
> namespace references have appeared.  Existing inode is pinned
> by dentry and dentry is pinned by _dentry_ reference we've got.
>
> If dentry is not there or is already in the middle of being destroyed,
> we allocate a new inode, stash our namespace reference into it,
> create a dentry referencing that new inode, stash it into namespace
> and return that dentry.  Without dropping namespace reference we'd
> obtained in ns_get_path_task() - it went into new inode.
>
> If inode or dentry creation fails (out of memory), we drop what
> we'd obtained (namespace if inode creation fails, just the inode
> if dentry creation fails - namespace reference that went into
> inode will be dropped by inode destructor in the latter case) and
> return an error.
>
> If somebody else manages to get through the entire thing while
> we'd been allocating stuff and we see _their_ dentry already
> stashed into namespace, we just drop our dentry (its destructor
> will drop inode reference, which will lead to inode destructor,
> which will drop namespace reference) and bugger off with -EAGAIN.
> The caller (ns_get_path_cb()) will retry the entire thing.
>
> The invariant to be preserved here is "each of those inodes
> holds a reference to namespace for as long as the inode exists".
> ns_get_path() increments namespace refcount if and only if it
> has allocated an nsfs inode.  If it grabs an existing dentry
> instead, the namespace refcount remains unchanged.
>
> Anyway, I would really like to see .config and C (or shell)
> reproducer.  Ideally - .config that could be run in a KVM.

Al Viro April 7, 2020, 6:26 p.m. UTC | #6

On Tue, Apr 07, 2020 at 09:53:29PM +0900, Yun Levi wrote:
> BTW, It's my question.
> 
> >Look: we call ns_get_path(), which calls ns_get_path_cb(), which
> >calls the callback passed to it (ns_get_path_task(), in this case),
> >which grabs a reference to ns.  Then we pass that reference to
> >__ns_get_path().
> >
> >__ns_get_path() looks for dentry stashed in ns.  If there is one
> >and it's still alive, we grab a reference to that dentry and drop
> >the reference to ns - no new inodes had been created, so no new
> >namespace references have appeared.  Existing inode is pinned
> >by dentry and dentry is pinned by _dentry_ reference we've got.
> 
> actually ns_get_path is called in unshare(2).

Yes, it does.  Via perf_event_namespaces(), which does
        perf_fill_ns_link_info(&ns_link_info[NET_NS_INDEX],
                               task, &netns_operations);
and there we have
        error = ns_get_path(&ns_path, task, ns_ops);
        if (!error) {
                ns_inode = ns_path.dentry->d_inode;
                ns_link_info->dev = new_encode_dev(ns_inode->i_sb->s_dev);
                ns_link_info->ino = ns_inode->i_ino;
                path_put(&ns_path);
        }
See that path_put()?  Dentry reference is dropped by it.

> and it makes new dentry and
> inode in __ns_get_path finally (Cuz it create new network namespace)
>
> In that case, when I mount with --bind option to this proc/self/ns/net, it
> only increase dentry refcount on do_loopback->clone_mnt finally (not call
> netns_operation->get)
> That means it's not increase previous created network namespace reference
> count but only increase reference count of _dentry_
>
> In that situation, If I exit the child process it definitely frees the
> net_namespace previous created at the same time it decrease net_namespace's
> refcnt in exit_task_namespace().
> It means it's possible that bind mount point can hold the dentry with inode
> having net_namespace's dangling pointer in another process.
> In above situation, parent who know that binded mount point calls setns(2)
> then it sets the net_namespace which is refered by the inode of the dentry
> increased by do_loopback.
> That makes set the net_namespace which was freed already.

How?  Netns reference in inode contributes to netns refcount.  And it's held
for as long as the _inode_ has positive refcount - we only drop it from
the inode destructor, *NOT* every time we drop a reference to inode.
In the similar fashion, the inode reference in dentry contributes to inode
refcount.  And again, that inode reference won't be dropped until the _last_
reference to dentry gets dropped.

Incrementing refcount of dentry is enough to pin the inode and thus the
netns the inode refers to.  It's a very common pattern with refcounting;
a useful way of thinking about it is to consider the refcount of e.g.
inode as sum of several components, one of them being "number of struct
dentry instances with ->d_inode pointing to our inode".  And look at e.g.
__ns_get_path() like this:
        rcu_read_lock();
        d = atomic_long_read(&ns->stashed);
        if (!d)
                goto slow;
        dentry = (struct dentry *)d;
        if (!lockref_get_not_dead(&dentry->d_lockref))
                goto slow;
other_count(dentry) got incremented by 1.
        rcu_read_unlock();
        ns->ops->put(ns);
other_count(ns) decremented by 1.
got_it:
        path->mnt = mntget(mnt);
        path->dentry = dentry;
path added to paths(dentry), other_count(dentry) decremented by 1 (getting
it back to the original value).
        return 0;
slow:   
        rcu_read_unlock();
        inode = new_inode_pseudo(mnt->mnt_sb);
        if (!inode) {
                ns->ops->put(ns);
subtract 1 from other_count(ns)
                return -ENOMEM;
        }
dentries(inode) = empty
other_count(inode) = 1
	....
	inode->i_private = ns;
add inode to inodes(ns), subtract 1 from other_count(ns); the total
is unchanged.
        dentry = d_alloc_anon(mnt->mnt_sb);
        if (!dentry) {
                iput(inode);
subtract 1 from other_count(inode).  Since now all components of
inode refcount are zero, inode gets destroyed.  Destructor calls
nsfs_evict_inode(), which removes the inode from inodes(ns).
The total effect: inode is destroyed, inodes(ns) is back to what
it used to be and other_count(ns) is left decremented by 1 compared
to what we used to have.  IOW, the balance is the same as if inode
allocation would've failed.
                return -ENOMEM;
        }
other_count(dentry) = 1
        d_instantiate(dentry, inode);
add dentry to dentries(inode), subtract 1 from other_count(inode).
The total is unchanged.  Now other_count(inode) is 0 and dentries(inode)
is {dentry}.
        d = atomic_long_cmpxchg(&ns->stashed, 0, (unsigned long)dentry);
        if (d) {
somebody else has gotten there first
                d_delete(dentry);       /* make sure ->d_prune() does nothing */
                dput(dentry);
subtract 1 from other_count(dentry) (which will drive it to 0).  Since
no other references exist, dentry gets destroyed.  Destructor will
remove dentry from dentries(inode) and since other_count(inode) is already
zero, trigger destruction of inode.  That, in turn, will remove inode
from inodes(ns).  Total effect: dentry is destroyed, inode is destroyed,
inodes(ns) is back to what it used to be, other_count(ns) is left decremented
by 1 compared to what we used to have.
                cpu_relax();
                return -EAGAIN;
        }
        goto got_it;
got_it:
        path->mnt = mntget(mnt);
        path->dentry = dentry;
add path to paths(dentry), subtract 1 from other_count(dentry).  At that
point other_count(dentry) is back to 0, ditto for other_count(inode) and
other_count(ns) is left decremented by 1 compared to what it used to be.
inode is added to inodes(ns), dentry - to dentries(inode) and path - to
paths(dentry).
        return 0;
and we are done.

In all cases the total effect is the same as far as "other" counts go:
other_count(ns) is down by 1 and that's the only change in other_count()
of *ANY* objects.  Of course we do not keep track of the sets explicitly;
it would cost too much and we only interested in the sum of their sizes
anyway.  What we actually store is the sum, so operations like "transfer
the reference from one component to another" are not immediately obvious
to be refcounting ones - the sum is unchanged.  Conceptually, though,
they are refcounting operations.
	Up to d_instantiate() we are holding a reference to inode;
after that we are *not* - it has been transferred to dentry.  That's
why on subsequent failure exits we do not call iput() - the inode
reference is not ours to discard anymore.
	In the same way, up to inode->i_private = ns; we are holding
a reference to ns.  After that we are not - it's been transferred to
inode.  From that point on it's not ours to discard; it will be
dropped when inode gets destroyed, whenever that happens.

Yun Levi April 8, 2020, 5:59 a.m. UTC | #7

Thank you for great comments. Thanks to you I understand what i missed.

I try to generate problem on mainline But, as you explained that
situation isn't happen,

Maybe my other things which I made generate some problem (freeing
network namespace..)

Thanks for great answering and sharing.

If I meet the situation, at that time I'll share. Thank you very much!

P.S. If I have a question, Could I ask via e-mail like this?

On Wed, Apr 8, 2020 at 3:26 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Tue, Apr 07, 2020 at 09:53:29PM +0900, Yun Levi wrote:
> > BTW, It's my question.
> >
> > >Look: we call ns_get_path(), which calls ns_get_path_cb(), which
> > >calls the callback passed to it (ns_get_path_task(), in this case),
> > >which grabs a reference to ns.  Then we pass that reference to
> > >__ns_get_path().
> > >
> > >__ns_get_path() looks for dentry stashed in ns.  If there is one
> > >and it's still alive, we grab a reference to that dentry and drop
> > >the reference to ns - no new inodes had been created, so no new
> > >namespace references have appeared.  Existing inode is pinned
> > >by dentry and dentry is pinned by _dentry_ reference we've got.
> >
> > actually ns_get_path is called in unshare(2).
>
> Yes, it does.  Via perf_event_namespaces(), which does
>         perf_fill_ns_link_info(&ns_link_info[NET_NS_INDEX],
>                                task, &netns_operations);
> and there we have
>         error = ns_get_path(&ns_path, task, ns_ops);
>         if (!error) {
>                 ns_inode = ns_path.dentry->d_inode;
>                 ns_link_info->dev = new_encode_dev(ns_inode->i_sb->s_dev);
>                 ns_link_info->ino = ns_inode->i_ino;
>                 path_put(&ns_path);
>         }
> See that path_put()?  Dentry reference is dropped by it.
>
> > and it makes new dentry and
> > inode in __ns_get_path finally (Cuz it create new network namespace)
> >
> > In that case, when I mount with --bind option to this proc/self/ns/net, it
> > only increase dentry refcount on do_loopback->clone_mnt finally (not call
> > netns_operation->get)
> > That means it's not increase previous created network namespace reference
> > count but only increase reference count of _dentry_
> >
> > In that situation, If I exit the child process it definitely frees the
> > net_namespace previous created at the same time it decrease net_namespace's
> > refcnt in exit_task_namespace().
> > It means it's possible that bind mount point can hold the dentry with inode
> > having net_namespace's dangling pointer in another process.
> > In above situation, parent who know that binded mount point calls setns(2)
> > then it sets the net_namespace which is refered by the inode of the dentry
> > increased by do_loopback.
> > That makes set the net_namespace which was freed already.
>
> How?  Netns reference in inode contributes to netns refcount.  And it's held
> for as long as the _inode_ has positive refcount - we only drop it from
> the inode destructor, *NOT* every time we drop a reference to inode.
> In the similar fashion, the inode reference in dentry contributes to inode
> refcount.  And again, that inode reference won't be dropped until the _last_
> reference to dentry gets dropped.
>
> Incrementing refcount of dentry is enough to pin the inode and thus the
> netns the inode refers to.  It's a very common pattern with refcounting;
> a useful way of thinking about it is to consider the refcount of e.g.
> inode as sum of several components, one of them being "number of struct
> dentry instances with ->d_inode pointing to our inode".  And look at e.g.
> __ns_get_path() like this:
>         rcu_read_lock();
>         d = atomic_long_read(&ns->stashed);
>         if (!d)
>                 goto slow;
>         dentry = (struct dentry *)d;
>         if (!lockref_get_not_dead(&dentry->d_lockref))
>                 goto slow;
> other_count(dentry) got incremented by 1.
>         rcu_read_unlock();
>         ns->ops->put(ns);
> other_count(ns) decremented by 1.
> got_it:
>         path->mnt = mntget(mnt);
>         path->dentry = dentry;
> path added to paths(dentry), other_count(dentry) decremented by 1 (getting
> it back to the original value).
>         return 0;
> slow:
>         rcu_read_unlock();
>         inode = new_inode_pseudo(mnt->mnt_sb);
>         if (!inode) {
>                 ns->ops->put(ns);
> subtract 1 from other_count(ns)
>                 return -ENOMEM;
>         }
> dentries(inode) = empty
> other_count(inode) = 1
>         ....
>         inode->i_private = ns;
> add inode to inodes(ns), subtract 1 from other_count(ns); the total
> is unchanged.
>         dentry = d_alloc_anon(mnt->mnt_sb);
>         if (!dentry) {
>                 iput(inode);
> subtract 1 from other_count(inode).  Since now all components of
> inode refcount are zero, inode gets destroyed.  Destructor calls
> nsfs_evict_inode(), which removes the inode from inodes(ns).
> The total effect: inode is destroyed, inodes(ns) is back to what
> it used to be and other_count(ns) is left decremented by 1 compared
> to what we used to have.  IOW, the balance is the same as if inode
> allocation would've failed.
>                 return -ENOMEM;
>         }
> other_count(dentry) = 1
>         d_instantiate(dentry, inode);
> add dentry to dentries(inode), subtract 1 from other_count(inode).
> The total is unchanged.  Now other_count(inode) is 0 and dentries(inode)
> is {dentry}.
>         d = atomic_long_cmpxchg(&ns->stashed, 0, (unsigned long)dentry);
>         if (d) {
> somebody else has gotten there first
>                 d_delete(dentry);       /* make sure ->d_prune() does nothing */
>                 dput(dentry);
> subtract 1 from other_count(dentry) (which will drive it to 0).  Since
> no other references exist, dentry gets destroyed.  Destructor will
> remove dentry from dentries(inode) and since other_count(inode) is already
> zero, trigger destruction of inode.  That, in turn, will remove inode
> from inodes(ns).  Total effect: dentry is destroyed, inode is destroyed,
> inodes(ns) is back to what it used to be, other_count(ns) is left decremented
> by 1 compared to what we used to have.
>                 cpu_relax();
>                 return -EAGAIN;
>         }
>         goto got_it;
> got_it:
>         path->mnt = mntget(mnt);
>         path->dentry = dentry;
> add path to paths(dentry), subtract 1 from other_count(dentry).  At that
> point other_count(dentry) is back to 0, ditto for other_count(inode) and
> other_count(ns) is left decremented by 1 compared to what it used to be.
> inode is added to inodes(ns), dentry - to dentries(inode) and path - to
> paths(dentry).
>         return 0;
> and we are done.
>
> In all cases the total effect is the same as far as "other" counts go:
> other_count(ns) is down by 1 and that's the only change in other_count()
> of *ANY* objects.  Of course we do not keep track of the sets explicitly;
> it would cost too much and we only interested in the sum of their sizes
> anyway.  What we actually store is the sum, so operations like "transfer
> the reference from one component to another" are not immediately obvious
> to be refcounting ones - the sum is unchanged.  Conceptually, though,
> they are refcounting operations.
>         Up to d_instantiate() we are holding a reference to inode;
> after that we are *not* - it has been transferred to dentry.  That's
> why on subsequent failure exits we do not call iput() - the inode
> reference is not ours to discard anymore.
>         In the same way, up to inode->i_private = ns; we are holding
> a reference to ns.  After that we are not - it's been transferred to
> inode.  From that point on it's not ours to discard; it will be
> dropped when inode gets destroyed, whenever that happens.

Al Viro April 8, 2020, 1:48 p.m. UTC | #8

On Wed, Apr 08, 2020 at 02:59:17PM +0900, Yun Levi wrote:
> Thank you for great comments. Thanks to you I understand what i missed.
> 
> I try to generate problem on mainline But, as you explained that
> situation isn't happen,
> 
> Maybe my other things which I made generate some problem (freeing
> network namespace..)
> 
> Thanks for great answering and sharing.
> 
> If I meet the situation, at that time I'll share. Thank you very much!
> 
> P.S. If I have a question, Could I ask via e-mail like this?

Sure, no problem...

netns: dangling pointer on netns bind mount point.

Commit Message

Comments

Patch