Message ID | 20221113170507.8205-1-yin31149@gmail.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [v2] net: sched: fix memory leak in tcindex_set_parms | expand |
On Mon, 2022-11-14 at 01:05 +0800, Hawkins Jiawei wrote: > Syzkaller reports a memory leak as follows: > ==================================== > BUG: memory leak > unreferenced object 0xffff88810c287f00 (size 256): > comm "syz-executor105", pid 3600, jiffies 4294943292 (age 12.990s) > hex dump (first 32 bytes): > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > backtrace: > [<ffffffff814cf9f0>] kmalloc_trace+0x20/0x90 mm/slab_common.c:1046 > [<ffffffff839c9e07>] kmalloc include/linux/slab.h:576 [inline] > [<ffffffff839c9e07>] kmalloc_array include/linux/slab.h:627 [inline] > [<ffffffff839c9e07>] kcalloc include/linux/slab.h:659 [inline] > [<ffffffff839c9e07>] tcf_exts_init include/net/pkt_cls.h:250 [inline] > [<ffffffff839c9e07>] tcindex_set_parms+0xa7/0xbe0 net/sched/cls_tcindex.c:342 > [<ffffffff839caa1f>] tcindex_change+0xdf/0x120 net/sched/cls_tcindex.c:553 > [<ffffffff8394db62>] tc_new_tfilter+0x4f2/0x1100 net/sched/cls_api.c:2147 > [<ffffffff8389e91c>] rtnetlink_rcv_msg+0x4dc/0x5d0 net/core/rtnetlink.c:6082 > [<ffffffff839eba67>] netlink_rcv_skb+0x87/0x1d0 net/netlink/af_netlink.c:2540 > [<ffffffff839eab87>] netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline] > [<ffffffff839eab87>] netlink_unicast+0x397/0x4c0 net/netlink/af_netlink.c:1345 > [<ffffffff839eb046>] netlink_sendmsg+0x396/0x710 net/netlink/af_netlink.c:1921 > [<ffffffff8383e796>] sock_sendmsg_nosec net/socket.c:714 [inline] > [<ffffffff8383e796>] sock_sendmsg+0x56/0x80 net/socket.c:734 > [<ffffffff8383eb08>] ____sys_sendmsg+0x178/0x410 net/socket.c:2482 > [<ffffffff83843678>] ___sys_sendmsg+0xa8/0x110 net/socket.c:2536 > [<ffffffff838439c5>] __sys_sendmmsg+0x105/0x330 net/socket.c:2622 > [<ffffffff83843c14>] __do_sys_sendmmsg net/socket.c:2651 [inline] > [<ffffffff83843c14>] __se_sys_sendmmsg net/socket.c:2648 [inline] > [<ffffffff83843c14>] __x64_sys_sendmmsg+0x24/0x30 net/socket.c:2648 > [<ffffffff84605fd5>] do_syscall_x64 arch/x86/entry/common.c:50 [inline] > [<ffffffff84605fd5>] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 > [<ffffffff84800087>] entry_SYSCALL_64_after_hwframe+0x63/0xcd > ==================================== > > Kernel uses tcindex_change() to change an existing > traffic-control-indices filter properties. During the > process of changing, kernel clears the old > traffic-control-indices filter result, and updates it > by RCU assigning new traffic-control-indices data. > > Yet the problem is that, kernel clears the old > traffic-control-indices filter result, without destroying > its tcf_exts structure, which triggers the above > memory leak. > > This patch solves it by using tcf_exts_destroy() to > destroy the tcf_exts structure in old > traffic-control-indices filter result, after the > RCU grace period. > > [Thanks to the suggestion from Jakub Kicinski and Cong Wang] > > Fixes: b9a24bb76bf6 ("net_sched: properly handle failure case of tcf_exts_init()") > Link: https://lore.kernel.org/all/0000000000001de5c505ebc9ec59@google.com/ > Reported-by: syzbot+232ebdbd36706c965ebf@syzkaller.appspotmail.com > Tested-by: syzbot+232ebdbd36706c965ebf@syzkaller.appspotmail.com > Cc: Cong Wang <cong.wang@bytedance.com> > Cc: Jakub Kicinski <kuba@kernel.org> > Signed-off-by: Hawkins Jiawei <yin31149@gmail.com> > --- > v2: > - remove all 'will' in commit message according to Jakub Kicinski > - add Fixes tag according to Jakub Kicinski > - remove all ifdefs according to Jakub Kicinski and Cong Wang > - add synchronize_rcu() before destorying old_e according to > Cong Wang > > v1: https://lore.kernel.org/all/20221031060835.11722-1-yin31149@gmail.com/ > net/sched/cls_tcindex.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/net/sched/cls_tcindex.c b/net/sched/cls_tcindex.c > index 1c9eeb98d826..d2fac9559d3e 100644 > --- a/net/sched/cls_tcindex.c > +++ b/net/sched/cls_tcindex.c > @@ -338,6 +338,7 @@ tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base, > struct tcf_result cr = {}; > int err, balloc = 0; > struct tcf_exts e; > + struct tcf_exts old_e = {}; > > err = tcf_exts_init(&e, net, TCA_TCINDEX_ACT, TCA_TCINDEX_POLICE); > if (err < 0) > @@ -479,6 +480,7 @@ tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base, > } > > if (old_r && old_r != r) { > + old_e = old_r->exts; > err = tcindex_filter_result_init(old_r, cp, net); > if (err < 0) { > kfree(f); > @@ -510,6 +512,12 @@ tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base, > tcf_exts_destroy(&new_filter_result.exts); > } > > + /* Note: old_e should be destroyed after the RCU grace period, > + * to avoid possible use-after-free by concurrent readers. > + */ > + synchronize_rcu(); this could make tc reconfiguration potentially very slow. I'm wondering if we can delegate the tcf_exts_destroy() to some workqueue? Thanks! Paolo
On Tue, 15 Nov 2022 at 12:36, Paolo Abeni <pabeni@redhat.com> wrote: > > On Mon, 2022-11-14 at 01:05 +0800, Hawkins Jiawei wrote: > > Syzkaller reports a memory leak as follows: > > ==================================== > > BUG: memory leak > > unreferenced object 0xffff88810c287f00 (size 256): > > comm "syz-executor105", pid 3600, jiffies 4294943292 (age 12.990s) > > hex dump (first 32 bytes): > > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > backtrace: > > [<ffffffff814cf9f0>] kmalloc_trace+0x20/0x90 mm/slab_common.c:1046 > > [<ffffffff839c9e07>] kmalloc include/linux/slab.h:576 [inline] > > [<ffffffff839c9e07>] kmalloc_array include/linux/slab.h:627 [inline] > > [<ffffffff839c9e07>] kcalloc include/linux/slab.h:659 [inline] > > [<ffffffff839c9e07>] tcf_exts_init include/net/pkt_cls.h:250 [inline] > > [<ffffffff839c9e07>] tcindex_set_parms+0xa7/0xbe0 net/sched/cls_tcindex.c:342 > > [<ffffffff839caa1f>] tcindex_change+0xdf/0x120 net/sched/cls_tcindex.c:553 > > [<ffffffff8394db62>] tc_new_tfilter+0x4f2/0x1100 net/sched/cls_api.c:2147 > > [<ffffffff8389e91c>] rtnetlink_rcv_msg+0x4dc/0x5d0 net/core/rtnetlink.c:6082 > > [<ffffffff839eba67>] netlink_rcv_skb+0x87/0x1d0 net/netlink/af_netlink.c:2540 > > [<ffffffff839eab87>] netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline] > > [<ffffffff839eab87>] netlink_unicast+0x397/0x4c0 net/netlink/af_netlink.c:1345 > > [<ffffffff839eb046>] netlink_sendmsg+0x396/0x710 net/netlink/af_netlink.c:1921 > > [<ffffffff8383e796>] sock_sendmsg_nosec net/socket.c:714 [inline] > > [<ffffffff8383e796>] sock_sendmsg+0x56/0x80 net/socket.c:734 > > [<ffffffff8383eb08>] ____sys_sendmsg+0x178/0x410 net/socket.c:2482 > > [<ffffffff83843678>] ___sys_sendmsg+0xa8/0x110 net/socket.c:2536 > > [<ffffffff838439c5>] __sys_sendmmsg+0x105/0x330 net/socket.c:2622 > > [<ffffffff83843c14>] __do_sys_sendmmsg net/socket.c:2651 [inline] > > [<ffffffff83843c14>] __se_sys_sendmmsg net/socket.c:2648 [inline] > > [<ffffffff83843c14>] __x64_sys_sendmmsg+0x24/0x30 net/socket.c:2648 > > [<ffffffff84605fd5>] do_syscall_x64 arch/x86/entry/common.c:50 [inline] > > [<ffffffff84605fd5>] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 > > [<ffffffff84800087>] entry_SYSCALL_64_after_hwframe+0x63/0xcd > > ==================================== > > > > Kernel uses tcindex_change() to change an existing > > traffic-control-indices filter properties. During the > > process of changing, kernel clears the old > > traffic-control-indices filter result, and updates it > > by RCU assigning new traffic-control-indices data. > > > > Yet the problem is that, kernel clears the old > > traffic-control-indices filter result, without destroying > > its tcf_exts structure, which triggers the above > > memory leak. > > > > This patch solves it by using tcf_exts_destroy() to > > destroy the tcf_exts structure in old > > traffic-control-indices filter result, after the > > RCU grace period. > > > > [Thanks to the suggestion from Jakub Kicinski and Cong Wang] > > > > Fixes: b9a24bb76bf6 ("net_sched: properly handle failure case of tcf_exts_init()") > > Link: https://lore.kernel.org/all/0000000000001de5c505ebc9ec59@google.com/ > > Reported-by: syzbot+232ebdbd36706c965ebf@syzkaller.appspotmail.com > > Tested-by: syzbot+232ebdbd36706c965ebf@syzkaller.appspotmail.com > > Cc: Cong Wang <cong.wang@bytedance.com> > > Cc: Jakub Kicinski <kuba@kernel.org> > > Signed-off-by: Hawkins Jiawei <yin31149@gmail.com> > > --- > > v2: > > - remove all 'will' in commit message according to Jakub Kicinski > > - add Fixes tag according to Jakub Kicinski > > - remove all ifdefs according to Jakub Kicinski and Cong Wang > > - add synchronize_rcu() before destorying old_e according to > > Cong Wang > > > > v1: https://lore.kernel.org/all/20221031060835.11722-1-yin31149@gmail.com/ > > net/sched/cls_tcindex.c | 8 ++++++++ > > 1 file changed, 8 insertions(+) > > > > diff --git a/net/sched/cls_tcindex.c b/net/sched/cls_tcindex.c > > index 1c9eeb98d826..d2fac9559d3e 100644 > > --- a/net/sched/cls_tcindex.c > > +++ b/net/sched/cls_tcindex.c > > @@ -338,6 +338,7 @@ tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base, > > struct tcf_result cr = {}; > > int err, balloc = 0; > > struct tcf_exts e; > > + struct tcf_exts old_e = {}; > > > > err = tcf_exts_init(&e, net, TCA_TCINDEX_ACT, TCA_TCINDEX_POLICE); > > if (err < 0) > > @@ -479,6 +480,7 @@ tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base, > > } > > > > if (old_r && old_r != r) { > > + old_e = old_r->exts; > > err = tcindex_filter_result_init(old_r, cp, net); > > if (err < 0) { > > kfree(f); > > @@ -510,6 +512,12 @@ tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base, > > tcf_exts_destroy(&new_filter_result.exts); > > } > > > > + /* Note: old_e should be destroyed after the RCU grace period, > > + * to avoid possible use-after-free by concurrent readers. > > + */ > > + synchronize_rcu(); > > this could make tc reconfiguration potentially very slow. I'm wondering > if we can delegate the tcf_exts_destroy() to some workqueue? call_rcu?
On Mon, 14 Nov 2022 01:05:08 +0800 Hawkins Jiawei wrote: > diff --git a/net/sched/cls_tcindex.c b/net/sched/cls_tcindex.c > index 1c9eeb98d826..d2fac9559d3e 100644 > --- a/net/sched/cls_tcindex.c > +++ b/net/sched/cls_tcindex.c > @@ -338,6 +338,7 @@ tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base, > struct tcf_result cr = {}; > int err, balloc = 0; > struct tcf_exts e; > + struct tcf_exts old_e = {}; This is not a valid way of initializing a structure. tcf_exts_init() is supposed to be called. If we add a list member to that structure this code will break, again. > err = tcf_exts_init(&e, net, TCA_TCINDEX_ACT, TCA_TCINDEX_POLICE); > if (err < 0) > @@ -479,6 +480,7 @@ tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base, > } > > if (old_r && old_r != r) { > + old_e = old_r->exts; > err = tcindex_filter_result_init(old_r, cp, net); > if (err < 0) { > kfree(f); > @@ -510,6 +512,12 @@ tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base, > tcf_exts_destroy(&new_filter_result.exts); > } > > + /* Note: old_e should be destroyed after the RCU grace period, > + * to avoid possible use-after-free by concurrent readers. > + */ > + synchronize_rcu(); > + tcf_exts_destroy(&old_e); I don't think this dance is required, @cp is a copy of the original data, and the original (@p) is destroyed in a safe manner below. > if (oldp) > tcf_queue_work(&oldp->rwork, tcindex_partial_destroy_work); > return 0;
On Tue, 2022-11-15 at 09:02 -0800, Jakub Kicinski wrote: > On Mon, 14 Nov 2022 01:05:08 +0800 Hawkins Jiawei wrote: > > > @@ -479,6 +480,7 @@ tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base, > > } > > > > if (old_r && old_r != r) { > > + old_e = old_r->exts; > > err = tcindex_filter_result_init(old_r, cp, net); > > if (err < 0) { > > kfree(f); > > @@ -510,6 +512,12 @@ tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base, > > tcf_exts_destroy(&new_filter_result.exts); > > } > > > > + /* Note: old_e should be destroyed after the RCU grace period, > > + * to avoid possible use-after-free by concurrent readers. > > + */ > > + synchronize_rcu(); > > + tcf_exts_destroy(&old_e); > > I don't think this dance is required, @cp is a copy of the original > data, and the original (@p) is destroyed in a safe manner below. This code confuses me more than a bit, and I don't follow ?!? it looks like that at this point: * the data path could access 'old_r->exts' contents via 'p' just before the previous 'tcindex_filter_result_init(old_r, cp, net);' but still potentially within the same RCU grace period * 'tcindex_filter_result_init(old_r, cp, net);' has 'unlinked' the old exts from 'p' so that will not be freed by later tcindex_partial_destroy_work() Overall it looks to me that we need some somewhat wait for the RCU grace period, Somewhat side question: it looks like that the 'perfect hashing' usage is the root cause of the issue addressed here, and very likely is afflicted by other problems, e.g. the data curruption in 'err = tcindex_filter_result_init(old_r, cp, net);'. AFAICS 'perfect hashing' usage is a sort of optimization that the user- space may trigger with some combination of the tcindex arguments. I'm wondering if we could drop all perfect hashing related code? Paolo
On Tue, 15 Nov 2022 19:57:10 +0100 Paolo Abeni wrote: > This code confuses me more than a bit, and I don't follow ?!? It's very confusing :S For starters I don't know when r != old_r. I mean now it triggers randomly after the RCU-ification, but in the original code when it was just a memset(). When would old_r ever not be null and yet point to a different entry? > it looks like that at this point: > > * the data path could access 'old_r->exts' contents via 'p' just before > the previous 'tcindex_filter_result_init(old_r, cp, net);' but still > potentially within the same RCU grace period > > * 'tcindex_filter_result_init(old_r, cp, net);' has 'unlinked' the old > exts from 'p' so that will not be freed by later > tcindex_partial_destroy_work() > > Overall it looks to me that we need some somewhat wait for the RCU > grace period, Isn't it better to make @cp a deeper copy of @p ? I thought it already is but we don't seem to be cloning p->h. Also the cloning of p->perfect looks quite lossy. > Somewhat side question: it looks like that the 'perfect hashing' usage > is the root cause of the issue addressed here, and very likely is > afflicted by other problems, e.g. the data curruption in 'err = > tcindex_filter_result_init(old_r, cp, net);'. > > AFAICS 'perfect hashing' usage is a sort of optimization that the user- > space may trigger with some combination of the tcindex arguments. I'm > wondering if we could drop all perfect hashing related code? The thought of "how much of this can we delete" did cross my mind :)
Hi, On Wed, 16 Nov 2022 at 01:02, Jakub Kicinski <kuba@kernel.org> wrote: > > On Mon, 14 Nov 2022 01:05:08 +0800 Hawkins Jiawei wrote: > > diff --git a/net/sched/cls_tcindex.c b/net/sched/cls_tcindex.c > > index 1c9eeb98d826..d2fac9559d3e 100644 > > --- a/net/sched/cls_tcindex.c > > +++ b/net/sched/cls_tcindex.c > > @@ -338,6 +338,7 @@ tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base, > > struct tcf_result cr = {}; > > int err, balloc = 0; > > struct tcf_exts e; > > + struct tcf_exts old_e = {}; > > This is not a valid way of initializing a structure. > tcf_exts_init() is supposed to be called. > If we add a list member to that structure this code will break, again. Yes, you are right. But the `old_e` variable here is used only for freeing old_r->exts resource, `old_e` will be overwritten by old_r->exts content as follows: struct tcf_exts old_e = {}; ... if (old_r && old_r != r) { old_e = old_r->exts; ... } ... synchronize_rcu(); tcf_exts_destroy(&old_e); So this patch uses `struct tcf_exts old_e = {}` here just for a cleared space.
On Wed, 16 Nov 2022 at 10:44, Jakub Kicinski <kuba@kernel.org> wrote: > > On Tue, 15 Nov 2022 19:57:10 +0100 Paolo Abeni wrote: > > This code confuses me more than a bit, and I don't follow ?!? > > It's very confusing :S > > For starters I don't know when r != old_r. I mean now it triggers > randomly after the RCU-ification, but in the original code when > it was just a memset(). When would old_r ever not be null and yet > point to a different entry? I am also confused about the code when I tried to fix this bug. As for when `old_r != r`, according to the simplified code below, this should be probably true if `p->perfect` is true or `!p->perfect && !pc->h` is true(please correct me if I am wrong) struct tcindex_filter_result new_filter_result, *old_r = r; struct tcindex_data *cp = NULL, *oldp; struct tcf_result cr = {}; /* tcindex_data attributes must look atomic to classifier/lookup so * allocate new tcindex data and RCU assign it onto root. Keeping * perfect hash and hash pointers from old data. */ cp = kzalloc(sizeof(*cp), GFP_KERNEL); if (p->perfect) { if (tcindex_alloc_perfect_hash(net, cp) < 0) goto errout; ... } cp->h = p->h; if (!cp->perfect && !cp->h) { if (valid_perfect_hash(cp)) { if (tcindex_alloc_perfect_hash(net, cp) < 0) goto errout_alloc; } else { struct tcindex_filter __rcu **hash; hash = kcalloc(cp->hash, sizeof(struct tcindex_filter *), GFP_KERNEL); if (!hash) goto errout_alloc; cp->h = hash; } } ... if (cp->perfect) r = cp->perfect + handle; else r = tcindex_lookup(cp, handle) ? : &new_filter_result; if (old_r && old_r != r) { err = tcindex_filter_result_init(old_r, cp, net); if (err < 0) { kfree(f); goto errout_alloc; } } * If `p->perfect` is true, tcindex_alloc_perfect_hash() newly alloctes cp->perfect. * If `!p->perfect && !p->h` is true, cp->perfect or cp->h is newly allocated. In either case, r probably points to the newly allocated memory, which should not equals to the old_r. > > > it looks like that at this point: > > > > * the data path could access 'old_r->exts' contents via 'p' just before > > the previous 'tcindex_filter_result_init(old_r, cp, net);' but still > > potentially within the same RCU grace period > > > > * 'tcindex_filter_result_init(old_r, cp, net);' has 'unlinked' the old > > exts from 'p' so that will not be freed by later > > tcindex_partial_destroy_work() > > > > Overall it looks to me that we need some somewhat wait for the RCU > > grace period, > > Isn't it better to make @cp a deeper copy of @p ? > I thought it already is but we don't seem to be cloning p->h. > Also the cloning of p->perfect looks quite lossy. Yes, I also think @cp should be a deeper copy of @p. But it seems that in tcindex_alloc_perfect_hash(), each @cp ->exts will be initialized by tcf_exts_init() as below, and tcindex_set_parms() forgets to free the old ->exts content, triggering this memory leak.(Please correct me if I am wrong) static int tcindex_alloc_perfect_hash(struct net *net, struct tcindex_data *cp) { int i, err = 0; cp->perfect = kcalloc(cp->hash, sizeof(struct tcindex_filter_result), GFP_KERNEL | __GFP_NOWARN); for (i = 0; i < cp->hash; i++) { err = tcf_exts_init(&cp->perfect[i].exts, net, TCA_TCINDEX_ACT, TCA_TCINDEX_POLICE); if (err < 0) goto errout; cp->perfect[i].p = cp; } } static inline int tcf_exts_init(struct tcf_exts *exts, struct net *net, int action, int police) { #ifdef CONFIG_NET_CLS_ACT exts->type = 0; exts->nr_actions = 0; /* Note: we do not own yet a reference on net. * This reference might be taken later from tcf_exts_get_net(). */ exts->net = net; exts->actions = kcalloc(TCA_ACT_MAX_PRIO, sizeof(struct tc_action *), GFP_KERNEL); if (!exts->actions) return -ENOMEM; #endif exts->action = action; exts->police = police; return 0; } > > > Somewhat side question: it looks like that the 'perfect hashing' usage > > is the root cause of the issue addressed here, and very likely is > > afflicted by other problems, e.g. the data curruption in 'err = > > tcindex_filter_result_init(old_r, cp, net);'. > > > > AFAICS 'perfect hashing' usage is a sort of optimization that the user- > > space may trigger with some combination of the tcindex arguments. I'm > > wondering if we could drop all perfect hashing related code? > > The thought of "how much of this can we delete" did cross my mind :)
On Wed, 16 Nov 2022 at 20:10, Hawkins Jiawei <yin31149@gmail.com> wrote: > > On Wed, 16 Nov 2022 at 10:44, Jakub Kicinski <kuba@kernel.org> wrote: > > > > On Tue, 15 Nov 2022 19:57:10 +0100 Paolo Abeni wrote: > > > This code confuses me more than a bit, and I don't follow ?!? > > > > It's very confusing :S > > > > For starters I don't know when r != old_r. I mean now it triggers > > randomly after the RCU-ification, but in the original code when > > it was just a memset(). When would old_r ever not be null and yet > > point to a different entry? > > I am also confused about the code when I tried to fix this bug. > > As for when `old_r != r`, according to the simplified > code below, this should be probably true if `p->perfect` is true > or `!p->perfect && !pc->h` is true(please correct me if I am wrong) > > struct tcindex_filter_result new_filter_result, *old_r = r; > struct tcindex_data *cp = NULL, *oldp; > struct tcf_result cr = {}; > > /* tcindex_data attributes must look atomic to classifier/lookup so > * allocate new tcindex data and RCU assign it onto root. Keeping > * perfect hash and hash pointers from old data. > */ > cp = kzalloc(sizeof(*cp), GFP_KERNEL); > > if (p->perfect) { > if (tcindex_alloc_perfect_hash(net, cp) < 0) > goto errout; > ... > } > cp->h = p->h; > > if (!cp->perfect && !cp->h) { > if (valid_perfect_hash(cp)) { > if (tcindex_alloc_perfect_hash(net, cp) < 0) > goto errout_alloc; > > } else { > struct tcindex_filter __rcu **hash; > > hash = kcalloc(cp->hash, > sizeof(struct tcindex_filter *), > GFP_KERNEL); > > if (!hash) > goto errout_alloc; > > cp->h = hash; > } > } > ... > > if (cp->perfect) > r = cp->perfect + handle; > else > r = tcindex_lookup(cp, handle) ? : &new_filter_result; > > if (old_r && old_r != r) { > err = tcindex_filter_result_init(old_r, cp, net); > if (err < 0) { > kfree(f); > goto errout_alloc; > } > } > > * If `p->perfect` is true, tcindex_alloc_perfect_hash() newly > alloctes cp->perfect. > > * If `!p->perfect && !p->h` is true, cp->perfect or cp->h is > newly allocated. > > In either case, r probably points to the newly allocated memory, > which should not equals to the old_r. Sorry for the error. In the second case, `r` is possibly pointing to the `&new_filter_result`, which is a stack variable address, and should still not equal to the `old_r`. > > > > > > it looks like that at this point: > > > > > > * the data path could access 'old_r->exts' contents via 'p' just before > > > the previous 'tcindex_filter_result_init(old_r, cp, net);' but still > > > potentially within the same RCU grace period > > > > > > * 'tcindex_filter_result_init(old_r, cp, net);' has 'unlinked' the old > > > exts from 'p' so that will not be freed by later > > > tcindex_partial_destroy_work() > > > > > > Overall it looks to me that we need some somewhat wait for the RCU > > > grace period, > > > > Isn't it better to make @cp a deeper copy of @p ? > > I thought it already is but we don't seem to be cloning p->h. > > Also the cloning of p->perfect looks quite lossy. > > Yes, I also think @cp should be a deeper copy of @p. > > But it seems that in tcindex_alloc_perfect_hash(), > each @cp ->exts will be initialized by tcf_exts_init() > as below, and tcindex_set_parms() forgets to free the > old ->exts content, triggering this memory leak.(Please > correct me if I am wrong) > > static int tcindex_alloc_perfect_hash(struct net *net, > struct tcindex_data *cp) > { > int i, err = 0; > > cp->perfect = kcalloc(cp->hash, sizeof(struct tcindex_filter_result), > GFP_KERNEL | __GFP_NOWARN); > > for (i = 0; i < cp->hash; i++) { > err = tcf_exts_init(&cp->perfect[i].exts, net, > TCA_TCINDEX_ACT, TCA_TCINDEX_POLICE); > if (err < 0) > goto errout; > cp->perfect[i].p = cp; > } > } > > static inline int tcf_exts_init(struct tcf_exts *exts, struct net *net, > int action, int police) > { > #ifdef CONFIG_NET_CLS_ACT > exts->type = 0; > exts->nr_actions = 0; > /* Note: we do not own yet a reference on net. > * This reference might be taken later from tcf_exts_get_net(). > */ > exts->net = net; > exts->actions = kcalloc(TCA_ACT_MAX_PRIO, sizeof(struct tc_action *), > GFP_KERNEL); > if (!exts->actions) > return -ENOMEM; > #endif > exts->action = action; > exts->police = police; > return 0; > } > > > > > > Somewhat side question: it looks like that the 'perfect hashing' usage > > > is the root cause of the issue addressed here, and very likely is > > > afflicted by other problems, e.g. the data curruption in 'err = > > > tcindex_filter_result_init(old_r, cp, net);'. > > > > > > AFAICS 'perfect hashing' usage is a sort of optimization that the user- > > > space may trigger with some combination of the tcindex arguments. I'm > > > wondering if we could drop all perfect hashing related code? > > > > The thought of "how much of this can we delete" did cross my mind :)
diff --git a/net/sched/cls_tcindex.c b/net/sched/cls_tcindex.c index 1c9eeb98d826..d2fac9559d3e 100644 --- a/net/sched/cls_tcindex.c +++ b/net/sched/cls_tcindex.c @@ -338,6 +338,7 @@ tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base, struct tcf_result cr = {}; int err, balloc = 0; struct tcf_exts e; + struct tcf_exts old_e = {}; err = tcf_exts_init(&e, net, TCA_TCINDEX_ACT, TCA_TCINDEX_POLICE); if (err < 0) @@ -479,6 +480,7 @@ tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base, } if (old_r && old_r != r) { + old_e = old_r->exts; err = tcindex_filter_result_init(old_r, cp, net); if (err < 0) { kfree(f); @@ -510,6 +512,12 @@ tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base, tcf_exts_destroy(&new_filter_result.exts); } + /* Note: old_e should be destroyed after the RCU grace period, + * to avoid possible use-after-free by concurrent readers. + */ + synchronize_rcu(); + tcf_exts_destroy(&old_e); + if (oldp) tcf_queue_work(&oldp->rwork, tcindex_partial_destroy_work); return 0;