From patchwork Thu Mar 23 20:55:29 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Thomas Gleixner X-Patchwork-Id: 13186085 X-Patchwork-Delegate: kuba@kernel.org Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id DC82DC76195 for ; Thu, 23 Mar 2023 20:55:37 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231574AbjCWUze (ORCPT ); Thu, 23 Mar 2023 16:55:34 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44642 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231561AbjCWUzc (ORCPT ); Thu, 23 Mar 2023 16:55:32 -0400 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CF2CB196B1; Thu, 23 Mar 2023 13:55:30 -0700 (PDT) Message-ID: <20230323102800.042297517@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1679604929; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=JXHMhp93Kw/ebB5Ti1phDvCi97j8rxdz0JvFtJbboLs=; b=1J+6XXE3vcpEqT8mTqFjAncgsl5Cgt6ofrHt9FERvrkIwnnYzfj1Jo+R5+HymBPHkrv+y1 15Q6JR9uASSyF21KhFwofgaCsWXeT836edO1FycUxo8sM+xArmOK4N1T7jlGD+78Ve90yE 5N86reja/fWZNcCWZV6WGzydHHLK3vjctyKWP4hrXeTiG/thIBFTu1DTvnK8FFnZRvDlOz Lb8pHbP/HY9bgW6JH/fKcv+pfnwbK34usOs8MvX4mYQj6jrftchTe14kxlImRB2aM5Ko2a ADz8Jw890XZA2ChO+i3umXQkmSaROggywUFscebPOOC+1o5li5PIODJa35uGPg== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1679604929; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=JXHMhp93Kw/ebB5Ti1phDvCi97j8rxdz0JvFtJbboLs=; b=bKm+gA+p9aWHHNJ6LVJzRsXesNvxQX+EMPjgSAAhdP3sFf1nJZimXNF64xIunUBxkBZLwp W4elR0fcr80qT8Dg== From: Thomas Gleixner To: LKML Cc: Linus Torvalds , x86@kernel.org, Wangyang Guo , Arjan van De Ven , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , netdev@vger.kernel.org, Will Deacon , Peter Zijlstra , Boqun Feng , Mark Rutland , Marc Zyngier , Qiuxu Zhuo , David Ahern Subject: [patch V3 1/4] [patch V2 1/4] net: dst: Prevent false sharing vs. dst_entry:: __refcnt References: <20230323102649.764958589@linutronix.de> MIME-Version: 1.0 Date: Thu, 23 Mar 2023 21:55:29 +0100 (CET) Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: kuba@kernel.org From: Wangyang Guo dst_entry::__refcnt is highly contended in scenarios where many connections happen from and to the same IP. The reference count is an atomic_t, so the reference count operations have to take the cache-line exclusive. Aside of the unavoidable reference count contention there is another significant problem which is caused by that: False sharing. perf top identified two affected read accesses. dst_entry::lwtstate and rtable::rt_genid. dst_entry:__refcnt is located at offset 64 of dst_entry, which puts it into a seperate cacheline vs. the read mostly members located at the beginning of the struct. That prevents false sharing vs. the struct members in the first 64 bytes of the structure, but there is also dst_entry::lwtstate which is located after the reference count and in the same cache line. This member is read after a reference count has been acquired. struct rtable embeds a struct dst_entry at offset 0. struct dst_entry has a size of 112 bytes, which means that the struct members of rtable which follow the dst member share the same cache line as dst_entry::__refcnt. Especially rtable::rt_genid is also read by the contexts which have a reference count acquired already. When dst_entry:__refcnt is incremented or decremented via an atomic operation these read accesses stall. This was found when analysing the memtier benchmark in 1:100 mode, which amplifies the problem extremly. Move the rt[6i]_uncached[_list] members out of struct rtable and struct rt6_info into struct dst_entry to provide padding and move the lwtstate member after that so it ends up in the same cache line. The resulting improvement depends on the micro-architecture and the number of CPUs. It ranges from +20% to +120% with a localhost memtier/memcached benchmark. [ tglx: Rearrange struct ] Signed-off-by: Wangyang Guo Signed-off-by: Arjan van de Ven Signed-off-by: Thomas Gleixner Reviewed-by: Eric Dumazet Reviewed-by: David Ahern --- V2: Move uncached[_list] into dst_entry (Eric) --- include/net/dst.h | 15 ++++++++++++++- include/net/ip6_fib.h | 3 --- include/net/ip6_route.h | 2 +- include/net/route.h | 3 --- net/ipv4/route.c | 20 ++++++++++---------- net/ipv4/xfrm4_policy.c | 4 ++-- net/ipv6/route.c | 26 +++++++++++++------------- net/ipv6/xfrm6_policy.c | 4 ++-- 8 files changed, 42 insertions(+), 35 deletions(-) --- a/include/net/dst.h +++ b/include/net/dst.h @@ -69,15 +69,28 @@ struct dst_entry { #endif int __use; unsigned long lastuse; - struct lwtunnel_state *lwtstate; struct rcu_head rcu_head; short error; short __pad; __u32 tclassid; #ifndef CONFIG_64BIT + struct lwtunnel_state *lwtstate; atomic_t __refcnt; /* 32-bit offset 64 */ #endif netdevice_tracker dev_tracker; + + /* + * Used by rtable and rt6_info. Moves lwtstate into the next cache + * line on 64bit so that lwtstate does not cause false sharing with + * __refcnt under contention of __refcnt. This also puts the + * frequently accessed members of rtable and rt6_info out of the + * __refcnt cache line. + */ + struct list_head rt_uncached; + struct uncached_list *rt_uncached_list; +#ifdef CONFIG_64BIT + struct lwtunnel_state *lwtstate; +#endif }; struct dst_metrics { --- a/include/net/ip6_fib.h +++ b/include/net/ip6_fib.h @@ -217,9 +217,6 @@ struct rt6_info { struct inet6_dev *rt6i_idev; u32 rt6i_flags; - struct list_head rt6i_uncached; - struct uncached_list *rt6i_uncached_list; - /* more non-fragment space at head required */ unsigned short rt6i_nfheader_len; }; --- a/include/net/ip6_route.h +++ b/include/net/ip6_route.h @@ -100,7 +100,7 @@ static inline struct dst_entry *ip6_rout static inline void ip6_rt_put_flags(struct rt6_info *rt, int flags) { if (!(flags & RT6_LOOKUP_F_DST_NOREF) || - !list_empty(&rt->rt6i_uncached)) + !list_empty(&rt->dst.rt_uncached)) ip6_rt_put(rt); } --- a/include/net/route.h +++ b/include/net/route.h @@ -78,9 +78,6 @@ struct rtable { /* Miscellaneous cached information */ u32 rt_mtu_locked:1, rt_pmtu:31; - - struct list_head rt_uncached; - struct uncached_list *rt_uncached_list; }; static inline bool rt_is_input_route(const struct rtable *rt) --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -1508,20 +1508,20 @@ void rt_add_uncached_list(struct rtable { struct uncached_list *ul = raw_cpu_ptr(&rt_uncached_list); - rt->rt_uncached_list = ul; + rt->dst.rt_uncached_list = ul; spin_lock_bh(&ul->lock); - list_add_tail(&rt->rt_uncached, &ul->head); + list_add_tail(&rt->dst.rt_uncached, &ul->head); spin_unlock_bh(&ul->lock); } void rt_del_uncached_list(struct rtable *rt) { - if (!list_empty(&rt->rt_uncached)) { - struct uncached_list *ul = rt->rt_uncached_list; + if (!list_empty(&rt->dst.rt_uncached)) { + struct uncached_list *ul = rt->dst.rt_uncached_list; spin_lock_bh(&ul->lock); - list_del_init(&rt->rt_uncached); + list_del_init(&rt->dst.rt_uncached); spin_unlock_bh(&ul->lock); } } @@ -1546,13 +1546,13 @@ void rt_flush_dev(struct net_device *dev continue; spin_lock_bh(&ul->lock); - list_for_each_entry_safe(rt, safe, &ul->head, rt_uncached) { + list_for_each_entry_safe(rt, safe, &ul->head, dst.rt_uncached) { if (rt->dst.dev != dev) continue; rt->dst.dev = blackhole_netdev; netdev_ref_replace(dev, blackhole_netdev, &rt->dst.dev_tracker, GFP_ATOMIC); - list_move(&rt->rt_uncached, &ul->quarantine); + list_move(&rt->dst.rt_uncached, &ul->quarantine); } spin_unlock_bh(&ul->lock); } @@ -1644,7 +1644,7 @@ struct rtable *rt_dst_alloc(struct net_d rt->rt_uses_gateway = 0; rt->rt_gw_family = 0; rt->rt_gw4 = 0; - INIT_LIST_HEAD(&rt->rt_uncached); + INIT_LIST_HEAD(&rt->dst.rt_uncached); rt->dst.output = ip_output; if (flags & RTCF_LOCAL) @@ -1675,7 +1675,7 @@ struct rtable *rt_dst_clone(struct net_d new_rt->rt_gw4 = rt->rt_gw4; else if (rt->rt_gw_family == AF_INET6) new_rt->rt_gw6 = rt->rt_gw6; - INIT_LIST_HEAD(&new_rt->rt_uncached); + INIT_LIST_HEAD(&new_rt->dst.rt_uncached); new_rt->dst.input = rt->dst.input; new_rt->dst.output = rt->dst.output; @@ -2859,7 +2859,7 @@ struct dst_entry *ipv4_blackhole_route(s else if (rt->rt_gw_family == AF_INET6) rt->rt_gw6 = ort->rt_gw6; - INIT_LIST_HEAD(&rt->rt_uncached); + INIT_LIST_HEAD(&rt->dst.rt_uncached); } dst_release(dst_orig); --- a/net/ipv4/xfrm4_policy.c +++ b/net/ipv4/xfrm4_policy.c @@ -91,7 +91,7 @@ static int xfrm4_fill_dst(struct xfrm_ds xdst->u.rt.rt_gw6 = rt->rt_gw6; xdst->u.rt.rt_pmtu = rt->rt_pmtu; xdst->u.rt.rt_mtu_locked = rt->rt_mtu_locked; - INIT_LIST_HEAD(&xdst->u.rt.rt_uncached); + INIT_LIST_HEAD(&xdst->u.rt.dst.rt_uncached); rt_add_uncached_list(&xdst->u.rt); return 0; @@ -121,7 +121,7 @@ static void xfrm4_dst_destroy(struct dst struct xfrm_dst *xdst = (struct xfrm_dst *)dst; dst_destroy_metrics_generic(dst); - if (xdst->u.rt.rt_uncached_list) + if (xdst->u.rt.dst.rt_uncached_list) rt_del_uncached_list(&xdst->u.rt); xfrm_dst_destroy(xdst); } --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -139,20 +139,20 @@ void rt6_uncached_list_add(struct rt6_in { struct uncached_list *ul = raw_cpu_ptr(&rt6_uncached_list); - rt->rt6i_uncached_list = ul; + rt->dst.rt_uncached_list = ul; spin_lock_bh(&ul->lock); - list_add_tail(&rt->rt6i_uncached, &ul->head); + list_add_tail(&rt->dst.rt_uncached, &ul->head); spin_unlock_bh(&ul->lock); } void rt6_uncached_list_del(struct rt6_info *rt) { - if (!list_empty(&rt->rt6i_uncached)) { - struct uncached_list *ul = rt->rt6i_uncached_list; + if (!list_empty(&rt->dst.rt_uncached)) { + struct uncached_list *ul = rt->dst.rt_uncached_list; spin_lock_bh(&ul->lock); - list_del_init(&rt->rt6i_uncached); + list_del_init(&rt->dst.rt_uncached); spin_unlock_bh(&ul->lock); } } @@ -169,7 +169,7 @@ static void rt6_uncached_list_flush_dev( continue; spin_lock_bh(&ul->lock); - list_for_each_entry_safe(rt, safe, &ul->head, rt6i_uncached) { + list_for_each_entry_safe(rt, safe, &ul->head, dst.rt_uncached) { struct inet6_dev *rt_idev = rt->rt6i_idev; struct net_device *rt_dev = rt->dst.dev; bool handled = false; @@ -188,7 +188,7 @@ static void rt6_uncached_list_flush_dev( handled = true; } if (handled) - list_move(&rt->rt6i_uncached, + list_move(&rt->dst.rt_uncached, &ul->quarantine); } spin_unlock_bh(&ul->lock); @@ -334,7 +334,7 @@ static const struct rt6_info ip6_blk_hol static void rt6_info_init(struct rt6_info *rt) { memset_after(rt, 0, dst); - INIT_LIST_HEAD(&rt->rt6i_uncached); + INIT_LIST_HEAD(&rt->dst.rt_uncached); } /* allocate dst with ip6_dst_ops */ @@ -2638,7 +2638,7 @@ struct dst_entry *ip6_route_output_flags dst = ip6_route_output_flags_noref(net, sk, fl6, flags); rt6 = (struct rt6_info *)dst; /* For dst cached in uncached_list, refcnt is already taken. */ - if (list_empty(&rt6->rt6i_uncached) && !dst_hold_safe(dst)) { + if (list_empty(&rt6->dst.rt_uncached) && !dst_hold_safe(dst)) { dst = &net->ipv6.ip6_null_entry->dst; dst_hold(dst); } @@ -2748,7 +2748,7 @@ INDIRECT_CALLABLE_SCOPE struct dst_entry from = rcu_dereference(rt->from); if (from && (rt->rt6i_flags & RTF_PCPU || - unlikely(!list_empty(&rt->rt6i_uncached)))) + unlikely(!list_empty(&rt->dst.rt_uncached)))) dst_ret = rt6_dst_from_check(rt, from, cookie); else dst_ret = rt6_check(rt, from, cookie); @@ -6477,7 +6477,7 @@ static int __net_init ip6_route_net_init net->ipv6.ip6_null_entry->dst.ops = &net->ipv6.ip6_dst_ops; dst_init_metrics(&net->ipv6.ip6_null_entry->dst, ip6_template_metrics, true); - INIT_LIST_HEAD(&net->ipv6.ip6_null_entry->rt6i_uncached); + INIT_LIST_HEAD(&net->ipv6.ip6_null_entry->dst.rt_uncached); #ifdef CONFIG_IPV6_MULTIPLE_TABLES net->ipv6.fib6_has_custom_rules = false; @@ -6489,7 +6489,7 @@ static int __net_init ip6_route_net_init net->ipv6.ip6_prohibit_entry->dst.ops = &net->ipv6.ip6_dst_ops; dst_init_metrics(&net->ipv6.ip6_prohibit_entry->dst, ip6_template_metrics, true); - INIT_LIST_HEAD(&net->ipv6.ip6_prohibit_entry->rt6i_uncached); + INIT_LIST_HEAD(&net->ipv6.ip6_prohibit_entry->dst.rt_uncached); net->ipv6.ip6_blk_hole_entry = kmemdup(&ip6_blk_hole_entry_template, sizeof(*net->ipv6.ip6_blk_hole_entry), @@ -6499,7 +6499,7 @@ static int __net_init ip6_route_net_init net->ipv6.ip6_blk_hole_entry->dst.ops = &net->ipv6.ip6_dst_ops; dst_init_metrics(&net->ipv6.ip6_blk_hole_entry->dst, ip6_template_metrics, true); - INIT_LIST_HEAD(&net->ipv6.ip6_blk_hole_entry->rt6i_uncached); + INIT_LIST_HEAD(&net->ipv6.ip6_blk_hole_entry->dst.rt_uncached); #ifdef CONFIG_IPV6_SUBTREES net->ipv6.fib6_routes_require_src = 0; #endif --- a/net/ipv6/xfrm6_policy.c +++ b/net/ipv6/xfrm6_policy.c @@ -89,7 +89,7 @@ static int xfrm6_fill_dst(struct xfrm_ds xdst->u.rt6.rt6i_gateway = rt->rt6i_gateway; xdst->u.rt6.rt6i_dst = rt->rt6i_dst; xdst->u.rt6.rt6i_src = rt->rt6i_src; - INIT_LIST_HEAD(&xdst->u.rt6.rt6i_uncached); + INIT_LIST_HEAD(&xdst->u.rt6.dst.rt_uncached); rt6_uncached_list_add(&xdst->u.rt6); return 0; @@ -121,7 +121,7 @@ static void xfrm6_dst_destroy(struct dst if (likely(xdst->u.rt6.rt6i_idev)) in6_dev_put(xdst->u.rt6.rt6i_idev); dst_destroy_metrics_generic(dst); - if (xdst->u.rt6.rt6i_uncached_list) + if (xdst->u.rt6.dst.rt_uncached_list) rt6_uncached_list_del(&xdst->u.rt6); xfrm_dst_destroy(xdst); } From patchwork Thu Mar 23 20:55:30 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Thomas Gleixner X-Patchwork-Id: 13186086 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1760FC6FD1C for ; Thu, 23 Mar 2023 20:55:40 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231594AbjCWUzh (ORCPT ); Thu, 23 Mar 2023 16:55:37 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44658 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229600AbjCWUzd (ORCPT ); Thu, 23 Mar 2023 16:55:33 -0400 Received: from galois.linutronix.de (Galois.linutronix.de [IPv6:2a0a:51c0:0:12e:550::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1B0207AB3; Thu, 23 Mar 2023 13:55:32 -0700 (PDT) Message-ID: <20230323102800.101763813@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1679604930; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=hkHQ0iTngxG13UkvP/TjosHXNw7lXOMqMuZo5ApdX6M=; b=C1841qRox1pkq0T60no0X5no6imEgAczoEQ3thBsG2fa4vhRnnLloMptti4Gv/aB2755L8 Zw09iCxcHIREgxODEYOOtVfqVUiV2TIWbMp0PY4L+xkQQ7hk5spPviSoeOjimY1L6gio4v FO6hGYl3tJLH8aHJIOdUbx0MZ1ldaJFwbAJJsisIgtW/7FC+jiLmTRZAFFznTGKzIAmTw4 ctaJCB9WKVHCFjRKPT+cWpLt6uCTXEidQMiudkTewLGSHBEwFETyHS1LoyfLvo7rQZQ31n KzHBINBr15q03fytwPt3ZtjiHsL05owhjTvFnmVYCelzgcHV7Gl6S32U0E2dXQ== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1679604930; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=hkHQ0iTngxG13UkvP/TjosHXNw7lXOMqMuZo5ApdX6M=; b=nKgQv5A592T+NK56P/si/yFJeSfs56lmzHlK8IYKMSgH1RutBOScfPoPlp25xyzeHr/B29 xeD7iLmaTvTQKlBg== From: Thomas Gleixner To: LKML Cc: Linus Torvalds , x86@kernel.org, Wangyang Guo , Arjan van De Ven , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , netdev@vger.kernel.org, Will Deacon , Peter Zijlstra , Boqun Feng , Mark Rutland , Marc Zyngier , Qiuxu Zhuo Subject: [patch V3 2/4] atomics: Provide atomic_add_negative() variants References: <20230323102649.764958589@linutronix.de> MIME-Version: 1.0 Date: Thu, 23 Mar 2023 21:55:30 +0100 (CET) Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Thomas Gleixner atomic_add_negative() does not provide the relaxed/acquire/release variants. Provide them in preparation for a new scalable reference count algorithm. Signed-off-by: Thomas Gleixner Acked-by: Mark Rutland --- V3: Fix typos (Mark). Make kernel doc correct vs. ordering variants V2: New patch --- include/linux/atomic/atomic-arch-fallback.h | 208 ++++++++++++++++++++++++++-- include/linux/atomic/atomic-instrumented.h | 68 +++++++++ include/linux/atomic/atomic-long.h | 38 ++++- scripts/atomic/atomics.tbl | 2 scripts/atomic/fallbacks/add_negative | 11 - 5 files changed, 309 insertions(+), 18 deletions(-) --- a/include/linux/atomic/atomic-arch-fallback.h +++ b/include/linux/atomic/atomic-arch-fallback.h @@ -1208,15 +1208,21 @@ arch_atomic_inc_and_test(atomic_t *v) #define arch_atomic_inc_and_test arch_atomic_inc_and_test #endif +#ifndef arch_atomic_add_negative_relaxed +#ifdef arch_atomic_add_negative +#define arch_atomic_add_negative_acquire arch_atomic_add_negative +#define arch_atomic_add_negative_release arch_atomic_add_negative +#define arch_atomic_add_negative_relaxed arch_atomic_add_negative +#endif /* arch_atomic_add_negative */ + #ifndef arch_atomic_add_negative /** - * arch_atomic_add_negative - add and test if negative + * arch_atomic_add_negative - Add and test if negative * @i: integer value to add * @v: pointer of type atomic_t * - * Atomically adds @i to @v and returns true - * if the result is negative, or false when - * result is greater than or equal to zero. + * Atomically adds @i to @v and returns true if the result is negative, + * or false when the result is greater than or equal to zero. */ static __always_inline bool arch_atomic_add_negative(int i, atomic_t *v) @@ -1226,6 +1232,95 @@ arch_atomic_add_negative(int i, atomic_t #define arch_atomic_add_negative arch_atomic_add_negative #endif +#ifndef arch_atomic_add_negative_acquire +/** + * arch_atomic_add_negative_acquire - Add and test if negative + * @i: integer value to add + * @v: pointer of type atomic_t + * + * Atomically adds @i to @v and returns true if the result is negative, + * or false when the result is greater than or equal to zero. + */ +static __always_inline bool +arch_atomic_add_negative_acquire(int i, atomic_t *v) +{ + return arch_atomic_add_return_acquire(i, v) < 0; +} +#define arch_atomic_add_negative_acquire arch_atomic_add_negative_acquire +#endif + +#ifndef arch_atomic_add_negative_release +/** + * arch_atomic_add_negative_release - Add and test if negative + * @i: integer value to add + * @v: pointer of type atomic_t + * + * Atomically adds @i to @v and returns true if the result is negative, + * or false when the result is greater than or equal to zero. + */ +static __always_inline bool +arch_atomic_add_negative_release(int i, atomic_t *v) +{ + return arch_atomic_add_return_release(i, v) < 0; +} +#define arch_atomic_add_negative_release arch_atomic_add_negative_release +#endif + +#ifndef arch_atomic_add_negative_relaxed +/** + * arch_atomic_add_negative_relaxed - Add and test if negative + * @i: integer value to add + * @v: pointer of type atomic_t + * + * Atomically adds @i to @v and returns true if the result is negative, + * or false when the result is greater than or equal to zero. + */ +static __always_inline bool +arch_atomic_add_negative_relaxed(int i, atomic_t *v) +{ + return arch_atomic_add_return_relaxed(i, v) < 0; +} +#define arch_atomic_add_negative_relaxed arch_atomic_add_negative_relaxed +#endif + +#else /* arch_atomic_add_negative_relaxed */ + +#ifndef arch_atomic_add_negative_acquire +static __always_inline bool +arch_atomic_add_negative_acquire(int i, atomic_t *v) +{ + bool ret = arch_atomic_add_negative_relaxed(i, v); + __atomic_acquire_fence(); + return ret; +} +#define arch_atomic_add_negative_acquire arch_atomic_add_negative_acquire +#endif + +#ifndef arch_atomic_add_negative_release +static __always_inline bool +arch_atomic_add_negative_release(int i, atomic_t *v) +{ + __atomic_release_fence(); + return arch_atomic_add_negative_relaxed(i, v); +} +#define arch_atomic_add_negative_release arch_atomic_add_negative_release +#endif + +#ifndef arch_atomic_add_negative +static __always_inline bool +arch_atomic_add_negative(int i, atomic_t *v) +{ + bool ret; + __atomic_pre_full_fence(); + ret = arch_atomic_add_negative_relaxed(i, v); + __atomic_post_full_fence(); + return ret; +} +#define arch_atomic_add_negative arch_atomic_add_negative +#endif + +#endif /* arch_atomic_add_negative_relaxed */ + #ifndef arch_atomic_fetch_add_unless /** * arch_atomic_fetch_add_unless - add unless the number is already a given value @@ -2329,15 +2424,21 @@ arch_atomic64_inc_and_test(atomic64_t *v #define arch_atomic64_inc_and_test arch_atomic64_inc_and_test #endif +#ifndef arch_atomic64_add_negative_relaxed +#ifdef arch_atomic64_add_negative +#define arch_atomic64_add_negative_acquire arch_atomic64_add_negative +#define arch_atomic64_add_negative_release arch_atomic64_add_negative +#define arch_atomic64_add_negative_relaxed arch_atomic64_add_negative +#endif /* arch_atomic64_add_negative */ + #ifndef arch_atomic64_add_negative /** - * arch_atomic64_add_negative - add and test if negative + * arch_atomic64_add_negative - Add and test if negative * @i: integer value to add * @v: pointer of type atomic64_t * - * Atomically adds @i to @v and returns true - * if the result is negative, or false when - * result is greater than or equal to zero. + * Atomically adds @i to @v and returns true if the result is negative, + * or false when the result is greater than or equal to zero. */ static __always_inline bool arch_atomic64_add_negative(s64 i, atomic64_t *v) @@ -2347,6 +2448,95 @@ arch_atomic64_add_negative(s64 i, atomic #define arch_atomic64_add_negative arch_atomic64_add_negative #endif +#ifndef arch_atomic64_add_negative_acquire +/** + * arch_atomic64_add_negative_acquire - Add and test if negative + * @i: integer value to add + * @v: pointer of type atomic64_t + * + * Atomically adds @i to @v and returns true if the result is negative, + * or false when the result is greater than or equal to zero. + */ +static __always_inline bool +arch_atomic64_add_negative_acquire(s64 i, atomic64_t *v) +{ + return arch_atomic64_add_return_acquire(i, v) < 0; +} +#define arch_atomic64_add_negative_acquire arch_atomic64_add_negative_acquire +#endif + +#ifndef arch_atomic64_add_negative_release +/** + * arch_atomic64_add_negative_release - Add and test if negative + * @i: integer value to add + * @v: pointer of type atomic64_t + * + * Atomically adds @i to @v and returns true if the result is negative, + * or false when the result is greater than or equal to zero. + */ +static __always_inline bool +arch_atomic64_add_negative_release(s64 i, atomic64_t *v) +{ + return arch_atomic64_add_return_release(i, v) < 0; +} +#define arch_atomic64_add_negative_release arch_atomic64_add_negative_release +#endif + +#ifndef arch_atomic64_add_negative_relaxed +/** + * arch_atomic64_add_negative_relaxed - Add and test if negative + * @i: integer value to add + * @v: pointer of type atomic64_t + * + * Atomically adds @i to @v and returns true if the result is negative, + * or false when the result is greater than or equal to zero. + */ +static __always_inline bool +arch_atomic64_add_negative_relaxed(s64 i, atomic64_t *v) +{ + return arch_atomic64_add_return_relaxed(i, v) < 0; +} +#define arch_atomic64_add_negative_relaxed arch_atomic64_add_negative_relaxed +#endif + +#else /* arch_atomic64_add_negative_relaxed */ + +#ifndef arch_atomic64_add_negative_acquire +static __always_inline bool +arch_atomic64_add_negative_acquire(s64 i, atomic64_t *v) +{ + bool ret = arch_atomic64_add_negative_relaxed(i, v); + __atomic_acquire_fence(); + return ret; +} +#define arch_atomic64_add_negative_acquire arch_atomic64_add_negative_acquire +#endif + +#ifndef arch_atomic64_add_negative_release +static __always_inline bool +arch_atomic64_add_negative_release(s64 i, atomic64_t *v) +{ + __atomic_release_fence(); + return arch_atomic64_add_negative_relaxed(i, v); +} +#define arch_atomic64_add_negative_release arch_atomic64_add_negative_release +#endif + +#ifndef arch_atomic64_add_negative +static __always_inline bool +arch_atomic64_add_negative(s64 i, atomic64_t *v) +{ + bool ret; + __atomic_pre_full_fence(); + ret = arch_atomic64_add_negative_relaxed(i, v); + __atomic_post_full_fence(); + return ret; +} +#define arch_atomic64_add_negative arch_atomic64_add_negative +#endif + +#endif /* arch_atomic64_add_negative_relaxed */ + #ifndef arch_atomic64_fetch_add_unless /** * arch_atomic64_fetch_add_unless - add unless the number is already a given value @@ -2456,4 +2646,4 @@ arch_atomic64_dec_if_positive(atomic64_t #endif #endif /* _LINUX_ATOMIC_FALLBACK_H */ -// b5e87bdd5ede61470c29f7a7e4de781af3770f09 +// 00071fffa021cec66f6290d706d69c91df87bade --- a/include/linux/atomic/atomic-instrumented.h +++ b/include/linux/atomic/atomic-instrumented.h @@ -592,6 +592,28 @@ atomic_add_negative(int i, atomic_t *v) return arch_atomic_add_negative(i, v); } +static __always_inline bool +atomic_add_negative_acquire(int i, atomic_t *v) +{ + instrument_atomic_read_write(v, sizeof(*v)); + return arch_atomic_add_negative_acquire(i, v); +} + +static __always_inline bool +atomic_add_negative_release(int i, atomic_t *v) +{ + kcsan_release(); + instrument_atomic_read_write(v, sizeof(*v)); + return arch_atomic_add_negative_release(i, v); +} + +static __always_inline bool +atomic_add_negative_relaxed(int i, atomic_t *v) +{ + instrument_atomic_read_write(v, sizeof(*v)); + return arch_atomic_add_negative_relaxed(i, v); +} + static __always_inline int atomic_fetch_add_unless(atomic_t *v, int a, int u) { @@ -1211,6 +1233,28 @@ atomic64_add_negative(s64 i, atomic64_t return arch_atomic64_add_negative(i, v); } +static __always_inline bool +atomic64_add_negative_acquire(s64 i, atomic64_t *v) +{ + instrument_atomic_read_write(v, sizeof(*v)); + return arch_atomic64_add_negative_acquire(i, v); +} + +static __always_inline bool +atomic64_add_negative_release(s64 i, atomic64_t *v) +{ + kcsan_release(); + instrument_atomic_read_write(v, sizeof(*v)); + return arch_atomic64_add_negative_release(i, v); +} + +static __always_inline bool +atomic64_add_negative_relaxed(s64 i, atomic64_t *v) +{ + instrument_atomic_read_write(v, sizeof(*v)); + return arch_atomic64_add_negative_relaxed(i, v); +} + static __always_inline s64 atomic64_fetch_add_unless(atomic64_t *v, s64 a, s64 u) { @@ -1830,6 +1874,28 @@ atomic_long_add_negative(long i, atomic_ return arch_atomic_long_add_negative(i, v); } +static __always_inline bool +atomic_long_add_negative_acquire(long i, atomic_long_t *v) +{ + instrument_atomic_read_write(v, sizeof(*v)); + return arch_atomic_long_add_negative_acquire(i, v); +} + +static __always_inline bool +atomic_long_add_negative_release(long i, atomic_long_t *v) +{ + kcsan_release(); + instrument_atomic_read_write(v, sizeof(*v)); + return arch_atomic_long_add_negative_release(i, v); +} + +static __always_inline bool +atomic_long_add_negative_relaxed(long i, atomic_long_t *v) +{ + instrument_atomic_read_write(v, sizeof(*v)); + return arch_atomic_long_add_negative_relaxed(i, v); +} + static __always_inline long atomic_long_fetch_add_unless(atomic_long_t *v, long a, long u) { @@ -2083,4 +2149,4 @@ atomic_long_dec_if_positive(atomic_long_ }) #endif /* _LINUX_ATOMIC_INSTRUMENTED_H */ -// 764f741eb77a7ad565dc8d99ce2837d5542e8aee +// 1b485de9cbaa4900de59e14ee2084357eaeb1c3a --- a/include/linux/atomic/atomic-long.h +++ b/include/linux/atomic/atomic-long.h @@ -479,6 +479,24 @@ arch_atomic_long_add_negative(long i, at return arch_atomic64_add_negative(i, v); } +static __always_inline bool +arch_atomic_long_add_negative_acquire(long i, atomic_long_t *v) +{ + return arch_atomic64_add_negative_acquire(i, v); +} + +static __always_inline bool +arch_atomic_long_add_negative_release(long i, atomic_long_t *v) +{ + return arch_atomic64_add_negative_release(i, v); +} + +static __always_inline bool +arch_atomic_long_add_negative_relaxed(long i, atomic_long_t *v) +{ + return arch_atomic64_add_negative_relaxed(i, v); +} + static __always_inline long arch_atomic_long_fetch_add_unless(atomic_long_t *v, long a, long u) { @@ -973,6 +991,24 @@ arch_atomic_long_add_negative(long i, at return arch_atomic_add_negative(i, v); } +static __always_inline bool +arch_atomic_long_add_negative_acquire(long i, atomic_long_t *v) +{ + return arch_atomic_add_negative_acquire(i, v); +} + +static __always_inline bool +arch_atomic_long_add_negative_release(long i, atomic_long_t *v) +{ + return arch_atomic_add_negative_release(i, v); +} + +static __always_inline bool +arch_atomic_long_add_negative_relaxed(long i, atomic_long_t *v) +{ + return arch_atomic_add_negative_relaxed(i, v); +} + static __always_inline long arch_atomic_long_fetch_add_unless(atomic_long_t *v, long a, long u) { @@ -1011,4 +1047,4 @@ arch_atomic_long_dec_if_positive(atomic_ #endif /* CONFIG_64BIT */ #endif /* _LINUX_ATOMIC_LONG_H */ -// e8f0e08ff072b74d180eabe2ad001282b38c2c88 +// a194c07d7d2f4b0e178d3c118c919775d5d65f50 --- a/scripts/atomic/atomics.tbl +++ b/scripts/atomic/atomics.tbl @@ -33,7 +33,7 @@ try_cmpxchg B v p:old i:new sub_and_test b i v dec_and_test b v inc_and_test b v -add_negative b i v +add_negative B i v add_unless fb v i:a i:u inc_not_zero b v inc_unless_negative b v --- a/scripts/atomic/fallbacks/add_negative +++ b/scripts/atomic/fallbacks/add_negative @@ -1,16 +1,15 @@ cat < X-Patchwork-Id: 13186087 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 803BFC6FD1C for ; Thu, 23 Mar 2023 20:55:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231634AbjCWUzw (ORCPT ); Thu, 23 Mar 2023 16:55:52 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44712 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231228AbjCWUzf (ORCPT ); Thu, 23 Mar 2023 16:55:35 -0400 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5407A211E9; Thu, 23 Mar 2023 13:55:33 -0700 (PDT) Message-ID: <20230323102800.158429195@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1679604932; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=TIcy7hjzbgoCve+ENTr0UmXx1uxEj1nHMIp4rJd/ZA4=; b=JtOFQ3cei0e7THPaF11AJh9RKKoP8mJ+YZgfoGw25WsGGOO6RD/FzVCX8dFQDuKku+QZeg K05G+1i4VJI55RMfnMQihXDQwv5bC9OV8NxfA2EHbLtusPwIjJCVJ0onHcB1rxQ0lkfKt8 l/5cg879+EkBxVxIVFMDMequ3Mnh7TlBKb6d3YWc0ca33KgT7hxsZdZIoYuJyrRtwsExzO /MHf+T7uQ46qX0ehtOX1zbs7M8WnSLNbShuTk7uPuvAg12TfMTS8XzJv+oVOlUm/BZ2UnC oy/cGOAm9Uj4HWWnK3Uvgq6x5N+2bADc8Emdk6cO0lnPuzVIGR+nndSaWstAqw== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1679604932; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=TIcy7hjzbgoCve+ENTr0UmXx1uxEj1nHMIp4rJd/ZA4=; b=x+HXOZLlwD4eHocuIDHyDeStTtYrI63GoSpb/TSbRdL1Atp3GrKHm39tfEfSMnxPVEESci HWyrCvSIaywBw3AQ== From: Thomas Gleixner To: LKML Cc: Linus Torvalds , x86@kernel.org, Wangyang Guo , Arjan van De Ven , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , netdev@vger.kernel.org, Will Deacon , Peter Zijlstra , Boqun Feng , Mark Rutland , Marc Zyngier , Qiuxu Zhuo , Arjan Van De Ven Subject: [patch V3 3/4] atomics: Provide rcuref - scalable reference counting References: <20230323102649.764958589@linutronix.de> MIME-Version: 1.0 Date: Thu, 23 Mar 2023 21:55:31 +0100 (CET) Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Thomas Gleixner atomic_t based reference counting, including refcount_t, uses atomic_inc_not_zero() for acquiring a reference. atomic_inc_not_zero() is implemented with a atomic_try_cmpxchg() loop. High contention of the reference count leads to retry loops and scales badly. There is nothing to improve on this implementation as the semantics have to be preserved. Provide rcuref as a scalable alternative solution which is suitable for RCU managed objects. Similar to refcount_t it comes with overflow and underflow detection and mitigation. rcuref treats the underlying atomic_t as an unsigned integer and partitions this space into zones: 0x00000000 - 0x7FFFFFFF valid zone (1 .. (INT_MAX + 1) references) 0x80000000 - 0xBFFFFFFF saturation zone 0xC0000000 - 0xFFFFFFFE dead zone 0xFFFFFFFF no reference rcuref_get() unconditionally increments the reference count with atomic_add_negative_relaxed(). rcuref_put() unconditionally decrements the reference count with atomic_add_negative_release(). This unconditional increment avoids the inc_not_zero() problem, but requires a more complex implementation on the put() side when the count drops from 0 to -1. When this transition is detected then it is attempted to mark the reference count dead, by setting it to the midpoint of the dead zone with a single atomic_cmpxchg_release() operation. This operation can fail due to a concurrent rcuref_get() elevating the reference count from -1 to 0 again. If the unconditional increment in rcuref_get() hits a reference count which is marked dead (or saturated) it will detect it after the fact and bring back the reference count to the midpoint of the respective zone. The zones provide enough tolerance which makes it practically impossible to escape from a zone. The racy implementation of rcuref_put() requires to protect rcuref_put() against a grace period ending in order to prevent a subtle use after free. As RCU is the only mechanism which allows to protect against that, it is not possible to fully replace the atomic_inc_not_zero() based implementation of refcount_t with this scheme. The final drop is slightly more expensive than the atomic_dec_return() counterpart, but that's not the case which this is optimized for. The optimization is on the high frequeunt get()/put() pairs and their scalability. The performance of an uncontended rcuref_get()/put() pair where the put() is not dropping the last reference is still on par with the plain atomic operations, while at the same time providing overflow and underflow detection and mitigation. The performance of rcuref compared to plain atomic_inc_not_zero() and atomic_dec_return() based reference counting under contention: - Micro benchmark: All CPUs running a increment/decrement loop on an elevated reference count, which means the 0 to -1 transition never happens. The performance gain depends on microarchitecture and the number of CPUs and has been observed in the range of 1.3X to 4.7X - Conversion of dst_entry::__refcnt to rcuref and testing with the localhost memtier/memcached benchmark. That benchmark shows the reference count contention prominently. The performance gain depends on microarchitecture and the number of CPUs and has been observed in the range of 1.1X to 2.6X over the previous fix for the false sharing issue vs. struct dst_entry::__refcnt. When memtier is run over a real 1Gb network connection, there is a small gain on top of the false sharing fix. The two changes combined result in a 2%-5% total gain for that networked test. Reported-by: Wangyang Guo Reported-by: Arjan Van De Ven Signed-off-by: Thomas Gleixner --- V3: Fix details in changelog and comments (Qiuxu) V2: Switch to atomic_add_negative() to make the fast path lean (Linus) --- include/linux/rcuref.h | 155 +++++++++++++++++++++++++++ include/linux/types.h | 6 + lib/Makefile | 2 lib/rcuref.c | 281 +++++++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 443 insertions(+), 1 deletion(-) --- /dev/null +++ b/include/linux/rcuref.h @@ -0,0 +1,155 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +#ifndef _LINUX_RCUREF_H +#define _LINUX_RCUREF_H + +#include +#include +#include +#include +#include +#include + +#define RCUREF_ONEREF 0x00000000U +#define RCUREF_MAXREF 0x7FFFFFFFU +#define RCUREF_SATURATED 0xA0000000U +#define RCUREF_RELEASED 0xC0000000U +#define RCUREF_DEAD 0xE0000000U +#define RCUREF_NOREF 0xFFFFFFFFU + +/** + * rcuref_init - Initialize a rcuref reference count with the given reference count + * @ref: Pointer to the reference count + * @cnt: The initial reference count typically '1' + */ +static inline void rcuref_init(rcuref_t *ref, unsigned int cnt) +{ + atomic_set(&ref->refcnt, cnt - 1); +} + +/** + * rcuref_read - Read the number of held reference counts of a rcuref + * @ref: Pointer to the reference count + * + * Return: The number of held references (0 ... N) + */ +static inline unsigned int rcuref_read(rcuref_t *ref) +{ + unsigned int c = atomic_read(&ref->refcnt); + + /* Return 0 if within the DEAD zone. */ + return c >= RCUREF_RELEASED ? 0 : c + 1; +} + +extern __must_check bool rcuref_get_slowpath(rcuref_t *ref); + +/** + * rcuref_get - Acquire one reference on a rcuref reference count + * @ref: Pointer to the reference count + * + * Similar to atomic_inc_not_zero() but saturates at RCUREF_MAXREF. + * + * Provides no memory ordering, it is assumed the caller has guaranteed the + * object memory to be stable (RCU, etc.). It does provide a control dependency + * and thereby orders future stores. See documentation in lib/rcuref.c + * + * Return: + * False if the attempt to acquire a reference failed. This happens + * when the last reference has been put already + * + * True if a reference was successfully acquired + */ +static inline __must_check bool rcuref_get(rcuref_t *ref) +{ + /* + * Unconditionally increase the reference count. The saturation and + * dead zones provide enough tolerance for this. + */ + if (likely(!atomic_add_negative_relaxed(1, &ref->refcnt))) + return true; + + /* Handle the cases inside the saturation and dead zones */ + return rcuref_get_slowpath(ref); +} + +extern __must_check bool rcuref_put_slowpath(rcuref_t *ref); + +/* + * Internal helper. Do not invoke directly. + */ +static __always_inline __must_check bool __rcuref_put(rcuref_t *ref) +{ + RCU_LOCKDEP_WARN(!rcu_read_lock_held() && preemptible(), + "suspicious rcuref_put_rcusafe() usage"); + /* + * Unconditionally decrease the reference count. The saturation and + * dead zones provide enough tolerance for this. + */ + if (likely(!atomic_add_negative_release(-1, &ref->refcnt))) + return false; + + /* + * Handle the last reference drop and cases inside the saturation + * and dead zones. + */ + return rcuref_put_slowpath(ref); +} + +/** + * rcuref_put_rcusafe -- Release one reference for a rcuref reference count RCU safe + * @ref: Pointer to the reference count + * + * Provides release memory ordering, such that prior loads and stores are done + * before, and provides an acquire ordering on success such that free() + * must come after. + * + * Can be invoked from contexts, which guarantee that no grace period can + * happen which would free the object concurrently if the decrement drops + * the last reference and the slowpath races against a concurrent get() and + * put() pair. rcu_read_lock()'ed and atomic contexts qualify. + * + * Return: + * True if this was the last reference with no future references + * possible. This signals the caller that it can safely release the + * object which is protected by the reference counter. + * + * False if there are still active references or the put() raced + * with a concurrent get()/put() pair. Caller is not allowed to + * release the protected object. + */ +static inline __must_check bool rcuref_put_rcusafe(rcuref_t *ref) +{ + return __rcuref_put(ref); +} + +/** + * rcuref_put -- Release one reference for a rcuref reference count + * @ref: Pointer to the reference count + * + * Can be invoked from any context. + * + * Provides release memory ordering, such that prior loads and stores are done + * before, and provides an acquire ordering on success such that free() + * must come after. + * + * Return: + * + * True if this was the last reference with no future references + * possible. This signals the caller that it can safely schedule the + * object, which is protected by the reference counter, for + * deconstruction. + * + * False if there are still active references or the put() raced + * with a concurrent get()/put() pair. Caller is not allowed to + * deconstruct the protected object. + */ +static inline __must_check bool rcuref_put(rcuref_t *ref) +{ + bool released; + + preempt_disable(); + released = __rcuref_put(ref); + preempt_enable(); + return released; +} + +#endif --- a/include/linux/types.h +++ b/include/linux/types.h @@ -175,6 +175,12 @@ typedef struct { } atomic64_t; #endif +typedef struct { + atomic_t refcnt; +} rcuref_t; + +#define RCUREF_INIT(i) { .refcnt = ATOMIC_INIT(i - 1) } + struct list_head { struct list_head *next, *prev; }; --- a/lib/Makefile +++ b/lib/Makefile @@ -47,7 +47,7 @@ obj-y += bcd.o sort.o parser.o debug_loc list_sort.o uuid.o iov_iter.o clz_ctz.o \ bsearch.o find_bit.o llist.o memweight.o kfifo.o \ percpu-refcount.o rhashtable.o base64.o \ - once.o refcount.o usercopy.o errseq.o bucket_locks.o \ + once.o refcount.o rcuref.o usercopy.o errseq.o bucket_locks.o \ generic-radix-tree.o obj-$(CONFIG_STRING_SELFTEST) += test_string.o obj-y += string_helpers.o --- /dev/null +++ b/lib/rcuref.c @@ -0,0 +1,281 @@ +// SPDX-License-Identifier: GPL-2.0-only + +/* + * rcuref - A scalable reference count implementation for RCU managed objects + * + * rcuref is provided to replace open coded reference count implementations + * based on atomic_t. It protects explicitely RCU managed objects which can + * be visible even after the last reference has been dropped and the object + * is heading towards destruction. + * + * A common usage pattern is: + * + * get() + * rcu_read_lock(); + * p = get_ptr(); + * if (p && !atomic_inc_not_zero(&p->refcnt)) + * p = NULL; + * rcu_read_unlock(); + * return p; + * + * put() + * if (!atomic_dec_return(&->refcnt)) { + * remove_ptr(p); + * kfree_rcu((p, rcu); + * } + * + * atomic_inc_not_zero() is implemented with a try_cmpxchg() loop which has + * O(N^2) behaviour under contention with N concurrent operations. + * + * rcuref uses atomic_add_negative_relaxed() for the fast path, which scales + * better under contention. + * + * Why not refcount? + * ================= + * + * In principle it should be possible to make refcount use the rcuref + * scheme, but the destruction race described below cannot be prevented + * unless the protected object is RCU managed. + * + * Theory of operation + * =================== + * + * rcuref uses an unsigned integer reference counter. As long as the + * counter value is greater than or equal to RCUREF_ONEREF and not larger + * than RCUREF_MAXREF the reference is alive: + * + * ONEREF MAXREF SATURATED RELEASED DEAD NOREF + * 0 0x7FFFFFFF 0x8000000 0xA0000000 0xBFFFFFFF 0xC0000000 0xE0000000 0xFFFFFFFF + * <---valid --------> <-------saturation zone-------> <-----dead zone-----> + * + * The get() and put() operations do unconditional increments and + * decrements. The result is checked after the operation. This optimizes + * for the fast path. + * + * If the reference count is saturated or dead, then the increments and + * decrements are not harmful as the reference count still stays in the + * respective zones and is always set back to STATURATED resp. DEAD. The + * zones have room for 2^28 racing operations in each direction, which + * makes it practically impossible to escape the zones. + * + * Once the last reference is dropped the reference count becomes + * RCUREF_NOREF which forces rcuref_put() into the slowpath operation. The + * slowpath then tries to set the reference count from RCUREF_NOREF to + * RCUREF_DEAD via a cmpxchg(). This opens a small window where a + * concurrent rcuref_get() can acquire the reference count and bring it + * back to RCUREF_ONEREF or even drop the reference again and mark it DEAD. + * + * If the cmpxchg() succeeds then a concurrent rcuref_get() will result in + * DEAD + 1, which is inside the dead zone. If that happens the reference + * count is put back to DEAD. + * + * The actual race is possible due to the unconditional increment and + * decrements in rcuref_get() and rcuref_put(): + * + * T1 T2 + * get() put() + * if (atomic_add_negative(-1, &ref->refcnt)) + * succeeds-> atomic_cmpxchg(&ref->refcnt, NOREF, DEAD); + * + * atomic_add_negative(1, &ref->refcnt); <- Elevates refcount to DEAD + 1 + * + * As the result of T1's add is negative, the get() goes into the slow path + * and observes refcnt being in the dead zone which makes the operation fail. + * + * Possible critical states: + * + * Context Counter References Operation + * T1 0 1 init() + * T2 1 2 get() + * T1 0 1 put() + * T2 -1 0 put() tries to mark dead + * T1 0 1 get() + * T2 0 1 put() mark dead fails + * T1 -1 0 put() tries to mark dead + * T1 DEAD 0 put() mark dead succeeds + * T2 DEAD+1 0 get() fails and puts it back to DEAD + * + * Of course there are more complex scenarios, but the above illustrates + * the working principle. The rest is left to the imagination of the + * reader. + * + * Deconstruction race + * =================== + * + * The release operation must be protected by prohibiting a grace period in + * order to prevent a possible use after free: + * + * T1 T2 + * put() get() + * // ref->refcnt = ONEREF + * if (!atomic_add_negative(-1, &ref->refcnt)) + * return false; <- Not taken + * + * // ref->refcnt == NOREF + * --> preemption + * // Elevates ref->refcnt to ONEREF + * if (!atomic_add_negative(1, &ref->refcnt)) + * return true; <- taken + * + * if (put(&p->ref)) { <-- Succeeds + * remove_pointer(p); + * kfree_rcu(p, rcu); + * } + * + * RCU grace period ends, object is freed + * + * atomic_cmpxchg(&ref->refcnt, NOREF, DEAD); <- UAF + * + * This is prevented by disabling preemption around the put() operation as + * that's in most kernel configurations cheaper than a rcu_read_lock() / + * rcu_read_unlock() pair and in many cases even a NOOP. In any case it + * prevents the grace period which keeps the object alive until all put() + * operations complete. + * + * Saturation protection + * ===================== + * + * The reference count has a saturation limit RCUREF_MAXREF (INT_MAX). + * Once this is exceedded the reference count becomes stale by setting it + * to RCUREF_SATURATED, which will cause a memory leak, but it prevents + * wrap arounds which obviously cause worse problems than a memory + * leak. When saturation is reached a warning is emitted. + * + * Race conditions + * =============== + * + * All reference count increment/decrement operations are unconditional and + * only verified after the fact. This optimizes for the good case and takes + * the occasional race vs. a dead or already saturated refcount into + * account. The saturation and dead zones are large enough to accomodate + * for that. + * + * Memory ordering + * =============== + * + * Memory ordering rules are slightly relaxed wrt regular atomic_t functions + * and provide only what is strictly required for refcounts. + * + * The increments are fully relaxed; these will not provide ordering. The + * rationale is that whatever is used to obtain the object to increase the + * reference count on will provide the ordering. For locked data + * structures, its the lock acquire, for RCU/lockless data structures its + * the dependent load. + * + * rcuref_get() provides a control dependency ordering future stores which + * ensures that the object is not modified when acquiring a reference + * fails. + * + * rcuref_put() provides release order, i.e. all prior loads and stores + * will be issued before. It also provides a control dependency ordering + * against the subsequent destruction of the object. + * + * If rcuref_put() successfully dropped the last reference and marked the + * object DEAD it also provides acquire ordering. + */ + +#include +#include + +/** + * rcuref_get_slowpath - Slowpath of rcuref_get() + * @ref: Pointer to the reference count + * + * Invoked when the reference count is outside of the valid zone. + * + * Return: + * False if the reference count was already marked dead + * + * True if the reference count is saturated, which prevents the + * object from being deconstructed ever. + */ +bool rcuref_get_slowpath(rcuref_t *ref) +{ + unsigned int cnt = atomic_read(&ref->refcnt); + + /* + * If the reference count was already marked dead, undo the + * increment so it stays in the middle of the dead zone and return + * fail. + */ + if (cnt >= RCUREF_RELEASED) { + atomic_set(&ref->refcnt, RCUREF_DEAD); + return false; + } + + /* + * If it was saturated, warn and mark it so. In case the increment + * was already on a saturated value restore the saturation + * marker. This keeps it in the middle of the saturation zone and + * prevents the reference count from overflowing. This leaks the + * object memory, but prevents the obvious reference count overflow + * damage. + */ + if (WARN_ONCE(cnt > RCUREF_MAXREF, "rcuref saturated - leaking memory")) + atomic_set(&ref->refcnt, RCUREF_SATURATED); + return true; +} +EXPORT_SYMBOL_GPL(rcuref_get_slowpath); + +/** + * rcuref_put_slowpath - Slowpath of __rcuref_put() + * @ref: Pointer to the reference count + * + * Invoked when the reference count is outside of the valid zone. + * + * Return: + * True if this was the last reference with no future references + * possible. This signals the caller that it can safely schedule the + * object, which is protected by the reference counter, for + * deconstruction. + * + * False if there are still active references or the put() raced + * with a concurrent get()/put() pair. Caller is not allowed to + * deconstruct the protected object. + */ +bool rcuref_put_slowpath(rcuref_t *ref) +{ + unsigned int cnt = atomic_read(&ref->refcnt); + + /* Did this drop the last reference? */ + if (likely(cnt == RCUREF_NOREF)) { + /* + * Carefully try to set the reference count to RCUREF_DEAD. + * + * This can fail if a concurrent get() operation has + * elevated it again or the corresponding put() even marked + * it dead already. Both are valid situations and do not + * require a retry. If this fails the caller is not + * allowed to deconstruct the object. + */ + if (atomic_cmpxchg_release(&ref->refcnt, RCUREF_NOREF, RCUREF_DEAD) != RCUREF_NOREF) + return false; + + /* + * The caller can safely schedule the object for + * deconstruction. Provide acquire ordering. + */ + smp_acquire__after_ctrl_dep(); + return true; + } + + /* + * If the reference count was already in the dead zone, then this + * put() operation is imbalanced. Warn, put the reference count back to + * DEAD and tell the caller to not deconstruct the object. + */ + if (WARN_ONCE(cnt >= RCUREF_RELEASED, "rcuref - imbalanced put()")) { + atomic_set(&ref->refcnt, RCUREF_DEAD); + return false; + } + + /* + * This is a put() operation on a saturated refcount. Restore the + * mean saturation value and tell the caller to not deconstruct the + * object. + */ + if (cnt > RCUREF_MAXREF) + atomic_set(&ref->refcnt, RCUREF_SATURATED); + return false; +} +EXPORT_SYMBOL_GPL(rcuref_put_slowpath); From patchwork Thu Mar 23 20:55:32 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Thomas Gleixner X-Patchwork-Id: 13186088 X-Patchwork-Delegate: kuba@kernel.org Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B229EC761AF for ; Thu, 23 Mar 2023 20:55:56 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231685AbjCWUzy (ORCPT ); Thu, 23 Mar 2023 16:55:54 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44736 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231580AbjCWUzg (ORCPT ); Thu, 23 Mar 2023 16:55:36 -0400 Received: from galois.linutronix.de (Galois.linutronix.de [IPv6:2a0a:51c0:0:12e:550::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 81C3C22018; Thu, 23 Mar 2023 13:55:34 -0700 (PDT) Message-ID: <20230323102800.215027837@linutronix.de> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1679604933; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=zemSn9bwMuohAbZIJdZqdqifWHmv1X857M8xMLEwTMk=; b=bU8on7hNzZNid/j31YZIKpZYzS/UzQOUW612ucQTcm/QAfMfGSmeXHVSV1ZzoXS/kP1Nn+ IGOU+VQfJtRmvzngJLL9J9/TS0ZogndMFY0t3MvXa2EQxrX8dqK/9vBjCa4HbNBdgg85dN EnWJ9Ol2cTVVimt9SmCCqHhc/jTgD0YPj19zBNtqfnl75S9Dz3u/hWwTuVtE+DRIdNQxO8 5xdYK1Thm6eNZ4vWG9L2AqBSm/P4bOrbF2O7FIoPlNpZDy2N942sMV/egrfN/9aN/frt/c gK4rgdApKiKUNuFbAMD/cMWlimRJfEPKL4tVqDVNYVCkHs+n0bWCCgirnP5bpQ== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1679604933; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: references:references; bh=zemSn9bwMuohAbZIJdZqdqifWHmv1X857M8xMLEwTMk=; b=6aWbqEmg5KPHPKXvit+0OB62ebkKDg8DRby2KaNiUEBJJLn5GSCiSv4Hx2bwZwh6QgFDs4 Jg8zCT9QcsOp6hCA== From: Thomas Gleixner To: LKML Cc: Linus Torvalds , x86@kernel.org, Wangyang Guo , Arjan van De Ven , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , netdev@vger.kernel.org, Will Deacon , Peter Zijlstra , Boqun Feng , Mark Rutland , Marc Zyngier , Qiuxu Zhuo , Arjan Van De Ven Subject: [patch V3 4/4] net: dst: Switch to rcuref_t reference counting References: <20230323102649.764958589@linutronix.de> MIME-Version: 1.0 Date: Thu, 23 Mar 2023 21:55:32 +0100 (CET) Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: kuba@kernel.org From: Thomas Gleixner Under high contention dst_entry::__refcnt becomes a significant bottleneck. atomic_inc_not_zero() is implemented with a cmpxchg() loop, which goes into high retry rates on contention. Switch the reference count to rcuref_t which results in a significant performance gain. Rename the reference count member to __rcuref to reflect the change. The gain depends on the micro-architecture and the number of concurrent operations and has been measured in the range of +25% to +130% with a localhost memtier/memcached benchmark which amplifies the problem massively. Running the memtier/memcached benchmark over a real (1Gb) network connection the conversion on top of the false sharing fix for struct dst_entry::__refcnt results in a total gain in the 2%-5% range over the upstream baseline. Reported-by: Wangyang Guo Reported-by: Arjan Van De Ven Signed-off-by: Thomas Gleixner Link: https://lore.kernel.org/r/20230307125538.989175656@linutronix.de --- V3: Rename the refcount member to __rcuref (Linus) --- include/net/dst.h | 19 ++++++++++--------- include/net/sock.h | 2 +- net/bridge/br_nf_core.c | 2 +- net/core/dst.c | 26 +++++--------------------- net/core/rtnetlink.c | 2 +- net/ipv6/route.c | 6 +++--- net/netfilter/ipvs/ip_vs_xmit.c | 4 ++-- 7 files changed, 23 insertions(+), 38 deletions(-) --- a/include/net/dst.h +++ b/include/net/dst.h @@ -16,6 +16,7 @@ #include #include #include +#include #include #include #include @@ -61,11 +62,11 @@ struct dst_entry { unsigned short trailer_len; /* space to reserve at tail */ /* - * __refcnt wants to be on a different cache line from + * __rcuref wants to be on a different cache line from * input/output/ops or performance tanks badly */ #ifdef CONFIG_64BIT - atomic_t __refcnt; /* 64-bit offset 64 */ + rcuref_t __rcuref; /* 64-bit offset 64 */ #endif int __use; unsigned long lastuse; @@ -75,16 +76,16 @@ struct dst_entry { __u32 tclassid; #ifndef CONFIG_64BIT struct lwtunnel_state *lwtstate; - atomic_t __refcnt; /* 32-bit offset 64 */ + rcuref_t __rcuref; /* 32-bit offset 64 */ #endif netdevice_tracker dev_tracker; /* * Used by rtable and rt6_info. Moves lwtstate into the next cache * line on 64bit so that lwtstate does not cause false sharing with - * __refcnt under contention of __refcnt. This also puts the + * __rcuref under contention of __rcuref. This also puts the * frequently accessed members of rtable and rt6_info out of the - * __refcnt cache line. + * __rcuref cache line. */ struct list_head rt_uncached; struct uncached_list *rt_uncached_list; @@ -238,10 +239,10 @@ static inline void dst_hold(struct dst_e { /* * If your kernel compilation stops here, please check - * the placement of __refcnt in struct dst_entry + * the placement of __rcuref in struct dst_entry */ - BUILD_BUG_ON(offsetof(struct dst_entry, __refcnt) & 63); - WARN_ON(atomic_inc_not_zero(&dst->__refcnt) == 0); + BUILD_BUG_ON(offsetof(struct dst_entry, __rcuref) & 63); + WARN_ON(!rcuref_get(&dst->__rcuref)); } static inline void dst_use_noref(struct dst_entry *dst, unsigned long time) @@ -305,7 +306,7 @@ static inline void skb_dst_copy(struct s */ static inline bool dst_hold_safe(struct dst_entry *dst) { - return atomic_inc_not_zero(&dst->__refcnt); + return rcuref_get(&dst->__rcuref); } /** --- a/include/net/sock.h +++ b/include/net/sock.h @@ -2131,7 +2131,7 @@ sk_dst_get(struct sock *sk) rcu_read_lock(); dst = rcu_dereference(sk->sk_dst_cache); - if (dst && !atomic_inc_not_zero(&dst->__refcnt)) + if (dst && !rcuref_get(&dst->__rcuref)) dst = NULL; rcu_read_unlock(); return dst; --- a/net/bridge/br_nf_core.c +++ b/net/bridge/br_nf_core.c @@ -73,7 +73,7 @@ void br_netfilter_rtable_init(struct net { struct rtable *rt = &br->fake_rtable; - atomic_set(&rt->dst.__refcnt, 1); + rcuref_init(&rt->dst.__rcuref, 1); rt->dst.dev = br->dev; dst_init_metrics(&rt->dst, br_dst_default_metrics, true); rt->dst.flags = DST_NOXFRM | DST_FAKE_RTABLE; --- a/net/core/dst.c +++ b/net/core/dst.c @@ -66,7 +66,7 @@ void dst_init(struct dst_entry *dst, str dst->tclassid = 0; #endif dst->lwtstate = NULL; - atomic_set(&dst->__refcnt, initial_ref); + rcuref_init(&dst->__rcuref, initial_ref); dst->__use = 0; dst->lastuse = jiffies; dst->flags = flags; @@ -162,31 +162,15 @@ EXPORT_SYMBOL(dst_dev_put); void dst_release(struct dst_entry *dst) { - if (dst) { - int newrefcnt; - - newrefcnt = atomic_dec_return(&dst->__refcnt); - if (WARN_ONCE(newrefcnt < 0, "dst_release underflow")) - net_warn_ratelimited("%s: dst:%p refcnt:%d\n", - __func__, dst, newrefcnt); - if (!newrefcnt) - call_rcu_hurry(&dst->rcu_head, dst_destroy_rcu); - } + if (dst && rcuref_put(&dst->__rcuref)) + call_rcu_hurry(&dst->rcu_head, dst_destroy_rcu); } EXPORT_SYMBOL(dst_release); void dst_release_immediate(struct dst_entry *dst) { - if (dst) { - int newrefcnt; - - newrefcnt = atomic_dec_return(&dst->__refcnt); - if (WARN_ONCE(newrefcnt < 0, "dst_release_immediate underflow")) - net_warn_ratelimited("%s: dst:%p refcnt:%d\n", - __func__, dst, newrefcnt); - if (!newrefcnt) - dst_destroy(dst); - } + if (dst && rcuref_put(&dst->__rcuref)) + dst_destroy(dst); } EXPORT_SYMBOL(dst_release_immediate); --- a/net/core/rtnetlink.c +++ b/net/core/rtnetlink.c @@ -840,7 +840,7 @@ int rtnl_put_cacheinfo(struct sk_buff *s if (dst) { ci.rta_lastuse = jiffies_delta_to_clock_t(jiffies - dst->lastuse); ci.rta_used = dst->__use; - ci.rta_clntref = atomic_read(&dst->__refcnt); + ci.rta_clntref = rcuref_read(&dst->__rcuref); } if (expires) { unsigned long clock; --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -293,7 +293,7 @@ static const struct fib6_info fib6_null_ static const struct rt6_info ip6_null_entry_template = { .dst = { - .__refcnt = ATOMIC_INIT(1), + .__rcuref = RCUREF_INIT(1), .__use = 1, .obsolete = DST_OBSOLETE_FORCE_CHK, .error = -ENETUNREACH, @@ -307,7 +307,7 @@ static const struct rt6_info ip6_null_en static const struct rt6_info ip6_prohibit_entry_template = { .dst = { - .__refcnt = ATOMIC_INIT(1), + .__rcuref = RCUREF_INIT(1), .__use = 1, .obsolete = DST_OBSOLETE_FORCE_CHK, .error = -EACCES, @@ -319,7 +319,7 @@ static const struct rt6_info ip6_prohibi static const struct rt6_info ip6_blk_hole_entry_template = { .dst = { - .__refcnt = ATOMIC_INIT(1), + .__rcuref = RCUREF_INIT(1), .__use = 1, .obsolete = DST_OBSOLETE_FORCE_CHK, .error = -EINVAL, --- a/net/netfilter/ipvs/ip_vs_xmit.c +++ b/net/netfilter/ipvs/ip_vs_xmit.c @@ -339,7 +339,7 @@ static int spin_unlock_bh(&dest->dst_lock); IP_VS_DBG(10, "new dst %pI4, src %pI4, refcnt=%d\n", &dest->addr.ip, &dest_dst->dst_saddr.ip, - atomic_read(&rt->dst.__refcnt)); + rcuref_read(&rt->dst.__rcuref)); } if (ret_saddr) *ret_saddr = dest_dst->dst_saddr.ip; @@ -507,7 +507,7 @@ static int spin_unlock_bh(&dest->dst_lock); IP_VS_DBG(10, "new dst %pI6, src %pI6, refcnt=%d\n", &dest->addr.in6, &dest_dst->dst_saddr.in6, - atomic_read(&rt->dst.__refcnt)); + rcuref_read(&rt->dst.__rcuref)); } if (ret_saddr) *ret_saddr = dest_dst->dst_saddr.in6;