From patchwork Fri Jun 25 20:04:52 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 12345879 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.1 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8CA97C49EA7 for ; Fri, 25 Jun 2021 20:05:05 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 720AD6195F for ; Fri, 25 Jun 2021 20:05:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229906AbhFYUHU (ORCPT ); Fri, 25 Jun 2021 16:07:20 -0400 Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:48266 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S229878AbhFYUHS (ORCPT ); Fri, 25 Jun 2021 16:07:18 -0400 Received: from pps.filterd (m0089730.ppops.net [127.0.0.1]) by m0089730.ppops.net (8.16.0.43/8.16.0.43) with SMTP id 15PK4LAa001695 for ; Fri, 25 Jun 2021 13:04:56 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=BCXA1is2SgfRbBG9IuoIHQcqTpKa0CM3/e/1g+2tH20=; b=TuTaZPWPpM9Ey4AEBjZ0j6/1t9cDjawfQdV5ECpbIxlSPoAGbKw12ItDUzcOp3VRXQ49 U19iCSA6yoK7/K652BApNjBQ+SOCGaGf9aMpOEV5TroBYQy/6abDt4N7olKj+9jvgKh8 tg5EF0DzUMyKbhiL8fTHX0ZrkmN65xlCXu0= Received: from maileast.thefacebook.com ([163.114.130.16]) by m0089730.ppops.net with ESMTP id 39d253xpe0-3 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Fri, 25 Jun 2021 13:04:56 -0700 Received: from intmgw001.05.ash9.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:83::6) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2176.2; Fri, 25 Jun 2021 13:04:55 -0700 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 74D1E2942295; Fri, 25 Jun 2021 13:04:52 -0700 (PDT) From: Martin KaFai Lau To: CC: Alexei Starovoitov , Daniel Borkmann , Eric Dumazet , , Neal Cardwell , , Yonghong Song , Yuchung Cheng Subject: [PATCH bpf-next 1/8] tcp: seq_file: Avoid skipping sk during tcp_seek_last_pos Date: Fri, 25 Jun 2021 13:04:52 -0700 Message-ID: <20210625200452.723506-1-kafai@fb.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20210625200446.723230-1-kafai@fb.com> References: <20210625200446.723230-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-ORIG-GUID: LLQzaaBIou10o1ufzq1VNvrRI19U9Kpt X-Proofpoint-GUID: LLQzaaBIou10o1ufzq1VNvrRI19U9Kpt X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391,18.0.790 definitions=2021-06-25_07:2021-06-25,2021-06-25 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 phishscore=0 priorityscore=1501 bulkscore=0 clxscore=1015 impostorscore=0 lowpriorityscore=0 adultscore=0 suspectscore=0 mlxlogscore=964 mlxscore=0 spamscore=0 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2104190000 definitions=main-2106250123 X-FB-Internal: deliver Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net st->bucket stores the current bucket number. st->offset stores the offset within this bucket that is the sk to be seq_show(). Thus, st->offset only makes sense within the same st->bucket. These two variables are an optimization for the common no-lseek case. When resuming the seq_file iteration (i.e. seq_start()), tcp_seek_last_pos() tries to continue from the st->offset at bucket st->bucket. However, it is possible that the bucket pointed by st->bucket has changed and st->offset may end up skipping the whole st->bucket without finding a sk. In this case, tcp_seek_last_pos() currently continues to satisfy the offset condition in the next (and incorrect) bucket. Instead, regardless of the offset value, the first sk of the next bucket should be returned. Thus, "bucket == st->bucket" check is added to tcp_seek_last_pos(). The chance of hitting this is small and the issue is a decade old, so targeting for the next tree. Fixes: a8b690f98baf ("tcp: Fix slowness in read /proc/net/tcp") Signed-off-by: Martin KaFai Lau --- net/ipv4/tcp_ipv4.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 6cb8e269f1ab..7a49427eefbf 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2451,6 +2451,7 @@ static void *tcp_get_idx(struct seq_file *seq, loff_t pos) static void *tcp_seek_last_pos(struct seq_file *seq) { struct tcp_iter_state *st = seq->private; + int bucket = st->bucket; int offset = st->offset; int orig_num = st->num; void *rc = NULL; @@ -2461,7 +2462,7 @@ static void *tcp_seek_last_pos(struct seq_file *seq) break; st->state = TCP_SEQ_STATE_LISTENING; rc = listening_get_next(seq, NULL); - while (offset-- && rc) + while (offset-- && rc && bucket == st->bucket) rc = listening_get_next(seq, rc); if (rc) break; @@ -2472,7 +2473,7 @@ static void *tcp_seek_last_pos(struct seq_file *seq) if (st->bucket > tcp_hashinfo.ehash_mask) break; rc = established_get_first(seq); - while (offset-- && rc) + while (offset-- && rc && bucket == st->bucket) rc = established_get_next(seq, rc); } From patchwork Fri Jun 25 20:04:58 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 12345881 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.1 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A6083C48BC2 for ; Fri, 25 Jun 2021 20:05:05 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 88A6361973 for ; Fri, 25 Jun 2021 20:05:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229573AbhFYUHX (ORCPT ); Fri, 25 Jun 2021 16:07:23 -0400 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:40494 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229916AbhFYUHW (ORCPT ); Fri, 25 Jun 2021 16:07:22 -0400 Received: from pps.filterd (m0044012.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 15PJxa2I012186 for ; Fri, 25 Jun 2021 13:05:01 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=iYGTfo9zw+Q2bdKpFAkgkKkC4nVVQRvhm8hS4gvCfxs=; b=GIgaRIdbMQ0wQRJeIqySn9vyXvXkeUwivMlYmoBj9vEcpMPmRtdTcdGmSIk1Kc6Yo3Fc IZ5aT63s/237ldAWQvAWBZ6EzqkBs0ZxPIz+JU8nxbyPz+QvzVfs/qHj/IvlqoyaFX5B bLH6y6/Z1QjJnuxnv9/usFCQeE32kxhIdhA= Received: from mail.thefacebook.com ([163.114.132.120]) by mx0a-00082601.pphosted.com with ESMTP id 39d24nxkqq-2 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Fri, 25 Jun 2021 13:05:01 -0700 Received: from intmgw001.06.ash9.facebook.com (2620:10d:c085:208::11) by mail.thefacebook.com (2620:10d:c085:21d::5) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2176.2; Fri, 25 Jun 2021 13:05:00 -0700 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id B7B3729422B0; Fri, 25 Jun 2021 13:04:58 -0700 (PDT) From: Martin KaFai Lau To: CC: Alexei Starovoitov , Daniel Borkmann , Eric Dumazet , , Neal Cardwell , , Yonghong Song , Yuchung Cheng Subject: [PATCH bpf-next 2/8] tcp: seq_file: Refactor net and family matching Date: Fri, 25 Jun 2021 13:04:58 -0700 Message-ID: <20210625200458.724177-1-kafai@fb.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20210625200446.723230-1-kafai@fb.com> References: <20210625200446.723230-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-GUID: lTZpQlQNpD52yWuAsB00RXPBl1wAROtk X-Proofpoint-ORIG-GUID: lTZpQlQNpD52yWuAsB00RXPBl1wAROtk X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391,18.0.790 definitions=2021-06-25_07:2021-06-25,2021-06-25 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 mlxscore=0 phishscore=0 impostorscore=0 mlxlogscore=999 priorityscore=1501 lowpriorityscore=0 adultscore=0 spamscore=0 malwarescore=0 bulkscore=0 suspectscore=0 clxscore=1015 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2104190000 definitions=main-2106250122 X-FB-Internal: deliver Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net This patch refactors the net and family matching into two new helpers, seq_sk_match() and seq_file_family(). seq_file_family() is in the later part of the file to prepare the change of a following patch. Signed-off-by: Martin KaFai Lau --- net/ipv4/tcp_ipv4.c | 68 ++++++++++++++++++++------------------------- 1 file changed, 30 insertions(+), 38 deletions(-) diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 7a49427eefbf..13a8b6e8d6bc 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2277,6 +2277,17 @@ EXPORT_SYMBOL(tcp_v4_destroy_sock); #ifdef CONFIG_PROC_FS /* Proc filesystem TCP sock list dumping. */ +static unsigned short seq_file_family(const struct seq_file *seq); + +static bool seq_sk_match(struct seq_file *seq, const struct sock *sk) +{ + unsigned short family = seq_file_family(seq); + + /* AF_UNSPEC is used as a match all */ + return ((family == AF_UNSPEC || family == sk->sk_family) && + net_eq(sock_net(sk), seq_file_net(seq))); +} + /* * Get next listener socket follow cur. If cur is NULL, get first socket * starting from bucket given in st->bucket; when st->bucket is zero the @@ -2284,18 +2295,11 @@ EXPORT_SYMBOL(tcp_v4_destroy_sock); */ static void *listening_get_next(struct seq_file *seq, void *cur) { - struct tcp_seq_afinfo *afinfo; struct tcp_iter_state *st = seq->private; - struct net *net = seq_file_net(seq); struct inet_listen_hashbucket *ilb; struct hlist_nulls_node *node; struct sock *sk = cur; - if (st->bpf_seq_afinfo) - afinfo = st->bpf_seq_afinfo; - else - afinfo = PDE_DATA(file_inode(seq->file)); - if (!sk) { get_head: ilb = &tcp_hashinfo.listening_hash[st->bucket]; @@ -2311,10 +2315,7 @@ static void *listening_get_next(struct seq_file *seq, void *cur) sk = sk_nulls_next(sk); get_sk: sk_nulls_for_each_from(sk, node) { - if (!net_eq(sock_net(sk), net)) - continue; - if (afinfo->family == AF_UNSPEC || - sk->sk_family == afinfo->family) + if (seq_sk_match(seq, sk)) return sk; } spin_unlock(&ilb->lock); @@ -2351,15 +2352,7 @@ static inline bool empty_bucket(const struct tcp_iter_state *st) */ static void *established_get_first(struct seq_file *seq) { - struct tcp_seq_afinfo *afinfo; struct tcp_iter_state *st = seq->private; - struct net *net = seq_file_net(seq); - void *rc = NULL; - - if (st->bpf_seq_afinfo) - afinfo = st->bpf_seq_afinfo; - else - afinfo = PDE_DATA(file_inode(seq->file)); st->offset = 0; for (; st->bucket <= tcp_hashinfo.ehash_mask; ++st->bucket) { @@ -2373,32 +2366,20 @@ static void *established_get_first(struct seq_file *seq) spin_lock_bh(lock); sk_nulls_for_each(sk, node, &tcp_hashinfo.ehash[st->bucket].chain) { - if ((afinfo->family != AF_UNSPEC && - sk->sk_family != afinfo->family) || - !net_eq(sock_net(sk), net)) { - continue; - } - rc = sk; - goto out; + if (seq_sk_match(seq, sk)) + return sk; } spin_unlock_bh(lock); } -out: - return rc; + + return NULL; } static void *established_get_next(struct seq_file *seq, void *cur) { - struct tcp_seq_afinfo *afinfo; struct sock *sk = cur; struct hlist_nulls_node *node; struct tcp_iter_state *st = seq->private; - struct net *net = seq_file_net(seq); - - if (st->bpf_seq_afinfo) - afinfo = st->bpf_seq_afinfo; - else - afinfo = PDE_DATA(file_inode(seq->file)); ++st->num; ++st->offset; @@ -2406,9 +2387,7 @@ static void *established_get_next(struct seq_file *seq, void *cur) sk = sk_nulls_next(sk); sk_nulls_for_each_from(sk, node) { - if ((afinfo->family == AF_UNSPEC || - sk->sk_family == afinfo->family) && - net_eq(sock_net(sk), net)) + if (seq_sk_match(seq, sk)) return sk; } @@ -2754,6 +2733,19 @@ static const struct seq_operations bpf_iter_tcp_seq_ops = { .stop = bpf_iter_tcp_seq_stop, }; #endif +static unsigned short seq_file_family(const struct seq_file *seq) +{ + const struct tcp_iter_state *st = seq->private; + const struct tcp_seq_afinfo *afinfo = st->bpf_seq_afinfo; + + /* Iterated from bpf_iter. Let the bpf prog to filter instead. */ + if (afinfo) + return AF_UNSPEC; + + /* Iterated from proc fs */ + afinfo = PDE_DATA(file_inode(seq->file)); + return afinfo->family; +} static const struct seq_operations tcp4_seq_ops = { .show = tcp4_seq_show, From patchwork Fri Jun 25 20:05:04 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 12345883 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.1 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 520C5C2B9F4 for ; Fri, 25 Jun 2021 20:05:12 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 3D44C61429 for ; Fri, 25 Jun 2021 20:05:12 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229933AbhFYUHc (ORCPT ); Fri, 25 Jun 2021 16:07:32 -0400 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:56084 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229812AbhFYUHb (ORCPT ); Fri, 25 Jun 2021 16:07:31 -0400 Received: from pps.filterd (m0109334.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 15PJvgxx029177 for ; Fri, 25 Jun 2021 13:05:10 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=vdoppFcFAh9nTnqcwp4GpQgR/4yTZArT/1CSGl0BaKo=; b=np5cXTGEithZDZt85OAG8nLvb5S4pvvN0NJbJn8urku9tvs1sSM0dIuk2cN4o9Da41rY xLBr2WxDYANmzK9r5a7uScjzPY6tjYCOe1s31an6eNvRkqp9QKk9n3pq+C9ds57b3G6y A6f1sPPP0D9QCu+2TS648H5IGKgbSoA8b3c= Received: from mail.thefacebook.com ([163.114.132.120]) by mx0a-00082601.pphosted.com with ESMTP id 39d23qxpmh-11 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Fri, 25 Jun 2021 13:05:10 -0700 Received: from intmgw001.37.frc1.facebook.com (2620:10d:c085:108::4) by mail.thefacebook.com (2620:10d:c085:11d::5) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2176.2; Fri, 25 Jun 2021 13:05:07 -0700 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 021FC29422B0; Fri, 25 Jun 2021 13:05:04 -0700 (PDT) From: Martin KaFai Lau To: CC: Alexei Starovoitov , Daniel Borkmann , Eric Dumazet , , Neal Cardwell , , Yonghong Song , Yuchung Cheng Subject: [PATCH bpf-next 3/8] bpf: tcp: seq_file: Remove bpf_seq_afinfo from tcp_iter_state Date: Fri, 25 Jun 2021 13:05:04 -0700 Message-ID: <20210625200504.725683-1-kafai@fb.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20210625200446.723230-1-kafai@fb.com> References: <20210625200446.723230-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-GUID: CmR4dtoGpGxpYlHRyoYvcEbm8GBZ5jR4 X-Proofpoint-ORIG-GUID: CmR4dtoGpGxpYlHRyoYvcEbm8GBZ5jR4 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391,18.0.790 definitions=2021-06-25_07:2021-06-25,2021-06-25 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 bulkscore=0 lowpriorityscore=0 phishscore=0 clxscore=1015 suspectscore=0 spamscore=0 malwarescore=0 mlxscore=0 adultscore=0 impostorscore=0 priorityscore=1501 mlxlogscore=999 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2104190000 definitions=main-2106250122 X-FB-Internal: deliver Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net A following patch will create a separate struct to store extra bpf_iter state and it will embed the existing tcp_iter_state like this: struct bpf_tcp_iter_state { struct tcp_iter_state state; /* More bpf_iter specific states here ... */ } As a prep work, this patch removes the "struct tcp_seq_afinfo *bpf_seq_afinfo" where its purpose is to tell if it is iterating from bpf_iter instead of proc fs. Currently, if "*bpf_seq_afinfo" is not NULL, it is iterating from bpf_iter. The kernel should not filter by the addr family and leave this filtering decision to the bpf prog. Instead of adding a "*bpf_seq_afinfo" pointer, this patch uses the "seq->op == &bpf_iter_tcp_seq_ops" test to tell if it is iterating from the bpf iter. The bpf_iter_(init|fini)_tcp() is left here to prepare for the change of a following patch. Signed-off-by: Martin KaFai Lau --- include/net/tcp.h | 1 - net/ipv4/tcp_ipv4.c | 25 +++++-------------------- 2 files changed, 5 insertions(+), 21 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index e668f1bf780d..06ce38967890 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1955,7 +1955,6 @@ struct tcp_iter_state { struct seq_net_private p; enum tcp_seq_states state; struct sock *syn_wait_sk; - struct tcp_seq_afinfo *bpf_seq_afinfo; int bucket, offset, sbucket, num; loff_t last_pos; }; diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 13a8b6e8d6bc..ca55e87f6cc9 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2735,12 +2735,13 @@ static const struct seq_operations bpf_iter_tcp_seq_ops = { #endif static unsigned short seq_file_family(const struct seq_file *seq) { - const struct tcp_iter_state *st = seq->private; - const struct tcp_seq_afinfo *afinfo = st->bpf_seq_afinfo; + const struct tcp_seq_afinfo *afinfo; +#ifdef CONFIG_BPF_SYSCALL /* Iterated from bpf_iter. Let the bpf prog to filter instead. */ - if (afinfo) + if (seq->op == &bpf_iter_tcp_seq_ops) return AF_UNSPEC; +#endif /* Iterated from proc fs */ afinfo = PDE_DATA(file_inode(seq->file)); @@ -2998,27 +2999,11 @@ DEFINE_BPF_ITER_FUNC(tcp, struct bpf_iter_meta *meta, static int bpf_iter_init_tcp(void *priv_data, struct bpf_iter_aux_info *aux) { - struct tcp_iter_state *st = priv_data; - struct tcp_seq_afinfo *afinfo; - int ret; - - afinfo = kmalloc(sizeof(*afinfo), GFP_USER | __GFP_NOWARN); - if (!afinfo) - return -ENOMEM; - - afinfo->family = AF_UNSPEC; - st->bpf_seq_afinfo = afinfo; - ret = bpf_iter_init_seq_net(priv_data, aux); - if (ret) - kfree(afinfo); - return ret; + return bpf_iter_init_seq_net(priv_data, aux); } static void bpf_iter_fini_tcp(void *priv_data) { - struct tcp_iter_state *st = priv_data; - - kfree(st->bpf_seq_afinfo); bpf_iter_fini_seq_net(priv_data); } From patchwork Fri Jun 25 20:05:11 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 12345885 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.1 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 16766C2B9F4 for ; Fri, 25 Jun 2021 20:05:22 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id EC90961429 for ; Fri, 25 Jun 2021 20:05:21 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229881AbhFYUHm (ORCPT ); Fri, 25 Jun 2021 16:07:42 -0400 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:19890 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229774AbhFYUHm (ORCPT ); Fri, 25 Jun 2021 16:07:42 -0400 Received: from pps.filterd (m0148461.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 15PJxHr9022848 for ; Fri, 25 Jun 2021 13:05:21 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=D5gYKZmeoLvbMoqCkGct6o7H9L2LBkqzP4GWz9cx/MM=; b=nRJWp1X0KVtVDHVTqtuKEA0m84H0FNChMuAeINpZrx2Fw2K2Fi7K2m22Tcw5Qu65/vAc B2eHQQjYLAm4ZToGZRxGnsTPniaz3jQJh5gSrbSiMzL6c+OY7fd5kKE6VtoJBFU0COMb ZpXpv1m7spLTZaKOL1jEQh2snN4bi8fvVew= Received: from mail.thefacebook.com ([163.114.132.120]) by mx0a-00082601.pphosted.com with ESMTP id 39d23kequb-9 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Fri, 25 Jun 2021 13:05:21 -0700 Received: from intmgw001.37.frc1.facebook.com (2620:10d:c085:108::4) by mail.thefacebook.com (2620:10d:c085:11d::5) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2176.2; Fri, 25 Jun 2021 13:05:19 -0700 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 4048029422B0; Fri, 25 Jun 2021 13:05:11 -0700 (PDT) From: Martin KaFai Lau To: CC: Alexei Starovoitov , Daniel Borkmann , Eric Dumazet , , Neal Cardwell , , Yonghong Song , Yuchung Cheng Subject: [PATCH bpf-next 4/8] tcp: seq_file: Add listening_get_first() Date: Fri, 25 Jun 2021 13:05:11 -0700 Message-ID: <20210625200511.726185-1-kafai@fb.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20210625200446.723230-1-kafai@fb.com> References: <20210625200446.723230-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-GUID: EM0zabSA98hWYHsu7IzisfOyQ6d9a2QO X-Proofpoint-ORIG-GUID: EM0zabSA98hWYHsu7IzisfOyQ6d9a2QO X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391,18.0.790 definitions=2021-06-25_07:2021-06-25,2021-06-25 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 mlxlogscore=933 lowpriorityscore=0 impostorscore=0 malwarescore=0 bulkscore=0 clxscore=1015 spamscore=0 phishscore=0 suspectscore=0 mlxscore=0 adultscore=0 priorityscore=1501 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2104190000 definitions=main-2106250122 X-FB-Internal: deliver Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net The current listening_get_next() is overloaded by passing NULL to the 2nd arg, like listening_get_next(seq, NULL), to mean get_first(). This patch moves some logic from the listening_get_next() into a new function listening_get_first(). It will be equivalent to the current established_get_first() and established_get_next() setup. get_first() is to find a non empty bucket and return the first sk. get_next() is to find the next sk of the current bucket and then resorts to get_first() if the current bucket is exhausted. The next patch is to move the listener seq_file iteration from listening_hash (port only) to lhash2 (port+addr). Separating out listening_get_first() from listening_get_next() here will make the following lhash2 changes cleaner and easier to follow. Signed-off-by: Martin KaFai Lau --- net/ipv4/tcp_ipv4.c | 59 ++++++++++++++++++++++++++++++--------------- 1 file changed, 39 insertions(+), 20 deletions(-) diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index ca55e87f6cc9..693bda155876 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2288,10 +2288,38 @@ static bool seq_sk_match(struct seq_file *seq, const struct sock *sk) net_eq(sock_net(sk), seq_file_net(seq))); } -/* - * Get next listener socket follow cur. If cur is NULL, get first socket - * starting from bucket given in st->bucket; when st->bucket is zero the - * very first socket in the hash table is returned. +/* Find a non empty bucket (starting from st->bucket) + * and return the first sk from it. + */ +static void *listening_get_first(struct seq_file *seq) +{ + struct tcp_iter_state *st = seq->private; + + st->offset = 0; + for (; st->bucket < INET_LHTABLE_SIZE; st->bucket++) { + struct inet_listen_hashbucket *ilb; + struct hlist_nulls_node *node; + struct sock *sk; + + ilb = &tcp_hashinfo.listening_hash[st->bucket]; + if (hlist_nulls_empty(&ilb->nulls_head)) + continue; + + spin_lock(&ilb->lock); + sk_nulls_for_each(sk, node, &ilb->nulls_head) { + if (seq_sk_match(seq, sk)) + return sk; + } + spin_unlock(&ilb->lock); + } + + return NULL; +} + +/* Find the next sk of "cur" within the same bucket (i.e. st->bucket). + * If "cur" is the last one in the st->bucket, + * call listening_get_first() to return the first sk of the next + * non empty bucket. */ static void *listening_get_next(struct seq_file *seq, void *cur) { @@ -2300,29 +2328,20 @@ static void *listening_get_next(struct seq_file *seq, void *cur) struct hlist_nulls_node *node; struct sock *sk = cur; - if (!sk) { -get_head: - ilb = &tcp_hashinfo.listening_hash[st->bucket]; - spin_lock(&ilb->lock); - sk = sk_nulls_head(&ilb->nulls_head); - st->offset = 0; - goto get_sk; - } - ilb = &tcp_hashinfo.listening_hash[st->bucket]; ++st->num; ++st->offset; sk = sk_nulls_next(sk); -get_sk: + sk_nulls_for_each_from(sk, node) { if (seq_sk_match(seq, sk)) return sk; } + + ilb = &tcp_hashinfo.listening_hash[st->bucket]; spin_unlock(&ilb->lock); - st->offset = 0; - if (++st->bucket < INET_LHTABLE_SIZE) - goto get_head; - return NULL; + ++st->bucket; + return listening_get_first(seq); } static void *listening_get_idx(struct seq_file *seq, loff_t *pos) @@ -2332,7 +2351,7 @@ static void *listening_get_idx(struct seq_file *seq, loff_t *pos) st->bucket = 0; st->offset = 0; - rc = listening_get_next(seq, NULL); + rc = listening_get_first(seq); while (rc && *pos) { rc = listening_get_next(seq, rc); @@ -2440,7 +2459,7 @@ static void *tcp_seek_last_pos(struct seq_file *seq) if (st->bucket >= INET_LHTABLE_SIZE) break; st->state = TCP_SEQ_STATE_LISTENING; - rc = listening_get_next(seq, NULL); + rc = listening_get_first(seq); while (offset-- && rc && bucket == st->bucket) rc = listening_get_next(seq, rc); if (rc) From patchwork Fri Jun 25 20:05:17 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 12345887 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.1 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id DF1FEC2B9F4 for ; Fri, 25 Jun 2021 20:05:26 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id C7BE16195D for ; Fri, 25 Jun 2021 20:05:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229987AbhFYUHq (ORCPT ); Fri, 25 Jun 2021 16:07:46 -0400 Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:32737 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S229831AbhFYUHn (ORCPT ); Fri, 25 Jun 2021 16:07:43 -0400 Received: from pps.filterd (m0089730.ppops.net [127.0.0.1]) by m0089730.ppops.net (8.16.0.43/8.16.0.43) with SMTP id 15PK4JUw001642 for ; Fri, 25 Jun 2021 13:05:22 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=PjRtIq2lwdfr6jQlWN1rqSauL6MjzsyAin4/GjXSODc=; b=Y/cbgm+PuI6bazl2LirxCThGjCivmXkDD2S28mKjnuGZKEdcUDTVrhfW8BnN3fJzuaez kRULIjFs1VZ5YDrx4seXwHgEify4KdBzzaExf9sVhZKzZMy7u9AnDkXtB0BXNVf7yIfZ FIriBe3gkgJnLODaxS+oPnsPBJA/Fmzla8U= Received: from maileast.thefacebook.com ([163.114.130.16]) by m0089730.ppops.net with ESMTP id 39d253xpg0-5 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Fri, 25 Jun 2021 13:05:22 -0700 Received: from intmgw001.05.ash7.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:82::f) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2176.2; Fri, 25 Jun 2021 13:05:20 -0700 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 8148329422B0; Fri, 25 Jun 2021 13:05:17 -0700 (PDT) From: Martin KaFai Lau To: CC: Alexei Starovoitov , Daniel Borkmann , Eric Dumazet , , Neal Cardwell , , Yonghong Song , Yuchung Cheng Subject: [PATCH bpf-next 5/8] tcp: seq_file: Replace listening_hash with lhash2 Date: Fri, 25 Jun 2021 13:05:17 -0700 Message-ID: <20210625200517.726679-1-kafai@fb.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20210625200446.723230-1-kafai@fb.com> References: <20210625200446.723230-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-ORIG-GUID: imOfE8zRovk09QtfylRctdKvQ5QBEiKO X-Proofpoint-GUID: imOfE8zRovk09QtfylRctdKvQ5QBEiKO X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391,18.0.790 definitions=2021-06-25_07:2021-06-25,2021-06-25 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 phishscore=0 priorityscore=1501 bulkscore=0 clxscore=1015 impostorscore=0 lowpriorityscore=0 adultscore=0 suspectscore=0 mlxlogscore=706 mlxscore=0 spamscore=0 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2104190000 definitions=main-2106250123 X-FB-Internal: deliver Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net This patch moves the tcp seq_file iteration on listeners from the port only listening_hash to the port+addr lhash2. When iterating from the bpf iter, the next patch will need to lock the socket such that the bpf iter with netadmin cap can do setsockopt (e.g. to change the TCP_CONGESTION). To avoid locking the bucket and then locking the sock, the bpf iter will first batch some sockets from the same bucket and then unlock the bucket. If the bucket size is small (which usually is), it is easier to batch the whole bucket such that it is less likely to miss a setsockopt on a socket due to changes in the bucket. However, the port only listening_hash could have many listeners hashed to a bucket (e.g. many individual VIP(s):443 and also multiple by the number of SO_REUSEPORT). We have seen bucket size in tens of thousands range. Also, the chance of having changes in some popular port buckets (e.g. 443) is also high. The port+addr lhash2 was introduced to solve this large listener bucket issue. Also, the listening_hash usage has already been replaced with lhash2 in the fast path inet[6]_lookup_listener(). This patch follows the same direction on moving to lhash2 and iterates the lhash2 instead of listening_hash. Signed-off-by: Martin KaFai Lau --- include/net/inet_hashtables.h | 6 ++++++ net/ipv4/tcp_ipv4.c | 35 ++++++++++++++++++----------------- 2 files changed, 24 insertions(+), 17 deletions(-) diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h index ca6a3ea9057e..f72ec113ae56 100644 --- a/include/net/inet_hashtables.h +++ b/include/net/inet_hashtables.h @@ -160,6 +160,12 @@ struct inet_hashinfo { ____cacheline_aligned_in_smp; }; +#define inet_lhash2_for_each_icsk_continue(__icsk) \ + hlist_for_each_entry_continue(__icsk, icsk_listen_portaddr_node) + +#define inet_lhash2_for_each_icsk(__icsk, list) \ + hlist_for_each_entry(__icsk, list, icsk_listen_portaddr_node) + #define inet_lhash2_for_each_icsk_rcu(__icsk, list) \ hlist_for_each_entry_rcu(__icsk, list, icsk_listen_portaddr_node) diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 693bda155876..0d851289a89e 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2296,21 +2296,22 @@ static void *listening_get_first(struct seq_file *seq) struct tcp_iter_state *st = seq->private; st->offset = 0; - for (; st->bucket < INET_LHTABLE_SIZE; st->bucket++) { - struct inet_listen_hashbucket *ilb; - struct hlist_nulls_node *node; + for (; st->bucket <= tcp_hashinfo.lhash2_mask; st->bucket++) { + struct inet_listen_hashbucket *ilb2; + struct inet_connection_sock *icsk; struct sock *sk; - ilb = &tcp_hashinfo.listening_hash[st->bucket]; - if (hlist_nulls_empty(&ilb->nulls_head)) + ilb2 = &tcp_hashinfo.lhash2[st->bucket]; + if (hlist_empty(&ilb2->head)) continue; - spin_lock(&ilb->lock); - sk_nulls_for_each(sk, node, &ilb->nulls_head) { + spin_lock(&ilb2->lock); + inet_lhash2_for_each_icsk(icsk, &ilb2->head) { + sk = (struct sock *)icsk; if (seq_sk_match(seq, sk)) return sk; } - spin_unlock(&ilb->lock); + spin_unlock(&ilb2->lock); } return NULL; @@ -2324,22 +2325,22 @@ static void *listening_get_first(struct seq_file *seq) static void *listening_get_next(struct seq_file *seq, void *cur) { struct tcp_iter_state *st = seq->private; - struct inet_listen_hashbucket *ilb; - struct hlist_nulls_node *node; + struct inet_listen_hashbucket *ilb2; + struct inet_connection_sock *icsk; struct sock *sk = cur; ++st->num; ++st->offset; - sk = sk_nulls_next(sk); - - sk_nulls_for_each_from(sk, node) { + icsk = inet_csk(sk); + inet_lhash2_for_each_icsk_continue(icsk) { + sk = (struct sock *)icsk; if (seq_sk_match(seq, sk)) return sk; } - ilb = &tcp_hashinfo.listening_hash[st->bucket]; - spin_unlock(&ilb->lock); + ilb2 = &tcp_hashinfo.lhash2[st->bucket]; + spin_unlock(&ilb2->lock); ++st->bucket; return listening_get_first(seq); } @@ -2456,7 +2457,7 @@ static void *tcp_seek_last_pos(struct seq_file *seq) switch (st->state) { case TCP_SEQ_STATE_LISTENING: - if (st->bucket >= INET_LHTABLE_SIZE) + if (st->bucket > tcp_hashinfo.lhash2_mask) break; st->state = TCP_SEQ_STATE_LISTENING; rc = listening_get_first(seq); @@ -2541,7 +2542,7 @@ void tcp_seq_stop(struct seq_file *seq, void *v) switch (st->state) { case TCP_SEQ_STATE_LISTENING: if (v != SEQ_START_TOKEN) - spin_unlock(&tcp_hashinfo.listening_hash[st->bucket].lock); + spin_unlock(&tcp_hashinfo.lhash2[st->bucket].lock); break; case TCP_SEQ_STATE_ESTABLISHED: if (v) From patchwork Fri Jun 25 20:05:23 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 12345889 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.1 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 049D7C2B9F4 for ; Fri, 25 Jun 2021 20:05:39 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id DF05E61429 for ; Fri, 25 Jun 2021 20:05:38 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229994AbhFYUH4 (ORCPT ); Fri, 25 Jun 2021 16:07:56 -0400 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:15886 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229999AbhFYUHr (ORCPT ); Fri, 25 Jun 2021 16:07:47 -0400 Received: from pps.filterd (m0109333.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 15PK4uET011711 for ; Fri, 25 Jun 2021 13:05:26 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=HGzIbjVF/3v01Ve9Zt6KGZ4clT0vLwLXgwY9JVsQDfA=; b=CIUPk5PAO9g0J2nA1XqlKyNMDTNBylFOY8F5QIcmO8xnIS8jcGUByqT6B17Qct//5XUo Sts1ewKRa9qR1UX00CXd/n3m+Zx0ZuUmpbb7qHx40S8vzMPH6RtANFgD+S1IAC6HIJmT H/jEEr36rGdfM/riEcbTGa+3qMY5NQRSNtg= Received: from mail.thefacebook.com ([163.114.132.120]) by mx0a-00082601.pphosted.com with ESMTP id 39d255pq03-6 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Fri, 25 Jun 2021 13:05:26 -0700 Received: from intmgw001.38.frc1.facebook.com (2620:10d:c085:208::f) by mail.thefacebook.com (2620:10d:c085:21d::7) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2176.2; Fri, 25 Jun 2021 13:05:25 -0700 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id CB27D29422B0; Fri, 25 Jun 2021 13:05:23 -0700 (PDT) From: Martin KaFai Lau To: CC: Alexei Starovoitov , Daniel Borkmann , Eric Dumazet , , Neal Cardwell , , Yonghong Song , Yuchung Cheng Subject: [PATCH bpf-next 6/8] bpf: tcp: bpf iter batching and lock_sock Date: Fri, 25 Jun 2021 13:05:23 -0700 Message-ID: <20210625200523.726854-1-kafai@fb.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20210625200446.723230-1-kafai@fb.com> References: <20210625200446.723230-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-ORIG-GUID: ZaMlBjWfK67i7RJI55yCnEbV7WpQCzBf X-Proofpoint-GUID: ZaMlBjWfK67i7RJI55yCnEbV7WpQCzBf X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391,18.0.790 definitions=2021-06-25_07:2021-06-25,2021-06-25 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 priorityscore=1501 mlxscore=0 bulkscore=0 suspectscore=0 mlxlogscore=999 malwarescore=0 impostorscore=0 phishscore=0 spamscore=0 clxscore=1015 adultscore=0 lowpriorityscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2104190000 definitions=main-2106250123 X-FB-Internal: deliver Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net This patch does batching and lock_sock for the bpf tcp iter. It does not affect the proc fs iteration. With bpf-tcp-cc, new algo rollout happens more often. Instead of restarting the application to pick up the new tcp-cc, the next patch will allow bpf iter with CAP_NET_ADMIN to do setsockopt(TCP_CONGESTION). This requires locking the sock. Also, unlike the proc iteration (cat /proc/net/tcp[6]), the bpf iter can inspect all fields of a tcp_sock. It will be useful to have a consistent view on some of the fields (e.g. the ones reported in tcp_get_info() that also acquires the sock lock). Double lock: locking the bucket first and then locking the sock could lead to deadlock. This patch takes a batching approach similar to inet_diag. While holding the bucket lock, it batch a number of sockets into an array first and then unlock the bucket. Before doing show(), it then calls lock_sock_fast(). In a machine with ~400k connections, the maximum number of sk in a bucket of the established hashtable is 7. 0.02% of the established connections fall into this bucket size. For listen hash (port+addr lhash2), the bucket is usually very small also except for the SO_REUSEPORT use case which the userspace could have one SO_REUSEPORT socket per thread. While batching is used, it can also minimize the chance of missing sock in the setsockopt use case if the whole bucket is batched. This patch will start with a batch array with INIT_BATCH_SZ (16) which will be enough for the most common cases. bpf_iter_tcp_batch() will try to realloc to a larger array to handle exception case (e.g. the SO_REUSEPORT case in the lhash2). Signed-off-by: Martin KaFai Lau --- net/ipv4/tcp_ipv4.c | 236 ++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 230 insertions(+), 6 deletions(-) diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 0d851289a89e..856144d33f52 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2687,6 +2687,15 @@ static int tcp4_seq_show(struct seq_file *seq, void *v) } #ifdef CONFIG_BPF_SYSCALL +struct bpf_tcp_iter_state { + struct tcp_iter_state state; + unsigned int cur_sk; + unsigned int end_sk; + unsigned int max_sk; + struct sock **batch; + bool st_bucket_done; +}; + struct bpf_iter__tcp { __bpf_md_ptr(struct bpf_iter_meta *, meta); __bpf_md_ptr(struct sock_common *, sk_common); @@ -2705,16 +2714,203 @@ static int tcp_prog_seq_show(struct bpf_prog *prog, struct bpf_iter_meta *meta, return bpf_iter_run_prog(prog, &ctx); } +static void bpf_iter_tcp_put_batch(struct bpf_tcp_iter_state *iter) +{ + while (iter->cur_sk < iter->end_sk) + sock_put(iter->batch[iter->cur_sk++]); +} + +static int bpf_iter_tcp_realloc_batch(struct bpf_tcp_iter_state *iter, + unsigned int new_batch_sz) +{ + struct sock **new_batch; + + new_batch = kvmalloc(sizeof(*new_batch) * new_batch_sz, GFP_USER); + if (!new_batch) + return -ENOMEM; + + bpf_iter_tcp_put_batch(iter); + kvfree(iter->batch); + iter->batch = new_batch; + iter->max_sk = new_batch_sz; + + return 0; +} + +static unsigned int bpf_iter_tcp_listening_batch(struct seq_file *seq, + struct sock *start_sk) +{ + struct bpf_tcp_iter_state *iter = seq->private; + struct tcp_iter_state *st = &iter->state; + struct inet_connection_sock *icsk; + unsigned int expected = 1; + struct sock *sk; + + sock_hold(start_sk); + iter->batch[iter->end_sk++] = start_sk; + + icsk = inet_csk(start_sk); + inet_lhash2_for_each_icsk_continue(icsk) { + sk = (struct sock *)icsk; + if (seq_sk_match(seq, sk)) { + if (iter->end_sk < iter->max_sk) { + sock_hold(sk); + iter->batch[iter->end_sk++] = sk; + } + expected++; + } + } + spin_unlock(&tcp_hashinfo.lhash2[st->bucket].lock); + + return expected; +} + +static unsigned int bpf_iter_tcp_established_batch(struct seq_file *seq, + struct sock *start_sk) +{ + struct bpf_tcp_iter_state *iter = seq->private; + struct tcp_iter_state *st = &iter->state; + struct hlist_nulls_node *node; + unsigned int expected = 1; + struct sock *sk; + + sock_hold(start_sk); + iter->batch[iter->end_sk++] = start_sk; + + sk = sk_nulls_next(start_sk); + sk_nulls_for_each_from(sk, node) { + if (seq_sk_match(seq, sk)) { + if (iter->end_sk < iter->max_sk) { + sock_hold(sk); + iter->batch[iter->end_sk++] = sk; + } + expected++; + } + } + spin_unlock_bh(inet_ehash_lockp(&tcp_hashinfo, st->bucket)); + + return expected; +} + +static struct sock *bpf_iter_tcp_batch(struct seq_file *seq) +{ + struct bpf_tcp_iter_state *iter = seq->private; + struct tcp_iter_state *st = &iter->state; + unsigned int expected; + bool resized = false; + struct sock *sk; + + /* The st->bucket is done. Directly advance to the next + * bucket instead of having the tcp_seek_last_pos() to skip + * one by one in the current bucket and eventually find out + * it has to advance to the next bucket. + */ + if (iter->st_bucket_done) { + st->offset = 0; + st->bucket++; + if (st->state == TCP_SEQ_STATE_LISTENING && + st->bucket > tcp_hashinfo.lhash2_mask) { + st->state = TCP_SEQ_STATE_ESTABLISHED; + st->bucket = 0; + } + } + +again: + /* Get a new batch */ + iter->cur_sk = 0; + iter->end_sk = 0; + iter->st_bucket_done = false; + + sk = tcp_seek_last_pos(seq); + if (!sk) + return NULL; /* Done */ + + if (st->state == TCP_SEQ_STATE_LISTENING) + expected = bpf_iter_tcp_listening_batch(seq, sk); + else + expected = bpf_iter_tcp_established_batch(seq, sk); + + if (iter->end_sk == expected) { + iter->st_bucket_done = true; + return sk; + } + + if (!resized && !bpf_iter_tcp_realloc_batch(iter, expected * 3 / 2)) { + resized = true; + goto again; + } + + return sk; +} + +static void *bpf_iter_tcp_seq_start(struct seq_file *seq, loff_t *pos) +{ + /* bpf iter does not support lseek, so it always + * continue from where it was stop()-ped. + */ + if (*pos) + return bpf_iter_tcp_batch(seq); + + return SEQ_START_TOKEN; +} + +static void *bpf_iter_tcp_seq_next(struct seq_file *seq, void *v, loff_t *pos) +{ + struct bpf_tcp_iter_state *iter = seq->private; + struct tcp_iter_state *st = &iter->state; + struct sock *sk; + + /* Whenever seq_next() is called, the iter->cur_sk is + * done with seq_show(), so advance to the next sk in + * the batch. + */ + if (iter->cur_sk < iter->end_sk) { + /* Keeping st->num consistent in tcp_iter_state. + * bpf_iter_tcp does not use st->num. + * meta.seq_num is used instead. + */ + st->num++; + /* Move st->offset to the next sk in the bucket such that + * the future start() will resume at st->offset in + * st->bucket. See tcp_seek_last_pos(). + */ + st->offset++; + sock_put(iter->batch[iter->cur_sk++]); + } + + if (iter->cur_sk < iter->end_sk) + sk = iter->batch[iter->cur_sk]; + else + sk = bpf_iter_tcp_batch(seq); + + ++*pos; + /* Keeping st->last_pos consistent in tcp_iter_state. + * bpf iter does not do lseek, so st->last_pos always equals to *pos. + */ + st->last_pos = *pos; + return sk; +} + static int bpf_iter_tcp_seq_show(struct seq_file *seq, void *v) { struct bpf_iter_meta meta; struct bpf_prog *prog; struct sock *sk = v; + bool slow; uid_t uid; + int ret; if (v == SEQ_START_TOKEN) return 0; + if (sk_fullsock(sk)) + slow = lock_sock_fast(sk); + + if (unlikely(sk_unhashed(sk))) { + ret = SEQ_SKIP; + goto unlock; + } + if (sk->sk_state == TCP_TIME_WAIT) { uid = 0; } else if (sk->sk_state == TCP_NEW_SYN_RECV) { @@ -2728,11 +2924,18 @@ static int bpf_iter_tcp_seq_show(struct seq_file *seq, void *v) meta.seq = seq; prog = bpf_iter_get_info(&meta, false); - return tcp_prog_seq_show(prog, &meta, v, uid); + ret = tcp_prog_seq_show(prog, &meta, v, uid); + +unlock: + if (sk_fullsock(sk)) + unlock_sock_fast(sk, slow); + return ret; + } static void bpf_iter_tcp_seq_stop(struct seq_file *seq, void *v) { + struct bpf_tcp_iter_state *iter = seq->private; struct bpf_iter_meta meta; struct bpf_prog *prog; @@ -2743,13 +2946,16 @@ static void bpf_iter_tcp_seq_stop(struct seq_file *seq, void *v) (void)tcp_prog_seq_show(prog, &meta, v, 0); } - tcp_seq_stop(seq, v); + if (iter->cur_sk < iter->end_sk) { + bpf_iter_tcp_put_batch(iter); + iter->st_bucket_done = false; + } } static const struct seq_operations bpf_iter_tcp_seq_ops = { .show = bpf_iter_tcp_seq_show, - .start = tcp_seq_start, - .next = tcp_seq_next, + .start = bpf_iter_tcp_seq_start, + .next = bpf_iter_tcp_seq_next, .stop = bpf_iter_tcp_seq_stop, }; #endif @@ -3017,21 +3223,39 @@ static struct pernet_operations __net_initdata tcp_sk_ops = { DEFINE_BPF_ITER_FUNC(tcp, struct bpf_iter_meta *meta, struct sock_common *sk_common, uid_t uid) +#define INIT_BATCH_SZ 16 + static int bpf_iter_init_tcp(void *priv_data, struct bpf_iter_aux_info *aux) { - return bpf_iter_init_seq_net(priv_data, aux); + struct bpf_tcp_iter_state *iter = priv_data; + int err; + + err = bpf_iter_init_seq_net(priv_data, aux); + if (err) + return err; + + err = bpf_iter_tcp_realloc_batch(iter, INIT_BATCH_SZ); + if (err) { + bpf_iter_fini_seq_net(priv_data); + return err; + } + + return 0; } static void bpf_iter_fini_tcp(void *priv_data) { + struct bpf_tcp_iter_state *iter = priv_data; + bpf_iter_fini_seq_net(priv_data); + kvfree(iter->batch); } static const struct bpf_iter_seq_info tcp_seq_info = { .seq_ops = &bpf_iter_tcp_seq_ops, .init_seq_private = bpf_iter_init_tcp, .fini_seq_private = bpf_iter_fini_tcp, - .seq_priv_size = sizeof(struct tcp_iter_state), + .seq_priv_size = sizeof(struct bpf_tcp_iter_state), }; static struct bpf_iter_reg tcp_reg_info = { From patchwork Fri Jun 25 20:05:30 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 12345891 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.1 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9BB8CC2B9F4 for ; Fri, 25 Jun 2021 20:05:42 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 7267161429 for ; Fri, 25 Jun 2021 20:05:42 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229774AbhFYUIB (ORCPT ); Fri, 25 Jun 2021 16:08:01 -0400 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:12204 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229930AbhFYUIB (ORCPT ); Fri, 25 Jun 2021 16:08:01 -0400 Received: from pps.filterd (m0109334.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 15PJvjiT029232 for ; Fri, 25 Jun 2021 13:05:40 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=KzAP4i8LUIb74pSh25a6Kd1ypWtRJzZfK/iNf/FUH5Y=; b=MOHkeMPSmSdNhfVRFig0GLr9z2dRM9p4vQ1qK6M/cTxKxQBnp8MSan0Nw7eRi/zP1vdE qKjjuDLeqxyjSghpj+cGAdtYXPt7FyDcOKiHbD6IoC0W9/qS1MofuqY5QFJz3noa/tz5 iRhNjtEZVD/pUGBxc/c6rOffz6jZhQjyBDU= Received: from mail.thefacebook.com ([163.114.132.120]) by mx0a-00082601.pphosted.com with ESMTP id 39d23qxpr9-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Fri, 25 Jun 2021 13:05:39 -0700 Received: from intmgw001.38.frc1.facebook.com (2620:10d:c085:208::f) by mail.thefacebook.com (2620:10d:c085:21d::7) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2176.2; Fri, 25 Jun 2021 13:05:39 -0700 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 1540529422B0; Fri, 25 Jun 2021 13:05:30 -0700 (PDT) From: Martin KaFai Lau To: CC: Alexei Starovoitov , Daniel Borkmann , Eric Dumazet , , Neal Cardwell , , Yonghong Song , Yuchung Cheng Subject: [PATCH bpf-next 7/8] bpf: tcp: Support bpf_setsockopt in bpf tcp iter Date: Fri, 25 Jun 2021 13:05:30 -0700 Message-ID: <20210625200530.727573-1-kafai@fb.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20210625200446.723230-1-kafai@fb.com> References: <20210625200446.723230-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-GUID: k9iLDNYe2OwPfjLGmCrtfXyzUhKmJ5HP X-Proofpoint-ORIG-GUID: k9iLDNYe2OwPfjLGmCrtfXyzUhKmJ5HP X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391,18.0.790 definitions=2021-06-25_07:2021-06-25,2021-06-25 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 bulkscore=0 lowpriorityscore=0 phishscore=0 clxscore=1015 suspectscore=0 spamscore=0 malwarescore=0 mlxscore=0 adultscore=0 impostorscore=0 priorityscore=1501 mlxlogscore=999 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2104190000 definitions=main-2106250122 X-FB-Internal: deliver Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net This patch allows bpf tcp iter with netadmin cap to call bpf_setsockopt. To allow a specific bpf iter (tcp here) to call a set of helpers, get_func_proto function pointer is added to bpf_iter_reg. Signed-off-by: Martin KaFai Lau --- include/linux/bpf.h | 7 +++++++ kernel/bpf/bpf_iter.c | 22 ++++++++++++++++++++++ kernel/trace/bpf_trace.c | 7 ++++++- net/core/filter.c | 17 +++++++++++++++++ net/ipv4/tcp_ipv4.c | 15 +++++++++++++++ 5 files changed, 67 insertions(+), 1 deletion(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index f309fc1509f2..31f51a601925 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1397,6 +1397,9 @@ typedef void (*bpf_iter_show_fdinfo_t) (const struct bpf_iter_aux_info *aux, struct seq_file *seq); typedef int (*bpf_iter_fill_link_info_t)(const struct bpf_iter_aux_info *aux, struct bpf_link_info *info); +typedef const struct bpf_func_proto * +(*bpf_iter_get_func_proto_t)(enum bpf_func_id func_id, + const struct bpf_prog *prog); enum bpf_iter_feature { BPF_ITER_RESCHED = BIT(0), @@ -1409,6 +1412,7 @@ struct bpf_iter_reg { bpf_iter_detach_target_t detach_target; bpf_iter_show_fdinfo_t show_fdinfo; bpf_iter_fill_link_info_t fill_link_info; + bpf_iter_get_func_proto_t get_func_proto; u32 ctx_arg_info_size; u32 feature; struct bpf_ctx_arg_aux ctx_arg_info[BPF_ITER_CTX_ARG_MAX]; @@ -1431,6 +1435,8 @@ struct bpf_iter__bpf_map_elem { int bpf_iter_reg_target(const struct bpf_iter_reg *reg_info); void bpf_iter_unreg_target(const struct bpf_iter_reg *reg_info); bool bpf_iter_prog_supported(struct bpf_prog *prog); +const struct bpf_func_proto * +bpf_iter_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog); int bpf_iter_link_attach(const union bpf_attr *attr, bpfptr_t uattr, struct bpf_prog *prog); int bpf_iter_new_fd(struct bpf_link *link); bool bpf_link_is_iter(struct bpf_link *link); @@ -1997,6 +2003,7 @@ extern const struct bpf_func_proto bpf_task_storage_get_proto; extern const struct bpf_func_proto bpf_task_storage_delete_proto; extern const struct bpf_func_proto bpf_for_each_map_elem_proto; extern const struct bpf_func_proto bpf_btf_find_by_name_kind_proto; +extern const struct bpf_func_proto bpf_sk_setsockopt_proto; const struct bpf_func_proto *bpf_tracing_func_proto( enum bpf_func_id func_id, const struct bpf_prog *prog); diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c index 2d4fbdbb194e..2e9d47bb40ff 100644 --- a/kernel/bpf/bpf_iter.c +++ b/kernel/bpf/bpf_iter.c @@ -360,6 +360,28 @@ bool bpf_iter_prog_supported(struct bpf_prog *prog) return supported; } +const struct bpf_func_proto * +bpf_iter_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) +{ + const struct bpf_iter_target_info *tinfo; + const struct bpf_func_proto *fn = NULL; + + mutex_lock(&targets_mutex); + list_for_each_entry(tinfo, &targets, list) { + if (tinfo->btf_id == prog->aux->attach_btf_id) { + const struct bpf_iter_reg *reg_info; + + reg_info = tinfo->reg_info; + if (reg_info->get_func_proto) + fn = reg_info->get_func_proto(func_id, prog); + break; + } + } + mutex_unlock(&targets_mutex); + + return fn; +} + static void bpf_iter_link_release(struct bpf_link *link) { struct bpf_iter_link *iter_link = diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 7a52bc172841..11e69734d67b 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -1428,6 +1428,8 @@ raw_tp_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) const struct bpf_func_proto * tracing_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) { + const struct bpf_func_proto *fn; + switch (func_id) { #ifdef CONFIG_NET case BPF_FUNC_skb_output: @@ -1468,7 +1470,10 @@ tracing_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) case BPF_FUNC_d_path: return &bpf_d_path_proto; default: - return raw_tp_prog_func_proto(func_id, prog); + fn = raw_tp_prog_func_proto(func_id, prog); + if (!fn && prog->expected_attach_type == BPF_TRACE_ITER) + fn = bpf_iter_get_func_proto(func_id, prog); + return fn; } } diff --git a/net/core/filter.c b/net/core/filter.c index d22895caa164..ab4d60c043d4 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -5063,6 +5063,23 @@ static const struct bpf_func_proto bpf_sock_ops_setsockopt_proto = { .arg5_type = ARG_CONST_SIZE, }; +BPF_CALL_5(bpf_sk_setsockopt, struct sock *, sk, int, level, + int, optname, char *, optval, int, optlen) +{ + return _bpf_setsockopt(sk, level, optname, optval, optlen); +} + +const struct bpf_func_proto bpf_sk_setsockopt_proto = { + .func = bpf_sk_setsockopt, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_BTF_ID_SOCK_COMMON, + .arg2_type = ARG_ANYTHING, + .arg3_type = ARG_ANYTHING, + .arg4_type = ARG_PTR_TO_MEM, + .arg5_type = ARG_CONST_SIZE, +}; + static int bpf_sock_ops_get_syn(struct bpf_sock_ops_kern *bpf_sock, int optname, const u8 **start) { diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 856144d33f52..5bfa6233fff1 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -3258,6 +3258,20 @@ static const struct bpf_iter_seq_info tcp_seq_info = { .seq_priv_size = sizeof(struct bpf_tcp_iter_state), }; +static const struct bpf_func_proto * +bpf_iter_tcp_get_func_proto(enum bpf_func_id func_id, + const struct bpf_prog *prog) +{ + switch (func_id) { + case BPF_FUNC_setsockopt: + if (capable(CAP_NET_ADMIN)) + return &bpf_sk_setsockopt_proto; + fallthrough; + default: + return NULL; + } +} + static struct bpf_iter_reg tcp_reg_info = { .target = "tcp", .ctx_arg_info_size = 1, @@ -3265,6 +3279,7 @@ static struct bpf_iter_reg tcp_reg_info = { { offsetof(struct bpf_iter__tcp, sk_common), PTR_TO_BTF_ID_OR_NULL }, }, + .get_func_proto = bpf_iter_tcp_get_func_proto, .seq_info = &tcp_seq_info, }; From patchwork Fri Jun 25 20:05:36 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 12345893 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.1 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 99BAFC49EAB for ; Fri, 25 Jun 2021 20:05:45 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 841B76195F for ; Fri, 25 Jun 2021 20:05:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230002AbhFYUIF (ORCPT ); Fri, 25 Jun 2021 16:08:05 -0400 Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:12206 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S229913AbhFYUIE (ORCPT ); Fri, 25 Jun 2021 16:08:04 -0400 Received: from pps.filterd (m0089730.ppops.net [127.0.0.1]) by m0089730.ppops.net (8.16.0.43/8.16.0.43) with SMTP id 15PK4KNj001664 for ; Fri, 25 Jun 2021 13:05:42 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=2mbJAQuXRNK9sRcpe5TiGXGLOqhlnJ9UNN3EDEUBaLQ=; b=hyNMbEGQhtq2AiLrK7WZwzUsOXh7xbAyZ/qQfsoGXwccyoReDvEi1AZhk8LaXYFftYln 0Ipe6M3xNmog4Pt1tv5zWEi/ZXJ/fXXzxzkU1NxyRMY9yqEHreW5k9GNyi0pE8lUskFE 8Mq7iPLsnrBDstfwyjw1bTTUtMU1NGHId14= Received: from mail.thefacebook.com ([163.114.132.120]) by m0089730.ppops.net with ESMTP id 39d253xpj4-3 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Fri, 25 Jun 2021 13:05:42 -0700 Received: from intmgw006.03.ash8.facebook.com (2620:10d:c085:108::4) by mail.thefacebook.com (2620:10d:c085:21d::6) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2176.2; Fri, 25 Jun 2021 13:05:39 -0700 Received: by devbig005.ftw2.facebook.com (Postfix, from userid 6611) id 51BB329422B0; Fri, 25 Jun 2021 13:05:36 -0700 (PDT) From: Martin KaFai Lau To: CC: Alexei Starovoitov , Daniel Borkmann , Eric Dumazet , , Neal Cardwell , , Yonghong Song , Yuchung Cheng Subject: [PATCH bpf-next 8/8] bpf: selftest: Test batching and bpf_setsockopt in bpf tcp iter Date: Fri, 25 Jun 2021 13:05:36 -0700 Message-ID: <20210625200536.728323-1-kafai@fb.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20210625200446.723230-1-kafai@fb.com> References: <20210625200446.723230-1-kafai@fb.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-ORIG-GUID: lcpx8bRmTzlbxWQarNsi3L_MzZNIvu0R X-Proofpoint-GUID: lcpx8bRmTzlbxWQarNsi3L_MzZNIvu0R X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391,18.0.790 definitions=2021-06-25_07:2021-06-25,2021-06-25 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 phishscore=0 priorityscore=1501 bulkscore=0 clxscore=1015 impostorscore=0 lowpriorityscore=0 adultscore=0 suspectscore=0 mlxlogscore=999 mlxscore=0 spamscore=0 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2104190000 definitions=main-2106250123 X-FB-Internal: deliver Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net This patch adds tests for the batching and bpf_setsockopt in bpf tcp iter. It first creates: a) 1 non SO_REUSEPORT listener in lhash2. b) 256 passive and active fds connected to the listener in (a). c) 256 SO_REUSEPORT listeners in one of the lhash2 bucket. The test sets all listeners and connections to bpf_cubic before running the bpf iter. The bpf iter then calls setsockopt(TCP_CONGESTION) to switch each listener and connection from bpf_cubic to bpf_dctcp. The bpf iter has a random_retry mode such that it can return EAGAIN to the usespace in the middle of a batch. Signed-off-by: Martin KaFai Lau --- tools/testing/selftests/bpf/network_helpers.c | 85 ++++++- tools/testing/selftests/bpf/network_helpers.h | 4 + .../bpf/prog_tests/bpf_iter_setsockopt.c | 226 ++++++++++++++++++ .../selftests/bpf/progs/bpf_iter_setsockopt.c | 76 ++++++ .../selftests/bpf/progs/bpf_tracing_net.h | 4 + 5 files changed, 386 insertions(+), 9 deletions(-) create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_iter_setsockopt.c create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_setsockopt.c diff --git a/tools/testing/selftests/bpf/network_helpers.c b/tools/testing/selftests/bpf/network_helpers.c index 2060bc122c53..26468a8f44f3 100644 --- a/tools/testing/selftests/bpf/network_helpers.c +++ b/tools/testing/selftests/bpf/network_helpers.c @@ -66,17 +66,13 @@ int settimeo(int fd, int timeout_ms) #define save_errno_close(fd) ({ int __save = errno; close(fd); errno = __save; }) -int start_server(int family, int type, const char *addr_str, __u16 port, - int timeout_ms) +static int __start_server(int type, const struct sockaddr *addr, + socklen_t addrlen, int timeout_ms, bool reuseport) { - struct sockaddr_storage addr = {}; - socklen_t len; + int on = 1; int fd; - if (make_sockaddr(family, addr_str, port, &addr, &len)) - return -1; - - fd = socket(family, type, 0); + fd = socket(addr->sa_family, type, 0); if (fd < 0) { log_err("Failed to create server socket"); return -1; @@ -85,7 +81,13 @@ int start_server(int family, int type, const char *addr_str, __u16 port, if (settimeo(fd, timeout_ms)) goto error_close; - if (bind(fd, (const struct sockaddr *)&addr, len) < 0) { + if (reuseport && + setsockopt(fd, SOL_SOCKET, SO_REUSEPORT, &on, sizeof(on))) { + log_err("Failed to set SO_REUSEPORT"); + return -1; + } + + if (bind(fd, addr, addrlen) < 0) { log_err("Failed to bind socket"); goto error_close; } @@ -104,6 +106,69 @@ int start_server(int family, int type, const char *addr_str, __u16 port, return -1; } +int start_server(int family, int type, const char *addr_str, __u16 port, + int timeout_ms) +{ + struct sockaddr_storage addr; + socklen_t addrlen; + + if (make_sockaddr(family, addr_str, port, &addr, &addrlen)) + return -1; + + return __start_server(type, (struct sockaddr *)&addr, + addrlen, timeout_ms, false); +} + +int *start_reuseport_server(int family, int type, const char *addr_str, + __u16 port, int timeout_ms, unsigned int nr_listens) +{ + struct sockaddr_storage addr; + unsigned int nr_fds = 0; + socklen_t addrlen; + int *fds; + + if (!nr_listens) + return NULL; + + if (make_sockaddr(family, addr_str, port, &addr, &addrlen)) + return NULL; + + fds = malloc(sizeof(*fds) * nr_listens); + if (!fds) + return NULL; + + fds[0] = __start_server(type, (struct sockaddr *)&addr, addrlen, + timeout_ms, true); + if (fds[0] == -1) + goto close_fds; + nr_fds = 1; + + if (getsockname(fds[0], (struct sockaddr *)&addr, &addrlen)) + goto close_fds; + + for (; nr_fds < nr_listens; nr_fds++) { + fds[nr_fds] = __start_server(type, (struct sockaddr *)&addr, + addrlen, timeout_ms, true); + if (fds[nr_fds] == -1) + goto close_fds; + } + + return fds; + +close_fds: + free_fds(fds, nr_fds); + return NULL; +} + +void free_fds(int *fds, unsigned int nr_close_fds) +{ + if (fds) { + while (nr_close_fds) + close(fds[--nr_close_fds]); + free(fds); + } +} + int fastopen_connect(int server_fd, const char *data, unsigned int data_len, int timeout_ms) { @@ -217,6 +282,7 @@ int make_sockaddr(int family, const char *addr_str, __u16 port, if (family == AF_INET) { struct sockaddr_in *sin = (void *)addr; + memset(addr, 0, sizeof(*sin)); sin->sin_family = AF_INET; sin->sin_port = htons(port); if (addr_str && @@ -230,6 +296,7 @@ int make_sockaddr(int family, const char *addr_str, __u16 port, } else if (family == AF_INET6) { struct sockaddr_in6 *sin6 = (void *)addr; + memset(addr, 0, sizeof(*sin6)); sin6->sin6_family = AF_INET6; sin6->sin6_port = htons(port); if (addr_str && diff --git a/tools/testing/selftests/bpf/network_helpers.h b/tools/testing/selftests/bpf/network_helpers.h index 5e0d51c07b63..d60bc2897770 100644 --- a/tools/testing/selftests/bpf/network_helpers.h +++ b/tools/testing/selftests/bpf/network_helpers.h @@ -36,6 +36,10 @@ extern struct ipv6_packet pkt_v6; int settimeo(int fd, int timeout_ms); int start_server(int family, int type, const char *addr, __u16 port, int timeout_ms); +int *start_reuseport_server(int family, int type, const char *addr_str, + __u16 port, int timeout_ms, + unsigned int nr_listens); +void free_fds(int *fds, unsigned int nr_close_fds); int connect_to_fd(int server_fd, int timeout_ms); int connect_fd_to_fd(int client_fd, int server_fd, int timeout_ms); int fastopen_connect(int server_fd, const char *data, unsigned int data_len, diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_iter_setsockopt.c b/tools/testing/selftests/bpf/prog_tests/bpf_iter_setsockopt.c new file mode 100644 index 000000000000..85babb0487b3 --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/bpf_iter_setsockopt.c @@ -0,0 +1,226 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2021 Facebook */ +#define _GNU_SOURCE +#include +#include +#include "network_helpers.h" +#include "bpf_dctcp.skel.h" +#include "bpf_cubic.skel.h" +#include "bpf_iter_setsockopt.skel.h" + +static int create_netns(void) +{ + if (!ASSERT_OK(unshare(CLONE_NEWNET), "create netns")) + return -1; + + if (!ASSERT_OK(system("ip link set dev lo up"), "bring up lo")) + return -1; + + return 0; +} + +static unsigned int set_bpf_cubic(int *fds, unsigned int nr_fds) +{ + unsigned int i; + + for (i = 0; i < nr_fds; i++) { + if (setsockopt(fds[i], SOL_TCP, TCP_CONGESTION, "bpf_cubic", + sizeof("bpf_cubic"))) + return i; + } + + return nr_fds; +} + +static unsigned int check_bpf_dctcp(int *fds, unsigned int nr_fds) +{ + char tcp_cc[16]; + socklen_t optlen = sizeof(tcp_cc); + unsigned int i; + + for (i = 0; i < nr_fds; i++) { + if (getsockopt(fds[i], SOL_TCP, TCP_CONGESTION, + tcp_cc, &optlen) || + strcmp(tcp_cc, "bpf_dctcp")) + return i; + } + + return nr_fds; +} + +static int *make_established(int listen_fd, unsigned int nr_est, + int **paccepted_fds) +{ + int *est_fds, *accepted_fds; + unsigned int i; + + est_fds = malloc(sizeof(*est_fds) * nr_est); + if (!est_fds) + return NULL; + + accepted_fds = malloc(sizeof(*accepted_fds) * nr_est); + if (!accepted_fds) { + free(est_fds); + return NULL; + } + + for (i = 0; i < nr_est; i++) { + est_fds[i] = connect_to_fd(listen_fd, 0); + if (est_fds[i] == -1) + break; + if (set_bpf_cubic(&est_fds[i], 1) != 1) { + close(est_fds[i]); + break; + } + + accepted_fds[i] = accept(listen_fd, NULL, 0); + if (accepted_fds[i] == -1) { + close(est_fds[i]); + break; + } + } + + if (!ASSERT_EQ(i, nr_est, "create established fds")) { + free_fds(accepted_fds, i); + free_fds(est_fds, i); + return NULL; + } + + *paccepted_fds = accepted_fds; + return est_fds; +} + +static unsigned short get_local_port(int fd) +{ + struct sockaddr_in6 addr; + socklen_t addrlen = sizeof(addr); + + if (!getsockname(fd, &addr, &addrlen)) + return ntohs(addr.sin6_port); + + return 0; +} + +static void do_bpf_iter_setsockopt(struct bpf_iter_setsockopt *iter_skel, + bool random_retry) +{ + int *reuse_listen_fds = NULL, *accepted_fds = NULL, *est_fds = NULL; + unsigned int nr_reuse_listens = 256, nr_est = 256; + int err, iter_fd = -1, listen_fd = -1; + char buf; + + /* Prepare non-reuseport listen_fd */ + listen_fd = start_server(AF_INET6, SOCK_STREAM, "::1", 0, 0); + if (!ASSERT_GE(listen_fd, 0, "start_server")) + return; + if (!ASSERT_EQ(set_bpf_cubic(&listen_fd, 1), 1, + "set listen_fd to cubic")) + goto done; + iter_skel->bss->listen_hport = get_local_port(listen_fd); + if (!ASSERT_NEQ(iter_skel->bss->listen_hport, 0, + "get_local_port(listen_fd)")) + goto done; + + /* Connect to non-reuseport listen_fd */ + est_fds = make_established(listen_fd, nr_est, &accepted_fds); + if (!ASSERT_OK_PTR(est_fds, "create established")) + goto done; + + /* Prepare reuseport listen fds */ + reuse_listen_fds = start_reuseport_server(AF_INET6, SOCK_STREAM, + "::1", 0, 0, + nr_reuse_listens); + if (!ASSERT_OK_PTR(reuse_listen_fds, "start_reuseport_server")) + goto done; + if (!ASSERT_EQ(set_bpf_cubic(reuse_listen_fds, nr_reuse_listens), + nr_reuse_listens, "set reuse_listen_fds to cubic")) + goto done; + iter_skel->bss->reuse_listen_hport = get_local_port(reuse_listen_fds[0]); + if (!ASSERT_NEQ(iter_skel->bss->reuse_listen_hport, 0, + "get_local_port(reuse_listen_fds[0])")) + goto done; + + /* Run bpf tcp iter to switch from bpf_cubic to bpf_dctcp */ + iter_skel->bss->random_retry = random_retry; + iter_fd = bpf_iter_create(bpf_link__fd(iter_skel->links.change_tcp_cc)); + if (!ASSERT_GE(iter_fd, 0, "create iter_fd")) + goto done; + + while ((err = read(iter_fd, &buf, sizeof(buf))) == -1 && + errno == EAGAIN) + ; + if (!ASSERT_OK(err, "read iter error")) + goto done; + + /* Check reuseport listen fds for dctcp */ + ASSERT_EQ(check_bpf_dctcp(reuse_listen_fds, nr_reuse_listens), + nr_reuse_listens, + "check reuse_listen_fds dctcp"); + + /* Check non reuseport listen fd for dctcp */ + ASSERT_EQ(check_bpf_dctcp(&listen_fd, 1), 1, + "check listen_fd dctcp"); + + /* Check established fds for dctcp */ + ASSERT_EQ(check_bpf_dctcp(est_fds, nr_est), nr_est, + "check est_fds dctcp"); + + /* Check accepted fds for dctcp */ + ASSERT_EQ(check_bpf_dctcp(accepted_fds, nr_est), nr_est, + "check accepted_fds dctcp"); + +done: + if (iter_fd != -1) + close(iter_fd); + if (listen_fd != -1) + close(listen_fd); + free_fds(reuse_listen_fds, nr_reuse_listens); + free_fds(accepted_fds, nr_est); + free_fds(est_fds, nr_est); +} + +void test_bpf_iter_setsockopt(void) +{ + struct bpf_iter_setsockopt *iter_skel = NULL; + struct bpf_cubic *cubic_skel = NULL; + struct bpf_dctcp *dctcp_skel = NULL; + struct bpf_link *cubic_link = NULL; + struct bpf_link *dctcp_link = NULL; + + if (create_netns()) + return; + + /* Load iter_skel */ + iter_skel = bpf_iter_setsockopt__open_and_load(); + if (!ASSERT_OK_PTR(iter_skel, "iter_skel")) + return; + iter_skel->links.change_tcp_cc = bpf_program__attach_iter(iter_skel->progs.change_tcp_cc, NULL); + if (!ASSERT_OK_PTR(iter_skel->links.change_tcp_cc, "attach iter")) + goto done; + + /* Load bpf_cubic */ + cubic_skel = bpf_cubic__open_and_load(); + if (!ASSERT_OK_PTR(cubic_skel, "cubic_skel")) + goto done; + cubic_link = bpf_map__attach_struct_ops(cubic_skel->maps.cubic); + if (!ASSERT_OK_PTR(cubic_link, "cubic_link")) + goto done; + + /* Load bpf_dctcp */ + dctcp_skel = bpf_dctcp__open_and_load(); + if (!ASSERT_OK_PTR(dctcp_skel, "dctcp_skel")) + goto done; + dctcp_link = bpf_map__attach_struct_ops(dctcp_skel->maps.dctcp); + if (!ASSERT_OK_PTR(dctcp_link, "dctcp_link")) + goto done; + + do_bpf_iter_setsockopt(iter_skel, true); + do_bpf_iter_setsockopt(iter_skel, false); + +done: + bpf_link__destroy(cubic_link); + bpf_link__destroy(dctcp_link); + bpf_cubic__destroy(cubic_skel); + bpf_dctcp__destroy(dctcp_skel); + bpf_iter_setsockopt__destroy(iter_skel); +} diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_setsockopt.c b/tools/testing/selftests/bpf/progs/bpf_iter_setsockopt.c new file mode 100644 index 000000000000..72cc4c1c681e --- /dev/null +++ b/tools/testing/selftests/bpf/progs/bpf_iter_setsockopt.c @@ -0,0 +1,76 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2021 Facebook */ +#include "bpf_iter.h" +#include "bpf_tracing_net.h" +#include +#include + +#define sk_num __sk_common.skc_num +#define sk_dport __sk_common.skc_dport +#define sk_state __sk_common.skc_state +#define sk_family __sk_common.skc_family + +#define bpf_tcp_sk(skc) ({ \ + struct sock_common *_skc = skc; \ + sk = NULL; \ + tp = NULL; \ + if (_skc) { \ + tp = bpf_skc_to_tcp_sock(_skc); \ + sk = (struct sock *)tp; \ + } \ + tp; \ +}) + +unsigned short reuse_listen_hport = 0; +unsigned short listen_hport = 0; +char old_cc[TCP_CA_NAME_MAX] = "bpf_cubic"; +char new_cc[TCP_CA_NAME_MAX] = "bpf_dctcp"; +bool random_retry = false; + +static bool tcp_ca_eq(const char *a, const char *b) +{ + int i; + + for (i = 0; i < TCP_CA_NAME_MAX; i++) { + if (a[i] != b[i]) + return false; + if (!a[i]) + break; + } + + return true; +} + +SEC("iter/tcp") +int change_tcp_cc(struct bpf_iter__tcp *ctx) +{ + struct bpf_iter_meta *meta = ctx->meta; + struct inet_connection_sock *icsk; + struct seq_file *seq = meta->seq; + struct tcp_sock *tp; + struct sock *sk; + int ret; + + if (!bpf_tcp_sk(ctx->sk_common)) + return 0; + + if (sk->sk_family != AF_INET6 || + (sk->sk_state != TCP_LISTEN && + sk->sk_state != TCP_ESTABLISHED) || + (sk->sk_num != reuse_listen_hport && + sk->sk_num != listen_hport && + bpf_ntohs(sk->sk_dport) != listen_hport)) + return 0; + + icsk = (struct inet_connection_sock *)tp; + if (!tcp_ca_eq(icsk->icsk_ca_ops->name, old_cc)) + return 0; + + if (random_retry && bpf_get_prandom_u32() % 4 == 1) + return 1; + + bpf_setsockopt(tp, SOL_TCP, TCP_CONGESTION, new_cc, sizeof(new_cc)); + return 0; +} + +char _license[] SEC("license") = "GPL"; diff --git a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h index 01378911252b..d66c4caaeec1 100644 --- a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h +++ b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h @@ -5,6 +5,10 @@ #define AF_INET 2 #define AF_INET6 10 +#define SOL_TCP 6 +#define TCP_CONGESTION 13 +#define TCP_CA_NAME_MAX 16 + #define ICSK_TIME_RETRANS 1 #define ICSK_TIME_PROBE0 3 #define ICSK_TIME_LOSS_PROBE 5