From patchwork Fri Apr 18 22:46:39 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 14057741 X-Patchwork-Delegate: bpf@iogearbox.net Received: from out-179.mta1.migadu.com (out-179.mta1.migadu.com [95.215.58.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 18D3D2459CB for ; Fri, 18 Apr 2025 22:47:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.179 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745016436; cv=none; b=B5J9/4tjFkzoy54wHsaqmTFfFp0wDXyEdWXQZ+D7Qk+CLQBdApIvniDqXWpZ9usEbEYwhAS7fFj5orygmMtPYsL3v6qKxVkT/AymjMzKvtYovBeqiAzV+r1gm+I+ZhSJ+R7fYtI5JMuWiQhSc5/K8ozW0ebBqk0NB8DprYzL1nw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745016436; c=relaxed/simple; bh=OSgQ/kTYIPP+HEuAEQ39jT6+MWIWB0gAsn10G0y1v9g=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=g5QMjBUWzmtrnuY3mTFYeEe3a5mj7EBi4nWrv07mgIQFbh2fllFi/Y8wWyzlNhm9N+qzhoz7cj/1zspAxLZS78weeGtuz/1XGM6kFmx36hMKh7eyYRKLiPFHpR0RQFHlDzxdiXyBJY9W9TWbHmRi6voP2ZtalNSeTgBOAQ5+YJI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=JfYqzh7l; arc=none smtp.client-ip=95.215.58.179 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="JfYqzh7l" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1745016431; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=/KTY9VRdQcwLvPSJGuFdTg46U1/5fxyrLnnHuCMXSC0=; b=JfYqzh7lN4ueUKEK38QDokrGX6KHhGQuqxw5mcYA7X9xwXfptp47iY1SMDqPMG0OD/s2NE XbHJAipfmqC+oBED6vk1ZviP/hIuyzjiPhGklYAoVTqWzyrFHkLt3Eznx8VRfSuKUKVM+O 8S3ohxk1lXHy6t/RObzTaoijfVp3U/A= From: Martin KaFai Lau To: bpf@vger.kernel.org Cc: 'Alexei Starovoitov ' , 'Andrii Nakryiko ' , 'Daniel Borkmann ' , netdev@vger.kernel.org, kernel-team@meta.com, 'Amery Hung ' Subject: [RFC PATCH bpf-next 01/12] bpf: Check KF_bpf_rbtree_add_impl for the "case KF_ARG_PTR_TO_RB_NODE" Date: Fri, 18 Apr 2025 15:46:39 -0700 Message-ID: <20250418224652.105998-2-martin.lau@linux.dev> In-Reply-To: <20250418224652.105998-1-martin.lau@linux.dev> References: <20250418224652.105998-1-martin.lau@linux.dev> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Migadu-Flow: FLOW_OUT X-Patchwork-Delegate: bpf@iogearbox.net X-Patchwork-State: RFC From: Martin KaFai Lau In a later patch, two new kfuncs will take the bpf_rb_node pointer arg. struct bpf_rb_node *bpf_rbtree_left(struct bpf_rb_root *root, struct bpf_rb_node *node); struct bpf_rb_node *bpf_rbtree_right(struct bpf_rb_root *root, struct bpf_rb_node *node); In the check_kfunc_call, there is a "case KF_ARG_PTR_TO_RB_NODE" to check if the reg->type should be an allocated pointer or should be a non_owning_ref. The later patch will need to ensure that the bpf_rb_node pointer passing to the new bpf_rbtree_{left,right} must be a non_owning_ref. This should be the same requirement as the existing bpf_rbtree_remove. This patch swaps the current "if else" statement. Instead of checking the bpf_rbtree_remove, it checks the bpf_rbtree_add. Then the new bpf_rbtree_{left,right} will fall into the "else" case to make the later patch simpler. bpf_rbtree_add should be the only one that needs an allocated pointer. This should be a no-op change considering there are only two kfunc(s) taking bpf_rb_node pointer arg, rbtree_add and rbtree_remove. Signed-off-by: Martin KaFai Lau --- kernel/bpf/verifier.c | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 54c6953a8b84..2e1ce7debc16 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -13200,22 +13200,22 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_ return ret; break; case KF_ARG_PTR_TO_RB_NODE: - if (meta->func_id == special_kfunc_list[KF_bpf_rbtree_remove]) { - if (!type_is_non_owning_ref(reg->type) || reg->ref_obj_id) { - verbose(env, "rbtree_remove node input must be non-owning ref\n"); + if (meta->func_id == special_kfunc_list[KF_bpf_rbtree_add_impl]) { + if (reg->type != (PTR_TO_BTF_ID | MEM_ALLOC)) { + verbose(env, "arg#%d expected pointer to allocated object\n", i); return -EINVAL; } - if (in_rbtree_lock_required_cb(env)) { - verbose(env, "rbtree_remove not allowed in rbtree cb\n"); + if (!reg->ref_obj_id) { + verbose(env, "allocated object must be referenced\n"); return -EINVAL; } } else { - if (reg->type != (PTR_TO_BTF_ID | MEM_ALLOC)) { - verbose(env, "arg#%d expected pointer to allocated object\n", i); + if (!type_is_non_owning_ref(reg->type) || reg->ref_obj_id) { + verbose(env, "rbtree_remove node input must be non-owning ref\n"); return -EINVAL; } - if (!reg->ref_obj_id) { - verbose(env, "allocated object must be referenced\n"); + if (in_rbtree_lock_required_cb(env)) { + verbose(env, "rbtree_remove not allowed in rbtree cb\n"); return -EINVAL; } } From patchwork Fri Apr 18 22:46:40 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 14057742 X-Patchwork-Delegate: bpf@iogearbox.net Received: from out-178.mta1.migadu.com (out-178.mta1.migadu.com [95.215.58.178]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AB1C8253B57 for ; Fri, 18 Apr 2025 22:47:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.178 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745016439; cv=none; b=AIiX3Z3MHl8pDc8w61udJvMQZignMvh6hnHXGvBQFTySPmjw60L7NjYeX/npMdLTiSa4ZGiz27iOO4UF7w5T+PgRnm9FBIaaV2hNSfoATx2J0Z5l1qXVTcS+CSlKcNN4e72vn6zTFnXRhYGfUhT9ZZXiT1eRNQnDs+70sQuCjoo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745016439; c=relaxed/simple; bh=hZNL8FaX5nbwJ3VvS2791LPVX33JvOa5GXZ5uWh8o/s=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=iLbP3f8q8cc5dyxgrOj/enjE+Q3yYifQgJ7hk3PCcMMoCcV7TCI1SRj51X350hUqssV/REzTVf34W9/zeKEEm/CRiNzsU+KTz65Egu1DFtGVJzn3Z21u4bRXkkSoBEcd/wkWuJSvoSCWxPCDcxtKe+kcAmaEPRpjWI46Q7w0+S4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=RVAnd1Vt; arc=none smtp.client-ip=95.215.58.178 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="RVAnd1Vt" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1745016433; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ddWB+m84a0iJbxYqZiaRkEN7iNLdkmkc0rCeihP24f4=; b=RVAnd1Vt6+7RgZLdJBfszKkXaag25S1Rhgc+FCxfzL6WpptF4c5xtfsNBdJ7g4oYjrL6lv 3CDkDRSos3LX1oPQOKiaUJcOG/dc4Q4nK/kcgWdYqEN+SavU19by2Ah0hMMPFvSvtXDTcW k2H39fgAGrlAmF+BULqG2Dcw5t0Iheo= From: Martin KaFai Lau To: bpf@vger.kernel.org Cc: 'Alexei Starovoitov ' , 'Andrii Nakryiko ' , 'Daniel Borkmann ' , netdev@vger.kernel.org, kernel-team@meta.com, 'Amery Hung ' Subject: [RFC PATCH bpf-next 02/12] bpf: Simplify reg0 marking for the rbtree kfuncs that return a bpf_rb_node pointer Date: Fri, 18 Apr 2025 15:46:40 -0700 Message-ID: <20250418224652.105998-3-martin.lau@linux.dev> In-Reply-To: <20250418224652.105998-1-martin.lau@linux.dev> References: <20250418224652.105998-1-martin.lau@linux.dev> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Migadu-Flow: FLOW_OUT X-Patchwork-Delegate: bpf@iogearbox.net X-Patchwork-State: RFC From: Martin KaFai Lau The current rbtree kfunc, bpf_rbtree_{first, remove}, returns the bpf_rb_node pointer. The check_kfunc_call currently checks the kfunc btf_id instead of its return pointer type to decide if it needs to do mark_reg_graph_node(reg0) and ref_set_non_owning(reg0). The later patch will add bpf_rbtree_{root,left,right} that will also return a bpf_rb_node pointer. Instead of adding more kfunc btf_id checks to the "if" case, this patch changes the test to check the kfunc's return type. is_rbtree_node_type() function is added to test if a pointer type is a bpf_rb_node. The callers have already skipped the modifiers of the pointer type. A note on the ref_set_non_owning(), although bpf_rbtree_remove() also returns a bpf_rb_node pointer, the bpf_rbtree_remove() has the KF_ACQUIRE flag. Thus, its reg0 will not become non-owning. Signed-off-by: Martin KaFai Lau --- kernel/bpf/verifier.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 2e1ce7debc16..bf14da00f09a 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -11987,6 +11987,11 @@ static bool is_kfunc_arg_res_spin_lock(const struct btf *btf, const struct btf_p return __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_RES_SPIN_LOCK_ID); } +static bool is_rbtree_node_type(const struct btf_type *t) +{ + return t == btf_type_by_id(btf_vmlinux, kf_arg_btf_ids[KF_ARG_RB_NODE_ID]); +} + static bool is_kfunc_arg_callback(struct bpf_verifier_env *env, const struct btf *btf, const struct btf_param *arg) { @@ -13750,8 +13755,7 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn, struct btf_field *field = meta.arg_list_head.field; mark_reg_graph_node(regs, BPF_REG_0, &field->graph_root); - } else if (meta.func_id == special_kfunc_list[KF_bpf_rbtree_remove] || - meta.func_id == special_kfunc_list[KF_bpf_rbtree_first]) { + } else if (is_rbtree_node_type(ptr_type)) { struct btf_field *field = meta.arg_rbtree_root.field; mark_reg_graph_node(regs, BPF_REG_0, &field->graph_root); @@ -13881,7 +13885,7 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn, if (is_kfunc_ret_null(&meta)) regs[BPF_REG_0].id = id; regs[BPF_REG_0].ref_obj_id = id; - } else if (meta.func_id == special_kfunc_list[KF_bpf_rbtree_first]) { + } else if (is_rbtree_node_type(ptr_type)) { ref_set_non_owning(env, ®s[BPF_REG_0]); } From patchwork Fri Apr 18 22:46:41 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 14057743 X-Patchwork-Delegate: bpf@iogearbox.net Received: from out-171.mta1.migadu.com (out-171.mta1.migadu.com [95.215.58.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 291A62459CB for ; Fri, 18 Apr 2025 22:47:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.171 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745016440; cv=none; b=PzDfXl0k9ZTThQIy7E9tMwdkXl5YdSU+b8wU6bQcjbxXSAqvCLzSJ4xJYeH82z0xa7fWrMIEUJZdS4sJvZfHaLQ1sIPgRK/lJboZkn8/Rq1EHsn/lborTBxD9lzs55rrKqcYaHiDI9s0BWvyteAuGpMLABnjtMi259NIBeZUBhg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745016440; c=relaxed/simple; bh=9PJBt/4tvHniINIAyNyoTYfl71UQ9akMAzCxOXNwK+8=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=WA20zyJgZsOnkpTkvf8cVZYRmHhgNKRVTUjm3e3DXo/P6jqUT1X2v8Jy113hd9hjtb9rxHTD9wAcZG7VJJX4HOLBv/1GWCNdQ8I2Ms/RzQhMhMDzoMBhRS7q6twO3Ya5aSZbLkOne6vAPfqlVBQSpuSfBAFVAtuyUe/08Jg6Zng= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=FcM/eetA; arc=none smtp.client-ip=95.215.58.171 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="FcM/eetA" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1745016436; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=j4+Fgc7EEB97NPIdx5N+inrw28wsUjm9qufyeomFPuM=; b=FcM/eetAfjHcZA3iq5u68Hlc0m+YzMaaqCjpCJhEtw4yq0wmwwN427BOuQlkvRfPn+WGFt KgvZkMOgyJNqosw9Gvm5m9Ap+RNfArEAKFLwcFs1XfzzK3CUxxG1JEudmDP1BUnLr7Ynxe dpimkLb+XeXjdgQqBhJH8Ykr1mmMvts= From: Martin KaFai Lau To: bpf@vger.kernel.org Cc: 'Alexei Starovoitov ' , 'Andrii Nakryiko ' , 'Daniel Borkmann ' , netdev@vger.kernel.org, kernel-team@meta.com, 'Amery Hung ' Subject: [RFC PATCH bpf-next 03/12] bpf: Add bpf_rbtree_{root,left,right} kfunc Date: Fri, 18 Apr 2025 15:46:41 -0700 Message-ID: <20250418224652.105998-4-martin.lau@linux.dev> In-Reply-To: <20250418224652.105998-1-martin.lau@linux.dev> References: <20250418224652.105998-1-martin.lau@linux.dev> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Migadu-Flow: FLOW_OUT X-Patchwork-Delegate: bpf@iogearbox.net X-Patchwork-State: RFC From: Martin KaFai Lau In the kernel fq qdisc implementation, it requires to traverse a rbtree stored with the networking "flows". In the later bpf selftests prog, the much simplified logic that uses the bpf_rbtree_{root,left,right} to traverse the tree is like: struct fq_flow { struct bpf_rb_node fq_node; struct bpf_rb_node rate_node; struct bpf_refcount refcount; unsigned long sk_long; }; struct fq_flow_root { struct bpf_spin_lock lock; struct bpf_rb_root root __contains(fq_flow, fq_node); }; struct fq_flow *fq_classify(...) { struct bpf_rb_node *tofree[FQ_GC_MAX]; struct fq_flow_root *root; struct fq_flow *gc_f, *f; struct bpf_rb_node *p; int i, fcnt = 0; /* ... */ f = NULL; bpf_spin_lock(&root->lock); p = bpf_rbtree_root(&root->root); while (can_loop) { if (!p) break; gc_f = bpf_rb_entry(p, struct fq_flow, fq_node); if (gc_f->sk_long == sk_long) { f = bpf_refcount_acquire(gc_f); break; } /* To be removed from the rbtree */ if (fcnt < FQ_GC_MAX && fq_gc_candidate(gc_f, jiffies_now)) tofree[fcnt++] = p; if (gc_f->sk_long > sk_long) p = bpf_rbtree_left(&root->root, p); else p = bpf_rbtree_right(&root->root, p); } /* remove from the rbtree */ for (i = 0; i < fcnt; i++) { p = tofree[i]; tofree[i] = bpf_rbtree_remove(&root->root, p); } bpf_spin_unlock(&root->lock); /* bpf_obj_drop the fq_flow(s) that have just been removed * from the rbtree. */ for (i = 0; i < fcnt; i++) { p = tofree[i]; if (p) { gc_f = bpf_rb_entry(p, struct fq_flow, fq_node); bpf_obj_drop(gc_f); } } return f; } The above simplified code needs to traverse the rbtree for two purposes, 1) find the flow with the desired sk_long value 2) while searching for the sk_long, collect flows that are the fq_gc_candidate. They will be removed from the rbtree. This patch adds the bpf_rbtree_{root,left,right} kfunc to enable the rbtree traversal. The returned bpf_rb_node pointer will be a non-owning reference which is the same as the returned pointer of the exisiting bpf_rbtree_first kfunc. Signed-off-by: Martin KaFai Lau --- kernel/bpf/helpers.c | 30 ++++++++++++++++++++++++++++++ kernel/bpf/verifier.c | 23 ++++++++++++++++++----- 2 files changed, 48 insertions(+), 5 deletions(-) diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index e3a2662f4e33..36150d340c16 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -2366,6 +2366,33 @@ __bpf_kfunc struct bpf_rb_node *bpf_rbtree_first(struct bpf_rb_root *root) return (struct bpf_rb_node *)rb_first_cached(r); } +__bpf_kfunc struct bpf_rb_node *bpf_rbtree_root(struct bpf_rb_root *root) +{ + struct rb_root_cached *r = (struct rb_root_cached *)root; + + return (struct bpf_rb_node *)r->rb_root.rb_node; +} + +__bpf_kfunc struct bpf_rb_node *bpf_rbtree_left(struct bpf_rb_root *root, struct bpf_rb_node *node) +{ + struct bpf_rb_node_kern *node_internal = (struct bpf_rb_node_kern *)node; + + if (READ_ONCE(node_internal->owner) != root) + return NULL; + + return (struct bpf_rb_node *)node_internal->rb_node.rb_left; +} + +__bpf_kfunc struct bpf_rb_node *bpf_rbtree_right(struct bpf_rb_root *root, struct bpf_rb_node *node) +{ + struct bpf_rb_node_kern *node_internal = (struct bpf_rb_node_kern *)node; + + if (READ_ONCE(node_internal->owner) != root) + return NULL; + + return (struct bpf_rb_node *)node_internal->rb_node.rb_right; +} + /** * bpf_task_acquire - Acquire a reference to a task. A task acquired by this * kfunc which is not stored in a map as a kptr, must be released by calling @@ -3214,6 +3241,9 @@ BTF_ID_FLAGS(func, bpf_task_release, KF_RELEASE) BTF_ID_FLAGS(func, bpf_rbtree_remove, KF_ACQUIRE | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_rbtree_add_impl) BTF_ID_FLAGS(func, bpf_rbtree_first, KF_RET_NULL) +BTF_ID_FLAGS(func, bpf_rbtree_root, KF_RET_NULL) +BTF_ID_FLAGS(func, bpf_rbtree_left, KF_RET_NULL) +BTF_ID_FLAGS(func, bpf_rbtree_right, KF_RET_NULL) #ifdef CONFIG_CGROUPS BTF_ID_FLAGS(func, bpf_cgroup_acquire, KF_ACQUIRE | KF_RCU | KF_RET_NULL) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index bf14da00f09a..3624de1c6925 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -12081,6 +12081,9 @@ enum special_kfunc_type { KF_bpf_rbtree_remove, KF_bpf_rbtree_add_impl, KF_bpf_rbtree_first, + KF_bpf_rbtree_root, + KF_bpf_rbtree_left, + KF_bpf_rbtree_right, KF_bpf_dynptr_from_skb, KF_bpf_dynptr_from_xdp, KF_bpf_dynptr_slice, @@ -12121,6 +12124,9 @@ BTF_ID(func, bpf_rdonly_cast) BTF_ID(func, bpf_rbtree_remove) BTF_ID(func, bpf_rbtree_add_impl) BTF_ID(func, bpf_rbtree_first) +BTF_ID(func, bpf_rbtree_root) +BTF_ID(func, bpf_rbtree_left) +BTF_ID(func, bpf_rbtree_right) #ifdef CONFIG_NET BTF_ID(func, bpf_dynptr_from_skb) BTF_ID(func, bpf_dynptr_from_xdp) @@ -12156,6 +12162,9 @@ BTF_ID(func, bpf_rcu_read_unlock) BTF_ID(func, bpf_rbtree_remove) BTF_ID(func, bpf_rbtree_add_impl) BTF_ID(func, bpf_rbtree_first) +BTF_ID(func, bpf_rbtree_root) +BTF_ID(func, bpf_rbtree_left) +BTF_ID(func, bpf_rbtree_right) #ifdef CONFIG_NET BTF_ID(func, bpf_dynptr_from_skb) BTF_ID(func, bpf_dynptr_from_xdp) @@ -12591,7 +12600,10 @@ static bool is_bpf_rbtree_api_kfunc(u32 btf_id) { return btf_id == special_kfunc_list[KF_bpf_rbtree_add_impl] || btf_id == special_kfunc_list[KF_bpf_rbtree_remove] || - btf_id == special_kfunc_list[KF_bpf_rbtree_first]; + btf_id == special_kfunc_list[KF_bpf_rbtree_first] || + btf_id == special_kfunc_list[KF_bpf_rbtree_root] || + btf_id == special_kfunc_list[KF_bpf_rbtree_left] || + btf_id == special_kfunc_list[KF_bpf_rbtree_right]; } static bool is_bpf_iter_num_api_kfunc(u32 btf_id) @@ -12691,7 +12703,9 @@ static bool check_kfunc_is_graph_node_api(struct bpf_verifier_env *env, break; case BPF_RB_NODE: ret = (kfunc_btf_id == special_kfunc_list[KF_bpf_rbtree_remove] || - kfunc_btf_id == special_kfunc_list[KF_bpf_rbtree_add_impl]); + kfunc_btf_id == special_kfunc_list[KF_bpf_rbtree_add_impl] || + kfunc_btf_id == special_kfunc_list[KF_bpf_rbtree_left] || + kfunc_btf_id == special_kfunc_list[KF_bpf_rbtree_right]); break; default: verbose(env, "verifier internal error: unexpected graph node argument type %s\n", @@ -13216,15 +13230,14 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_ } } else { if (!type_is_non_owning_ref(reg->type) || reg->ref_obj_id) { - verbose(env, "rbtree_remove node input must be non-owning ref\n"); + verbose(env, "%s can only take non-owning bpf_rb_node pointer\n", func_name); return -EINVAL; } if (in_rbtree_lock_required_cb(env)) { - verbose(env, "rbtree_remove not allowed in rbtree cb\n"); + verbose(env, "%s not allowed in rbtree cb\n", func_name); return -EINVAL; } } - ret = process_kf_arg_ptr_to_rbtree_node(env, reg, regno, meta); if (ret < 0) return ret; From patchwork Fri Apr 18 22:46:42 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 14057744 X-Patchwork-Delegate: bpf@iogearbox.net Received: from out-180.mta1.migadu.com (out-180.mta1.migadu.com [95.215.58.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 65A38254856 for ; Fri, 18 Apr 2025 22:47:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.180 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745016443; cv=none; b=pjQAZbjcXBI10K56owWtKOUZrhmcUxrEQ+ifi0+GMZ/nBNYw7sYwiz+lHrgWlQeDcssHBA0GhGDOr0HaKV09Y9Utaty4ZvwDVRlQx5QuLTOE3rAjhE21B7Kcp3CiZ1VGbmXbCZbLevGWpDEtYDpWxYaMpzX+Q4lWCbZBH2GruPM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745016443; c=relaxed/simple; bh=CuNEOL40Y2GqHx4FswXb7VriAJlwIBnumV/8NdjK9hA=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=RKWi4zwFy9LXOxbqWVQcL5kinr3/ymzv1jigat4uzLlrvzuMTaUjpOYDIiVZQHgkXqNSylXlzRdGAxty4YC22MyXIUva61gsloqP/Op/6zBGi3AYZPbaEKzhQFLvwWoxKeGYb5o6+Z857r6I/syI4pHY/bpapAd8DDaVaWgeU1Y= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=OicXXmds; arc=none smtp.client-ip=95.215.58.180 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="OicXXmds" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1745016439; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=X5xTbmm3lU0hsicQ+mtYw0M8HKq+iogNgwvIHDGq7w8=; b=OicXXmdsPgI742mGCSlQ+83AyAPHl2QrfQzYqbuHJp7KKjb3HRnhw3DE4Om9agpCCl+Fpy 2evy3DxIK4c/l1oq7kRZq+f/SmhpnOcOJNCgExrQc92h3B5nCm2LS2PPaQIJcjSZQYFrOL HyW6UwTw5w7CVY+EqILs8M/RUQ1jD/A= From: Martin KaFai Lau To: bpf@vger.kernel.org Cc: 'Alexei Starovoitov ' , 'Andrii Nakryiko ' , 'Daniel Borkmann ' , netdev@vger.kernel.org, kernel-team@meta.com, 'Amery Hung ' Subject: [RFC PATCH bpf-next 04/12] selftests/bpf: Adjust failure message in the rbtree_fail test Date: Fri, 18 Apr 2025 15:46:42 -0700 Message-ID: <20250418224652.105998-5-martin.lau@linux.dev> In-Reply-To: <20250418224652.105998-1-martin.lau@linux.dev> References: <20250418224652.105998-1-martin.lau@linux.dev> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Migadu-Flow: FLOW_OUT X-Patchwork-Delegate: bpf@iogearbox.net X-Patchwork-State: RFC From: Martin KaFai Lau Some of the failure messages in the rbtree_fail test. The message is now "bpf_rbtree_remove can only take non-owning bpf_rb_node pointer". Signed-off-by: Martin KaFai Lau --- tools/testing/selftests/bpf/progs/rbtree_fail.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/bpf/progs/rbtree_fail.c b/tools/testing/selftests/bpf/progs/rbtree_fail.c index dbd5eee8e25e..528122320471 100644 --- a/tools/testing/selftests/bpf/progs/rbtree_fail.c +++ b/tools/testing/selftests/bpf/progs/rbtree_fail.c @@ -69,7 +69,7 @@ long rbtree_api_nolock_first(void *ctx) } SEC("?tc") -__failure __msg("rbtree_remove node input must be non-owning ref") +__failure __msg("bpf_rbtree_remove can only take non-owning bpf_rb_node pointer") long rbtree_api_remove_unadded_node(void *ctx) { struct node_data *n, *m; @@ -178,7 +178,7 @@ long rbtree_api_use_unchecked_remove_retval(void *ctx) } SEC("?tc") -__failure __msg("rbtree_remove node input must be non-owning ref") +__failure __msg("bpf_rbtree_remove can only take non-owning bpf_rb_node pointer") long rbtree_api_add_release_unlock_escape(void *ctx) { struct node_data *n; @@ -202,7 +202,7 @@ long rbtree_api_add_release_unlock_escape(void *ctx) } SEC("?tc") -__failure __msg("rbtree_remove node input must be non-owning ref") +__failure __msg("bpf_rbtree_remove can only take non-owning bpf_rb_node pointer") long rbtree_api_first_release_unlock_escape(void *ctx) { struct bpf_rb_node *res; From patchwork Fri Apr 18 22:46:43 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 14057745 X-Patchwork-Delegate: bpf@iogearbox.net Received: from out-176.mta1.migadu.com (out-176.mta1.migadu.com [95.215.58.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E9C3F254AEF for ; Fri, 18 Apr 2025 22:47:23 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.176 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745016445; cv=none; b=NCB1iP7bXOKXQp8gaEMu1uzpcKXmnpwAIwHS1ckLrFCAFiJdQ45FyWohqDPQ4fcUhb/cVnYmz0NYZ9tQzm0IV7P9HFdlTl47MPbRcuED7TgGJ3VkjlLGhZC+Z8IJ4bz3InlWwi1gmBNqScNip11pBYaTu2yGcmFiutjs2zZXSxA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745016445; c=relaxed/simple; bh=POUP65QfD3Jih/Phx3rYNjjGtPYxJ2oOCLOCs/kFtkk=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=LOBcfIWPvNcPqUlW2FV30PJmgMUncoqA7qlJ9wCwjcg2t9yGZ/REzidfEUUVlU/FUXomEpP/ZYTTDnM7tQWFHxnxkZcU4Gtr+fCLU4NiQG60AqSKZ26NxWYIJw13N3LZ/C7lBjTLEgxqyaLdnD5TedD3ZDMyV11rbaF6OJCePek= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=rMnpn+/w; arc=none smtp.client-ip=95.215.58.176 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="rMnpn+/w" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1745016442; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=z0gs555Cz68mhNJtdgXeq87LKiQaPcHB8ENqO0L6Ljs=; b=rMnpn+/wE8NeytoHBMVZ6KoNn79A47IQB9lv2Wr7V3GBo4MJRcj1TK75VeirKyJbxlnEBV E+OvZ5olbjM/YapFKMH4Fxq4tenuq2vM+609662CfXcDOZHqX6fJiqn9oPEL3tJSv/6u48 uYFx+ENECjYKXiIHwoGkBo8SSg9HDGE= From: Martin KaFai Lau To: bpf@vger.kernel.org Cc: 'Alexei Starovoitov ' , 'Andrii Nakryiko ' , 'Daniel Borkmann ' , netdev@vger.kernel.org, kernel-team@meta.com, 'Amery Hung ' Subject: [RFC PATCH bpf-next 05/12] bpf: Allow refcounted bpf_rb_node used in bpf_rbtree_{remove,left,right} Date: Fri, 18 Apr 2025 15:46:43 -0700 Message-ID: <20250418224652.105998-6-martin.lau@linux.dev> In-Reply-To: <20250418224652.105998-1-martin.lau@linux.dev> References: <20250418224652.105998-1-martin.lau@linux.dev> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Migadu-Flow: FLOW_OUT X-Patchwork-Delegate: bpf@iogearbox.net X-Patchwork-State: RFC From: Martin KaFai Lau The bpf_rbtree_{remove,left,right} requires the root's lock to be held. They also check the node_internal->owner is still owned by that root before proceeding, so it is safe to allow refcounted bpf_rb_node pointer to be used in these kfuncs. In the later selftest, a networking flow (allocated by bpf_obj_new) can be added to two different rbtrees. There are cases that the flow is searched from one rbtree, held the refcount of the flow, and then removed from the another rbtree: struct fq_flow { struct bpf_rb_node fq_node; struct bpf_rb_node rate_node; struct bpf_refcount refcount; unsigned long sk_long; }; int bpf_fq_enqueue(...) { /* ... */ bpf_spin_lock(&root->lock); while (can_loop) { /* ... */ if (!p) break; gc_f = bpf_rb_entry(p, struct fq_flow, fq_node); if (gc_f->sk_long == sk_long) { f = bpf_refcount_acquire(gc_f); break; } /* ... */ } bpf_spin_unlock(&root->lock); if (f) { bpf_spin_lock(&q->lock); bpf_rbtree_remove(&q->delayed, &f->rate_node); bpf_spin_unlock(&q->lock); } } bpf_rbtree_{left,right} do not need this change but are relaxed together with bpf_rbtree_remove instead of adding extra verifier logic to exclude these kfuncs. Signed-off-by: Martin KaFai Lau --- kernel/bpf/verifier.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 3624de1c6925..3b905331ca0e 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -13229,8 +13229,8 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_ return -EINVAL; } } else { - if (!type_is_non_owning_ref(reg->type) || reg->ref_obj_id) { - verbose(env, "%s can only take non-owning bpf_rb_node pointer\n", func_name); + if (!type_is_non_owning_ref(reg->type) && !reg->ref_obj_id) { + verbose(env, "%s can only take non-owning or refcounted bpf_rb_node pointer\n", func_name); return -EINVAL; } if (in_rbtree_lock_required_cb(env)) { From patchwork Fri Apr 18 22:46:44 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 14057746 X-Patchwork-Delegate: bpf@iogearbox.net Received: from out-179.mta1.migadu.com (out-179.mta1.migadu.com [95.215.58.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7B820254AEE for ; Fri, 18 Apr 2025 22:47:26 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.179 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745016448; cv=none; b=Eo5cGIImwmUhqwGAScgZxWJDuSK88OOVUqqgDgoHB6bneQ1dJw1QbV2Wfm422ghhYobmDsGZ6KJppEot3556fvFdbrz1LB3X/fqTItAfhzZOGq7W7X5ACOl04VMsFvAMQ7meRfdC6XjBHAm19doD6laQd/5tK9AeNtyg1WhQu9c= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745016448; c=relaxed/simple; bh=HZ8T7J57P/bxP1ziEyHwnHBXTuztbDu/PQoKHAuFkWk=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=qZuz6D4/Ysj2R5IvGvSgr+raVQK6XqURtvPtD5JAh3Ur/ynYeBWXlfiM56rOwbcB5bNEQsz779tiosEgkR5qm5jGi6/B72V2qkX4qGlKE761gYbMOKLIH764sCFxmK/nO0LLrGXAO4hmWwqrNq/tatyhaHvjvOoGQmYSJnlRJWY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=qm+ln21L; arc=none smtp.client-ip=95.215.58.179 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="qm+ln21L" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1745016444; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=TOV4NH4C8SzaFUvxJ45cIX3wrZfNDYoZ1kAm/yKhHn4=; b=qm+ln21LlCwOA4pUndpfwEq7fbW7Eh0gHs07WRelGmh3ILBgMj6QnW2JmVaaD4henU14Sj 8eGD5Lxdm1KOAwiAI8tKYgfB2gKRwgG5+MKK0ukklQW4CDb7VKIKcNdnNyhX1i4Rv7IfNa oIiBDTpbCxtYgb4EC7XMJqM8i3oWmnA= From: Martin KaFai Lau To: bpf@vger.kernel.org Cc: 'Alexei Starovoitov ' , 'Andrii Nakryiko ' , 'Daniel Borkmann ' , netdev@vger.kernel.org, kernel-team@meta.com, 'Amery Hung ' Subject: [RFC PATCH bpf-next 06/12] selftests/bpf: Adjust test that does not allow refcounted node in rbtree_remove Date: Fri, 18 Apr 2025 15:46:44 -0700 Message-ID: <20250418224652.105998-7-martin.lau@linux.dev> In-Reply-To: <20250418224652.105998-1-martin.lau@linux.dev> References: <20250418224652.105998-1-martin.lau@linux.dev> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Migadu-Flow: FLOW_OUT X-Patchwork-Delegate: bpf@iogearbox.net X-Patchwork-State: RFC From: Martin KaFai Lau rbtree_remove now allows refcounted node now. The rbtree_api_remove_unadded_node test needs to be adjusted. First change, it does not expect a verifier's error now. Second change, the test now expects bpf_rbtree_remove(&groot, &m->node) to return NULL. The test uses __retval(0) to ensure this NULL return value. Some of the "only take non-owning..." failure messages are changed also. Signed-off-by: Martin KaFai Lau --- .../testing/selftests/bpf/progs/rbtree_fail.c | 26 ++++++++++--------- 1 file changed, 14 insertions(+), 12 deletions(-) diff --git a/tools/testing/selftests/bpf/progs/rbtree_fail.c b/tools/testing/selftests/bpf/progs/rbtree_fail.c index 528122320471..b2e24f018a3f 100644 --- a/tools/testing/selftests/bpf/progs/rbtree_fail.c +++ b/tools/testing/selftests/bpf/progs/rbtree_fail.c @@ -69,11 +69,11 @@ long rbtree_api_nolock_first(void *ctx) } SEC("?tc") -__failure __msg("bpf_rbtree_remove can only take non-owning bpf_rb_node pointer") +__retval(0) long rbtree_api_remove_unadded_node(void *ctx) { struct node_data *n, *m; - struct bpf_rb_node *res; + struct bpf_rb_node *res_n, *res_m; n = bpf_obj_new(typeof(*n)); if (!n) @@ -89,18 +89,20 @@ long rbtree_api_remove_unadded_node(void *ctx) bpf_rbtree_add(&groot, &n->node, less); /* This remove should pass verifier */ - res = bpf_rbtree_remove(&groot, &n->node); - n = container_of(res, struct node_data, node); + res_n = bpf_rbtree_remove(&groot, &n->node); /* This remove shouldn't, m isn't in an rbtree */ - res = bpf_rbtree_remove(&groot, &m->node); - m = container_of(res, struct node_data, node); + res_m = bpf_rbtree_remove(&groot, &m->node); bpf_spin_unlock(&glock); - if (n) - bpf_obj_drop(n); - if (m) - bpf_obj_drop(m); + bpf_obj_drop(m); + if (res_n) + bpf_obj_drop(container_of(res_n, struct node_data, node)); + if (res_m) { + bpf_obj_drop(container_of(res_m, struct node_data, node)); + return 2; + } + return 0; } @@ -178,7 +180,7 @@ long rbtree_api_use_unchecked_remove_retval(void *ctx) } SEC("?tc") -__failure __msg("bpf_rbtree_remove can only take non-owning bpf_rb_node pointer") +__failure __msg("bpf_rbtree_remove can only take non-owning or refcounted bpf_rb_node pointer") long rbtree_api_add_release_unlock_escape(void *ctx) { struct node_data *n; @@ -202,7 +204,7 @@ long rbtree_api_add_release_unlock_escape(void *ctx) } SEC("?tc") -__failure __msg("bpf_rbtree_remove can only take non-owning bpf_rb_node pointer") +__failure __msg("bpf_rbtree_remove can only take non-owning or refcounted bpf_rb_node pointer") long rbtree_api_first_release_unlock_escape(void *ctx) { struct bpf_rb_node *res; From patchwork Fri Apr 18 22:46:45 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 14057747 X-Patchwork-Delegate: bpf@iogearbox.net Received: from out-177.mta1.migadu.com (out-177.mta1.migadu.com [95.215.58.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 36C69254B1D for ; Fri, 18 Apr 2025 22:47:29 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.177 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745016451; cv=none; b=VuSi+1rUjo9GyZOD5vvl7Ur5WQJe/8+r78fqCj+uDYVdVX5knFUJKN0kQLqZy7SwEPQ+VH6VvhzUaxC7VLLvbimoJgC1Lw9IOsuZqG8g1nwwHEHJhSIf6zRc3A1nphX83CniE2/7I+vvkpxqk9lTzK4ROO4HDi7IFzS0wjHHEIc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745016451; c=relaxed/simple; bh=lC9Bm1Lc9f5Ax6JWpvdySA4XFniaobaJD5K5om4Bu9c=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=UX+DLMnopBfBiVFKdOcCzKTQ1LmusgiHxCKapm64gqH+eAFVeguMH22z4ZGovP+agqnBSixlBkqaiFRu+1WPsenHamLdGhpkGjAhmKsDjKLkPn1wiYKG+YrzPBiSCUTGVdboO/mAgpAccU1F7YIf+qcZsm5SR6Xip7sAMLS/09o= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=S+sPtX2P; arc=none smtp.client-ip=95.215.58.177 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="S+sPtX2P" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1745016447; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=OXOHicNF7bqdEgEjAz871TrKeLbksj/UlQAeXjWqm0Y=; b=S+sPtX2PV7jZ0ttR+r4FSqYZ3T4DOBuLMnkQ/nIEI8l6RVzGV6BopRylFOliBpSjMqjh58 fRqiCKC8/BH5jgexhSW2wQc5Md8oq78cHFNhHv+ZHfBUDe5goGPMBHIiXviblQ1RmOXk8H gUtishcO9US51AUBU9kjJ7GdyRS7OTM= From: Martin KaFai Lau To: bpf@vger.kernel.org Cc: 'Alexei Starovoitov ' , 'Andrii Nakryiko ' , 'Daniel Borkmann ' , netdev@vger.kernel.org, kernel-team@meta.com, 'Amery Hung ' Subject: [RFC PATCH bpf-next 07/12] selftests/bpf: Add rbtree_search test Date: Fri, 18 Apr 2025 15:46:45 -0700 Message-ID: <20250418224652.105998-8-martin.lau@linux.dev> In-Reply-To: <20250418224652.105998-1-martin.lau@linux.dev> References: <20250418224652.105998-1-martin.lau@linux.dev> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Migadu-Flow: FLOW_OUT X-Patchwork-Delegate: bpf@iogearbox.net X-Patchwork-State: RFC From: Martin KaFai Lau This patch has a much simplified rbtree usage from the kernel sch_fq qdisc. It has a "struct node_data" which can be added to two different rbtrees which are ordered by different keys. The test first populates both rbtrees. Then search for a lookup_key from the "groot0" rbtree. Once the lookup_key is found, that node refcount is taken. The node is then removed from another "groot1" rbtree. While searching the lookup_key, the test will also try to remove all rbnodes in the path leading to the lookup_key. Signed-off-by: Martin KaFai Lau --- .../testing/selftests/bpf/prog_tests/rbtree.c | 6 + .../selftests/bpf/progs/rbtree_search.c | 137 ++++++++++++++++++ 2 files changed, 143 insertions(+) create mode 100644 tools/testing/selftests/bpf/progs/rbtree_search.c diff --git a/tools/testing/selftests/bpf/prog_tests/rbtree.c b/tools/testing/selftests/bpf/prog_tests/rbtree.c index 9818f06c97c5..d8f3d7a45fe9 100644 --- a/tools/testing/selftests/bpf/prog_tests/rbtree.c +++ b/tools/testing/selftests/bpf/prog_tests/rbtree.c @@ -8,6 +8,7 @@ #include "rbtree_fail.skel.h" #include "rbtree_btf_fail__wrong_node_type.skel.h" #include "rbtree_btf_fail__add_wrong_type.skel.h" +#include "rbtree_search.skel.h" static void test_rbtree_add_nodes(void) { @@ -187,3 +188,8 @@ void test_rbtree_fail(void) { RUN_TESTS(rbtree_fail); } + +void test_rbtree_search(void) +{ + RUN_TESTS(rbtree_search); +} diff --git a/tools/testing/selftests/bpf/progs/rbtree_search.c b/tools/testing/selftests/bpf/progs/rbtree_search.c new file mode 100644 index 000000000000..475f7cf3285f --- /dev/null +++ b/tools/testing/selftests/bpf/progs/rbtree_search.c @@ -0,0 +1,137 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */ + +#include +#include +#include "bpf_misc.h" +#include "bpf_experimental.h" + +struct node_data { + struct bpf_refcount ref; + struct bpf_rb_node r0; + struct bpf_rb_node r1; + int key0; + int key1; +}; + +#define private(name) SEC(".data." #name) __hidden __attribute__((aligned(8))) +private(A) struct bpf_spin_lock glock0; +private(A) struct bpf_rb_root groot0 __contains(node_data, r0); + +private(B) struct bpf_spin_lock glock1; +private(B) struct bpf_rb_root groot1 __contains(node_data, r1); + +#define rb_entry(ptr, type, member) container_of(ptr, type, member) +#define NR_NODES 16 + +int zero = 0; + +static bool less0(struct bpf_rb_node *a, const struct bpf_rb_node *b) +{ + struct node_data *node_a; + struct node_data *node_b; + + node_a = rb_entry(a, struct node_data, r0); + node_b = rb_entry(b, struct node_data, r0); + + return node_a->key0 < node_b->key0; +} + +static bool less1(struct bpf_rb_node *a, const struct bpf_rb_node *b) +{ + struct node_data *node_a; + struct node_data *node_b; + + node_a = rb_entry(a, struct node_data, r1); + node_b = rb_entry(b, struct node_data, r1); + + return node_a->key1 < node_b->key1; +} + +SEC("syscall") +__retval(0) +long rbtree_search(void *ctx) +{ + struct bpf_rb_node *rb_n, *rb_m, *gc_ns[NR_NODES]; + long lookup_key = NR_NODES / 2; + struct node_data *n, *m; + int i, err, nr_gc = 0; + + for (i = zero; i < NR_NODES && can_loop; i++) { + n = bpf_obj_new(typeof(*n)); + if (!n) + return __LINE__; + + m = bpf_refcount_acquire(n); + + n->key0 = i; + m->key1 = i; + + bpf_spin_lock(&glock0); + err = bpf_rbtree_add(&groot0, &n->r0, less0); + bpf_spin_unlock(&glock0); + + bpf_spin_lock(&glock1); + err = bpf_rbtree_add(&groot1, &m->r1, less1); + bpf_spin_unlock(&glock1); + + if (err) + return __LINE__; + } + + n = NULL; + bpf_spin_lock(&glock0); + rb_n = bpf_rbtree_root(&groot0); + while (can_loop) { + if (!rb_n) { + bpf_spin_unlock(&glock0); + return __LINE__; + } + + n = rb_entry(rb_n, struct node_data, r0); + if (lookup_key == n->key0) + break; + if (nr_gc < NR_NODES) + gc_ns[nr_gc++] = rb_n; + if (lookup_key < n->key0) + rb_n = bpf_rbtree_left(&groot0, rb_n); + else + rb_n = bpf_rbtree_right(&groot0, rb_n); + } + + if (!n || lookup_key != n->key0) { + bpf_spin_unlock(&glock0); + return __LINE__; + } + + for (i = 0; i < nr_gc; i++) { + rb_n = gc_ns[i]; + gc_ns[i] = bpf_rbtree_remove(&groot0, rb_n); + } + + m = bpf_refcount_acquire(n); + bpf_spin_unlock(&glock0); + + for (i = 0; i < nr_gc; i++) { + rb_n = gc_ns[i]; + if (rb_n) { + n = rb_entry(rb_n, struct node_data, r0); + bpf_obj_drop(n); + } + } + + if (!m) + return __LINE__; + + bpf_spin_lock(&glock1); + rb_m = bpf_rbtree_remove(&groot1, &m->r1); + bpf_spin_unlock(&glock1); + bpf_obj_drop(m); + if (!rb_m) + return __LINE__; + bpf_obj_drop(rb_entry(rb_m, struct node_data, r1)); + + return 0; +} + +char _license[] SEC("license") = "GPL"; From patchwork Fri Apr 18 22:46:46 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 14057748 X-Patchwork-Delegate: bpf@iogearbox.net Received: from out-180.mta1.migadu.com (out-180.mta1.migadu.com [95.215.58.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C92E7255249; Fri, 18 Apr 2025 22:47:31 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.180 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745016453; cv=none; b=F8vnCrGeGuZ4LJYoYDsKwDKs2xwh3tx6Dc5Z/NCmplEF2S/tITXhCXnMmOPU+3UK9LYyaGvQwtpXmiURSjO/h6FSG9GcfjMHM9UOM/tOP5I9AKnhf/kvF54FD6vQE3xGmIJY6GMFz6EwRnlUOXPXoEbUT1ttY7bsZiq7O7ChYz4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745016453; c=relaxed/simple; bh=AjZX+0BmaPlLrmnhq9iMLJny5oO0O6XaLptsGcVV0HM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=BsqUNaoVBSnOUJKrobEiqrqkqsaPf3o0oiwSlx/bIlQCDq3WUQtkbSVm8t+547ScSlVHxwvyof7qsEIeUroR1T8relh++C1cJusDaQgMpufok5Ru9XrswNKjnUAcXSokjOmreVG9qQqIfBwKd4M6rBj+zmU9sKl0Hg4mRkx3uoI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=QjXW4eSu; arc=none smtp.client-ip=95.215.58.180 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="QjXW4eSu" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1745016450; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=HOfDyRHkI+uye2d2x32p/JPSDtOGltyZSaojxSO1pfo=; b=QjXW4eSulhCJgpaoQtjf/2kfDHtdWOzRDrFWi13q2mwQhFtE/J7Y2hEUJzSaq52Dwu6jvE FT64PSeL7MsNeMWoheFb5HT5+ihqhRjMZNziy6XPUik2KBrxLLx7mJPd0KjG8RNOo1XYsd ayUOBUGLLWG/UTx2hT4oE7iFOq5oTSE= From: Martin KaFai Lau To: bpf@vger.kernel.org Cc: 'Alexei Starovoitov ' , 'Andrii Nakryiko ' , 'Daniel Borkmann ' , netdev@vger.kernel.org, kernel-team@meta.com, 'Amery Hung ' Subject: [RFC PATCH bpf-next 08/12] bpf: Simplify reg0 marking for the list kfuncs that return a bpf_list_node pointer Date: Fri, 18 Apr 2025 15:46:46 -0700 Message-ID: <20250418224652.105998-9-martin.lau@linux.dev> In-Reply-To: <20250418224652.105998-1-martin.lau@linux.dev> References: <20250418224652.105998-1-martin.lau@linux.dev> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Migadu-Flow: FLOW_OUT X-Patchwork-Delegate: bpf@iogearbox.net X-Patchwork-State: RFC From: Martin KaFai Lau The next patch will add bpf_list_{front,back} kfuncs to peek the head and tail of a list. Both of them will return a 'struct bpf_list_node *'. Follow the earlier change for rbtree, this patch checks the return btf type is a 'struct bpf_list_node' pointer instead of checking each kfuncs individually to decide if mark_reg_graph_node should be called. This will make the bpf_list_{front,back} kfunc addition easier in the later patch. Signed-off-by: Martin KaFai Lau --- kernel/bpf/verifier.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 3b905331ca0e..aacde0274e0f 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -11992,6 +11992,11 @@ static bool is_rbtree_node_type(const struct btf_type *t) return t == btf_type_by_id(btf_vmlinux, kf_arg_btf_ids[KF_ARG_RB_NODE_ID]); } +static bool is_list_node_type(const struct btf_type *t) +{ + return t == btf_type_by_id(btf_vmlinux, kf_arg_btf_ids[KF_ARG_LIST_NODE_ID]); +} + static bool is_kfunc_arg_callback(struct bpf_verifier_env *env, const struct btf *btf, const struct btf_param *arg) { @@ -13763,8 +13768,7 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn, insn_aux->kptr_struct_meta = btf_find_struct_meta(meta.arg_btf, meta.arg_btf_id); - } else if (meta.func_id == special_kfunc_list[KF_bpf_list_pop_front] || - meta.func_id == special_kfunc_list[KF_bpf_list_pop_back]) { + } else if (is_list_node_type(ptr_type)) { struct btf_field *field = meta.arg_list_head.field; mark_reg_graph_node(regs, BPF_REG_0, &field->graph_root); From patchwork Fri Apr 18 22:46:47 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 14057749 X-Patchwork-Delegate: bpf@iogearbox.net Received: from out-177.mta1.migadu.com (out-177.mta1.migadu.com [95.215.58.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 760A6255E34 for ; Fri, 18 Apr 2025 22:47:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.177 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745016456; cv=none; b=hPJejFkuXn3WKcby3KnhYj3rxhitU6gH2rzDusPTz0gAV4gpBKaGe2htEx//wToI4mgPF+i4mBJt35eJdUZ2JU+O+qKzs6ILK/wZU0FrHVKCivAX3kFihLFS96cT6bYrv3fKlr2OoIQLuodg7t4H/pCAgNyRIKLe/MDSeQbpFc4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745016456; c=relaxed/simple; bh=Ya1zEAYNutolwv6KH1vqLiZ54drfZW43K2kgHG1e6dY=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=elzkXZCKQHp0KjeRgmyR0xMNp8kUJIdR3BbXvJjX/ldUFXVszuY6c6hoXVCR+muuIHUYPNv1iRUBtSdicxNw3MfeJTuPtYBk3NlO76Z8Zc0WG5/GH8w23VoltlpnsHKqrnyYnMbUgBd26VbtKDTzF8GjZ6Yto8hwlKxGoatP1YQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=xtKtOTMn; arc=none smtp.client-ip=95.215.58.177 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="xtKtOTMn" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1745016452; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=F0Jov62g6P3kuxMg3ohcNB00iUIpotsSk9z8qoAzQOE=; b=xtKtOTMnEtxdfp0clvr/p3wdzobNUVoQJCnBVa5mzBkNOMTfATSzbzbcHo/TYnUkfXxOmI qjL8q0hD4FMJDbZoybhXR0beVT/hZNcUSmxdbOuhhrK90QiGJxMzUC3BCm+sMNgF4DASin JCe2yFPjlWUseiFqn/pOIWM0d6O3N3w= From: Martin KaFai Lau To: bpf@vger.kernel.org Cc: 'Alexei Starovoitov ' , 'Andrii Nakryiko ' , 'Daniel Borkmann ' , netdev@vger.kernel.org, kernel-team@meta.com, 'Amery Hung ' Subject: [RFC PATCH bpf-next 09/12] bpf: Add bpf_list_{front,back} kfunc Date: Fri, 18 Apr 2025 15:46:47 -0700 Message-ID: <20250418224652.105998-10-martin.lau@linux.dev> In-Reply-To: <20250418224652.105998-1-martin.lau@linux.dev> References: <20250418224652.105998-1-martin.lau@linux.dev> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Migadu-Flow: FLOW_OUT X-Patchwork-Delegate: bpf@iogearbox.net X-Patchwork-State: RFC From: Martin KaFai Lau In the kernel fq qdisc implementation, it only needs to look at the fields of the first node in a list but does not always need to remove it from the list. It is more convenient to have a peek kfunc for the list. It works similar to the bpf_rbtree_first(). This patch adds bpf_list_{front,back} kfunc. The verifier is changed such that the kfunc returning "struct bpf_list_node *" will be marked as non-owning. The exception is the KF_ACQUIRE kfunc. The net effect is only the new bpf_list_{front,back} kfuncs will have its return pointer marked as non-owning. Signed-off-by: Martin KaFai Lau --- kernel/bpf/helpers.c | 22 ++++++++++++++++++++++ kernel/bpf/verifier.c | 12 ++++++++++-- 2 files changed, 32 insertions(+), 2 deletions(-) diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index 36150d340c16..78cefb41266a 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -2293,6 +2293,26 @@ __bpf_kfunc struct bpf_list_node *bpf_list_pop_back(struct bpf_list_head *head) return __bpf_list_del(head, true); } +__bpf_kfunc struct bpf_list_node *bpf_list_front(struct bpf_list_head *head) +{ + struct list_head *h = (struct list_head *)head; + + if (list_empty(h) || unlikely(!h->next)) + return NULL; + + return (struct bpf_list_node *)h->next; +} + +__bpf_kfunc struct bpf_list_node *bpf_list_back(struct bpf_list_head *head) +{ + struct list_head *h = (struct list_head *)head; + + if (list_empty(h) || unlikely(!h->next)) + return NULL; + + return (struct bpf_list_node *)h->prev; +} + __bpf_kfunc struct bpf_rb_node *bpf_rbtree_remove(struct bpf_rb_root *root, struct bpf_rb_node *node) { @@ -3236,6 +3256,8 @@ BTF_ID_FLAGS(func, bpf_list_push_front_impl) BTF_ID_FLAGS(func, bpf_list_push_back_impl) BTF_ID_FLAGS(func, bpf_list_pop_front, KF_ACQUIRE | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_list_pop_back, KF_ACQUIRE | KF_RET_NULL) +BTF_ID_FLAGS(func, bpf_list_front, KF_RET_NULL) +BTF_ID_FLAGS(func, bpf_list_back, KF_RET_NULL) BTF_ID_FLAGS(func, bpf_task_acquire, KF_ACQUIRE | KF_RCU | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_task_release, KF_RELEASE) BTF_ID_FLAGS(func, bpf_rbtree_remove, KF_ACQUIRE | KF_RET_NULL) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index aacde0274e0f..78a9b3d1cd29 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -12079,6 +12079,8 @@ enum special_kfunc_type { KF_bpf_list_push_back_impl, KF_bpf_list_pop_front, KF_bpf_list_pop_back, + KF_bpf_list_front, + KF_bpf_list_back, KF_bpf_cast_to_kern_ctx, KF_bpf_rdonly_cast, KF_bpf_rcu_read_lock, @@ -12124,6 +12126,8 @@ BTF_ID(func, bpf_list_push_front_impl) BTF_ID(func, bpf_list_push_back_impl) BTF_ID(func, bpf_list_pop_front) BTF_ID(func, bpf_list_pop_back) +BTF_ID(func, bpf_list_front) +BTF_ID(func, bpf_list_back) BTF_ID(func, bpf_cast_to_kern_ctx) BTF_ID(func, bpf_rdonly_cast) BTF_ID(func, bpf_rbtree_remove) @@ -12160,6 +12164,8 @@ BTF_ID(func, bpf_list_push_front_impl) BTF_ID(func, bpf_list_push_back_impl) BTF_ID(func, bpf_list_pop_front) BTF_ID(func, bpf_list_pop_back) +BTF_ID(func, bpf_list_front) +BTF_ID(func, bpf_list_back) BTF_ID(func, bpf_cast_to_kern_ctx) BTF_ID(func, bpf_rdonly_cast) BTF_ID(func, bpf_rcu_read_lock) @@ -12598,7 +12604,9 @@ static bool is_bpf_list_api_kfunc(u32 btf_id) return btf_id == special_kfunc_list[KF_bpf_list_push_front_impl] || btf_id == special_kfunc_list[KF_bpf_list_push_back_impl] || btf_id == special_kfunc_list[KF_bpf_list_pop_front] || - btf_id == special_kfunc_list[KF_bpf_list_pop_back]; + btf_id == special_kfunc_list[KF_bpf_list_pop_back] || + btf_id == special_kfunc_list[KF_bpf_list_front] || + btf_id == special_kfunc_list[KF_bpf_list_back]; } static bool is_bpf_rbtree_api_kfunc(u32 btf_id) @@ -13902,7 +13910,7 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn, if (is_kfunc_ret_null(&meta)) regs[BPF_REG_0].id = id; regs[BPF_REG_0].ref_obj_id = id; - } else if (is_rbtree_node_type(ptr_type)) { + } else if (is_rbtree_node_type(ptr_type) || is_list_node_type(ptr_type)) { ref_set_non_owning(env, ®s[BPF_REG_0]); } From patchwork Fri Apr 18 22:46:48 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 14057750 X-Patchwork-Delegate: bpf@iogearbox.net Received: from out-179.mta1.migadu.com (out-179.mta1.migadu.com [95.215.58.179]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EBE1F255249 for ; Fri, 18 Apr 2025 22:47:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.179 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745016458; cv=none; b=j4ft+HBZxquD/X2/XHUb92l7zDiWQhZhVDRloLAXHEwJ5KIEPkmtfHlKd+eiTwgDlhLeUP2NOTDYuP0vX0yGnUt7UfJBxaMKEhtEhpviS9E3SN60w5GgyWTfhVtjo1bEQW0kCwE99u+1HXnxmbK66VueouZS8vFXDgI932o2LHs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745016458; c=relaxed/simple; bh=FAVDjxgMGD68wDfcuJ9OgkQ6hNFmKo+dnfT7mWT24ZY=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=BqS6svYth2VEdAp+xzWD+Zrj5CpAY07jT8gslMmfASooRWmRb3e/cjlMlGR5WLzRDjXPF8IFm+Ab1WYpt4OSRMz0Yu2RIRmWOIoMWv80oYOSQz3ojssFpS5JNNNQAQZ8vpQQqrGr+nKUuhEmaOij7zCxfTiuzjcA9DZprcIoXX4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=RCORlsvf; arc=none smtp.client-ip=95.215.58.179 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="RCORlsvf" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1745016455; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=npWiVMwD7DIHq80EQckjRfNw4J8zE0gtQXoaeTokPy4=; b=RCORlsvflOWcJtCyPfT83iZnWA1EGtV3k29LbEQNybeQYvgFn9vAa5ObV/XdlG9NP0ZTWp r39rqn0EuencjeOM+sRwp6chUxH9ejB+FxZIYu0HdBHLG+eU9cnSQ+rnWcFQU7zr4lOK4L /SSfqeHBWvkyQsFda2jy5t7gHekMzTo= From: Martin KaFai Lau To: bpf@vger.kernel.org Cc: 'Alexei Starovoitov ' , 'Andrii Nakryiko ' , 'Daniel Borkmann ' , netdev@vger.kernel.org, kernel-team@meta.com, 'Amery Hung ' Subject: [RFC PATCH bpf-next 10/12] selftests/bpf: Add test for bpf_list_{front,back} Date: Fri, 18 Apr 2025 15:46:48 -0700 Message-ID: <20250418224652.105998-11-martin.lau@linux.dev> In-Reply-To: <20250418224652.105998-1-martin.lau@linux.dev> References: <20250418224652.105998-1-martin.lau@linux.dev> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Migadu-Flow: FLOW_OUT X-Patchwork-Delegate: bpf@iogearbox.net X-Patchwork-State: RFC From: Martin KaFai Lau This patch adds a test for the new bpf_list_{front,back} kfunc. It also adds a test to ensure the non-owning node pointer cannot be used after unlock. Signed-off-by: Martin KaFai Lau --- .../selftests/bpf/prog_tests/linked_list.c | 2 + .../selftests/bpf/progs/linked_list_peek.c | 104 ++++++++++++++++++ 2 files changed, 106 insertions(+) create mode 100644 tools/testing/selftests/bpf/progs/linked_list_peek.c diff --git a/tools/testing/selftests/bpf/prog_tests/linked_list.c b/tools/testing/selftests/bpf/prog_tests/linked_list.c index 77d07e0a4a55..559f45239a83 100644 --- a/tools/testing/selftests/bpf/prog_tests/linked_list.c +++ b/tools/testing/selftests/bpf/prog_tests/linked_list.c @@ -7,6 +7,7 @@ #include "linked_list.skel.h" #include "linked_list_fail.skel.h" +#include "linked_list_peek.skel.h" static char log_buf[1024 * 1024]; @@ -804,4 +805,5 @@ void test_linked_list(void) test_linked_list_success(LIST_IN_LIST, false); test_linked_list_success(LIST_IN_LIST, true); test_linked_list_success(TEST_ALL, false); + RUN_TESTS(linked_list_peek); } diff --git a/tools/testing/selftests/bpf/progs/linked_list_peek.c b/tools/testing/selftests/bpf/progs/linked_list_peek.c new file mode 100644 index 000000000000..26c978091e5b --- /dev/null +++ b/tools/testing/selftests/bpf/progs/linked_list_peek.c @@ -0,0 +1,104 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */ + +#include +#include +#include "bpf_misc.h" +#include "bpf_experimental.h" + +struct node_data { + struct bpf_list_node l; + int key; +}; + +#define private(name) SEC(".data." #name) __hidden __attribute__((aligned(8))) +private(A) struct bpf_spin_lock glock; +private(A) struct bpf_list_head ghead __contains(node_data, l); + +#define list_entry(ptr, type, member) container_of(ptr, type, member) +#define NR_NODES 32 + +int zero = 0; + +SEC("syscall") +__failure __msg("invalid mem access 'scalar'") +long list_peek_unlock_scalar_node(void *ctx) +{ + struct bpf_list_node *l_n; + struct node_data *n; + + bpf_spin_lock(&glock); + l_n = bpf_list_front(&ghead); + bpf_spin_unlock(&glock); + + if (l_n) { + n = list_entry(l_n, struct node_data, l); + if (n->key == 0) + return __LINE__; + } + + return 0; +} + +SEC("syscall") +__retval(0) +long list_peek(void *ctx) +{ + struct bpf_list_node *l_n; + struct node_data *n; + int i, err = 0; + + bpf_spin_lock(&glock); + l_n = bpf_list_front(&ghead); + bpf_spin_unlock(&glock); + if (l_n) + return __LINE__; + + bpf_spin_lock(&glock); + l_n = bpf_list_back(&ghead); + bpf_spin_unlock(&glock); + if (l_n) + return __LINE__; + + for (i = zero; i < NR_NODES && can_loop; i++) { + n = bpf_obj_new(typeof(*n)); + if (!n) + return __LINE__; + n->key = i; + bpf_spin_lock(&glock); + bpf_list_push_back(&ghead, &n->l); + bpf_spin_unlock(&glock); + } + + bpf_spin_lock(&glock); + + l_n = bpf_list_front(&ghead); + if (!l_n) { + err = __LINE__; + goto done; + } + + n = list_entry(l_n, struct node_data, l); + if (n->key != 0) { + err = __LINE__; + goto done; + } + + l_n = bpf_list_back(&ghead); + if (!l_n) { + err = __LINE__; + goto done; + } + + n = list_entry(l_n, struct node_data, l); + if (n->key != NR_NODES - 1) { + err = __LINE__; + goto done; + } + +done: + bpf_spin_unlock(&glock); + return err; +} + +char _license[] SEC("license") = "GPL"; From patchwork Fri Apr 18 22:46:49 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 14057751 X-Patchwork-Delegate: bpf@iogearbox.net Received: from out-189.mta1.migadu.com (out-189.mta1.migadu.com [95.215.58.189]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8F1C02561DA for ; Fri, 18 Apr 2025 22:47:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.189 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745016461; cv=none; b=PQwoFDJK326TIb3T6SMs7iRGFweXoaO8DPnCoCNp9NN2+L4VJCERtneFo0S5Km8opPBqYsuZZx2xGYwX9GLkEYdCS4sgOp1AhefYvSvV+FOm0R5rZGBw0s3uv7TJi/YcE1WqwrUHLUnNyFgrylfEUbwYlXf0kxKRlUubtcpczKA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745016461; c=relaxed/simple; bh=0SN6REfasD/fc0mS2aOpYJqurDJOEn0ehNvLSPZonTo=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=n5eqVGJ9sQXkmpGnuijWi0ve5UYiF7ykBzB21D/b9CKCtHOu9lazMIfX7rUcNdWqKvDkmXhjVAtLEEVasPgvkU6T+A8IJRvW7aJKqIcwzh2DFV1swROJ9jeU101nU+ZGRcJbYQORE0u9yB3MDCSSD5M5LqHTFBe88HOB4ZlKIlY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=uBVMlMWc; arc=none smtp.client-ip=95.215.58.189 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="uBVMlMWc" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1745016458; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=lfKAcsBzxqCYQ2eXy+4UKh7uCnTy+R1SZsq5aN6RNms=; b=uBVMlMWc537xYxnBL8fxDR8H/vuNzdOfqVSC3OxXT3MVxF1AhBr74AWZl8tnZOhM9+joXa gtmijwbNEwxPHRy5cfpiePHt9DV8tBw6rzIzENPQBp/TTjHDYhhfJ7MgdSrzCqYJN1VolV pFMQIgZKUS7lCk6cxTRnoiuoLylFjdE= From: Martin KaFai Lau To: bpf@vger.kernel.org Cc: 'Alexei Starovoitov ' , 'Andrii Nakryiko ' , 'Daniel Borkmann ' , netdev@vger.kernel.org, kernel-team@meta.com, 'Amery Hung ' Subject: [RFC PATCH bpf-next 11/12] bpf: net: Add a qdisc kfunc to set sk_pacing_status. Date: Fri, 18 Apr 2025 15:46:49 -0700 Message-ID: <20250418224652.105998-12-martin.lau@linux.dev> In-Reply-To: <20250418224652.105998-1-martin.lau@linux.dev> References: <20250418224652.105998-1-martin.lau@linux.dev> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Migadu-Flow: FLOW_OUT X-Patchwork-Delegate: bpf@iogearbox.net X-Patchwork-State: RFC From: Martin KaFai Lau bpf_qdisc_set_sk_pacing() is added to set the sk_pacing_status. This can tell the tcp stack that the qdisc can do pacing and the tcp stack does not need to do its own internal pacing. It is only allowed in the enqueue ops. Signed-off-by: Martin KaFai Lau --- net/sched/bpf_qdisc.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/net/sched/bpf_qdisc.c b/net/sched/bpf_qdisc.c index 9f32b305636f..45776918efcf 100644 --- a/net/sched/bpf_qdisc.c +++ b/net/sched/bpf_qdisc.c @@ -7,6 +7,7 @@ #include #include #include +#include #define QDISC_OP_IDX(op) (offsetof(struct Qdisc_ops, op) / sizeof(void (*)(void))) #define QDISC_MOFF_IDX(moff) (moff / sizeof(void (*)(void))) @@ -214,6 +215,17 @@ __bpf_kfunc void bpf_qdisc_skb_drop(struct sk_buff *skb, __qdisc_drop(skb, (struct sk_buff **)to_free_list); } +__bpf_kfunc int bpf_qdisc_set_sk_pacing(struct sk_buff *skb, enum sk_pacing pacing) +{ + struct sock *sk = skb->sk; + + if (!sk || sk_listener_or_tw(sk) || pacing != SK_PACING_FQ) + return -EINVAL; + + smp_store_release(&sk->sk_pacing_status, pacing); + return 0; +} + /* bpf_qdisc_watchdog_schedule - Schedule a qdisc to a later time using a timer. * @sch: The qdisc to be scheduled. * @expire: The expiry time of the timer. @@ -274,6 +286,7 @@ BTF_KFUNCS_START(qdisc_kfunc_ids) BTF_ID_FLAGS(func, bpf_skb_get_hash, KF_TRUSTED_ARGS) BTF_ID_FLAGS(func, bpf_kfree_skb, KF_RELEASE) BTF_ID_FLAGS(func, bpf_qdisc_skb_drop, KF_RELEASE) +BTF_ID_FLAGS(func, bpf_qdisc_set_sk_pacing, KF_TRUSTED_ARGS) BTF_ID_FLAGS(func, bpf_dynptr_from_skb, KF_TRUSTED_ARGS) BTF_ID_FLAGS(func, bpf_qdisc_watchdog_schedule, KF_TRUSTED_ARGS) BTF_ID_FLAGS(func, bpf_qdisc_init_prologue, KF_TRUSTED_ARGS) @@ -290,6 +303,7 @@ BTF_SET_END(qdisc_common_kfunc_set) BTF_SET_START(qdisc_enqueue_kfunc_set) BTF_ID(func, bpf_qdisc_skb_drop) BTF_ID(func, bpf_qdisc_watchdog_schedule) +BTF_ID(func, bpf_qdisc_set_sk_pacing) BTF_SET_END(qdisc_enqueue_kfunc_set) BTF_SET_START(qdisc_dequeue_kfunc_set) From patchwork Fri Apr 18 22:46:50 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Martin KaFai Lau X-Patchwork-Id: 14057752 X-Patchwork-Delegate: bpf@iogearbox.net Received: from out-188.mta1.migadu.com (out-188.mta1.migadu.com [95.215.58.188]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 94E082566F1 for ; Fri, 18 Apr 2025 22:47:42 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.188 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745016465; cv=none; b=iSqbbJ892z35nKbHfT+ivzDkGsQHt1CAVT3DgZ5n7OIH5mdrZxcT2LAZnnttWj8faOA5insVaXFk+K+7CuWSfj2LW9BTRToZtu8jeX5My5Qip56/ZlZQMCHw0wylOIZ79LnD2xfzvsXByzWC9RNr+E9z3j7v8THg5Mt0BGvtMFY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1745016465; c=relaxed/simple; bh=5p86syqDmU91s9gVKhvahWXkjPmE4gxdvnLEYyGP5I4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=BEwjjTJxmqwzVi7zk13yDnSB/MO2RwOZpBpbv75hY1JxUsBuCIV7VQk8AUqqhHAjiTdZ0aSUgAUYMCSilLfV4/dCmXeqZe+DlbPFo3PBJLch2+Gnh73ZDwt6Xgh1KEfAs16bXsvLlWYheIGQoxS/bdRQFpFGMWqgzX9xRkPcFQk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=sdH1SiEo; arc=none smtp.client-ip=95.215.58.188 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="sdH1SiEo" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1745016460; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=FF6O9idzEMy1o107NcN3zff+lqMv9kzfPL95ViTPfmE=; b=sdH1SiEogsE8KzHsOwinalunCMMkOUjnz5mQF8Eque0DYSadqlYEMVLizxlk4Pz5IpxNP4 /IAMvnGNY6y947TygO7H0186PqPrmtm8C/zPy7O+CH82D4FqrIflGhaWswqyhUGI/j4yo7 a/qIipkP7424gkdvldKv53ds8ZTZeSA= From: Martin KaFai Lau To: bpf@vger.kernel.org Cc: 'Alexei Starovoitov ' , 'Andrii Nakryiko ' , 'Daniel Borkmann ' , netdev@vger.kernel.org, kernel-team@meta.com, 'Amery Hung ' Subject: [RFC PATCH bpf-next 12/12] selftests/bpf: A bpf fq implementation similar to the kernel sch_fq Date: Fri, 18 Apr 2025 15:46:50 -0700 Message-ID: <20250418224652.105998-13-martin.lau@linux.dev> In-Reply-To: <20250418224652.105998-1-martin.lau@linux.dev> References: <20250418224652.105998-1-martin.lau@linux.dev> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Migadu-Flow: FLOW_OUT X-Patchwork-Delegate: bpf@iogearbox.net X-Patchwork-State: RFC From: Martin KaFai Lau This patch adds a fuller fq qdisc implementation that is comparable to the kernel fq implementation. The code is mostly borrowed from the sch_fq.c before the WRR addition. The WRR should be doable as a followup. Some highlights: * The current struct_ops does not support the qdisc_priv() concept. qdisc_priv() is the additional private data allocated by the qdisc subsystem at the end of a struct_ops object. The patch is using a map-in-map approach to do the qdisc_priv. The outer map is an arraymap. When a qdisc instance starts, it grabs an available index (idx) in the ".init" ops. This idx will be the key to lookup the outer arraymap. The inner map will then serve as the qdisc_priv which is the 'struct fq_sched_data' * Each qdisc instance has a hash table of rbtrees. This patch also uses map-in-map to do this. The outer arraymap's key is the qdisc "idx". The inner map is the array of bpf_rb_root. * With bpf_rbtree_{root,left,right} and bpf_list_{front,back}, the fq_classify/enqueue/dequeue should be more recognizable when comparing with the sch_fq.c. Like, searching the flow and doing gc. * Most of the code deviation from sch_fq.c is because of the lock requirement and the refcount requirement. veristat: File Program Verdict Duration (us) Insns States Program size Jited size ---------------- -------------- ------- ------------- ----- ------ ------------ ---------- bpf_sch_fq.bpf.o bpf_fq_dequeue success 43043 1367 119 531 2798 bpf_sch_fq.bpf.o bpf_fq_destroy success 12414 543 54 232 1350 bpf_sch_fq.bpf.o bpf_fq_enqueue success 91888 4750 335 695 3645 bpf_sch_fq.bpf.o bpf_fq_init success 7439 149 11 123 897 bpf_sch_fq.bpf.o bpf_fq_reset success 12553 541 53 198 1189 ---------------- -------------- ------- ------------- ----- ------ ------------ ---------- Signed-off-by: Martin KaFai Lau --- .../selftests/bpf/prog_tests/bpf_qdisc.c | 21 + .../selftests/bpf/progs/bpf_qdisc_common.h | 97 +- .../selftests/bpf/progs/bpf_qdisc_fq.c | 2 - .../testing/selftests/bpf/progs/bpf_sch_fq.c | 1171 +++++++++++++++++ .../selftests/bpf/progs/bpf_tracing_net.h | 1 + 5 files changed, 1289 insertions(+), 3 deletions(-) create mode 100644 tools/testing/selftests/bpf/progs/bpf_sch_fq.c diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c b/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c index c9a54177c84e..2955d88a35cc 100644 --- a/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c +++ b/tools/testing/selftests/bpf/prog_tests/bpf_qdisc.c @@ -7,6 +7,7 @@ #include "network_helpers.h" #include "bpf_qdisc_fifo.skel.h" #include "bpf_qdisc_fq.skel.h" +#include "bpf_sch_fq.skel.h" #define LO_IFINDEX 1 @@ -88,6 +89,24 @@ static void test_fq(void) bpf_qdisc_fq__destroy(fq_skel); } +static void test_sch_fq(void) +{ + struct bpf_sch_fq *skel; + + skel = bpf_sch_fq__open_and_load(); + if (!ASSERT_OK_PTR(skel, "bpf_sch_fq__open_and_load")) + return; + + skel->links.fq = bpf_map__attach_struct_ops(skel->maps.fq); + if (!ASSERT_OK_PTR(skel->links.fq, "bpf_map__attach_struct_ops")) + goto done; + + do_test("bpf_sch_fq"); + +done: + bpf_sch_fq__destroy(skel); +} + static void test_qdisc_attach_to_mq(void) { DECLARE_LIBBPF_OPTS(bpf_tc_hook, hook, @@ -171,6 +190,8 @@ void test_bpf_qdisc(void) test_fifo(); if (test__start_subtest("fq")) test_fq(); + if (test__start_subtest("sch_fq")) + test_sch_fq(); if (test__start_subtest("attach to mq")) test_qdisc_attach_to_mq(); if (test__start_subtest("attach to non root")) diff --git a/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h b/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h index 65a2c561c0bb..94b4766d226b 100644 --- a/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h +++ b/tools/testing/selftests/bpf/progs/bpf_qdisc_common.h @@ -3,6 +3,11 @@ #ifndef _BPF_QDISC_COMMON_H #define _BPF_QDISC_COMMON_H +#include "bpf_tracing_net.h" + +#define E2BIG 7 +#define EINVAL 22 + #define NET_XMIT_SUCCESS 0x00 #define NET_XMIT_DROP 0x01 /* skb dropped */ #define NET_XMIT_CN 0x02 /* congestion notification */ @@ -10,15 +15,25 @@ #define TC_PRIO_CONTROL 7 #define TC_PRIO_MAX 15 +#define MSEC_PER_SEC 1000L +#define NSEC_PER_SEC 1000000000L +#define NSEC_PER_USEC 1000L + +#define INT64_MAX (9223372036854775807L) +#define MAX_JIFFY_OFFSET ((INT64_MAX >> 1)-1) + #define private(name) SEC(".data." #name) __hidden __attribute__((aligned(8))) +extern unsigned long CONFIG_HZ __kconfig; +#define HZ CONFIG_HZ + u32 bpf_skb_get_hash(struct sk_buff *p) __ksym; void bpf_kfree_skb(struct sk_buff *p) __ksym; void bpf_qdisc_skb_drop(struct sk_buff *p, struct bpf_sk_buff_ptr *to_free) __ksym; void bpf_qdisc_watchdog_schedule(struct Qdisc *sch, u64 expire, u64 delta_ns) __ksym; void bpf_qdisc_bstats_update(struct Qdisc *sch, const struct sk_buff *skb) __ksym; -static struct qdisc_skb_cb *qdisc_skb_cb(const struct sk_buff *skb) +static inline struct qdisc_skb_cb *qdisc_skb_cb(const struct sk_buff *skb) { return (struct qdisc_skb_cb *)skb->cb; } @@ -28,4 +43,84 @@ static inline unsigned int qdisc_pkt_len(const struct sk_buff *skb) return qdisc_skb_cb(skb)->pkt_len; } +static inline unsigned long msecs_to_jiffies(const unsigned int m) +{ + /* + * ONLY work for + * HZ is equal to or smaller than 1000, and 1000 is a nice round + * multiple of HZ, divide with the factor between them, but round + * upwards: + */ + if (HZ <= MSEC_PER_SEC && MSEC_PER_SEC % HZ) + return (m + (MSEC_PER_SEC / HZ) - 1) / (MSEC_PER_SEC / HZ); + else + return MAX_JIFFY_OFFSET; +} + +static inline unsigned int psched_mtu(const struct net_device *dev) +{ + return dev->mtu + dev->hard_header_len; +} + +static inline struct net_device *qdisc_dev(const struct Qdisc *qdisc) +{ + return qdisc->dev_queue->dev; +} + +#define time_after(a,b) ((long)((b) - (a)) < 0) +#define bpf_rb_entry(ptr, type, member) container_of(ptr, type, member) + +static inline void qdisc_qstats_backlog_dec(struct Qdisc *sch, + const struct sk_buff *skb) +{ + sch->qstats.backlog -= qdisc_pkt_len(skb); +} + +static inline void qdisc_qstats_backlog_inc(struct Qdisc *sch, + const struct sk_buff *skb) +{ + sch->qstats.backlog += qdisc_pkt_len(skb); +} + +static inline int qdisc_drop(struct sk_buff *skb, struct Qdisc *sch, + struct bpf_sk_buff_ptr *to_free) +{ + bpf_qdisc_skb_drop(skb, to_free); + + return NET_XMIT_DROP; +} + +static inline bool sk_listener_or_tw(const struct sock *sk) +{ + return (1 << sk->sk_state) & + (TCPF_LISTEN | TCPF_NEW_SYN_RECV | TCPF_TIME_WAIT); +} + +static inline bool sk_fullsock(const struct sock *sk) +{ + return (1 << sk->sk_state) & ~(TCPF_TIME_WAIT | TCPF_NEW_SYN_RECV); +} + +static inline bool sk_is_inet(const struct sock *sk) +{ + int family = sk->sk_family; + + return family == AF_INET || family == AF_INET6; +} + +static inline bool sk_is_tcp(const struct sock *sk) +{ + return sk_is_inet(sk) && + sk->sk_type == SOCK_STREAM && + sk->sk_protocol == IPPROTO_TCP; +} + +#define GOLDEN_RATIO_64 0x61C8864680B583EBull + +static inline u32 hash_ptr(u64 val, unsigned int bits) +{ + /* 64x64-bit multiply is efficient on all 64-bit processors */ + return val * GOLDEN_RATIO_64 >> (64 - bits); +} + #endif diff --git a/tools/testing/selftests/bpf/progs/bpf_qdisc_fq.c b/tools/testing/selftests/bpf/progs/bpf_qdisc_fq.c index 7c110a156224..60683ad9c76f 100644 --- a/tools/testing/selftests/bpf/progs/bpf_qdisc_fq.c +++ b/tools/testing/selftests/bpf/progs/bpf_qdisc_fq.c @@ -643,8 +643,6 @@ static int fq_remove_flows_in_list(u32 index, void *ctx) return 0; } -extern unsigned CONFIG_HZ __kconfig; - /* limit number of collected flows per round */ #define FQ_GC_MAX 8 #define FQ_GC_AGE (3*CONFIG_HZ) diff --git a/tools/testing/selftests/bpf/progs/bpf_sch_fq.c b/tools/testing/selftests/bpf/progs/bpf_sch_fq.c new file mode 100644 index 000000000000..a57b90b54a96 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/bpf_sch_fq.c @@ -0,0 +1,1171 @@ +// SPDX-License-Identifier: GPL-2.0 + +#define fq_flow fq_flow_kern +#define fq_sched_data fq_sched_data_kern +#include +#undef fq_sched_data +#undef fq_flow +#include +#include +#include +#include "bpf_experimental.h" +#include "bpf_misc.h" +#include "bpf_qdisc_common.h" +#include "bpf_tracing_net.h" + +#define NR_QUEUES 4 + +#define FQ_TREE_LOG 15 +#define FQ_FLOW_TREE_ENTRIES (1 << FQ_TREE_LOG) /* 32k */ + +#define FQ_GC_MAX 8 +#define FQ_GC_AGE (3*HZ) + +#define likely(x) __builtin_expect(!!(x), 1) +#define unlikely(x) __builtin_expect(!!(x), 0) + +#define max_t(type, x, y) ({ \ + type __max1 = (x); \ + type __max2 = (y); \ + __max1 > __max2 ? __max1 : __max2; }) + +#define min_t(type, x, y) ({ \ + type __min1 = (x); \ + type __min2 = (y); \ + __min1 < __min2 ? __min1 : __min2; }) + +struct sock *dummy_sk; +struct sk_buff *dummy_skb; +int zero = 0; + +static inline struct fq_skb_cb *fq_skb_cb(struct sk_buff *skb) +{ + return (struct fq_skb_cb *)qdisc_skb_cb(skb)->data; +} + +struct skb_node { + /* cannot directly read time_to_send from skbn->skb, + * so duplicate the time_to_send here. + */ + u64 time_to_send; + struct sk_buff __kptr * skb; + struct bpf_list_node list_node; + struct bpf_rb_node rb_node; + struct bpf_refcount refcount; +}; + +struct fq_flow { + struct bpf_spin_lock lock; + struct bpf_rb_root skb_root __contains(skb_node, rb_node); + struct bpf_list_head skb_list __contains(skb_node, list_node); + struct bpf_rb_node fq_node; + u64 tail_time_to_send; + u64 stat_fastpath_packets; + unsigned long sk_long; + unsigned long age; + u32 socket_hash; + u32 qlen; + u64 time_next_packet; + int credit; + bool throttled; + struct bpf_list_node new_flow_node; + struct bpf_list_node old_flow_node; + struct bpf_rb_node rate_node; + struct bpf_refcount refcount; +}; + +struct fq_sched_data { + struct bpf_spin_lock lock; + struct bpf_list_head new_flows __contains(fq_flow, new_flow_node); + struct bpf_list_head old_flows __contains(fq_flow, old_flow_node); + struct bpf_rb_root delayed __contains(fq_flow, rate_node); /* for rate limited flows */ + + u64 time_next_delayed_flow; + unsigned long unthrottle_latency_ns; + + u32 quantum; + u32 initial_quantum; + u32 flow_refill_delay; + u32 flow_plimit; /* max packets per flow */ + unsigned long flow_max_rate; /* optional max rate per flow */ + u64 horizon; /* horizon in ns */ + u32 orphan_mask; /* mask for orphaned skb */ + u32 low_rate_threshold; + u8 rate_enable; + u8 fq_trees_log; + u8 horizon_drop; + u32 timer_slack; /* hrtimer slack in ns */ + + u32 flows; + u32 inactive_flows; /* Flows with no packet to send. */ + u32 throttled_flows; + + u64 stat_throttled; + u64 stat_gc_flows; + + u64 stat_internal_packets; /* aka highprio */ + u64 stat_horizon_drops; + u64 stat_horizon_caps; + u64 stat_flows_plimit; + u64 stat_pkts_too_long; + u64 stat_allocation_errors; +}; + +struct fq_flow_root { + struct bpf_spin_lock lock; + struct bpf_rb_root root __contains(fq_flow, fq_node); +}; + +struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __type(key, __u32); + __type(value, struct fq_sched_data); + __uint(max_entries, NR_QUEUES); +} priv_data_array SEC(".maps"); + +struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __type(key, __u32); + __type(value, struct fq_flow); + __uint(max_entries, NR_QUEUES); +} fq_internal_array SEC(".maps"); + +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __type(key, unsigned long); + __type(value, int); + __uint(max_entries, NR_QUEUES); +} sch_to_idx_map SEC(".maps"); + +struct fq_flow_root_array { + __uint(type, BPF_MAP_TYPE_ARRAY); + __type(key, __u32); + __type(value, struct fq_flow_root); + __uint(max_entries, FQ_FLOW_TREE_ENTRIES); +} fq_flows0 SEC(".maps"), + fq_flows1 SEC(".maps"), + fq_flows2 SEC(".maps"), + fq_flows3 SEC(".maps"); + +struct { + __uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS); + __type(key, __u32); + __uint(max_entries, NR_QUEUES); + __array(values, struct fq_flow_root_array); +} fq_flow_roots SEC(".maps") = { + .values = { + [0] = &fq_flows0, + [1] = &fq_flows1, + [2] = &fq_flows2, + [3] = &fq_flows3, + } +}; + +#define private(name) SEC(".data." #name) __hidden __attribute__((aligned(8))) + +private(A) struct bpf_spin_lock lock; +private(A) __u64 used_idx[NR_QUEUES]; + +struct bpf_rb_node *bpf_rbtree_root(struct bpf_rb_root *root) __ksym; +struct bpf_rb_node *bpf_rbtree_left(struct bpf_rb_root *root, struct bpf_rb_node *node) __ksym; +struct bpf_rb_node *bpf_rbtree_right(struct bpf_rb_root *root, struct bpf_rb_node *node) __ksym; + +static int sch_idx(struct Qdisc *sch) +{ + unsigned long sch_long = (unsigned long)sch; + int *idx; + + idx = bpf_map_lookup_elem(&sch_to_idx_map, &sch_long); + if (!idx || *idx < 0 || *idx >= NR_QUEUES) + return -1; + + return *idx; +} + +static struct fq_sched_data *qdisc_priv(struct Qdisc *sch) +{ + int idx = sch_idx(sch); + + if (idx == -1) + return NULL; + + return bpf_map_lookup_elem(&priv_data_array, &idx); +} + +static struct fq_flow *fq_internal(struct Qdisc *sch) +{ + int idx = sch_idx(sch); + + if (idx == -1) + return NULL; + + return bpf_map_lookup_elem(&fq_internal_array, &idx); +} + +/* + * f->tail and f->age share the same location. + * We can use the low order bit to differentiate if this location points + * to a sk_buff or contains a jiffies value, if we force this value to be odd. + * This assumes f->tail low order bit must be 0 since alignof(struct sk_buff) >= 2 + */ +static void fq_flow_set_detached(struct fq_flow *f, u64 jiffies_now) +{ + f->age = jiffies_now; +} + +static bool fq_flow_is_detached(const struct fq_flow *f) +{ + return !!f->age; +} + +static bool fq_flow_less(struct bpf_rb_node *a, const struct bpf_rb_node *b) +{ + struct fq_flow *f_a; + struct fq_flow *f_b; + + f_a = bpf_rb_entry(a, struct fq_flow, fq_node); + f_b = bpf_rb_entry(b, struct fq_flow, fq_node); + + return f_a->sk_long < f_b->sk_long; +} + +static bool fq_flow_is_throttled(const struct fq_flow *f) +{ + return f->throttled; +} + +static void fq_flow_unset_throttled(struct fq_sched_data *q, struct fq_flow *f) +{ + struct bpf_rb_node *rb_n; + + rb_n = bpf_rbtree_remove(&q->delayed, &f->rate_node); + if (rb_n) { + f = container_of(rb_n, struct fq_flow, rate_node); + q->throttled_flows--; + bpf_list_push_back(&q->old_flows, &f->old_flow_node); + } +} + +static bool fq_flow_delay_less(struct bpf_rb_node *a, const struct bpf_rb_node *b) +{ + struct fq_flow *f_a; + struct fq_flow *f_b; + + f_a = bpf_rb_entry(a, struct fq_flow, rate_node); + f_b = bpf_rb_entry(b, struct fq_flow, rate_node); + + return f_a->time_next_packet < f_b->time_next_packet; +} + +static void fq_flow_set_throttled(struct fq_sched_data *q, struct fq_flow *f) +{ + bpf_rbtree_add(&q->delayed, &f->rate_node, fq_flow_delay_less); + q->throttled_flows++; + q->stat_throttled++; + f->throttled = true; + if (q->time_next_delayed_flow > f->time_next_packet) + q->time_next_delayed_flow = f->time_next_packet; +} + +static bool fq_gc_candidate(const struct fq_flow *f, u64 jiffies_now) +{ + return fq_flow_is_detached(f) && + time_after(jiffies_now, f->age + FQ_GC_AGE); +} + +/* Fast path can be used if : + * 1) Packet tstamp is in the past. + * 2) FQ qlen == 0 OR + * (no flow is currently eligible for transmit, + * AND fast path queue has less than 8 packets) + * 3) No SO_MAX_PACING_RATE on the socket (if any). + * 4) No @maxrate attribute on this qdisc, + * + * FQ can not use generic TCQ_F_CAN_BYPASS infrastructure. + */ +static bool fq_fastpath_check(struct Qdisc *sch, struct sk_buff *skb, u64 now, + struct fq_flow *f_internal) +{ + const struct fq_sched_data *q = qdisc_priv(sch); + const struct sock *sk; + + /* impossible */ + if (!q) + return false; + + if (fq_skb_cb(skb)->time_to_send > now) + return false; + + if (sch->q.qlen != 0) { + /* Even if some packets are stored in this qdisc, + * we can still enable fast path if all of them are + * scheduled in the future (ie no flows are eligible) + * or in the fast path queue. + */ + if (q->flows != q->inactive_flows + q->throttled_flows) + return false; + + /* Do not allow fast path queue to explode, we want Fair Queue mode + * under pressure. + */ + if (f_internal->qlen >= 8) + return false; + } + + sk = skb->sk; + if (sk && sk_fullsock(sk) && !sk_is_tcp(sk) && + sk->sk_max_pacing_rate != ~0UL) + return false; + + if (q->flow_max_rate != ~0UL) + return false; + + return true; +} + +static struct fq_flow *fq_classify(struct Qdisc *sch, struct sk_buff *skb, u64 jiffies_now, + bool *unthrottle) +{ + struct fq_sched_data *q = qdisc_priv(sch); + struct bpf_rb_node *p; + struct sock *sk = skb->sk; + struct fq_flow_root *root; + unsigned long sk_long; + struct fq_flow *f; + + __u32 idx = sch_idx(sch), socket_hash = 0, key; + struct bpf_rb_node *tofree[FQ_GC_MAX]; + struct bpf_map *fq_flows_bucket; + struct fq_flow *gc_f; + int i, fcnt = 0; + + /* impossible */ + if (!q || idx == -1) + return NULL; + + /* This force sk_long to be a scalar. Otherwise, + * reading skb->sk is a ptr_to_btf_id and the verifier + * does not allow to do ptr math that is needed in + * the hash_ptr() + */ + bpf_probe_read_kernel(&sk_long, sizeof(sk_long), &skb->sk); + + *unthrottle = false; + + /* SYNACK messages are attached to a TCP_NEW_SYN_RECV request socket + * or a listener (SYNCOOKIE mode) + * 1) request sockets are not full blown, + * they do not contain sk_pacing_rate + * 2) They are not part of a 'flow' yet + * 3) We do not want to rate limit them (eg SYNFLOOD attack), + * especially if the listener set SO_MAX_PACING_RATE + * 4) We pretend they are orphaned + */ + if (!sk || sk_listener_or_tw(sk)) { + unsigned long hash = bpf_skb_get_hash(skb) & q->orphan_mask; + + /* By forcing low order bit to 1, we make sure to not + * collide with a local flow (socket pointers are word aligned) + */ + sk_long = (hash << 1) | 1UL; + } else if (sk->sk_state == TCP_CLOSE) { + unsigned long hash = bpf_skb_get_hash(skb) & q->orphan_mask; + /* + * Sockets in TCP_CLOSE are non connected. + * Typical use case is UDP sockets, they can send packets + * with sendto() to many different destinations. + * We probably could use a generic bit advertising + * non connected sockets, instead of sk_state == TCP_CLOSE, + * if we care enough. + */ + sk_long = (hash << 1) | 1UL; + } else { + socket_hash = sk->sk_hash; + } + + fq_flows_bucket = bpf_map_lookup_elem(&fq_flow_roots, &idx); + if (!fq_flows_bucket) + return NULL; + + key = hash_ptr(sk_long, q->fq_trees_log); + root = bpf_map_lookup_elem(fq_flows_bucket, &key); + if (!root) + return NULL; + + f = NULL; + bpf_spin_lock(&root->lock); + p = bpf_rbtree_root(&root->root); + while (can_loop) { + if (!p) + break; + + gc_f = bpf_rb_entry(p, struct fq_flow, fq_node); + if (gc_f->sk_long == sk_long) { + f = bpf_refcount_acquire(gc_f); + break; + } + + if (fcnt < FQ_GC_MAX && fq_gc_candidate(gc_f, jiffies_now)) + tofree[fcnt++] = p; + + if (gc_f->sk_long > sk_long) + p = bpf_rbtree_left(&root->root, p); + else + p = bpf_rbtree_right(&root->root, p); + } + + for (i = 0; i < fcnt; i++) { + p = tofree[i]; + tofree[i] = bpf_rbtree_remove(&root->root, p); + } + + bpf_spin_unlock(&root->lock); + + for (i = 0; i < fcnt; i++) { + p = tofree[i]; + if (p) { + gc_f = bpf_rb_entry(p, struct fq_flow, fq_node); + bpf_obj_drop(gc_f); + } + } + + q->flows -= fcnt; + q->inactive_flows -= fcnt; + q->stat_gc_flows += fcnt; + + if (f) { + /* socket might have been reallocated, so check + * if its sk_hash is the same. + * It not, we need to refill credit with + * initial quantum + */ + if (unlikely((unsigned long)skb->sk == sk_long && + f->socket_hash != socket_hash)) { + f->credit = q->initial_quantum; + f->socket_hash = socket_hash; + f->time_next_packet = 0ULL; + if (q->rate_enable) + bpf_qdisc_set_sk_pacing(skb, SK_PACING_FQ); + if (fq_flow_is_throttled(f)) + *unthrottle = true; + } + } else { + struct fq_flow *f_new; + + f_new = bpf_obj_new(typeof(*f_new)); + if (unlikely(!f_new)) { + q->stat_allocation_errors++; + return NULL; + } + + /* bpf mem allocator does not zero memory */ + f_new->tail_time_to_send = 0; + f_new->stat_fastpath_packets = 0; + f_new->sk_long = sk_long; + f_new->socket_hash = socket_hash; + f_new->qlen = 0; + f_new->time_next_packet = 0; + f_new->credit = q->initial_quantum; + f_new->throttled = 0; + + fq_flow_set_detached(f_new, jiffies_now); + f = bpf_refcount_acquire(f_new); + bpf_spin_lock(&root->lock); + bpf_rbtree_add(&root->root, &f_new->fq_node, fq_flow_less); + bpf_spin_unlock(&root->lock); + + q->flows++; + q->inactive_flows++; + if (q->rate_enable) + bpf_qdisc_set_sk_pacing(skb, SK_PACING_FQ); + } + + return f; +} + +static bool bpf_list_empty(struct bpf_list_head *bpf_head) +{ + struct list_head *head; + + head = bpf_core_cast(bpf_head, struct list_head); + return (!head->next || head->next == head); +} + +static struct skb_node *list_to_skbn(struct bpf_list_node *l_n) +{ + return l_n ? container_of(l_n, struct skb_node, list_node) : NULL; +} + +static struct skb_node *rb_to_skbn(struct bpf_rb_node *rb_n) +{ + return rb_n ? container_of(rb_n, struct skb_node, rb_node) : NULL; +} + +/* Remove one skb from flow queue. + * This skb must be the return value of prior fq_peek(). + */ +static struct skb_node *fq_dequeue_skbn(struct Qdisc *sch, struct fq_flow *flow, + struct skb_node *skbn, bool from_rb) +{ + struct bpf_list_node *l_n; + struct bpf_rb_node *rb_n; + + if (from_rb) { + rb_n = bpf_rbtree_remove(&flow->skb_root, &skbn->rb_node); + skbn = rb_n ? rb_to_skbn(rb_n) : NULL; + } else { + l_n = bpf_list_pop_front(&flow->skb_list); + skbn = l_n ? list_to_skbn(l_n) : NULL; + } + + return skbn; +} + +static struct skb_node *fq_internal_dequeue_skbn(struct Qdisc *sch, struct fq_flow *flow, + struct skb_node *skbn, bool from_rb) +{ + struct bpf_list_node *l_n; + struct bpf_rb_node *rb_n; + + if (from_rb) { + rb_n = bpf_rbtree_remove(&flow->skb_root, &skbn->rb_node); + skbn = rb_n ? rb_to_skbn(rb_n) : NULL; + } else { + l_n = bpf_list_pop_front(&flow->skb_list); + skbn = l_n ? list_to_skbn(l_n) : NULL; + } + + return skbn; +} + +static struct sk_buff *skbn_drop(struct Qdisc *sch, struct skb_node *skbn) +{ + struct sk_buff *skb; + + if (!skbn) + return NULL; + + skb = bpf_kptr_xchg(&skbn->skb, NULL); + bpf_obj_drop(skbn); + + if (skb) { + bpf_qdisc_bstats_update(sch, skb); + qdisc_qstats_backlog_dec(sch, skb); + sch->q.qlen--; + } + + return skb; +} + +static struct skb_node *fq_peek(struct fq_flow *flow, bool *from_rb) +{ + struct bpf_rb_node *rb_n = bpf_rbtree_first(&flow->skb_root); + struct bpf_list_node *l_n = bpf_list_front(&flow->skb_list); + struct skb_node *rb_skbn; + struct skb_node *l_skbn; + + if (!rb_n) { + *from_rb = false; + return l_n ? list_to_skbn(l_n) : NULL; + } + + if (!l_n) { + *from_rb = true; + return rb_n ? rb_to_skbn(rb_n) : NULL; + } + + l_skbn = list_to_skbn(l_n); + rb_skbn = rb_to_skbn(rb_n); + + if (rb_skbn->time_to_send < l_skbn->time_to_send) { + *from_rb = true; + return rb_skbn; + } else { + *from_rb = false; + return l_skbn; + } +} + +static bool skbn_less(struct bpf_rb_node *a, const struct bpf_rb_node *b) +{ + struct skb_node *skbn_a, *skbn_b; + + skbn_a = container_of(a, struct skb_node, rb_node); + skbn_b = container_of(b, struct skb_node, rb_node); + + return skbn_a->time_to_send < skbn_b->time_to_send; +} + +/* __always_inline. Otherwise, verifier rejects the "flow" related insn: + * "same insn cannot be used with different pointers" + * flow can be a map_value or a ptr-to-alloc-mem. + */ +static __always_inline void flow_queue_add(struct fq_flow *flow, struct skb_node *skbn) +{ + bool empty_list = bpf_list_empty(&flow->skb_list); + + bpf_spin_lock(&flow->lock); + if (!empty_list || skbn->time_to_send >= flow->tail_time_to_send) { + flow->tail_time_to_send = skbn->time_to_send; + bpf_list_push_back(&flow->skb_list, &skbn->list_node); + } else { + bpf_rbtree_add(&flow->skb_root, &skbn->rb_node, skbn_less); + } + bpf_spin_unlock(&flow->lock); + flow->age = 0; +} + +static bool fq_packet_beyond_horizon(const struct sk_buff *skb, + const struct fq_sched_data *q, u64 now) +{ + return unlikely((s64)skb->tstamp > (s64)(now + q->horizon)); +} + +/* inline because >5 arguments */ +static __always_inline +int queue_skb_and_drop_flow(struct Qdisc *sch, struct fq_sched_data *q, + struct fq_flow *f, struct sk_buff *skb, + struct bpf_sk_buff_ptr *to_free, + u64 jiffies_now, bool unthrottle) +{ + struct skb_node *skbn; + + skbn = bpf_obj_new(typeof(*skbn)); + if (!skbn) { + bpf_obj_drop(f); + return qdisc_drop(skb, sch, to_free); + } + + f->qlen++; + sch->q.qlen++; + + /* finish reading everything we need on skb before + * bpf_kptr_xchg. After xchg, neither skb nor + * skbn->skb is readable. + */ + skbn->time_to_send = fq_skb_cb(skb)->time_to_send; + qdisc_qstats_backlog_inc(sch, skb); + skb = bpf_kptr_xchg(&skbn->skb, skb); + if (skb) + qdisc_drop(skb, sch, to_free); + /* Note: this overwrites f->age */ + flow_queue_add(f, skbn); + + bpf_spin_lock(&q->lock); + if (fq_flow_is_detached(f)) { + struct fq_flow *f_dup; + + f_dup = bpf_refcount_acquire(f); + if (f_dup) + bpf_list_push_back(&q->new_flows, &f_dup->new_flow_node); + if (time_after(jiffies_now, f->age + q->flow_refill_delay)) + f->credit = max_t(u32, f->credit, q->quantum); + } else { + if (unthrottle) + fq_flow_unset_throttled(q, f); + } + bpf_spin_unlock(&q->lock); + bpf_obj_drop(f); + + return NET_XMIT_SUCCESS; +} + +static int queue_skb(struct Qdisc *sch, struct fq_sched_data *q, + struct fq_flow *f, struct sk_buff *skb, + struct bpf_sk_buff_ptr *to_free) +{ + struct skb_node *skbn; + + skbn = bpf_obj_new(typeof(*skbn)); + if (!skbn) + return qdisc_drop(skb, sch, to_free); + + f->qlen++; + sch->q.qlen++; + + /* finish reading everything we need on skb before + * bpf_kptr_xchg. After xchg, neither skb nor + * skbn->skb is readable. + */ + skbn->time_to_send = fq_skb_cb(skb)->time_to_send; + qdisc_qstats_backlog_inc(sch, skb); + skb = bpf_kptr_xchg(&skbn->skb, skb); + if (skb) + qdisc_drop(skb, sch, to_free); + /* Note: this overwrites f->age */ + flow_queue_add(f, skbn); + + return NET_XMIT_SUCCESS; +} + +SEC("struct_ops") +int BPF_PROG(bpf_fq_enqueue, struct sk_buff *skb, struct Qdisc *sch, + struct bpf_sk_buff_ptr *to_free) +{ + struct fq_flow *f, *f_internal = fq_internal(sch); + struct fq_sched_data *q = qdisc_priv(sch); + bool unthrottle; + u64 now, jiffies_now; + + if (!q || !f_internal) + return qdisc_drop(skb, sch, to_free); + + if (unlikely(sch->q.qlen >= sch->limit)) + return qdisc_drop(skb, sch, to_free); + + now = bpf_ktime_get_ns(); + jiffies_now = bpf_jiffies64(); + + if (!skb->tstamp) { + fq_skb_cb(skb)->time_to_send = now; + } else { + /* Check if packet timestamp is too far in the future. */ + if (fq_packet_beyond_horizon(skb, q, now)) { + if (q->horizon_drop) { + q->stat_horizon_drops++; + return qdisc_drop(skb, sch, to_free); + } + q->stat_horizon_caps++; + skb->tstamp = now + q->horizon; + } + fq_skb_cb(skb)->time_to_send = skb->tstamp; + } + + /* warning: no starvation prevention... */ + if (unlikely((skb->priority & TC_PRIO_MAX) == TC_PRIO_CONTROL) || + fq_fastpath_check(sch, skb, now, f_internal)) { + q->stat_internal_packets++; + return queue_skb(sch, q, f_internal, skb, to_free); + } + + f = fq_classify(sch, skb, jiffies_now, &unthrottle); + + if (!f) + return qdisc_drop(skb, sch, to_free); + + if (unlikely(f->qlen >= q->flow_plimit)) { + q->stat_flows_plimit++; + bpf_obj_drop(f); + return qdisc_drop(skb, sch, to_free); + } + + return queue_skb_and_drop_flow(sch, q, f, skb, to_free, jiffies_now, unthrottle); +} + +static void fq_check_throttled(struct fq_sched_data *q, u64 now) +{ + struct bpf_rb_node *rb_n; + unsigned long sample; + struct fq_flow *f; + + if (q->time_next_delayed_flow > now) + return; + + /* Update unthrottle latency EWMA. + * This is cheap and can help diagnosing timer/latency problems. + */ + sample = (unsigned long)(now - q->time_next_delayed_flow); + q->unthrottle_latency_ns -= q->unthrottle_latency_ns >> 3; + q->unthrottle_latency_ns += sample >> 3; + + q->time_next_delayed_flow = ~0ULL; + while (can_loop) { + rb_n = bpf_rbtree_first(&q->delayed); + if (!rb_n) + break; + + f = container_of(rb_n, struct fq_flow, rate_node); + if (f->time_next_packet > now) { + q->time_next_delayed_flow = f->time_next_packet; + break; + } + fq_flow_unset_throttled(q, f); + } +} + +struct fq_flow *fq_flow_peek(struct fq_sched_data *q, bool *new) +{ + struct bpf_list_node *l_n; + + *new = false; + l_n = bpf_list_front(&q->new_flows); + if (l_n) { + *new = true; + return container_of(l_n, struct fq_flow, new_flow_node); + } + + l_n = bpf_list_front(&q->old_flows); + if (l_n) { + *new = false; + return container_of(l_n, struct fq_flow, old_flow_node); + } + + return NULL; +} + +struct fq_flow *fq_flow_pop_front(struct fq_sched_data *q, bool new) +{ + struct bpf_list_node *l_n; + struct fq_flow *f = NULL; + + if (new) { + l_n = bpf_list_pop_front(&q->new_flows); + if (l_n) + f = container_of(l_n, struct fq_flow, new_flow_node); + } else { + l_n = bpf_list_pop_front(&q->old_flows); + if (l_n) + f = container_of(l_n, struct fq_flow, old_flow_node); + } + + return f; +} + +SEC("struct_ops/bpf_fq_dequeue") +struct sk_buff *BPF_PROG(bpf_fq_dequeue, struct Qdisc *sch) +{ + struct fq_flow *f = NULL, *f_internal = fq_internal(sch); + struct fq_sched_data *q = qdisc_priv(sch); + u64 time_next_packet, now, jiffies_now; + struct sk_buff *skb = NULL; + struct skb_node *skbn = NULL; + unsigned long rate; + bool from_rb, new; + u32 plen; + + if (!q || !f_internal) + return NULL; + + if (!sch->q.qlen) + return NULL; + + now = bpf_ktime_get_ns(); + jiffies_now = bpf_jiffies64(); + + if (unlikely(f_internal->qlen)) { + bpf_spin_lock(&f_internal->lock); + skbn = fq_peek(f_internal, &from_rb); + if (skbn) { + skbn = fq_internal_dequeue_skbn(sch, f_internal, skbn, from_rb); + bpf_spin_unlock(&f_internal->lock); + skb = skbn_drop(sch, skbn); + f_internal->qlen--; + return skb; + } + bpf_spin_unlock(&f_internal->lock); + } + + bpf_spin_lock(&q->lock); + fq_check_throttled(q, now); + bpf_spin_unlock(&q->lock); + + while (can_loop) { + bpf_spin_lock(&q->lock); + f = fq_flow_peek(q, &new); + if (!f) { + bpf_spin_unlock(&q->lock); + if (q->time_next_delayed_flow != ~0ULL) + bpf_qdisc_watchdog_schedule(sch, q->time_next_delayed_flow, + q->timer_slack); + return NULL; + } + + if (f->credit <= 0) { + f->credit += q->quantum; + f = fq_flow_pop_front(q, new); + if (f) { + bpf_list_push_back(&q->old_flows, &f->old_flow_node); + f = NULL; + } + bpf_spin_unlock(&q->lock); + continue; + } + + f = bpf_refcount_acquire(f); + bpf_spin_unlock(&q->lock); + + if (!f) + continue; + + bpf_spin_lock(&f->lock); + skbn = fq_peek(f, &from_rb); + if (skbn) { + time_next_packet = max_t(u64, skbn->time_to_send, f->time_next_packet); + + if (now < time_next_packet) { + bpf_spin_unlock(&f->lock); + bpf_obj_drop(f); + + bpf_spin_lock(&q->lock); + f = fq_flow_pop_front(q, new); + if (f) { + f->time_next_packet = time_next_packet; + fq_flow_set_throttled(q, f); + f = NULL; + } + bpf_spin_unlock(&q->lock); + continue; + } + skbn = fq_dequeue_skbn(sch, f, skbn, from_rb); + bpf_spin_unlock(&f->lock); + break; + } else { + bpf_spin_unlock(&f->lock); + bpf_obj_drop(f); + + bpf_spin_lock(&q->lock); + f = fq_flow_pop_front(q, new); + if (f) { + /* force a pass through old_flows to prevent starvation */ + if (new && bpf_list_front(&q->old_flows)) { + bpf_list_push_back(&q->old_flows, &f->old_flow_node); + f = NULL; + } else { + fq_flow_set_detached(f, jiffies_now); + } + } + bpf_spin_unlock(&q->lock); + if (f) { + bpf_obj_drop(f); + f = NULL; + } + } + } + + if (!f) + return NULL; + + skb = skbn_drop(sch, skbn);; + if (!skb) { + bpf_obj_drop(f); + return NULL; + } + + if (--f->qlen == 0) + q->inactive_flows++; + + plen = qdisc_pkt_len(skb); + f->credit -= plen; + + if (!q->rate_enable) + goto out; + + rate = q->flow_max_rate; + + /* If EDT time was provided for this skb, we need to + * update f->time_next_packet only if this qdisc enforces + * a flow max rate. + */ + if (!skb->tstamp) { + struct sock *sk = skb->sk; + + if (sk && !sk_listener_or_tw(sk)) + rate = min_t(unsigned long, sk->sk_pacing_rate, rate); + + if (rate <= q->low_rate_threshold) { + f->credit = 0; + } else { + plen = max_t(u32, plen, q->quantum); + if (f->credit > 0) + goto out; + } + } + if (rate != ~0UL) { + u64 len = (u64)plen * NSEC_PER_SEC; + + if (likely(rate)) + len = len / rate; + /* Since socket rate can change later, + * clamp the delay to 1 second. + * Really, providers of too big packets should be fixed ! + */ + if (unlikely(len > NSEC_PER_SEC)) { + len = NSEC_PER_SEC; + q->stat_pkts_too_long++; + } + /* Account for schedule/timers drifts. + * f->time_next_packet was set when prior packet was sent, + * and current time (@now) can be too late by tens of us. + */ + if (f->time_next_packet) + len -= min_t(u64, len/2, now - f->time_next_packet); + f->time_next_packet = now + len; + } + +out: + bpf_obj_drop(f); + bpf_qdisc_bstats_update(sch, skb); + return skb; +} + +static int fq_init(struct Qdisc *sch, struct fq_sched_data *q) +{ + sch->limit = 10000; + q->flow_plimit = 100; + q->quantum = 2 * psched_mtu(qdisc_dev(sch)); + q->initial_quantum = 10 * psched_mtu(qdisc_dev(sch)); + q->flow_refill_delay = msecs_to_jiffies(40); + q->flow_max_rate = ~0UL; + q->time_next_delayed_flow = ~0ULL; + q->rate_enable = 1; + q->fq_trees_log = FQ_TREE_LOG; + q->orphan_mask = 1024 - 1; + q->low_rate_threshold = 550000 / 8; + q->timer_slack = 10 * NSEC_PER_USEC; /* 10 usec of hrtimer slack */ + q->horizon = 10ULL * NSEC_PER_SEC; /* 10 seconds */ + q->horizon_drop = 1; /* by default, drop packets beyond horizon */ + return 0; +} + +SEC("struct_ops") +int BPF_PROG(bpf_fq_init, struct Qdisc *sch, struct nlattr *opt, + struct netlink_ext_ack *extack) +{ + unsigned long sch_long = (unsigned long)sch; + struct fq_sched_data *q; + int idx, err; + + bpf_spin_lock(&lock); + for (idx = 0; idx < NR_QUEUES; idx++) { + if (!used_idx[idx]) { + used_idx[idx] = 1; + break; + } + } + bpf_spin_unlock(&lock); + + if (idx == NR_QUEUES) + return -E2BIG; + + err = bpf_map_update_elem(&sch_to_idx_map, &sch_long, &idx, 0); + if (err) + return err; + + q = qdisc_priv(sch); + if (!q) + return -EINVAL; + + return fq_init(sch, q); +} + +static void fq_flow_purge(struct fq_flow *f) +{ + struct bpf_list_node *l_n; + struct bpf_rb_node *rb_n; + struct skb_node *skbn; + + while (can_loop) { + bpf_spin_lock(&f->lock); + rb_n = bpf_rbtree_first(&f->skb_root); + if (rb_n) + rb_n = bpf_rbtree_remove(&f->skb_root, rb_n); + l_n = bpf_list_pop_front(&f->skb_list); + bpf_spin_unlock(&f->lock); + skbn = NULL; + if (rb_n) { + skbn = rb_to_skbn(rb_n); + bpf_obj_drop(skbn); + } + if (l_n) { + skbn = list_to_skbn(l_n); + bpf_obj_drop(skbn); + } + if (!skbn) + break; + } +} + +SEC("struct_ops") +void BPF_PROG(bpf_fq_reset, struct Qdisc *sch) +{ + struct fq_flow *f, *f_internal = fq_internal(sch); + struct fq_sched_data *q = qdisc_priv(sch); + struct bpf_map *fq_flows_bucket; + struct bpf_list_node *l_n0, *l_n1; + struct bpf_rb_node *rb_n; + struct fq_flow_root *root; + int i, idx = sch_idx(sch); + + if (!q || !f_internal || idx == -1) + return; + + fq_flows_bucket = bpf_map_lookup_elem(&fq_flow_roots, &idx); + if (!fq_flows_bucket) + return; + + fq_flow_purge(f_internal); + + while (can_loop) { + bpf_spin_lock(&q->lock); + l_n0 = bpf_list_pop_front(&q->new_flows); + l_n1 = bpf_list_pop_front(&q->old_flows); + rb_n = bpf_rbtree_first(&q->delayed); + if (rb_n) + rb_n = bpf_rbtree_remove(&q->delayed, rb_n); + bpf_spin_unlock(&q->lock); + + f = NULL; + if (l_n0) { + f = container_of(l_n0, struct fq_flow, new_flow_node); + bpf_obj_drop(f); + } + if (l_n1) { + f = container_of(l_n1, struct fq_flow, old_flow_node); + bpf_obj_drop(f); + } + if (rb_n) { + f = container_of(rb_n, struct fq_flow, rate_node); + bpf_obj_drop(f); + } + if (!f) + break; + } + + for (i = zero; i < FQ_FLOW_TREE_ENTRIES && can_loop; i++) { + root = bpf_map_lookup_elem(fq_flows_bucket, &i); + if (!root) + break; + bpf_spin_lock(&root->lock); + rb_n = bpf_rbtree_first(&root->root); + if (rb_n) + rb_n = bpf_rbtree_remove(&root->root, rb_n); + bpf_spin_unlock(&root->lock); + + if (rb_n) { + f = container_of(rb_n, struct fq_flow, fq_node); + /* this should be the final refcount of the flow and + * this drop will flush all the queued skb. + */ + bpf_obj_drop(f); + } + } +} + +SEC("struct_ops") +void BPF_PROG(bpf_fq_destroy, struct Qdisc *sch) +{ + unsigned long sch_long = (unsigned long)sch; + int idx = sch_idx(sch); + + if (idx == -1) + return; + + ____bpf_fq_reset(ctx, sch); + + bpf_map_delete_elem(&sch_to_idx_map, &sch_long); + bpf_spin_lock(&lock); + used_idx[idx] = 0; + bpf_spin_unlock(&lock); +} + +SEC(".struct_ops.link") +struct Qdisc_ops fq = { + .enqueue = (void *)bpf_fq_enqueue, + .dequeue = (void *)bpf_fq_dequeue, + .init = (void *)bpf_fq_init, + .reset = (void *)bpf_fq_reset, + .destroy = (void *)bpf_fq_destroy, + .id = "bpf_sch_fq", +}; + +char _license[] SEC("license") = "GPL"; diff --git a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h index 659694162739..557657c92652 100644 --- a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h +++ b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h @@ -121,6 +121,7 @@ #define ir_v6_rmt_addr req.__req_common.skc_v6_daddr #define ir_v6_loc_addr req.__req_common.skc_v6_rcv_saddr +#define sk_hash __sk_common.skc_hash #define sk_num __sk_common.skc_num #define sk_dport __sk_common.skc_dport #define sk_family __sk_common.skc_family