From patchwork Wed Oct 5 17:17:05 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Cong Wang X-Patchwork-Id: 12999507 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 66107C433F5 for ; Wed, 5 Oct 2022 17:18:20 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230256AbiJERSS (ORCPT ); Wed, 5 Oct 2022 13:18:18 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58466 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230224AbiJERSP (ORCPT ); Wed, 5 Oct 2022 13:18:15 -0400 Received: from mail-qv1-xf2b.google.com (mail-qv1-xf2b.google.com [IPv6:2607:f8b0:4864:20::f2b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E108123161; Wed, 5 Oct 2022 10:18:10 -0700 (PDT) Received: by mail-qv1-xf2b.google.com with SMTP id z18so10660502qvn.6; Wed, 05 Oct 2022 10:18:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date; bh=FsSSRjv1LZzNC7YhdV5TsyTM1ptg/km1D8CLdNHsHIc=; b=e1DpgewBP8xrzQqTnzzwJvlLJas9MMsx1ABqNgozQDUWeoX6dW9ESqVsjNzHFgmreS yEOE9wKJA3GyBaGMyclDPAEtdgaYjpcXGszT1Kk0oFd60faQQ0hL4xeDgdtpX6p/I3ds oNmHXh+Sy70OFfD6nLAnReuwgxUO6QGGQsSCfflqWKmG8fJE2qt8tcTgrZLquf29+ZjI HFaJxx4W7ii8nA8bXZNr9FjcgFwIFHDhOloXZURBIB5HtvmHeTMOTVzch5de++k7evn9 Fm8LM8LHH69cFi0hyvszXIQR2nEgBjxdKJM4ZGQ83e8FgNb/uQ6jCvhMWkbDukJuoSER oAKA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date; bh=FsSSRjv1LZzNC7YhdV5TsyTM1ptg/km1D8CLdNHsHIc=; b=alGNTO7En+evYGsBRZfINvGMEmDfwIeW7e7HeycEdA3oHeuGNNlNvJSNgeZbqlpIpa L5Ef5vRLj9LUsa/iSo4UYGPLOnGdYYsjgqADQJsZeHzhqB/dyTwZRogXsxBu3VQj/VNb BsJWR3qomWkjQ0d2RQUyo1de9zMxU/EEC/TgzBIa3wAFJmvqdY9WTgMo4cq17dNsP3UP MHbJFTL2nvODIMubB2MVA3GQX+0tLGIFb8EAaBvh5/2mrXy3rB2gkGat8GDNHP/taR5Q JTJ/UPmANdgAcu6vTiQxEYtTipi06rklOegsikqD41AqpXud4k+ygaQ8LPv5kDmgI1IM FBUg== X-Gm-Message-State: ACrzQf1gENuwYwveiDH31BNU8q2N7KeK1b3Suv7fmFk5HAZII193vUUh 1G81tWAih/kZe1Gm3wqk+1KqLyuY+p8= X-Google-Smtp-Source: AMsMyM5Y8zGQsQHsJxB+5Zs1AK1nDSSCyr6J47V7ZF2Ome85bHpSuI5l3T//PkcD82TaxfZlYKGE3g== X-Received: by 2002:ad4:5beb:0:b0:4af:96ab:21e5 with SMTP id k11-20020ad45beb000000b004af96ab21e5mr603426qvc.85.1664990261238; Wed, 05 Oct 2022 10:17:41 -0700 (PDT) Received: from pop-os.attlocal.net ([2600:1700:65a0:ab60:2bd1:c1af:4b3b:4384]) by smtp.gmail.com with ESMTPSA id m13-20020ac85b0d000000b003913996dce3sm1764552qtw.6.2022.10.05.10.17.39 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 05 Oct 2022 10:17:40 -0700 (PDT) From: Cong Wang To: netdev@vger.kernel.org Cc: yangpeihao@sjtu.edu.cn, toke@redhat.com, jhs@mojatatu.com, jiri@resnulli.us, bpf@vger.kernel.org, sdf@google.com, Cong Wang Subject: [RFC Patch v6 1/5] bpf: Introduce rbtree map Date: Wed, 5 Oct 2022 10:17:05 -0700 Message-Id: <20221005171709.150520-2-xiyou.wangcong@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20221005171709.150520-1-xiyou.wangcong@gmail.com> References: <20221005171709.150520-1-xiyou.wangcong@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net X-Patchwork-State: RFC From: Cong Wang Insert: bpf_map_update(&map, &key, &val, flag); Delete a specific key-val pair: bpf_map_delete_elem(&map, &key); Pop the minimum one: bpf_map_pop(&map, &val); Lookup: val = bpf_map_lookup_elem(&map, &key); Iterator: bpf_for_each_map_elem(&map, callback, key, val); Signed-off-by: Cong Wang --- include/linux/bpf_types.h | 1 + include/uapi/linux/bpf.h | 1 + kernel/bpf/Makefile | 2 +- kernel/bpf/rbtree.c | 445 ++++++++++++++++++++++++++++++++++++++ 4 files changed, 448 insertions(+), 1 deletion(-) create mode 100644 kernel/bpf/rbtree.c diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h index 2c6a4f2562a7..c53ba6de1613 100644 --- a/include/linux/bpf_types.h +++ b/include/linux/bpf_types.h @@ -127,6 +127,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops) BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops) BPF_MAP_TYPE(BPF_MAP_TYPE_BLOOM_FILTER, bloom_filter_map_ops) BPF_MAP_TYPE(BPF_MAP_TYPE_USER_RINGBUF, user_ringbuf_map_ops) +BPF_MAP_TYPE(BPF_MAP_TYPE_RBTREE, rbtree_map_ops) BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint) BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 51b9aa640ad2..9492cd3af701 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -935,6 +935,7 @@ enum bpf_map_type { BPF_MAP_TYPE_TASK_STORAGE, BPF_MAP_TYPE_BLOOM_FILTER, BPF_MAP_TYPE_USER_RINGBUF, + BPF_MAP_TYPE_RBTREE, }; /* Note that tracing related programs such as diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile index 341c94f208f4..e60249258c74 100644 --- a/kernel/bpf/Makefile +++ b/kernel/bpf/Makefile @@ -7,7 +7,7 @@ endif CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy) obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o task_iter.o prog_iter.o link_iter.o -obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o +obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o rbtree.o obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o obj-${CONFIG_BPF_LSM} += bpf_inode_storage.o diff --git a/kernel/bpf/rbtree.c b/kernel/bpf/rbtree.c new file mode 100644 index 000000000000..f1a9b1c40b8b --- /dev/null +++ b/kernel/bpf/rbtree.c @@ -0,0 +1,445 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * rbtree.c: eBPF rbtree map + * + * Copyright (C) 2022, ByteDance, Cong Wang + */ +#include +#include +#include +#include +#include +#include +#include +#include + +#define RBTREE_CREATE_FLAG_MASK \ + (BPF_F_NUMA_NODE | BPF_F_ACCESS_MASK) + +/* each rbtree element is struct rbtree_elem + key + value */ +struct rbtree_elem { + struct rb_node rbnode; + char key[] __aligned(8); +}; + +struct rbtree_map { + struct bpf_map map; + struct bpf_mem_alloc ma; + raw_spinlock_t lock; + struct rb_root root; + atomic_t nr_entries; +}; + +#define rb_to_elem(rb) rb_entry_safe(rb, struct rbtree_elem, rbnode) +#define elem_rb_first(root) rb_to_elem(rb_first(root)) +#define elem_rb_last(root) rb_to_elem(rb_last(root)) +#define elem_rb_next(e) rb_to_elem(rb_next(&(e)->rbnode)) +#define rbtree_walk_safe(e, tmp, root) \ + for (e = elem_rb_first(root); \ + tmp = e ? elem_rb_next(e) : NULL, (e != NULL); \ + e = tmp) + +static struct rbtree_map *rbtree_map(struct bpf_map *map) +{ + return container_of(map, struct rbtree_map, map); +} + +/* Called from syscall */ +static int rbtree_map_alloc_check(union bpf_attr *attr) +{ + if (!bpf_capable()) + return -EPERM; + + /* check sanity of attributes */ + if (attr->max_entries == 0 || + attr->map_flags & ~RBTREE_CREATE_FLAG_MASK || + !bpf_map_flags_access_ok(attr->map_flags)) + return -EINVAL; + + if (attr->value_size > KMALLOC_MAX_SIZE) + /* if value_size is bigger, the user space won't be able to + * access the elements. + */ + return -E2BIG; + + return 0; +} + +static struct bpf_map *rbtree_map_alloc(union bpf_attr *attr) +{ + int numa_node = bpf_map_attr_numa_node(attr); + struct rbtree_map *rb; + u32 elem_size; + int err; + + rb = bpf_map_area_alloc(sizeof(*rb), numa_node); + if (!rb) + return ERR_PTR(-ENOMEM); + + memset(rb, 0, sizeof(*rb)); + bpf_map_init_from_attr(&rb->map, attr); + raw_spin_lock_init(&rb->lock); + rb->root = RB_ROOT; + atomic_set(&rb->nr_entries, 0); + + elem_size = sizeof(struct rbtree_elem) + + round_up(rb->map.key_size, 8); + elem_size += round_up(rb->map.value_size, 8); + err = bpf_mem_alloc_init(&rb->ma, elem_size, false); + if (err) { + bpf_map_area_free(rb); + return ERR_PTR(err); + } + return &rb->map; +} + +static void check_and_free_fields(struct rbtree_map *rb, + struct rbtree_elem *elem) +{ + void *map_value = elem->key + round_up(rb->map.key_size, 8); + + if (map_value_has_kptrs(&rb->map)) + bpf_map_free_kptrs(&rb->map, map_value); +} + +static void rbtree_map_purge(struct bpf_map *map) +{ + struct rbtree_map *rb = rbtree_map(map); + struct rbtree_elem *e, *tmp; + + rbtree_walk_safe(e, tmp, &rb->root) { + rb_erase(&e->rbnode, &rb->root); + check_and_free_fields(rb, e); + bpf_mem_cache_free(&rb->ma, e); + } +} + +/* Called when map->refcnt goes to zero, either from workqueue or from syscall */ +static void rbtree_map_free(struct bpf_map *map) +{ + struct rbtree_map *rb = rbtree_map(map); + unsigned long flags; + + raw_spin_lock_irqsave(&rb->lock, flags); + rbtree_map_purge(map); + raw_spin_unlock_irqrestore(&rb->lock, flags); + bpf_mem_alloc_destroy(&rb->ma); + bpf_map_area_free(rb); +} + +static struct rbtree_elem *bpf_rbtree_find(struct rb_root *root, void *key, int size) +{ + struct rb_node **p = &root->rb_node; + struct rb_node *parent = NULL; + struct rbtree_elem *e; + + while (*p) { + int ret; + + parent = *p; + e = rb_to_elem(parent); + ret = memcmp(key, e->key, size); + if (ret < 0) + p = &parent->rb_left; + else if (ret > 0) + p = &parent->rb_right; + else + return e; + } + return NULL; +} + +/* Called from eBPF program or syscall */ +static void *rbtree_map_lookup_elem(struct bpf_map *map, void *key) +{ + struct rbtree_map *rb = rbtree_map(map); + struct rbtree_elem *e; + + e = bpf_rbtree_find(&rb->root, key, rb->map.key_size); + if (!e) + return NULL; + return e->key + round_up(rb->map.key_size, 8); +} + +static int check_flags(struct rbtree_elem *old, u64 map_flags) +{ + if (old && (map_flags & ~BPF_F_LOCK) == BPF_NOEXIST) + /* elem already exists */ + return -EEXIST; + + if (!old && (map_flags & ~BPF_F_LOCK) == BPF_EXIST) + /* elem doesn't exist, cannot update it */ + return -ENOENT; + + return 0; +} + +static void rbtree_map_insert(struct rbtree_map *rb, struct rbtree_elem *e) +{ + struct rb_root *root = &rb->root; + struct rb_node **p = &root->rb_node; + struct rb_node *parent = NULL; + struct rbtree_elem *e1; + + while (*p) { + parent = *p; + e1 = rb_to_elem(parent); + if (memcmp(e->key, e1->key, rb->map.key_size) < 0) + p = &parent->rb_left; + else + p = &parent->rb_right; + } + rb_link_node(&e->rbnode, parent, p); + rb_insert_color(&e->rbnode, root); +} + +/* Called from syscall or from eBPF program */ +static int rbtree_map_update_elem(struct bpf_map *map, void *key, void *value, + u64 map_flags) +{ + struct rbtree_map *rb = rbtree_map(map); + void *val = rbtree_map_lookup_elem(map, key); + int ret; + + ret = check_flags(val, map_flags); + if (ret) + return ret; + + if (!val) { + struct rbtree_elem *e_new; + unsigned long flags; + + e_new = bpf_mem_cache_alloc(&rb->ma); + if (!e_new) + return -ENOMEM; + val = e_new->key + round_up(rb->map.key_size, 8); + check_and_init_map_value(&rb->map, val); + memcpy(e_new->key, key, rb->map.key_size); + raw_spin_lock_irqsave(&rb->lock, flags); + rbtree_map_insert(rb, e_new); + raw_spin_unlock_irqrestore(&rb->lock, flags); + atomic_inc(&rb->nr_entries); + } + + if (map_flags & BPF_F_LOCK) + copy_map_value_locked(map, val, value, false); + else + copy_map_value(map, val, value); + return 0; +} + +/* Called from syscall or from eBPF program */ +static int rbtree_map_delete_elem(struct bpf_map *map, void *key) +{ + struct rbtree_map *rb = rbtree_map(map); + struct rbtree_elem *e; + unsigned long flags; + + raw_spin_lock_irqsave(&rb->lock, flags); + e = bpf_rbtree_find(&rb->root, key, rb->map.key_size); + if (!e) { + raw_spin_unlock_irqrestore(&rb->lock, flags); + return -ENOENT; + } + rb_erase(&e->rbnode, &rb->root); + raw_spin_unlock_irqrestore(&rb->lock, flags); + check_and_free_fields(rb, e); + bpf_mem_cache_free(&rb->ma, e); + atomic_dec(&rb->nr_entries); + return 0; +} + +/* Called from syscall or from eBPF program */ +static int rbtree_map_pop_elem(struct bpf_map *map, void *value) +{ + struct rbtree_map *rb = rbtree_map(map); + struct rbtree_elem *e = elem_rb_first(&rb->root); + unsigned long flags; + void *val; + + if (!e) + return -ENOENT; + raw_spin_lock_irqsave(&rb->lock, flags); + rb_erase(&e->rbnode, &rb->root); + raw_spin_unlock_irqrestore(&rb->lock, flags); + val = e->key + round_up(rb->map.key_size, 8); + copy_map_value(map, value, val); + check_and_free_fields(rb, e); + bpf_mem_cache_free(&rb->ma, e); + atomic_dec(&rb->nr_entries); + return 0; +} + +/* Called from syscall */ +static int rbtree_map_get_next_key(struct bpf_map *map, void *key, void *next_key) +{ + struct rbtree_map *rb = rbtree_map(map); + struct rbtree_elem *e; + + if (!key) { + e = elem_rb_first(&rb->root); + if (!e) + return -ENOENT; + goto found; + } + e = bpf_rbtree_find(&rb->root, key, rb->map.key_size); + if (!e) + return -ENOENT; + e = elem_rb_next(e); + if (!e) + return 0; +found: + memcpy(next_key, e->key, map->key_size); + return 0; +} + +static int bpf_for_each_rbtree_map(struct bpf_map *map, + bpf_callback_t callback_fn, + void *callback_ctx, u64 flags) +{ + struct rbtree_map *rb = rbtree_map(map); + struct rbtree_elem *e, *tmp; + void *key, *value; + u32 num_elems = 0; + u64 ret = 0; + + if (flags != 0) + return -EINVAL; + + rbtree_walk_safe(e, tmp, &rb->root) { + num_elems++; + key = e->key; + value = key + round_up(rb->map.key_size, 8); + ret = callback_fn((u64)(long)map, (u64)(long)key, (u64)(long)value, + (u64)(long)callback_ctx, 0); + /* return value: 0 - continue, 1 - stop and return */ + if (ret) + break; + } + + return num_elems; +} + +struct rbtree_map_seq_info { + struct bpf_map *map; + struct rbtree_map *rb; +}; + +static void *rbtree_map_seq_find_next(struct rbtree_map_seq_info *info, + struct rbtree_elem *prev_elem) +{ + const struct rbtree_map *rb = info->rb; + struct rbtree_elem *elem; + + /* try to find next elem in the same bucket */ + if (prev_elem) { + elem = elem_rb_next(prev_elem); + if (elem) + return elem; + return NULL; + } + + return elem_rb_first(&rb->root); +} + +static void *rbtree_map_seq_start(struct seq_file *seq, loff_t *pos) +{ + struct rbtree_map_seq_info *info = seq->private; + + if (*pos == 0) + ++*pos; + + /* pairs with rbtree_map_seq_stop */ + rcu_read_lock(); + return rbtree_map_seq_find_next(info, NULL); +} + +static void *rbtree_map_seq_next(struct seq_file *seq, void *v, loff_t *pos) +{ + struct rbtree_map_seq_info *info = seq->private; + + ++*pos; + return rbtree_map_seq_find_next(info, v); +} + +static int rbtree_map_seq_show(struct seq_file *seq, void *v) +{ + struct rbtree_map_seq_info *info = seq->private; + struct bpf_iter__bpf_map_elem ctx = {}; + struct rbtree_elem *elem = v; + struct bpf_iter_meta meta; + struct bpf_prog *prog; + + meta.seq = seq; + prog = bpf_iter_get_info(&meta, !elem); + if (!prog) + return 0; + + ctx.meta = &meta; + ctx.map = info->map; + if (elem) { + ctx.key = elem->key; + ctx.value = elem->key + round_up(info->map->key_size, 8); + } + + return bpf_iter_run_prog(prog, &ctx); +} + +static void rbtree_map_seq_stop(struct seq_file *seq, void *v) +{ + if (!v) + (void)rbtree_map_seq_show(seq, NULL); + + /* pairs with rbtree_map_seq_start */ + rcu_read_unlock(); +} + +static const struct seq_operations rbtree_map_seq_ops = { + .start = rbtree_map_seq_start, + .next = rbtree_map_seq_next, + .stop = rbtree_map_seq_stop, + .show = rbtree_map_seq_show, +}; + +static int rbtree_map_init_seq_private(void *priv_data, + struct bpf_iter_aux_info *aux) +{ + struct rbtree_map_seq_info *info = priv_data; + + bpf_map_inc_with_uref(aux->map); + info->map = aux->map; + info->rb = rbtree_map(info->map); + return 0; +} + +static void rbtree_map_fini_seq_private(void *priv_data) +{ + struct rbtree_map_seq_info *info = priv_data; + + bpf_map_put_with_uref(info->map); +} + +static const struct bpf_iter_seq_info rbtree_map_iter_seq_info = { + .seq_ops = &rbtree_map_seq_ops, + .init_seq_private = rbtree_map_init_seq_private, + .fini_seq_private = rbtree_map_fini_seq_private, + .seq_priv_size = sizeof(struct rbtree_map_seq_info), +}; + +BTF_ID_LIST_SINGLE(rbtree_map_btf_ids, struct, rbtree_map) +const struct bpf_map_ops rbtree_map_ops = { + .map_meta_equal = bpf_map_meta_equal, + .map_alloc_check = rbtree_map_alloc_check, + .map_alloc = rbtree_map_alloc, + .map_free = rbtree_map_free, + .map_lookup_elem = rbtree_map_lookup_elem, + .map_update_elem = rbtree_map_update_elem, + .map_delete_elem = rbtree_map_delete_elem, + .map_pop_elem = rbtree_map_pop_elem, + .map_get_next_key = rbtree_map_get_next_key, + .map_set_for_each_callback_args = map_set_for_each_callback_args, + .map_for_each_callback = bpf_for_each_rbtree_map, + .map_btf_id = &rbtree_map_btf_ids[0], + .iter_seq_info = &rbtree_map_iter_seq_info, +}; + From patchwork Wed Oct 5 17:17:06 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Cong Wang X-Patchwork-Id: 12999508 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 54392C4332F for ; Wed, 5 Oct 2022 17:18:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230286AbiJERSU (ORCPT ); Wed, 5 Oct 2022 13:18:20 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58464 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229681AbiJERSQ (ORCPT ); Wed, 5 Oct 2022 13:18:16 -0400 Received: from mail-qk1-x734.google.com (mail-qk1-x734.google.com [IPv6:2607:f8b0:4864:20::734]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EF8B943E66; Wed, 5 Oct 2022 10:18:11 -0700 (PDT) Received: by mail-qk1-x734.google.com with SMTP id 27so2563586qkc.8; Wed, 05 Oct 2022 10:18:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date; bh=fTW6wSX6OhM/JpjmLL4Y+GywGzcuwm93dYTCYEcWEt0=; b=dazamK/d1Rz/XDe8b+uqeGyq+PsLmJejIFx2ZHkg+/DFhJqYUx69ShGqr8vBXpIYAD bHcaziiD+K0L2Uh6MMvEUb4iW42dxI2LZOvzR2Bmp5Gbi899gSRfQG1HZoxrvNFGOG00 bSk+geavYTw5nGVgytasxPfQFk+ZdpN0zwsZD4BeRa5lfBtNgXuq7p/ItkufX1jLgUhq AKSMsasFvxu6U292y0C5erxAdkMpJhACDSDOLCgaO++jnxIIjjQ564rqZEtmVlCCZY4e ItnSbxmXKytsiBbQjccLEA6jhEEvtLaAD2k4bPvx5aaMQIrZzxA6orGP9Dsk5ML7AscW i/4g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date; bh=fTW6wSX6OhM/JpjmLL4Y+GywGzcuwm93dYTCYEcWEt0=; b=Uu3flDFmY9/grhC3EHeRDK3sXJaJNvqew8ypcif9qPC9jBIZBiAjrnHl2XdKN/10U9 4RgJ+7xSGTD4mx/Ai7F8rOnEwimUAbeOEIzgP5TiVkLQBxnIT77lApI3O9wELNkLdHgO BEnnrWbl93eXguT968W9af2dBEYDUrnirIRPqNxKjTwq1SPymguHo0XSXgIydw3pbWWQ qjChBpy3ZXe/xqFiRcc3odBY9RE4m0P4QBArsdQTmFEt0I8WArY3p8FcEysImF6ENnkU ODJda3JNiRHUw/Xckyz4TfX5zBEIVPxg8sfiirB9uiJwsgCT4IHgULg0EXAHsTot/heM hP1w== X-Gm-Message-State: ACrzQf2M+IxPkLbaWztLQhqgcKR8KQoHaDpbHT2libOmJ+uUnMunEUaX tYFHaEt82vCKFeil1dGweCNGUl4Tn4Q= X-Google-Smtp-Source: AMsMyM7HKTOsoiHBEziUUpIs5OyGTkLPwbon+B6vCAQel56A3rjqzKBJvldoHMILKp9aops2eNuQSw== X-Received: by 2002:ac8:5992:0:b0:35c:c83a:740f with SMTP id e18-20020ac85992000000b0035cc83a740fmr471153qte.503.1664990263131; Wed, 05 Oct 2022 10:17:43 -0700 (PDT) Received: from pop-os.attlocal.net ([2600:1700:65a0:ab60:2bd1:c1af:4b3b:4384]) by smtp.gmail.com with ESMTPSA id m13-20020ac85b0d000000b003913996dce3sm1764552qtw.6.2022.10.05.10.17.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 05 Oct 2022 10:17:42 -0700 (PDT) From: Cong Wang To: netdev@vger.kernel.org Cc: yangpeihao@sjtu.edu.cn, toke@redhat.com, jhs@mojatatu.com, jiri@resnulli.us, bpf@vger.kernel.org, sdf@google.com, Cong Wang Subject: [RFC Patch v6 2/5] bpf: Add map in map support to rbtree Date: Wed, 5 Oct 2022 10:17:06 -0700 Message-Id: <20221005171709.150520-3-xiyou.wangcong@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20221005171709.150520-1-xiyou.wangcong@gmail.com> References: <20221005171709.150520-1-xiyou.wangcong@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net X-Patchwork-State: RFC From: Cong Wang Signed-off-by: Cong Wang --- include/linux/bpf.h | 4 + include/linux/bpf_types.h | 1 + include/uapi/linux/bpf.h | 1 + kernel/bpf/rbtree.c | 158 ++++++++++++++++++++++++++++++++++++++ kernel/bpf/syscall.c | 7 ++ 5 files changed, 171 insertions(+) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 9e7d46d16032..d4d85df1e8ea 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1913,6 +1913,10 @@ int bpf_fd_array_map_lookup_elem(struct bpf_map *map, void *key, u32 *value); int bpf_fd_htab_map_update_elem(struct bpf_map *map, struct file *map_file, void *key, void *value, u64 map_flags); int bpf_fd_htab_map_lookup_elem(struct bpf_map *map, void *key, u32 *value); +int bpf_fd_rbtree_map_update_elem(struct bpf_map *map, struct file *map_file, + void *key, void *value, u64 map_flags); +int bpf_fd_rbtree_map_lookup_elem(struct bpf_map *map, void *key, u32 *value); +int bpf_fd_rbtree_map_pop_elem(struct bpf_map *map, void *value); int bpf_get_file_flag(int flags); int bpf_check_uarg_tail_zero(bpfptr_t uaddr, size_t expected_size, diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h index c53ba6de1613..d1ef13b08e28 100644 --- a/include/linux/bpf_types.h +++ b/include/linux/bpf_types.h @@ -128,6 +128,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_RINGBUF, ringbuf_map_ops) BPF_MAP_TYPE(BPF_MAP_TYPE_BLOOM_FILTER, bloom_filter_map_ops) BPF_MAP_TYPE(BPF_MAP_TYPE_USER_RINGBUF, user_ringbuf_map_ops) BPF_MAP_TYPE(BPF_MAP_TYPE_RBTREE, rbtree_map_ops) +BPF_MAP_TYPE(BPF_MAP_TYPE_RBTREE_OF_MAPS, rbtree_map_in_map_ops) BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint) BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 9492cd3af701..994a3e42a4fa 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -936,6 +936,7 @@ enum bpf_map_type { BPF_MAP_TYPE_BLOOM_FILTER, BPF_MAP_TYPE_USER_RINGBUF, BPF_MAP_TYPE_RBTREE, + BPF_MAP_TYPE_RBTREE_OF_MAPS, }; /* Note that tracing related programs such as diff --git a/kernel/bpf/rbtree.c b/kernel/bpf/rbtree.c index f1a9b1c40b8b..43d3d4193ce4 100644 --- a/kernel/bpf/rbtree.c +++ b/kernel/bpf/rbtree.c @@ -12,6 +12,7 @@ #include #include #include +#include "map_in_map.h" #define RBTREE_CREATE_FLAG_MASK \ (BPF_F_NUMA_NODE | BPF_F_ACCESS_MASK) @@ -443,3 +444,160 @@ const struct bpf_map_ops rbtree_map_ops = { .iter_seq_info = &rbtree_map_iter_seq_info, }; +static struct bpf_map *rbtree_map_in_map_alloc(union bpf_attr *attr) +{ + struct bpf_map *map, *inner_map_meta; + + inner_map_meta = bpf_map_meta_alloc(attr->inner_map_fd); + if (IS_ERR(inner_map_meta)) + return inner_map_meta; + + map = rbtree_map_alloc(attr); + if (IS_ERR(map)) { + bpf_map_meta_free(inner_map_meta); + return map; + } + + map->inner_map_meta = inner_map_meta; + return map; +} + +static void *fd_rbtree_map_get_ptr(const struct bpf_map *map, struct rbtree_elem *e) +{ + return *(void **)(e->key + roundup(map->key_size, 8)); +} + +static void rbtree_map_in_map_purge(struct bpf_map *map) +{ + struct rbtree_map *rb = rbtree_map(map); + struct rbtree_elem *e, *tmp; + + rbtree_walk_safe(e, tmp, &rb->root) { + void *ptr = fd_rbtree_map_get_ptr(map, e); + + map->ops->map_fd_put_ptr(ptr); + } +} + +static void rbtree_map_in_map_free(struct bpf_map *map) +{ + struct rbtree_map *rb = rbtree_map(map); + + bpf_map_meta_free(map->inner_map_meta); + rbtree_map_in_map_purge(map); + bpf_map_area_free(rb); +} + +/* Called from eBPF program */ +static void *rbtree_map_in_map_lookup_elem(struct bpf_map *map, void *key) +{ + struct bpf_map **inner_map = rbtree_map_lookup_elem(map, key); + + if (!inner_map) + return NULL; + + return READ_ONCE(*inner_map); +} + +static int rbtree_map_in_map_alloc_check(union bpf_attr *attr) +{ + if (attr->value_size != sizeof(u32)) + return -EINVAL; + return rbtree_map_alloc_check(attr); +} + +/* Called from eBPF program */ +static int rbtree_map_in_map_pop_elem(struct bpf_map *map, void *value) +{ + struct rbtree_map *rb = rbtree_map(map); + struct rbtree_elem *e = elem_rb_first(&rb->root); + struct bpf_map **inner_map; + unsigned long flags; + + if (!e) + return -ENOENT; + raw_spin_lock_irqsave(&rb->lock, flags); + rb_erase(&e->rbnode, &rb->root); + raw_spin_unlock_irqrestore(&rb->lock, flags); + inner_map = fd_rbtree_map_get_ptr(map, e); + *(void **)value = *inner_map; + bpf_mem_cache_free(&rb->ma, e); + atomic_dec(&rb->nr_entries); + return 0; +} + +/* only called from syscall */ +int bpf_fd_rbtree_map_pop_elem(struct bpf_map *map, void *value) +{ + struct bpf_map *ptr; + int ret = 0; + + if (!map->ops->map_fd_sys_lookup_elem) + return -ENOTSUPP; + + rcu_read_lock(); + ret = rbtree_map_in_map_pop_elem(map, &ptr); + if (!ret) + *(u32 *)value = map->ops->map_fd_sys_lookup_elem(ptr); + else + ret = -ENOENT; + rcu_read_unlock(); + + return ret; +} + +/* only called from syscall */ +int bpf_fd_rbtree_map_lookup_elem(struct bpf_map *map, void *key, u32 *value) +{ + void **ptr; + int ret = 0; + + if (!map->ops->map_fd_sys_lookup_elem) + return -ENOTSUPP; + + rcu_read_lock(); + ptr = rbtree_map_lookup_elem(map, key); + if (ptr) + *value = map->ops->map_fd_sys_lookup_elem(READ_ONCE(*ptr)); + else + ret = -ENOENT; + rcu_read_unlock(); + + return ret; +} + +/* only called from syscall */ +int bpf_fd_rbtree_map_update_elem(struct bpf_map *map, struct file *map_file, + void *key, void *value, u64 map_flags) +{ + void *ptr; + int ret; + u32 ufd = *(u32 *)value; + + ptr = map->ops->map_fd_get_ptr(map, map_file, ufd); + if (IS_ERR(ptr)) + return PTR_ERR(ptr); + + ret = rbtree_map_update_elem(map, key, &ptr, map_flags); + if (ret) + map->ops->map_fd_put_ptr(ptr); + + return ret; +} + +const struct bpf_map_ops rbtree_map_in_map_ops = { + .map_alloc_check = rbtree_map_in_map_alloc_check, + .map_alloc = rbtree_map_in_map_alloc, + .map_free = rbtree_map_in_map_free, + .map_get_next_key = rbtree_map_get_next_key, + .map_lookup_elem = rbtree_map_in_map_lookup_elem, + .map_update_elem = rbtree_map_update_elem, + .map_pop_elem = rbtree_map_in_map_pop_elem, + .map_delete_elem = rbtree_map_delete_elem, + .map_fd_get_ptr = bpf_map_fd_get_ptr, + .map_fd_put_ptr = bpf_map_fd_put_ptr, + .map_fd_sys_lookup_elem = bpf_map_fd_sys_lookup_elem, + .map_check_btf = map_check_no_btf, + .map_btf_id = &rbtree_map_btf_ids[0], +}; + diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 7b373a5e861f..1b968dc38500 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -213,6 +213,11 @@ static int bpf_map_update_value(struct bpf_map *map, struct fd f, void *key, err = bpf_fd_htab_map_update_elem(map, f.file, key, value, flags); rcu_read_unlock(); + } else if (map->map_type == BPF_MAP_TYPE_RBTREE_OF_MAPS) { + rcu_read_lock(); + err = bpf_fd_rbtree_map_update_elem(map, f.file, key, value, + flags); + rcu_read_unlock(); } else if (map->map_type == BPF_MAP_TYPE_REUSEPORT_SOCKARRAY) { /* rcu_read_lock() is not needed */ err = bpf_fd_reuseport_array_update_elem(map, key, value, @@ -1832,6 +1837,8 @@ static int map_lookup_and_delete_elem(union bpf_attr *attr) if (map->map_type == BPF_MAP_TYPE_QUEUE || map->map_type == BPF_MAP_TYPE_STACK) { err = map->ops->map_pop_elem(map, value); + } else if (map->map_type == BPF_MAP_TYPE_RBTREE_OF_MAPS) { + bpf_fd_rbtree_map_pop_elem(map, value); } else if (map->map_type == BPF_MAP_TYPE_HASH || map->map_type == BPF_MAP_TYPE_PERCPU_HASH || map->map_type == BPF_MAP_TYPE_LRU_HASH || From patchwork Wed Oct 5 17:17:07 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Cong Wang X-Patchwork-Id: 12999506 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6C4FAC43219 for ; Wed, 5 Oct 2022 17:18:18 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230078AbiJERSR (ORCPT ); Wed, 5 Oct 2022 13:18:17 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58446 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230113AbiJERSO (ORCPT ); Wed, 5 Oct 2022 13:18:14 -0400 Received: from mail-qk1-x734.google.com (mail-qk1-x734.google.com [IPv6:2607:f8b0:4864:20::734]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0C28633E1F; Wed, 5 Oct 2022 10:18:10 -0700 (PDT) Received: by mail-qk1-x734.google.com with SMTP id 27so2563320qkc.8; Wed, 05 Oct 2022 10:18:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date; bh=/c/ojkIDOi7ngt7QWdMzzjZkiRn6t97rMfXNoNC5uaY=; b=O45J88u/1DlrxbMmCh7hibfyu0mlcx8w3dchHNqIIYcqBRtpmobyPOpLfEV8qUe0pM pjksxgrPNSqHPewWZFEUg0+kJOcmUtrF7Oicsva6iVU6HPNmZGuhKrdKDhWRUT3jt2IS 6eToAITwLoX2BmqBiWvCDK2IEw9e+Om9A8RxrrtGKXmzTxRuxg5TwjT5Uqq0D6HXR2l9 MT1j0yHGT6+M/IWeNflu7TNDTM5MoNb4PLnzGS/i9yuNFjOKXx9/7vT4jLxNA2ici0vL 0eccGjz/KKo9WXTz8cG/SCdiTe4o4acgXPjYF/ZbX79bIvlsFVRK7ye5WITlos1cQsd5 0Hvg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date; bh=/c/ojkIDOi7ngt7QWdMzzjZkiRn6t97rMfXNoNC5uaY=; b=F5eMR2A3/uWmU6zIcRCbrL4Z7TcZQFxj35A5ZvIv75adAiBl4A8cnTSJnAMsLN7FNn pCpB64vzErFWolryhypIkoh/jO3d+0cJxsx/wCQ1D+mdGvktv/M2uHfeGKF0X4j5Fss1 dak54Ht5kEgSgvSwZyUgYdtnDhiTbEv4JROO4/g7kCMfDcS5/Nr5v83JxxmJWOkziEOm QoP1je/w/gIR5xC5CVKvpuKnCXL4/TCM3OAsgzfYInbOkI9PhMsu8U4vIEdJvFKIjEV3 CMnBtanv5hnRG4+Qd3lzE4JHt3rR3sJvpE+Kej9TSu2U7gLEWgmFh2JfydGTVu1tA9fp ooSQ== X-Gm-Message-State: ACrzQf3MvL/Jq78ht5tCi9V5nqOju6M2DLQfCPB0bsb2jGWx3mf9XYSX PjhZj2KspHYJd7NIpef1AZKgzIxE2/Q= X-Google-Smtp-Source: AMsMyM4IFxpn7HkJ2LSuUEiOQJq4Sh02nNpbVuHkd+OuM/yz/7F9IWnhEhl0UrGQ/qsCejdBqtm18Q== X-Received: by 2002:a37:6345:0:b0:6e3:70a7:a2a2 with SMTP id x66-20020a376345000000b006e370a7a2a2mr450104qkb.41.1664990265286; Wed, 05 Oct 2022 10:17:45 -0700 (PDT) Received: from pop-os.attlocal.net ([2600:1700:65a0:ab60:2bd1:c1af:4b3b:4384]) by smtp.gmail.com with ESMTPSA id m13-20020ac85b0d000000b003913996dce3sm1764552qtw.6.2022.10.05.10.17.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 05 Oct 2022 10:17:44 -0700 (PDT) From: Cong Wang To: netdev@vger.kernel.org Cc: yangpeihao@sjtu.edu.cn, toke@redhat.com, jhs@mojatatu.com, jiri@resnulli.us, bpf@vger.kernel.org, sdf@google.com, Cong Wang , Cong Wang Subject: [RFC Patch v6 3/5] net_sched: introduce eBPF based Qdisc Date: Wed, 5 Oct 2022 10:17:07 -0700 Message-Id: <20221005171709.150520-4-xiyou.wangcong@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20221005171709.150520-1-xiyou.wangcong@gmail.com> References: <20221005171709.150520-1-xiyou.wangcong@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net X-Patchwork-State: RFC Introduce a new Qdisc which is completely managed by eBPF program of type BPF_PROG_TYPE_QDISC. It accepts two eBPF programs of the same type, but one for enqueue and the other for dequeue. And it interacts with Qdisc layer in two ways: 1) It relies on Qdisc watchdog to handle throttling; 2) It could pass the skb enqueue/dequeue down to child classes The context of this eBPF program is different, as shown below: ┌──────────┬───────────────┬─────────────────────────────────┐ │ │ │ │ │ prog │ input │ output │ │ │ │ │ ├──────────┼───────────────┼─────────────────────────────────┤ │ │ ctx->skb │ SCH_BPF_THROTTLE: ctx->delay │ │ │ │ │ │ enqueue │ ctx->classid │ SCH_BPF_QUEUED: None │ │ │ │ │ │ │ │ SCH_BPF_DROP: None │ │ │ │ │ │ │ │ SCH_BPF_CN: None │ │ │ │ │ │ │ │ SCH_BPF_PASS: ctx->classid │ ├──────────┼───────────────┼─────────────────────────────────┤ │ │ │ SCH_BPF_THROTTLE: ctx->delay │ │ │ │ │ │ dequeue │ ctx->classid │ SCH_BPF_DEQUEUED: ctx->skb │ │ │ │ │ │ │ │ SCH_BPF_DROP: None │ │ │ │ │ │ │ │ SCH_BPF_PASS: ctx->classid │ └──────────┴───────────────┴─────────────────────────────────┘ Signed-off-by: Cong Wang --- include/linux/bpf_types.h | 2 + include/uapi/linux/bpf.h | 16 ++ include/uapi/linux/pkt_sched.h | 17 ++ net/core/filter.c | 12 + net/sched/Kconfig | 15 + net/sched/Makefile | 1 + net/sched/sch_bpf.c | 485 +++++++++++++++++++++++++++++++++ 7 files changed, 548 insertions(+) create mode 100644 net/sched/sch_bpf.c diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h index d1ef13b08e28..4e375abe0f03 100644 --- a/include/linux/bpf_types.h +++ b/include/linux/bpf_types.h @@ -8,6 +8,8 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_SCHED_CLS, tc_cls_act, struct __sk_buff, struct sk_buff) BPF_PROG_TYPE(BPF_PROG_TYPE_SCHED_ACT, tc_cls_act, struct __sk_buff, struct sk_buff) +BPF_PROG_TYPE(BPF_PROG_TYPE_QDISC, tc_qdisc, + struct __sk_buff, struct sk_buff) BPF_PROG_TYPE(BPF_PROG_TYPE_XDP, xdp, struct xdp_md, struct xdp_buff) #ifdef CONFIG_CGROUP_BPF diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 994a3e42a4fa..c21fd1f189bc 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -980,6 +980,7 @@ enum bpf_prog_type { BPF_PROG_TYPE_LSM, BPF_PROG_TYPE_SK_LOOKUP, BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */ + BPF_PROG_TYPE_QDISC, }; enum bpf_attach_type { @@ -6984,4 +6985,19 @@ struct bpf_core_relo { enum bpf_core_relo_kind kind; }; +struct sch_bpf_ctx { + struct __sk_buff *skb; + __u32 classid; + __u64 delay; +}; + +enum { + SCH_BPF_QUEUED, + SCH_BPF_DEQUEUED = SCH_BPF_QUEUED, + SCH_BPF_DROP, + SCH_BPF_CN, + SCH_BPF_THROTTLE, + SCH_BPF_PASS, +}; + #endif /* _UAPI__LINUX_BPF_H__ */ diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h index 000eec106856..229af4cc54f6 100644 --- a/include/uapi/linux/pkt_sched.h +++ b/include/uapi/linux/pkt_sched.h @@ -1278,4 +1278,21 @@ enum { #define TCA_ETS_MAX (__TCA_ETS_MAX - 1) +#define TCA_SCH_BPF_FLAG_DIRECT _BITUL(0) +enum { + TCA_SCH_BPF_UNSPEC, + TCA_SCH_BPF_FLAGS, /* u32 */ + TCA_SCH_BPF_ENQUEUE_PROG_NAME, /* string */ + TCA_SCH_BPF_ENQUEUE_PROG_FD, /* u32 */ + TCA_SCH_BPF_ENQUEUE_PROG_ID, /* u32 */ + TCA_SCH_BPF_ENQUEUE_PROG_TAG, /* data */ + TCA_SCH_BPF_DEQUEUE_PROG_NAME, /* string */ + TCA_SCH_BPF_DEQUEUE_PROG_FD, /* u32 */ + TCA_SCH_BPF_DEQUEUE_PROG_ID, /* u32 */ + TCA_SCH_BPF_DEQUEUE_PROG_TAG, /* data */ + __TCA_SCH_BPF_MAX, +}; + +#define TCA_SCH_BPF_MAX (__TCA_SCH_BPF_MAX - 1) + #endif diff --git a/net/core/filter.c b/net/core/filter.c index bb0136e7a8e4..7a271b77a2cc 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -10655,6 +10655,18 @@ const struct bpf_prog_ops tc_cls_act_prog_ops = { .test_run = bpf_prog_test_run_skb, }; +const struct bpf_verifier_ops tc_qdisc_verifier_ops = { + .get_func_proto = tc_cls_act_func_proto, + .is_valid_access = tc_cls_act_is_valid_access, + .convert_ctx_access = tc_cls_act_convert_ctx_access, + .gen_prologue = tc_cls_act_prologue, + .gen_ld_abs = bpf_gen_ld_abs, +}; + +const struct bpf_prog_ops tc_qdisc_prog_ops = { + .test_run = bpf_prog_test_run_skb, +}; + const struct bpf_verifier_ops xdp_verifier_ops = { .get_func_proto = xdp_func_proto, .is_valid_access = xdp_is_valid_access, diff --git a/net/sched/Kconfig b/net/sched/Kconfig index 1e8ab4749c6c..19f68aac79b1 100644 --- a/net/sched/Kconfig +++ b/net/sched/Kconfig @@ -439,6 +439,21 @@ config NET_SCH_ETS If unsure, say N. +config NET_SCH_BPF + tristate "eBPF based programmable queue discipline" + help + This eBPF based queue discipline offers a way to program your + own packet scheduling algorithm. This is a classful qdisc which + also allows you to decide the hierarchy. + + Say Y here if you want to use the eBPF based programmable queue + discipline. + + To compile this driver as a module, choose M here: the module + will be called sch_bpf. + + If unsure, say N. + menuconfig NET_SCH_DEFAULT bool "Allow override default queue discipline" help diff --git a/net/sched/Makefile b/net/sched/Makefile index dd14ef413fda..9ef0d579f5ff 100644 --- a/net/sched/Makefile +++ b/net/sched/Makefile @@ -65,6 +65,7 @@ obj-$(CONFIG_NET_SCH_FQ_PIE) += sch_fq_pie.o obj-$(CONFIG_NET_SCH_CBS) += sch_cbs.o obj-$(CONFIG_NET_SCH_ETF) += sch_etf.o obj-$(CONFIG_NET_SCH_TAPRIO) += sch_taprio.o +obj-$(CONFIG_NET_SCH_BPF) += sch_bpf.o obj-$(CONFIG_NET_CLS_U32) += cls_u32.o obj-$(CONFIG_NET_CLS_ROUTE4) += cls_route.o diff --git a/net/sched/sch_bpf.c b/net/sched/sch_bpf.c new file mode 100644 index 000000000000..2998d576708d --- /dev/null +++ b/net/sched/sch_bpf.c @@ -0,0 +1,485 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Programmable Qdisc with eBPF + * + * Copyright (C) 2022, ByteDance, Cong Wang + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define ACT_BPF_NAME_LEN 256 + +struct sch_bpf_prog { + struct bpf_prog *prog; + const char *name; +}; + +struct sch_bpf_class { + struct Qdisc_class_common common; + struct Qdisc *qdisc; + + unsigned int drops; + unsigned int overlimits; + struct gnet_stats_basic_sync bstats; +}; + +struct sch_bpf_qdisc { + struct tcf_proto __rcu *filter_list; /* optional external classifier */ + struct tcf_block *block; + struct Qdisc_class_hash clhash; + struct sch_bpf_prog enqueue_prog; + struct sch_bpf_prog dequeue_prog; + + struct qdisc_watchdog watchdog; +}; + +static int sch_bpf_dump_prog(const struct sch_bpf_prog *prog, struct sk_buff *skb, + int name, int id, int tag) +{ + struct nlattr *nla; + + if (prog->name && + nla_put_string(skb, name, prog->name)) + return -EMSGSIZE; + + if (nla_put_u32(skb, id, prog->prog->aux->id)) + return -EMSGSIZE; + + nla = nla_reserve(skb, tag, sizeof(prog->prog->tag)); + if (!nla) + return -EMSGSIZE; + + memcpy(nla_data(nla), prog->prog->tag, nla_len(nla)); + return 0; +} + +static int sch_bpf_dump(struct Qdisc *sch, struct sk_buff *skb) +{ + struct sch_bpf_qdisc *q = qdisc_priv(sch); + struct nlattr *opts; + u32 bpf_flags = 0; + + opts = nla_nest_start_noflag(skb, TCA_OPTIONS); + if (!opts) + goto nla_put_failure; + + if (bpf_flags && nla_put_u32(skb, TCA_SCH_BPF_FLAGS, bpf_flags)) + goto nla_put_failure; + + if (sch_bpf_dump_prog(&q->enqueue_prog, skb, TCA_SCH_BPF_ENQUEUE_PROG_NAME, + TCA_SCH_BPF_ENQUEUE_PROG_ID, TCA_SCH_BPF_ENQUEUE_PROG_TAG)) + goto nla_put_failure; + if (sch_bpf_dump_prog(&q->dequeue_prog, skb, TCA_SCH_BPF_DEQUEUE_PROG_NAME, + TCA_SCH_BPF_DEQUEUE_PROG_ID, TCA_SCH_BPF_DEQUEUE_PROG_TAG)) + goto nla_put_failure; + + return nla_nest_end(skb, opts); + +nla_put_failure: + return -1; +} + +static int sch_bpf_dump_stats(struct Qdisc *sch, struct gnet_dump *d) +{ + return 0; +} + +static struct sch_bpf_class *sch_bpf_find(struct Qdisc *sch, u32 classid) +{ + struct sch_bpf_qdisc *q = qdisc_priv(sch); + struct Qdisc_class_common *clc; + + clc = qdisc_class_find(&q->clhash, classid); + if (!clc) + return NULL; + return container_of(clc, struct sch_bpf_class, common); +} + +static int sch_bpf_enqueue(struct sk_buff *skb, struct Qdisc *sch, + struct sk_buff **to_free) +{ + struct sch_bpf_qdisc *q = qdisc_priv(sch); + unsigned int len = qdisc_pkt_len(skb); + struct sch_bpf_ctx ctx = {}; + struct sch_bpf_class *cl; + int res = NET_XMIT_SUCCESS; + struct bpf_prog *enqueue; + s64 now; + + enqueue = rcu_dereference(q->enqueue_prog.prog); + bpf_compute_data_pointers(skb); + ctx.skb = (struct __sk_buff *)skb; + ctx.classid = sch->handle; + res = bpf_prog_run(enqueue, &ctx); + switch (res) { + case SCH_BPF_THROTTLE: + now = ktime_get_ns(); + qdisc_watchdog_schedule_ns(&q->watchdog, now + ctx.delay); + qdisc_qstats_overlimit(sch); + fallthrough; + case SCH_BPF_QUEUED: + return NET_XMIT_SUCCESS; + case SCH_BPF_CN: + return NET_XMIT_CN; + case SCH_BPF_PASS: + break; + default: + __qdisc_drop(skb, to_free); + return NET_XMIT_DROP; + } + + cl = sch_bpf_find(sch, ctx.classid); + if (!cl || !cl->qdisc) { + if (res & __NET_XMIT_BYPASS) + qdisc_qstats_drop(sch); + __qdisc_drop(skb, to_free); + return res; + } + + res = qdisc_enqueue(skb, cl->qdisc, to_free); + if (res != NET_XMIT_SUCCESS) { + if (net_xmit_drop_count(res)) { + qdisc_qstats_drop(sch); + cl->drops++; + } + return res; + } + + sch->qstats.backlog += len; + sch->q.qlen++; + return res; +} + +static struct sk_buff *sch_bpf_dequeue(struct Qdisc *sch) +{ + struct sch_bpf_qdisc *q = qdisc_priv(sch); + struct sk_buff *ret = NULL; + struct sch_bpf_ctx ctx = {}; + struct bpf_prog *dequeue; + struct sch_bpf_class *cl; + s64 now; + int res; + + dequeue = rcu_dereference(q->dequeue_prog.prog); + ctx.classid = sch->handle; + res = bpf_prog_run(dequeue, &ctx); + switch (res) { + case SCH_BPF_DEQUEUED: + ret = (struct sk_buff *)ctx.skb; + break; + case SCH_BPF_THROTTLE: + now = ktime_get_ns(); + qdisc_watchdog_schedule_ns(&q->watchdog, now + ctx.delay); + qdisc_qstats_overlimit(sch); + cl->overlimits++; + return NULL; + case SCH_BPF_PASS: + cl = sch_bpf_find(sch, ctx.classid); + if (!cl || !cl->qdisc) + return NULL; + ret = qdisc_dequeue_peeked(cl->qdisc); + if (ret) { + qdisc_bstats_update(sch, ret); + qdisc_qstats_backlog_dec(sch, ret); + sch->q.qlen--; + } + } + + return ret; +} + +static struct Qdisc *sch_bpf_leaf(struct Qdisc *sch, unsigned long arg) +{ + struct sch_bpf_class *cl = (struct sch_bpf_class *)arg; + + return cl->qdisc; +} + +static int sch_bpf_graft(struct Qdisc *sch, unsigned long arg, struct Qdisc *new, + struct Qdisc **old, struct netlink_ext_ack *extack) +{ + struct sch_bpf_class *cl = (struct sch_bpf_class *)arg; + + if (new) + *old = qdisc_replace(sch, new, &cl->qdisc); + return 0; +} + +static unsigned long sch_bpf_bind(struct Qdisc *sch, unsigned long parent, + u32 classid) +{ + return 0; +} + +static void sch_bpf_unbind(struct Qdisc *q, unsigned long cl) +{ +} + +static unsigned long sch_bpf_search(struct Qdisc *sch, u32 handle) +{ + return (unsigned long)sch_bpf_find(sch, handle); +} + +static struct tcf_block *sch_bpf_tcf_block(struct Qdisc *sch, unsigned long cl, + struct netlink_ext_ack *extack) +{ + struct sch_bpf_qdisc *q = qdisc_priv(sch); + + if (cl) + return NULL; + return q->block; +} + +static const struct nla_policy sch_bpf_policy[TCA_SCH_BPF_MAX + 1] = { + [TCA_SCH_BPF_FLAGS] = { .type = NLA_U32 }, + [TCA_SCH_BPF_ENQUEUE_PROG_FD] = { .type = NLA_U32 }, + [TCA_SCH_BPF_ENQUEUE_PROG_NAME] = { .type = NLA_NUL_STRING, + .len = ACT_BPF_NAME_LEN }, + [TCA_SCH_BPF_DEQUEUE_PROG_FD] = { .type = NLA_U32 }, + [TCA_SCH_BPF_DEQUEUE_PROG_NAME] = { .type = NLA_NUL_STRING, + .len = ACT_BPF_NAME_LEN }, +}; + +static int bpf_init_prog(struct nlattr *fd, struct nlattr *name, struct sch_bpf_prog *prog) +{ + char *prog_name = NULL; + struct bpf_prog *fp; + u32 bpf_fd; + + if (!fd) + return -EINVAL; + bpf_fd = nla_get_u32(fd); + + fp = bpf_prog_get_type(bpf_fd, BPF_PROG_TYPE_QDISC); + if (IS_ERR(fp)) + return PTR_ERR(fp); + + if (name) { + prog_name = nla_memdup(name, GFP_KERNEL); + if (!prog_name) { + bpf_prog_put(fp); + return -ENOMEM; + } + } + + prog->name = prog_name; + prog->prog = fp; + return 0; +} + +static void bpf_cleanup_prog(struct sch_bpf_prog *prog) +{ + if (prog->prog) + bpf_prog_put(prog->prog); + kfree(prog->name); +} + +static int sch_bpf_change(struct Qdisc *sch, struct nlattr *opt, + struct netlink_ext_ack *extack) +{ + struct sch_bpf_qdisc *q = qdisc_priv(sch); + struct nlattr *tb[TCA_SCH_BPF_MAX + 1]; + int err; + + if (!opt) + return -EINVAL; + + err = nla_parse_nested_deprecated(tb, TCA_SCH_BPF_MAX, opt, + sch_bpf_policy, NULL); + if (err < 0) + return err; + + if (tb[TCA_SCH_BPF_FLAGS]) { + u32 bpf_flags = nla_get_u32(tb[TCA_SCH_BPF_FLAGS]); + + if (bpf_flags & ~TCA_SCH_BPF_FLAG_DIRECT) + return -EINVAL; + } + + err = bpf_init_prog(tb[TCA_SCH_BPF_ENQUEUE_PROG_FD], + tb[TCA_SCH_BPF_ENQUEUE_PROG_NAME], &q->enqueue_prog); + if (err) + return err; + err = bpf_init_prog(tb[TCA_SCH_BPF_DEQUEUE_PROG_FD], + tb[TCA_SCH_BPF_DEQUEUE_PROG_NAME], &q->dequeue_prog); + return err; +} + +static int sch_bpf_init(struct Qdisc *sch, struct nlattr *opt, + struct netlink_ext_ack *extack) +{ + struct sch_bpf_qdisc *q = qdisc_priv(sch); + int err; + + qdisc_watchdog_init(&q->watchdog, sch); + if (opt) { + err = sch_bpf_change(sch, opt, extack); + if (err) + return err; + } + + err = tcf_block_get(&q->block, &q->filter_list, sch, extack); + if (err) + return err; + + return qdisc_class_hash_init(&q->clhash); +} + +static void sch_bpf_reset(struct Qdisc *sch) +{ + struct sch_bpf_qdisc *q = qdisc_priv(sch); + + qdisc_watchdog_cancel(&q->watchdog); +} + +static void sch_bpf_destroy(struct Qdisc *sch) +{ + struct sch_bpf_qdisc *q = qdisc_priv(sch); + + qdisc_watchdog_cancel(&q->watchdog); + tcf_block_put(q->block); + qdisc_class_hash_destroy(&q->clhash); + bpf_cleanup_prog(&q->enqueue_prog); + bpf_cleanup_prog(&q->dequeue_prog); +} + +static int sch_bpf_change_class(struct Qdisc *sch, u32 classid, + u32 parentid, struct nlattr **tca, + unsigned long *arg, + struct netlink_ext_ack *extack) +{ + struct sch_bpf_class *cl = (struct sch_bpf_class *)*arg; + struct sch_bpf_qdisc *q = qdisc_priv(sch); + + if (!cl) { + cl = kzalloc(sizeof(*cl), GFP_KERNEL); + if (!cl) + return -ENOBUFS; + qdisc_class_hash_insert(&q->clhash, &cl->common); + } + + qdisc_class_hash_grow(sch, &q->clhash); + *arg = (unsigned long)cl; + return 0; +} + +static int sch_bpf_delete(struct Qdisc *sch, unsigned long arg, + struct netlink_ext_ack *extack) +{ + struct sch_bpf_class *cl = (struct sch_bpf_class *)arg; + struct sch_bpf_qdisc *q = qdisc_priv(sch); + + qdisc_class_hash_remove(&q->clhash, &cl->common); + if (cl->qdisc) + qdisc_put(cl->qdisc); + return 0; +} + +static int sch_bpf_dump_class(struct Qdisc *sch, unsigned long arg, + struct sk_buff *skb, struct tcmsg *tcm) +{ + return 0; +} + +static int +sch_bpf_dump_class_stats(struct Qdisc *sch, unsigned long arg, struct gnet_dump *d) +{ + struct sch_bpf_class *cl = (struct sch_bpf_class *)arg; + struct gnet_stats_queue qs = { + .drops = cl->drops, + .overlimits = cl->overlimits, + }; + __u32 qlen = 0; + + if (cl->qdisc) + qdisc_qstats_qlen_backlog(cl->qdisc, &qlen, &qs.backlog); + else + qlen = 0; + + if (gnet_stats_copy_basic(d, NULL, &cl->bstats, true) < 0 || + gnet_stats_copy_queue(d, NULL, &qs, qlen) < 0) + return -1; + return 0; +} + +static void sch_bpf_walk(struct Qdisc *sch, struct qdisc_walker *arg) +{ + struct sch_bpf_qdisc *q = qdisc_priv(sch); + struct sch_bpf_class *cl; + unsigned int i; + + if (arg->stop) + return; + + for (i = 0; i < q->clhash.hashsize; i++) { + hlist_for_each_entry(cl, &q->clhash.hash[i], common.hnode) { + if (arg->count < arg->skip) { + arg->count++; + continue; + } + if (arg->fn(sch, (unsigned long)cl, arg) < 0) { + arg->stop = 1; + return; + } + arg->count++; + } + } +} + +static const struct Qdisc_class_ops sch_bpf_class_ops = { + .graft = sch_bpf_graft, + .leaf = sch_bpf_leaf, + .find = sch_bpf_search, + .change = sch_bpf_change_class, + .delete = sch_bpf_delete, + .tcf_block = sch_bpf_tcf_block, + .bind_tcf = sch_bpf_bind, + .unbind_tcf = sch_bpf_unbind, + .dump = sch_bpf_dump_class, + .dump_stats = sch_bpf_dump_class_stats, + .walk = sch_bpf_walk, +}; + +static struct Qdisc_ops sch_bpf_qdisc_ops __read_mostly = { + .cl_ops = &sch_bpf_class_ops, + .id = "bpf", + .priv_size = sizeof(struct sch_bpf_qdisc), + .enqueue = sch_bpf_enqueue, + .dequeue = sch_bpf_dequeue, + .peek = qdisc_peek_dequeued, + .init = sch_bpf_init, + .reset = sch_bpf_reset, + .destroy = sch_bpf_destroy, + .change = sch_bpf_change, + .dump = sch_bpf_dump, + .dump_stats = sch_bpf_dump_stats, + .owner = THIS_MODULE, +}; + +static int __init sch_bpf_mod_init(void) +{ + return register_qdisc(&sch_bpf_qdisc_ops); +} + +static void __exit sch_bpf_mod_exit(void) +{ + unregister_qdisc(&sch_bpf_qdisc_ops); +} + +module_init(sch_bpf_mod_init) +module_exit(sch_bpf_mod_exit) +MODULE_AUTHOR("Cong Wang"); +MODULE_LICENSE("GPL"); +MODULE_DESCRIPTION("eBPF queue discipline"); From patchwork Wed Oct 5 17:17:08 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Cong Wang X-Patchwork-Id: 12999505 X-Patchwork-Delegate: kuba@kernel.org Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4CA8FC433F5 for ; Wed, 5 Oct 2022 17:18:17 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230134AbiJERSP (ORCPT ); Wed, 5 Oct 2022 13:18:15 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58390 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229681AbiJERSN (ORCPT ); Wed, 5 Oct 2022 13:18:13 -0400 Received: from mail-qv1-xf2c.google.com (mail-qv1-xf2c.google.com [IPv6:2607:f8b0:4864:20::f2c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E165726111; Wed, 5 Oct 2022 10:18:10 -0700 (PDT) Received: by mail-qv1-xf2c.google.com with SMTP id mx8so4474643qvb.8; Wed, 05 Oct 2022 10:18:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date; bh=ZhAUYVqrdKYklNGIsEXG4znDGVspQ80aq1aJ4554Cc0=; b=XVTKmDU0btvmwiJm2tarmHaZXjec2OSoNYPt63uGtSpVIlbgGxxTyuah0+G6y1HLtX F6klWNxyzedwZVUyRj7krLaQ0z3XH/eNh6t40mdbgi2JqaLqp08TDjZgwgKAOuZ8ZjTy WdVkG4C6pdiufVYjfrv6cpVa9Srhw7wXLPU1t60Udf9RcYKsdthZmwqRdllwo02XTdFh gQOHI3iHD4CWkAD94mzYBD5JKRzRMZ5Im1TtdJTLZe1n5l+BJCd2kX0BWb7BlQ/vAbXB +DhBou99kHHwyokw32MDyYQDxqlC+sXs06p/uFEf9SK9ws1mrI0EsuO/RzdMZ1eqzLzr f+0A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date; bh=ZhAUYVqrdKYklNGIsEXG4znDGVspQ80aq1aJ4554Cc0=; b=GUh1NOpC+kRZs96b6VZQyLzlnmxKH8rru88vtKdR09PVpNwMDGAaKLqXzwYgwGT7Ft /hVz5qS0IYqE/fuotrM6EVOIqblN3KG8kkIn1yFRgKggqs6/NdKQZClRTSo8y9tP9h6X zP/FlC40dTwaa/uLfkZjIFeMqwg8U1PfrfxZ9jjrkksGFC4yGZBaWpNk3ZfFcuCL2c1z lXtscKKRvFIrxpUs9OL/rYj3XbGP/bvD2ZNbwI/eubtGKoCiiwRtykAKbMTToIkDUo8q pb66INSct8elNZeFACBty/jdJBGFsLkcOoUVYI67bVCHi0G9dz64XBg3yIURmyrjBL25 Uysw== X-Gm-Message-State: ACrzQf2JP0b7JrnMEzrHbnEgpvt/tFalXFWRMyU4cKoMWcArWKf+VVBm g9JD11aBGqSRnxyEN3HLPhs6BpFg7WY= X-Google-Smtp-Source: AMsMyM5IYwCwEN1mk5GDaZ6ZOVUMnadpOXpI0I2WBqgbWV4iXT2KsktcL7kQ9PlOgCTMzQKL3g9upA== X-Received: by 2002:a05:6214:5018:b0:4b1:c2d9:f0a with SMTP id jo24-20020a056214501800b004b1c2d90f0amr615128qvb.45.1664990267095; Wed, 05 Oct 2022 10:17:47 -0700 (PDT) Received: from pop-os.attlocal.net ([2600:1700:65a0:ab60:2bd1:c1af:4b3b:4384]) by smtp.gmail.com with ESMTPSA id m13-20020ac85b0d000000b003913996dce3sm1764552qtw.6.2022.10.05.10.17.45 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 05 Oct 2022 10:17:46 -0700 (PDT) From: Cong Wang To: netdev@vger.kernel.org Cc: yangpeihao@sjtu.edu.cn, toke@redhat.com, jhs@mojatatu.com, jiri@resnulli.us, bpf@vger.kernel.org, sdf@google.com, Cong Wang Subject: [RFC Patch v6 4/5] net_sched: Add kfuncs for storing skb Date: Wed, 5 Oct 2022 10:17:08 -0700 Message-Id: <20221005171709.150520-5-xiyou.wangcong@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20221005171709.150520-1-xiyou.wangcong@gmail.com> References: <20221005171709.150520-1-xiyou.wangcong@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: kuba@kernel.org X-Patchwork-State: RFC From: Cong Wang Signed-off-by: Cong Wang --- net/sched/sch_bpf.c | 81 ++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 80 insertions(+), 1 deletion(-) diff --git a/net/sched/sch_bpf.c b/net/sched/sch_bpf.c index 2998d576708d..6850eb8bb574 100644 --- a/net/sched/sch_bpf.c +++ b/net/sched/sch_bpf.c @@ -15,6 +15,7 @@ #include #include #include +#include #include #include #include @@ -468,9 +469,87 @@ static struct Qdisc_ops sch_bpf_qdisc_ops __read_mostly = { .owner = THIS_MODULE, }; +__diag_push(); +__diag_ignore_all("-Wmissing-prototypes", + "Global functions as their definitions will be in vmlinux BTF"); + +/** + * bpf_skb_acquire - Acquire a reference to an skb. An skb acquired by this + * kfunc which is not stored in a map as a kptr, must be released by calling + * bpf_skb_release(). + * @p: The skb on which a reference is being acquired. + */ +__used noinline +struct sk_buff *bpf_skb_acquire(struct sk_buff *p) +{ + return skb_get(p); +} + +/** + * bpf_skb_kptr_get - Acquire a reference on a struct sk_buff kptr. An skb + * kptr acquired by this kfunc which is not subsequently stored in a map, must + * be released by calling bpf_skb_release(). + * @pp: A pointer to an skb kptr on which a reference is being acquired. + */ +__used noinline +struct sk_buff *bpf_skb_kptr_get(struct sk_buff **pp) +{ + struct sk_buff *p; + + rcu_read_lock(); + p = READ_ONCE(*pp); + if (p && !refcount_inc_not_zero(&p->users)) + p = NULL; + rcu_read_unlock(); + + return p; +} + +/** + * bpf_skb_release - Release the reference acquired on a struct sk_buff *. + * @p: The skb on which a reference is being released. + */ +__used noinline void bpf_skb_release(struct sk_buff *p) +{ + consume_skb(p); +} + +__diag_pop(); + +BTF_SET8_START(skb_kfunc_btf_ids) +BTF_ID_FLAGS(func, bpf_skb_acquire, KF_ACQUIRE) +BTF_ID_FLAGS(func, bpf_skb_kptr_get, KF_ACQUIRE | KF_KPTR_GET | KF_RET_NULL) +BTF_ID_FLAGS(func, bpf_skb_release, KF_RELEASE | KF_TRUSTED_ARGS) +BTF_SET8_END(skb_kfunc_btf_ids) + +static const struct btf_kfunc_id_set skb_kfunc_set = { + .owner = THIS_MODULE, + .set = &skb_kfunc_btf_ids, +}; + +BTF_ID_LIST(skb_kfunc_dtor_ids) +BTF_ID(struct, sk_buff) +BTF_ID(func, bpf_skb_release) + static int __init sch_bpf_mod_init(void) { - return register_qdisc(&sch_bpf_qdisc_ops); + int ret; + const struct btf_id_dtor_kfunc skb_kfunc_dtors[] = { + { + .btf_id = skb_kfunc_dtor_ids[0], + .kfunc_btf_id = skb_kfunc_dtor_ids[1] + }, + }; + + ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_QDISC, &skb_kfunc_set); + if (ret) + return ret; + ret = register_btf_id_dtor_kfuncs(skb_kfunc_dtors, + ARRAY_SIZE(skb_kfunc_dtors), + THIS_MODULE); + if (ret == 0) + return register_qdisc(&sch_bpf_qdisc_ops); + return ret; } static void __exit sch_bpf_mod_exit(void) From patchwork Wed Oct 5 17:17:09 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Cong Wang X-Patchwork-Id: 12999509 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C8115C433F5 for ; Wed, 5 Oct 2022 17:18:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230289AbiJERSV (ORCPT ); Wed, 5 Oct 2022 13:18:21 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58512 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230231AbiJERSQ (ORCPT ); Wed, 5 Oct 2022 13:18:16 -0400 Received: from mail-qv1-xf2e.google.com (mail-qv1-xf2e.google.com [IPv6:2607:f8b0:4864:20::f2e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 493953A157; Wed, 5 Oct 2022 10:18:11 -0700 (PDT) Received: by mail-qv1-xf2e.google.com with SMTP id i9so7352929qvu.1; Wed, 05 Oct 2022 10:18:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date; bh=Vs/rP4IDV/yKojwamvZ9fmwsEFJF87oNCRi38myhe58=; b=jwaA/ryGa3BWj7GzdeC5+oqAxFxxpksehzr+ELMt+kPKPf9yhDFyTXsqA3DlIQN7Qb Ed0V5AMHHgwT2SfFitIMsoo1T9K/Or2sMvVGHUtb+Ip/FoU+io9euDo7rTEHHhiZa+47 n8TB24IddJqTFQ3prk+pfI/zGfIisvu7KCiN1lxYGkyeuLMJ70H+R0/NftOOvWXyeJRH pgipVPN/N0KNK/xF4e9/QCXZLl4fKknk9G9eIZw4jfk9TMBOeeF1p8Dn9lOwH3ZZRUHT oc6csXvAhg5Fp/Nv8ji9wl/e9ICd7zlMITDj/OoQYrlqc50ywuofnK+D9CYVhvaqYM9C CcjA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date; bh=Vs/rP4IDV/yKojwamvZ9fmwsEFJF87oNCRi38myhe58=; b=tmP03Qyww8njw+Jvy45a3fKPKpD3/bSproiYNPUDTwEHZnAEVbUtx06978xe5iGe9l zqyY0K7mGQHX90Y1o+KijbZ3juI/Yip40KY3P9c5TVsS3R+5Oa9zFWQzWT1wj8CRO9fs fHlVbOqJBLULidFW28mFqSPylToQdanXLe7Q6CKB6uBsvLRfO5DAUp2oOBeP+BLwez2Y 1WoqGi9tkxw92A0jX+sPIwvxY0ZNF13i+PMaQ2TnbK2v+AfVo33ZXCi78db9n/RHSp0J cq0KLCYksb+fNDS7X0YitWdP458JBdieaGoTYN9vuV55GCniNBB2/VHrbvFIUDmsBA3U BzBw== X-Gm-Message-State: ACrzQf0LKGSiKyopiTjQP4glHLXfboDQjlk7qLNXMO3vISa3+uT/9gGr wstidetkb2um5HZuevtuo9fAFMtC1fc= X-Google-Smtp-Source: AMsMyM5r/2zJpIbjtabbsuMXMdw+/w9El5wJ7ps4NSmBwrzKGqk0K8nceMKR3GlL9XY5rpvJnZg7qw== X-Received: by 2002:a05:6214:c66:b0:4ac:b026:458b with SMTP id t6-20020a0562140c6600b004acb026458bmr666694qvj.20.1664990268939; Wed, 05 Oct 2022 10:17:48 -0700 (PDT) Received: from pop-os.attlocal.net ([2600:1700:65a0:ab60:2bd1:c1af:4b3b:4384]) by smtp.gmail.com with ESMTPSA id m13-20020ac85b0d000000b003913996dce3sm1764552qtw.6.2022.10.05.10.17.47 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 05 Oct 2022 10:17:48 -0700 (PDT) From: Cong Wang To: netdev@vger.kernel.org Cc: yangpeihao@sjtu.edu.cn, toke@redhat.com, jhs@mojatatu.com, jiri@resnulli.us, bpf@vger.kernel.org, sdf@google.com, Cong Wang Subject: [RFC Patch v6 5/5] net_sched: Introduce helper bpf_skb_tc_classify() Date: Wed, 5 Oct 2022 10:17:09 -0700 Message-Id: <20221005171709.150520-6-xiyou.wangcong@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20221005171709.150520-1-xiyou.wangcong@gmail.com> References: <20221005171709.150520-1-xiyou.wangcong@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net X-Patchwork-State: RFC From: Cong Wang Introduce an eBPF helper function bpf_skb_tc_classify() to reuse exising TC filters on *any* Qdisc to classify the skb. Signed-off-by: Cong Wang --- include/uapi/linux/bpf.h | 1 + net/core/filter.c | 17 +++++++++- net/sched/cls_api.c | 69 ++++++++++++++++++++++++++++++++++++++++ 3 files changed, 86 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index c21fd1f189bc..7ed04736c4e4 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -5650,6 +5650,7 @@ union bpf_attr { FN(tcp_raw_check_syncookie_ipv6), \ FN(ktime_get_tai_ns), \ FN(user_ringbuf_drain), \ + FN(skb_tc_classify), \ /* */ /* integer value in 'imm' field of BPF_CALL instruction selects which helper diff --git a/net/core/filter.c b/net/core/filter.c index 7a271b77a2cc..d1ed60114794 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -7926,6 +7926,21 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) } } +const struct bpf_func_proto bpf_skb_tc_classify_proto __weak; + +static const struct bpf_func_proto * +tc_qdisc_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) +{ + switch (func_id) { +#ifdef CONFIG_NET_CLS_ACT + case BPF_FUNC_skb_tc_classify: + return &bpf_skb_tc_classify_proto; +#endif + default: + return tc_cls_act_func_proto(func_id, prog); + } +} + static const struct bpf_func_proto * xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) { @@ -10656,7 +10671,7 @@ const struct bpf_prog_ops tc_cls_act_prog_ops = { }; const struct bpf_verifier_ops tc_qdisc_verifier_ops = { - .get_func_proto = tc_cls_act_func_proto, + .get_func_proto = tc_qdisc_func_proto, .is_valid_access = tc_cls_act_is_valid_access, .convert_ctx_access = tc_cls_act_convert_ctx_access, .gen_prologue = tc_cls_act_prologue, diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c index 50566db45949..64470a8947b1 100644 --- a/net/sched/cls_api.c +++ b/net/sched/cls_api.c @@ -22,6 +22,7 @@ #include #include #include +#include #include #include #include @@ -1655,6 +1656,74 @@ int tcf_classify(struct sk_buff *skb, } EXPORT_SYMBOL(tcf_classify); +#ifdef CONFIG_BPF_SYSCALL +BPF_CALL_3(bpf_skb_tc_classify, struct sk_buff *, skb, int, ifindex, u32, handle) +{ + struct net *net = dev_net(skb->dev); + const struct Qdisc_class_ops *cops; + struct tcf_result res = {}; + struct tcf_block *block; + struct tcf_chain *chain; + struct net_device *dev; + unsigned long cl = 0; + struct Qdisc *q; + int result; + + rcu_read_lock(); + dev = dev_get_by_index_rcu(net, ifindex); + if (!dev) + goto out; + q = qdisc_lookup_rcu(dev, handle); + if (!q) + goto out; + + cops = q->ops->cl_ops; + if (!cops) + goto out; + if (!cops->tcf_block) + goto out; + if (TC_H_MIN(handle)) { + cl = cops->find(q, handle); + if (cl == 0) + goto out; + } + block = cops->tcf_block(q, cl, NULL); + if (!block) + goto out; + + for (chain = tcf_get_next_chain(block, NULL); + chain; + chain = tcf_get_next_chain(block, chain)) { + struct tcf_proto *tp; + + result = tcf_classify(skb, NULL, tp, &res, false); + if (result >= 0) { + switch (result) { + case TC_ACT_QUEUED: + case TC_ACT_STOLEN: + case TC_ACT_TRAP: + fallthrough; + case TC_ACT_SHOT: + rcu_read_unlock(); + return 0; + } + } + } +out: + rcu_read_unlock(); + return res.class; +} + +const struct bpf_func_proto bpf_skb_tc_classify_proto = { + .func = bpf_skb_tc_classify, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_ANYTHING, + .arg3_type = ARG_ANYTHING, +}; +#endif + struct tcf_chain_info { struct tcf_proto __rcu **pprev; struct tcf_proto __rcu *next;