From patchwork Tue Sep 5 01:52:51 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "wuqiang.matt" X-Patchwork-Id: 13374775 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id BDE9ECA0FFB for ; Tue, 5 Sep 2023 16:19:11 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234265AbjIEQRu (ORCPT ); Tue, 5 Sep 2023 12:17:50 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42430 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S245164AbjIEBxR (ORCPT ); Mon, 4 Sep 2023 21:53:17 -0400 Received: from mail-pf1-x42d.google.com (mail-pf1-x42d.google.com [IPv6:2607:f8b0:4864:20::42d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B786ACC7 for ; Mon, 4 Sep 2023 18:53:12 -0700 (PDT) Received: by mail-pf1-x42d.google.com with SMTP id d2e1a72fcca58-68c3b9f8333so1234305b3a.1 for ; Mon, 04 Sep 2023 18:53:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1693878792; x=1694483592; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=6YvzxtbZYYtrobQEwlRq7CmmbQu5a+B1kL+sfGeS7P4=; b=gmWa3UEOhyOMaI90Spfon8qGOqskIJUkpomG5wauVpZNF6xDCFjIpmrYMcMD/W5CaE H8dkN0/3lK0sEBDMKXFrrhFoz99wfduj34djp5Z42KIjggcRbPBJ4jvK3RDV5i1ZQqD0 JzihssBz2D0J4vWbZ68wr82RejOUOMZ6kIVm34ktV0xdfdTgN8R5BxJMMc+NQVjdT2cK FTtzO8Z+elAOvTtBSvLIORdBEaxTGiHsjIISfpOC/OhtQwPkdJOHwaqEeFZkoyPFEq6Z fP6M/dl48Ux3x1QUu5bm/6jklyn/W74ZfnprFFd56qGE5pgmaBBC2EeLm3IkjHdgUCwF sFgg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1693878792; x=1694483592; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=6YvzxtbZYYtrobQEwlRq7CmmbQu5a+B1kL+sfGeS7P4=; b=HpoEVXcvQBET2lekX4QCUAsXEg3ajHPZVD5Emu1VZtZxgU1fto18Zo3pTCjL7ehpXP pYSoy/kv3r2KSnuuZNEb+h6UpkWzmNJurYB4Qp/8igKPnptf5YD6P99T4knPftYYIuOb 8RPoW2h6haRiuMuO1P+RBpoPxmV3oLr6JdZkEvaB4w+sgQLD1fNy3IASm3sBYu3Lv59G mrt/6z7dino1vKW0+CN+b0zaKfaWt+r2diBxfqJq2fYiqeQrcBIbeYzBgxUpKOl4mfS6 RjaUmk2Y/zCkThw1jBdNH19rEXxcLAbKrycKPbSgYbuIRW8vVhK1R16uCKa5YAiHK2Ts MTsQ== X-Gm-Message-State: AOJu0YxaUb7/MdWqOkZHLGxQ0KDye4GIP8cZM7lx79IzQmrCWP52coD3 CB0WMWp9b/LHRicFMAdsn5VvPpLhUXOe+VHYpW8= X-Google-Smtp-Source: AGHT+IFO/SI8L+EmSrs6NhBEnXhh4vr64h+J4COJQseZvdLdxHmiw4YLFhRvId/z0kGEP9DWO5t2Iw== X-Received: by 2002:a05:6a00:1c83:b0:68a:6e26:a918 with SMTP id y3-20020a056a001c8300b0068a6e26a918mr15257585pfw.8.1693878791880; Mon, 04 Sep 2023 18:53:11 -0700 (PDT) Received: from devz1.bytedance.net ([203.208.167.146]) by smtp.gmail.com with ESMTPSA id y5-20020aa78045000000b0064d74808738sm7910483pfm.214.2023.09.04.18.53.06 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 04 Sep 2023 18:53:11 -0700 (PDT) From: "wuqiang.matt" To: linux-trace-kernel@vger.kernel.org, mhiramat@kernel.org, davem@davemloft.net, anil.s.keshavamurthy@intel.com, naveen.n.rao@linux.ibm.com, rostedt@goodmis.org, peterz@infradead.org, akpm@linux-foundation.org, sander@svanheule.net, ebiggers@google.com, dan.j.williams@intel.com, jpoimboe@kernel.org Cc: linux-kernel@vger.kernel.org, lkp@intel.com, mattwu@163.com, "wuqiang.matt" Subject: [PATCH v9 1/5] lib: objpool added: ring-array based lockless MPMC Date: Tue, 5 Sep 2023 09:52:51 +0800 Message-Id: <20230905015255.81545-2-wuqiang.matt@bytedance.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230905015255.81545-1-wuqiang.matt@bytedance.com> References: <20230905015255.81545-1-wuqiang.matt@bytedance.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-trace-kernel@vger.kernel.org The object pool is a scalable implementaion of high performance queue for object allocation and reclamation, such as kretprobe instances. With leveraging percpu ring-array to mitigate the hot spot of memory contention, it could deliver near-linear scalability for high parallel scenarios. The objpool is best suited for following cases: 1) Memory allocation or reclamation are prohibited or too expensive 2) Consumers are of different priorities, such as irqs and threads Limitations: 1) Maximum objects (capacity) is determined during pool initializing and can't be modified (extended) after objpool creation 2) The memory of objects won't be freed until objpool is finalized 3) Object allocation (pop) may fail after trying all cpu slots Signed-off-by: wuqiang.matt Signed-off-by: Masami Hiramatsu (Google) --- include/linux/objpool.h | 174 +++++++++++++++++++++ lib/Makefile | 2 +- lib/objpool.c | 338 ++++++++++++++++++++++++++++++++++++++++ 3 files changed, 513 insertions(+), 1 deletion(-) create mode 100644 include/linux/objpool.h create mode 100644 lib/objpool.c diff --git a/include/linux/objpool.h b/include/linux/objpool.h new file mode 100644 index 000000000000..33c832216b98 --- /dev/null +++ b/include/linux/objpool.h @@ -0,0 +1,174 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#ifndef _LINUX_OBJPOOL_H +#define _LINUX_OBJPOOL_H + +#include +#include + +/* + * objpool: ring-array based lockless MPMC queue + * + * Copyright: wuqiang.matt@bytedance.com + * + * The object pool is a scalable implementaion of high performance queue + * for objects allocation and reclamation, such as kretprobe instances. + * + * With leveraging per-cpu ring-array to mitigate the hot spots of memory + * contention, it could deliver near-linear scalability for high parallel + * scenarios. The ring-array is compactly managed in a single cache-line + * to benefit from warmed L1 cache for most cases (<= 4 objects per-core). + * The body of pre-allocated objects is stored in continuous cache-lines + * just after the ring-array. + * + * The object pool is interrupt safe. Both allocation and reclamation + * (object pop and push operations) can be preemptible or interruptable. + * + * It's best suited for following cases: + * 1) Memory allocation or reclamation are prohibited or too expensive + * 2) Consumers are of different priorities, such as irqs and threads + * + * Limitations: + * 1) Maximum objects (capacity) is determined during pool initializing + * 2) The memory of objects won't be freed until the pool is finalized + * 3) Object allocation (pop) may fail after trying all cpu slots + */ + +/** + * struct objpool_slot - percpu ring array of objpool + * @head: head of the local ring array (to retrieve at) + * @tail: tail of the local ring array (to append at) + * @bits: log2 of capacity (for bitwise operations) + * @mask: capacity - 1 + * + * Represents a cpu-local array-based ring buffer, its size is specialized + * during initialization of object pool. The percpu objpool slot is to be + * allocated from local memory for NUMA system, and to be kept compact in + * continuous memory: ages[] is stored just after the body of objpool_slot, + * and then entries[]. ages[] describes revision of each item, solely used + * to avoid ABA; entries[] contains pointers of the actual objects + * + * Layout of objpool_slot in memory: + * + * 64bit: + * 4 8 12 16 32 64 + * | head | tail | bits | mask | ages[4] | ents[4]: (8 * 4) | objects + * + * 32bit: + * 4 8 12 16 32 48 64 + * | head | tail | bits | mask | ages[4] | ents[4] | unused | objects + * + */ +struct objpool_slot { + uint32_t head; + uint32_t tail; + uint32_t bits; + uint32_t mask; +} __packed; + +struct objpool_head; + +/* + * caller-specified callback for object initial setup, it's only called + * once for each object (just after the memory allocation of the object) + */ +typedef int (*objpool_init_obj_cb)(void *obj, void *context); + +/* caller-specified cleanup callback for objpool destruction */ +typedef int (*objpool_fini_cb)(struct objpool_head *head, void *context); + +/** + * struct objpool_head - object pooling metadata + * @obj_size: object & element size + * @nr_objs: total objs (to be pre-allocated) + * @nr_cpus: nr_cpu_ids + * @capacity: max objects per cpuslot + * @gfp: gfp flags for kmalloc & vmalloc + * @ref: refcount for objpool + * @flags: flags for objpool management + * @cpu_slots: array of percpu slots + * @slot_sizes: size in bytes of slots + * @release: resource cleanup callback + * @context: caller-provided context + */ +struct objpool_head { + int obj_size; + int nr_objs; + int nr_cpus; + int capacity; + gfp_t gfp; + refcount_t ref; + unsigned long flags; + struct objpool_slot **cpu_slots; + int *slot_sizes; + objpool_fini_cb release; + void *context; +}; + +#define OBJPOOL_FROM_VMALLOC (0x800000000) /* objpool allocated from vmalloc area */ +#define OBJPOOL_HAVE_OBJECTS (0x400000000) /* objects allocated along with objpool */ + +/** + * objpool_init() - initialize objpool and pre-allocated objects + * @head: the object pool to be initialized, declared by caller + * @nr_objs: total objects to be pre-allocated by this object pool + * @object_size: size of an object (should be > 0) + * @gfp: flags for memory allocation (via kmalloc or vmalloc) + * @context: user context for object initialization callback + * @objinit: object initialization callback for extra setup + * @release: cleanup callback for extra cleanup task + * + * return value: 0 for success, otherwise error code + * + * All pre-allocated objects are to be zeroed after memory allocation. + * Caller could do extra initialization in objinit callback. objinit() + * will be called just after slot allocation and will be only once for + * each object. Since then the objpool won't touch any content of the + * objects. It's caller's duty to perform reinitialization after each + * pop (object allocation) or do clearance before each push (object + * reclamation). + */ +int objpool_init(struct objpool_head *head, int nr_objs, int object_size, + gfp_t gfp, void *context, objpool_init_obj_cb objinit, + objpool_fini_cb release); + +/** + * objpool_pop() - allocate an object from objpool + * @head: object pool + * + * return value: object ptr or NULL if failed + */ +void *objpool_pop(struct objpool_head *head); + +/** + * objpool_push() - reclaim the object and return back to objpool + * @obj: object ptr to be pushed to objpool + * @head: object pool + * + * return: 0 or error code (it fails only when user tries to push + * the same object multiple times or wrong "objects" into objpool) + */ +int objpool_push(void *obj, struct objpool_head *head); + +/** + * objpool_drop() - discard the object and deref objpool + * @obj: object ptr to be discarded + * @head: object pool + * + * return: 0 if objpool was released or error code + */ +int objpool_drop(void *obj, struct objpool_head *head); + +/** + * objpool_free() - release objpool forcely (all objects to be freed) + * @head: object pool to be released + */ +void objpool_free(struct objpool_head *head); + +/** + * objpool_fini() - deref object pool (also releasing unused objects) + * @head: object pool to be dereferenced + */ +void objpool_fini(struct objpool_head *head); + +#endif /* _LINUX_OBJPOOL_H */ diff --git a/lib/Makefile b/lib/Makefile index 1ffae65bb7ee..7a84c922d9ff 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -34,7 +34,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \ is_single_threaded.o plist.o decompress.o kobject_uevent.o \ earlycpio.o seq_buf.o siphash.o dec_and_lock.o \ nmi_backtrace.o win_minmax.o memcat_p.o \ - buildid.o + buildid.o objpool.o lib-$(CONFIG_PRINTK) += dump_stack.o lib-$(CONFIG_SMP) += cpumask.o diff --git a/lib/objpool.c b/lib/objpool.c new file mode 100644 index 000000000000..22e752371820 --- /dev/null +++ b/lib/objpool.c @@ -0,0 +1,338 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include +#include +#include +#include +#include +#include +#include +#include + +/* + * objpool: ring-array based lockless MPMC/FIFO queues + * + * Copyright: wuqiang.matt@bytedance.com + */ + +#define SLOT_AGES(s) ((uint32_t *)((char *)(s) + sizeof(struct objpool_slot))) +#define SLOT_ENTS(s) ((void **)((char *)(s) + sizeof(struct objpool_slot) + \ + (sizeof(uint32_t) << (s)->bits))) +#define SLOT_OBJS(s) ((void *)((char *)(s) + sizeof(struct objpool_slot) + \ + ((sizeof(uint32_t) + sizeof(void *)) << (s)->bits))) +#define SLOT_CORE(n) cpumask_nth((n) % num_possible_cpus(), cpu_possible_mask) + +/* compute the suitable num of objects to be managed per slot */ +static int objpool_nobjs(int size) +{ + return rounddown_pow_of_two((size - sizeof(struct objpool_slot)) / + (sizeof(uint32_t) + sizeof(void *))); +} + +/* allocate and initialize percpu slots */ +static int +objpool_init_percpu_slots(struct objpool_head *head, int nobjs, + void *context, objpool_init_obj_cb objinit) +{ + int i, j, n, size, objsz, cpu = 0, nents = head->capacity; + + /* aligned object size by sizeof(void *) */ + objsz = ALIGN(head->obj_size, sizeof(void *)); + /* shall we allocate objects along with percpu-slot */ + if (objsz) + head->flags |= OBJPOOL_HAVE_OBJECTS; + + /* vmalloc is used in default to allocate percpu-slots */ + if (!(head->gfp & GFP_ATOMIC)) + head->flags |= OBJPOOL_FROM_VMALLOC; + + for (i = 0; i < head->nr_cpus; i++) { + struct objpool_slot *os; + + /* skip the cpus which could never be present */ + if (!cpu_possible(i)) + continue; + + /* compute how many objects to be managed by this slot */ + n = nobjs / num_possible_cpus(); + if (cpu < (nobjs % num_possible_cpus())) + n++; + size = sizeof(struct objpool_slot) + sizeof(void *) * nents + + sizeof(uint32_t) * nents + objsz * n; + + /* + * here we allocate percpu-slot & objects together in a single + * allocation, taking advantage of warm caches and TLB hits as + * vmalloc always aligns the request size to pages + */ + if (head->flags & OBJPOOL_FROM_VMALLOC) + os = __vmalloc_node(size, sizeof(void *), head->gfp, + cpu_to_node(i), __builtin_return_address(0)); + else + os = kmalloc_node(size, head->gfp, cpu_to_node(i)); + if (!os) + return -ENOMEM; + + /* initialize percpu slot for the i-th slot */ + memset(os, 0, size); + os->bits = ilog2(head->capacity); + os->mask = head->capacity - 1; + head->cpu_slots[i] = os; + head->slot_sizes[i] = size; + cpu = cpu + 1; + + /* + * manually set head & tail to avoid possible conflict: + * We assume that the head item is ready for retrieval + * iff head is equal to ages[head & mask]. but ages is + * initialized as 0, so in view of the caller of pop(), + * the 1st item (0th) is always ready, but the reality + * could be: push() is stalled before the final update, + * thus the item being inserted will be lost forever + */ + os->head = os->tail = head->capacity; + + if (!objsz) + continue; + + for (j = 0; j < n; j++) { + uint32_t *ages = SLOT_AGES(os); + void **ents = SLOT_ENTS(os); + void *obj = SLOT_OBJS(os) + j * objsz; + uint32_t ie = os->tail & os->mask; + + /* perform object initialization */ + if (objinit) { + int rc = objinit(obj, context); + if (rc) + return rc; + } + + /* add obj into the ring array */ + ents[ie] = obj; + ages[ie] = os->tail; + os->tail++; + head->nr_objs++; + } + } + + return 0; +} + +/* cleanup all percpu slots of the object pool */ +static void objpool_fini_percpu_slots(struct objpool_head *head) +{ + int i; + + if (!head->cpu_slots) + return; + + for (i = 0; i < head->nr_cpus; i++) { + if (!head->cpu_slots[i]) + continue; + if (head->flags & OBJPOOL_FROM_VMALLOC) + vfree(head->cpu_slots[i]); + else + kfree(head->cpu_slots[i]); + } + kfree(head->cpu_slots); + head->cpu_slots = NULL; + head->slot_sizes = NULL; +} + +/* initialize object pool and pre-allocate objects */ +int objpool_init(struct objpool_head *head, int nr_objs, int object_size, + gfp_t gfp, void *context, objpool_init_obj_cb objinit, + objpool_fini_cb release) +{ + int nents, rc; + + /* check input parameters */ + if (nr_objs <= 0 || object_size <= 0) + return -EINVAL; + + /* calculate percpu slot size (rounded to pow of 2) */ + nents = max_t(int, roundup_pow_of_two(nr_objs), + objpool_nobjs(L1_CACHE_BYTES)); + + /* initialize objpool head */ + memset(head, 0, sizeof(struct objpool_head)); + head->nr_cpus = nr_cpu_ids; + head->obj_size = object_size; + head->capacity = nents; + head->gfp = gfp & ~__GFP_ZERO; + head->context = context; + head->release = release; + + /* allocate array for percpu slots */ + head->cpu_slots = kzalloc(head->nr_cpus * sizeof(void *) + + head->nr_cpus * sizeof(int), head->gfp); + if (!head->cpu_slots) + return -ENOMEM; + head->slot_sizes = (int *)&head->cpu_slots[head->nr_cpus]; + + /* initialize per-cpu slots */ + rc = objpool_init_percpu_slots(head, nr_objs, context, objinit); + if (rc) + objpool_fini_percpu_slots(head); + else + refcount_set(&head->ref, nr_objs + 1); + + return rc; +} +EXPORT_SYMBOL_GPL(objpool_init); + +/* adding object to slot, abort if the slot was already full */ +static inline int objpool_try_add_slot(void *obj, struct objpool_slot *os) +{ + uint32_t *ages = SLOT_AGES(os); + void **ents = SLOT_ENTS(os); + uint32_t head, tail; + + do { + /* perform memory loading for both head and tail */ + head = READ_ONCE(os->head); + tail = READ_ONCE(os->tail); + /* just abort if slot is full */ + if (tail - head > os->mask) + return -ENOENT; + /* try to extend tail by 1 using CAS to avoid races */ + if (try_cmpxchg_acquire(&os->tail, &tail, tail + 1)) + break; + } while (1); + + /* the tail-th of slot is reserved for the given obj */ + WRITE_ONCE(ents[tail & os->mask], obj); + /* update epoch id to make this object available for pop() */ + smp_store_release(&ages[tail & os->mask], tail); + return 0; +} + +/* reclaim an object to object pool */ +int objpool_push(void *obj, struct objpool_head *oh) +{ + unsigned long flags; + int cpu, rc; + + /* disable local irq to avoid preemption & interruption */ + raw_local_irq_save(flags); + cpu = raw_smp_processor_id(); + do { + rc = objpool_try_add_slot(obj, oh->cpu_slots[cpu]); + if (!rc) + break; + cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1); + } while (1); + raw_local_irq_restore(flags); + + return rc; +} +EXPORT_SYMBOL_GPL(objpool_push); + +/* drop the allocated object, rather reclaim it to objpool */ +int objpool_drop(void *obj, struct objpool_head *head) +{ + if (!obj || !head) + return -EINVAL; + + if (refcount_dec_and_test(&head->ref)) { + objpool_free(head); + return 0; + } + + return -EAGAIN; +} +EXPORT_SYMBOL_GPL(objpool_drop); + +/* try to retrieve object from slot */ +static inline void *objpool_try_get_slot(struct objpool_slot *os) +{ + uint32_t *ages = SLOT_AGES(os); + void **ents = SLOT_ENTS(os); + /* do memory load of head to local head */ + uint32_t head = smp_load_acquire(&os->head); + + /* loop if slot isn't empty */ + while (head != READ_ONCE(os->tail)) { + uint32_t id = head & os->mask, prev = head; + + /* do prefetching of object ents */ + prefetch(&ents[id]); + + /* check whether this item was ready for retrieval */ + if (smp_load_acquire(&ages[id]) == head) { + /* node must have been udpated by push() */ + void *node = READ_ONCE(ents[id]); + /* commit and move forward head of the slot */ + if (try_cmpxchg_release(&os->head, &head, head + 1)) + return node; + /* head was already updated by others */ + } + + /* re-load head from memory and continue trying */ + head = READ_ONCE(os->head); + /* + * head stays unchanged, so it's very likely there's an + * ongoing push() on other cpu nodes but yet not update + * ages[] to mark it's completion + */ + if (head == prev) + break; + } + + return NULL; +} + +/* allocate an object from object pool */ +void *objpool_pop(struct objpool_head *head) +{ + unsigned long flags; + int i, cpu; + void *obj = NULL; + + /* disable local irq to avoid preemption & interruption */ + raw_local_irq_save(flags); + + cpu = raw_smp_processor_id(); + for (i = 0; i < num_possible_cpus(); i++) { + obj = objpool_try_get_slot(head->cpu_slots[cpu]); + if (obj) + break; + cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1); + } + raw_local_irq_restore(flags); + + return obj; +} +EXPORT_SYMBOL_GPL(objpool_pop); + +/* release whole objpool forcely */ +void objpool_free(struct objpool_head *head) +{ + if (!head->cpu_slots) + return; + + /* release percpu slots */ + objpool_fini_percpu_slots(head); + + /* call user's cleanup callback if provided */ + if (head->release) + head->release(head, head->context); +} +EXPORT_SYMBOL_GPL(objpool_free); + +/* drop unused objects and defref objpool for releasing */ +void objpool_fini(struct objpool_head *head) +{ + void *nod; + + do { + /* grab object from objpool and drop it */ + nod = objpool_pop(head); + + /* drop the extra ref of objpool */ + if (refcount_dec_and_test(&head->ref)) + objpool_free(head); + } while (nod); +} +EXPORT_SYMBOL_GPL(objpool_fini); From patchwork Tue Sep 5 01:52:52 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "wuqiang.matt" X-Patchwork-Id: 13374778 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 310A4CA0FFD for ; Tue, 5 Sep 2023 16:19:37 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S244207AbjIEQTL (ORCPT ); Tue, 5 Sep 2023 12:19:11 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36630 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S245201AbjIEBxp (ORCPT ); Mon, 4 Sep 2023 21:53:45 -0400 Received: from mail-pf1-x42b.google.com (mail-pf1-x42b.google.com [IPv6:2607:f8b0:4864:20::42b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8D2FECC8 for ; Mon, 4 Sep 2023 18:53:18 -0700 (PDT) Received: by mail-pf1-x42b.google.com with SMTP id d2e1a72fcca58-68cbbff84f6so1218784b3a.1 for ; Mon, 04 Sep 2023 18:53:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1693878798; x=1694483598; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=OXBWZvqTDZ2iX3B2LeXF5RCR0BFpOh2X4c6fSPhJDGw=; b=DnGrLJoZVOP5TgWtXHWPgz2MfIay7SjRhaztrdIQv2ytfbPO1OHC5GNZV/QNGtzBpd 7A+/oWJnDux28me/JXELhKF1BaSglcXFYrvIxvRHoc3Qy1tbi1vzE/NLcfvq/xeRAgwg sAdrIMvLnr8Nt87Q2ZZzb/73Gg2/YVk8hTWEJnjaQ63TUGOdgKAGWTWdTkExlH+5SnQE dY8FHLROlDTzSwL1p9hVzxxLd1on8bkdKVkBSeKM9IQbcd4f5VUJCafObUKObBejd+y3 mQtBBDM3fRgCfkaiVAbqBIB0TPuaR/wfzEYtFnR4RqY8pnrREiZv6zQZvT8mA+VD1p+e mO+w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1693878798; x=1694483598; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=OXBWZvqTDZ2iX3B2LeXF5RCR0BFpOh2X4c6fSPhJDGw=; b=KhcsgQdF7dPY3VxYOhomaTLi2wYhkUZO4uZpezi8wvl81GFl6Ln/svY0ZbzYzeY3XQ c/P4HADZhD1vev8WRjAWz4Yy7qQ9CnvSBerIWXm/1ng/w3yWmKHU9yeNGgqSsI2D2RpP IkcVV8QWABK7KTsJzYGZH2iggocar+EMiERkmsD4M+7Sb6iH6sl2G62hXW2zok2px6DZ Z2niOLtrnsjPW3XXWyTDVX5a5T4ZyF0zz8DjecFipBvEsuJyLTbs/rq6OTtG7f2iLRjw cN3p1t1ISk9cwUjB1LB4VmO7LAaYLWqCmh1L6KyvEvsnxewzIQB7o04NOLkVsaOVgNWD z8FQ== X-Gm-Message-State: AOJu0Yz0JnSVmJmFSoRguIDL3dqk7VMM0tdTGILpzyXzWa60pwwEMi00 WnSEY4O/qf7sXe8LrL4MyNMerGtKxFzZDCrKtE8= X-Google-Smtp-Source: AGHT+IHWtwG9XYOphCTqCq0EC9fkJmnHXba2ei/DItlXo0c2xis/1vmvrxp0be3/T8GIv131mL6V4Q== X-Received: by 2002:a05:6a20:324e:b0:14b:f78e:d061 with SMTP id hm14-20020a056a20324e00b0014bf78ed061mr10730242pzc.19.1693878797678; Mon, 04 Sep 2023 18:53:17 -0700 (PDT) Received: from devz1.bytedance.net ([203.208.167.146]) by smtp.gmail.com with ESMTPSA id y5-20020aa78045000000b0064d74808738sm7910483pfm.214.2023.09.04.18.53.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 04 Sep 2023 18:53:17 -0700 (PDT) From: "wuqiang.matt" To: linux-trace-kernel@vger.kernel.org, mhiramat@kernel.org, davem@davemloft.net, anil.s.keshavamurthy@intel.com, naveen.n.rao@linux.ibm.com, rostedt@goodmis.org, peterz@infradead.org, akpm@linux-foundation.org, sander@svanheule.net, ebiggers@google.com, dan.j.williams@intel.com, jpoimboe@kernel.org Cc: linux-kernel@vger.kernel.org, lkp@intel.com, mattwu@163.com, "wuqiang.matt" Subject: [PATCH v9 2/5] lib: objpool test module added Date: Tue, 5 Sep 2023 09:52:52 +0800 Message-Id: <20230905015255.81545-3-wuqiang.matt@bytedance.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230905015255.81545-1-wuqiang.matt@bytedance.com> References: <20230905015255.81545-1-wuqiang.matt@bytedance.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-trace-kernel@vger.kernel.org The test_objpool module (test_objpool) will run several testcases for objpool stress and performance evaluation. Each testcase will have all available cpu cores involved to create a situation of high parallel and high contention. As of now there are 5 groups and 5 * 2 testcases in total: 1) group 1: synchronous mode objpool is managed synchronously, that is, all objects are to be reclaimed before objpool finalization and the objpool owner makes sure of it. All threads on different cores run in the same pace 2) group 2: synchronous mode + hrtimer this case have 2 customers: normal threads and hrtimer softirqs 3) group 3: synchronous + overrun mode This test group is mainly for performance evaluation of missing cases when pre-allocated objects are less than the requested 4) group 4: asynchronous mode This case is just an emulation of kretprobe, with refcount used to control the objpool lifecycle 5) group 5: asynchronous mode with hrtimer hrtimer softirq is introduced to stress async objpool operations Signed-off-by: wuqiang.matt --- lib/Kconfig.debug | 11 + lib/Makefile | 2 + lib/test_objpool.c | 689 +++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 702 insertions(+) create mode 100644 lib/test_objpool.c diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index d6798513a8c2..6598604cf6c8 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -2931,6 +2931,17 @@ config TEST_CLOCKSOURCE_WATCHDOG If unsure, say N. +config TEST_OBJPOOL + tristate "Test module for correctness and stress of objpool" + default n + depends on m && DEBUG_KERNEL + help + This builds the "test_objpool" module that should be used for + correctness verification and concurrent testings of objects + allocation and reclamation. + + If unsure, say N. + endif # RUNTIME_TESTING_MENU config ARCH_USE_MEMTEST diff --git a/lib/Makefile b/lib/Makefile index 7a84c922d9ff..19b936f2af1c 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -106,6 +106,8 @@ obj-$(CONFIG_KPROBES_SANITY_TEST) += test_kprobes.o obj-$(CONFIG_TEST_REF_TRACKER) += test_ref_tracker.o CFLAGS_test_fprobe.o += $(CC_FLAGS_FTRACE) obj-$(CONFIG_FPROBE_SANITY_TEST) += test_fprobe.o +obj-$(CONFIG_TEST_OBJPOOL) += test_objpool.o + # # CFLAGS for compiling floating point code inside the kernel. x86/Makefile turns # off the generation of FPU/SSE* instructions for kernel proper but FPU_FLAGS diff --git a/lib/test_objpool.c b/lib/test_objpool.c new file mode 100644 index 000000000000..d329472f8ab6 --- /dev/null +++ b/lib/test_objpool.c @@ -0,0 +1,689 @@ +// SPDX-License-Identifier: GPL-2.0 + +/* + * Test module for lockless object pool + * + * Copyright: wuqiang.matt@bytedance.com + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define OT_NR_MAX_BULK (16) + +/* memory usage */ +struct ot_mem_stat { + atomic_long_t alloc; + atomic_long_t free; +}; + +/* object allocation results */ +struct ot_obj_stat { + unsigned long nhits; + unsigned long nmiss; +}; + +/* control & results per testcase */ +struct ot_data { + struct rw_semaphore start; + struct completion wait; + struct completion rcu; + atomic_t nthreads ____cacheline_aligned_in_smp; + atomic_t stop ____cacheline_aligned_in_smp; + struct ot_mem_stat kmalloc; + struct ot_mem_stat vmalloc; + struct ot_obj_stat objects; + u64 duration; +}; + +/* testcase */ +struct ot_test { + int async; /* synchronous or asynchronous */ + int mode; /* only mode 0 supported */ + int objsz; /* object size */ + int duration; /* ms */ + int delay; /* ms */ + int bulk_normal; + int bulk_irq; + unsigned long hrtimer; /* ms */ + const char *name; + struct ot_data data; +}; + +/* per-cpu worker */ +struct ot_item { + struct objpool_head *pool; /* pool head */ + struct ot_test *test; /* test parameters */ + + void (*worker)(struct ot_item *item, int irq); + + /* hrtimer control */ + ktime_t hrtcycle; + struct hrtimer hrtimer; + + int bulk[2]; /* for thread and irq */ + int delay; + u32 niters; + + /* summary per thread */ + struct ot_obj_stat stat[2]; /* thread and irq */ + u64 duration; +}; + +/* + * memory leakage checking + */ + +static void *ot_kzalloc(struct ot_test *test, long size) +{ + void *ptr = kzalloc(size, GFP_KERNEL); + + if (ptr) + atomic_long_add(size, &test->data.kmalloc.alloc); + return ptr; +} + +static void ot_kfree(struct ot_test *test, void *ptr, long size) +{ + if (!ptr) + return; + atomic_long_add(size, &test->data.kmalloc.free); + kfree(ptr); +} + +static void ot_mem_report(struct ot_test *test) +{ + long alloc, free; + + pr_info("memory allocation summary for %s\n", test->name); + + alloc = atomic_long_read(&test->data.kmalloc.alloc); + free = atomic_long_read(&test->data.kmalloc.free); + pr_info(" kmalloc: %lu - %lu = %lu\n", alloc, free, alloc - free); + + alloc = atomic_long_read(&test->data.vmalloc.alloc); + free = atomic_long_read(&test->data.vmalloc.free); + pr_info(" vmalloc: %lu - %lu = %lu\n", alloc, free, alloc - free); +} + +/* user object instance */ +struct ot_node { + void *owner; + unsigned long data; + unsigned long refs; + unsigned long payload[32]; +}; + +/* user objpool manager */ +struct ot_context { + struct objpool_head pool; /* objpool head */ + struct ot_test *test; /* test parameters */ + void *ptr; /* user pool buffer */ + unsigned long size; /* buffer size */ + struct rcu_head rcu; +}; + +static DEFINE_PER_CPU(struct ot_item, ot_pcup_items); + +static int ot_init_data(struct ot_data *data) +{ + memset(data, 0, sizeof(*data)); + init_rwsem(&data->start); + init_completion(&data->wait); + init_completion(&data->rcu); + atomic_set(&data->nthreads, 1); + + return 0; +} + +static int ot_init_node(void *nod, void *context) +{ + struct ot_context *sop = context; + struct ot_node *on = nod; + + on->owner = &sop->pool; + return 0; +} + +static enum hrtimer_restart ot_hrtimer_handler(struct hrtimer *hrt) +{ + struct ot_item *item = container_of(hrt, struct ot_item, hrtimer); + struct ot_test *test = item->test; + + if (atomic_read_acquire(&test->data.stop)) + return HRTIMER_NORESTART; + + /* do bulk-testings for objects pop/push */ + item->worker(item, 1); + + hrtimer_forward(hrt, hrt->base->get_time(), item->hrtcycle); + return HRTIMER_RESTART; +} + +static void ot_start_hrtimer(struct ot_item *item) +{ + if (!item->test->hrtimer) + return; + hrtimer_start(&item->hrtimer, item->hrtcycle, HRTIMER_MODE_REL); +} + +static void ot_stop_hrtimer(struct ot_item *item) +{ + if (!item->test->hrtimer) + return; + hrtimer_cancel(&item->hrtimer); +} + +static int ot_init_hrtimer(struct ot_item *item, unsigned long hrtimer) +{ + struct hrtimer *hrt = &item->hrtimer; + + if (!hrtimer) + return -ENOENT; + + item->hrtcycle = ktime_set(0, hrtimer * 1000000UL); + hrtimer_init(hrt, CLOCK_MONOTONIC, HRTIMER_MODE_REL); + hrt->function = ot_hrtimer_handler; + return 0; +} + +static int ot_init_cpu_item(struct ot_item *item, + struct ot_test *test, + struct objpool_head *pool, + void (*worker)(struct ot_item *, int)) +{ + memset(item, 0, sizeof(*item)); + item->pool = pool; + item->test = test; + item->worker = worker; + + item->bulk[0] = test->bulk_normal; + item->bulk[1] = test->bulk_irq; + item->delay = test->delay; + + /* initialize hrtimer */ + ot_init_hrtimer(item, item->test->hrtimer); + return 0; +} + +static int ot_thread_worker(void *arg) +{ + struct ot_item *item = arg; + struct ot_test *test = item->test; + ktime_t start; + + atomic_inc(&test->data.nthreads); + down_read(&test->data.start); + up_read(&test->data.start); + start = ktime_get(); + ot_start_hrtimer(item); + do { + if (atomic_read_acquire(&test->data.stop)) + break; + /* do bulk-testings for objects pop/push */ + item->worker(item, 0); + } while (!kthread_should_stop()); + ot_stop_hrtimer(item); + item->duration = (u64) ktime_us_delta(ktime_get(), start); + if (atomic_dec_and_test(&test->data.nthreads)) + complete(&test->data.wait); + + return 0; +} + +static void ot_perf_report(struct ot_test *test, u64 duration) +{ + struct ot_obj_stat total, normal = {0}, irq = {0}; + int cpu, nthreads = 0; + + pr_info("\n"); + pr_info("Testing summary for %s\n", test->name); + + for_each_possible_cpu(cpu) { + struct ot_item *item = per_cpu_ptr(&ot_pcup_items, cpu); + if (!item->duration) + continue; + normal.nhits += item->stat[0].nhits; + normal.nmiss += item->stat[0].nmiss; + irq.nhits += item->stat[1].nhits; + irq.nmiss += item->stat[1].nmiss; + pr_info("CPU: %d duration: %lluus\n", cpu, item->duration); + pr_info("\tthread:\t%16lu hits \t%16lu miss\n", + item->stat[0].nhits, item->stat[0].nmiss); + pr_info("\tirq: \t%16lu hits \t%16lu miss\n", + item->stat[1].nhits, item->stat[1].nmiss); + pr_info("\ttotal: \t%16lu hits \t%16lu miss\n", + item->stat[0].nhits + item->stat[1].nhits, + item->stat[0].nmiss + item->stat[1].nmiss); + nthreads++; + } + + total.nhits = normal.nhits + irq.nhits; + total.nmiss = normal.nmiss + irq.nmiss; + + pr_info("ALL: \tnthreads: %d duration: %lluus\n", nthreads, duration); + pr_info("SUM: \t%16lu hits \t%16lu miss\n", + total.nhits, total.nmiss); + + test->data.objects = total; + test->data.duration = duration; +} + +/* + * synchronous test cases for objpool manipulation + */ + +/* objpool manipulation for synchronous mode (percpu objpool) */ +static struct ot_context *ot_init_sync_m0(struct ot_test *test) +{ + struct ot_context *sop = NULL; + int max = num_possible_cpus() << 3; + + sop = (struct ot_context *)ot_kzalloc(test, sizeof(*sop)); + if (!sop) + return NULL; + sop->test = test; + + if (objpool_init(&sop->pool, max, test->objsz, + GFP_KERNEL, sop, ot_init_node, NULL)) { + ot_kfree(test, sop, sizeof(*sop)); + return NULL; + } + WARN_ON(max != sop->pool.nr_objs); + + return sop; +} + +static void ot_fini_sync(struct ot_context *sop) +{ + objpool_fini(&sop->pool); + ot_kfree(sop->test, sop, sizeof(*sop)); +} + +struct { + struct ot_context * (*init)(struct ot_test *oc); + void (*fini)(struct ot_context *sop); +} g_ot_sync_ops[] = { + {.init = ot_init_sync_m0, .fini = ot_fini_sync}, +}; + +/* + * synchronous test cases: performance mode + */ + +static void ot_bulk_sync(struct ot_item *item, int irq) +{ + struct ot_node *nods[OT_NR_MAX_BULK]; + int i; + + for (i = 0; i < item->bulk[irq]; i++) + nods[i] = objpool_pop(item->pool); + + if (!irq && (item->delay || !(++(item->niters) & 0x7FFF))) + msleep(item->delay); + + while (i-- > 0) { + struct ot_node *on = nods[i]; + if (on) { + on->refs++; + objpool_push(on, item->pool); + item->stat[irq].nhits++; + } else { + item->stat[irq].nmiss++; + } + } +} + +static int ot_start_sync(struct ot_test *test) +{ + struct ot_context *sop; + ktime_t start; + u64 duration; + unsigned long timeout; + int cpu; + + /* initialize objpool for syncrhonous testcase */ + sop = g_ot_sync_ops[test->mode].init(test); + if (!sop) + return -ENOMEM; + + /* grab rwsem to block testing threads */ + down_write(&test->data.start); + + for_each_possible_cpu(cpu) { + struct ot_item *item = per_cpu_ptr(&ot_pcup_items, cpu); + struct task_struct *work; + + ot_init_cpu_item(item, test, &sop->pool, ot_bulk_sync); + + /* skip offline cpus */ + if (!cpu_online(cpu)) + continue; + + work = kthread_create_on_node(ot_thread_worker, item, + cpu_to_node(cpu), "ot_worker_%d", cpu); + if (IS_ERR(work)) { + pr_err("failed to create thread for cpu %d\n", cpu); + } else { + kthread_bind(work, cpu); + wake_up_process(work); + } + } + + /* wait a while to make sure all threads waiting at start line */ + msleep(20); + + /* in case no threads were created: memory insufficient ? */ + if (atomic_dec_and_test(&test->data.nthreads)) + complete(&test->data.wait); + + // sched_set_fifo_low(current); + + /* start objpool testing threads */ + start = ktime_get(); + up_write(&test->data.start); + + /* yeild cpu to worker threads for duration ms */ + timeout = msecs_to_jiffies(test->duration); + schedule_timeout_interruptible(timeout); + + /* tell workers threads to quit */ + atomic_set_release(&test->data.stop, 1); + + /* wait all workers threads finish and quit */ + wait_for_completion(&test->data.wait); + duration = (u64) ktime_us_delta(ktime_get(), start); + + /* cleanup objpool */ + g_ot_sync_ops[test->mode].fini(sop); + + /* report testing summary and performance results */ + ot_perf_report(test, duration); + + /* report memory allocation summary */ + ot_mem_report(test); + + return 0; +} + +/* + * asynchronous test cases: pool lifecycle controlled by refcount + */ + +static void ot_fini_async_rcu(struct rcu_head *rcu) +{ + struct ot_context *sop = container_of(rcu, struct ot_context, rcu); + struct ot_test *test = sop->test; + + /* here all cpus are aware of the stop event: test->data.stop = 1 */ + WARN_ON(!atomic_read_acquire(&test->data.stop)); + + objpool_fini(&sop->pool); + complete(&test->data.rcu); +} + +static void ot_fini_async(struct ot_context *sop) +{ + /* make sure the stop event is acknowledged by all cores */ + call_rcu(&sop->rcu, ot_fini_async_rcu); +} + +static int ot_objpool_release(struct objpool_head *head, void *context) +{ + struct ot_context *sop = context; + + WARN_ON(!head || !sop || head != &sop->pool); + + /* do context cleaning if needed */ + if (sop) + ot_kfree(sop->test, sop, sizeof(*sop)); + + return 0; +} + +static struct ot_context *ot_init_async_m0(struct ot_test *test) +{ + struct ot_context *sop = NULL; + int max = num_possible_cpus() << 3; + + sop = (struct ot_context *)ot_kzalloc(test, sizeof(*sop)); + if (!sop) + return NULL; + sop->test = test; + + if (objpool_init(&sop->pool, max, test->objsz, GFP_KERNEL, + sop, ot_init_node, ot_objpool_release)) { + ot_kfree(test, sop, sizeof(*sop)); + return NULL; + } + WARN_ON(max != sop->pool.nr_objs); + + return sop; +} + +struct { + struct ot_context * (*init)(struct ot_test *oc); + void (*fini)(struct ot_context *sop); +} g_ot_async_ops[] = { + {.init = ot_init_async_m0, .fini = ot_fini_async}, +}; + +static void ot_nod_recycle(struct ot_node *on, struct objpool_head *pool, + int release) +{ + struct ot_context *sop; + + on->refs++; + + if (!release) { + /* push object back to opjpool for reuse */ + objpool_push(on, pool); + return; + } + + sop = container_of(pool, struct ot_context, pool); + WARN_ON(sop != pool->context); + + /* unref objpool with nod removed forever */ + objpool_drop(on, pool); +} + +static void ot_bulk_async(struct ot_item *item, int irq) +{ + struct ot_test *test = item->test; + struct ot_node *nods[OT_NR_MAX_BULK]; + int i, stop; + + for (i = 0; i < item->bulk[irq]; i++) + nods[i] = objpool_pop(item->pool); + + if (!irq) { + if (item->delay || !(++(item->niters) & 0x7FFF)) + msleep(item->delay); + get_cpu(); + } + + stop = atomic_read_acquire(&test->data.stop); + + /* drop all objects and deref objpool */ + while (i-- > 0) { + struct ot_node *on = nods[i]; + + if (on) { + on->refs++; + ot_nod_recycle(on, item->pool, stop); + item->stat[irq].nhits++; + } else { + item->stat[irq].nmiss++; + } + } + + if (!irq) + put_cpu(); +} + +static int ot_start_async(struct ot_test *test) +{ + struct ot_context *sop; + ktime_t start; + u64 duration; + unsigned long timeout; + int cpu; + + /* initialize objpool for syncrhonous testcase */ + sop = g_ot_async_ops[test->mode].init(test); + if (!sop) + return -ENOMEM; + + /* grab rwsem to block testing threads */ + down_write(&test->data.start); + + for_each_possible_cpu(cpu) { + struct ot_item *item = per_cpu_ptr(&ot_pcup_items, cpu); + struct task_struct *work; + + ot_init_cpu_item(item, test, &sop->pool, ot_bulk_async); + + /* skip offline cpus */ + if (!cpu_online(cpu)) + continue; + + work = kthread_create_on_node(ot_thread_worker, item, + cpu_to_node(cpu), "ot_worker_%d", cpu); + if (IS_ERR(work)) { + pr_err("failed to create thread for cpu %d\n", cpu); + } else { + kthread_bind(work, cpu); + wake_up_process(work); + } + } + + /* wait a while to make sure all threads waiting at start line */ + msleep(20); + + /* in case no threads were created: memory insufficient ? */ + if (atomic_dec_and_test(&test->data.nthreads)) + complete(&test->data.wait); + + /* start objpool testing threads */ + start = ktime_get(); + up_write(&test->data.start); + + /* yeild cpu to worker threads for duration ms */ + timeout = msecs_to_jiffies(test->duration); + schedule_timeout_interruptible(timeout); + + /* tell workers threads to quit */ + atomic_set_release(&test->data.stop, 1); + + /* do async-finalization */ + g_ot_async_ops[test->mode].fini(sop); + + /* wait all workers threads finish and quit */ + wait_for_completion(&test->data.wait); + duration = (u64) ktime_us_delta(ktime_get(), start); + + /* assure rcu callback is triggered */ + wait_for_completion(&test->data.rcu); + + /* + * now we are sure that objpool is finalized either + * by rcu callback or by worker threads + */ + + /* report testing summary and performance results */ + ot_perf_report(test, duration); + + /* report memory allocation summary */ + ot_mem_report(test); + + return 0; +} + +/* + * predefined testing cases: + * synchronous case / overrun case / async case + * + * async: synchronous or asynchronous testing + * mode: only mode 0 supported + * duration: int, total test time in ms + * delay: int, delay (in ms) between each iteration + * bulk_normal: int, repeat times for thread worker + * bulk_irq: int, repeat times for irq consumer + * hrtimer: unsigned long, hrtimer intervnal in ms + * name: char *, tag for current test ot_item + */ + +#define NODE_COMPACT sizeof(struct ot_node) +#define NODE_VMALLOC (512) + +struct ot_test g_testcases[] = { + + /* sync & normal */ + {0, 0, NODE_COMPACT, 1000, 0, 1, 0, 0, "sync: percpu objpool"}, + {0, 0, NODE_VMALLOC, 1000, 0, 1, 0, 0, "sync: percpu objpool from vmalloc"}, + + /* sync & hrtimer */ + {0, 0, NODE_COMPACT, 1000, 0, 1, 1, 4, "sync & hrtimer: percpu objpool"}, + {0, 0, NODE_VMALLOC, 1000, 0, 1, 1, 4, "sync & hrtimer: percpu objpool from vmalloc"}, + + /* sync & overrun */ + {0, 0, NODE_COMPACT, 1000, 0, 16, 0, 0, "sync overrun: percpu objpool"}, + {0, 0, NODE_VMALLOC, 1000, 0, 16, 0, 0, "sync overrun: percpu objpool from vmalloc"}, + + /* async mode */ + {1, 0, NODE_COMPACT, 1000, 0, 1, 0, 0, "async: percpu objpool"}, + {1, 0, NODE_VMALLOC, 1000, 0, 1, 0, 0, "async: percpu objpool from vmalloc"}, + + /* async + hrtimer mode */ + {1, 0, NODE_COMPACT, 1000, 0, 4, 4, 4, "async & hrtimer: percpu objpool"}, + {1, 0, NODE_VMALLOC, 1000, 0, 4, 4, 4, "async & hrtimer: percpu objpool from vmalloc"}, +}; + +static int __init ot_mod_init(void) +{ + int i; + + /* perform testings */ + for (i = 0; i < ARRAY_SIZE(g_testcases); i++) { + ot_init_data(&g_testcases[i].data); + if (g_testcases[i].async) + ot_start_async(&g_testcases[i]); + else + ot_start_sync(&g_testcases[i]); + } + + /* show tests summary */ + pr_info("\n"); + pr_info("Summary of testcases:\n"); + for (i = 0; i < ARRAY_SIZE(g_testcases); i++) { + pr_info(" duration: %lluus \thits: %10lu \tmiss: %10lu \t%s\n", + g_testcases[i].data.duration, g_testcases[i].data.objects.nhits, + g_testcases[i].data.objects.nmiss, g_testcases[i].name); + } + + return -EAGAIN; +} + +static void __exit ot_mod_exit(void) +{ +} + +module_init(ot_mod_init); +module_exit(ot_mod_exit); + +MODULE_LICENSE("GPL"); \ No newline at end of file From patchwork Tue Sep 5 01:52:53 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "wuqiang.matt" X-Patchwork-Id: 13374779 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 94B9DCA0FF3 for ; Tue, 5 Sep 2023 16:19:38 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233539AbjIEQTO (ORCPT ); Tue, 5 Sep 2023 12:19:14 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36652 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S245209AbjIEBxu (ORCPT ); Mon, 4 Sep 2023 21:53:50 -0400 Received: from mail-pf1-x436.google.com (mail-pf1-x436.google.com [IPv6:2607:f8b0:4864:20::436]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4D862CCF for ; Mon, 4 Sep 2023 18:53:24 -0700 (PDT) Received: by mail-pf1-x436.google.com with SMTP id d2e1a72fcca58-68a4bcf8a97so917018b3a.1 for ; Mon, 04 Sep 2023 18:53:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1693878803; x=1694483603; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=I0LYCxHuDKkrmqo9ljpFqNLOy2yIGC65hB6hQC6pS+M=; b=U8KjRKDh0S899uJooWLX6zPSlusApZ8olj3RH2T/CSAykE/1vwEWViC0tHytk/lcmA eV889HMttDHRRH/urR9KTvovPtDqbXGQ7CIsnq9/4h8HEYHHrS4ZSgnkTlyAdSdE0K7X 8ztFIo8iFMwXU3Ow0HH68Lelj2r7gh3Byis0SA/rg6erVZismCX596KV7EPCQPCVwgu3 tNTX3UwLgXKHE9krhwp4CjFRm98gVBOokohW8iVLDHE1MVWrw4XIq5cFI4HHpw9HmfAU de/cslzrdwdTlMSQxu/6a5korkVhvEUbqWOo/+H2/z6sx2TrhbI8mVjhOHGfGtM4wcj9 DWsQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1693878803; x=1694483603; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=I0LYCxHuDKkrmqo9ljpFqNLOy2yIGC65hB6hQC6pS+M=; b=faiOqQ26zRMHuDNuX2ej6njBNAV35DhLpbzrEJohR3011tyVfDrL51r2N8WsRiiXWz DOURt7bEWc7uFq/ILyRacCxhLRURk03QtKH79zqseWPNqvMEFE3GjpBin1BFFK8cGs/a eJSxk1o3nJCuU4Dp8FXaK2A3Aq66Ozz5gUhOZbe2lHP6a5x6sLuPuDA7y9evO7AsQl06 oFXHY7igU88ToO70X0jdTA87nIFlwjit8M8jRnLoqsEJ7qZYTeskd88JX2U8hQWmIRHt zIVBGD/680TEM0L4aL3tA+c7uNz+sMVsE7k3ltQQZ2uZEisUPaAtBXKzHtLC6zWa0OeR VmuQ== X-Gm-Message-State: AOJu0YyIvI5lilPWd4yOLydoUHyCJ04yFRPgE6f7SiPmI2hR37X1ejgW BzKTRht6P/bgBm4MhIAMc2710s4tkip4OZFFkWE= X-Google-Smtp-Source: AGHT+IEHDMuPFo2UCooGDhuLLv22oRDSTa8wLIt418A1ZFOnnlvBdZbnvZMdBIO7eDuWJXRJOl1aYQ== X-Received: by 2002:a05:6a00:1a93:b0:68a:69ba:6789 with SMTP id e19-20020a056a001a9300b0068a69ba6789mr10646251pfv.16.1693878803443; Mon, 04 Sep 2023 18:53:23 -0700 (PDT) Received: from devz1.bytedance.net ([203.208.167.146]) by smtp.gmail.com with ESMTPSA id y5-20020aa78045000000b0064d74808738sm7910483pfm.214.2023.09.04.18.53.18 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 04 Sep 2023 18:53:23 -0700 (PDT) From: "wuqiang.matt" To: linux-trace-kernel@vger.kernel.org, mhiramat@kernel.org, davem@davemloft.net, anil.s.keshavamurthy@intel.com, naveen.n.rao@linux.ibm.com, rostedt@goodmis.org, peterz@infradead.org, akpm@linux-foundation.org, sander@svanheule.net, ebiggers@google.com, dan.j.williams@intel.com, jpoimboe@kernel.org Cc: linux-kernel@vger.kernel.org, lkp@intel.com, mattwu@163.com, "wuqiang.matt" Subject: [PATCH v9 3/5] kprobes: kretprobe scalability improvement with objpool Date: Tue, 5 Sep 2023 09:52:53 +0800 Message-Id: <20230905015255.81545-4-wuqiang.matt@bytedance.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230905015255.81545-1-wuqiang.matt@bytedance.com> References: <20230905015255.81545-1-wuqiang.matt@bytedance.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-trace-kernel@vger.kernel.org kretprobe is using freelist to manage return-instances, but freelist, as LIFO queue based on singly linked list, scales badly and reduces the overall throughput of kretprobed routines, especially for high contention scenarios. Here's a typical throughput test of sys_prctl (counts in 10 seconds, measured with perf stat -a -I 10000 -e syscalls:sys_enter_prctl): OS: Debian 10 X86_64, Linux 6.5rc7 with freelist HW: XEON 8336C x 2, 64 cores/128 threads, DDR4 3200MT/s 1T 2T 4T 8T 16T 24T 24150045 29317964 15446741 12494489 18287272 18287272 32T 48T 64T 72T 96T 128T 16200682 1620068 11645677 11269858 10470118 9931051 This patch introduces objpool to kretprobe and rethook, with orginal freelist replaced and brings near-linear scalability to kretprobed routines. Tests of kretprobe throughput show the biggest ratio as 166.7x of the original freelist. Here's the comparison: 1T 2T 4T 8T 16T native: 41186213 82336866 164250978 328662645 658810299 freelist: 24150045 29317964 15446741 12494489 18287272 objpool: 24663700 49410571 98544977 197725940 396294399 32T 48T 64T 96T 128T native: 1330338351 1969957941 2512291791 1514690434 2671040914 freelist: 16200682 13737658 11645677 10470118 9931051 objpool: 78673470 1177354156 1514690434 1604472348 1655086824 Tests on 96-core ARM64 system output similarly, but with the biggest ratio up to 454.8x: OS: Debian 10 AARCH64, Linux 6.5rc7 HW: Kunpeng-920 96 cores/2 sockets/4 NUMA nodes, DDR4 2933 MT/s 1T 2T 4T 8T 16T native: . 30066096 13733813 126194076 257447289 505800181 freelist: 16152090 11064397 11124068 7215768 5663013 objpool: 13733813 27749031 56540679 112291770 223482778 24T 32T 48T 64T 96T native: 763305277 1015925192 1521075123 2033009392 3021013752 freelist: 5015810 4602893 3766792 3382478 2945292 objpool: 334605663 448310646 675018951 903449904 1339693418 Signed-off-by: wuqiang.matt Acked-by: Masami Hiramatsu (Google) --- include/linux/kprobes.h | 11 ++--- include/linux/rethook.h | 16 ++----- kernel/kprobes.c | 93 +++++++++++++++++------------------------ kernel/trace/fprobe.c | 32 ++++++-------- kernel/trace/rethook.c | 90 ++++++++++++++++++--------------------- 5 files changed, 98 insertions(+), 144 deletions(-) diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h index 85a64cb95d75..365eb092e9c4 100644 --- a/include/linux/kprobes.h +++ b/include/linux/kprobes.h @@ -26,8 +26,7 @@ #include #include #include -#include -#include +#include #include #include @@ -141,7 +140,7 @@ static inline bool kprobe_ftrace(struct kprobe *p) */ struct kretprobe_holder { struct kretprobe *rp; - refcount_t ref; + struct objpool_head pool; }; struct kretprobe { @@ -154,7 +153,6 @@ struct kretprobe { #ifdef CONFIG_KRETPROBE_ON_RETHOOK struct rethook *rh; #else - struct freelist_head freelist; struct kretprobe_holder *rph; #endif }; @@ -165,10 +163,7 @@ struct kretprobe_instance { #ifdef CONFIG_KRETPROBE_ON_RETHOOK struct rethook_node node; #else - union { - struct freelist_node freelist; - struct rcu_head rcu; - }; + struct rcu_head rcu; struct llist_node llist; struct kretprobe_holder *rph; kprobe_opcode_t *ret_addr; diff --git a/include/linux/rethook.h b/include/linux/rethook.h index 26b6f3c81a76..ce69b2b7bc35 100644 --- a/include/linux/rethook.h +++ b/include/linux/rethook.h @@ -6,11 +6,10 @@ #define _LINUX_RETHOOK_H #include -#include +#include #include #include #include -#include struct rethook_node; @@ -30,14 +29,12 @@ typedef void (*rethook_handler_t) (struct rethook_node *, void *, unsigned long, struct rethook { void *data; rethook_handler_t handler; - struct freelist_head pool; - refcount_t ref; + struct objpool_head pool; struct rcu_head rcu; }; /** * struct rethook_node - The rethook shadow-stack entry node. - * @freelist: The freelist, linked to struct rethook::pool. * @rcu: The rcu_head for deferred freeing. * @llist: The llist, linked to a struct task_struct::rethooks. * @rethook: The pointer to the struct rethook. @@ -48,20 +45,16 @@ struct rethook { * on each entry of the shadow stack. */ struct rethook_node { - union { - struct freelist_node freelist; - struct rcu_head rcu; - }; + struct rcu_head rcu; struct llist_node llist; struct rethook *rethook; unsigned long ret_addr; unsigned long frame; }; -struct rethook *rethook_alloc(void *data, rethook_handler_t handler); +struct rethook *rethook_alloc(void *data, rethook_handler_t handler, int size, int num); void rethook_stop(struct rethook *rh); void rethook_free(struct rethook *rh); -void rethook_add_node(struct rethook *rh, struct rethook_node *node); struct rethook_node *rethook_try_get(struct rethook *rh); void rethook_recycle(struct rethook_node *node); void rethook_hook(struct rethook_node *node, struct pt_regs *regs, bool mcount); @@ -98,4 +91,3 @@ void rethook_flush_task(struct task_struct *tk); #endif #endif - diff --git a/kernel/kprobes.c b/kernel/kprobes.c index ca385b61d546..075a632e6c7c 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -1877,13 +1877,27 @@ static struct notifier_block kprobe_exceptions_nb = { #ifdef CONFIG_KRETPROBES #if !defined(CONFIG_KRETPROBE_ON_RETHOOK) + +/* callbacks for objpool of kretprobe instances */ +static int kretprobe_init_inst(void *nod, void *context) +{ + struct kretprobe_instance *ri = nod; + + ri->rph = context; + return 0; +} +static int kretprobe_fini_pool(struct objpool_head *head, void *context) +{ + kfree(context); + return 0; +} + static void free_rp_inst_rcu(struct rcu_head *head) { struct kretprobe_instance *ri = container_of(head, struct kretprobe_instance, rcu); + struct kretprobe_holder *rph = ri->rph; - if (refcount_dec_and_test(&ri->rph->ref)) - kfree(ri->rph); - kfree(ri); + objpool_drop(ri, &rph->pool); } NOKPROBE_SYMBOL(free_rp_inst_rcu); @@ -1892,7 +1906,7 @@ static void recycle_rp_inst(struct kretprobe_instance *ri) struct kretprobe *rp = get_kretprobe(ri); if (likely(rp)) - freelist_add(&ri->freelist, &rp->freelist); + objpool_push(ri, &rp->rph->pool); else call_rcu(&ri->rcu, free_rp_inst_rcu); } @@ -1929,23 +1943,12 @@ NOKPROBE_SYMBOL(kprobe_flush_task); static inline void free_rp_inst(struct kretprobe *rp) { - struct kretprobe_instance *ri; - struct freelist_node *node; - int count = 0; - - node = rp->freelist.head; - while (node) { - ri = container_of(node, struct kretprobe_instance, freelist); - node = node->next; - - kfree(ri); - count++; - } + struct kretprobe_holder *rph = rp->rph; - if (refcount_sub_and_test(count, &rp->rph->ref)) { - kfree(rp->rph); - rp->rph = NULL; - } + if (!rph) + return; + rp->rph = NULL; + objpool_fini(&rph->pool); } /* This assumes the 'tsk' is the current task or the is not running. */ @@ -2087,19 +2090,17 @@ NOKPROBE_SYMBOL(__kretprobe_trampoline_handler) static int pre_handler_kretprobe(struct kprobe *p, struct pt_regs *regs) { struct kretprobe *rp = container_of(p, struct kretprobe, kp); + struct kretprobe_holder *rph = rp->rph; struct kretprobe_instance *ri; - struct freelist_node *fn; - fn = freelist_try_get(&rp->freelist); - if (!fn) { + ri = objpool_pop(&rph->pool); + if (!ri) { rp->nmissed++; return 0; } - ri = container_of(fn, struct kretprobe_instance, freelist); - if (rp->entry_handler && rp->entry_handler(ri, regs)) { - freelist_add(&ri->freelist, &rp->freelist); + objpool_push(ri, &rph->pool); return 0; } @@ -2193,7 +2194,6 @@ int kprobe_on_func_entry(kprobe_opcode_t *addr, const char *sym, unsigned long o int register_kretprobe(struct kretprobe *rp) { int ret; - struct kretprobe_instance *inst; int i; void *addr; @@ -2227,20 +2227,12 @@ int register_kretprobe(struct kretprobe *rp) rp->maxactive = max_t(unsigned int, 10, 2*num_possible_cpus()); #ifdef CONFIG_KRETPROBE_ON_RETHOOK - rp->rh = rethook_alloc((void *)rp, kretprobe_rethook_handler); - if (!rp->rh) - return -ENOMEM; + rp->rh = rethook_alloc((void *)rp, kretprobe_rethook_handler, + sizeof(struct kretprobe_instance) + + rp->data_size, rp->maxactive); + if (IS_ERR(rp->rh)) + return PTR_ERR(rp->rh); - for (i = 0; i < rp->maxactive; i++) { - inst = kzalloc(sizeof(struct kretprobe_instance) + - rp->data_size, GFP_KERNEL); - if (inst == NULL) { - rethook_free(rp->rh); - rp->rh = NULL; - return -ENOMEM; - } - rethook_add_node(rp->rh, &inst->node); - } rp->nmissed = 0; /* Establish function entry probe point */ ret = register_kprobe(&rp->kp); @@ -2249,25 +2241,18 @@ int register_kretprobe(struct kretprobe *rp) rp->rh = NULL; } #else /* !CONFIG_KRETPROBE_ON_RETHOOK */ - rp->freelist.head = NULL; rp->rph = kzalloc(sizeof(struct kretprobe_holder), GFP_KERNEL); if (!rp->rph) return -ENOMEM; - rp->rph->rp = rp; - for (i = 0; i < rp->maxactive; i++) { - inst = kzalloc(sizeof(struct kretprobe_instance) + - rp->data_size, GFP_KERNEL); - if (inst == NULL) { - refcount_set(&rp->rph->ref, i); - free_rp_inst(rp); - return -ENOMEM; - } - inst->rph = rp->rph; - freelist_add(&inst->freelist, &rp->freelist); + if (objpool_init(&rp->rph->pool, rp->maxactive, rp->data_size + + sizeof(struct kretprobe_instance), GFP_KERNEL, + rp->rph, kretprobe_init_inst, kretprobe_fini_pool)) { + kfree(rp->rph); + rp->rph = NULL; + return -ENOMEM; } - refcount_set(&rp->rph->ref, i); - + rp->rph->rp = rp; rp->nmissed = 0; /* Establish function entry probe point */ ret = register_kprobe(&rp->kp); diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c index 3b21f4063258..f5bf98e6b2ac 100644 --- a/kernel/trace/fprobe.c +++ b/kernel/trace/fprobe.c @@ -187,9 +187,9 @@ static void fprobe_init(struct fprobe *fp) static int fprobe_init_rethook(struct fprobe *fp, int num) { - int i, size; + int size; - if (num < 0) + if (num <= 0) return -EINVAL; if (!fp->exit_handler) { @@ -202,29 +202,21 @@ static int fprobe_init_rethook(struct fprobe *fp, int num) size = fp->nr_maxactive; else size = num * num_possible_cpus() * 2; - if (size < 0) + if (size <= 0) return -E2BIG; - fp->rethook = rethook_alloc((void *)fp, fprobe_exit_handler); - if (!fp->rethook) - return -ENOMEM; - for (i = 0; i < size; i++) { - struct fprobe_rethook_node *node; - - node = kzalloc(sizeof(*node) + fp->entry_data_size, GFP_KERNEL); - if (!node) { - rethook_free(fp->rethook); - fp->rethook = NULL; - return -ENOMEM; - } - rethook_add_node(fp->rethook, &node->node); - } + /* Initialize rethook */ + fp->rethook = rethook_alloc((void *)fp, fprobe_exit_handler, + sizeof(struct fprobe_rethook_node), size); + if (IS_ERR(fp->rethook)) + return PTR_ERR(fp->rethook); + return 0; } static void fprobe_fail_cleanup(struct fprobe *fp) { - if (fp->rethook) { + if (!IS_ERR_OR_NULL(fp->rethook)) { /* Don't need to cleanup rethook->handler because this is not used. */ rethook_free(fp->rethook); fp->rethook = NULL; @@ -379,14 +371,14 @@ int unregister_fprobe(struct fprobe *fp) if (!fprobe_is_registered(fp)) return -EINVAL; - if (fp->rethook) + if (!IS_ERR_OR_NULL(fp->rethook)) rethook_stop(fp->rethook); ret = unregister_ftrace_function(&fp->ops); if (ret < 0) return ret; - if (fp->rethook) + if (!IS_ERR_OR_NULL(fp->rethook)) rethook_free(fp->rethook); ftrace_free_filter(&fp->ops); diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c index 5eb9b598f4e9..13c8e6773892 100644 --- a/kernel/trace/rethook.c +++ b/kernel/trace/rethook.c @@ -9,6 +9,7 @@ #include #include #include +#include /* Return hook list (shadow stack by list) */ @@ -36,21 +37,7 @@ void rethook_flush_task(struct task_struct *tk) static void rethook_free_rcu(struct rcu_head *head) { struct rethook *rh = container_of(head, struct rethook, rcu); - struct rethook_node *rhn; - struct freelist_node *node; - int count = 1; - - node = rh->pool.head; - while (node) { - rhn = container_of(node, struct rethook_node, freelist); - node = node->next; - kfree(rhn); - count++; - } - - /* The rh->ref is the number of pooled node + 1 */ - if (refcount_sub_and_test(count, &rh->ref)) - kfree(rh); + objpool_fini(&rh->pool); } /** @@ -83,54 +70,62 @@ void rethook_free(struct rethook *rh) call_rcu(&rh->rcu, rethook_free_rcu); } +static int rethook_init_node(void *nod, void *context) +{ + struct rethook_node *node = nod; + + node->rethook = context; + return 0; +} + +static int rethook_fini_pool(struct objpool_head *head, void *context) +{ + kfree(context); + return 0; +} + /** * rethook_alloc() - Allocate struct rethook. * @data: a data to pass the @handler when hooking the return. - * @handler: the return hook callback function. + * @handler: the return hook callback function, must NOT be NULL + * @size: node size: rethook node and additional data + * @num: number of rethook nodes to be preallocated * * Allocate and initialize a new rethook with @data and @handler. - * Return NULL if memory allocation fails or @handler is NULL. + * Return pointer of new rethook, or error codes for failures. + * * Note that @handler == NULL means this rethook is going to be freed. */ -struct rethook *rethook_alloc(void *data, rethook_handler_t handler) +struct rethook *rethook_alloc(void *data, rethook_handler_t handler, + int size, int num) { - struct rethook *rh = kzalloc(sizeof(struct rethook), GFP_KERNEL); + struct rethook *rh; - if (!rh || !handler) { - kfree(rh); - return NULL; - } + if (!handler || num <= 0 || size < sizeof(struct rethook_node)) + return ERR_PTR(-EINVAL); + + rh = kzalloc(sizeof(struct rethook), GFP_KERNEL); + if (!rh) + return ERR_PTR(-ENOMEM); rh->data = data; rh->handler = handler; - rh->pool.head = NULL; - refcount_set(&rh->ref, 1); + /* initialize the objpool for rethook nodes */ + if (objpool_init(&rh->pool, num, size, GFP_KERNEL, rh, + rethook_init_node, rethook_fini_pool)) { + kfree(rh); + return ERR_PTR(-ENOMEM); + } return rh; } -/** - * rethook_add_node() - Add a new node to the rethook. - * @rh: the struct rethook. - * @node: the struct rethook_node to be added. - * - * Add @node to @rh. User must allocate @node (as a part of user's - * data structure.) The @node fields are initialized in this function. - */ -void rethook_add_node(struct rethook *rh, struct rethook_node *node) -{ - node->rethook = rh; - freelist_add(&node->freelist, &rh->pool); - refcount_inc(&rh->ref); -} - static void free_rethook_node_rcu(struct rcu_head *head) { struct rethook_node *node = container_of(head, struct rethook_node, rcu); + struct rethook *rh = node->rethook; - if (refcount_dec_and_test(&node->rethook->ref)) - kfree(node->rethook); - kfree(node); + objpool_drop(node, &rh->pool); } /** @@ -145,7 +140,7 @@ void rethook_recycle(struct rethook_node *node) lockdep_assert_preemption_disabled(); if (likely(READ_ONCE(node->rethook->handler))) - freelist_add(&node->freelist, &node->rethook->pool); + objpool_push(node, &node->rethook->pool); else call_rcu(&node->rcu, free_rethook_node_rcu); } @@ -161,7 +156,6 @@ NOKPROBE_SYMBOL(rethook_recycle); struct rethook_node *rethook_try_get(struct rethook *rh) { rethook_handler_t handler = READ_ONCE(rh->handler); - struct freelist_node *fn; lockdep_assert_preemption_disabled(); @@ -178,11 +172,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh) if (unlikely(!rcu_is_watching())) return NULL; - fn = freelist_try_get(&rh->pool); - if (!fn) - return NULL; - - return container_of(fn, struct rethook_node, freelist); + return (struct rethook_node *)objpool_pop(&rh->pool); } NOKPROBE_SYMBOL(rethook_try_get); From patchwork Tue Sep 5 01:52:54 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "wuqiang.matt" X-Patchwork-Id: 13374783 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1B264CA0FE2 for ; Tue, 5 Sep 2023 16:19:44 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229714AbjIEQTU (ORCPT ); Tue, 5 Sep 2023 12:19:20 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48944 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S245176AbjIEBxd (ORCPT ); Mon, 4 Sep 2023 21:53:33 -0400 Received: from mail-pf1-x431.google.com (mail-pf1-x431.google.com [IPv6:2607:f8b0:4864:20::431]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BD258CDA for ; Mon, 4 Sep 2023 18:53:29 -0700 (PDT) Received: by mail-pf1-x431.google.com with SMTP id d2e1a72fcca58-68c0cb00fb3so1641732b3a.2 for ; Mon, 04 Sep 2023 18:53:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1693878809; x=1694483609; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=DeCI1NwP8+gNkfLZ//g6QG7eHPqu4x99wp4QGWvVdhI=; b=NOx9AcKliZNOePnhEiaoAeWZJglS5Zj/1K2YXz/9Fg4EQk2nDPru3bIpizeRqqOqin G9inMTbuHUff7P5qQhD5HPHxYNQle+N26loc/WBRpZjiI4YgplFB9Y1AaStfKuOxM2l7 BmfCEozjeqY+Izqf5d3Ab1JJ8F647qLIvu1JDeDSseTBeAcSz6Z0lWyHRtUj9okAs/H6 8UnFut9fi0DZ3vVTIJBA+PbnBAPNVCOviCo0IJeNJ+L2C6zVnresm6nPLvVnzhulcFdN cf//31qm01CWbQ2irrImEvA3UCSzT2xHZxw28lwrFb+wa9zZeSx+DYZabDmPVxl+yMXR nopA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1693878809; x=1694483609; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=DeCI1NwP8+gNkfLZ//g6QG7eHPqu4x99wp4QGWvVdhI=; b=AJTrPT7vM6F/YQn4dieHzzwRgLWWWLxP3FzsRa24u1k4XvYHTmNFBaQlIedi0w5Xpz Q0IPtZTx50+XSv7ILX+wEIDXIo+JZ3U6dCS37Sxo+u1mX+HjwezKTwuci78kUiX/c/NC vJKWDyNJGffNsaQMw/0/2aDpEud1Cx9dvFpEuzsvd2lLcpQaapBshqbL5DouqHoSdEUQ bU6o+6oWaIo1H+oC/xj93ZInIHQPgmLXFJVUZsF6aEfUIXC2wqxPfkmh/fGQ18IOeuU6 5shJAtACptJjZW4/jPDnBwPV7X0EIqvkWkH0RsCXAAK46cQRhDdWpRtrtuydRxoHmqq5 FTwA== X-Gm-Message-State: AOJu0YzqM6AuwUugm1SRJx2bHR6rB22kT5Ix4dR1lrAMZC/ZpmR2AfeA VMcY7a3S85GmS6SDuwcPdvJRtfD2ni8cSH8jvrM= X-Google-Smtp-Source: AGHT+IEb+MdC5QmuX1vP1E88PNYHkXtGqyNFHPovzxjNaejFjyx3gru9y+Ck6zxKDFbmkmp/puqYMA== X-Received: by 2002:a05:6a21:7905:b0:14d:7b6:cf2f with SMTP id bg5-20020a056a21790500b0014d07b6cf2fmr13014056pzc.47.1693878809047; Mon, 04 Sep 2023 18:53:29 -0700 (PDT) Received: from devz1.bytedance.net ([203.208.167.146]) by smtp.gmail.com with ESMTPSA id y5-20020aa78045000000b0064d74808738sm7910483pfm.214.2023.09.04.18.53.23 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 04 Sep 2023 18:53:28 -0700 (PDT) From: "wuqiang.matt" To: linux-trace-kernel@vger.kernel.org, mhiramat@kernel.org, davem@davemloft.net, anil.s.keshavamurthy@intel.com, naveen.n.rao@linux.ibm.com, rostedt@goodmis.org, peterz@infradead.org, akpm@linux-foundation.org, sander@svanheule.net, ebiggers@google.com, dan.j.williams@intel.com, jpoimboe@kernel.org Cc: linux-kernel@vger.kernel.org, lkp@intel.com, mattwu@163.com, "wuqiang.matt" Subject: [PATCH v9 4/5] kprobes: freelist.h removed Date: Tue, 5 Sep 2023 09:52:54 +0800 Message-Id: <20230905015255.81545-5-wuqiang.matt@bytedance.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230905015255.81545-1-wuqiang.matt@bytedance.com> References: <20230905015255.81545-1-wuqiang.matt@bytedance.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-trace-kernel@vger.kernel.org This patch will remove freelist.h from kernel source tree, since the only use cases (kretprobe and rethook) are converted to objpool. Signed-off-by: wuqiang.matt --- include/linux/freelist.h | 129 --------------------------------------- 1 file changed, 129 deletions(-) delete mode 100644 include/linux/freelist.h diff --git a/include/linux/freelist.h b/include/linux/freelist.h deleted file mode 100644 index fc1842b96469..000000000000 --- a/include/linux/freelist.h +++ /dev/null @@ -1,129 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause */ -#ifndef FREELIST_H -#define FREELIST_H - -#include - -/* - * Copyright: cameron@moodycamel.com - * - * A simple CAS-based lock-free free list. Not the fastest thing in the world - * under heavy contention, but simple and correct (assuming nodes are never - * freed until after the free list is destroyed), and fairly speedy under low - * contention. - * - * Adapted from: https://moodycamel.com/blog/2014/solving-the-aba-problem-for-lock-free-free-lists - */ - -struct freelist_node { - atomic_t refs; - struct freelist_node *next; -}; - -struct freelist_head { - struct freelist_node *head; -}; - -#define REFS_ON_FREELIST 0x80000000 -#define REFS_MASK 0x7FFFFFFF - -static inline void __freelist_add(struct freelist_node *node, struct freelist_head *list) -{ - /* - * Since the refcount is zero, and nobody can increase it once it's - * zero (except us, and we run only one copy of this method per node at - * a time, i.e. the single thread case), then we know we can safely - * change the next pointer of the node; however, once the refcount is - * back above zero, then other threads could increase it (happens under - * heavy contention, when the refcount goes to zero in between a load - * and a refcount increment of a node in try_get, then back up to - * something non-zero, then the refcount increment is done by the other - * thread) -- so if the CAS to add the node to the actual list fails, - * decrese the refcount and leave the add operation to the next thread - * who puts the refcount back to zero (which could be us, hence the - * loop). - */ - struct freelist_node *head = READ_ONCE(list->head); - - for (;;) { - WRITE_ONCE(node->next, head); - atomic_set_release(&node->refs, 1); - - if (!try_cmpxchg_release(&list->head, &head, node)) { - /* - * Hmm, the add failed, but we can only try again when - * the refcount goes back to zero. - */ - if (atomic_fetch_add_release(REFS_ON_FREELIST - 1, &node->refs) == 1) - continue; - } - return; - } -} - -static inline void freelist_add(struct freelist_node *node, struct freelist_head *list) -{ - /* - * We know that the should-be-on-freelist bit is 0 at this point, so - * it's safe to set it using a fetch_add. - */ - if (!atomic_fetch_add_release(REFS_ON_FREELIST, &node->refs)) { - /* - * Oh look! We were the last ones referencing this node, and we - * know we want to add it to the free list, so let's do it! - */ - __freelist_add(node, list); - } -} - -static inline struct freelist_node *freelist_try_get(struct freelist_head *list) -{ - struct freelist_node *prev, *next, *head = smp_load_acquire(&list->head); - unsigned int refs; - - while (head) { - prev = head; - refs = atomic_read(&head->refs); - if ((refs & REFS_MASK) == 0 || - !atomic_try_cmpxchg_acquire(&head->refs, &refs, refs+1)) { - head = smp_load_acquire(&list->head); - continue; - } - - /* - * Good, reference count has been incremented (it wasn't at - * zero), which means we can read the next and not worry about - * it changing between now and the time we do the CAS. - */ - next = READ_ONCE(head->next); - if (try_cmpxchg_acquire(&list->head, &head, next)) { - /* - * Yay, got the node. This means it was on the list, - * which means should-be-on-freelist must be false no - * matter the refcount (because nobody else knows it's - * been taken off yet, it can't have been put back on). - */ - WARN_ON_ONCE(atomic_read(&head->refs) & REFS_ON_FREELIST); - - /* - * Decrease refcount twice, once for our ref, and once - * for the list's ref. - */ - atomic_fetch_add(-2, &head->refs); - - return head; - } - - /* - * OK, the head must have changed on us, but we still need to decrement - * the refcount we increased. - */ - refs = atomic_fetch_add(-1, &prev->refs); - if (refs == REFS_ON_FREELIST + 1) - __freelist_add(prev, list); - } - - return NULL; -} - -#endif /* FREELIST_H */ From patchwork Tue Sep 5 01:52:55 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "wuqiang.matt" X-Patchwork-Id: 13374776 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4F6EECA0FF3 for ; Tue, 5 Sep 2023 16:19:35 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233405AbjIEQTE (ORCPT ); Tue, 5 Sep 2023 12:19:04 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48972 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S245226AbjIEBx5 (ORCPT ); Mon, 4 Sep 2023 21:53:57 -0400 Received: from mail-pf1-x433.google.com (mail-pf1-x433.google.com [IPv6:2607:f8b0:4864:20::433]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 34E9DCC6 for ; Mon, 4 Sep 2023 18:53:35 -0700 (PDT) Received: by mail-pf1-x433.google.com with SMTP id d2e1a72fcca58-68c0d262933so1056297b3a.0 for ; Mon, 04 Sep 2023 18:53:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1693878814; x=1694483614; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=eW5Gw9lfx88ca1Dfq2MtXvfhvw7cUu9QhwY8xI1fswY=; b=cEkFZlAiFFAfCFNylE/JG5GVfeqbg8cHop9ks5qJIEzA8r8kq8JdNYI3/ER0mTd39A RaxPQKt1+UmvZvh1JOOxGI0CUrkjFG0uhs199x3YNjVFIj5iEhzAdzga+cJqNIQBLS6c sl2srVJLM/bKHh6sfSLdvrqOApck6ioJxhpsoGlxnD71CQd2e/S3mBa7D3n4u4xbdDF3 vuw3c0RRGWOYcIADnRD+aNG6S+WZCEpQRw5dIRxzchIHHqz6ti/ZlS1vwO3mSgSmRL24 BLgijAdcPL1tAdTjS1+6oM+0kMr7sPGNbxBghKHSPwampYewpe69Mo4GHYISrMY+MQuv oHsQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1693878814; x=1694483614; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=eW5Gw9lfx88ca1Dfq2MtXvfhvw7cUu9QhwY8xI1fswY=; b=kOzgybI9sQC40eAeCYognoEqz07rZDmQksh+7+wXVFgL0rtIXVKf914AUsyqj3W6Tc 2sYwP0+cChV2HP8qZrBHq/aKymKFrmTb28evqSyBPPWeE+0RX1it0wVd73sQxyTuMAC6 HpmUWymIEq1GkGJV2fAC58AzRnGlnn5CaatokM7NHEhIDUJSD/zcR5TcwmuY5XrX1KoD R1B6A0FRNIdmsElPyjl1FMFZQWiKdX0pPAj8+6v+Nrj1Gb2qKFjtbAUNryBfR4dLu1wb Hk5LK4J2sHjn2m2aUQ/eJTDhjdIgxsrJ1g8wb5O+LUWnWlOtJ8W6DmbIgN5w20/LzUd8 1YoA== X-Gm-Message-State: AOJu0Yya0dy3sWQ+u80pbsPl+SlRYZ/7SaFFy2rhWVPNvuHXluPujsKy +Xwxugb2y5xcGZD4+VTF53bGIDo1D3N1ehrwRR8= X-Google-Smtp-Source: AGHT+IFPRLMPPittNPbLARa/xaabWtmV32rvjPcjRdHbe3f9dKJ55stslpiC2Fy94Jw1DqmXP4jCjw== X-Received: by 2002:a05:6a00:844:b0:682:b6c8:2eb with SMTP id q4-20020a056a00084400b00682b6c802ebmr10352183pfk.1.1693878814666; Mon, 04 Sep 2023 18:53:34 -0700 (PDT) Received: from devz1.bytedance.net ([203.208.167.146]) by smtp.gmail.com with ESMTPSA id y5-20020aa78045000000b0064d74808738sm7910483pfm.214.2023.09.04.18.53.29 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 04 Sep 2023 18:53:34 -0700 (PDT) From: "wuqiang.matt" To: linux-trace-kernel@vger.kernel.org, mhiramat@kernel.org, davem@davemloft.net, anil.s.keshavamurthy@intel.com, naveen.n.rao@linux.ibm.com, rostedt@goodmis.org, peterz@infradead.org, akpm@linux-foundation.org, sander@svanheule.net, ebiggers@google.com, dan.j.williams@intel.com, jpoimboe@kernel.org Cc: linux-kernel@vger.kernel.org, lkp@intel.com, mattwu@163.com, "wuqiang.matt" Subject: [PATCH v9 5/5] MAINTAINERS: objpool added Date: Tue, 5 Sep 2023 09:52:55 +0800 Message-Id: <20230905015255.81545-6-wuqiang.matt@bytedance.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20230905015255.81545-1-wuqiang.matt@bytedance.com> References: <20230905015255.81545-1-wuqiang.matt@bytedance.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-trace-kernel@vger.kernel.org objpool, a scalable and lockless ring-array based object pool, was introduced to replace the original freelist (a LIFO queue based on singly linked list) to improve kretprobe scalability. Signed-off-by: wuqiang.matt --- MAINTAINERS | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index 48abe1a281f2..1c0a38d763a2 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -15328,6 +15328,13 @@ F: include/linux/objagg.h F: lib/objagg.c F: lib/test_objagg.c +OBJPOOL +M: Matt Wu +S: Supported +F: include/linux/objpool.h +F: lib/objpool.c +F: lib/test_objpool.c + OBJTOOL M: Josh Poimboeuf M: Peter Zijlstra