Message ID | 20230619143231.222536-3-houtao@huaweicloud.com (mailing list archive) |
---|---|
State | RFC |
Delegated to: | BPF |
Headers | show |
Series | Handle immediate reuse in bpf memory allocator | expand |
On Mon, Jun 19, 2023 at 7:00 AM Hou Tao <houtao@huaweicloud.com> wrote: > > +static void bpf_rcu_gp_acc_work(struct callback_head *head) > +{ > + struct bpf_rcu_gp_acc_ctx *ctx = container_of(head, struct bpf_rcu_gp_acc_ctx, work); > + > + local_irq_disable(); > + rcu_momentary_dyntick_idle(); > + local_irq_enable(); We discussed this with Paul off-line and decided to go a different route. Paul prepared a patch for us to expose rcu_request_urgent_qs_task(). I'll be sending the series later this week.
On Tue, Jun 20, 2023 at 09:15:03AM -0700, Alexei Starovoitov wrote: > On Mon, Jun 19, 2023 at 7:00 AM Hou Tao <houtao@huaweicloud.com> wrote: > > > > +static void bpf_rcu_gp_acc_work(struct callback_head *head) > > +{ > > + struct bpf_rcu_gp_acc_ctx *ctx = container_of(head, struct bpf_rcu_gp_acc_ctx, work); > > + > > + local_irq_disable(); > > + rcu_momentary_dyntick_idle(); > > + local_irq_enable(); > > We discussed this with Paul off-line and decided to go a different route. > Paul prepared a patch for us to expose rcu_request_urgent_qs_task(). > I'll be sending the series later this week. Just for completeness, this patch is in -rcu here and as shown below: b1edaa1e6364 ("rcu: Export rcu_request_urgent_qs_task()") Thanx, Paul ------------------------------------------------------------------------ commit b1edaa1e6364caf9833a30b5dd7599080b6806b8 Author: Paul E. McKenney <paulmck@kernel.org> Date: Thu Jun 8 16:31:49 2023 -0700 rcu: Export rcu_request_urgent_qs_task() If a CPU is executing a long series of non-sleeping system calls, RCU grace periods can be delayed for on the order of a couple hundred milliseconds. This is normally not a problem, but if each system call does a call_rcu(), those callbacks can stack up. RCU will eventually notice this callback storm, but use of rcu_request_urgent_qs_task() allows the code invoking call_rcu() to give RCU a heads up. This function is not for general use, not yet, anyway. Reported-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h index 7f17acf29dda..7b949292908a 100644 --- a/include/linux/rcutiny.h +++ b/include/linux/rcutiny.h @@ -138,6 +138,8 @@ static inline int rcu_needs_cpu(void) return 0; } +static inline void rcu_request_urgent_qs_task(struct task_struct *t) { } + /* * Take advantage of the fact that there is only one CPU, which * allows us to ignore virtualization-based context switches. diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h index 56bccb5a8fde..126f6b418f6a 100644 --- a/include/linux/rcutree.h +++ b/include/linux/rcutree.h @@ -21,6 +21,7 @@ void rcu_softirq_qs(void); void rcu_note_context_switch(bool preempt); int rcu_needs_cpu(void); void rcu_cpu_stall_reset(void); +void rcu_request_urgent_qs_task(struct task_struct *t); /* * Note a virtualization-based context switch. This is simply a diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h index 9829d8161b21..0c99ec19c8e0 100644 --- a/kernel/rcu/rcu.h +++ b/kernel/rcu/rcu.h @@ -493,7 +493,6 @@ static inline void rcu_expedite_gp(void) { } static inline void rcu_unexpedite_gp(void) { } static inline void rcu_async_hurry(void) { } static inline void rcu_async_relax(void) { } -static inline void rcu_request_urgent_qs_task(struct task_struct *t) { } #else /* #ifdef CONFIG_TINY_RCU */ bool rcu_gp_is_normal(void); /* Internal RCU use. */ bool rcu_gp_is_expedited(void); /* Internal RCU use. */ @@ -514,7 +513,6 @@ struct task_struct *get_rcu_tasks_rude_gp_kthread(void); #else /* #ifdef CONFIG_TASKS_RCU_GENERIC */ static inline void show_rcu_tasks_gp_kthreads(void) {} #endif /* #else #ifdef CONFIG_TASKS_RCU_GENERIC */ -void rcu_request_urgent_qs_task(struct task_struct *t); #endif /* #else #ifdef CONFIG_TINY_RCU */ #define RCU_SCHEDULER_INACTIVE 0
Hi, On 6/21/2023 12:15 AM, Alexei Starovoitov wrote: > On Mon, Jun 19, 2023 at 7:00 AM Hou Tao <houtao@huaweicloud.com> wrote: >> +static void bpf_rcu_gp_acc_work(struct callback_head *head) >> +{ >> + struct bpf_rcu_gp_acc_ctx *ctx = container_of(head, struct bpf_rcu_gp_acc_ctx, work); >> + >> + local_irq_disable(); >> + rcu_momentary_dyntick_idle(); >> + local_irq_enable(); > We discussed this with Paul off-line and decided to go a different route. "A different route" means the method used to reduce the memory footprint is different or the method to do reuse-after-rcu-gp is different ? > Paul prepared a patch for us to expose rcu_request_urgent_qs_task(). > I'll be sending the series later this week. Do you plan to take over the reuse-after-rcu-gp patchset ? I did a quick test, it showed that the memory footprint is smaller and the performance is similar when using rcu_request_urgent_qs_task() instead of rcu_momentary_dyntick_idle(). overwrite: Summary: per-prod-op 27.25 ± 0.08k/s, memory usage 29.52 ± 1.60MiB, peak memory usage 37.84MiB batch_add_batch_del: Summary: per-prod-op 45.84 ± 0.29k/s, memory usage 24.18 ± 1.44MiB, peak memory usage 30.38MiB add_del_on_diff_cpu: Summary: per-prod-op 13.09 ± 0.33k/s, memory usage 32.48 ± 3.51MiB, peak memory usage 53.54MiB
On Tue, Jun 20, 2023 at 6:07 PM Hou Tao <houtao@huaweicloud.com> wrote: > > Hi, > > On 6/21/2023 12:15 AM, Alexei Starovoitov wrote: > > On Mon, Jun 19, 2023 at 7:00 AM Hou Tao <houtao@huaweicloud.com> wrote: > >> +static void bpf_rcu_gp_acc_work(struct callback_head *head) > >> +{ > >> + struct bpf_rcu_gp_acc_ctx *ctx = container_of(head, struct bpf_rcu_gp_acc_ctx, work); > >> + > >> + local_irq_disable(); > >> + rcu_momentary_dyntick_idle(); > >> + local_irq_enable(); > > We discussed this with Paul off-line and decided to go a different route. > "A different route" means the method used to reduce the memory footprint > is different or the method to do reuse-after-rcu-gp is different ? Pretty much everything is different. > > Paul prepared a patch for us to expose rcu_request_urgent_qs_task(). > > I'll be sending the series later this week. > Do you plan to take over the reuse-after-rcu-gp patchset ? I took a different approach. It will be easier to discuss when I post patches. Hopefully later today or tomorrow. > I did a quick test, it showed that the memory footprint is smaller and > the performance is similar when using rcu_request_urgent_qs_task() > instead of rcu_momentary_dyntick_idle(). Right. I saw a similar effect as well. My understanding is that rcu_momentary_dyntick_idle() is a heavier mechanism and absolutely should not be used while holding rcu_read_lock(). So calling from irq_work is not ok. Only kworker, as you did, is ok from safety pov. Still not recommended in general. Hence rcu_request_urgent_qs_task.
Hi, On 6/21/2023 9:20 AM, Alexei Starovoitov wrote: > On Tue, Jun 20, 2023 at 6:07 PM Hou Tao <houtao@huaweicloud.com> wrote: >> Hi, >> >> On 6/21/2023 12:15 AM, Alexei Starovoitov wrote: >>> On Mon, Jun 19, 2023 at 7:00 AM Hou Tao <houtao@huaweicloud.com> wrote: >>>> +static void bpf_rcu_gp_acc_work(struct callback_head *head) >>>> +{ >>>> + struct bpf_rcu_gp_acc_ctx *ctx = container_of(head, struct bpf_rcu_gp_acc_ctx, work); >>>> + >>>> + local_irq_disable(); >>>> + rcu_momentary_dyntick_idle(); >>>> + local_irq_enable(); >>> We discussed this with Paul off-line and decided to go a different route. >> "A different route" means the method used to reduce the memory footprint >> is different or the method to do reuse-after-rcu-gp is different ? > Pretty much everything is different. I see. > >>> Paul prepared a patch for us to expose rcu_request_urgent_qs_task(). >>> I'll be sending the series later this week. >> Do you plan to take over the reuse-after-rcu-gp patchset ? > I took a different approach. > It will be easier to discuss when I post patches. > Hopefully later today or tomorrow. Look forward to the new approach. > >> I did a quick test, it showed that the memory footprint is smaller and >> the performance is similar when using rcu_request_urgent_qs_task() >> instead of rcu_momentary_dyntick_idle(). > Right. I saw a similar effect as well. > My understanding is that rcu_momentary_dyntick_idle() is a heavier mechanism > and absolutely should not be used while holding rcu_read_lock(). > So calling from irq_work is not ok. > Only kworker, as you did, is ok from safety pov. It is not kworker and it is a task work which runs when the process returns back to userspace. > Still not recommended in general. Hence rcu_request_urgent_qs_task. Thanks for the explanation.
diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c index 9b31c53fd285..c4b4cae04400 100644 --- a/kernel/bpf/memalloc.c +++ b/kernel/bpf/memalloc.c @@ -6,6 +6,7 @@ #include <linux/irq_work.h> #include <linux/bpf_mem_alloc.h> #include <linux/memcontrol.h> +#include <linux/task_work.h> #include <asm/local.h> /* Any context (including NMI) BPF specific memory allocator. @@ -123,6 +124,16 @@ struct bpf_reuse_batch { struct rcu_head rcu; }; +#define BPF_GP_ACC_RUNNING 0 + +struct bpf_rcu_gp_acc_ctx { + unsigned long flags; + unsigned long next_run; + struct callback_head work; +}; + +static DEFINE_PER_CPU(struct bpf_rcu_gp_acc_ctx, bpf_acc_ctx); + static struct llist_node notrace *__llist_del_first(struct llist_head *head) { struct llist_node *entry, *next; @@ -347,12 +358,47 @@ static void dyn_reuse_rcu(struct rcu_head *rcu) kfree(batch); } +static void bpf_rcu_gp_acc_work(struct callback_head *head) +{ + struct bpf_rcu_gp_acc_ctx *ctx = container_of(head, struct bpf_rcu_gp_acc_ctx, work); + + local_irq_disable(); + rcu_momentary_dyntick_idle(); + local_irq_enable(); + + /* The interval between rcu_momentary_dyntick_idle() calls is + * at least 10ms. + */ + WRITE_ONCE(ctx->next_run, jiffies + msecs_to_jiffies(10)); + clear_bit(BPF_GP_ACC_RUNNING, &ctx->flags); +} + +static void bpf_mem_rcu_gp_acc(struct bpf_mem_cache *c) +{ + struct bpf_rcu_gp_acc_ctx *ctx = this_cpu_ptr(&bpf_acc_ctx); + + if (atomic_read(&c->dyn_reuse_rcu_cnt) < 128 || + time_before(jiffies, READ_ONCE(ctx->next_run))) + return; + + if ((current->flags & PF_KTHREAD) || + test_and_set_bit(BPF_GP_ACC_RUNNING, &ctx->flags)) + return; + + init_task_work(&ctx->work, bpf_rcu_gp_acc_work); + /* Task is exiting ? */ + if (task_work_add(current, &ctx->work, TWA_RESUME)) + clear_bit(BPF_GP_ACC_RUNNING, &ctx->flags); +} + static void reuse_bulk(struct bpf_mem_cache *c) { struct llist_node *head, *tail; struct bpf_reuse_batch *batch; unsigned long flags; + bpf_mem_rcu_gp_acc(c); + head = llist_del_all(&c->free_llist_extra); tail = head; while (tail && tail->next)