[RFC,0/4] Add hazard pointers to kernel

Message ID	20240917143402.930114-1-boqun.feng@gmail.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> Feedback-ID: iad51458e:Fastmail From: Boqun Feng <boqun.feng@gmail.com> To: linux-kernel@vger.kernel.org, rcu@vger.kernel.org, linux-mm@kvack.org, lkmm@vger.kernel.org Cc: "Paul E. McKenney" <paulmck@kernel.org>, Frederic Weisbecker <frederic@kernel.org>, Neeraj Upadhyay <neeraj.upadhyay@kernel.org>, Joel Fernandes <joel@joelfernandes.org>, Josh Triplett <josh@joshtriplett.org>, Boqun Feng <boqun.feng@gmail.com>, Uladzislau Rezki <urezki@gmail.com>, Steven Rostedt <rostedt@goodmis.org>, Mathieu Desnoyers <mathieu.desnoyers@efficios.com>, Lai Jiangshan <jiangshanlai@gmail.com>, Zqiang <qiang.zhang1211@gmail.com>, Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@redhat.com>, Will Deacon <will@kernel.org>, Waiman Long <longman@redhat.com>, Mark Rutland <mark.rutland@arm.com>, Thomas Gleixner <tglx@linutronix.de>, Kent Overstreet <kent.overstreet@gmail.com>, Linus Torvalds <torvalds@linux-foundation.org>, Vlastimil Babka <vbabka@suse.cz>, maged.michael@gmail.com Subject: [RFC PATCH 0/4] Add hazard pointers to kernel Date: Tue, 17 Sep 2024 07:33:58 -0700 Message-ID: <20240917143402.930114-1-boqun.feng@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Add hazard pointers to kernel \| expand [RFC,0/4] Add hazard pointers to kernel [RFC,1/4] hazptr: Add initial implementation of hazard pointers [RFC,2/4] refscale: Add benchmarks for hazptr [RFC,3/4] refscale: Add benchmarks for percpu_ref [RFC,4/4] WIP: hazptr: Add hazptr test sample

Boqun Feng Sept. 17, 2024, 2:33 p.m. UTC

Hi,

This series introduces hazard pointers [1] to kernel space. A TL;DR
description of hazard pointers is "a scalable refcounting mechanim
with RCU-like API". More information can be found at [2].

The problem we are trying to resolve here is refcount scalability
issues that cannot be resolved simply by RCU or SRCU (maybe due to the
requirement of an unbound protect duration). Neeraj has tried it in the
scalability issue[3] he has been working on, and he will share more
information in our LPC session [4] (and I will update in the list for
those who cannot make it to the session later).

My micro-benchmark shows the hazard pointers provide very good
scalability on par with percpu_ref/RCU/SRCU on the reader side:

(refscale in x86_64 + PREEMPT=y, avg reader duration in ns)
nreaders	1		8		32
percpu_ref	6.95123		10.0869		8.9674
rcu		2.97923		3.243		3.55077
hazptr		8.5991		8.40443		8.5762
srcu		16.7754  	22.4807		20.2406

Things that we know are currently not working:

*	Handling module unload, probably needs a hazptr_barrier()
	similar to rcu_barrier().

*	rcutorture support should be added to catch potential bugs (esp.
	for callback handling).

*	Improvement for updater side performance, currently all
	callbacks are handled in one work, this can be improved by using
	multiple work_structs or threads.

Of course, I might create some bugs, so please take a look. Also love to
hear anything on the current API. Any feedback is welcome!

Patch #1 is the implemenation of hazptr, Paul and Neeraj contributed a
lot, but all bugs are mine ;-)

Patch 2-3 add micro-benchmarks for hazptr and percpu_ref.

Patch #4 is a simple test I've used for development, I put it here just
in case someone wants to give a quick try, eventually, we need to add
hazptr to rcutorture (or has its own torture) for more testing.

Regards,
Boqun

[1]: M. M. Michael, "Hazard pointers: safe memory reclamation for
     lock-free objects," in IEEE Transactions on Parallel and
     Distributed Systems, vol. 15, no. 6, pp. 491-504, June 2004
[2]: https://docs.google.com/document/d/113WFjGlAW4m72xNbZWHUSE-yU2HIJnWpiXp91ShtgeE/
[3]: https://lore.kernel.org/lkml/20240916050811.473556-1-Neeraj.Upadhyay@amd.com/
[4]: https://lpc.events/event/18/contributions/1731/
[5]: Herlihy, Maurice, Victor Luchangco, and Mark Moir. "The repeat
     offender problem: A mechanism for supporting dynamic-sized,
     lock-free data structures." International Symposium on Distributed
     Computing. Berlin, Heidelberg: Springer Berlin Heidelberg, 2002.

Boqun Feng (4):
  hazptr: Add initial implementation of hazard pointers
  refscale: Add benchmarks for hazptr
  refscale: Add benchmarks for percpu_ref
  WIP: hazptr: Add hazptr test sample

 include/linux/hazptr.h       |  83 +++++++
 kernel/Makefile              |   1 +
 kernel/hazptr.c              | 463 +++++++++++++++++++++++++++++++++++
 kernel/rcu/refscale.c        | 127 +++++++++-
 samples/Kconfig              |   6 +
 samples/Makefile             |   1 +
 samples/hazptr/hazptr_test.c |  87 +++++++
 7 files changed, 767 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/hazptr.h
 create mode 100644 kernel/hazptr.c
 create mode 100644 samples/hazptr/hazptr_test.c

Linus Torvalds Sept. 18, 2024, 7:18 a.m. UTC | #1

On Tue, 17 Sept 2024 at 16:34, Boqun Feng <boqun.feng@gmail.com> wrote:
>
> This series introduces hazard pointers [1] to kernel space. A TL;DR
> description of hazard pointers is "a scalable refcounting mechanim
> with RCU-like API". More information can be found at [2].

Please give actual "this is useful for X, and here is an actual real
load with numbers showing why it matters".

We don't just merge random infrastructure without a use-case and an
argument for it.

                 Linus

Neeraj Upadhyay Sept. 18, 2024, 10:44 p.m. UTC | #2

On 9/18/2024 12:48 PM, Linus Torvalds wrote:
> On Tue, 17 Sept 2024 at 16:34, Boqun Feng <boqun.feng@gmail.com> wrote:
>>
>> This series introduces hazard pointers [1] to kernel space. A TL;DR
>> description of hazard pointers is "a scalable refcounting mechanim
>> with RCU-like API". More information can be found at [2].
> 
> Please give actual "this is useful for X, and here is an actual real
> load with numbers showing why it matters".
> 

One of the use case where we had seen improvement is - Nginx
web server throughput scalability with AppArmor enabled. For this use
case we see refcount scalability problem when kref operations
are done for AppArmor label object in Nginx worker's context. More
details about this are captured @ [1] [2].

When we switch from kref to hazard pointer in apparmor_file_open(),
we see ~7% improvement in Nginx throughput for this use case.

While we were working on this problem, this refcount scalability issue got
resolved  recently with conditional ref acquisition [3] (however, there are new
developments in apparmor code which might bring back the refcount problem [4]).

[1] https://lore.kernel.org/lkml/20240110111856.87370-7-Neeraj.Upadhyay@amd.com/T/
[2] https://lore.kernel.org/lkml/20240916050811.473556-1-Neeraj.Upadhyay@amd.com/
[3] https://lore.kernel.org/lkml/20240620131524.156312-1-mjguzik@gmail.com/
[4] https://lore.kernel.org/lkml/71c0ea18-8b8b-402b-b03c-029aeedc2747@canonical.com/

- Neeraj

> We don't just merge random infrastructure without a use-case and an
> argument for it.
> 
>                  Linus

Linus Torvalds Sept. 19, 2024, 6:46 a.m. UTC | #3

On Thu, 19 Sept 2024 at 00:44, Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> wrote:
>
> While we were working on this problem, this refcount scalability issue got
> resolved  recently with conditional ref acquisition [3] (however, there are new
> developments in apparmor code which might bring back the refcount problem [4]).

Honestly, the various security layers should be a whole lot more
careful about their horrid performance issues, and I think that [4]
you point at needs to just be headed off at the pass.

No  more "the security layer is so bad at performance that we have to
introduce new ref mechanisms", please. Let's push back on bad security
layer code instead.

                Linus

Christoph Hellwig Sept. 19, 2024, 2:14 p.m. UTC | #4

On Wed, Sep 18, 2024 at 09:18:43AM +0200, Linus Torvalds wrote:
> On Tue, 17 Sept 2024 at 16:34, Boqun Feng <boqun.feng@gmail.com> wrote:
> >
> > This series introduces hazard pointers [1] to kernel space. A TL;DR
> > description of hazard pointers is "a scalable refcounting mechanim
> > with RCU-like API". More information can be found at [2].
> 
> Please give actual "this is useful for X, and here is an actual real
> load with numbers showing why it matters".

Agreed.  From the description this would seem like a good fit for
q_usage_counter in the block layer, which currently makes creative use
of percpu counters.

Linus Torvalds Sept. 19, 2024, 2:21 p.m. UTC | #5

On Thu, 19 Sept 2024 at 16:15, Christoph Hellwig <hch@infradead.org> wrote:
>
> Agreed.  From the description this would seem like a good fit for
> q_usage_counter in the block layer, which currently makes creative use
> of percpu counters.

Yes, if this actually could simplify code that currently used percpu
counters, that might be lovely.

The percpu counters often perform very well, but then have huge pain
in either managing the percpu allocation, or in trying to synchronize
across CPU's.

I'd be a lot more interested in "we can fix complex code" than in "we
have crappy code in bad subsystems where we can hide the performance
impact of the subsystem not having been done right".

               Linus

Mateusz Guzik Sept. 19, 2024, 2:30 p.m. UTC | #6

On Thu, Sep 19, 2024 at 04:14:05AM +0530, Neeraj Upadhyay wrote:
> On 9/18/2024 12:48 PM, Linus Torvalds wrote:
> > On Tue, 17 Sept 2024 at 16:34, Boqun Feng <boqun.feng@gmail.com> wrote:
> >>
> >> This series introduces hazard pointers [1] to kernel space. A TL;DR
> >> description of hazard pointers is "a scalable refcounting mechanim
> >> with RCU-like API". More information can be found at [2].
> > 
> > Please give actual "this is useful for X, and here is an actual real
> > load with numbers showing why it matters".
> > 
> 
> One of the use case where we had seen improvement is - Nginx
> web server throughput scalability with AppArmor enabled. For this use
> case we see refcount scalability problem when kref operations
> are done for AppArmor label object in Nginx worker's context. More
> details about this are captured @ [1] [2].
> 
> When we switch from kref to hazard pointer in apparmor_file_open(),
> we see ~7% improvement in Nginx throughput for this use case.
> 
> While we were working on this problem, this refcount scalability issue got
> resolved  recently with conditional ref acquisition [3] (however, there are new
> developments in apparmor code which might bring back the refcount problem [4]).
> 

The open/close thing is still serializing across different processes,
the slowdown just got lower. As in apparmor *as is* continues to be a
problem at big enough scale.

Per my messages in the area in the past, I'm confident this is fixable
with changing the refcount model to cache ref changes per-thread. I
employed this very scheme $elsewhere.

Since equivalent mechanism is applicable to creds this may want to be
implemented as something under lib/. I even started to work on it for
Linux, but real life got in the way and then I could not be arsed to
finish. 

It is a little reminiscenet of per-cpu refs. Here is the outline again:

kref usage gets replaced with a touple of { kref users; s64 refs; }

task_struct grows a pointer to the cached label and refs counter on it

when a new thread is created it bumps users and stores the pointer. on
destruction it decrements users and rolls up the local changes.
Similarly, if it turns out the label has to change during thread's
lifetime, the same thing happens.

In pseudo-code for apparmor_file_open():
	if (unlikely(current->aa_cached_label != check_label())) {
		/* do a replacement here */
	}
	/* just bump the local counter, no synchronisation with other
	 * cpus in the common case */
	current->aa_cached_label_refs++;

In apparmor_file_close():
	/* common case fast path */
	if (file->aa_label == current->aa_cached_label) {
		current->aa_cached_label_refs--;
		return;
	}
	/* we get here if apparmor got reconfigured or this is a file we
	 * inherited from another proc which had a different label and
	 * this is the last fput */
	kref_put(file->aa_label);

Conceptually there is almost nothing to see here.

As outlined above stale labels would clear themselves out as threads
open files. However, a thread which stubborly refuses to call allocate a
new file obj may hold on to a stale label indefinitely.

One way to sort it out:
I presume there is a spot somewhere in user<->kernel transition handling
which updates the credentials pointer, should it have changed.

$elsewhere I patched it up with a "cow" generation counter. If not
matching with the real task struct you know you need to take the fast
path and check creds, apparmor and whatever else. No extra branches in
the fast path, but a new int does have to be read. Given that
task_struct is a little bit of a cluster fuck I don't think it's a
problem.

That would be a rough sketch, anyone interested can fill in the details.
This still performs serializing atomics in *certain* cases, but avoids
them in almost all cases and there is nothing complicated about this
that I see, just some effort to implement.

So I don't believe patching up RCU with hazard pointers is warranted if
apparmor is the only justification.

Anyway no ETA from my end, anyone interested is free to take the idea or
do better.

Neeraj Upadhyay Sept. 20, 2024, 5 a.m. UTC | #7

On 9/19/2024 12:16 PM, Linus Torvalds wrote:
> On Thu, 19 Sept 2024 at 00:44, Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> wrote:
>>
>> While we were working on this problem, this refcount scalability issue got
>> resolved  recently with conditional ref acquisition [3] (however, there are new
>> developments in apparmor code which might bring back the refcount problem [4]).
> 
> Honestly, the various security layers should be a whole lot more
> careful about their horrid performance issues, and I think that [4]
> you point at needs to just be headed off at the pass.
> 
> No  more "the security layer is so bad at performance that we have to
> introduce new ref mechanisms", please. Let's push back on bad security
> layer code instead.
> 

Ok got it. Thanks for your feedback! I had tried using percpu refcount first
(in place of kref) in AppArmor. However, that required managing the last
reference drop (implemented in [1] and [2]). Mateusz has shared some ideas
in his reply to this thread. Maybe that is a workable solution. Will defer
to John on this as I have limited understanding of the cred management code.

- Neeraj

>                 Linus

[RFC,0/4] Add hazard pointers to kernel

Message

Comments