diff mbox

[RFC,06/19] Provide refcount_t, an atomic_t like primitive built just for refcounting.

Message ID 1482994571-18687-7-git-send-email-elena.reshetova@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Reshetova, Elena Dec. 29, 2016, 6:55 a.m. UTC
It provides saturation semantics such that overflow becomes impossible
and thereby 'spurious' use-after-free is avoided.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/kref.h     |  23 ++---
 include/linux/refcount.h | 262 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 273 insertions(+), 12 deletions(-)
 create mode 100644 include/linux/refcount.h

Comments

Eric Biggers Dec. 30, 2016, 1:06 a.m. UTC | #1
On Thu, Dec 29, 2016 at 08:55:58AM +0200, Elena Reshetova wrote:
> +
> +static inline __must_check
> +bool refcount_add_not_zero(unsigned int i, refcount_t *r)
> +{
> +       unsigned int old, new, val = atomic_read(&r->refs);
> +
> +       for (;;) {
> +               if (!val)
> +                       return false;
> +
> +               if (unlikely(val == UINT_MAX))
> +                       return true;
> +
> +               new = val + i;
> +               if (new < val)
> +                       new = UINT_MAX;
> +               old = atomic_cmpxchg_relaxed(&r->refs, val, new);
> +               if (old == val)
> +                       break;
> +
> +               val = old;
> +       }
> +
> +       WARN(new == UINT_MAX, "refcount_t: saturated; leaking memory.\n");
> +
> +       return true;
> +}
> +
> +/*
> + * Similar to atomic_inc_not_zero(), will saturate at UINT_MAX and WARN.
> + *
> + * Provides no memory ordering, it is assumed the caller has guaranteed the
> + * object memory to be stable (RCU, etc.). It does provide a control dependency
> + * and thereby orders future stores. See the comment on top.
> + */
> +static inline __must_check
> +bool refcount_inc_not_zero(refcount_t *r)
> +{
> +       return refcount_add_not_zero(1, r);
> +}
> +
> +/*
> + * Similar to atomic_inc(), will saturate at UINT_MAX and WARN.
> + *
> + * Provides no memory ordering, it is assumed the caller already has a
> + * reference on the object, will WARN when this is not so.
> + */
> +static inline void refcount_inc(refcount_t *r)
> +{
> +       WARN(!refcount_inc_not_zero(r), "refcount_t: increment on 0; use-after-free.\n");
> +}
> +

... and refcount_inc() compiles to over 100 bytes of instructions on x86_64.
This is the wrong approach.  We need a low-overhead solution, otherwise no one
will turn on refcount protection and the feature will be useless.

What exactly is wrong with the current solution in PAX/grsecurity?  Looking at
the x86 version they have atomic_inc() do 'lock incl' like usual, then use 'jo'
to, if the counter overflowed, jump to *out-of-line* error handling code, in a
separate section of the kernel image.   Then it raises a software interrupt, and
the interrupt handler sets the overflowed counter to INT_MAX and does the needed
logging and signal raising.

That approach seems very efficient.  It seems the only overhead when someone
isn't actually exploiting a refcount bug is the 'jo' instruction, with the
branch not taken.  There aren't even any other in-line instructions to waste
icache space.

I do see they used to use a slightly different approach that did a decrement
instead of setting the counter to INT_MAX.  And that was clearly racy because
two concurrent increments could circumvent the overflow protection.  But AFAICS
their current solution is not racy in any meaningful way, since the setting to
INT_MAX means an overflow will be detected again on the next increment, even if
there were some concurrent increments in the mean time.  (And if by some stretch
of the imagination, it was possible to execute *another* INT_MAX increments
before the fixup code had a chance to run, the correct solution would be to
simply use 'js' instead of 'jo' to detect overflows.  I'm guessing the only
reason they don't do that is because some atomic_t's are used to store negative
values...)

So given that there is an existing solution which AFAICS is efficient and
achieves the desired protection, why has the proposal turned into a monstrous
cmpxchg loop that won't be practical to enable by default?

I also think that the "warn when incrementing a 0 refcount" part of the change
shouldn't be there.  It's additional overhead that seems tangential to the main
goal of the feature which is to protect against refcount overflows, not to
protect against random increments in some object which has *already* been freed
and potentially exploited.

- Eric
Reshetova, Elena Dec. 30, 2016, 1:17 p.m. UTC | #2
> On Thu, Dec 29, 2016 at 08:55:58AM +0200, Elena Reshetova wrote:
> > +
> > +static inline __must_check
> > +bool refcount_add_not_zero(unsigned int i, refcount_t *r)
> > +{
> > +       unsigned int old, new, val = atomic_read(&r->refs);
> > +
> > +       for (;;) {
> > +               if (!val)
> > +                       return false;
> > +
> > +               if (unlikely(val == UINT_MAX))
> > +                       return true;
> > +
> > +               new = val + i;
> > +               if (new < val)
> > +                       new = UINT_MAX;
> > +               old = atomic_cmpxchg_relaxed(&r->refs, val, new);
> > +               if (old == val)
> > +                       break;
> > +
> > +               val = old;
> > +       }
> > +
> > +       WARN(new == UINT_MAX, "refcount_t: saturated; leaking memory.\n");
> > +
> > +       return true;
> > +}
> > +
> > +/*
> > + * Similar to atomic_inc_not_zero(), will saturate at UINT_MAX and WARN.
> > + *
> > + * Provides no memory ordering, it is assumed the caller has guaranteed the
> > + * object memory to be stable (RCU, etc.). It does provide a control dependency
> > + * and thereby orders future stores. See the comment on top.
> > + */
> > +static inline __must_check
> > +bool refcount_inc_not_zero(refcount_t *r)
> > +{
> > +       return refcount_add_not_zero(1, r);
> > +}
> > +
> > +/*
> > + * Similar to atomic_inc(), will saturate at UINT_MAX and WARN.
> > + *
> > + * Provides no memory ordering, it is assumed the caller already has a
> > + * reference on the object, will WARN when this is not so.
> > + */
> > +static inline void refcount_inc(refcount_t *r)
> > +{
> > +       WARN(!refcount_inc_not_zero(r), "refcount_t: increment on 0; use-after-
> free.\n");
> > +}
> > +
> 
> ... and refcount_inc() compiles to over 100 bytes of instructions on x86_64.
> This is the wrong approach.  We need a low-overhead solution, otherwise no one
> will turn on refcount protection and the feature will be useless.
> 
> What exactly is wrong with the current solution in PAX/grsecurity?  Looking at
> the x86 version they have atomic_inc() do 'lock incl' like usual, then use 'jo'
> to, if the counter overflowed, jump to *out-of-line* error handling code, in a
> separate section of the kernel image.   Then it raises a software interrupt, and
> the interrupt handler sets the overflowed counter to INT_MAX and does the
> needed
> logging and signal raising.
> 
> That approach seems very efficient.  It seems the only overhead when someone
> isn't actually exploiting a refcount bug is the 'jo' instruction, with the
> branch not taken.  There aren't even any other in-line instructions to waste
> icache space.
> I do see they used to use a slightly different approach that did a decrement
> instead of setting the counter to INT_MAX.  And that was clearly racy because
> two concurrent increments could circumvent the overflow protection.  But AFAICS
> their current solution is not racy in any meaningful way, since the setting to
> INT_MAX means an overflow will be detected again on the next increment, even if
> there were some concurrent increments in the mean time.  (And if by some stretch
> of the imagination, it was possible to execute *another* INT_MAX increments
> before the fixup code had a chance to run, the correct solution would be to
> simply use 'js' instead of 'jo' to detect overflows.  I'm guessing the only
> reason they don't do that is because some atomic_t's are used to store negative
> values...)

> 
> So given that there is an existing solution which AFAICS is efficient and
> achieves the desired protection, why has the proposal turned into a monstrous
> cmpxchg loop that won't be practical to enable by default?

I guess we can try to benchmark the whole thing to see what is the overhead.
Especially now when we have the subsystem parts ready. 

And if you have been following the story since beginning, we also have PaX/grsecurity
approach done for linux-next/stable and benchmarks have been previously posted, so
would be easy to compare if needed. 

I guess one could in principle think of mixed approach between this one and the one that 
PaX/grsecurity had: define refcount_t type and API, but use assembly instructions
behind to speed things up. 

> I also think that the "warn when incrementing a 0 refcount" part of the change
> shouldn't be there.  It's additional overhead that seems tangential to the main
> goal of the feature which is to protect against refcount overflows, not to
> protect against random increments in some object which has *already* been freed
> and potentially exploited.

Actually having warn for now was useful debugging feature: we got to found out many places which would not work otherwise. 

Best Regards,
Elena.

> 
> - Eric
Eric Biggers Dec. 30, 2016, 7:52 p.m. UTC | #3
On Fri, Dec 30, 2016 at 01:17:08PM +0000, Reshetova, Elena wrote:
> > 
> > ... and refcount_inc() compiles to over 100 bytes of instructions on x86_64.
> > This is the wrong approach.  We need a low-overhead solution, otherwise no one
> > will turn on refcount protection and the feature will be useless.
> > 
> > What exactly is wrong with the current solution in PAX/grsecurity?  Looking at
> > the x86 version they have atomic_inc() do 'lock incl' like usual, then use 'jo'
> > to, if the counter overflowed, jump to *out-of-line* error handling code, in a
> > separate section of the kernel image.   Then it raises a software interrupt, and
> > the interrupt handler sets the overflowed counter to INT_MAX and does the
> > needed
> > logging and signal raising.
> > 
> > That approach seems very efficient.  It seems the only overhead when someone
> > isn't actually exploiting a refcount bug is the 'jo' instruction, with the
> > branch not taken.  There aren't even any other in-line instructions to waste
> > icache space.
> > I do see they used to use a slightly different approach that did a decrement
> > instead of setting the counter to INT_MAX.  And that was clearly racy because
> > two concurrent increments could circumvent the overflow protection.  But AFAICS
> > their current solution is not racy in any meaningful way, since the setting to
> > INT_MAX means an overflow will be detected again on the next increment, even if
> > there were some concurrent increments in the mean time.  (And if by some stretch
> > of the imagination, it was possible to execute *another* INT_MAX increments
> > before the fixup code had a chance to run, the correct solution would be to
> > simply use 'js' instead of 'jo' to detect overflows.  I'm guessing the only
> > reason they don't do that is because some atomic_t's are used to store negative
> > values...)
> 
> > 
> > So given that there is an existing solution which AFAICS is efficient and
> > achieves the desired protection, why has the proposal turned into a monstrous
> > cmpxchg loop that won't be practical to enable by default?
> 
> I guess we can try to benchmark the whole thing to see what is the overhead.
> Especially now when we have the subsystem parts ready. 
> 
> And if you have been following the story since beginning, we also have PaX/grsecurity
> approach done for linux-next/stable and benchmarks have been previously posted, so
> would be easy to compare if needed. 
> 
> I guess one could in principle think of mixed approach between this one and the one that 
> PaX/grsecurity had: define refcount_t type and API, but use assembly instructions
> behind to speed things up. 

I haven't been following the whole story, sorry.  Are your "PaX/grsecurity
approach" patches based on their latest version, i.e. without the racy
decrement?  Also, with regards to benchmarks, there really are two things that
are important: the performance impact of actually executing all the cmpxchg loop
stuff instead of a simple 'lock incl', and the icache footprint of all the extra
inlined instructions (or the overhead of a function call if that is "solved" by
making refcount_inc() out of line).  The latter is unlikely to be visible in a
microbenchmark but it's still very important.

AFAICS the whole question of refcount_t/atomic_t vs.
atomic_t/atomic_unchecked_t has nothing to do with the details of how the
checked refcount operations are actually implemented.  These things need to be
considered separately.

> 
> > I also think that the "warn when incrementing a 0 refcount" part of the change
> > shouldn't be there.  It's additional overhead that seems tangential to the main
> > goal of the feature which is to protect against refcount overflows, not to
> > protect against random increments in some object which has *already* been freed
> > and potentially exploited.
> 
> Actually having warn for now was useful debugging feature: we got to found out many places which would not work otherwise. 
> 

The point of the feature is exploit mitigation, not refcount debugging.  The
mitigation needs to be practical to turn on in real production systems.  If we
want to have extra debugging features too, fine but that should be a *debugging*
feature not a security feature, controllable by a separate config option, e.g.
CONFIG_DEBUG_REFCOUNT for debugging vs. CONFIG_HARDENED_REFCOUNT for the actual
mitigation.

- Eric
Peter Zijlstra Jan. 3, 2017, 1:21 p.m. UTC | #4
On Thu, Dec 29, 2016 at 07:06:27PM -0600, Eric Biggers wrote:
> 
> ... and refcount_inc() compiles to over 100 bytes of instructions on x86_64.
> This is the wrong approach.  We need a low-overhead solution, otherwise no one
> will turn on refcount protection and the feature will be useless.

Its not something that can be turned on or off, refcount_t is
unconditional code. But you raise a good point on the size of the thing.

I count 116 bytes on x86_64. If I disable all the debug crud that
reduces to 45 bytes, and that's because GCC-6 is generating particularly
stupid code (albeit not as stupid as GCC-5 did).

A hand coded one ends up being 29 bytes, of which 6 are the function
pro- and epilogue, so effectively 23 bytes for inline.

Which we can reduce to 22 if instead of using a literal for UINT_MAX we
do: xor %[t], %[t]; dec %[t].

0000000000009ee0 <ponies>:
    9ee0:       55                      push   %rbp
    9ee1:       48 89 e5                mov    %rsp,%rbp
    9ee4:       8b 07                   mov    (%rdi),%eax
    9ee6:       85 c0                   test   %eax,%eax
    9ee8:       74 10                   je     9efa <ponies+0x1a>
    9eea:       89 c2                   mov    %eax,%edx
    9eec:       ff c2                   inc    %edx
    9eee:       73 04                   jae    9ef4 <ponies+0x14>
    9ef0:       31 d2                   xor    %edx,%edx
    9ef2:       ff ca                   dec    %edx
    9ef4:       f0 0f b1 17             lock cmpxchg %edx,(%rdi)
    9ef8:       75 ec                   jne    9ee6 <ponies+0x6>
    9efa:       5d                      pop    %rbp
    9efb:       c3			retq

(caveat: I wrote this on a post-holidays brain without testing)

Also note that call overhead on an x86_64 (big core) is something like
1.5 cycles. And afaik Sparc64 is the architecture with the worst call
overhead, but that already has its atomic functions out of line for
different reasons.

> What exactly is wrong with the current solution in PAX/grsecurity?  Looking at
> the x86 version they have atomic_inc() do 'lock incl' like usual, then use 'jo'
> to, if the counter overflowed, jump to *out-of-line* error handling code, in a
> separate section of the kernel image.   Then it raises a software interrupt, and
> the interrupt handler sets the overflowed counter to INT_MAX and does the needed
> logging and signal raising.

Doing an unconditional INC on INT_MAX gives a temporarily visible
artifact of INT_MAX+1 (or INT_MIN) in the best case.

This is fundamentally not an atomic operation and therefore does not
belong in the atomic_* family, full stop.

Not to mention that the whole wrap/unwrap or checked/unchecked split of
atomic_t is a massive trainwreck. Moving over to refcount_t, which has
simple and well defined semantics forces us to audit and cleanup all the
reference counting crud as it gets converted, this is a good thing.

Yes it takes more time and effort, but the end result is better code.

I understand why PaX/grsecurity chose not to do this, but that doesn't
make it a proper solution for upstream.


Now as to why refcount cannot be implemented using that scheme you
outlined:

	vCPU0			vCPU1

	lock inc %[r]
	jo

	<vcpu preempt-out>

				for lots
					refcount_dec_and_test(&obj->ref)
						/* hooray, we hit 0 */
						kfree(obj);

	<vcpu preempt-in>

	mov $0xFFFFFFFF, %[r] /* OOPS use-after-free */


Is this unlikely, yes, extremely so. Do I want to be the one debugging
this, heck no.
Eric Biggers Jan. 4, 2017, 8:36 p.m. UTC | #5
On Tue, Jan 03, 2017 at 02:21:36PM +0100, Peter Zijlstra wrote:
> On Thu, Dec 29, 2016 at 07:06:27PM -0600, Eric Biggers wrote:
> > 
> > ... and refcount_inc() compiles to over 100 bytes of instructions on x86_64.
> > This is the wrong approach.  We need a low-overhead solution, otherwise no one
> > will turn on refcount protection and the feature will be useless.
> 
> Its not something that can be turned on or off, refcount_t is
> unconditional code. But you raise a good point on the size of the thing.
...
> Doing an unconditional INC on INT_MAX gives a temporarily visible
> artifact of INT_MAX+1 (or INT_MIN) in the best case.
> 
> This is fundamentally not an atomic operation and therefore does not
> belong in the atomic_* family, full stop.

Again I feel this is going down the wrong track.  The point of the PaX feature
this is based on is to offer protection against *exploits* involving abuse of
refcount leak bugs.  If the overflow logic triggers then there is a kernel *bug*
and the rules have already been broken.  The question of whether the exploit
mitigation is "atomic" is not important unless it allows the mitigation to be
circumvented.

And yes this should be a config option just like other hardening options like
CONFIG_HARDENED_USERCOPY.  Making it unconditional only makes it harder to get
merged and hurts users who, for whatever reason, don't want/need extra
protections against kernel exploits.  This is especially true if an
implementation with significant performance and code size overhead is chosen.

> Now as to why refcount cannot be implemented using that scheme you
> outlined:
> 
> 	vCPU0			vCPU1
> 
> 	lock inc %[r]
> 	jo
> 
> 	<vcpu preempt-out>
> 
> 				for lots
> 					refcount_dec_and_test(&obj->ref)
> 						/* hooray, we hit 0 */
> 						kfree(obj);
> 
> 	<vcpu preempt-in>
> 
> 	mov $0xFFFFFFFF, %[r] /* OOPS use-after-free */

This scenario doesn't make sense.  If there's no bug that causes extra refcount
decrements, then it would be impossible to generate the INT_MAX+1 decrements
needed to free the object.  Or if there *is* a bug that causes extra refcount
decrements, then it could already be abused at any time in any of the proposed
solutions to trivially cause a use-after-free.

Eric
Peter Zijlstra Jan. 5, 2017, 10:44 a.m. UTC | #6
On Wed, Jan 04, 2017 at 12:36:01PM -0800, Eric Biggers wrote:
> On Tue, Jan 03, 2017 at 02:21:36PM +0100, Peter Zijlstra wrote:
> > On Thu, Dec 29, 2016 at 07:06:27PM -0600, Eric Biggers wrote:
> > > 
> > > ... and refcount_inc() compiles to over 100 bytes of instructions on x86_64.
> > > This is the wrong approach.  We need a low-overhead solution, otherwise no one
> > > will turn on refcount protection and the feature will be useless.
> > 
> > Its not something that can be turned on or off, refcount_t is
> > unconditional code. But you raise a good point on the size of the thing.
> ...
> > Doing an unconditional INC on INT_MAX gives a temporarily visible
> > artifact of INT_MAX+1 (or INT_MIN) in the best case.
> > 
> > This is fundamentally not an atomic operation and therefore does not
> > belong in the atomic_* family, full stop.
> 
> Again I feel this is going down the wrong track.  The point of the PaX feature
> this is based on is to offer protection against *exploits* involving abuse of
> refcount leak bugs.  If the overflow logic triggers then there is a kernel *bug*
> and the rules have already been broken.  The question of whether the exploit
> mitigation is "atomic" is not important unless it allows the mitigation to be
> circumvented.

It matters for where you put it. It cannot be part of atomic_t if it is
intrinsically not atomic.

Not to mention that the whole checked/unchecked split doesn't make sense
for atomic_t since the concept only applies to a subset of the
operations, and for those are only desired for a subset of applications;
notably the reference count case.

The proposed refcount_t aims to address the exact exploit scenario you
mention, but does so from a separate type with well defined semantics.

Our current effort has mostly been aimed at exploring the space and
defining the semantics and API that make sense. For example the refcount
type should not have functions that can subvert the protection or create
malformed stuff, ie. it should be internally consistent.

Note that the saturation semantics are an effective mitigation of the
described use-after-free exploit into a resource leak denial-of-service;
although Kees wants to later add an instant panic option, which would
create a more immediate DoS scenario.

> And yes this should be a config option just like other hardening options like
> CONFIG_HARDENED_USERCOPY.  Making it unconditional only makes it harder to get
> merged and hurts users who, for whatever reason, don't want/need extra
> protections against kernel exploits.  This is especially true if an
> implementation with significant performance and code size overhead is chosen.

Maybe.. and while I agree that whatever GCC generates from the generic
implementation currently available is quite horrible, I'm not convinced
a hand coded equivalent is an actual problem. The cmpxchg loop only
really suffers in performance when the variable is contended, and when a
refcount is contended enough for this to show up, your code has issues
anyway. That leaves size, and lets first show that that ends up being a
real problem.

Premature optimization etc..

> This scenario doesn't make sense.  If there's no bug that causes extra refcount
> decrements, then it would be impossible to generate the INT_MAX+1 decrements
> needed to free the object.  Or if there *is* a bug that causes extra refcount
> decrements, then it could already be abused at any time in any of the proposed
> solutions to trivially cause a use-after-free.

Right you are, so much for thinking straight :-) I still feel deeply
uncomfortable with the nonatomic nature of the thing, I'll have to think
a bit more on this thing.

Also note that such a scheme does not make sense for LL/SC
architectures.

In any case, if it can be proven to be equivalent, we can always change
the x86 implementation later.
pageexec@freemail.hu Jan. 5, 2017, 9:21 p.m. UTC | #7
On 3 Jan 2017 at 14:21, Peter Zijlstra wrote:

> Its not something that can be turned on or off, refcount_t is
> unconditional code. But you raise a good point on the size of the thing.
> 
> I count 116 bytes on x86_64. If I disable all the debug crud...

...then the whole thing loses its raison d'etre ;). so let's not cheat
and compare apples to apples in the future. this means that your solution
has a very bad size overhead (dozens of insns vs. one, multiplied by a few
thousand uses) and performance overhead (several insns that the processor
cannot parallelize due to register and control dependencies, at least one
mispredicted conditional forward jump, etc) which is simply not acceptable
as it doesn't balance it with any sort of security advantage (see below).

> > What exactly is wrong with the current solution in PAX/grsecurity?  Looking at
> > the x86 version they have atomic_inc() do 'lock incl' like usual, then use 'jo'
> > to, if the counter overflowed, jump to *out-of-line* error handling code, in a
> > separate section of the kernel image.   Then it raises a software interrupt, and
> > the interrupt handler sets the overflowed counter to INT_MAX and does the needed
> > logging and signal raising.
> 
> Doing an unconditional INC on INT_MAX gives a temporarily visible
> artifact of INT_MAX+1 (or INT_MIN) in the best case.
> 
> This is fundamentally not an atomic operation and therefore does not
> belong in the atomic_* family, full stop.

i don't know why people keep repeating this myth but what exactly is not
atomic in the PaX REFCOUNT feature? hint: it's all atomic ops based,
there's no semantic change in there.

what is undesired behaviour (for the security goal, not the underlying
atomic_t API!) is that the undo/saturation logic makes the whole dance
visible to other CPUs on non-LL/SC archs which means that this case needs
special care there to achieve the security goal but there's no breakage in
atomic ops. put another way, if you consider REFCOUNT not atomic somehow
then so is the upstream atomic_t API since it makes INT_MAX+1 visible to
all CPUs just like REFCOUNT does (in fact the whole point of the REFCOUNT
accessors is that they can be better than upstream on LL/SC in this regard).

> Not to mention that the whole wrap/unwrap or checked/unchecked split of
> atomic_t is a massive trainwreck. Moving over to refcount_t, which has
> simple and well defined semantics forces us to audit and cleanup all the
> reference counting crud as it gets converted, this is a good thing.

while a noble goal your approach has two problems: one, it doesn't go
nearly far enough, and two, what it does do currently is anything but
clear (beyond the above discussed implementation problems). let me
elaborate on both.

1. the proper approach should be to separate the low-level implementation
machinery from the higher level APIs exposed to users. the current low-level
API is the atomic_t type and its accessors (including the 64/long variants)
which is directly used by most code and there're few higher level abstractions
on top of it. the proper low-level design should first of all separate the
underlying type into signed/unsigned types as the conversions between the
two make for a clumsy and error prone API to start with (just look at your
own refcount_t API, you're abusing signed/unsigned conversions where one
direction is implementation defined and allows a trap representation, is
your code ready to handle that? not to mention that the C implementations
of atomic_t trigger UB when signed overflow occurs).

second, based on the signed/unsigned atomic API you should build not one
but several abstractions. refcount_t is an obvious candidate (based on the
unsigned low-level machinery) but there's a lot more kinds (IIRC, i counted
something like a dozen different patterns when i last looked at the
exceptional cases for REFCOUNT). this leads to...

2. your current refcount_t API is wrong and i think reflects a misunderstanding
of what reference counts are about in general. in particular, checking for
a 0 refcount value for any purpose other than freeing the underlying object
makes no sense and means that the usage is not a refcount or the user has
a UAF bug to start with (the check in kref_get is also nonsense for this
reason). this is because refcounts protect (account for) references/handles
to an object, not the object itself. this also means that such references
(think pointers in C) can only be created by *copying* an existing one but
they cannot be made up out of thin air.

in theory, each time a reference is copied, the corresponding refcount
must be increased and when the reference goes out of scope, the refcount
must be decremented. when the refcount reaches 0 we know that there're no
references left so the object can be freed. in practice, we optimize the
refcount increments on copies by observing that the lifetimes of certain
reference copies nest in each other (think local variables, non-captured
function parameters, etc) so we can get away by only accounting for the
outermost reference in such a nesting hierarchy of references (and we get
the refcounting bugs when we fail at this arguably premature optimization
and miss a decrement or increment for a reference copy).

now with this introduction tell me what sense does e.g., refcount_add_not_zero
make? any refcount API should never even expose the underlying counter
mechanism since refcounts are about references, not the counter underneath.
that is, a proper refcount API would look like this:

// establishes the initial refcount of 0 (which doesn't mean 'freed object'
// but 'no references, yet')
// it's needed for object construction if the object may come from non-zero
// initialized memory, otherwise ref_get should be used
void ref_reset(objtype *ref);

// returns a copy of the reference if the refcount can be safely
// incremented, NULL otherwise (or crash&burn, etc)
objtype *ref_get(objtype *ref);

// decrements the refcount and free the object if it reached 0
// dtor could be a field of objtype but that'd cost memory (if there are
// more objects than ref_put callers which is the likely case in real life)
// and is an attack surface (kref follows this model too)
void ref_put(objtype *ref, destruct_t *dtor);

conceptually that is all you need to provide for refcount users (note that
there's no leakage of the lower level implementation into the API, e.g., it
can be lockless based on atomic_t or synchronized explicitly, all that is
hidden from the users of the API).

any other need/usage is not a refcount and needs its own API (and analysis
of potential security problems when misused). now to implement this in C
you'll have to break the abstraction in that it'd be an undue burden on
the API implementor to provide the API for each type of reference (in C++
you could use templates) so in practice you'd also expose the immediately
underlying refcount type (a reference to it) in the API and pray that users
will get it right. something like this:

objtype *ref_get(objtype *ref, refcount_t *count);
void ref_put(objtype *ref, refcount_t *count, destruct_t *dtor);

where 'count' would be passed as &ref->refcount_field or you could make
this API macro based instead and stick to the proper API by mandating the
name of the object's refcount field:

#define REF_GET(ref) ({ refcount_inc(&ref->refcount) ? ref : NULL;})
#define REF_PUT(ref, dtor) do { if (refcount_dec(&ref->refcount)) dtor(ref); } while(0)

where refcount_inc would do the increment and overflow check and return
a boolean based on the result and refcount_dec would do the decrement and
return a boolean to trigger object destruction (which has to at least free
the object's memory of course).

> Yes it takes more time and effort, but the end result is better code.

most definitely but that road is a lot longer than some of you seem to
anticipate ;).

> Now as to why refcount cannot be implemented using that scheme you
> outlined:
> 
>  vCPU0			vCPU1
> 
>  lock inc %[r]
>  jo
> 
>  <vcpu preempt-out>
> 
> 				for lots
>					refcount_dec_and_test(&obj->ref)
>						/* hooray, we hit 0 */
>						kfree(obj);
> 
>  <vcpu preempt-in>
> 
>  mov $0xFFFFFFFF, %[r] /* OOPS use-after-free */
> 
> 
> Is this unlikely, yes, extremely so. Do I want to be the one debugging
> this, heck no.

i can't make any sense of your example. which kind of refcount bug is
assumed here? if you have an underflow bug then you can just abuse it at
any refcount value, there's no need to win any race at the saturation
value. if it's an overflow bug (the use case for REFCOUNT) then i don't
understand the refcount_dec_and_test call... note also that in the
example's underflow case you have a UAF regardless of the refcount
protection logic so your debugging woes don't go away just because the
UAF is for the non-refcount fields of the underlying object.

as for the security of the current PaX REFCOUNT feature on x86, if you
wanted to break it you'd have to win that race above 2G times in a row
(saturation is at INT_MAX, not UINT_MAX as your example assumes), that is,
all at once in 2G different tasks. good luck finding a machine that can
spawn that many tasks and then even better luck winning the race in every
single one of them in a row without a single schedule back into any one
of them. and if someone's really worried about this case, preemption (or
interrupts for an even shorter window) can be simply disabled around this
logic which would still result in better code than the proposed refcount_t
implementation.

cheers,
 PaX Team
Greg KH Jan. 20, 2017, 10:35 a.m. UTC | #8
Sorry I missed responding to this earlier, my fault.

I'm just going to comment on one big part here, as I was the one who
created the kref api, and has watched it mutate over the years...

On Thu, Jan 05, 2017 at 10:21:31PM +0100, PaX Team wrote:
> 2. your current refcount_t API is wrong and i think reflects a misunderstanding
> of what reference counts are about in general. in particular, checking for
> a 0 refcount value for any purpose other than freeing the underlying object
> makes no sense and means that the usage is not a refcount or the user has
> a UAF bug to start with (the check in kref_get is also nonsense for this
> reason). this is because refcounts protect (account for) references/handles
> to an object, not the object itself. this also means that such references
> (think pointers in C) can only be created by *copying* an existing one but
> they cannot be made up out of thin air.

I agree in principle here, and that's how the original kref api worked.
But then the nasty real-world got in the way here and people started
using it :)

> in theory, each time a reference is copied, the corresponding refcount
> must be increased and when the reference goes out of scope, the refcount
> must be decremented. when the refcount reaches 0 we know that there're no
> references left so the object can be freed. in practice, we optimize the
> refcount increments on copies by observing that the lifetimes of certain
> reference copies nest in each other (think local variables, non-captured
> function parameters, etc) so we can get away by only accounting for the
> outermost reference in such a nesting hierarchy of references (and we get
> the refcounting bugs when we fail at this arguably premature optimization
> and miss a decrement or increment for a reference copy).

Yes, in theory, I totally agree with you.  That _should_ be the only way
a refcount works.  And that's how kref worked for a few years, until
real-world got in the way...

> now with this introduction tell me what sense does e.g., refcount_add_not_zero
> make? any refcount API should never even expose the underlying counter
> mechanism since refcounts are about references, not the counter underneath.
> that is, a proper refcount API would look like this:

I agree, but if you look at the places where that is used, there are
good reasons for it.  Unfortunatly some types of object lifecycles don't
fit into the "once it hits zero it needs to be removed" as some other
type of "object cleanup" model comes into play where things are removed
later on.  Once an object's reference hits zero it means, for that
object, that no _more_ references can be grabbed, because it is "dead"
or something like that.

I've tried to argue otherwise, but almost every time I ended up agreeing
with the usage in the end.  That's not to say that there isn't room in
the current kernel to fix up a few of these usages (I'm suspicious about
the nvme usage, as those people are notorious for doing horrid things to
an api just because they can...)

> // establishes the initial refcount of 0 (which doesn't mean 'freed object'
> // but 'no references, yet')
> // it's needed for object construction if the object may come from non-zero
> // initialized memory, otherwise ref_get should be used
> void ref_reset(objtype *ref);

kref_init() in today's kernel.  I agree it is needed, but not that you
should be able to call kref_get without ever calling it first.  It's
best to always init things in my opinion.

> // returns a copy of the reference if the refcount can be safely
> // incremented, NULL otherwise (or crash&burn, etc)
> objtype *ref_get(objtype *ref);

kref_get().

> // decrements the refcount and free the object if it reached 0
> // dtor could be a field of objtype but that'd cost memory (if there are
> // more objects than ref_put callers which is the likely case in real life)
> // and is an attack surface (kref follows this model too)
> void ref_put(objtype *ref, destruct_t *dtor);

kref_put().

Or kref_put_mutex() as it's nice to be able to keep the lock within the
api to prevent it being forced to be kept outside of it (which in your
example, is required, which forces us to audit things better.)

> conceptually that is all you need to provide for refcount users (note that
> there's no leakage of the lower level implementation into the API, e.g., it
> can be lockless based on atomic_t or synchronized explicitly, all that is
> hidden from the users of the API).

I agree in theory, but then the real-world got in the way and showed
that there are other usages that need more than just this.  Which is why
the kref api has grown over the years to accomidate them.

To wave things away and say "you are doing it all wrong" is dangerous.
I know I did just that in my youth, but experience has taught me that
sometimes I am wrong, and that apis need to be extended to fit other,
valid, usage models.

In short, the wonderful quote, "In theory, theory and practice are the
same. In practice, they are not." applies here as well.  I've had it
happen to me too many times to count over the years :)

Moving to a refcount_t api is good, and a solid thing to do, it allows
us to start catching the places where atomic_t is being used incorrectly,
fix them up to use a reference count properly, and move on to the next
issue at hand.

You all can argue if the current refcount api is "performant" or not
enough, which is great, as it can then be fixed up in one single place
and made more "secure" by catching reference count overflows or not.

thanks,

greg k-h
Peter Zijlstra Jan. 20, 2017, 1:10 p.m. UTC | #9
On Thu, Jan 05, 2017 at 10:21:31PM +0100, PaX Team wrote:
> On 3 Jan 2017 at 14:21, Peter Zijlstra wrote:
> 
> > Its not something that can be turned on or off, refcount_t is
> > unconditional code. But you raise a good point on the size of the thing.
> > 
> > I count 116 bytes on x86_64. If I disable all the debug crud...
> 
> ...then the whole thing loses its raison d'etre ;).

It does not, without the WARN() stuff it is still saturating and
avoiding use-after-free. If we fix WARN to not suck on x86 by for
example extending the "ud2" bugtable, it would not blow up the textsize
so much.

> > This is fundamentally not an atomic operation and therefore does not
> > belong in the atomic_* family, full stop.
> 
> i don't know why people keep repeating this myth but what exactly is not
> atomic in the PaX REFCOUNT feature? hint: it's all atomic ops based,
> there's no semantic change in there.
> 
> what is undesired behaviour (for the security goal, not the underlying
> atomic_t API!) is that the undo/saturation logic makes the whole dance
> visible to other CPUs

And therefore its not atomic. If intermediate state is observable its
not atomic. That's not a myth, that's the very definition of what atomic
is.

> > Not to mention that the whole wrap/unwrap or checked/unchecked split of
> > atomic_t is a massive trainwreck. Moving over to refcount_t, which has
> > simple and well defined semantics forces us to audit and cleanup all the
> > reference counting crud as it gets converted, this is a good thing.
> 
> while a noble goal your approach has two problems: one, it doesn't go
> nearly far enough, and two, what it does do currently is anything but
> clear (beyond the above discussed implementation problems). let me
> elaborate on both.
> 
> 1. the proper approach should be to separate the low-level implementation
> machinery from the higher level APIs exposed to users. the current low-level
> API is the atomic_t type and its accessors (including the 64/long variants)
> which is directly used by most code and there're few higher level abstractions
> on top of it. the proper low-level design should first of all separate the
> underlying type into signed/unsigned types as the conversions between the
> two make for a clumsy and error prone API to start with (just look at your
> own refcount_t API, you're abusing signed/unsigned conversions where one
> direction is implementation defined and allows a trap representation, is
> your code ready to handle that? not to mention that the C implementations
> of atomic_t trigger UB when signed overflow occurs).

Thing is, the Linux kernel isn't written in C as per the spec, it a C
dialect or C like language. By using -fno-strict-overflow and assuming
2s complement there is no UB.

(and I strongly believe removing UB from a language is a good thing)

> 2. your current refcount_t API is wrong and i think reflects a misunderstanding
> of what reference counts are about in general. in particular, checking for
> a 0 refcount value for any purpose other than freeing the underlying object
> makes no sense and means that the usage is not a refcount or the user has
> a UAF bug to start with (the check in kref_get is also nonsense for this
> reason).

_That_ UAF bug is the exact reason the WARN is there. People write buggy
code, warning them wherever possible is a good thing.

> this is because refcounts protect (account for) references/handles
> to an object, not the object itself. this also means that such references
> (think pointers in C) can only be created by *copying* an existing one but
> they cannot be made up out of thin air.

Which is exactly why get() on 0 is wrong and _should_ warn. I think
we very much agree on what a refcount is.

> now with this introduction tell me what sense does e.g., refcount_add_not_zero
> make? any refcount API should never even expose the underlying counter
> mechanism since refcounts are about references, not the counter underneath.
> that is, a proper refcount API would look like this:

Without language support (and there are a few proposals for both C and
C++) we're stuck with manual refcounts.

> // establishes the initial refcount of 0 (which doesn't mean 'freed object'
> // but 'no references, yet')
> // it's needed for object construction if the object may come from non-zero
> // initialized memory, otherwise ref_get should be used
> void ref_reset(objtype *ref);

Here we violently disagree. I pose that initialization should be !0,
after all, the act of creating the objects means you have a reference to
it.

In Bjarne's owner vs non-owner pointer types proposal, new<>() returns an
owner pointer, ie. refcount == 1.

The corollary is that then 0 becomes special to mean 'no concurrency' /
'dead'.

> // returns a copy of the reference if the refcount can be safely
> // incremented, NULL otherwise (or crash&burn, etc)
> objtype *ref_get(objtype *ref);

Without initializing to !0, I pose there is no way to tell if you can
safely increment. The current thing does WARN, Kees would like crash and
burn, but that's details.

> // decrements the refcount and free the object if it reached 0
> // dtor could be a field of objtype but that'd cost memory (if there are
> // more objects than ref_put callers which is the likely case in real life)
> // and is an attack surface (kref follows this model too)
> void ref_put(objtype *ref, destruct_t *dtor);
> 
> conceptually that is all you need to provide for refcount users (note that
> there's no leakage of the lower level implementation into the API, e.g., it
> can be lockless based on atomic_t or synchronized explicitly, all that is
> hidden from the users of the API).
> 
> any other need/usage is not a refcount and needs its own API (and analysis
> of potential security problems when misused).

Here again we have to disagree. RCU/lockless lookups create a
difference in semantics on get(). Where normally, in the serialized
case, get() on 0 is an immediate UAF, the lockless case needs to
explicitly deal with that.

So we get:

	get_serialized() -- must never observe 0 and will return the
			    same pointer.
	get_lockless() -- can observe 0 and we must check the return
	                  value.

Since most people do not write lockless code, we default to the first
for the regular inc/get and provide inc_not_zero() for the latter one.

With your proposed interface we'd have to write:

	my_ptr = ref_get(ptr);
	WARN_ON(!my_ptr);

All over the place in order to provide the same warning to people,
because ref_get() could not include the WARN because of the lockless
case.

Providing this warn for people writing code is I feel important. Its a
small and fairly simple check that avoids/catches mistakes as they're
made.

Then there's the whole dec_and_lock class, which isn't really
fundamental, but more a nice optimization where we can avoid taking the
lock in the common case.

As to the whole add/sub, I would rather not have that either.

> now to implement this in C
> you'll have to break the abstraction in that it'd be an undue burden on
> the API implementor to provide the API for each type of reference (in C++
> you could use templates) so in practice you'd also expose the immediately
> underlying refcount type (a reference to it) in the API and pray that users
> will get it right. something like this:
> 
> objtype *ref_get(objtype *ref, refcount_t *count);
> void ref_put(objtype *ref, refcount_t *count, destruct_t *dtor);
> 
> where 'count' would be passed as &ref->refcount_field or you could make
> this API macro based instead and stick to the proper API by mandating the
> name of the object's refcount field:
> 
> #define REF_GET(ref) ({ refcount_inc(&ref->refcount) ? ref : NULL;})
> #define REF_PUT(ref, dtor) do { if (refcount_dec(&ref->refcount)) dtor(ref); } while(0)

Since, as you say 'pray that users will get it right' I'm not convinced
of this particular form over any other. But its trivial to provide a
wrapper like that if people feel strongly about it.

> > Yes it takes more time and effort, but the end result is better code.
> 
> most definitely but that road is a lot longer than some of you seem to
> anticipate ;).

Oh, I've been at this long enough to know that :-)

> i can't make any sense of your example.

Yeah, got my brain in a twist, it didn't make sense.
diff mbox

Patch

diff --git a/include/linux/kref.h b/include/linux/kref.h
index a5181bd..78f840a 100644
--- a/include/linux/kref.h
+++ b/include/linux/kref.h
@@ -19,23 +19,26 @@ 
 #include <linux/atomic.h>
 #include <linux/kernel.h>
 #include <linux/mutex.h>
+#include <linux/refcount.h>
 
 struct kref {
-	atomic_t refcount;
+	refcount_t refcount;
 };
 
+#define KREF_INIT(n)	{ .refcount = REFCOUNT_INIT(n), }
+
 /**
  * kref_init - initialize object.
  * @kref: object in question.
  */
 static inline void kref_init(struct kref *kref)
 {
-	atomic_set(&kref->refcount, 1);
+	refcount_set(&kref->refcount, 1);
 }
 
 static inline int kref_read(const struct kref *kref)
 {
-	return atomic_read(&kref->refcount);
+	return refcount_read(&kref->refcount);
 }
 
 /**
@@ -44,11 +47,7 @@  static inline int kref_read(const struct kref *kref)
  */
 static inline void kref_get(struct kref *kref)
 {
-	/* If refcount was 0 before incrementing then we have a race
-	 * condition when this kref is freeing by some other thread right now.
-	 * In this case one should use kref_get_unless_zero()
-	 */
-	WARN_ON_ONCE(atomic_inc_return(&kref->refcount) < 2);
+	refcount_inc(&kref->refcount);
 }
 
 /**
@@ -72,7 +71,7 @@  static inline int kref_put(struct kref *kref, void (*release)(struct kref *kref)
 {
 	WARN_ON(release == NULL);
 
-	if (atomic_dec_and_test(&kref->refcount)) {
+	if (refcount_dec_and_test(&kref->refcount)) {
 		release(kref);
 		return 1;
 	}
@@ -85,7 +84,7 @@  static inline int kref_put_mutex(struct kref *kref,
 {
 	WARN_ON(release == NULL);
 
-	if (atomic_dec_and_mutex_lock(&kref->refcount, lock)) {
+	if (refcount_dec_and_mutex_lock(&kref->refcount, lock)) {
 		release(kref);
 		return 1;
 	}
@@ -98,7 +97,7 @@  static inline int kref_put_lock(struct kref *kref,
 {
 	WARN_ON(release == NULL);
 
-	if (atomic_dec_and_lock(&kref->refcount, lock)) {
+	if (refcount_dec_and_lock(&kref->refcount, lock)) {
 		release(kref);
 		return 1;
 	}
@@ -123,6 +122,6 @@  static inline int kref_put_lock(struct kref *kref,
  */
 static inline int __must_check kref_get_unless_zero(struct kref *kref)
 {
-	return atomic_add_unless(&kref->refcount, 1, 0);
+	return refcount_inc_not_zero(&kref->refcount);
 }
 #endif /* _KREF_H_ */
diff --git a/include/linux/refcount.h b/include/linux/refcount.h
new file mode 100644
index 0000000..fc5abdb
--- /dev/null
+++ b/include/linux/refcount.h
@@ -0,0 +1,262 @@ 
+#ifndef _LINUX_REFCOUNT_H
+#define _LINUX_REFCOUNT_H
+
+/*
+ * Variant of atomic_t specialized for reference counts.
+ *
+ * The interface matches the atomic_t interface (to aid in porting) but only
+ * provides the few functions one should use for reference counting.
+ *
+ * It differs in that the counter saturates at UINT_MAX and will not move once
+ * there. This avoids wrapping the counter and causing 'spurious'
+ * use-after-free issues.
+ *
+ * Memory ordering rules are slightly relaxed wrt regular atomic_t functions
+ * and provide only what is strictly required for refcounts.
+ *
+ * The increments are fully relaxed; these will not provide ordering. The
+ * rationale is that whatever is used to obtain the object we're increasing the
+ * reference count on will provide the ordering. For locked data structures,
+ * its the lock acquire, for RCU/lockless data structures its the dependent
+ * load.
+ *
+ * Do note that inc_not_zero() provides a control dependency which will order
+ * future stores against the inc, this ensures we'll never modify the object
+ * if we did not in fact acquire a reference.
+ *
+ * The decrements will provide release order, such that all the prior loads and
+ * stores will be issued before, it also provides a control dependency, which
+ * will order us against the subsequent free().
+ *
+ * The control dependency is against the load of the cmpxchg (ll/sc) that
+ * succeeded. This means the stores aren't fully ordered, but this is fine
+ * because the 1->0 transition indicates no concurrency.
+ *
+ * Note that the allocator is responsible for ordering things between free()
+ * and alloc().
+ *
+ */
+
+#include <linux/atomic.h>
+#include <linux/bug.h>
+#include <linux/mutex.h>
+#include <linux/spinlock.h>
+
+typedef struct refcount_struct {
+       atomic_t refs;
+} refcount_t;
+
+#define REFCOUNT_INIT(n)       { .refs = ATOMIC_INIT(n), }
+
+static inline void refcount_set(refcount_t *r, unsigned int n)
+{
+       atomic_set(&r->refs, n);
+}
+
+static inline unsigned int refcount_read(const refcount_t *r)
+{
+       return atomic_read(&r->refs);
+}
+
+static inline __must_check
+bool refcount_add_not_zero(unsigned int i, refcount_t *r)
+{
+       unsigned int old, new, val = atomic_read(&r->refs);
+
+       for (;;) {
+               if (!val)
+                       return false;
+
+               if (unlikely(val == UINT_MAX))
+                       return true;
+
+               new = val + i;
+               if (new < val)
+                       new = UINT_MAX;
+               old = atomic_cmpxchg_relaxed(&r->refs, val, new);
+               if (old == val)
+                       break;
+
+               val = old;
+       }
+
+       WARN(new == UINT_MAX, "refcount_t: saturated; leaking memory.\n");
+
+       return true;
+}
+
+/*
+ * Similar to atomic_inc_not_zero(), will saturate at UINT_MAX and WARN.
+ *
+ * Provides no memory ordering, it is assumed the caller has guaranteed the
+ * object memory to be stable (RCU, etc.). It does provide a control dependency
+ * and thereby orders future stores. See the comment on top.
+ */
+static inline __must_check
+bool refcount_inc_not_zero(refcount_t *r)
+{
+       return refcount_add_not_zero(1, r);
+}
+
+/*
+ * Similar to atomic_inc(), will saturate at UINT_MAX and WARN.
+ *
+ * Provides no memory ordering, it is assumed the caller already has a
+ * reference on the object, will WARN when this is not so.
+ */
+static inline void refcount_inc(refcount_t *r)
+{
+       WARN(!refcount_inc_not_zero(r), "refcount_t: increment on 0; use-after-free.\n");
+}
+
+static inline void refcount_add(unsigned int i, refcount_t *r)
+{
+       WARN(!refcount_add_not_zero(i, r), "refcount_t: addition on 0; use-after-free.\n");
+}
+
+/*
+ * Similar to atomic_dec_and_test(), it will WARN on underflow and fail to
+ * decrement when saturated at UINT_MAX.
+ *
+ * Provides release memory ordering, such that prior loads and stores are done
+ * before, and provides a control dependency such that free() must come after.
+ * See the comment on top.
+ */
+static inline __must_check
+bool refcount_sub_and_test(unsigned int i, refcount_t *r)
+{
+       unsigned int old, new, val = atomic_read(&r->refs);
+
+       for (;;) {
+               if (val == UINT_MAX)
+                       return false;
+
+               new = val - i;
+               if (WARN(new > val, "refcount_t: underflow; use-after-free.\n"))
+                       return false;
+
+               old = atomic_cmpxchg_release(&r->refs, val, new);
+               if (old == val)
+                       break;
+
+               val = old;
+       }
+
+       return !new;
+}
+
+static inline __must_check
+bool refcount_dec_and_test(refcount_t *r)
+{
+       return refcount_sub_and_test(1, r);
+}
+
+/*
+ * Similar to atomic_dec(), it will WARN on underflow and fail to decrement
+ * when saturated at UINT_MAX.
+ *
+ * Provides release memory ordering, such that prior loads and stores are done
+ * before.
+ */
+static inline
+void refcount_dec(refcount_t *r)
+{
+       WARN(refcount_dec_and_test(r), "refcount_t: decrement hit 0; leaking memory.\n");
+}
+
+/*
+ * No atomic_t counterpart, it attempts a 1 -> 0 transition and returns the
+ * success thereof.
+ *
+ * Like all decrement operations, it provides release memory order and provides
+ * a control dependency.
+ *
+ * It can be used like a try-delete operator; this explicit case is provided
+ * and not cmpxchg in generic, because that would allow implementing unsafe
+ * operations.
+ */
+static inline __must_check
+bool refcount_dec_if_one(refcount_t *r)
+{
+       return atomic_cmpxchg_release(&r->refs, 1, 0) == 1;
+}
+
+/*
+ * No atomic_t counterpart, it decrements unless the value is 1, in which case
+ * it will return false.
+ *
+ * Was often done like: atomic_add_unless(&var, -1, 1)
+ */
+static inline __must_check
+bool refcount_dec_not_one(refcount_t *r)
+{
+       unsigned int old, new, val = atomic_read(&r->refs);
+
+       for (;;) {
+               if (val == UINT_MAX)
+                       return true;
+
+               if (val == 1)
+                       return false;
+
+               new = val - 1;
+               if (WARN(new > val, "refcount_t: underflow; use-after-free.\n"))
+                       return true;
+
+               old = atomic_cmpxchg_release(&r->refs, val, new);
+               if (old == val)
+                       break;
+
+               val = old;
+       }
+
+       return true;
+}
+
+/*
+ * Similar to atomic_dec_and_mutex_lock(), it will WARN on underflow and fail
+ * to decrement when saturated at UINT_MAX.
+ *
+ * Provides release memory ordering, such that prior loads and stores are done
+ * before, and provides a control dependency such that free() must come after.
+ * See the comment on top.
+ */
+static inline __must_check
+bool refcount_dec_and_mutex_lock(refcount_t *r, struct mutex *lock)
+{
+       if (refcount_dec_not_one(r))
+               return false;
+
+       mutex_lock(lock);
+       if (!refcount_dec_and_test(r)) {
+               mutex_unlock(lock);
+               return false;
+       }
+
+       return true;
+}
+
+/*
+ * Similar to atomic_dec_and_lock(), it will WARN on underflow and fail to
+ * decrement when saturated at UINT_MAX.
+ *
+ * Provides release memory ordering, such that prior loads and stores are done
+ * before, and provides a control dependency such that free() must come after.
+ * See the comment on top.
+ */
+static inline __must_check
+bool refcount_dec_and_lock(refcount_t *r, spinlock_t *lock)
+{
+       if (refcount_dec_not_one(r))
+               return false;
+
+       spin_lock(lock);
+       if (!refcount_dec_and_test(r)) {
+               spin_unlock(lock);
+               return false;
+       }
+
+       return true;
+}
+
+#endif /* _LINUX_REFCOUNT_H */