diff mbox

x86/refcount: Implement fast refcount_t handling

Message ID 20170421220939.GA65363@beast (mailing list archive)
State New, archived
Headers show

Commit Message

Kees Cook April 21, 2017, 10:09 p.m. UTC
This patch ports the x86-specific atomic overflow handling from PaX's
PAX_REFCOUNT to the upstream refcount_t API. This is an updated version
from PaX that eliminates the saturation race condition by resetting the
atomic counter back to the INT_MAX saturation value on both overflow and
underflow. To win a race, a system would have to have INT_MAX threads
simultaneously overflow before the saturation handler runs.

With this, the commonly used inc/dec_and_test usage patterns present
in performance-sensitive areas of the kernel (mm, net, block) will
use the regular inline atomic operations with only a single overflow
test instruction added to the fast path.

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 arch/Kconfig                       |  19 ++++++
 arch/x86/Kconfig                   |   1 +
 arch/x86/entry/entry_32.S          |   9 +++
 arch/x86/entry/entry_64.S          |   3 +
 arch/x86/include/asm/irq_vectors.h |   3 +
 arch/x86/include/asm/refcount.h    | 123 +++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/traps.h       |   5 ++
 arch/x86/kernel/traps.c            |  38 ++++++++++++
 drivers/misc/lkdtm_bugs.c          |  19 ++++--
 include/asm-generic/sections.h     |   4 ++
 include/asm-generic/vmlinux.lds.h  |   9 +++
 include/linux/kernel.h             |   2 +
 include/linux/refcount.h           |   4 ++
 kernel/panic.c                     |  23 +++++++
 lib/refcount.c                     |   6 +-
 15 files changed, 263 insertions(+), 5 deletions(-)
 create mode 100644 arch/x86/include/asm/refcount.h

Comments

Peter Zijlstra April 24, 2017, 8:32 a.m. UTC | #1
On Fri, Apr 21, 2017 at 03:09:39PM -0700, Kees Cook wrote:
> This patch ports the x86-specific atomic overflow handling from PaX's
> PAX_REFCOUNT to the upstream refcount_t API. This is an updated version
> from PaX that eliminates the saturation race condition by resetting the
> atomic counter back to the INT_MAX saturation value on both overflow and
> underflow. To win a race, a system would have to have INT_MAX threads
> simultaneously overflow before the saturation handler runs.

And is this impossible? Highly unlikely I'll grant you, but absolutely
impossible?

Also, you forgot nr_cpus in your bound. Afaict the worst case here is
O(nr_tasks + 3*nr_cpus).

Because PaX does it, is not a correctness argument. And this really
wants one.
Jann Horn April 24, 2017, 8:53 a.m. UTC | #2
On Mon, Apr 24, 2017 at 10:32 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Apr 21, 2017 at 03:09:39PM -0700, Kees Cook wrote:
>> This patch ports the x86-specific atomic overflow handling from PaX's
>> PAX_REFCOUNT to the upstream refcount_t API. This is an updated version
>> from PaX that eliminates the saturation race condition by resetting the
>> atomic counter back to the INT_MAX saturation value on both overflow and
>> underflow. To win a race, a system would have to have INT_MAX threads
>> simultaneously overflow before the saturation handler runs.
>
> And is this impossible? Highly unlikely I'll grant you, but absolutely
> impossible?
>
> Also, you forgot nr_cpus in your bound. Afaict the worst case here is
> O(nr_tasks + 3*nr_cpus).
>
> Because PaX does it, is not a correctness argument. And this really
> wants one.

From include/linux/threads.h:

/*
 * A maximum of 4 million PIDs should be enough for a while.
 * [NOTE: PID/TIDs are limited to 2^29 ~= 500+ million, see futex.h.]
 */
#define PID_MAX_LIMIT (CONFIG_BASE_SMALL ? PAGE_SIZE * 8 : \
(sizeof(long) > 4 ? 4 * 1024 * 1024 : PID_MAX_DEFAULT))

AFAICS that means you can only have up to 2^22 running tasks at
a time, since every running task requires a PID in the init pid namespace.
Peter Zijlstra April 24, 2017, 9:20 a.m. UTC | #3
On Mon, Apr 24, 2017 at 10:53:30AM +0200, Jann Horn wrote:
> On Mon, Apr 24, 2017 at 10:32 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Fri, Apr 21, 2017 at 03:09:39PM -0700, Kees Cook wrote:
> >> This patch ports the x86-specific atomic overflow handling from PaX's
> >> PAX_REFCOUNT to the upstream refcount_t API. This is an updated version
> >> from PaX that eliminates the saturation race condition by resetting the
> >> atomic counter back to the INT_MAX saturation value on both overflow and
> >> underflow. To win a race, a system would have to have INT_MAX threads
> >> simultaneously overflow before the saturation handler runs.
> >
> > And is this impossible? Highly unlikely I'll grant you, but absolutely
> > impossible?
> >
> > Also, you forgot nr_cpus in your bound. Afaict the worst case here is
> > O(nr_tasks + 3*nr_cpus).
> >
> > Because PaX does it, is not a correctness argument. And this really
> > wants one.
> 
> From include/linux/threads.h:
> 
> /*
>  * A maximum of 4 million PIDs should be enough for a while.
>  * [NOTE: PID/TIDs are limited to 2^29 ~= 500+ million, see futex.h.]
>  */
> #define PID_MAX_LIMIT (CONFIG_BASE_SMALL ? PAGE_SIZE * 8 : \
> (sizeof(long) > 4 ? 4 * 1024 * 1024 : PID_MAX_DEFAULT))
> 
> AFAICS that means you can only have up to 2^22 running tasks at
> a time, since every running task requires a PID in the init pid namespace.

That's but an artificial limit and bound to be increased sooner rather
than later, given the trend in building ever larger machines.

If we go with the futex limit on the full tid space, we end up with 2^30
(the comment here about 29 is wrong, see FUTEX_TID_MASK).

We then still have to add a multiple of nr_cpus. Granted, that will
still be slightly short of 3^31, but not to any amount I'm really
comfortable with, we're _really_ close.

Point remains though; Changelog (and arguably the code itself) should
very much include such bits.
Peter Zijlstra April 24, 2017, 10:45 a.m. UTC | #4
On Fri, Apr 21, 2017 at 03:09:39PM -0700, Kees Cook wrote:
> +static __always_inline __must_check bool refcount_inc_not_zero(refcount_t *r)
> +{
> +	const int a = 1;
> +	const int u = 0;
> +	int c, old;
> +
> +	c = atomic_read(&(r->refs));
> +	for (;;) {
> +		if (unlikely(c == (u)))
> +			break;
> +		old = atomic_cmpxchg(&(r->refs), c, c + (a));

Please use atomic_try_cmpxchg(), that generates saner code.

> +		if (likely(old == c))
> +			break;
> +		c = old;
> +	}
> +	return c != u;
> +}
Peter Zijlstra April 24, 2017, 10:48 a.m. UTC | #5
On Fri, Apr 21, 2017 at 03:09:39PM -0700, Kees Cook wrote:
> diff --git a/drivers/misc/lkdtm_bugs.c b/drivers/misc/lkdtm_bugs.c
> index e3f4cd8876b5..1bdafb29b802 100644
> --- a/drivers/misc/lkdtm_bugs.c
> +++ b/drivers/misc/lkdtm_bugs.c
> @@ -135,9 +135,15 @@ void lkdtm_HUNG_TASK(void)
>  	schedule();
>  }
>  
> +#ifdef CONFIG_FAST_REFCOUNT
> +#define REFCOUNT_MAX	INT_MAX
> +#else
> +#define REFCOUNT_MAX	UINT_MAX
> +#endif

That doesn't seem like a sensible place for this.
pageexec@freemail.hu April 24, 2017, 11 a.m. UTC | #6
On 24 Apr 2017 at 10:32, Peter Zijlstra wrote:

> On Fri, Apr 21, 2017 at 03:09:39PM -0700, Kees Cook wrote:
> > This patch ports the x86-specific atomic overflow handling from PaX's
> > PAX_REFCOUNT to the upstream refcount_t API. This is an updated version
> > from PaX that eliminates the saturation race condition by resetting the
> > atomic counter back to the INT_MAX saturation value on both overflow and
> > underflow. To win a race, a system would have to have INT_MAX threads
> > simultaneously overflow before the saturation handler runs.

note that the above is wrong (and even contradicting itself and the code).

> And is this impossible? Highly unlikely I'll grant you, but absolutely
> impossible?

here's my analysis from a while ago:
http://www.openwall.com/lists/kernel-hardening/2017/01/05/19

> Also, you forgot nr_cpus in your bound. Afaict the worst case here is
> O(nr_tasks + 3*nr_cpus).

what does nr_cpus have to do with winning the race?

> Because PaX does it, is not a correctness argument. And this really
> wants one.

heh, do you want to tell me about how checking for a 0 refcount prevents
exploiting a bug?
Peter Zijlstra April 24, 2017, 11:15 a.m. UTC | #7
On Mon, Apr 24, 2017 at 01:00:18PM +0200, PaX Team wrote:
> On 24 Apr 2017 at 10:32, Peter Zijlstra wrote:

> > Also, you forgot nr_cpus in your bound. Afaict the worst case here is
> > O(nr_tasks + 3*nr_cpus).
> 
> what does nr_cpus have to do with winning the race?

The CPUs could each run nested softirq/hardirq/nmi context poking at the
refcount, irrespective of the (preempted) task context.

> > Because PaX does it, is not a correctness argument. And this really
> > wants one.
> 
> heh, do you want to tell me about how checking for a 0 refcount prevents
> exploiting a bug?

Not the point. All I said was that saying somebody else does it (anybody
else, doesn't matter it was you) isn't an argument for correctness.
pageexec@freemail.hu April 24, 2017, 1:08 p.m. UTC | #8
On 24 Apr 2017 at 13:15, Peter Zijlstra wrote:

> On Mon, Apr 24, 2017 at 01:00:18PM +0200, PaX Team wrote:
> > On 24 Apr 2017 at 10:32, Peter Zijlstra wrote:
> 
> > > Also, you forgot nr_cpus in your bound. Afaict the worst case here is
> > > O(nr_tasks + 3*nr_cpus).
> > 
> > what does nr_cpus have to do with winning the race?
> 
> The CPUs could each run nested softirq/hardirq/nmi context poking at the
> refcount, irrespective of the (preempted) task context.

that's fine but are you also assuming that the code executed in each of
those contexts leaks the same refcount? otherwise whatever they do to the
refcount is no more relevant than a non-leaking preemptible path that runs
to completion in a bounded amount of time (i.e., you get temporary bumps
and thus need to win yet another set of races to get their effects at once).

> > > Because PaX does it, is not a correctness argument. And this really
> > > wants one.
> > 
> > heh, do you want to tell me about how checking for a 0 refcount prevents
> > exploiting a bug?
> 
> Not the point. All I said was that saying somebody else does it (anybody
> else, doesn't matter it was you) isn't an argument for correctness.

that was exactly my point: all this applies to you as well. so let me ask
the 3rd time: what is your "argument for correctness" for a 0 refcount
value check? how does it prevent exploitation?
Peter Zijlstra April 24, 2017, 1:33 p.m. UTC | #9
On Mon, Apr 24, 2017 at 03:08:20PM +0200, PaX Team wrote:
> On 24 Apr 2017 at 13:15, Peter Zijlstra wrote:
> 
> > On Mon, Apr 24, 2017 at 01:00:18PM +0200, PaX Team wrote:
> > > On 24 Apr 2017 at 10:32, Peter Zijlstra wrote:
> > 
> > > > Also, you forgot nr_cpus in your bound. Afaict the worst case here is
> > > > O(nr_tasks + 3*nr_cpus).
> > > 
> > > what does nr_cpus have to do with winning the race?
> > 
> > The CPUs could each run nested softirq/hardirq/nmi context poking at the
> > refcount, irrespective of the (preempted) task context.
> 
> that's fine but are you also assuming that the code executed in each of
> those contexts leaks the same refcount? otherwise whatever they do to the
> refcount is no more relevant than a non-leaking preemptible path that runs
> to completion in a bounded amount of time (i.e., you get temporary bumps
> and thus need to win yet another set of races to get their effects at once).

For worst case analysis we have to assume it does, unless we can proof
it doesn't. And that proof is very very hard, and would need to be
redone every time the kernel changes.

> that was exactly my point: all this applies to you as well. so let me ask
> the 3rd time: what is your "argument for correctness" for a 0 refcount
> value check? how does it prevent exploitation?

What 0 count check are you talking about, the one that triggers when we
want to increment 0 ?

I think I've explained that before; per reference count rules 0 means
freed (or about to be freed when we talk RCU).

The whole pattern:

	if (dec_and_test(&obj->ref))
		kfree(obj);

expresses this etc.. Other reference counts also do this. No references
means its getting freed.

Can you agree with this?


If so; any attempt to increase the reference count while its (being)
freed() is a use-after-free.

Therefore we disallow 0 increment.


Yes, this is an annoyance when you consider usage-counts, where 0 means
something else. But then, we were talking about reference counts, not
something else.
pageexec@freemail.hu April 24, 2017, 3:15 p.m. UTC | #10
On 24 Apr 2017 at 15:33, Peter Zijlstra wrote:

> On Mon, Apr 24, 2017 at 03:08:20PM +0200, PaX Team wrote:
> > On 24 Apr 2017 at 13:15, Peter Zijlstra wrote:
> > 
> > > On Mon, Apr 24, 2017 at 01:00:18PM +0200, PaX Team wrote:
> > > > On 24 Apr 2017 at 10:32, Peter Zijlstra wrote:
> > > 
> > > > > Also, you forgot nr_cpus in your bound. Afaict the worst case here is
> > > > > O(nr_tasks + 3*nr_cpus).
> > > > 
> > > > what does nr_cpus have to do with winning the race?
> > > 
> > > The CPUs could each run nested softirq/hardirq/nmi context poking at the
> > > refcount, irrespective of the (preempted) task context.
> > 
> > that's fine but are you also assuming that the code executed in each of
> > those contexts leaks the same refcount? otherwise whatever they do to the
> > refcount is no more relevant than a non-leaking preemptible path that runs
> > to completion in a bounded amount of time (i.e., you get temporary bumps
> > and thus need to win yet another set of races to get their effects at once).
> 
> For worst case analysis we have to assume it does, unless we can proof
> it doesn't. And that proof is very very hard, and would need to be
> redone every time the kernel changes.

for worst case analysis you need to show the existence of an amd64 system that
can spawn 2G tasks. then you'll have to show the feasibility of making all of
them get preempted (without a later reschedule) inside a 2 insn window.

> > that was exactly my point: all this applies to you as well. so let me ask
> > the 3rd time: what is your "argument for correctness" for a 0 refcount
> > value check? how does it prevent exploitation?
> 
> What 0 count check are you talking about, the one that triggers when we
> want to increment 0 ?

are there any other 0 checks in there?

> I think I've explained that before; per reference count rules 0 means
> freed (or about to be freed when we talk RCU).

you only said the same thing, what 0 means. you (still) didn't explain how
checking for it prevents exploitation.

> The whole pattern:
> 
>  if (dec_and_test(&obj->ref))
>   kfree(obj);
> 
> expresses this etc.. Other reference counts also do this. No references
> means its getting freed.
> 
> Can you agree with this?

sure, so far so good.

> If so; any attempt to increase the reference count while its (being)
> freed() is a use-after-free.

why would ever be there such an attempt? a freed object with intact memory
content is as useful for an attacker as a live one, that is, not at all.
Kees Cook April 24, 2017, 8:16 p.m. UTC | #11
On Mon, Apr 24, 2017 at 1:32 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Apr 21, 2017 at 03:09:39PM -0700, Kees Cook wrote:
>> This patch ports the x86-specific atomic overflow handling from PaX's
>> PAX_REFCOUNT to the upstream refcount_t API. This is an updated version
>> from PaX that eliminates the saturation race condition by resetting the
>> atomic counter back to the INT_MAX saturation value on both overflow and
>> underflow. To win a race, a system would have to have INT_MAX threads
>> simultaneously overflow before the saturation handler runs.
>
> And is this impossible? Highly unlikely I'll grant you, but absolutely
> impossible?

I'll adjust the language. "Highly unlikely" is still better than
"trivially doable with a single thread". :)

> Also, you forgot nr_cpus in your bound. Afaict the worst case here is
> O(nr_tasks + 3*nr_cpus).
>
> Because PaX does it, is not a correctness argument. And this really
> wants one.

Sure, I didn't mean to imply anything other than a demonstration of
what PaX is doing (and that it's better than not having it). If we can
improve it, that's great.

-Kees
Kees Cook April 24, 2017, 8:19 p.m. UTC | #12
On Mon, Apr 24, 2017 at 3:45 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Apr 21, 2017 at 03:09:39PM -0700, Kees Cook wrote:
>> +static __always_inline __must_check bool refcount_inc_not_zero(refcount_t *r)
>> +{
>> +     const int a = 1;
>> +     const int u = 0;
>> +     int c, old;
>> +
>> +     c = atomic_read(&(r->refs));
>> +     for (;;) {
>> +             if (unlikely(c == (u)))
>> +                     break;
>> +             old = atomic_cmpxchg(&(r->refs), c, c + (a));
>
> Please use atomic_try_cmpxchg(), that generates saner code.

Ah-ha, thanks. I actually copied this directly out of the existing
atomic_t function, so we should probably update it there too.

-Kees

>
>> +             if (likely(old == c))
>> +                     break;
>> +             c = old;
>> +     }
>> +     return c != u;
>> +}
Kees Cook April 24, 2017, 8:21 p.m. UTC | #13
On Mon, Apr 24, 2017 at 3:48 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Apr 21, 2017 at 03:09:39PM -0700, Kees Cook wrote:
>> diff --git a/drivers/misc/lkdtm_bugs.c b/drivers/misc/lkdtm_bugs.c
>> index e3f4cd8876b5..1bdafb29b802 100644
>> --- a/drivers/misc/lkdtm_bugs.c
>> +++ b/drivers/misc/lkdtm_bugs.c
>> @@ -135,9 +135,15 @@ void lkdtm_HUNG_TASK(void)
>>       schedule();
>>  }
>>
>> +#ifdef CONFIG_FAST_REFCOUNT
>> +#define REFCOUNT_MAX INT_MAX
>> +#else
>> +#define REFCOUNT_MAX UINT_MAX
>> +#endif
>
> That doesn't seem like a sensible place for this.

I'll drop the LKDTM changes from this particular patch. As for the
define, I think it's only interesting to LKDTM since it's the only
part interested in refcount_t internals. (i.e. nothing else would (or
should) use this information.)

-Kees
Kees Cook April 24, 2017, 8:33 p.m. UTC | #14
On Mon, Apr 24, 2017 at 4:00 AM, PaX Team <pageexec@freemail.hu> wrote:
> On 24 Apr 2017 at 10:32, Peter Zijlstra wrote:
>
>> On Fri, Apr 21, 2017 at 03:09:39PM -0700, Kees Cook wrote:
>> > This patch ports the x86-specific atomic overflow handling from PaX's
>> > PAX_REFCOUNT to the upstream refcount_t API. This is an updated version
>> > from PaX that eliminates the saturation race condition by resetting the
>> > atomic counter back to the INT_MAX saturation value on both overflow and
>> > underflow. To win a race, a system would have to have INT_MAX threads
>> > simultaneously overflow before the saturation handler runs.
>
> note that the above is wrong (and even contradicting itself and the code).

True, this changelog could be more accurate (it resets to INT_MAX on
overflow and INT_MIN on underflow). I think I'm right in saying that a
system would need INT_MAX threads running a refcount_inc() (and a
refcount_dec_and_test() at exactly the right moment) before the reset
handler got scheduled, though, yes?

I'll attempt to clarify this.

-Kees
Kees Cook April 24, 2017, 8:40 p.m. UTC | #15
On Mon, Apr 24, 2017 at 8:15 AM, PaX Team <pageexec@freemail.hu> wrote:
> On 24 Apr 2017 at 15:33, Peter Zijlstra wrote:
>> On Mon, Apr 24, 2017 at 03:08:20PM +0200, PaX Team wrote:
>> > On 24 Apr 2017 at 13:15, Peter Zijlstra wrote:
>> > that was exactly my point: all this applies to you as well. so let me ask
>> > the 3rd time: what is your "argument for correctness" for a 0 refcount
>> > value check? how does it prevent exploitation?
>>
>> I think I've explained that before; per reference count rules 0 means
>> freed (or about to be freed when we talk RCU).
>
> you only said the same thing, what 0 means. you (still) didn't explain how
> checking for it prevents exploitation.
>
>> The whole pattern:
>>
>>  if (dec_and_test(&obj->ref))
>>   kfree(obj);
>>
>> expresses this etc.. Other reference counts also do this. No references
>> means its getting freed.
>>
>> Can you agree with this?
>
> sure, so far so good.
>
>> If so; any attempt to increase the reference count while its (being)
>> freed() is a use-after-free.
>
> why would ever be there such an attempt? a freed object with intact memory
> content is as useful for an attacker as a live one, that is, not at all.

I think we're way off in the weeds here. The "cannot inc from 0" check
is about general sanity checks on refcounts. It should never happen,
and if it does, there's a bug. However, what the refcount hardening
protection is trying to do is protect again the exploitable condition:
overflow. Inc-from-0 isn't an exploitable condition since in theory
the memory suddenly becomes correctly managed again. We're just
discussing different things.

The point is to have very fast refcount_t that protects against
overflow so the mm, net, and block subsystems aren't worried about
making the atomic_t -> refcount_t changes there.

-Kees
Peter Zijlstra April 24, 2017, 10:01 p.m. UTC | #16
On Mon, Apr 24, 2017 at 01:40:56PM -0700, Kees Cook wrote:
> I think we're way off in the weeds here. The "cannot inc from 0" check
> is about general sanity checks on refcounts.

I disagree, although sanity check are good too.

> It should never happen, and if it does, there's a bug.

The very same is true of the overflow thing.

> However, what the refcount hardening protection is trying to do is
> protect again the exploitable condition: overflow.

Sure..

> Inc-from-0 isn't an exploitable condition since in theory
> the memory suddenly becomes correctly managed again.

It does not. It just got free'ed. Nothing will stop the free from
happening (or already having happened).

> We're just discussing different things.

No, both are bugs, neither overflow not inc-from-zero _should_ happen.
The whole point is that code is buggy. If it weren't for that, we'd not
be doing this.

How is the below not useful fodder for an exploit? It might be a less
common bug, and perhaps a bit more fiddly to make work, but afaict its
still a full use-after-free and therefore useful.

---

Thread-A                                        Thread-B

if(dec_and_test(&obj->ref)) { // true, ref==0

                                                inc(&obj->ref) // ref: 0->1

        kfree(obj);
}



        ~~~/ Thread-C /~~~

        obj = kmalloc(); // is obj from Thread-A

	obj->ref = 1;
        obj->func = ....


                                                obj->func();

                                                        -- OR --

                                                put(obj);
                                                  if (dec_and_test(&obj->ref))
                                                    kfree(obj); // which was of Thread-C
Kees Cook April 24, 2017, 10:37 p.m. UTC | #17
On Mon, Apr 24, 2017 at 3:01 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Apr 24, 2017 at 01:40:56PM -0700, Kees Cook wrote:
>> I think we're way off in the weeds here. The "cannot inc from 0" check
>> is about general sanity checks on refcounts.
>
> I disagree, although sanity check are good too.
>
>> It should never happen, and if it does, there's a bug.
>
> The very same is true of the overflow thing.
>
>> However, what the refcount hardening protection is trying to do is
>> protect again the exploitable condition: overflow.
>
> Sure..
>
>> Inc-from-0 isn't an exploitable condition since in theory
>> the memory suddenly becomes correctly managed again.
>
> It does not. It just got free'ed. Nothing will stop the free from
> happening (or already having happened).

Well, yes, but that's kind of my point. Detecting inc-from-0 is "too
late" to offer a protection. It offers notification of a bug, rather
than stopping an exploit from happening.

>> We're just discussing different things.
>
> No, both are bugs, neither overflow not inc-from-zero _should_ happen.
> The whole point is that code is buggy. If it weren't for that, we'd not
> be doing this.
>
> How is the below not useful fodder for an exploit? It might be a less
> common bug, and perhaps a bit more fiddly to make work, but afaict its
> still a full use-after-free and therefore useful.
>
> ---
>
> Thread-A                                        Thread-B
>
> if(dec_and_test(&obj->ref)) { // true, ref==0
>
>                                                 inc(&obj->ref) // ref: 0->1
>
>         kfree(obj);
> }
>
>
>
>         ~~~/ Thread-C /~~~
>
>         obj = kmalloc(); // is obj from Thread-A
>
>         obj->ref = 1;
>         obj->func = ....
>
>
>                                                 obj->func();
>
>                                                         -- OR --
>
>                                                 put(obj);
>                                                   if (dec_and_test(&obj->ref))
>                                                     kfree(obj); // which was of Thread-C

Right, both are bugs, but the exploitable condition is "refcount hit
zero and got freed while there were still things using it". Having too
many put()s that cause the refcount to reach zero early can't be
detected, but having too many get()s that wrap around, ultimately to
zero, can be (the overflow condition). With overflow protection, the
refcount can never (realistically) hit zero since the refcount will
get saturated at INT_MAX, so no exploit is possible -- no memory is
ever freed (it just leaks the allocation). The inc-from-zero case
performing a notification is certainly nice, but the exploit already
happened.

I want the overflow protection to work everywhere the kernel uses
refcounts, which means dealing with performance concerns from mm, net,
block, etc. Since this is a 1 instruction change (which branch
prediction should make almost no cost at all), this will easily
address any performance concerns from the other subsystems.

I'd rather have the overflow protection everywhere than only in some
areas, and I want to break the deadlock of this catch-22 of subsystems
not being able to say what benchmarks are important to them but
refusing to switch to refcount_t due to performance concerns.

-Kees
Rik van Riel April 25, 2017, 1:11 a.m. UTC | #18
On Mon, 2017-04-24 at 15:37 -0700, Kees Cook wrote:
> On Mon, Apr 24, 2017 at 3:01 PM, Peter Zijlstra <peterz@infradead.org
> > wrote:
> > On Mon, Apr 24, 2017 at 01:40:56PM -0700, Kees Cook wrote:
> > > I think we're way off in the weeds here. The "cannot inc from 0"
> > > check
> > > is about general sanity checks on refcounts.
> > 
> > I disagree, although sanity check are good too.
> > 
> > > It should never happen, and if it does, there's a bug.
> > 
> > The very same is true of the overflow thing.
> > 
> > > However, what the refcount hardening protection is trying to do
> > > is
> > > protect again the exploitable condition: overflow.
> > 
> > Sure..
> > 
> > > Inc-from-0 isn't an exploitable condition since in theory
> > > the memory suddenly becomes correctly managed again.
> > 
> > It does not. It just got free'ed. Nothing will stop the free from
> > happening (or already having happened).
> 
> Well, yes, but that's kind of my point. Detecting inc-from-0 is "too
> late" to offer a protection. It offers notification of a bug, rather
> than stopping an exploit from happening.

inc-from-0 could allow the attacker to gain access to
an object which gets allocated to a new user afterwards.

Certainly much less useful as an exploit, but still a
potential privilege escalation.
Peter Zijlstra April 25, 2017, 9:05 a.m. UTC | #19
On Mon, Apr 24, 2017 at 03:37:32PM -0700, Kees Cook wrote:
> On Mon, Apr 24, 2017 at 3:01 PM, Peter Zijlstra <peterz@infradead.org> wrote:

> > It does not. It just got free'ed. Nothing will stop the free from
> > happening (or already having happened).
> 
> Well, yes, but that's kind of my point. Detecting inc-from-0 is "too
> late" to offer a protection. It offers notification of a bug, rather
> than stopping an exploit from happening.

Well, your setup (panic_on_warn et al) would have it panic the box. That
will effectively stop the exploit by virtue of stopping everything.

And warn/bug/panic etc.. are I think a better option that silently
letting it happen.
Peter Zijlstra April 25, 2017, 10:23 a.m. UTC | #20
On Fri, Apr 21, 2017 at 03:09:39PM -0700, Kees Cook wrote:
> +static __always_inline void refcount_inc(refcount_t *r)
> +{
> +	asm volatile(LOCK_PREFIX "incl %0\n\t"
> +		REFCOUNT_CHECK_OVERFLOW(4)
> +		: [counter] "+m" (r->refs.counter)
> +		: : "cc", "cx");
> +}
> +
> +static __always_inline void refcount_dec(refcount_t *r)
> +{
> +	asm volatile(LOCK_PREFIX "decl %0\n\t"
> +		REFCOUNT_CHECK_UNDERFLOW(4)
> +		: [counter] "+m" (r->refs.counter)
> +		: : "cc", "cx");
> +}

> +dotraplinkage void do_refcount_error(struct pt_regs *regs, long error_code)
> +{
> +	const char *str = NULL;
> +
> +	BUG_ON(!(regs->flags & X86_EFLAGS_OF));
> +
> +#define range_check(size, direction, type, value) \
> +	if ((unsigned long)__##size##_##direction##_start <= regs->ip && \
> +	    regs->ip < (unsigned long)__##size##_##direction##_end) { \
> +		*(type *)regs->cx = value; \
> +		str = #size " " #direction; \
> +	}
> +
> +	range_check(refcount,   overflow,  int, INT_MAX)
> +	range_check(refcount,   underflow, int, INT_MIN)
> +
> +#undef range_check
> +
> +	BUG_ON(!str);
> +	do_error_trap(regs, error_code, (char *)str, X86_REFCOUNT_VECTOR,
> +		      SIGILL);
> +}
> +#endif


So what avoids this:

	CPU0				CPU1


	lock inc %[val]; # 0x7fffffff
	jo  2f
1:	...

					lock dec %[val]; # 0x80000000
					jo  2f
				1:	...




2:	mov $0x7fffffff, %[val]
	jmp 1b

				2:	mov $0x80000000, %[val]
					jmp 1b




	~~~~//~~~~


	lock inc %val; #0x80000000
	....

	lock inc %val; 0xffffffff
	lock inc %val; 0x00000000
pageexec@freemail.hu April 25, 2017, 11:26 a.m. UTC | #21
On 25 Apr 2017 at 0:01, Peter Zijlstra wrote:

> On Mon, Apr 24, 2017 at 01:40:56PM -0700, Kees Cook wrote:
> > I think we're way off in the weeds here. The "cannot inc from 0" check
> > is about general sanity checks on refcounts.
> 
> I disagree, although sanity check are good too.

exactly, an attacker doesn't care how a premature free occurs due
to reaching a 0 refcount, afterwards it's memory corruption time for
both old and new references regardless.

> > However, what the refcount hardening protection is trying to do is
> > protect again the exploitable condition: overflow.
> 
> Sure..

underflow is also exploitable, it's just much harder to defend against
(there're no known practical solutions).

> > Inc-from-0 isn't an exploitable condition since in theory
> > the memory suddenly becomes correctly managed again.
> 
> It does not. It just got free'ed. Nothing will stop the free from
> happening (or already having happened).

now hold this thought...

> How is the below not useful fodder for an exploit? It might be a less
> common bug, and perhaps a bit more fiddly to make work, but afaict its
> still a full use-after-free and therefore useful.
> 
> ---
> 
> Thread-A                                        Thread-B
> 
> if(dec_and_test(&obj->ref)) { // true, ref==0
> 
>                                                 inc(&obj->ref) // ref: 0->1
> 
>         kfree(obj);
> }

... and tell me why an attacker would let Thread-B do that increment
(that you're trying to detect) *before* the underlying memory gets
reused and thus the 0 changed to something else? hint: he'll do everything
in his power to prevent that, either by winning the race or if there's
no race (no refcount users outside his control), he'll win every time.
IOW, checking for 0 is pointless and you kinda proved it yourself now.
pageexec@freemail.hu April 25, 2017, 11:26 a.m. UTC | #22
On 24 Apr 2017 at 13:33, Kees Cook wrote:

> On Mon, Apr 24, 2017 at 4:00 AM, PaX Team <pageexec@freemail.hu> wrote:
> > On 24 Apr 2017 at 10:32, Peter Zijlstra wrote:
> >
> >> On Fri, Apr 21, 2017 at 03:09:39PM -0700, Kees Cook wrote:
> >> > This patch ports the x86-specific atomic overflow handling from PaX's
> >> > PAX_REFCOUNT to the upstream refcount_t API. This is an updated version
> >> > from PaX that eliminates the saturation race condition by resetting the
> >> > atomic counter back to the INT_MAX saturation value on both overflow and
> >> > underflow. To win a race, a system would have to have INT_MAX threads
> >> > simultaneously overflow before the saturation handler runs.
> >
> > note that the above is wrong (and even contradicting itself and the code).
> 
> True, this changelog could be more accurate (it resets to INT_MAX on
> overflow and INT_MIN on underflow). I think I'm right in saying that a
> system would need INT_MAX threads running a refcount_inc() (and a
> refcount_dec_and_test() at exactly the right moment) before the reset
> handler got scheduled, though, yes?

there's no uniform answer to this as there're several conditions
that can affect the effectiveness of the refcount protection.

e.g., how many independent leaking paths can the attacker exercise
(typically one), are these paths under some kind of locks (would
already prevent unbounded leaks/increments should the overflow
detecting thread be preempted), are negative refcounts allowed and
checked for or only signed overflow, etc.

INT_MAX threads would be needed when the leaking path is locked so
that it can only be exercised once and you'll need to get normal
(balanced) paths preempted just after the increment. if the leaking
path is lockless (can be exercised in parallel without bounds) then
2 threads are enough where the one triggering the signed overflow
would have to be preempted while the other one does INT_MAX increments
and trigger the UAF. this is where the other mechanisms i talked about
in the past become relevant: preemption or interrupts can be disabled
or negative refcount values can be detected and acted upon (your blind
copy-pasting effort passed upon this latter opportunity by not
specializing the 'jo' into 'js' for the refcount case).
pageexec@freemail.hu April 25, 2017, 11:26 a.m. UTC | #23
On 25 Apr 2017 at 12:23, Peter Zijlstra wrote:

> So what avoids this:

simple, you noted it yourself in your previous mail:

> Well, your setup (panic_on_warn et al) would have it panic the box. That
> will effectively stop the exploit by virtue of stopping everything.

with that in mind the actual code looks like this:

>	CPU0				CPU1
>
>
>	lock inc %[val]; # 0x7fffffff
>	jo  2f
>1:	...
>
>					lock dec %[val]; # 0x80000000
>					jo  2f
>				1:	...
>
>
>
>
>2:	mov $0x7fffffff, %[val]

	panic()

>	jmp 1b
>
>				2:	mov $0x80000000, %[val]

					panic()

>					jmp 1b
>

... and we never get this far.
Kees Cook April 25, 2017, 4:36 p.m. UTC | #24
On Tue, Apr 25, 2017 at 4:26 AM, PaX Team <pageexec@freemail.hu> wrote:
> On 25 Apr 2017 at 0:01, Peter Zijlstra wrote:
>> How is the below not useful fodder for an exploit? It might be a less
>> common bug, and perhaps a bit more fiddly to make work, but afaict its
>> still a full use-after-free and therefore useful.
>>
>> ---
>>
>> Thread-A                                        Thread-B
>>
>> if(dec_and_test(&obj->ref)) { // true, ref==0
>>
>>                                                 inc(&obj->ref) // ref: 0->1
>>
>>         kfree(obj);
>> }
>
> ... and tell me why an attacker would let Thread-B do that increment
> (that you're trying to detect) *before* the underlying memory gets
> reused and thus the 0 changed to something else? hint: he'll do everything
> in his power to prevent that, either by winning the race or if there's
> no race (no refcount users outside his control), he'll win every time.
> IOW, checking for 0 is pointless and you kinda proved it yourself now.

Right, having a deterministic protection (checking for overflow) is
best since it stops all exploits using that path. Hoping that an
attacker is unlucky and hits a notification after they've already
landed their corruption is not a very useful defense. It certainly has
a non-zero value, but stopping overflow 100% is better. Especially
when we can do it with no meaningful change in performance which lets
us actually do the atomic_t -> refcount_t conversion everywhere.

-Kees
Kees Cook April 25, 2017, 4:39 p.m. UTC | #25
On Tue, Apr 25, 2017 at 4:26 AM, PaX Team <pageexec@freemail.hu> wrote:
> INT_MAX threads would be needed when the leaking path is locked so
> that it can only be exercised once and you'll need to get normal
> (balanced) paths preempted just after the increment. if the leaking
> path is lockless (can be exercised in parallel without bounds) then
> 2 threads are enough where the one triggering the signed overflow
> would have to be preempted while the other one does INT_MAX increments
> and trigger the UAF. this is where the other mechanisms i talked about
> in the past become relevant: preemption or interrupts can be disabled
> or negative refcount values can be detected and acted upon (your blind
> copy-pasting effort passed upon this latter opportunity by not
> specializing the 'jo' into 'js' for the refcount case).

Well, it's not "blind" -- I'm trying to bring the code as-is to
upstream for discussion/examination with as little functional
differences as possible so it's easier to compare apples to apples.
(Which already resulted in more eyes looking at the code to find a bug
-- thanks Jann!) But yes, jo -> js hugely increases the coverage. I'll
make that change for v2.

Thanks!

-Kees
pageexec@freemail.hu April 26, 2017, 2:14 a.m. UTC | #26
On 25 Apr 2017 at 9:39, Kees Cook wrote:

> On Tue, Apr 25, 2017 at 4:26 AM, PaX Team <pageexec@freemail.hu> wrote:
> > INT_MAX threads would be needed when the leaking path is locked so
> > that it can only be exercised once and you'll need to get normal
> > (balanced) paths preempted just after the increment. if the leaking
> > path is lockless (can be exercised in parallel without bounds) then
> > 2 threads are enough where the one triggering the signed overflow
> > would have to be preempted while the other one does INT_MAX increments
> > and trigger the UAF. this is where the other mechanisms i talked about
> > in the past become relevant: preemption or interrupts can be disabled
> > or negative refcount values can be detected and acted upon (your blind
> > copy-pasting effort passed upon this latter opportunity by not
> > specializing the 'jo' into 'js' for the refcount case).
> 
> Well, it's not "blind" -- I'm trying to bring the code as-is to
> upstream for discussion/examination with as little functional
> differences as possible so it's easier to compare apples to apples.

you copied code from a version which is at least 2 major kernel revisions
behind (so much for those apples), you chose the one version which had a
bug that you didn't spot nor fix properly, you didn't realize the opportunity
that a special refcount type represents, you claimed refcount underflows
aren't exploitable but copied code that would detect signed underflow, you
didn't understand the limits and edge cases i explained above... need i go
on? doesn't leave one with great confidence in your ability to understand
and maintain this code...
Kees Cook April 26, 2017, 4:42 a.m. UTC | #27
On Tue, Apr 25, 2017 at 7:14 PM, PaX Team <pageexec@freemail.hu> wrote:
> On 25 Apr 2017 at 9:39, Kees Cook wrote:
>
>> On Tue, Apr 25, 2017 at 4:26 AM, PaX Team <pageexec@freemail.hu> wrote:
>> > INT_MAX threads would be needed when the leaking path is locked so
>> > that it can only be exercised once and you'll need to get normal
>> > (balanced) paths preempted just after the increment. if the leaking
>> > path is lockless (can be exercised in parallel without bounds) then
>> > 2 threads are enough where the one triggering the signed overflow
>> > would have to be preempted while the other one does INT_MAX increments
>> > and trigger the UAF. this is where the other mechanisms i talked about
>> > in the past become relevant: preemption or interrupts can be disabled
>> > or negative refcount values can be detected and acted upon (your blind
>> > copy-pasting effort passed upon this latter opportunity by not
>> > specializing the 'jo' into 'js' for the refcount case).
>>
>> Well, it's not "blind" -- I'm trying to bring the code as-is to
>> upstream for discussion/examination with as little functional
>> differences as possible so it's easier to compare apples to apples.
>
> you copied code from a version which is at least 2 major kernel revisions
> behind (so much for those apples)

Hmm, this was from your 4.9 port. Linus hasn't quite released 4.11
yet, so that's actually "at most 2 major kernel revisions behind". :)
Regardless, I'd be happy to refresh the port. Will you share a URL to
your latest rebase against upstream?

> you chose the one version which had a
> bug that you didn't spot nor fix properly, you didn't realize the opportunity
> that a special refcount type represents, you claimed refcount underflows
> aren't exploitable but copied code that would detect signed underflow, you
> didn't understand the limits and edge cases i explained above... need i go

As I said, I was trying to minimize changes to your implementation,
which included the bug and the other issues. The point of this was to
share it with others, and work collaboratively on it. I think this
clearly succeeded with benefits to both upstream and PaX: Jann spotted
the fix for the bug causing weird crashes I saw when doing initial
testing, you pointed out the benefit of using js over jo, I've
reorganized the RMWcc macros for more easily adding trailing
instructions, Peter is thinking about ways around the protection, etc.

> on? doesn't leave one with great confidence in your ability to understand
> and maintain this code...

Well, that's your opinion. I think the patch and its discussion helped
several people, including myself, understand this code. Since many
people will share its maintenance, I think this is the right way to
handle upstreaming these kinds of things. I don't claim to be
omniscient, just persistent. Getting this protection into upstream
means every Linux user will benefit from what you created, which I
think is awesome; this whole class of refcount flaws goes away. Thank
you for writing it, sharing it, and discussing it!

-Kees
diff mbox

Patch

diff --git a/arch/Kconfig b/arch/Kconfig
index cd211a14a88f..2cd150f03175 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -847,4 +847,23 @@  config STRICT_MODULE_RWX
 config ARCH_WANT_RELAX_ORDER
 	bool
 
+config ARCH_HAS_FAST_REFCOUNT
+	bool
+	help
+	  An architecture selects this when it has implemented refcount_t
+	  using primitizes that provide a faster runtime at the expense
+	  of some refcount state checks. The refcount overflow condition,
+	  however, must be retained. Catching overflows is the primary
+	  security concern for protecting against bugs in reference counts.
+
+config FAST_REFCOUNT
+	bool "Speed up reference counting at the expense of full validation"
+	depends on ARCH_HAS_FAST_REFCOUNT
+	help
+	  The regular reference counting infrastructure in the kernel checks
+	  many error conditions. If this option is selected, refcounting
+	  is made faster using architecture-specific implementions that may
+	  only check for reference count overflows (which is the primary
+	  way reference counting bugs are turned into security exploits).
+
 source "kernel/gcov/Kconfig"
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index cc98d5a294ee..a13db97e0d71 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -50,6 +50,7 @@  config X86
 	select ARCH_HAS_DEVMEM_IS_ALLOWED
 	select ARCH_HAS_ELF_RANDOMIZE
 	select ARCH_HAS_FAST_MULTIPLIER
+	select ARCH_HAS_FAST_REFCOUNT
 	select ARCH_HAS_GCOV_PROFILE_ALL
 	select ARCH_HAS_KCOV			if X86_64
 	select ARCH_HAS_MMIO_FLUSH
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 57f7ec35216e..9e8d9e2d70bf 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -792,6 +792,15 @@  ENTRY(spurious_interrupt_bug)
 	jmp	common_exception
 END(spurious_interrupt_bug)
 
+#ifdef CONFIG_FAST_REFCOUNT
+ENTRY(refcount_error)
+	ASM_CLAC
+	pushl   $0
+	pushl   $do_refcount_error
+	jmp     error_code
+ENDPROC(refcount_error)
+#endif
+
 #ifdef CONFIG_XEN
 ENTRY(xen_hypervisor_callback)
 	pushl	$-1				/* orig_ax = -1 => not a system call */
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 044d18ebc43c..a736b882ec76 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -858,6 +858,9 @@  idtentry coprocessor_error		do_coprocessor_error		has_error_code=0
 idtentry alignment_check		do_alignment_check		has_error_code=1
 idtentry simd_coprocessor_error		do_simd_coprocessor_error	has_error_code=0
 
+#ifdef CONFIG_FAST_REFCOUNT
+idtentry refcount_error			do_refcount_error		has_error_code=0
+#endif
 
 	/*
 	 * Reload gs selector with exception handling
diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index 6ca9fd6234e1..64ca4dcc29ec 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -48,6 +48,9 @@ 
 
 #define IA32_SYSCALL_VECTOR		0x80
 
+/* Refcount Overflow or Underflow Exception. */
+#define X86_REFCOUNT_VECTOR		0x81
+
 /*
  * Vectors 0x30-0x3f are used for ISA interrupts.
  *   round up to the next 16-vector boundary
diff --git a/arch/x86/include/asm/refcount.h b/arch/x86/include/asm/refcount.h
new file mode 100644
index 000000000000..79e35981e42f
--- /dev/null
+++ b/arch/x86/include/asm/refcount.h
@@ -0,0 +1,123 @@ 
+#ifndef __ASM_X86_REFCOUNT_H
+#define __ASM_X86_REFCOUNT_H
+/*
+ * x86-specific implementation of refcount_t. Ported from PAX_REFCOUNT in
+ * PaX/grsecurity.
+ */
+#include <linux/refcount.h>
+#include <asm/irq_vectors.h>
+
+#define __REFCOUNT_CHECK(size)				\
+	"jo 111f\n"					\
+	".if "__stringify(size)" == 4\n\t"		\
+	".pushsection .text.refcount_overflow\n"	\
+	".elseif "__stringify(size)" == -4\n\t"		\
+	".pushsection .text.refcount_underflow\n"	\
+	".else\n"					\
+	".error \"invalid size\"\n"			\
+	".endif\n"					\
+	"111:\tlea %[counter],%%"_ASM_CX"\n\t"		\
+	"int $"__stringify(X86_REFCOUNT_VECTOR)"\n"	\
+	"222:\n\t"					\
+	".popsection\n"					\
+	"333:\n"					\
+	_ASM_EXTABLE(222b, 333b)
+
+#define REFCOUNT_CHECK_OVERFLOW(size)	__REFCOUNT_CHECK(size)
+#define REFCOUNT_CHECK_UNDERFLOW(size)	__REFCOUNT_CHECK(-(size))
+
+#if !defined(__GCC_ASM_FLAG_OUTPUTS__) && defined(CC_HAVE_ASM_GOTO)
+/* Use asm goto */
+#define __GEN_CHECKED_RMWcc(fullop, var, size, cc, ...)			\
+do {									\
+	asm_volatile_goto(fullop					\
+			"\n\t"__REFCOUNT_CHECK(size)			\
+			";j" #cc " %l[cc_label]"			\
+			: : [counter] "m" (var), ## __VA_ARGS__		\
+			: "memory", "cc", "cx" : cc_label);		\
+	return 0;							\
+cc_label:								\
+	return 1;							\
+} while (0)
+
+#define GEN_BINARY_CHECKED_RMWcc(op, var, size, vcon, val, arg0, cc)	\
+	__GEN_CHECKED_RMWcc(op " %1, " arg0, var, size, cc, vcon (val))
+
+#else /* defined(__GCC_ASM_FLAG_OUTPUTS__) || !defined(CC_HAVE_ASM_GOTO) */
+
+#define __GEN_CHECKED_RMWcc(fullop, var, size, cc, ...)			\
+do {									\
+	bool c;								\
+	asm volatile (fullop						\
+			"\n\t"__REFCOUNT_CHECK(size)			\
+			";" CC_SET(cc)					\
+			: [counter] "+m" (var), CC_OUT(cc) (c)		\
+			: __VA_ARGS__ : "memory", "cc", "cx");		\
+	return c != 0;							\
+} while (0)
+
+#define GEN_BINARY_CHECKED_RMWcc(op, var, size, vcon, val, arg0, cc)	\
+	__GEN_CHECKED_RMWcc(op " %2, " arg0, var, size, cc, vcon (val))
+
+#endif /* defined(__GCC_ASM_FLAG_OUTPUTS__) || !defined(CC_HAVE_ASM_GOTO) */
+
+#define GEN_UNARY_CHECKED_RMWcc(op, var, size, arg0, cc)		\
+	__GEN_CHECKED_RMWcc(op " " arg0, var, size, cc)
+
+static __always_inline void refcount_add(unsigned int i, refcount_t *r)
+{
+	asm volatile(LOCK_PREFIX "addl %1,%0\n\t"
+		REFCOUNT_CHECK_OVERFLOW(4)
+		: [counter] "+m" (r->refs.counter)
+		: "ir" (i)
+		: "cc", "cx");
+}
+
+static __always_inline void refcount_inc(refcount_t *r)
+{
+	asm volatile(LOCK_PREFIX "incl %0\n\t"
+		REFCOUNT_CHECK_OVERFLOW(4)
+		: [counter] "+m" (r->refs.counter)
+		: : "cc", "cx");
+}
+
+static __always_inline void refcount_dec(refcount_t *r)
+{
+	asm volatile(LOCK_PREFIX "decl %0\n\t"
+		REFCOUNT_CHECK_UNDERFLOW(4)
+		: [counter] "+m" (r->refs.counter)
+		: : "cc", "cx");
+}
+
+static __always_inline __must_check
+bool refcount_sub_and_test(unsigned int i, refcount_t *r)
+{
+	GEN_BINARY_CHECKED_RMWcc(LOCK_PREFIX "subl", r->refs.counter,
+				-4, "er", i, "%0", e);
+}
+
+static __always_inline __must_check bool refcount_dec_and_test(refcount_t *r)
+{
+	GEN_UNARY_CHECKED_RMWcc(LOCK_PREFIX "decl", r->refs.counter,
+				-4, "%0", e);
+}
+
+static __always_inline __must_check bool refcount_inc_not_zero(refcount_t *r)
+{
+	const int a = 1;
+	const int u = 0;
+	int c, old;
+
+	c = atomic_read(&(r->refs));
+	for (;;) {
+		if (unlikely(c == (u)))
+			break;
+		old = atomic_cmpxchg(&(r->refs), c, c + (a));
+		if (likely(old == c))
+			break;
+		c = old;
+	}
+	return c != u;
+}
+
+#endif
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 01fd0a7f48cd..e4d8db75d85e 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -38,6 +38,10 @@  asmlinkage void machine_check(void);
 #endif /* CONFIG_X86_MCE */
 asmlinkage void simd_coprocessor_error(void);
 
+#ifdef CONFIG_FAST_REFCOUNT
+asmlinkage void refcount_error(void);
+#endif
+
 #ifdef CONFIG_TRACING
 asmlinkage void trace_page_fault(void);
 #define trace_stack_segment stack_segment
@@ -54,6 +58,7 @@  asmlinkage void trace_page_fault(void);
 #define trace_alignment_check alignment_check
 #define trace_simd_coprocessor_error simd_coprocessor_error
 #define trace_async_page_fault async_page_fault
+#define trace_refcount_error refcount_error
 #endif
 
 dotraplinkage void do_divide_error(struct pt_regs *, long);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 4e496379a871..999d324119c0 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -192,6 +192,13 @@  do_trap_no_signal(struct task_struct *tsk, int trapnr, char *str,
 			tsk->thread.trap_nr = trapnr;
 			die(str, regs, error_code);
 		}
+
+#ifdef CONFIG_FAST_REFCOUNT
+		if (trapnr == X86_REFCOUNT_VECTOR) {
+			regs->ip -= 2;	/* sizeof(int $xx) */
+			refcount_error_report(regs, str);
+		}
+#endif
 		return 0;
 	}
 
@@ -308,6 +315,32 @@  __visible void __noreturn handle_stack_overflow(const char *message,
 }
 #endif
 
+#ifdef CONFIG_FAST_REFCOUNT
+
+dotraplinkage void do_refcount_error(struct pt_regs *regs, long error_code)
+{
+	const char *str = NULL;
+
+	BUG_ON(!(regs->flags & X86_EFLAGS_OF));
+
+#define range_check(size, direction, type, value) \
+	if ((unsigned long)__##size##_##direction##_start <= regs->ip && \
+	    regs->ip < (unsigned long)__##size##_##direction##_end) { \
+		*(type *)regs->cx = value; \
+		str = #size " " #direction; \
+	}
+
+	range_check(refcount,   overflow,  int, INT_MAX)
+	range_check(refcount,   underflow, int, INT_MIN)
+
+#undef range_check
+
+	BUG_ON(!str);
+	do_error_trap(regs, error_code, (char *)str, X86_REFCOUNT_VECTOR,
+		      SIGILL);
+}
+#endif
+
 #ifdef CONFIG_X86_64
 /* Runs on IST stack */
 dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
@@ -983,6 +1016,11 @@  void __init trap_init(void)
 	set_bit(IA32_SYSCALL_VECTOR, used_vectors);
 #endif
 
+#ifdef CONFIG_FAST_REFCOUNT
+	set_intr_gate(X86_REFCOUNT_VECTOR, refcount_error);
+	set_bit(X86_REFCOUNT_VECTOR, used_vectors);
+#endif
+
 	/*
 	 * Set the IDT descriptor to a fixed read-only location, so that the
 	 * "sidt" instruction will not leak the location of the kernel, and
diff --git a/drivers/misc/lkdtm_bugs.c b/drivers/misc/lkdtm_bugs.c
index e3f4cd8876b5..1bdafb29b802 100644
--- a/drivers/misc/lkdtm_bugs.c
+++ b/drivers/misc/lkdtm_bugs.c
@@ -135,9 +135,15 @@  void lkdtm_HUNG_TASK(void)
 	schedule();
 }
 
+#ifdef CONFIG_FAST_REFCOUNT
+#define REFCOUNT_MAX	INT_MAX
+#else
+#define REFCOUNT_MAX	UINT_MAX
+#endif
+
 void lkdtm_REFCOUNT_SATURATE_INC(void)
 {
-	refcount_t over = REFCOUNT_INIT(UINT_MAX - 1);
+	refcount_t over = REFCOUNT_INIT(REFCOUNT_MAX - 1);
 
 	pr_info("attempting good refcount decrement\n");
 	refcount_dec(&over);
@@ -146,7 +152,7 @@  void lkdtm_REFCOUNT_SATURATE_INC(void)
 	pr_info("attempting bad refcount inc overflow\n");
 	refcount_inc(&over);
 	refcount_inc(&over);
-	if (refcount_read(&over) == UINT_MAX)
+	if (refcount_read(&over) == REFCOUNT_MAX)
 		pr_err("Correctly stayed saturated, but no BUG?!\n");
 	else
 		pr_err("Fail: refcount wrapped\n");
@@ -154,7 +160,7 @@  void lkdtm_REFCOUNT_SATURATE_INC(void)
 
 void lkdtm_REFCOUNT_SATURATE_ADD(void)
 {
-	refcount_t over = REFCOUNT_INIT(UINT_MAX - 1);
+	refcount_t over = REFCOUNT_INIT(REFCOUNT_MAX - 1);
 
 	pr_info("attempting good refcount decrement\n");
 	refcount_dec(&over);
@@ -162,7 +168,7 @@  void lkdtm_REFCOUNT_SATURATE_ADD(void)
 
 	pr_info("attempting bad refcount add overflow\n");
 	refcount_add(2, &over);
-	if (refcount_read(&over) == UINT_MAX)
+	if (refcount_read(&over) == REFCOUNT_MAX)
 		pr_err("Correctly stayed saturated, but no BUG?!\n");
 	else
 		pr_err("Fail: refcount wrapped\n");
@@ -178,6 +184,11 @@  void lkdtm_REFCOUNT_ZERO_DEC(void)
 		pr_err("Stayed at zero, but no BUG?!\n");
 	else
 		pr_err("Fail: refcount went crazy\n");
+
+	pr_info("attempting bad refcount decrement past INT_MIN\n");
+	atomic_set(&zero.refs, INT_MIN);
+	refcount_dec(&zero);
+	pr_err("Fail: wrap not detected\n");
 }
 
 void lkdtm_REFCOUNT_ZERO_SUB(void)
diff --git a/include/asm-generic/sections.h b/include/asm-generic/sections.h
index 532372c6cf15..0590f384f234 100644
--- a/include/asm-generic/sections.h
+++ b/include/asm-generic/sections.h
@@ -20,6 +20,8 @@ 
  *                   may be out of this range on some architectures.
  * [_sinittext, _einittext]: contains .init.text.* sections
  * [__bss_start, __bss_stop]: contains BSS sections
+ * [__refcount_overflow/underflow_start, ..._end]: contains .text sections
+ *		     for refcount error handling.
  *
  * Following global variables are optional and may be unavailable on some
  * architectures and/or kernel configurations.
@@ -39,6 +41,8 @@  extern char __per_cpu_load[], __per_cpu_start[], __per_cpu_end[];
 extern char __kprobes_text_start[], __kprobes_text_end[];
 extern char __entry_text_start[], __entry_text_end[];
 extern char __start_rodata[], __end_rodata[];
+extern char __refcount_overflow_start[], __refcount_overflow_end[];
+extern char __refcount_underflow_start[], __refcount_underflow_end[];
 
 /* Start and end of .ctors section - used for constructor calls. */
 extern char __ctors_start[], __ctors_end[];
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index 143db9c523e2..a04aae39e820 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -448,9 +448,18 @@ 
 		ALIGN_FUNCTION();					\
 		*(.text.hot .text .text.fixup .text.unlikely)		\
 		*(.ref.text)						\
+		REFCOUNT_TEXT						\
 	MEM_KEEP(init.text)						\
 	MEM_KEEP(exit.text)						\
 
+#define __REFCOUNT_TEXT(section)					\
+		VMLINUX_SYMBOL(__##section##_start) = .;                \
+		*(.text.##section)                                      \
+		VMLINUX_SYMBOL(__##section##_end) = .;
+
+#define REFCOUNT_TEXT							\
+	__REFCOUNT_TEXT(refcount_overflow)				\
+	__REFCOUNT_TEXT(refcount_underflow)
 
 /* sched.text is aling to function alignment to secure we have same
  * address even at second ld pass when generating System.map */
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 4c26dc3a8295..bc15822b24eb 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -275,6 +275,8 @@  extern int oops_may_print(void);
 void do_exit(long error_code) __noreturn;
 void complete_and_exit(struct completion *, long) __noreturn;
 
+void refcount_error_report(struct pt_regs *regs, const char *kind);
+
 /* Internal, do not use. */
 int __must_check _kstrtoul(const char *s, unsigned int base, unsigned long *res);
 int __must_check _kstrtol(const char *s, unsigned int base, long *res);
diff --git a/include/linux/refcount.h b/include/linux/refcount.h
index 0023fee4bbbc..fdb82bcaf975 100644
--- a/include/linux/refcount.h
+++ b/include/linux/refcount.h
@@ -22,6 +22,9 @@  static inline unsigned int refcount_read(const refcount_t *r)
 	return atomic_read(&r->refs);
 }
 
+#ifdef CONFIG_FAST_REFCOUNT
+#include <asm/refcount.h>
+#else
 extern __must_check bool refcount_add_not_zero(unsigned int i, refcount_t *r);
 extern void refcount_add(unsigned int i, refcount_t *r);
 
@@ -33,6 +36,7 @@  extern void refcount_sub(unsigned int i, refcount_t *r);
 
 extern __must_check bool refcount_dec_and_test(refcount_t *r);
 extern void refcount_dec(refcount_t *r);
+#endif
 
 extern __must_check bool refcount_dec_if_one(refcount_t *r);
 extern __must_check bool refcount_dec_not_one(refcount_t *r);
diff --git a/kernel/panic.c b/kernel/panic.c
index a58932b41700..a1745b60cc36 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -26,6 +26,7 @@ 
 #include <linux/nmi.h>
 #include <linux/console.h>
 #include <linux/bug.h>
+#include <linux/ratelimit.h>
 
 #define PANIC_TIMER_STEP 100
 #define PANIC_BLINK_SPD 18
@@ -601,6 +602,28 @@  EXPORT_SYMBOL(__stack_chk_fail);
 
 #endif
 
+#ifdef CONFIG_FAST_REFCOUNT
+static DEFINE_RATELIMIT_STATE(refcount_ratelimit, 15 * HZ, 3);
+
+void refcount_error_report(struct pt_regs *regs, const char *kind)
+{
+	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, current, true);
+
+	if (!__ratelimit(&refcount_ratelimit))
+		return;
+
+	pr_emerg("%s detected in: %s:%d, uid/euid: %u/%u\n",
+		kind ? kind : "refcount error",
+		current->comm, task_pid_nr(current),
+		from_kuid_munged(&init_user_ns, current_uid()),
+		from_kuid_munged(&init_user_ns, current_euid()));
+	print_symbol(KERN_EMERG "refcount error occurred at: %s\n",
+		instruction_pointer(regs));
+	BUG();
+}
+EXPORT_SYMBOL(refcount_error_report);
+#endif
+
 core_param(panic, panic_timeout, int, 0644);
 core_param(pause_on_oops, pause_on_oops, int, 0644);
 core_param(panic_on_warn, panic_on_warn, int, 0644);
diff --git a/lib/refcount.c b/lib/refcount.c
index aa09ad3c30b0..903a59557893 100644
--- a/lib/refcount.c
+++ b/lib/refcount.c
@@ -37,6 +37,9 @@ 
 #include <linux/refcount.h>
 #include <linux/bug.h>
 
+/* Leave out architecture-specific implementations. */
+#ifndef CONFIG_FAST_REFCOUNT
+
 bool refcount_add_not_zero(unsigned int i, refcount_t *r)
 {
 	unsigned int old, new, val = atomic_read(&r->refs);
@@ -168,6 +171,8 @@  void refcount_dec(refcount_t *r)
 }
 EXPORT_SYMBOL_GPL(refcount_dec);
 
+#endif /* CONFIG_FAST_REFCOUNT */
+
 /*
  * No atomic_t counterpart, it attempts a 1 -> 0 transition and returns the
  * success thereof.
@@ -264,4 +269,3 @@  bool refcount_dec_and_lock(refcount_t *r, spinlock_t *lock)
 	return true;
 }
 EXPORT_SYMBOL_GPL(refcount_dec_and_lock);
-