Message ID | 1482994571-18687-7-git-send-email-elena.reshetova@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Dec 29, 2016 at 08:55:58AM +0200, Elena Reshetova wrote: > + > +static inline __must_check > +bool refcount_add_not_zero(unsigned int i, refcount_t *r) > +{ > + unsigned int old, new, val = atomic_read(&r->refs); > + > + for (;;) { > + if (!val) > + return false; > + > + if (unlikely(val == UINT_MAX)) > + return true; > + > + new = val + i; > + if (new < val) > + new = UINT_MAX; > + old = atomic_cmpxchg_relaxed(&r->refs, val, new); > + if (old == val) > + break; > + > + val = old; > + } > + > + WARN(new == UINT_MAX, "refcount_t: saturated; leaking memory.\n"); > + > + return true; > +} > + > +/* > + * Similar to atomic_inc_not_zero(), will saturate at UINT_MAX and WARN. > + * > + * Provides no memory ordering, it is assumed the caller has guaranteed the > + * object memory to be stable (RCU, etc.). It does provide a control dependency > + * and thereby orders future stores. See the comment on top. > + */ > +static inline __must_check > +bool refcount_inc_not_zero(refcount_t *r) > +{ > + return refcount_add_not_zero(1, r); > +} > + > +/* > + * Similar to atomic_inc(), will saturate at UINT_MAX and WARN. > + * > + * Provides no memory ordering, it is assumed the caller already has a > + * reference on the object, will WARN when this is not so. > + */ > +static inline void refcount_inc(refcount_t *r) > +{ > + WARN(!refcount_inc_not_zero(r), "refcount_t: increment on 0; use-after-free.\n"); > +} > + ... and refcount_inc() compiles to over 100 bytes of instructions on x86_64. This is the wrong approach. We need a low-overhead solution, otherwise no one will turn on refcount protection and the feature will be useless. What exactly is wrong with the current solution in PAX/grsecurity? Looking at the x86 version they have atomic_inc() do 'lock incl' like usual, then use 'jo' to, if the counter overflowed, jump to *out-of-line* error handling code, in a separate section of the kernel image. Then it raises a software interrupt, and the interrupt handler sets the overflowed counter to INT_MAX and does the needed logging and signal raising. That approach seems very efficient. It seems the only overhead when someone isn't actually exploiting a refcount bug is the 'jo' instruction, with the branch not taken. There aren't even any other in-line instructions to waste icache space. I do see they used to use a slightly different approach that did a decrement instead of setting the counter to INT_MAX. And that was clearly racy because two concurrent increments could circumvent the overflow protection. But AFAICS their current solution is not racy in any meaningful way, since the setting to INT_MAX means an overflow will be detected again on the next increment, even if there were some concurrent increments in the mean time. (And if by some stretch of the imagination, it was possible to execute *another* INT_MAX increments before the fixup code had a chance to run, the correct solution would be to simply use 'js' instead of 'jo' to detect overflows. I'm guessing the only reason they don't do that is because some atomic_t's are used to store negative values...) So given that there is an existing solution which AFAICS is efficient and achieves the desired protection, why has the proposal turned into a monstrous cmpxchg loop that won't be practical to enable by default? I also think that the "warn when incrementing a 0 refcount" part of the change shouldn't be there. It's additional overhead that seems tangential to the main goal of the feature which is to protect against refcount overflows, not to protect against random increments in some object which has *already* been freed and potentially exploited. - Eric
> On Thu, Dec 29, 2016 at 08:55:58AM +0200, Elena Reshetova wrote: > > + > > +static inline __must_check > > +bool refcount_add_not_zero(unsigned int i, refcount_t *r) > > +{ > > + unsigned int old, new, val = atomic_read(&r->refs); > > + > > + for (;;) { > > + if (!val) > > + return false; > > + > > + if (unlikely(val == UINT_MAX)) > > + return true; > > + > > + new = val + i; > > + if (new < val) > > + new = UINT_MAX; > > + old = atomic_cmpxchg_relaxed(&r->refs, val, new); > > + if (old == val) > > + break; > > + > > + val = old; > > + } > > + > > + WARN(new == UINT_MAX, "refcount_t: saturated; leaking memory.\n"); > > + > > + return true; > > +} > > + > > +/* > > + * Similar to atomic_inc_not_zero(), will saturate at UINT_MAX and WARN. > > + * > > + * Provides no memory ordering, it is assumed the caller has guaranteed the > > + * object memory to be stable (RCU, etc.). It does provide a control dependency > > + * and thereby orders future stores. See the comment on top. > > + */ > > +static inline __must_check > > +bool refcount_inc_not_zero(refcount_t *r) > > +{ > > + return refcount_add_not_zero(1, r); > > +} > > + > > +/* > > + * Similar to atomic_inc(), will saturate at UINT_MAX and WARN. > > + * > > + * Provides no memory ordering, it is assumed the caller already has a > > + * reference on the object, will WARN when this is not so. > > + */ > > +static inline void refcount_inc(refcount_t *r) > > +{ > > + WARN(!refcount_inc_not_zero(r), "refcount_t: increment on 0; use-after- > free.\n"); > > +} > > + > > ... and refcount_inc() compiles to over 100 bytes of instructions on x86_64. > This is the wrong approach. We need a low-overhead solution, otherwise no one > will turn on refcount protection and the feature will be useless. > > What exactly is wrong with the current solution in PAX/grsecurity? Looking at > the x86 version they have atomic_inc() do 'lock incl' like usual, then use 'jo' > to, if the counter overflowed, jump to *out-of-line* error handling code, in a > separate section of the kernel image. Then it raises a software interrupt, and > the interrupt handler sets the overflowed counter to INT_MAX and does the > needed > logging and signal raising. > > That approach seems very efficient. It seems the only overhead when someone > isn't actually exploiting a refcount bug is the 'jo' instruction, with the > branch not taken. There aren't even any other in-line instructions to waste > icache space. > I do see they used to use a slightly different approach that did a decrement > instead of setting the counter to INT_MAX. And that was clearly racy because > two concurrent increments could circumvent the overflow protection. But AFAICS > their current solution is not racy in any meaningful way, since the setting to > INT_MAX means an overflow will be detected again on the next increment, even if > there were some concurrent increments in the mean time. (And if by some stretch > of the imagination, it was possible to execute *another* INT_MAX increments > before the fixup code had a chance to run, the correct solution would be to > simply use 'js' instead of 'jo' to detect overflows. I'm guessing the only > reason they don't do that is because some atomic_t's are used to store negative > values...) > > So given that there is an existing solution which AFAICS is efficient and > achieves the desired protection, why has the proposal turned into a monstrous > cmpxchg loop that won't be practical to enable by default? I guess we can try to benchmark the whole thing to see what is the overhead. Especially now when we have the subsystem parts ready. And if you have been following the story since beginning, we also have PaX/grsecurity approach done for linux-next/stable and benchmarks have been previously posted, so would be easy to compare if needed. I guess one could in principle think of mixed approach between this one and the one that PaX/grsecurity had: define refcount_t type and API, but use assembly instructions behind to speed things up. > I also think that the "warn when incrementing a 0 refcount" part of the change > shouldn't be there. It's additional overhead that seems tangential to the main > goal of the feature which is to protect against refcount overflows, not to > protect against random increments in some object which has *already* been freed > and potentially exploited. Actually having warn for now was useful debugging feature: we got to found out many places which would not work otherwise. Best Regards, Elena. > > - Eric
On Fri, Dec 30, 2016 at 01:17:08PM +0000, Reshetova, Elena wrote: > > > > ... and refcount_inc() compiles to over 100 bytes of instructions on x86_64. > > This is the wrong approach. We need a low-overhead solution, otherwise no one > > will turn on refcount protection and the feature will be useless. > > > > What exactly is wrong with the current solution in PAX/grsecurity? Looking at > > the x86 version they have atomic_inc() do 'lock incl' like usual, then use 'jo' > > to, if the counter overflowed, jump to *out-of-line* error handling code, in a > > separate section of the kernel image. Then it raises a software interrupt, and > > the interrupt handler sets the overflowed counter to INT_MAX and does the > > needed > > logging and signal raising. > > > > That approach seems very efficient. It seems the only overhead when someone > > isn't actually exploiting a refcount bug is the 'jo' instruction, with the > > branch not taken. There aren't even any other in-line instructions to waste > > icache space. > > I do see they used to use a slightly different approach that did a decrement > > instead of setting the counter to INT_MAX. And that was clearly racy because > > two concurrent increments could circumvent the overflow protection. But AFAICS > > their current solution is not racy in any meaningful way, since the setting to > > INT_MAX means an overflow will be detected again on the next increment, even if > > there were some concurrent increments in the mean time. (And if by some stretch > > of the imagination, it was possible to execute *another* INT_MAX increments > > before the fixup code had a chance to run, the correct solution would be to > > simply use 'js' instead of 'jo' to detect overflows. I'm guessing the only > > reason they don't do that is because some atomic_t's are used to store negative > > values...) > > > > > So given that there is an existing solution which AFAICS is efficient and > > achieves the desired protection, why has the proposal turned into a monstrous > > cmpxchg loop that won't be practical to enable by default? > > I guess we can try to benchmark the whole thing to see what is the overhead. > Especially now when we have the subsystem parts ready. > > And if you have been following the story since beginning, we also have PaX/grsecurity > approach done for linux-next/stable and benchmarks have been previously posted, so > would be easy to compare if needed. > > I guess one could in principle think of mixed approach between this one and the one that > PaX/grsecurity had: define refcount_t type and API, but use assembly instructions > behind to speed things up. I haven't been following the whole story, sorry. Are your "PaX/grsecurity approach" patches based on their latest version, i.e. without the racy decrement? Also, with regards to benchmarks, there really are two things that are important: the performance impact of actually executing all the cmpxchg loop stuff instead of a simple 'lock incl', and the icache footprint of all the extra inlined instructions (or the overhead of a function call if that is "solved" by making refcount_inc() out of line). The latter is unlikely to be visible in a microbenchmark but it's still very important. AFAICS the whole question of refcount_t/atomic_t vs. atomic_t/atomic_unchecked_t has nothing to do with the details of how the checked refcount operations are actually implemented. These things need to be considered separately. > > > I also think that the "warn when incrementing a 0 refcount" part of the change > > shouldn't be there. It's additional overhead that seems tangential to the main > > goal of the feature which is to protect against refcount overflows, not to > > protect against random increments in some object which has *already* been freed > > and potentially exploited. > > Actually having warn for now was useful debugging feature: we got to found out many places which would not work otherwise. > The point of the feature is exploit mitigation, not refcount debugging. The mitigation needs to be practical to turn on in real production systems. If we want to have extra debugging features too, fine but that should be a *debugging* feature not a security feature, controllable by a separate config option, e.g. CONFIG_DEBUG_REFCOUNT for debugging vs. CONFIG_HARDENED_REFCOUNT for the actual mitigation. - Eric
On Thu, Dec 29, 2016 at 07:06:27PM -0600, Eric Biggers wrote: > > ... and refcount_inc() compiles to over 100 bytes of instructions on x86_64. > This is the wrong approach. We need a low-overhead solution, otherwise no one > will turn on refcount protection and the feature will be useless. Its not something that can be turned on or off, refcount_t is unconditional code. But you raise a good point on the size of the thing. I count 116 bytes on x86_64. If I disable all the debug crud that reduces to 45 bytes, and that's because GCC-6 is generating particularly stupid code (albeit not as stupid as GCC-5 did). A hand coded one ends up being 29 bytes, of which 6 are the function pro- and epilogue, so effectively 23 bytes for inline. Which we can reduce to 22 if instead of using a literal for UINT_MAX we do: xor %[t], %[t]; dec %[t]. 0000000000009ee0 <ponies>: 9ee0: 55 push %rbp 9ee1: 48 89 e5 mov %rsp,%rbp 9ee4: 8b 07 mov (%rdi),%eax 9ee6: 85 c0 test %eax,%eax 9ee8: 74 10 je 9efa <ponies+0x1a> 9eea: 89 c2 mov %eax,%edx 9eec: ff c2 inc %edx 9eee: 73 04 jae 9ef4 <ponies+0x14> 9ef0: 31 d2 xor %edx,%edx 9ef2: ff ca dec %edx 9ef4: f0 0f b1 17 lock cmpxchg %edx,(%rdi) 9ef8: 75 ec jne 9ee6 <ponies+0x6> 9efa: 5d pop %rbp 9efb: c3 retq (caveat: I wrote this on a post-holidays brain without testing) Also note that call overhead on an x86_64 (big core) is something like 1.5 cycles. And afaik Sparc64 is the architecture with the worst call overhead, but that already has its atomic functions out of line for different reasons. > What exactly is wrong with the current solution in PAX/grsecurity? Looking at > the x86 version they have atomic_inc() do 'lock incl' like usual, then use 'jo' > to, if the counter overflowed, jump to *out-of-line* error handling code, in a > separate section of the kernel image. Then it raises a software interrupt, and > the interrupt handler sets the overflowed counter to INT_MAX and does the needed > logging and signal raising. Doing an unconditional INC on INT_MAX gives a temporarily visible artifact of INT_MAX+1 (or INT_MIN) in the best case. This is fundamentally not an atomic operation and therefore does not belong in the atomic_* family, full stop. Not to mention that the whole wrap/unwrap or checked/unchecked split of atomic_t is a massive trainwreck. Moving over to refcount_t, which has simple and well defined semantics forces us to audit and cleanup all the reference counting crud as it gets converted, this is a good thing. Yes it takes more time and effort, but the end result is better code. I understand why PaX/grsecurity chose not to do this, but that doesn't make it a proper solution for upstream. Now as to why refcount cannot be implemented using that scheme you outlined: vCPU0 vCPU1 lock inc %[r] jo <vcpu preempt-out> for lots refcount_dec_and_test(&obj->ref) /* hooray, we hit 0 */ kfree(obj); <vcpu preempt-in> mov $0xFFFFFFFF, %[r] /* OOPS use-after-free */ Is this unlikely, yes, extremely so. Do I want to be the one debugging this, heck no.
On Tue, Jan 03, 2017 at 02:21:36PM +0100, Peter Zijlstra wrote: > On Thu, Dec 29, 2016 at 07:06:27PM -0600, Eric Biggers wrote: > > > > ... and refcount_inc() compiles to over 100 bytes of instructions on x86_64. > > This is the wrong approach. We need a low-overhead solution, otherwise no one > > will turn on refcount protection and the feature will be useless. > > Its not something that can be turned on or off, refcount_t is > unconditional code. But you raise a good point on the size of the thing. ... > Doing an unconditional INC on INT_MAX gives a temporarily visible > artifact of INT_MAX+1 (or INT_MIN) in the best case. > > This is fundamentally not an atomic operation and therefore does not > belong in the atomic_* family, full stop. Again I feel this is going down the wrong track. The point of the PaX feature this is based on is to offer protection against *exploits* involving abuse of refcount leak bugs. If the overflow logic triggers then there is a kernel *bug* and the rules have already been broken. The question of whether the exploit mitigation is "atomic" is not important unless it allows the mitigation to be circumvented. And yes this should be a config option just like other hardening options like CONFIG_HARDENED_USERCOPY. Making it unconditional only makes it harder to get merged and hurts users who, for whatever reason, don't want/need extra protections against kernel exploits. This is especially true if an implementation with significant performance and code size overhead is chosen. > Now as to why refcount cannot be implemented using that scheme you > outlined: > > vCPU0 vCPU1 > > lock inc %[r] > jo > > <vcpu preempt-out> > > for lots > refcount_dec_and_test(&obj->ref) > /* hooray, we hit 0 */ > kfree(obj); > > <vcpu preempt-in> > > mov $0xFFFFFFFF, %[r] /* OOPS use-after-free */ This scenario doesn't make sense. If there's no bug that causes extra refcount decrements, then it would be impossible to generate the INT_MAX+1 decrements needed to free the object. Or if there *is* a bug that causes extra refcount decrements, then it could already be abused at any time in any of the proposed solutions to trivially cause a use-after-free. Eric
On Wed, Jan 04, 2017 at 12:36:01PM -0800, Eric Biggers wrote: > On Tue, Jan 03, 2017 at 02:21:36PM +0100, Peter Zijlstra wrote: > > On Thu, Dec 29, 2016 at 07:06:27PM -0600, Eric Biggers wrote: > > > > > > ... and refcount_inc() compiles to over 100 bytes of instructions on x86_64. > > > This is the wrong approach. We need a low-overhead solution, otherwise no one > > > will turn on refcount protection and the feature will be useless. > > > > Its not something that can be turned on or off, refcount_t is > > unconditional code. But you raise a good point on the size of the thing. > ... > > Doing an unconditional INC on INT_MAX gives a temporarily visible > > artifact of INT_MAX+1 (or INT_MIN) in the best case. > > > > This is fundamentally not an atomic operation and therefore does not > > belong in the atomic_* family, full stop. > > Again I feel this is going down the wrong track. The point of the PaX feature > this is based on is to offer protection against *exploits* involving abuse of > refcount leak bugs. If the overflow logic triggers then there is a kernel *bug* > and the rules have already been broken. The question of whether the exploit > mitigation is "atomic" is not important unless it allows the mitigation to be > circumvented. It matters for where you put it. It cannot be part of atomic_t if it is intrinsically not atomic. Not to mention that the whole checked/unchecked split doesn't make sense for atomic_t since the concept only applies to a subset of the operations, and for those are only desired for a subset of applications; notably the reference count case. The proposed refcount_t aims to address the exact exploit scenario you mention, but does so from a separate type with well defined semantics. Our current effort has mostly been aimed at exploring the space and defining the semantics and API that make sense. For example the refcount type should not have functions that can subvert the protection or create malformed stuff, ie. it should be internally consistent. Note that the saturation semantics are an effective mitigation of the described use-after-free exploit into a resource leak denial-of-service; although Kees wants to later add an instant panic option, which would create a more immediate DoS scenario. > And yes this should be a config option just like other hardening options like > CONFIG_HARDENED_USERCOPY. Making it unconditional only makes it harder to get > merged and hurts users who, for whatever reason, don't want/need extra > protections against kernel exploits. This is especially true if an > implementation with significant performance and code size overhead is chosen. Maybe.. and while I agree that whatever GCC generates from the generic implementation currently available is quite horrible, I'm not convinced a hand coded equivalent is an actual problem. The cmpxchg loop only really suffers in performance when the variable is contended, and when a refcount is contended enough for this to show up, your code has issues anyway. That leaves size, and lets first show that that ends up being a real problem. Premature optimization etc.. > This scenario doesn't make sense. If there's no bug that causes extra refcount > decrements, then it would be impossible to generate the INT_MAX+1 decrements > needed to free the object. Or if there *is* a bug that causes extra refcount > decrements, then it could already be abused at any time in any of the proposed > solutions to trivially cause a use-after-free. Right you are, so much for thinking straight :-) I still feel deeply uncomfortable with the nonatomic nature of the thing, I'll have to think a bit more on this thing. Also note that such a scheme does not make sense for LL/SC architectures. In any case, if it can be proven to be equivalent, we can always change the x86 implementation later.
On 3 Jan 2017 at 14:21, Peter Zijlstra wrote: > Its not something that can be turned on or off, refcount_t is > unconditional code. But you raise a good point on the size of the thing. > > I count 116 bytes on x86_64. If I disable all the debug crud... ...then the whole thing loses its raison d'etre ;). so let's not cheat and compare apples to apples in the future. this means that your solution has a very bad size overhead (dozens of insns vs. one, multiplied by a few thousand uses) and performance overhead (several insns that the processor cannot parallelize due to register and control dependencies, at least one mispredicted conditional forward jump, etc) which is simply not acceptable as it doesn't balance it with any sort of security advantage (see below). > > What exactly is wrong with the current solution in PAX/grsecurity? Looking at > > the x86 version they have atomic_inc() do 'lock incl' like usual, then use 'jo' > > to, if the counter overflowed, jump to *out-of-line* error handling code, in a > > separate section of the kernel image. Then it raises a software interrupt, and > > the interrupt handler sets the overflowed counter to INT_MAX and does the needed > > logging and signal raising. > > Doing an unconditional INC on INT_MAX gives a temporarily visible > artifact of INT_MAX+1 (or INT_MIN) in the best case. > > This is fundamentally not an atomic operation and therefore does not > belong in the atomic_* family, full stop. i don't know why people keep repeating this myth but what exactly is not atomic in the PaX REFCOUNT feature? hint: it's all atomic ops based, there's no semantic change in there. what is undesired behaviour (for the security goal, not the underlying atomic_t API!) is that the undo/saturation logic makes the whole dance visible to other CPUs on non-LL/SC archs which means that this case needs special care there to achieve the security goal but there's no breakage in atomic ops. put another way, if you consider REFCOUNT not atomic somehow then so is the upstream atomic_t API since it makes INT_MAX+1 visible to all CPUs just like REFCOUNT does (in fact the whole point of the REFCOUNT accessors is that they can be better than upstream on LL/SC in this regard). > Not to mention that the whole wrap/unwrap or checked/unchecked split of > atomic_t is a massive trainwreck. Moving over to refcount_t, which has > simple and well defined semantics forces us to audit and cleanup all the > reference counting crud as it gets converted, this is a good thing. while a noble goal your approach has two problems: one, it doesn't go nearly far enough, and two, what it does do currently is anything but clear (beyond the above discussed implementation problems). let me elaborate on both. 1. the proper approach should be to separate the low-level implementation machinery from the higher level APIs exposed to users. the current low-level API is the atomic_t type and its accessors (including the 64/long variants) which is directly used by most code and there're few higher level abstractions on top of it. the proper low-level design should first of all separate the underlying type into signed/unsigned types as the conversions between the two make for a clumsy and error prone API to start with (just look at your own refcount_t API, you're abusing signed/unsigned conversions where one direction is implementation defined and allows a trap representation, is your code ready to handle that? not to mention that the C implementations of atomic_t trigger UB when signed overflow occurs). second, based on the signed/unsigned atomic API you should build not one but several abstractions. refcount_t is an obvious candidate (based on the unsigned low-level machinery) but there's a lot more kinds (IIRC, i counted something like a dozen different patterns when i last looked at the exceptional cases for REFCOUNT). this leads to... 2. your current refcount_t API is wrong and i think reflects a misunderstanding of what reference counts are about in general. in particular, checking for a 0 refcount value for any purpose other than freeing the underlying object makes no sense and means that the usage is not a refcount or the user has a UAF bug to start with (the check in kref_get is also nonsense for this reason). this is because refcounts protect (account for) references/handles to an object, not the object itself. this also means that such references (think pointers in C) can only be created by *copying* an existing one but they cannot be made up out of thin air. in theory, each time a reference is copied, the corresponding refcount must be increased and when the reference goes out of scope, the refcount must be decremented. when the refcount reaches 0 we know that there're no references left so the object can be freed. in practice, we optimize the refcount increments on copies by observing that the lifetimes of certain reference copies nest in each other (think local variables, non-captured function parameters, etc) so we can get away by only accounting for the outermost reference in such a nesting hierarchy of references (and we get the refcounting bugs when we fail at this arguably premature optimization and miss a decrement or increment for a reference copy). now with this introduction tell me what sense does e.g., refcount_add_not_zero make? any refcount API should never even expose the underlying counter mechanism since refcounts are about references, not the counter underneath. that is, a proper refcount API would look like this: // establishes the initial refcount of 0 (which doesn't mean 'freed object' // but 'no references, yet') // it's needed for object construction if the object may come from non-zero // initialized memory, otherwise ref_get should be used void ref_reset(objtype *ref); // returns a copy of the reference if the refcount can be safely // incremented, NULL otherwise (or crash&burn, etc) objtype *ref_get(objtype *ref); // decrements the refcount and free the object if it reached 0 // dtor could be a field of objtype but that'd cost memory (if there are // more objects than ref_put callers which is the likely case in real life) // and is an attack surface (kref follows this model too) void ref_put(objtype *ref, destruct_t *dtor); conceptually that is all you need to provide for refcount users (note that there's no leakage of the lower level implementation into the API, e.g., it can be lockless based on atomic_t or synchronized explicitly, all that is hidden from the users of the API). any other need/usage is not a refcount and needs its own API (and analysis of potential security problems when misused). now to implement this in C you'll have to break the abstraction in that it'd be an undue burden on the API implementor to provide the API for each type of reference (in C++ you could use templates) so in practice you'd also expose the immediately underlying refcount type (a reference to it) in the API and pray that users will get it right. something like this: objtype *ref_get(objtype *ref, refcount_t *count); void ref_put(objtype *ref, refcount_t *count, destruct_t *dtor); where 'count' would be passed as &ref->refcount_field or you could make this API macro based instead and stick to the proper API by mandating the name of the object's refcount field: #define REF_GET(ref) ({ refcount_inc(&ref->refcount) ? ref : NULL;}) #define REF_PUT(ref, dtor) do { if (refcount_dec(&ref->refcount)) dtor(ref); } while(0) where refcount_inc would do the increment and overflow check and return a boolean based on the result and refcount_dec would do the decrement and return a boolean to trigger object destruction (which has to at least free the object's memory of course). > Yes it takes more time and effort, but the end result is better code. most definitely but that road is a lot longer than some of you seem to anticipate ;). > Now as to why refcount cannot be implemented using that scheme you > outlined: > > vCPU0 vCPU1 > > lock inc %[r] > jo > > <vcpu preempt-out> > > for lots > refcount_dec_and_test(&obj->ref) > /* hooray, we hit 0 */ > kfree(obj); > > <vcpu preempt-in> > > mov $0xFFFFFFFF, %[r] /* OOPS use-after-free */ > > > Is this unlikely, yes, extremely so. Do I want to be the one debugging > this, heck no. i can't make any sense of your example. which kind of refcount bug is assumed here? if you have an underflow bug then you can just abuse it at any refcount value, there's no need to win any race at the saturation value. if it's an overflow bug (the use case for REFCOUNT) then i don't understand the refcount_dec_and_test call... note also that in the example's underflow case you have a UAF regardless of the refcount protection logic so your debugging woes don't go away just because the UAF is for the non-refcount fields of the underlying object. as for the security of the current PaX REFCOUNT feature on x86, if you wanted to break it you'd have to win that race above 2G times in a row (saturation is at INT_MAX, not UINT_MAX as your example assumes), that is, all at once in 2G different tasks. good luck finding a machine that can spawn that many tasks and then even better luck winning the race in every single one of them in a row without a single schedule back into any one of them. and if someone's really worried about this case, preemption (or interrupts for an even shorter window) can be simply disabled around this logic which would still result in better code than the proposed refcount_t implementation. cheers, PaX Team
Sorry I missed responding to this earlier, my fault. I'm just going to comment on one big part here, as I was the one who created the kref api, and has watched it mutate over the years... On Thu, Jan 05, 2017 at 10:21:31PM +0100, PaX Team wrote: > 2. your current refcount_t API is wrong and i think reflects a misunderstanding > of what reference counts are about in general. in particular, checking for > a 0 refcount value for any purpose other than freeing the underlying object > makes no sense and means that the usage is not a refcount or the user has > a UAF bug to start with (the check in kref_get is also nonsense for this > reason). this is because refcounts protect (account for) references/handles > to an object, not the object itself. this also means that such references > (think pointers in C) can only be created by *copying* an existing one but > they cannot be made up out of thin air. I agree in principle here, and that's how the original kref api worked. But then the nasty real-world got in the way here and people started using it :) > in theory, each time a reference is copied, the corresponding refcount > must be increased and when the reference goes out of scope, the refcount > must be decremented. when the refcount reaches 0 we know that there're no > references left so the object can be freed. in practice, we optimize the > refcount increments on copies by observing that the lifetimes of certain > reference copies nest in each other (think local variables, non-captured > function parameters, etc) so we can get away by only accounting for the > outermost reference in such a nesting hierarchy of references (and we get > the refcounting bugs when we fail at this arguably premature optimization > and miss a decrement or increment for a reference copy). Yes, in theory, I totally agree with you. That _should_ be the only way a refcount works. And that's how kref worked for a few years, until real-world got in the way... > now with this introduction tell me what sense does e.g., refcount_add_not_zero > make? any refcount API should never even expose the underlying counter > mechanism since refcounts are about references, not the counter underneath. > that is, a proper refcount API would look like this: I agree, but if you look at the places where that is used, there are good reasons for it. Unfortunatly some types of object lifecycles don't fit into the "once it hits zero it needs to be removed" as some other type of "object cleanup" model comes into play where things are removed later on. Once an object's reference hits zero it means, for that object, that no _more_ references can be grabbed, because it is "dead" or something like that. I've tried to argue otherwise, but almost every time I ended up agreeing with the usage in the end. That's not to say that there isn't room in the current kernel to fix up a few of these usages (I'm suspicious about the nvme usage, as those people are notorious for doing horrid things to an api just because they can...) > // establishes the initial refcount of 0 (which doesn't mean 'freed object' > // but 'no references, yet') > // it's needed for object construction if the object may come from non-zero > // initialized memory, otherwise ref_get should be used > void ref_reset(objtype *ref); kref_init() in today's kernel. I agree it is needed, but not that you should be able to call kref_get without ever calling it first. It's best to always init things in my opinion. > // returns a copy of the reference if the refcount can be safely > // incremented, NULL otherwise (or crash&burn, etc) > objtype *ref_get(objtype *ref); kref_get(). > // decrements the refcount and free the object if it reached 0 > // dtor could be a field of objtype but that'd cost memory (if there are > // more objects than ref_put callers which is the likely case in real life) > // and is an attack surface (kref follows this model too) > void ref_put(objtype *ref, destruct_t *dtor); kref_put(). Or kref_put_mutex() as it's nice to be able to keep the lock within the api to prevent it being forced to be kept outside of it (which in your example, is required, which forces us to audit things better.) > conceptually that is all you need to provide for refcount users (note that > there's no leakage of the lower level implementation into the API, e.g., it > can be lockless based on atomic_t or synchronized explicitly, all that is > hidden from the users of the API). I agree in theory, but then the real-world got in the way and showed that there are other usages that need more than just this. Which is why the kref api has grown over the years to accomidate them. To wave things away and say "you are doing it all wrong" is dangerous. I know I did just that in my youth, but experience has taught me that sometimes I am wrong, and that apis need to be extended to fit other, valid, usage models. In short, the wonderful quote, "In theory, theory and practice are the same. In practice, they are not." applies here as well. I've had it happen to me too many times to count over the years :) Moving to a refcount_t api is good, and a solid thing to do, it allows us to start catching the places where atomic_t is being used incorrectly, fix them up to use a reference count properly, and move on to the next issue at hand. You all can argue if the current refcount api is "performant" or not enough, which is great, as it can then be fixed up in one single place and made more "secure" by catching reference count overflows or not. thanks, greg k-h
On Thu, Jan 05, 2017 at 10:21:31PM +0100, PaX Team wrote: > On 3 Jan 2017 at 14:21, Peter Zijlstra wrote: > > > Its not something that can be turned on or off, refcount_t is > > unconditional code. But you raise a good point on the size of the thing. > > > > I count 116 bytes on x86_64. If I disable all the debug crud... > > ...then the whole thing loses its raison d'etre ;). It does not, without the WARN() stuff it is still saturating and avoiding use-after-free. If we fix WARN to not suck on x86 by for example extending the "ud2" bugtable, it would not blow up the textsize so much. > > This is fundamentally not an atomic operation and therefore does not > > belong in the atomic_* family, full stop. > > i don't know why people keep repeating this myth but what exactly is not > atomic in the PaX REFCOUNT feature? hint: it's all atomic ops based, > there's no semantic change in there. > > what is undesired behaviour (for the security goal, not the underlying > atomic_t API!) is that the undo/saturation logic makes the whole dance > visible to other CPUs And therefore its not atomic. If intermediate state is observable its not atomic. That's not a myth, that's the very definition of what atomic is. > > Not to mention that the whole wrap/unwrap or checked/unchecked split of > > atomic_t is a massive trainwreck. Moving over to refcount_t, which has > > simple and well defined semantics forces us to audit and cleanup all the > > reference counting crud as it gets converted, this is a good thing. > > while a noble goal your approach has two problems: one, it doesn't go > nearly far enough, and two, what it does do currently is anything but > clear (beyond the above discussed implementation problems). let me > elaborate on both. > > 1. the proper approach should be to separate the low-level implementation > machinery from the higher level APIs exposed to users. the current low-level > API is the atomic_t type and its accessors (including the 64/long variants) > which is directly used by most code and there're few higher level abstractions > on top of it. the proper low-level design should first of all separate the > underlying type into signed/unsigned types as the conversions between the > two make for a clumsy and error prone API to start with (just look at your > own refcount_t API, you're abusing signed/unsigned conversions where one > direction is implementation defined and allows a trap representation, is > your code ready to handle that? not to mention that the C implementations > of atomic_t trigger UB when signed overflow occurs). Thing is, the Linux kernel isn't written in C as per the spec, it a C dialect or C like language. By using -fno-strict-overflow and assuming 2s complement there is no UB. (and I strongly believe removing UB from a language is a good thing) > 2. your current refcount_t API is wrong and i think reflects a misunderstanding > of what reference counts are about in general. in particular, checking for > a 0 refcount value for any purpose other than freeing the underlying object > makes no sense and means that the usage is not a refcount or the user has > a UAF bug to start with (the check in kref_get is also nonsense for this > reason). _That_ UAF bug is the exact reason the WARN is there. People write buggy code, warning them wherever possible is a good thing. > this is because refcounts protect (account for) references/handles > to an object, not the object itself. this also means that such references > (think pointers in C) can only be created by *copying* an existing one but > they cannot be made up out of thin air. Which is exactly why get() on 0 is wrong and _should_ warn. I think we very much agree on what a refcount is. > now with this introduction tell me what sense does e.g., refcount_add_not_zero > make? any refcount API should never even expose the underlying counter > mechanism since refcounts are about references, not the counter underneath. > that is, a proper refcount API would look like this: Without language support (and there are a few proposals for both C and C++) we're stuck with manual refcounts. > // establishes the initial refcount of 0 (which doesn't mean 'freed object' > // but 'no references, yet') > // it's needed for object construction if the object may come from non-zero > // initialized memory, otherwise ref_get should be used > void ref_reset(objtype *ref); Here we violently disagree. I pose that initialization should be !0, after all, the act of creating the objects means you have a reference to it. In Bjarne's owner vs non-owner pointer types proposal, new<>() returns an owner pointer, ie. refcount == 1. The corollary is that then 0 becomes special to mean 'no concurrency' / 'dead'. > // returns a copy of the reference if the refcount can be safely > // incremented, NULL otherwise (or crash&burn, etc) > objtype *ref_get(objtype *ref); Without initializing to !0, I pose there is no way to tell if you can safely increment. The current thing does WARN, Kees would like crash and burn, but that's details. > // decrements the refcount and free the object if it reached 0 > // dtor could be a field of objtype but that'd cost memory (if there are > // more objects than ref_put callers which is the likely case in real life) > // and is an attack surface (kref follows this model too) > void ref_put(objtype *ref, destruct_t *dtor); > > conceptually that is all you need to provide for refcount users (note that > there's no leakage of the lower level implementation into the API, e.g., it > can be lockless based on atomic_t or synchronized explicitly, all that is > hidden from the users of the API). > > any other need/usage is not a refcount and needs its own API (and analysis > of potential security problems when misused). Here again we have to disagree. RCU/lockless lookups create a difference in semantics on get(). Where normally, in the serialized case, get() on 0 is an immediate UAF, the lockless case needs to explicitly deal with that. So we get: get_serialized() -- must never observe 0 and will return the same pointer. get_lockless() -- can observe 0 and we must check the return value. Since most people do not write lockless code, we default to the first for the regular inc/get and provide inc_not_zero() for the latter one. With your proposed interface we'd have to write: my_ptr = ref_get(ptr); WARN_ON(!my_ptr); All over the place in order to provide the same warning to people, because ref_get() could not include the WARN because of the lockless case. Providing this warn for people writing code is I feel important. Its a small and fairly simple check that avoids/catches mistakes as they're made. Then there's the whole dec_and_lock class, which isn't really fundamental, but more a nice optimization where we can avoid taking the lock in the common case. As to the whole add/sub, I would rather not have that either. > now to implement this in C > you'll have to break the abstraction in that it'd be an undue burden on > the API implementor to provide the API for each type of reference (in C++ > you could use templates) so in practice you'd also expose the immediately > underlying refcount type (a reference to it) in the API and pray that users > will get it right. something like this: > > objtype *ref_get(objtype *ref, refcount_t *count); > void ref_put(objtype *ref, refcount_t *count, destruct_t *dtor); > > where 'count' would be passed as &ref->refcount_field or you could make > this API macro based instead and stick to the proper API by mandating the > name of the object's refcount field: > > #define REF_GET(ref) ({ refcount_inc(&ref->refcount) ? ref : NULL;}) > #define REF_PUT(ref, dtor) do { if (refcount_dec(&ref->refcount)) dtor(ref); } while(0) Since, as you say 'pray that users will get it right' I'm not convinced of this particular form over any other. But its trivial to provide a wrapper like that if people feel strongly about it. > > Yes it takes more time and effort, but the end result is better code. > > most definitely but that road is a lot longer than some of you seem to > anticipate ;). Oh, I've been at this long enough to know that :-) > i can't make any sense of your example. Yeah, got my brain in a twist, it didn't make sense.
diff --git a/include/linux/kref.h b/include/linux/kref.h index a5181bd..78f840a 100644 --- a/include/linux/kref.h +++ b/include/linux/kref.h @@ -19,23 +19,26 @@ #include <linux/atomic.h> #include <linux/kernel.h> #include <linux/mutex.h> +#include <linux/refcount.h> struct kref { - atomic_t refcount; + refcount_t refcount; }; +#define KREF_INIT(n) { .refcount = REFCOUNT_INIT(n), } + /** * kref_init - initialize object. * @kref: object in question. */ static inline void kref_init(struct kref *kref) { - atomic_set(&kref->refcount, 1); + refcount_set(&kref->refcount, 1); } static inline int kref_read(const struct kref *kref) { - return atomic_read(&kref->refcount); + return refcount_read(&kref->refcount); } /** @@ -44,11 +47,7 @@ static inline int kref_read(const struct kref *kref) */ static inline void kref_get(struct kref *kref) { - /* If refcount was 0 before incrementing then we have a race - * condition when this kref is freeing by some other thread right now. - * In this case one should use kref_get_unless_zero() - */ - WARN_ON_ONCE(atomic_inc_return(&kref->refcount) < 2); + refcount_inc(&kref->refcount); } /** @@ -72,7 +71,7 @@ static inline int kref_put(struct kref *kref, void (*release)(struct kref *kref) { WARN_ON(release == NULL); - if (atomic_dec_and_test(&kref->refcount)) { + if (refcount_dec_and_test(&kref->refcount)) { release(kref); return 1; } @@ -85,7 +84,7 @@ static inline int kref_put_mutex(struct kref *kref, { WARN_ON(release == NULL); - if (atomic_dec_and_mutex_lock(&kref->refcount, lock)) { + if (refcount_dec_and_mutex_lock(&kref->refcount, lock)) { release(kref); return 1; } @@ -98,7 +97,7 @@ static inline int kref_put_lock(struct kref *kref, { WARN_ON(release == NULL); - if (atomic_dec_and_lock(&kref->refcount, lock)) { + if (refcount_dec_and_lock(&kref->refcount, lock)) { release(kref); return 1; } @@ -123,6 +122,6 @@ static inline int kref_put_lock(struct kref *kref, */ static inline int __must_check kref_get_unless_zero(struct kref *kref) { - return atomic_add_unless(&kref->refcount, 1, 0); + return refcount_inc_not_zero(&kref->refcount); } #endif /* _KREF_H_ */ diff --git a/include/linux/refcount.h b/include/linux/refcount.h new file mode 100644 index 0000000..fc5abdb --- /dev/null +++ b/include/linux/refcount.h @@ -0,0 +1,262 @@ +#ifndef _LINUX_REFCOUNT_H +#define _LINUX_REFCOUNT_H + +/* + * Variant of atomic_t specialized for reference counts. + * + * The interface matches the atomic_t interface (to aid in porting) but only + * provides the few functions one should use for reference counting. + * + * It differs in that the counter saturates at UINT_MAX and will not move once + * there. This avoids wrapping the counter and causing 'spurious' + * use-after-free issues. + * + * Memory ordering rules are slightly relaxed wrt regular atomic_t functions + * and provide only what is strictly required for refcounts. + * + * The increments are fully relaxed; these will not provide ordering. The + * rationale is that whatever is used to obtain the object we're increasing the + * reference count on will provide the ordering. For locked data structures, + * its the lock acquire, for RCU/lockless data structures its the dependent + * load. + * + * Do note that inc_not_zero() provides a control dependency which will order + * future stores against the inc, this ensures we'll never modify the object + * if we did not in fact acquire a reference. + * + * The decrements will provide release order, such that all the prior loads and + * stores will be issued before, it also provides a control dependency, which + * will order us against the subsequent free(). + * + * The control dependency is against the load of the cmpxchg (ll/sc) that + * succeeded. This means the stores aren't fully ordered, but this is fine + * because the 1->0 transition indicates no concurrency. + * + * Note that the allocator is responsible for ordering things between free() + * and alloc(). + * + */ + +#include <linux/atomic.h> +#include <linux/bug.h> +#include <linux/mutex.h> +#include <linux/spinlock.h> + +typedef struct refcount_struct { + atomic_t refs; +} refcount_t; + +#define REFCOUNT_INIT(n) { .refs = ATOMIC_INIT(n), } + +static inline void refcount_set(refcount_t *r, unsigned int n) +{ + atomic_set(&r->refs, n); +} + +static inline unsigned int refcount_read(const refcount_t *r) +{ + return atomic_read(&r->refs); +} + +static inline __must_check +bool refcount_add_not_zero(unsigned int i, refcount_t *r) +{ + unsigned int old, new, val = atomic_read(&r->refs); + + for (;;) { + if (!val) + return false; + + if (unlikely(val == UINT_MAX)) + return true; + + new = val + i; + if (new < val) + new = UINT_MAX; + old = atomic_cmpxchg_relaxed(&r->refs, val, new); + if (old == val) + break; + + val = old; + } + + WARN(new == UINT_MAX, "refcount_t: saturated; leaking memory.\n"); + + return true; +} + +/* + * Similar to atomic_inc_not_zero(), will saturate at UINT_MAX and WARN. + * + * Provides no memory ordering, it is assumed the caller has guaranteed the + * object memory to be stable (RCU, etc.). It does provide a control dependency + * and thereby orders future stores. See the comment on top. + */ +static inline __must_check +bool refcount_inc_not_zero(refcount_t *r) +{ + return refcount_add_not_zero(1, r); +} + +/* + * Similar to atomic_inc(), will saturate at UINT_MAX and WARN. + * + * Provides no memory ordering, it is assumed the caller already has a + * reference on the object, will WARN when this is not so. + */ +static inline void refcount_inc(refcount_t *r) +{ + WARN(!refcount_inc_not_zero(r), "refcount_t: increment on 0; use-after-free.\n"); +} + +static inline void refcount_add(unsigned int i, refcount_t *r) +{ + WARN(!refcount_add_not_zero(i, r), "refcount_t: addition on 0; use-after-free.\n"); +} + +/* + * Similar to atomic_dec_and_test(), it will WARN on underflow and fail to + * decrement when saturated at UINT_MAX. + * + * Provides release memory ordering, such that prior loads and stores are done + * before, and provides a control dependency such that free() must come after. + * See the comment on top. + */ +static inline __must_check +bool refcount_sub_and_test(unsigned int i, refcount_t *r) +{ + unsigned int old, new, val = atomic_read(&r->refs); + + for (;;) { + if (val == UINT_MAX) + return false; + + new = val - i; + if (WARN(new > val, "refcount_t: underflow; use-after-free.\n")) + return false; + + old = atomic_cmpxchg_release(&r->refs, val, new); + if (old == val) + break; + + val = old; + } + + return !new; +} + +static inline __must_check +bool refcount_dec_and_test(refcount_t *r) +{ + return refcount_sub_and_test(1, r); +} + +/* + * Similar to atomic_dec(), it will WARN on underflow and fail to decrement + * when saturated at UINT_MAX. + * + * Provides release memory ordering, such that prior loads and stores are done + * before. + */ +static inline +void refcount_dec(refcount_t *r) +{ + WARN(refcount_dec_and_test(r), "refcount_t: decrement hit 0; leaking memory.\n"); +} + +/* + * No atomic_t counterpart, it attempts a 1 -> 0 transition and returns the + * success thereof. + * + * Like all decrement operations, it provides release memory order and provides + * a control dependency. + * + * It can be used like a try-delete operator; this explicit case is provided + * and not cmpxchg in generic, because that would allow implementing unsafe + * operations. + */ +static inline __must_check +bool refcount_dec_if_one(refcount_t *r) +{ + return atomic_cmpxchg_release(&r->refs, 1, 0) == 1; +} + +/* + * No atomic_t counterpart, it decrements unless the value is 1, in which case + * it will return false. + * + * Was often done like: atomic_add_unless(&var, -1, 1) + */ +static inline __must_check +bool refcount_dec_not_one(refcount_t *r) +{ + unsigned int old, new, val = atomic_read(&r->refs); + + for (;;) { + if (val == UINT_MAX) + return true; + + if (val == 1) + return false; + + new = val - 1; + if (WARN(new > val, "refcount_t: underflow; use-after-free.\n")) + return true; + + old = atomic_cmpxchg_release(&r->refs, val, new); + if (old == val) + break; + + val = old; + } + + return true; +} + +/* + * Similar to atomic_dec_and_mutex_lock(), it will WARN on underflow and fail + * to decrement when saturated at UINT_MAX. + * + * Provides release memory ordering, such that prior loads and stores are done + * before, and provides a control dependency such that free() must come after. + * See the comment on top. + */ +static inline __must_check +bool refcount_dec_and_mutex_lock(refcount_t *r, struct mutex *lock) +{ + if (refcount_dec_not_one(r)) + return false; + + mutex_lock(lock); + if (!refcount_dec_and_test(r)) { + mutex_unlock(lock); + return false; + } + + return true; +} + +/* + * Similar to atomic_dec_and_lock(), it will WARN on underflow and fail to + * decrement when saturated at UINT_MAX. + * + * Provides release memory ordering, such that prior loads and stores are done + * before, and provides a control dependency such that free() must come after. + * See the comment on top. + */ +static inline __must_check +bool refcount_dec_and_lock(refcount_t *r, spinlock_t *lock) +{ + if (refcount_dec_not_one(r)) + return false; + + spin_lock(lock); + if (!refcount_dec_and_test(r)) { + spin_unlock(lock); + return false; + } + + return true; +} + +#endif /* _LINUX_REFCOUNT_H */
It provides saturation semantics such that overflow becomes impossible and thereby 'spurious' use-after-free is avoided. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- include/linux/kref.h | 23 ++--- include/linux/refcount.h | 262 +++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 273 insertions(+), 12 deletions(-) create mode 100644 include/linux/refcount.h