Message ID | 1490811363-93944-5-git-send-email-keescook@chromium.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Mar 29, 2017 at 11:15 AM, Kees Cook <keescook@chromium.org> wrote: > Based on PaX's x86 pax_{open,close}_kernel() implementation, this > allows HAVE_ARCH_RARE_WRITE to work on x86. > > + > +static __always_inline unsigned long __arch_rare_write_begin(void) > +{ > + unsigned long cr0; > + > + preempt_disable(); This looks wrong. DEBUG_LOCKS_WARN_ON(!irqs_disabled()) would work, as would local_irq_disable(). There's no way that just disabling preemption is enough. (Also, how does this interact with perf nmis?) --Andy
On Wed, Mar 29, 2017 at 3:38 PM, Andy Lutomirski <luto@amacapital.net> wrote: > On Wed, Mar 29, 2017 at 11:15 AM, Kees Cook <keescook@chromium.org> wrote: >> Based on PaX's x86 pax_{open,close}_kernel() implementation, this >> allows HAVE_ARCH_RARE_WRITE to work on x86. >> > >> + >> +static __always_inline unsigned long __arch_rare_write_begin(void) >> +{ >> + unsigned long cr0; >> + >> + preempt_disable(); > > This looks wrong. DEBUG_LOCKS_WARN_ON(!irqs_disabled()) would work, > as would local_irq_disable(). There's no way that just disabling > preemption is enough. > > (Also, how does this interact with perf nmis?) Do you mean preempt_disable() isn't strong enough here? I'm open to suggestions. The goal would be to make sure nothing between _begin and _end would get executed without interruption... -Kees
On Wed, Mar 29, 2017 at 6:41 PM, Kees Cook <keescook@chromium.org> wrote: > On Wed, Mar 29, 2017 at 3:38 PM, Andy Lutomirski <luto@amacapital.net> wrote: >> On Wed, Mar 29, 2017 at 11:15 AM, Kees Cook <keescook@chromium.org> wrote: >>> Based on PaX's x86 pax_{open,close}_kernel() implementation, this >>> allows HAVE_ARCH_RARE_WRITE to work on x86. >>> >> >>> + >>> +static __always_inline unsigned long __arch_rare_write_begin(void) >>> +{ >>> + unsigned long cr0; >>> + >>> + preempt_disable(); >> >> This looks wrong. DEBUG_LOCKS_WARN_ON(!irqs_disabled()) would work, >> as would local_irq_disable(). There's no way that just disabling >> preemption is enough. >> >> (Also, how does this interact with perf nmis?) > > Do you mean preempt_disable() isn't strong enough here? I'm open to > suggestions. The goal would be to make sure nothing between _begin and > _end would get executed without interruption... > Sorry for the very slow response. preempt_disable() isn't strong enough to prevent interrupts, and an interrupt here would run with WP off, causing unknown havoc. I tend to think that the caller should be responsible for turning off interrupts. --Andy
On Wed, Apr 5, 2017 at 4:57 PM, Andy Lutomirski <luto@kernel.org> wrote: > On Wed, Mar 29, 2017 at 6:41 PM, Kees Cook <keescook@chromium.org> wrote: >> On Wed, Mar 29, 2017 at 3:38 PM, Andy Lutomirski <luto@amacapital.net> wrote: >>> On Wed, Mar 29, 2017 at 11:15 AM, Kees Cook <keescook@chromium.org> wrote: >>>> Based on PaX's x86 pax_{open,close}_kernel() implementation, this >>>> allows HAVE_ARCH_RARE_WRITE to work on x86. >>>> >>> >>>> + >>>> +static __always_inline unsigned long __arch_rare_write_begin(void) >>>> +{ >>>> + unsigned long cr0; >>>> + >>>> + preempt_disable(); >>> >>> This looks wrong. DEBUG_LOCKS_WARN_ON(!irqs_disabled()) would work, >>> as would local_irq_disable(). There's no way that just disabling >>> preemption is enough. >>> >>> (Also, how does this interact with perf nmis?) >> >> Do you mean preempt_disable() isn't strong enough here? I'm open to >> suggestions. The goal would be to make sure nothing between _begin and >> _end would get executed without interruption... >> > > Sorry for the very slow response. > > preempt_disable() isn't strong enough to prevent interrupts, and an > interrupt here would run with WP off, causing unknown havoc. I tend > to think that the caller should be responsible for turning off > interrupts. So, something like: Top-level functions: static __always_inline rare_write_begin(void) { preempt_disable(); local_irq_disable(); barrier(); __arch_rare_write_begin(); barrier(); } static __always_inline rare_write_end(void) { barrier(); __arch_rare_write_end(); barrier(); local_irq_enable(); preempt_enable_no_resched(); } x86-specific helpers: static __always_inline unsigned long __arch_rare_write_begin(void) { unsigned long cr0; cr0 = read_cr0() ^ X86_CR0_WP; BUG_ON(cr0 & X86_CR0_WP); write_cr0(cr0); return cr0 ^ X86_CR0_WP; } static __always_inline unsigned long __arch_rare_write_end(void) { unsigned long cr0; cr0 = read_cr0() ^ X86_CR0_WP; BUG_ON(!(cr0 & X86_CR0_WP)); write_cr0(cr0); return cr0 ^ X86_CR0_WP; } I can give it a spin... -Kees
On Wed, Apr 5, 2017 at 5:14 PM, Kees Cook <keescook@chromium.org> wrote: > On Wed, Apr 5, 2017 at 4:57 PM, Andy Lutomirski <luto@kernel.org> wrote: >> On Wed, Mar 29, 2017 at 6:41 PM, Kees Cook <keescook@chromium.org> wrote: >>> On Wed, Mar 29, 2017 at 3:38 PM, Andy Lutomirski <luto@amacapital.net> wrote: >>>> On Wed, Mar 29, 2017 at 11:15 AM, Kees Cook <keescook@chromium.org> wrote: >>>>> Based on PaX's x86 pax_{open,close}_kernel() implementation, this >>>>> allows HAVE_ARCH_RARE_WRITE to work on x86. >>>>> >>>> >>>>> + >>>>> +static __always_inline unsigned long __arch_rare_write_begin(void) >>>>> +{ >>>>> + unsigned long cr0; >>>>> + >>>>> + preempt_disable(); >>>> >>>> This looks wrong. DEBUG_LOCKS_WARN_ON(!irqs_disabled()) would work, >>>> as would local_irq_disable(). There's no way that just disabling >>>> preemption is enough. >>>> >>>> (Also, how does this interact with perf nmis?) >>> >>> Do you mean preempt_disable() isn't strong enough here? I'm open to >>> suggestions. The goal would be to make sure nothing between _begin and >>> _end would get executed without interruption... >>> >> >> Sorry for the very slow response. >> >> preempt_disable() isn't strong enough to prevent interrupts, and an >> interrupt here would run with WP off, causing unknown havoc. I tend >> to think that the caller should be responsible for turning off >> interrupts. > > So, something like: > > Top-level functions: > > static __always_inline rare_write_begin(void) > { > preempt_disable(); > local_irq_disable(); > barrier(); > __arch_rare_write_begin(); > barrier(); > } Looks good, except you don't need preempt_disable(). local_irq_disable() also disables preemption. You might need to use local_irq_save(), though, depending on whether any callers already have IRQs off. --Andy
On 6 April 2017 at 17:59, Andy Lutomirski <luto@amacapital.net> wrote: > On Wed, Apr 5, 2017 at 5:14 PM, Kees Cook <keescook@chromium.org> wrote: >> On Wed, Apr 5, 2017 at 4:57 PM, Andy Lutomirski <luto@kernel.org> wrote: >>> On Wed, Mar 29, 2017 at 6:41 PM, Kees Cook <keescook@chromium.org> wrote: >>>> On Wed, Mar 29, 2017 at 3:38 PM, Andy Lutomirski <luto@amacapital.net> wrote: >>>>> On Wed, Mar 29, 2017 at 11:15 AM, Kees Cook <keescook@chromium.org> wrote: >>>>>> Based on PaX's x86 pax_{open,close}_kernel() implementation, this >>>>>> allows HAVE_ARCH_RARE_WRITE to work on x86. >>>>>> >>>>>> + >>>>>> +static __always_inline unsigned long __arch_rare_write_begin(void) >>>>>> +{ >>>>>> + unsigned long cr0; >>>>>> + >>>>>> + preempt_disable(); >>>>> >>>>> This looks wrong. DEBUG_LOCKS_WARN_ON(!irqs_disabled()) would work, >>>>> as would local_irq_disable(). There's no way that just disabling >>>>> preemption is enough. >>>>> >>>>> (Also, how does this interact with perf nmis?) >>>> >>>> Do you mean preempt_disable() isn't strong enough here? I'm open to >>>> suggestions. The goal would be to make sure nothing between _begin and >>>> _end would get executed without interruption... >>>> >>> >>> Sorry for the very slow response. >>> >>> preempt_disable() isn't strong enough to prevent interrupts, and an >>> interrupt here would run with WP off, causing unknown havoc. I tend >>> to think that the caller should be responsible for turning off >>> interrupts. >> >> So, something like: >> >> Top-level functions: >> >> static __always_inline rare_write_begin(void) >> { >> preempt_disable(); >> local_irq_disable(); >> barrier(); >> __arch_rare_write_begin(); >> barrier(); >> } > > Looks good, except you don't need preempt_disable(). > local_irq_disable() also disables preemption. You might need to use > local_irq_save(), though, depending on whether any callers already > have IRQs off. Well, doesn't look good to me. NMIs will still be able to interrupt this code and will run with CR0.WP = 0. Shouldn't you instead question yourself why PaX can do it "just" with preempt_disable() instead?! Cheers, Mathias
On Wed, Mar 29, 2017 at 11:15:56AM -0700, Kees Cook wrote: > +static __always_inline unsigned long __arch_rare_write_end(void) > +{ > + unsigned long cr0; > + > + barrier(); > + cr0 = read_cr0() ^ X86_CR0_WP; > + BUG_ON(!(cr0 & X86_CR0_WP)); > + write_cr0(cr0); > + barrier(); > + preempt_enable_no_resched(); NAK > + return cr0 ^ X86_CR0_WP; > +}
On Fri, 7 Apr 2017, Mathias Krause wrote: > On 6 April 2017 at 17:59, Andy Lutomirski <luto@amacapital.net> wrote: > > On Wed, Apr 5, 2017 at 5:14 PM, Kees Cook <keescook@chromium.org> wrote: > >> static __always_inline rare_write_begin(void) > >> { > >> preempt_disable(); > >> local_irq_disable(); > >> barrier(); > >> __arch_rare_write_begin(); > >> barrier(); > >> } > > > > Looks good, except you don't need preempt_disable(). > > local_irq_disable() also disables preemption. You might need to use > > local_irq_save(), though, depending on whether any callers already > > have IRQs off. > > Well, doesn't look good to me. NMIs will still be able to interrupt > this code and will run with CR0.WP = 0. > > Shouldn't you instead question yourself why PaX can do it "just" with > preempt_disable() instead?! That's silly. Just because PaX does it, doesn't mean it's correct. To be honest, playing games with the CR0.WP bit is outright stupid to begin with. Whether protected by preempt_disable or local_irq_disable, to make that work it needs CR0 handling in the exception entry/exit at the lowest level. And that's just a nightmare maintainence wise as it's prone to be broken over time. Aside of that it's pointless overhead for the normal case. The proper solution is: write_rare(ptr, val) { mp = map_shadow_rw(ptr); *mp = val; unmap_shadow_rw(mp); } map_shadow_rw() is essentially the same thing as we do in the highmem case where the kernel creates a shadow mapping of the user space pages via kmap_atomic(). It's valid (at least on x86) to have a shadow map with the same page attributes but write enabled. That does not require any fixups of CR0 and just works. Thanks, tglx
On 7 April 2017 at 11:46, Thomas Gleixner <tglx@linutronix.de> wrote: > On Fri, 7 Apr 2017, Mathias Krause wrote: >> On 6 April 2017 at 17:59, Andy Lutomirski <luto@amacapital.net> wrote: >> > On Wed, Apr 5, 2017 at 5:14 PM, Kees Cook <keescook@chromium.org> wrote: >> >> static __always_inline rare_write_begin(void) >> >> { >> >> preempt_disable(); >> >> local_irq_disable(); >> >> barrier(); >> >> __arch_rare_write_begin(); >> >> barrier(); >> >> } >> > >> > Looks good, except you don't need preempt_disable(). >> > local_irq_disable() also disables preemption. You might need to use >> > local_irq_save(), though, depending on whether any callers already >> > have IRQs off. >> >> Well, doesn't look good to me. NMIs will still be able to interrupt >> this code and will run with CR0.WP = 0. >> >> Shouldn't you instead question yourself why PaX can do it "just" with >> preempt_disable() instead?! > > That's silly. Just because PaX does it, doesn't mean it's correct. To be > honest, playing games with the CR0.WP bit is outright stupid to begin with. Why that? It allows fast and CPU local modifications of r/o memory. OTOH, an approach that needs to fiddle with page table entries requires global synchronization to keep the individual TLB states in sync. Hmm.. Not that fast, I'd say. > Whether protected by preempt_disable or local_irq_disable, to make that > work it needs CR0 handling in the exception entry/exit at the lowest > level. And that's just a nightmare maintainence wise as it's prone to be > broken over time. It seems to be working fine for more than a decade now in PaX. So it can't be such a big maintenance nightmare ;) > Aside of that it's pointless overhead for the normal case. > > The proper solution is: > > write_rare(ptr, val) > { > mp = map_shadow_rw(ptr); > *mp = val; > unmap_shadow_rw(mp); > } > > map_shadow_rw() is essentially the same thing as we do in the highmem case > where the kernel creates a shadow mapping of the user space pages via > kmap_atomic(). The "proper solution" seems to be much slower compared to just toggling CR0.WP (which is costly in itself, already) because of the TLB invalidation / synchronisation involved. > It's valid (at least on x86) to have a shadow map with the same page > attributes but write enabled. That does not require any fixups of CR0 and > just works. "Just works", sure -- but it's not as tightly focused as the PaX solution which is CPU local, while your proposed solution is globally visible. Cheers, Mathias
On Fri, 7 Apr 2017, Mathias Krause wrote: > On 7 April 2017 at 11:46, Thomas Gleixner <tglx@linutronix.de> wrote: > > Whether protected by preempt_disable or local_irq_disable, to make that > > work it needs CR0 handling in the exception entry/exit at the lowest > > level. And that's just a nightmare maintainence wise as it's prone to be > > broken over time. > > It seems to be working fine for more than a decade now in PaX. So it > can't be such a big maintenance nightmare ;) I really do not care whether PaX wants to chase and verify that over and over. I certainly don't want to take the chance to leak CR0.WP ever and I very much care about extra stuff to check in the entry/exit path. > The "proper solution" seems to be much slower compared to just > toggling CR0.WP (which is costly in itself, already) because of the > TLB invalidation / synchronisation involved. Why the heck should we care about rare writes being performant? > > It's valid (at least on x86) to have a shadow map with the same page > > attributes but write enabled. That does not require any fixups of CR0 and > > just works. > > "Just works", sure -- but it's not as tightly focused as the PaX > solution which is CPU local, while your proposed solution is globally > visible. Making the world and some more writeable hardly qualifies as tightly focussed. Making the mapping concept CPU local is not rocket science either. The question is whethers it's worth the trouble. Thanks, tglx
On 7 April 2017 at 15:14, Thomas Gleixner <tglx@linutronix.de> wrote: > On Fri, 7 Apr 2017, Mathias Krause wrote: >> On 7 April 2017 at 11:46, Thomas Gleixner <tglx@linutronix.de> wrote: >> > Whether protected by preempt_disable or local_irq_disable, to make that >> > work it needs CR0 handling in the exception entry/exit at the lowest >> > level. And that's just a nightmare maintainence wise as it's prone to be >> > broken over time. >> >> It seems to be working fine for more than a decade now in PaX. So it >> can't be such a big maintenance nightmare ;) > > I really do not care whether PaX wants to chase and verify that over and > over. I certainly don't want to take the chance to leak CR0.WP ever and I > very much care about extra stuff to check in the entry/exit path. Fair enough. However, placing a BUG_ON(!(read_cr0() & X86_CR0_WP)) somewhere sensible should make those "leaks" visible fast -- and their exploitation impossible, i.e. fail hard. >> The "proper solution" seems to be much slower compared to just >> toggling CR0.WP (which is costly in itself, already) because of the >> TLB invalidation / synchronisation involved. > > Why the heck should we care about rare writes being performant? As soon as they stop being rare and people start extending the r/o protection to critical data structures accessed often. Then performance matters. >> > It's valid (at least on x86) to have a shadow map with the same page >> > attributes but write enabled. That does not require any fixups of CR0 and >> > just works. >> >> "Just works", sure -- but it's not as tightly focused as the PaX >> solution which is CPU local, while your proposed solution is globally >> visible. > > Making the world and some more writeable hardly qualifies as tightly > focussed. Making the mapping concept CPU local is not rocket science > either. The question is whethers it's worth the trouble. No, the question is if the value of the concept is well understood and if people can see what could be done with such a strong primitive. Apparently not... Cheers, Mathias
On Fri, Apr 07, 2017 at 12:51:15PM +0200, Mathias Krause wrote: > Why that? It allows fast and CPU local modifications of r/o memory. > OTOH, an approach that needs to fiddle with page table entries > requires global synchronization to keep the individual TLB states in > sync. Hmm.. Not that fast, I'd say. The fixmaps used for kmap_atomic are per-cpu, no global sync required.
On Fri, Apr 7, 2017 at 6:30 AM, Mathias Krause <minipli@googlemail.com> wrote: > On 7 April 2017 at 15:14, Thomas Gleixner <tglx@linutronix.de> wrote: >> On Fri, 7 Apr 2017, Mathias Krause wrote: >>> On 7 April 2017 at 11:46, Thomas Gleixner <tglx@linutronix.de> wrote: >>> > Whether protected by preempt_disable or local_irq_disable, to make that >>> > work it needs CR0 handling in the exception entry/exit at the lowest >>> > level. And that's just a nightmare maintainence wise as it's prone to be >>> > broken over time. >>> >>> It seems to be working fine for more than a decade now in PaX. So it >>> can't be such a big maintenance nightmare ;) >> >> I really do not care whether PaX wants to chase and verify that over and >> over. I certainly don't want to take the chance to leak CR0.WP ever and I >> very much care about extra stuff to check in the entry/exit path. > > Fair enough. However, placing a BUG_ON(!(read_cr0() & X86_CR0_WP)) > somewhere sensible should make those "leaks" visible fast -- and their > exploitation impossible, i.e. fail hard. The leaks surely exist and now we'll just add an exploitable BUG. >>> > It's valid (at least on x86) to have a shadow map with the same page >>> > attributes but write enabled. That does not require any fixups of CR0 and >>> > just works. >>> >>> "Just works", sure -- but it's not as tightly focused as the PaX >>> solution which is CPU local, while your proposed solution is globally >>> visible. >> >> Making the world and some more writeable hardly qualifies as tightly >> focussed. Making the mapping concept CPU local is not rocket science >> either. The question is whethers it's worth the trouble. > > No, the question is if the value of the concept is well understood and > if people can see what could be done with such a strong primitive. > Apparently not... I think we're approaching this all wrong, actually. The fact that x86 has this CR0.WP thing is arguably a historical accident, and the fact that PaX uses it doesn't mean that PaX is doing it the best way for upstream Linux. Why don't we start at the other end and do a generic non-arch-specific implementation: set up an mm_struct that contains an RW alias of the relevant parts of rodata and use use_mm to access it. (That is, get_fs() to back up the old fs, set_fs(USER_DS), use_mm(&rare_write_mm), do the write using copy_to_user, undo everything.) Then someone who cares about performance can benchmark the CR0.WP approach against it and try to argue that it's a good idea. This benchmark should wait until I'm done with my PCID work, because PCID is going to make use_mm() a whole heck of a lot faster. --Andy
On Fri, Apr 07, 2017 at 09:14:29AM -0700, Andy Lutomirski wrote: > I think we're approaching this all wrong, actually. The fact that x86 > has this CR0.WP thing is arguably a historical accident, and the fact > that PaX uses it doesn't mean that PaX is doing it the best way for > upstream Linux. > > Why don't we start at the other end and do a generic non-arch-specific > implementation: set up an mm_struct that contains an RW alias of the > relevant parts of rodata and use use_mm to access it. (That is, > get_fs() to back up the old fs, set_fs(USER_DS), > use_mm(&rare_write_mm), do the write using copy_to_user, undo > everything.) FWIW, I completely agree with this approach. That's largely the approach arm64 would have to take regardless (as per Hoeun's patches), and having a consistent common implementation would be desireable. There are a couple of other complications to handle (e.g. a perf interrupt coming in and trying to read from the mapping), but I think we can handle that generically. Thanks, Mark.
On Fri, 7 Apr 2017, Mathias Krause wrote: > On 7 April 2017 at 15:14, Thomas Gleixner <tglx@linutronix.de> wrote: > > On Fri, 7 Apr 2017, Mathias Krause wrote: > >> On 7 April 2017 at 11:46, Thomas Gleixner <tglx@linutronix.de> wrote: > >> > Whether protected by preempt_disable or local_irq_disable, to make that > >> > work it needs CR0 handling in the exception entry/exit at the lowest > >> > level. And that's just a nightmare maintainence wise as it's prone to be > >> > broken over time. > >> > >> It seems to be working fine for more than a decade now in PaX. So it > >> can't be such a big maintenance nightmare ;) > > > > I really do not care whether PaX wants to chase and verify that over and > > over. I certainly don't want to take the chance to leak CR0.WP ever and I > > very much care about extra stuff to check in the entry/exit path. > > Fair enough. However, placing a BUG_ON(!(read_cr0() & X86_CR0_WP)) > somewhere sensible should make those "leaks" visible fast -- and their > exploitation impossible, i.e. fail hard. Sure, you trade leaking WP with an potentially exploitable BUG(). > >> The "proper solution" seems to be much slower compared to just > >> toggling CR0.WP (which is costly in itself, already) because of the > >> TLB invalidation / synchronisation involved. > > > > Why the heck should we care about rare writes being performant? > > As soon as they stop being rare and people start extending the r/o > protection to critical data structures accessed often. Then > performance matters. Emphasis on "Then". I'm not seeing it, because no matter what you do it's going to be slow. Aside of that increasing the usage will also increase the chance to leak stuff. In that case I rather leak a single page mapping temporarily than taking the chance to leak WP. > >> > It's valid (at least on x86) to have a shadow map with the same page > >> > attributes but write enabled. That does not require any fixups of CR0 and > >> > just works. > >> > >> "Just works", sure -- but it's not as tightly focused as the PaX > >> solution which is CPU local, while your proposed solution is globally > >> visible. > > > > Making the world and some more writeable hardly qualifies as tightly > > focussed. Making the mapping concept CPU local is not rocket science > > either. The question is whethers it's worth the trouble. > > No, the question is if the value of the concept is well understood and > if people can see what could be done with such a strong primitive. > Apparently not... Oh, well. We can stop that discussion right here, if all you can provide is a killer phrase. I'm well aware what can be done with a strong primitive and I certainly understand the concept, but I'm not naive enough to believe that lifting one of the strong protections the kernel has by globaly disabling WP is anything which should be even considered. That bit is a horrible misconception and should be fused to 1. Aside of that, if you had taken the time to figure out how kmap_atomic stuff works then you would have noticed that it does not require cross CPU pagetable syncs and that the mapping place can be randomized to a certain degree. So this has neither global impact, nor does it become immediately globally visible. Thanks, tglx
On 7 Apr 2017 at 11:46, Thomas Gleixner wrote: > On Fri, 7 Apr 2017, Mathias Krause wrote: > > Well, doesn't look good to me. NMIs will still be able to interrupt > > this code and will run with CR0.WP = 0. > > > > Shouldn't you instead question yourself why PaX can do it "just" with > > preempt_disable() instead?! > > That's silly. Just because PaX does it, doesn't mean it's correct. is that FUD or do you have actionable information to share? > To be honest, playing games with the CR0.WP bit is outright stupid to begin with. why is that? cr0.wp exists since the i486 and its behaviour fits my purposes quite well, it's the best security/performance i know of. > Whether protected by preempt_disable or local_irq_disable, to make that > work it needs CR0 handling in the exception entry/exit at the lowest > level. correct. > And that's just a nightmare maintainence wise as it's prone to be > broken over time. i've got 14 years of experience of maintaining it and i never saw it break. > Aside of that it's pointless overhead for the normal case. unless it's optional code as the whole feature already is. > The proper solution is: > > write_rare(ptr, val) > { > mp = map_shadow_rw(ptr); > *mp = val; > unmap_shadow_rw(mp); > } this is not *the* proper solution, but only a naive one that suffers from the exact same need that the cr0.wp approach does and has worse performance impact. not exactly a win... [continuing from your next mail in order to save round-trip time] > I really do not care whether PaX wants to chase and verify that over and > over. verifying it is no different than verifying, say, swapgs use. > I certainly don't want to take the chance to leak CR0.WP ever why and where would cr0.wp leak? > and I very much care about extra stuff to check in the entry/exit path. your 'proper' solution needs (a lot more) extra stuff too. > Why the heck should we care about rare writes being performant? because you've been misled by the NIH crowd here that the PaX feature they tried to (badly) extract from has anything to do with frequency of writes. it does not. what it does do is provide an environment for variables that are conceptually writable but for security reasons should be read-only most of the time by most of the code (ditto for the grossly misunderstood and thus misnamed ro-after-shit). now imagine locking down the page table hierarchy with it... > Making the world and some more writeable hardly qualifies as tightly > focused. you forgot to add 'for a window of a few insns' and that the map/unmap approach does the same under an attacker controlled ptr. > Making the mapping concept CPU local is not rocket science > either. The question is whether it's worth the trouble. it is for people who care about the integrity of the kernel, and all this read-onlyness stuff implies that some do.
On 7 Apr 2017 at 9:14, Andy Lutomirski wrote: > On Fri, Apr 7, 2017 at 6:30 AM, Mathias Krause <minipli@googlemail.com> wrote: > > On 7 April 2017 at 15:14, Thomas Gleixner <tglx@linutronix.de> wrote: > >> On Fri, 7 Apr 2017, Mathias Krause wrote: > > Fair enough. However, placing a BUG_ON(!(read_cr0() & X86_CR0_WP)) > > somewhere sensible should make those "leaks" visible fast -- and their > > exploitation impossible, i.e. fail hard. > > The leaks surely exist and now we'll just add an exploitable BUG. can you please share those leaks that 'surely exist' and CC oss-security while at it? > I think we're approaching this all wrong, actually. The fact that x86 > has this CR0.WP thing is arguably a historical accident, and the fact > that PaX uses it doesn't mean that PaX is doing it the best way for > upstream Linux. > > Why don't we start at the other end and do a generic non-arch-specific > implementation: set up an mm_struct that contains an RW alias of the > relevant parts of rodata and use use_mm to access it. (That is, > get_fs() to back up the old fs, set_fs(USER_DS), > use_mm(&rare_write_mm), do the write using copy_to_user, undo > everything.) > > Then someone who cares about performance can benchmark the CR0.WP > approach against it and try to argue that it's a good idea. This > benchmark should wait until I'm done with my PCID work, because PCID > is going to make use_mm() a whole heck of a lot faster. in my measurements switching PCID is hovers around 230 cycles for snb-ivb and 200-220 for hsw-skl whereas cr0 writes are around 230-240 cycles. there's of course a whole lot more impact for switching address spaces so it'll never be fast enough to beat cr0.wp.
On Fri, 7 Apr 2017, Andy Lutomirski wrote: > On Fri, Apr 7, 2017 at 6:30 AM, Mathias Krause <minipli@googlemail.com> wrote: > > On 7 April 2017 at 15:14, Thomas Gleixner <tglx@linutronix.de> wrote: > >> I really do not care whether PaX wants to chase and verify that over and > >> over. I certainly don't want to take the chance to leak CR0.WP ever and I > >> very much care about extra stuff to check in the entry/exit path. > > > > Fair enough. However, placing a BUG_ON(!(read_cr0() & X86_CR0_WP)) > > somewhere sensible should make those "leaks" visible fast -- and their > > exploitation impossible, i.e. fail hard. > > The leaks surely exist and now we'll just add an exploitable BUG. :) > I think we're approaching this all wrong, actually. The fact that x86 > has this CR0.WP thing is arguably a historical accident, and the fact > that PaX uses it doesn't mean that PaX is doing it the best way for > upstream Linux. As I said before. It should be fused to 1. > Why don't we start at the other end and do a generic non-arch-specific > implementation: set up an mm_struct that contains an RW alias of the > relevant parts of rodata and use use_mm to access it. (That is, > get_fs() to back up the old fs, set_fs(USER_DS), > use_mm(&rare_write_mm), do the write using copy_to_user, undo > everything.) That works too, though I'm not sure that it's more efficient than temporarily creating and undoing a single cpu local alias mapping similar to the kmap_atomic mechanism in the highmem case. Aside of that the alias mapping requires a full PGDIR entry unless you want to get into the mess of keeping yet another identity mapping up to date. Thanks, tglx
On Fri, Apr 7, 2017 at 1:44 PM, Thomas Gleixner <tglx@linutronix.de> wrote: > On Fri, 7 Apr 2017, Andy Lutomirski wrote: >> On Fri, Apr 7, 2017 at 6:30 AM, Mathias Krause <minipli@googlemail.com> wrote: >> > On 7 April 2017 at 15:14, Thomas Gleixner <tglx@linutronix.de> wrote: >> >> I really do not care whether PaX wants to chase and verify that over and >> >> over. I certainly don't want to take the chance to leak CR0.WP ever and I >> >> very much care about extra stuff to check in the entry/exit path. >> > >> > Fair enough. However, placing a BUG_ON(!(read_cr0() & X86_CR0_WP)) >> > somewhere sensible should make those "leaks" visible fast -- and their >> > exploitation impossible, i.e. fail hard. >> >> The leaks surely exist and now we'll just add an exploitable BUG. > > :) > >> I think we're approaching this all wrong, actually. The fact that x86 >> has this CR0.WP thing is arguably a historical accident, and the fact >> that PaX uses it doesn't mean that PaX is doing it the best way for >> upstream Linux. > > As I said before. It should be fused to 1. > >> Why don't we start at the other end and do a generic non-arch-specific >> implementation: set up an mm_struct that contains an RW alias of the >> relevant parts of rodata and use use_mm to access it. (That is, >> get_fs() to back up the old fs, set_fs(USER_DS), >> use_mm(&rare_write_mm), do the write using copy_to_user, undo >> everything.) > > That works too, though I'm not sure that it's more efficient than > temporarily creating and undoing a single cpu local alias mapping similar > to the kmap_atomic mechanism in the highmem case. > > Aside of that the alias mapping requires a full PGDIR entry unless you want > to get into the mess of keeping yet another identity mapping up to date. I probably chose the wrong name for this feature (write rarely). That's _usually_ true, but "sensitive_write()" was getting rather long. The things that we need to protect with this are certainly stuff that doesn't get much writing, but some things are just plain sensitive (like page tables) and we should still try to be as fast as possible with them. I'll try to include a third example in the next posting of the series that makes this more obvious. I'm all for a general case for the infrastructure (as Andy and Mark has mentioned), but I don't want to get into the situation where people start refusing to use it because it's "too slow" (for example, see refcount_t vs net-dev right now). -Kees
> I probably chose the wrong name for this feature (write rarely). > That's _usually_ true, but "sensitive_write()" was getting rather > long. The things that we need to protect with this are certainly stuff > that doesn't get much writing, but some things are just plain > sensitive (like page tables) and we should still try to be as fast as > possible with them. Not too late to rename it. Scoped write? I think it makes change to use a different API than PaX for portability too, but not a different x86 implementation. It's quite important to limit the writes to the calling thread and it needs to perform well to be introduced widely. > I'm all for a general case for the infrastructure (as Andy and Mark > has mentioned), but I don't want to get into the situation where > people start refusing to use it because it's "too slow" (for example, > see refcount_t vs net-dev right now). Meanwhile, the PaX implementation has improved to avoid the issues that were brought up while only introducing a single always-predicted (due to code placement) branch on the overflow flag. That seems to have gone unnoticed upstream, where there's now a much slower implementation that's not more secure, and is blocked from introduction in areas where it's most needed based on the performance. Not to mention that it's opt-in... which is never going to work.
> Not too late to rename it. Scoped write? I think it makes change to
s/change/sense/
>> Fair enough. However, placing a BUG_ON(!(read_cr0() & X86_CR0_WP)) >> somewhere sensible should make those "leaks" visible fast -- and their >> exploitation impossible, i.e. fail hard. > > The leaks surely exist and now we'll just add an exploitable BUG. That didn't seem to matter for landing a rewrite of KSTACKOVERFLOW with a bunch of *known* DoS bugs dealt with in grsecurity and those were known issues that were unfixed for no apparent reason other than keeping egos intact. It looks like there are still some left... In that case, there also wasn't a security/performance advantage.
On Fri, Apr 7, 2017 at 12:58 PM, PaX Team <pageexec@freemail.hu> wrote: > On 7 Apr 2017 at 9:14, Andy Lutomirski wrote: > >> On Fri, Apr 7, 2017 at 6:30 AM, Mathias Krause <minipli@googlemail.com> wrote: >> > On 7 April 2017 at 15:14, Thomas Gleixner <tglx@linutronix.de> wrote: >> >> On Fri, 7 Apr 2017, Mathias Krause wrote: >> > Fair enough. However, placing a BUG_ON(!(read_cr0() & X86_CR0_WP)) >> > somewhere sensible should make those "leaks" visible fast -- and their >> > exploitation impossible, i.e. fail hard. >> >> The leaks surely exist and now we'll just add an exploitable BUG. > > can you please share those leaks that 'surely exist' and CC oss-security > while at it? I meant in the patchset here, not in grsecurity. grsecurity (on very, very brief inspection) seems to read cr0 and fix it up in pax_enter_kernel. >> >> Then someone who cares about performance can benchmark the CR0.WP >> approach against it and try to argue that it's a good idea. This >> benchmark should wait until I'm done with my PCID work, because PCID >> is going to make use_mm() a whole heck of a lot faster. > > in my measurements switching PCID is hovers around 230 cycles for snb-ivb > and 200-220 for hsw-skl whereas cr0 writes are around 230-240 cycles. there's > of course a whole lot more impact for switching address spaces so it'll never > be fast enough to beat cr0.wp. > If I'm reading this right, you're saying that a non-flushing CR3 write is about the same cost as a CR0.WP write. If so, then why should CR0 be preferred over the (arch-neutral) CR3 approach? And why would switching address spaces obviously be much slower? There'll be a very small number of TLB fills needed for the actual protected access. --Andy
On Fri, Apr 7, 2017 at 9:21 PM, Daniel Micay <danielmicay@gmail.com> wrote: >>> Fair enough. However, placing a BUG_ON(!(read_cr0() & X86_CR0_WP)) >>> somewhere sensible should make those "leaks" visible fast -- and their >>> exploitation impossible, i.e. fail hard. >> >> The leaks surely exist and now we'll just add an exploitable BUG. > > That didn't seem to matter for landing a rewrite of KSTACKOVERFLOW > with a bunch of *known* DoS bugs dealt with in grsecurity and those > were known issues that were unfixed for no apparent reason other than > keeping egos intact. It looks like there are still some left... > > In that case, there also wasn't a security/performance advantage. This is wildly off topic, but I think it's worth answering anyway because there's an important point here: grsecurity and PaX are great projects. They have a lot of good ideas, and they're put together quite nicely. The upstream kernel should *not* do things differently from they way they are in grsecurity/PaX just because it wants to be different. Conversely, the upstream kernel should not do things the same way as PaX just to be more like PaX. Keep in mind that the upstream kernel and grsecurity/PaX operate under different constraints. The upstream kernel tries to keep itself clean and to make tree-wide updates rather that keeping compatibility stuff around. PaX and grsecurity presumably want to retain some degree of simplicity when porting to newer upstream versions. In the context of virtually mapped stacks / KSTACKOVERFLOW, this naturally leads to different solutions. The upstream kernel had a bunch of buggy drivers that played badly with virtually mapped stacks. grsecurity sensibly went for the approach where the buggy drivers kept working. The upstream kernel went for the approach of fixing the drivers rather than keeping a compatibility workaround. Different constraints, different solutions. The point being that, if someone sends a patch to the x86 entry code that's justified by "be like PaX" or by "be different than PaX", that's not okay. It needs a real justification that stands on its own. In the case of rare writes or pax_open_kernel [1] or whatever we want to call it, CR3 would work without arch-specific code, and CR0 would not. That's an argument for CR3 that would need to be countered by something. (Sure, avoiding leaks either way might need arch changes. OTOH, a *randomized* CR3-based approach might not have as much of a leak issue to begin with.) [1] Contrary to popular belief, I don't sit around reading grsecurity code or config options, so I really don't know what this thing is called.
> grsecurity and PaX are great projects. They have a lot of good ideas, > and they're put together quite nicely. The upstream kernel should > *not* do things differently from they way they are in grsecurity/PaX > just because it wants to be different. Conversely, the upstream > kernel should not do things the same way as PaX just to be more like > PaX. > > Keep in mind that the upstream kernel and grsecurity/PaX operate under > different constraints. The upstream kernel tries to keep itself clean > and to make tree-wide updates rather that keeping compatibility stuff > around. PaX and grsecurity presumably want to retain some degree of > simplicity when porting to newer upstream versions. > > In the context of virtually mapped stacks / KSTACKOVERFLOW, this > naturally leads to different solutions. The upstream kernel had a > bunch of buggy drivers that played badly with virtually mapped stacks. > grsecurity sensibly went for the approach where the buggy drivers kept > working. The upstream kernel went for the approach of fixing the > drivers rather than keeping a compatibility workaround. Different > constraints, different solutions. Sure, but nothing is currently landed right now and the PaX implementation is a known quantity with a lot of testing. The submitted code is aimed at rare writes to globals, but this feature is more than that and design decisions shouldn't be based on just the short term. Kees is sensibly landing features by submitting them a little bit at a time, but a negative side effect of that is too much focus on just the initial proposed usage. As a downstream that's going to be making heavy use of mainline security features as a base due to PaX and grsecurity becoming commercial only private patches (*because* of upstream and the companies involved), I hope things go in the right direction i.e. not weaker/slower implementations than PaX. For example, while USERCOPY isn't entirely landed upstream (missing slab whitelisting and user offset/size), it's the base for the full feature and is going to get more testing. On the other hand, refcount_t and the slab/page sanitization work is 100% useless if you're willing to incorporate the changes to do it without needless performance loss and complexity. PAN emulation on 64-bit ARM was fresh ground while on ARMv7 a weaker implementation was used for no better reason than clashing egos. The upstream virtual mapped stacks will probably be perfectly good, but I just find it to be such a waste of time and effort to reinvent it and retread the same ground in terms of finding the bugs. I actually care a lot more about 64-bit ARM support than I do x86, but using a portable API for pax_open_kernel (for the simple uses at least) is separate from choosing the underlying implementation. There might not be a great way to do it on the architectures I care about but that doesn't need to hinder x86. It's really not that much code... A weaker/slower implementation for x86 also encourages the same elsewhere. > In the case of rare writes or pax_open_kernel [1] or whatever we want > to call it, CR3 would work without arch-specific code, and CR0 would > not. That's an argument for CR3 that would need to be countered by > something. (Sure, avoiding leaks either way might need arch changes. > OTOH, a *randomized* CR3-based approach might not have as much of a > leak issue to begin with.) By randomized do you mean just ASLR? Even if the window of opportunity to exploit it is small, it's really not the same and this has a lot more use than just rare writes to small global variables. I wouldn't consider stack ASLR to be a replacement for making them readable/writable only by the current thread either (required for race-free return CFI on x86, at least without not using actual returns). > [1] Contrary to popular belief, I don't sit around reading grsecurity > code or config options, so I really don't know what this thing is > called. Lots of the features aren't actually named. Maybe this could be considered part of KERNEXEC but I don't think it is.
On Sat, Apr 8, 2017 at 12:33 AM, Daniel Micay <danielmicay@gmail.com> wrote: > The > submitted code is aimed at rare writes to globals, but this feature is > more than that and design decisions shouldn't be based on just the > short term. Then, if you disagree with a proposed design, *explain why* in a standalone manner. Say what future uses a different design would have. > I actually care a lot more about 64-bit ARM support than I do x86, but > using a portable API for pax_open_kernel (for the simple uses at > least) is separate from choosing the underlying implementation. There > might not be a great way to do it on the architectures I care about > but that doesn't need to hinder x86. It's really not that much code... > A weaker/slower implementation for x86 also encourages the same > elsewhere. No one has explained how CR0.WP is weaker or slower than my proposal. Here's what I'm proposing: At boot, choose a random address A. Create an mm_struct that has a single VMA starting at A that represents the kernel's rarely-written section. Compute O = (A - VA of rarely-written section). To do a rare write, use_mm() the mm, write to (VA + O), then unuse_mm(). This should work on any arch that has an MMU that allows this type of aliasing and that doesn't have PA-based protections on the rarely-written section. It'll be considerably slower than CR0.WP on a current x86 kernel, but, with PCID landed, it shouldn't be much slower. It has the added benefit that writes to non-rare-write data using the rare-write primitive will fail. --Andy
* Andy Lutomirski <luto@kernel.org> wrote: > On Sat, Apr 8, 2017 at 12:33 AM, Daniel Micay <danielmicay@gmail.com> wrote: > > The > > submitted code is aimed at rare writes to globals, but this feature is > > more than that and design decisions shouldn't be based on just the > > short term. > > Then, if you disagree with a proposed design, *explain why* in a > standalone manner. Say what future uses a different design would > have. > > > I actually care a lot more about 64-bit ARM support than I do x86, but > > using a portable API for pax_open_kernel (for the simple uses at > > least) is separate from choosing the underlying implementation. There > > might not be a great way to do it on the architectures I care about > > but that doesn't need to hinder x86. It's really not that much code... > > A weaker/slower implementation for x86 also encourages the same > > elsewhere. > > No one has explained how CR0.WP is weaker or slower than my proposal. > Here's what I'm proposing: > > At boot, choose a random address A. Create an mm_struct that has a > single VMA starting at A that represents the kernel's rarely-written > section. Compute O = (A - VA of rarely-written section). To do a > rare write, use_mm() the mm, write to (VA + O), then unuse_mm(). BTW., note that this is basically a pagetable based protection key variant. > It'll be considerably slower than CR0.WP on a current x86 kernel, but, with PCID > landed, it shouldn't be much slower. It has the added benefit that writes to > non-rare-write data using the rare-write primitive will fail. ... which is a security advantage of the use_mm() based design you suggest. Thanks, Ingo
On 7 Apr 2017 at 21:58, Andy Lutomirski wrote: > On Fri, Apr 7, 2017 at 12:58 PM, PaX Team <pageexec@freemail.hu> wrote: > > On 7 Apr 2017 at 9:14, Andy Lutomirski wrote: > >> Then someone who cares about performance can benchmark the CR0.WP > >> approach against it and try to argue that it's a good idea. This > >> benchmark should wait until I'm done with my PCID work, because PCID > >> is going to make use_mm() a whole heck of a lot faster. > > > > in my measurements switching PCID is hovers around 230 cycles for snb-ivb > > and 200-220 for hsw-skl whereas cr0 writes are around 230-240 cycles. there's > > of course a whole lot more impact for switching address spaces so it'll never > > be fast enough to beat cr0.wp. > > > > If I'm reading this right, you're saying that a non-flushing CR3 write > is about the same cost as a CR0.WP write. If so, then why should CR0 > be preferred over the (arch-neutral) CR3 approach? cr3 (page table switching) isn't arch neutral at all ;). you probably meant the higher level primitives except they're not enough to implement the scheme as discussed before since the enter/exit paths are very much arch dependent. on x86 the cost of the pax_open/close_kernel primitives comes from the cr0 writes and nothing else, use_mm suffers not only from the cr3 writes but also locking/atomic ops and cr4 writes on its path and the inevitable TLB entry costs. and if cpu vendors cared enough, they could make toggling cr0.wp a fast path in the microcode and reduce its overhead by an order of magnitude. > And why would switching address spaces obviously be much slower? > There'll be a very small number of TLB fills needed for the actual > protected access. you'll be duplicating TLB entries in the alternative PCID for both code and data, where they will accumulate (=take room away from the normal PCID and expose unwanted memory for access) unless you also flush them when switching back (which then will cost even more cycles). also i'm not sure that processors implement all the 12 PCID bits so depending on how many PCIDs you plan to use, you could be causing even more unnecessary TLB replacements.
On 7 Apr 2017 at 22:07, Andy Lutomirski wrote: > grsecurity and PaX are great projects. They have a lot of good ideas, > and they're put together quite nicely. The upstream kernel should > *not* do things differently from they way they are in grsecurity/PaX > just because it wants to be different. Conversely, the upstream > kernel should not do things the same way as PaX just to be more like > PaX. so weit so gut. > Keep in mind that the upstream kernel and grsecurity/PaX operate under > different constraints. The upstream kernel tries to keep itself clean so do we. > and to make tree-wide updates rather that keeping compatibility stuff > around. so do we (e.g., fptr fixes for RAP, non-refcount atomic users, etc). > PaX and grsecurity presumably want to retain some degree of > simplicity when porting to newer upstream versions. s/simplicity/minimality/ as the code itself can be complex but that'll be of the minimal complexity we can come up with. > In the context of virtually mapped stacks / KSTACKOVERFLOW, this > naturally leads to different solutions. The upstream kernel had a > bunch of buggy drivers that played badly with virtually mapped stacks. > grsecurity sensibly went for the approach where the buggy drivers kept > working. The upstream kernel went for the approach of fixing the > drivers rather than keeping a compatibility workaround. Different > constraints, different solutions. except that's not what happened at all. spender's first version did just a vmalloc for the kstack like the totally NIH'd version upstream does now. while we always anticipated buggy dma users and thus had code that would detect them so that we could fix them, we quickly figured that the upstream kernel wasn't quite up to snuff as we had assumed and faced with the amount of buggy code, we went for the current vmap approach which kept users' systems working instead of breaking them. you're trying to imply that upstream fixed the drivers but as the facts show, that's not true. you simply unleashed your code on the world and hoped(?) that enough suckers would try it out during the -rc window. as we all know several releases and almost a year later, that was a losing bet as you still keep fixing those drivers (and something tells me that we haven't seen the end of it). this is simply irresponsible engineering for no technical reason. > In the case of rare writes or pax_open_kernel [1] or whatever we want > to call it, CR3 would work without arch-specific code, and CR0 would > not. That's an argument for CR3 that would need to be countered by > something. (Sure, avoiding leaks either way might need arch changes. > OTOH, a *randomized* CR3-based approach might not have as much of a > leak issue to begin with.) i have yet to see anyone explain what they mean by 'leak' here but if it is what i think it is then the arch specific entry/exit changes are not optional but mandatory. see below for randomization. [merging in your other mail as it's the same topic] > No one has explained how CR0.WP is weaker or slower than my proposal. you misunderstood, Daniel was talking about your use_mm approach. > Here's what I'm proposing: > > At boot, choose a random address A. what is the threat that a random address defends against? > Create an mm_struct that has a > single VMA starting at A that represents the kernel's rarely-written > section. Compute O = (A - VA of rarely-written section). To do a > rare write, use_mm() the mm, write to (VA + O), then unuse_mm(). the problem is that the amount of __read_only data extends beyond vmlinux, i.e., this approach won't scale. another problem is that it can't be used inside use_mm and switch_mm themselves (no read-only task structs or percpu pgd for you ;) and probably several other contexts. last but not least, use_mm says this about itself: (Note: this routine is intended to be called only from a kernel thread context) so using it will need some engineering (or the comment be fixed). > This should work on any arch that has an MMU that allows this type of > aliasing and that doesn't have PA-based protections on the rarely-written > section. you didn't address the arch-specific changes needed in the enter/exit paths. > It'll be considerably slower than CR0.WP on a current x86 kernel, but, > with PCID landed, it shouldn't be much slower. based on my experience with UDEREF on amd64, unfortunately PCID isn't all it's cracked up to be (IIRC, it maybe halved the UDEREF overhead instead of basically eliminating it as i had anticipated, and that was on snb, ivb and later do even worse). > It has the added benefit that writes to non-rare-write data using the > rare-write primitive will fail. what is the threat model you're assuming for this feature? based on what i have for PaX (arbitrary read/write access exploited for data-only attacks), the above makes no sense to me...
On Sun, Apr 9, 2017 at 5:47 AM, PaX Team <pageexec@freemail.hu> wrote: > On 7 Apr 2017 at 21:58, Andy Lutomirski wrote: > >> On Fri, Apr 7, 2017 at 12:58 PM, PaX Team <pageexec@freemail.hu> wrote: >> > On 7 Apr 2017 at 9:14, Andy Lutomirski wrote: >> >> Then someone who cares about performance can benchmark the CR0.WP >> >> approach against it and try to argue that it's a good idea. This >> >> benchmark should wait until I'm done with my PCID work, because PCID >> >> is going to make use_mm() a whole heck of a lot faster. >> > >> > in my measurements switching PCID is hovers around 230 cycles for snb-ivb >> > and 200-220 for hsw-skl whereas cr0 writes are around 230-240 cycles. there's >> > of course a whole lot more impact for switching address spaces so it'll never >> > be fast enough to beat cr0.wp. >> > >> >> If I'm reading this right, you're saying that a non-flushing CR3 write >> is about the same cost as a CR0.WP write. If so, then why should CR0 >> be preferred over the (arch-neutral) CR3 approach? > > cr3 (page table switching) isn't arch neutral at all ;). you probably meant > the higher level primitives except they're not enough to implement the scheme > as discussed before since the enter/exit paths are very much arch dependent. Yes. > > on x86 the cost of the pax_open/close_kernel primitives comes from the cr0 > writes and nothing else, use_mm suffers not only from the cr3 writes but > also locking/atomic ops and cr4 writes on its path and the inevitable TLB > entry costs. and if cpu vendors cared enough, they could make toggling cr0.wp > a fast path in the microcode and reduce its overhead by an order of magnitude. > If the CR4 writes happen in for this use case, that's a bug. >> And why would switching address spaces obviously be much slower? >> There'll be a very small number of TLB fills needed for the actual >> protected access. > > you'll be duplicating TLB entries in the alternative PCID for both code > and data, where they will accumulate (=take room away from the normal PCID > and expose unwanted memory for access) unless you also flush them when > switching back (which then will cost even more cycles). also i'm not sure > that processors implement all the 12 PCID bits so depending on how many PCIDs > you plan to use, you could be causing even more unnecessary TLB replacements. > Unless the CPU is rather dumber than I expect, the only duplicated entries should be for the writable aliases of pages that are written. The rest of the pages are global and should be shared for all PCIDs. --Andy
On Sun, Apr 9, 2017 at 1:24 PM, PaX Team <pageexec@freemail.hu> wrote: > >> In the context of virtually mapped stacks / KSTACKOVERFLOW, this >> naturally leads to different solutions. The upstream kernel had a >> bunch of buggy drivers that played badly with virtually mapped stacks. >> grsecurity sensibly went for the approach where the buggy drivers kept >> working. The upstream kernel went for the approach of fixing the >> drivers rather than keeping a compatibility workaround. Different >> constraints, different solutions. > > except that's not what happened at all. spender's first version did just > a vmalloc for the kstack like the totally NIH'd version upstream does > now. while we always anticipated buggy dma users and thus had code that > would detect them so that we could fix them, we quickly figured that the > upstream kernel wasn't quite up to snuff as we had assumed and faced with > the amount of buggy code, we went for the current vmap approach which > kept users' systems working instead of breaking them. > > you're trying to imply that upstream fixed the drivers but as the facts > show, that's not true. you simply unleashed your code on the world and > hoped(?) that enough suckers would try it out during the -rc window. as > we all know several releases and almost a year later, that was a losing > bet as you still keep fixing those drivers (and something tells me that > we haven't seen the end of it). this is simply irresponsible engineering > for no technical reason. I consider breaking buggy drivers (in a way that they either generally work okay or that they break with a nice OOPS depending on config) to be better than having a special case in what's supposed to be a fast path to keep them working. I did consider forcing the relevant debug options on for a while just to help shake these bugs out the woodwork faster. > >> In the case of rare writes or pax_open_kernel [1] or whatever we want >> to call it, CR3 would work without arch-specific code, and CR0 would >> not. That's an argument for CR3 that would need to be countered by >> something. (Sure, avoiding leaks either way might need arch changes. >> OTOH, a *randomized* CR3-based approach might not have as much of a >> leak issue to begin with.) > > i have yet to see anyone explain what they mean by 'leak' here but if it > is what i think it is then the arch specific entry/exit changes are not > optional but mandatory. see below for randomization. By "leak" I mean that a bug or exploit causes unintended code to run with CR0.WP or a special CR3 or a special PTE or whatever loaded. PaX hooks the entry code to avoid leaks. >> At boot, choose a random address A. > > what is the threat that a random address defends against? Makes it harder to exploit a case where the CR3 setting leaks. > >> Create an mm_struct that has a >> single VMA starting at A that represents the kernel's rarely-written >> section. Compute O = (A - VA of rarely-written section). To do a >> rare write, use_mm() the mm, write to (VA + O), then unuse_mm(). > > the problem is that the amount of __read_only data extends beyond vmlinux, > i.e., this approach won't scale. another problem is that it can't be used > inside use_mm and switch_mm themselves (no read-only task structs or percpu > pgd for you ;) and probably several other contexts. Can you clarify these uses that extend beyond vmlinux? I haven't looked at the grsecurity patch extensively. Are you talking about the BPF JIT stuff? If so, I think that should possibly be handled a bit differently, since I think the normal write-to-rare-write-vmlinux-sections primitive should preferably *not* be usable to write to executable pages. Using a real mm_struct for this could help. > > last but not least, use_mm says this about itself: > > (Note: this routine is intended to be called only > from a kernel thread context) > > so using it will need some engineering (or the comment be fixed). Indeed. >> It has the added benefit that writes to non-rare-write data using the >> rare-write primitive will fail. > > what is the threat model you're assuming for this feature? based on what i > have for PaX (arbitrary read/write access exploited for data-only attacks), > the above makes no sense to me... > If I use the primitive to try to write a value to the wrong section (write to kernel text, for example), IMO it would be nice to OOPS instead of succeeding. Please keep in mind that, unlike PaX, uses of a pax_open_kernel()-like function will may be carefully audited by a friendly security expert such as yourself. It would be nice to harden the primitive to a reasonable extent against minor misuses such as putting it in a context where the compiler will emit mov-a-reg-with-WP-set-to-CR0; ret.
On Fri, 7 Apr 2017, PaX Team wrote: > On 7 Apr 2017 at 11:46, Thomas Gleixner wrote: > > > On Fri, 7 Apr 2017, Mathias Krause wrote: > > > Well, doesn't look good to me. NMIs will still be able to interrupt > > > this code and will run with CR0.WP = 0. > > > > > > Shouldn't you instead question yourself why PaX can do it "just" with > > > preempt_disable() instead?! > > > > That's silly. Just because PaX does it, doesn't mean it's correct. > > is that FUD or do you have actionable information to share? That has absolutely nothing to do with FUD. I'm merily not accepting argumentations which say: PaX can do it "just".... That has exactly zero technical merit and it's not asked too much to provide precise technical arguments why one implementation is better than some other. > > To be honest, playing games with the CR0.WP bit is outright stupid to begin with. > > why is that? cr0.wp exists since the i486 and its behaviour fits my > purposes quite well, it's the best security/performance i know of. Works for me has never be a good engineering principle. > > Whether protected by preempt_disable or local_irq_disable, to make that > > work it needs CR0 handling in the exception entry/exit at the lowest > > level. > > correct. > > > And that's just a nightmare maintainence wise as it's prone to be > > broken over time. > > i've got 14 years of experience of maintaining it and i never saw it break. It's a difference whether you maintain a special purpose patch set out of tree for a subset of architectures - I certainly know what I'm talking about - or keeping stuff sane in the upstream kernel. > > I certainly don't want to take the chance to leak CR0.WP ever > > why and where would cr0.wp leak? It's bound to happen due to some subtle mistake and up to the point where you catch it (in the scheduler or entry/exit path) the world is writeable. And that will be some almost never executed error path which can be triggered by a carefully crafted attack. A very restricted writeable region is definitely preferred over full world writeable then, right? > > Why the heck should we care about rare writes being performant? > > because you've been misled by the NIH crowd here that the PaX feature they > tried to (badly) extract from has anything to do with frequency of writes. It would be apprectiated if you could keep your feud out of this. It's enough to tell me that 'rare write' is a misleading term and why. > > Making the world and some more writeable hardly qualifies as tightly > > focused. > > you forgot to add 'for a window of a few insns' and that the map/unmap If it'd be guaranteed to be a few instructions, then I wouldn't be that worried. The availability of make_world_writeable() as an unrestricted usable function makes me nervous as hell. We've had long standing issues where kmap_atomic() got leaked through a hard to spot almost never executed error handling path. And the same is bound to happen with this, just with a way worse outcome. > approach does the same under an attacker controlled ptr. Which attacker controlled pointer? Thanks, tglx
On Sat, Apr 08, 2017 at 08:20:03AM -0700, Andy Lutomirski wrote: > On Sat, Apr 8, 2017 at 12:33 AM, Daniel Micay <danielmicay@gmail.com> wrote: > > The > > submitted code is aimed at rare writes to globals, but this feature is > > more than that and design decisions shouldn't be based on just the > > short term. > > Then, if you disagree with a proposed design, *explain why* in a > standalone manner. Say what future uses a different design would > have. > > > I actually care a lot more about 64-bit ARM support than I do x86, but > > using a portable API for pax_open_kernel (for the simple uses at > > least) is separate from choosing the underlying implementation. There > > might not be a great way to do it on the architectures I care about > > but that doesn't need to hinder x86. It's really not that much code... > > A weaker/slower implementation for x86 also encourages the same > > elsewhere. > > No one has explained how CR0.WP is weaker or slower than my proposal. > Here's what I'm proposing: > > At boot, choose a random address A. Create an mm_struct that has a > single VMA starting at A that represents the kernel's rarely-written > section. Compute O = (A - VA of rarely-written section). To do a > rare write, use_mm() the mm, write to (VA + O), then unuse_mm(). > > This should work on any arch that has an MMU that allows this type of > aliasing and that doesn't have PA-based protections on the > rarely-written section. Modulo randomization, that sounds exactly like what I had envisaged [1], so that makes sense to me. Thanks, Mark. [1] http://www.openwall.com/lists/kernel-hardening/2016/11/18/3
On Fri, Apr 07, 2017 at 04:45:26PM +0200, Peter Zijlstra wrote: > On Fri, Apr 07, 2017 at 12:51:15PM +0200, Mathias Krause wrote: > > Why that? It allows fast and CPU local modifications of r/o memory. > > OTOH, an approach that needs to fiddle with page table entries > > requires global synchronization to keep the individual TLB states in > > sync. Hmm.. Not that fast, I'd say. > > The fixmaps used for kmap_atomic are per-cpu, no global sync required. That might be fine for x86, but for some architectures fixmap slots and kmap_atomic mappings happen to be visible to other CPUs even if they're not required to be. Using an mm solves that for all, though. Thanks, Mark.
On 9 Apr 2017 at 17:10, Andy Lutomirski wrote: > On Sun, Apr 9, 2017 at 5:47 AM, PaX Team <pageexec@freemail.hu> wrote: > > on x86 the cost of the pax_open/close_kernel primitives comes from the cr0 > > writes and nothing else, use_mm suffers not only from the cr3 writes but > > also locking/atomic ops and cr4 writes on its path and the inevitable TLB > > entry costs. and if cpu vendors cared enough, they could make toggling cr0.wp > > a fast path in the microcode and reduce its overhead by an order of magnitude. > > > > If the CR4 writes happen in for this use case, that's a bug. that depends on how you plan to handle perf/rdpmc users and how many alternative mm structs you plan to manage (one global, one per cpu, one per mm struct, etc). > > you'll be duplicating TLB entries in the alternative PCID for both code > > and data, where they will accumulate (=take room away from the normal PCID > > and expose unwanted memory for access) unless you also flush them when > > switching back (which then will cost even more cycles). also i'm not sure > > that processors implement all the 12 PCID bits so depending on how many PCIDs > > you plan to use, you could be causing even more unnecessary TLB replacements. > > > > Unless the CPU is rather dumber than I expect, the only duplicated > entries should be for the writable aliases of pages that are written. > The rest of the pages are global and should be shared for all PCIDs. well, 4.10.2.4 has language like this (4.10.3.2 implies similar): A logical processor may use a global TLB entry to translate a linear address, even if the TLB entry is associated with a PCID different from the current PCID. that to me says that global page entries are associated with a PCID and may (not) be used while in another PCID. in Intel-speak that's not 'dumb' but "tricks up our sleeve that we don't really want to tell you about in detail, except perhaps under a NDA".
On Mon, Apr 10, 2017 at 3:42 AM, PaX Team <pageexec@freemail.hu> wrote: > On 9 Apr 2017 at 17:10, Andy Lutomirski wrote: > >> On Sun, Apr 9, 2017 at 5:47 AM, PaX Team <pageexec@freemail.hu> wrote: >> > on x86 the cost of the pax_open/close_kernel primitives comes from the cr0 >> > writes and nothing else, use_mm suffers not only from the cr3 writes but >> > also locking/atomic ops and cr4 writes on its path and the inevitable TLB >> > entry costs. and if cpu vendors cared enough, they could make toggling cr0.wp >> > a fast path in the microcode and reduce its overhead by an order of magnitude. >> > >> >> If the CR4 writes happen in for this use case, that's a bug. > > that depends on how you plan to handle perf/rdpmc users and how many > alternative mm structs you plan to manage (one global, one per cpu, > one per mm struct, etc). > I was thinking one global unless more are needed for some reason. >> > you'll be duplicating TLB entries in the alternative PCID for both code >> > and data, where they will accumulate (=take room away from the normal PCID >> > and expose unwanted memory for access) unless you also flush them when >> > switching back (which then will cost even more cycles). also i'm not sure >> > that processors implement all the 12 PCID bits so depending on how many PCIDs >> > you plan to use, you could be causing even more unnecessary TLB replacements. >> > >> >> Unless the CPU is rather dumber than I expect, the only duplicated >> entries should be for the writable aliases of pages that are written. >> The rest of the pages are global and should be shared for all PCIDs. > > well, 4.10.2.4 has language like this (4.10.3.2 implies similar): > > A logical processor may use a global TLB entry to translate a linear > address, even if the TLB entry is associated with a PCID different > from the current PCID. I read this as: the CPU still semantically tags global TLB entries with a PCID, but the CPU will use (probably) use those TLB entries even if the PCIDs don't match. IIRC none of the TLB instructions have any effect that makes the PCID associated with a global entry visible, so the CPU could presumably omit the PCID tags entirely for global entries. E.g. I don't think there's any way to say "flush global entries with a given PCID". --Andy
On 9 Apr 2017 at 17:31, Andy Lutomirski wrote: > On Sun, Apr 9, 2017 at 1:24 PM, PaX Team <pageexec@freemail.hu> wrote: > > > I consider breaking buggy drivers (in a way that they either generally > work okay do they work okay when the dma transfer goes to a buffer that crosses physically non-contiguous page boundaries? > or that they break with a nice OOPS depending on config) to > be better than having a special case in what's supposed to be a fast > path to keep them working. I did consider forcing the relevant debug > options on for a while just to help shake these bugs out the woodwork > faster. that's a false dichotomy, discovering buggy drivers is orthogonal to (not) breaking users' systems as grsec shows. and how did you expect to 'shake these bugs out' when your own suggestion at the time was for distros to not enable this feature 'for a while'? > > i have yet to see anyone explain what they mean by 'leak' here but if it > > is what i think it is then the arch specific entry/exit changes are not > > optional but mandatory. see below for randomization. > > By "leak" I mean that a bug or exploit causes unintended code to run > with CR0.WP or a special CR3 or a special PTE or whatever loaded. how can a bug/exploit cause something like this? > PaX hooks the entry code to avoid leaks. PaX doesn't instrument enter/exit paths to prevent state leaks into interrupt context (it's a useful sideeffect though), rather it's needed for correctness if the kernel can be interrupted at all while it's open (address space switching will need to handle this too but you have yet to address it). > >> At boot, choose a random address A. > > > > what is the threat that a random address defends against? > > Makes it harder to exploit a case where the CR3 setting leaks. if an attacker has the ability to cause this leak (details of which are subject to the question i asked above) then why wouldn't he simply also make use of the primitives to modify his target via the writable vma without ever having to know the randomized address? i also wonder what exploit power you assume for this attack and whether that is already enough to simply go after page tables, etc instead of figuring out the alternative address space. > > the problem is that the amount of __read_only data extends beyond vmlinux, > > i.e., this approach won't scale. another problem is that it can't be used > > inside use_mm and switch_mm themselves (no read-only task structs or percpu > > pgd for you ;) and probably several other contexts. > > Can you clarify these uses that extend beyond vmlinux? one obvious candidate is modules. how do you want to handle them? then there's a whole bunch of dynamically allocated data that is a candidate for __read_only treatment. > > what is the threat model you're assuming for this feature? based on what i > > have for PaX (arbitrary read/write access exploited for data-only attacks), > > the above makes no sense to me... > > If I use the primitive to try to write a value to the wrong section > (write to kernel text, for example), IMO it would be nice to OOPS > instead of succeeding. this doesn't tell me what power you're assuming the attacker has. is it my generic arbitrary read-write ability or something more restricted and thus less realistic? i.e., how does the attacker get to 'use the primitive' and (presumably) also control the ptr/data? as for your specific example, kernel text isn't 'non-rare-write data' that you spoke of before, but that aside, what prevents an attacker from computing his target ptr so that after your accessor rebases it, it'd point back to his intended target instead? will you range-check (find_vma eventually?) each time? how will you make all this code safe from races from another task? the more checks you make, the more likely that something sensitive will spill to memory and be a target itself in order to hijack the sensitive write. > Please keep in mind that, unlike PaX, uses of a pax_open_kernel()-like > function will may be carefully audited by a friendly security expert > such as yourself. It would be nice to harden the primitive to a > reasonable extent against minor misuses such as putting it in a > context where the compiler will emit mov-a-reg-with-WP-set-to-CR0; > ret. i don't understand what's there to audit. if you want to treat a given piece of data as __read_only then you have no choice but to allow writes to it via the open/close mechanism and the compiler can tell you just where those writes are (and even do the instrumentation when you get tired of doing it by hand).
On 10 Apr 2017 at 10:26, Thomas Gleixner wrote: > On Fri, 7 Apr 2017, PaX Team wrote: > > On 7 Apr 2017 at 11:46, Thomas Gleixner wrote: > > > That's silly. Just because PaX does it, doesn't mean it's correct. > > > > is that FUD or do you have actionable information to share? > > That has absolutely nothing to do with FUD. I'm merily not accepting > argumentations which say: PaX can do it "just".... you implied that what PaX does may not be correct. if you can't back that up with facts and technical arguments then it is FUD. your turn. > That has exactly zero technical merit and it's not asked too much to > provide precise technical arguments why one implementation is better than > some other. exactly. start with explaining what is not correct in PaX with "precise technical arguments". > > > To be honest, playing games with the CR0.WP bit is outright stupid to begin with. > > > > why is that? cr0.wp exists since the i486 and its behaviour fits my > > purposes quite well, it's the best security/performance i know of. > > Works for me has never be a good engineering principle. good thing i didn't say that. on the other hand you failed to provide "precise technical arguments" for why "playing games with the CR0.WP bit is outright stupid to begin with". do you have any to share and discuss? > > > And that's just a nightmare maintainence wise as it's prone to be > > > broken over time. > > > > i've got 14 years of experience of maintaining it and i never saw it break. > > It's a difference whether you maintain a special purpose patch set out of > tree for a subset of architectures - I certainly know what I'm talking > about - or keeping stuff sane in the upstream kernel. there's no difference to me, i keep my stuff sane regardless. of course what you do with your out-of-tree code is your business but don't extrapolate it to mine. now besides argumentum ad verecundiam do you have "precise technical arguments" as to why maintaining a cr0.wp based approach would be "a nightmare maintainence wise as it's prone to be broken over time."? > > > I certainly don't want to take the chance to leak CR0.WP ever > > > > why and where would cr0.wp leak? > > It's bound to happen due to some subtle mistake i don't see what subtle mistake you're thinking of here. can you give me an example? > and up to the point where you catch it (in the scheduler or entry/exit path) > the world is writeable. where such a leak is caught depends on what subtle mistake you're talking about, so let's get back to this point once you answered that question. > And that will be some almost never executed error path which can > be triggered by a carefully crafted attack. open/close calls have nothing to do with error paths or even conditional execution, they're always executed as a sequence so this situation cannot occur. > A very restricted writeable region is definitely preferred over full > world writeable then, right? it doesn't matter when the attacker has an arbitrary read/write primitive which he can just use to modify that 'very restricted writeable region' to whatever he needs to cover first. now if all the data managing this region were also protected then it'd matter but that's never going to happen in the upstream kernel. > > > Making the world and some more writeable hardly qualifies as tightly > > > focused. > > > > you forgot to add 'for a window of a few insns' and that the map/unmap > > If it'd be guaranteed to be a few instructions, then I wouldn't be that > worried. it is, by definition assignments to otherwise __read_only data are (have to be) bracketed with open/close instrumentation. > The availability of make_world_writeable() as an unrestricted > usable function makes me nervous as hell. i can't imagine the nightmares you must have lived through for the two decades during which the kernel was all wide open... spass beiseite, we can address these fears of yours once you explain just what kind of error situations you have in mind and why they don't also apply to say text_poke(). > We've had long standing issues where kmap_atomic() got leaked through a > hard to spot almost never executed error handling path. And the same is > bound to happen with this, just with a way worse outcome. the lifetime and use of kmaps is very different so you'll have to explain in more detail why the problems they had apply here as well. > > approach does the same under an attacker controlled ptr. > > Which attacker controlled pointer? i meant the one passed to the map/unmap code, attacker control over it means the whole world is effectively writable again (Andy's vma approach would restrict this to just the interesting read-only data except the whole thing is irrelevant until all participating data (vma, page tables, etc) are also protected).
On Sun, Apr 9, 2017 at 1:24 PM, PaX Team <pageexec@freemail.hu> wrote: > On 7 Apr 2017 at 22:07, Andy Lutomirski wrote: >> No one has explained how CR0.WP is weaker or slower than my proposal. > > you misunderstood, Daniel was talking about your use_mm approach. > >> Here's what I'm proposing: >> >> At boot, choose a random address A. > > what is the threat that a random address defends against? > >> Create an mm_struct that has a >> single VMA starting at A that represents the kernel's rarely-written >> section. Compute O = (A - VA of rarely-written section). To do a >> rare write, use_mm() the mm, write to (VA + O), then unuse_mm(). > > the problem is that the amount of __read_only data extends beyond vmlinux, > i.e., this approach won't scale. another problem is that it can't be used > inside use_mm and switch_mm themselves (no read-only task structs or percpu > pgd for you ;) and probably several other contexts. These are the limitations that concern me: what will we NOT be able to make read-only as a result of the use_mm() design choice? My RFC series included a simple case and a constify case, but I did not include things like making page tables read-only, etc. I cant accept not using cr0, since we need to design something that works on arm64 too, which doesn't have anything like this (AFAIK), but I'd like to make sure we don't paint ourselves into a corner. -Kees
On Mon, Apr 10, 2017 at 1:13 PM, Kees Cook <keescook@chromium.org> wrote: > On Sun, Apr 9, 2017 at 1:24 PM, PaX Team <pageexec@freemail.hu> wrote: >> On 7 Apr 2017 at 22:07, Andy Lutomirski wrote: >>> No one has explained how CR0.WP is weaker or slower than my proposal. >> >> you misunderstood, Daniel was talking about your use_mm approach. >> >>> Here's what I'm proposing: >>> >>> At boot, choose a random address A. >> >> what is the threat that a random address defends against? >> >>> Create an mm_struct that has a >>> single VMA starting at A that represents the kernel's rarely-written >>> section. Compute O = (A - VA of rarely-written section). To do a >>> rare write, use_mm() the mm, write to (VA + O), then unuse_mm(). >> >> the problem is that the amount of __read_only data extends beyond vmlinux, >> i.e., this approach won't scale. another problem is that it can't be used >> inside use_mm and switch_mm themselves (no read-only task structs or percpu >> pgd for you ;) and probably several other contexts. > > These are the limitations that concern me: what will we NOT be able to > make read-only as a result of the use_mm() design choice? My RFC > series included a simple case and a constify case, but I did not > include things like making page tables read-only, etc. If we make page tables read-only, we may need to have multiple levels of rareness. Page table writes aren't all that rare, and I can imagine distros configuring the kernel so that static structs full of function pointers are read-only (IMO that should be the default or even mandatory), but page tables may be a different story. That being said, CR3-twiddling to write to page tables could actually work. Hmm.
On Mon, Apr 10, 2017 at 12:47 PM, PaX Team <pageexec@freemail.hu> wrote: > On 9 Apr 2017 at 17:31, Andy Lutomirski wrote: > >> On Sun, Apr 9, 2017 at 1:24 PM, PaX Team <pageexec@freemail.hu> wrote: >> > >> I consider breaking buggy drivers (in a way that they either generally >> work okay > > do they work okay when the dma transfer goes to a buffer that crosses > physically non-contiguous page boundaries? Nope. Like I said, i considered making the debugging mandatory. I may still send patches to do that. >> By "leak" I mean that a bug or exploit causes unintended code to run >> with CR0.WP or a special CR3 or a special PTE or whatever loaded. > > how can a bug/exploit cause something like this? For example: a bug in entry logic, a bug in perf NMI handling, or even a bug in *nested* perf NMI handling (egads!). Or maybe some super nasty interaction with suspend/resume. These are all fairly unlikely (except the nested perf case), but still. As a concrete example, back before my big NMI improvement series, it was possible for an NMI return to invoke espfix and/or take an IRET fault. This *shouldn't* happen on return to a context with CR0.WP set, but it would be incredibly nasty if it did. The code is separated out now, so it should be okay... > >> PaX hooks the entry code to avoid leaks. > > PaX doesn't instrument enter/exit paths to prevent state leaks into interrupt > context (it's a useful sideeffect though), rather it's needed for correctness > if the kernel can be interrupted at all while it's open (address space switching > will need to handle this too but you have yet to address it). I don't think we disagree here. A leak would be a case of incorrectness. > >> >> At boot, choose a random address A. >> > >> > what is the threat that a random address defends against? >> >> Makes it harder to exploit a case where the CR3 setting leaks. > > if an attacker has the ability to cause this leak (details of which are subject > to the question i asked above) then why wouldn't he simply also make use of the > primitives to modify his target via the writable vma without ever having to know > the randomized address? i also wonder what exploit power you assume for this > attack and whether that is already enough to simply go after page tables, etc > instead of figuring out the alternative address space. I'm imagining the power to (a) cause some code path to execute while the kernel is "open" and (b) the ability to use the buggy code path in question to write a a fully- or partially-controlled address. With CR0.WP clear, this can write shellcode directly. With CR3 pointing to a page table that maps some parts of the kernel (but not text!) at a randomized offset, you need to figure out the offset and find some other target in the mapping that gets your exploit farther along. You can't write shellcode directly. > >> > the problem is that the amount of __read_only data extends beyond vmlinux, >> > i.e., this approach won't scale. another problem is that it can't be used >> > inside use_mm and switch_mm themselves (no read-only task structs or percpu >> > pgd for you ;) and probably several other contexts. >> >> Can you clarify these uses that extend beyond vmlinux? > > one obvious candidate is modules. how do you want to handle them? then there's > a whole bunch of dynamically allocated data that is a candidate for __read_only > treatment. Exactly the same way. Map those regions at the same offset, maybe even in the same VMA. There's no reason that an artificial VMA used for this purpose can't be many gigabytes long and have vm_ops that only allow access to certain things. But multiple VMAs would work, too. > >> > what is the threat model you're assuming for this feature? based on what i >> > have for PaX (arbitrary read/write access exploited for data-only attacks), >> > the above makes no sense to me... >> >> If I use the primitive to try to write a value to the wrong section >> (write to kernel text, for example), IMO it would be nice to OOPS >> instead of succeeding. > > this doesn't tell me what power you're assuming the attacker has. is it > my generic arbitrary read-write ability or something more restricted and > thus less realistic? i.e., how does the attacker get to 'use the primitive' > and (presumably) also control the ptr/data? > > as for your specific example, kernel text isn't 'non-rare-write data' that > you spoke of before, but that aside, what prevents an attacker from computing > his target ptr so that after your accessor rebases it, it'd point back to his > intended target instead? It's a restriction on what targets can be hit. With CR0.WP, you can hit anything that has a VA. With CR3, you can hit only that which is mapped. > will you range-check (find_vma eventually?) each time? > how will you make all this code safe from races from another task? the more > checks you make, the more likely that something sensitive will spill to memory > and be a target itself in order to hijack the sensitive write. There's no code here making the checks at write time. It's just page table / VMA setup. > >> Please keep in mind that, unlike PaX, uses of a pax_open_kernel()-like >> function will may be carefully audited by a friendly security expert >> such as yourself. It would be nice to harden the primitive to a >> reasonable extent against minor misuses such as putting it in a >> context where the compiler will emit mov-a-reg-with-WP-set-to-CR0; >> ret. > > i don't understand what's there to audit. if you want to treat a given piece > of data as __read_only then you have no choice but to allow writes to it via > the open/close mechanism and the compiler can tell you just where those > writes are (and even do the instrumentation when you get tired of doing it > by hand). > I mean auditing all uses of pax_open_kernel() or any other function that opens the kernel. That function is, as used in PaX, terrifying. PaX probably gets every user right, but I don't trust driver writers with a function like pax_open_kernel() that's as powerful as PaX's. Suppose you get driver code like this: void foo(int (*func)()) { pax_open_kernel(); *thingy = func(); pax_close_kernel(); } That would be a very, very juicy target for a ROP-like attack. Just get the kernel to call this function with func pointing to something that does a memcpy or similar into executable space. Boom, shellcode execution. If CR3 is used instead, exploiting this is considerably more complicated.
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index cc98d5a294ee..2d1d707aa036 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -106,6 +106,7 @@ config X86 select HAVE_ARCH_KMEMCHECK select HAVE_ARCH_MMAP_RND_BITS if MMU select HAVE_ARCH_MMAP_RND_COMPAT_BITS if MMU && COMPAT + select HAVE_ARCH_RARE_WRITE select HAVE_ARCH_SECCOMP_FILTER select HAVE_ARCH_TRACEHOOK select HAVE_ARCH_TRANSPARENT_HUGEPAGE diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 1cfb36b8c024..2e6bf661bb84 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -91,6 +91,37 @@ extern struct mm_struct *pgd_page_get_mm(struct page *page); #endif /* CONFIG_PARAVIRT */ +/* TODO: Bad hack to deal with preempt macros being missing sometimes. */ +#ifndef preempt_disable +#include <linux/preempt.h> +#endif + +static __always_inline unsigned long __arch_rare_write_begin(void) +{ + unsigned long cr0; + + preempt_disable(); + barrier(); + cr0 = read_cr0() ^ X86_CR0_WP; + BUG_ON(cr0 & X86_CR0_WP); + write_cr0(cr0); + barrier(); + return cr0 ^ X86_CR0_WP; +} + +static __always_inline unsigned long __arch_rare_write_end(void) +{ + unsigned long cr0; + + barrier(); + cr0 = read_cr0() ^ X86_CR0_WP; + BUG_ON(!(cr0 & X86_CR0_WP)); + write_cr0(cr0); + barrier(); + preempt_enable_no_resched(); + return cr0 ^ X86_CR0_WP; +} + /* * The following only work if pte_present() is true. * Undefined behaviour if not..
Based on PaX's x86 pax_{open,close}_kernel() implementation, this allows HAVE_ARCH_RARE_WRITE to work on x86. There is missing work to sort out some header file issues where preempt.h is missing, though it can't be included in pg_table.h unconditionally... some other solution will be needed, perhaps an entirely separate header file for rare_write()-related defines... This patch is also missing paravirt support. Signed-off-by: Kees Cook <keescook@chromium.org> --- arch/x86/Kconfig | 1 + arch/x86/include/asm/pgtable.h | 31 +++++++++++++++++++++++++++++++ 2 files changed, 32 insertions(+)