diff mbox

[RFC,v2,04/11] x86: Implement __arch_rare_write_begin/unmap()

Message ID 1490811363-93944-5-git-send-email-keescook@chromium.org (mailing list archive)
State New, archived
Headers show

Commit Message

Kees Cook March 29, 2017, 6:15 p.m. UTC
Based on PaX's x86 pax_{open,close}_kernel() implementation, this
allows HAVE_ARCH_RARE_WRITE to work on x86.

There is missing work to sort out some header file issues where preempt.h
is missing, though it can't be included in pg_table.h unconditionally...
some other solution will be needed, perhaps an entirely separate header
file for rare_write()-related defines...

This patch is also missing paravirt support.

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/Kconfig               |  1 +
 arch/x86/include/asm/pgtable.h | 31 +++++++++++++++++++++++++++++++
 2 files changed, 32 insertions(+)

Comments

Andy Lutomirski March 29, 2017, 10:38 p.m. UTC | #1
On Wed, Mar 29, 2017 at 11:15 AM, Kees Cook <keescook@chromium.org> wrote:
> Based on PaX's x86 pax_{open,close}_kernel() implementation, this
> allows HAVE_ARCH_RARE_WRITE to work on x86.
>

> +
> +static __always_inline unsigned long __arch_rare_write_begin(void)
> +{
> +       unsigned long cr0;
> +
> +       preempt_disable();

This looks wrong.  DEBUG_LOCKS_WARN_ON(!irqs_disabled()) would work,
as would local_irq_disable().  There's no way that just disabling
preemption is enough.

(Also, how does this interact with perf nmis?)

--Andy
Kees Cook March 30, 2017, 1:41 a.m. UTC | #2
On Wed, Mar 29, 2017 at 3:38 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Wed, Mar 29, 2017 at 11:15 AM, Kees Cook <keescook@chromium.org> wrote:
>> Based on PaX's x86 pax_{open,close}_kernel() implementation, this
>> allows HAVE_ARCH_RARE_WRITE to work on x86.
>>
>
>> +
>> +static __always_inline unsigned long __arch_rare_write_begin(void)
>> +{
>> +       unsigned long cr0;
>> +
>> +       preempt_disable();
>
> This looks wrong.  DEBUG_LOCKS_WARN_ON(!irqs_disabled()) would work,
> as would local_irq_disable().  There's no way that just disabling
> preemption is enough.
>
> (Also, how does this interact with perf nmis?)

Do you mean preempt_disable() isn't strong enough here? I'm open to
suggestions. The goal would be to make sure nothing between _begin and
_end would get executed without interruption...

-Kees
Andy Lutomirski April 5, 2017, 11:57 p.m. UTC | #3
On Wed, Mar 29, 2017 at 6:41 PM, Kees Cook <keescook@chromium.org> wrote:
> On Wed, Mar 29, 2017 at 3:38 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> On Wed, Mar 29, 2017 at 11:15 AM, Kees Cook <keescook@chromium.org> wrote:
>>> Based on PaX's x86 pax_{open,close}_kernel() implementation, this
>>> allows HAVE_ARCH_RARE_WRITE to work on x86.
>>>
>>
>>> +
>>> +static __always_inline unsigned long __arch_rare_write_begin(void)
>>> +{
>>> +       unsigned long cr0;
>>> +
>>> +       preempt_disable();
>>
>> This looks wrong.  DEBUG_LOCKS_WARN_ON(!irqs_disabled()) would work,
>> as would local_irq_disable().  There's no way that just disabling
>> preemption is enough.
>>
>> (Also, how does this interact with perf nmis?)
>
> Do you mean preempt_disable() isn't strong enough here? I'm open to
> suggestions. The goal would be to make sure nothing between _begin and
> _end would get executed without interruption...
>

Sorry for the very slow response.

preempt_disable() isn't strong enough to prevent interrupts, and an
interrupt here would run with WP off, causing unknown havoc.  I tend
to think that the caller should be responsible for turning off
interrupts.

--Andy
Kees Cook April 6, 2017, 12:14 a.m. UTC | #4
On Wed, Apr 5, 2017 at 4:57 PM, Andy Lutomirski <luto@kernel.org> wrote:
> On Wed, Mar 29, 2017 at 6:41 PM, Kees Cook <keescook@chromium.org> wrote:
>> On Wed, Mar 29, 2017 at 3:38 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Wed, Mar 29, 2017 at 11:15 AM, Kees Cook <keescook@chromium.org> wrote:
>>>> Based on PaX's x86 pax_{open,close}_kernel() implementation, this
>>>> allows HAVE_ARCH_RARE_WRITE to work on x86.
>>>>
>>>
>>>> +
>>>> +static __always_inline unsigned long __arch_rare_write_begin(void)
>>>> +{
>>>> +       unsigned long cr0;
>>>> +
>>>> +       preempt_disable();
>>>
>>> This looks wrong.  DEBUG_LOCKS_WARN_ON(!irqs_disabled()) would work,
>>> as would local_irq_disable().  There's no way that just disabling
>>> preemption is enough.
>>>
>>> (Also, how does this interact with perf nmis?)
>>
>> Do you mean preempt_disable() isn't strong enough here? I'm open to
>> suggestions. The goal would be to make sure nothing between _begin and
>> _end would get executed without interruption...
>>
>
> Sorry for the very slow response.
>
> preempt_disable() isn't strong enough to prevent interrupts, and an
> interrupt here would run with WP off, causing unknown havoc.  I tend
> to think that the caller should be responsible for turning off
> interrupts.

So, something like:

Top-level functions:

static __always_inline rare_write_begin(void)
{
    preempt_disable();
    local_irq_disable();
    barrier();
    __arch_rare_write_begin();
    barrier();
}

static __always_inline rare_write_end(void)
{
    barrier();
    __arch_rare_write_end();
    barrier();
    local_irq_enable();
    preempt_enable_no_resched();
}

x86-specific helpers:

static __always_inline unsigned long __arch_rare_write_begin(void)
{
       unsigned long cr0;

       cr0 = read_cr0() ^ X86_CR0_WP;
       BUG_ON(cr0 & X86_CR0_WP);
       write_cr0(cr0);
       return cr0 ^ X86_CR0_WP;
}

static __always_inline unsigned long __arch_rare_write_end(void)
{
       unsigned long cr0;

       cr0 = read_cr0() ^ X86_CR0_WP;
       BUG_ON(!(cr0 & X86_CR0_WP));
       write_cr0(cr0);
       return cr0 ^ X86_CR0_WP;
}

I can give it a spin...

-Kees
Andy Lutomirski April 6, 2017, 3:59 p.m. UTC | #5
On Wed, Apr 5, 2017 at 5:14 PM, Kees Cook <keescook@chromium.org> wrote:
> On Wed, Apr 5, 2017 at 4:57 PM, Andy Lutomirski <luto@kernel.org> wrote:
>> On Wed, Mar 29, 2017 at 6:41 PM, Kees Cook <keescook@chromium.org> wrote:
>>> On Wed, Mar 29, 2017 at 3:38 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> On Wed, Mar 29, 2017 at 11:15 AM, Kees Cook <keescook@chromium.org> wrote:
>>>>> Based on PaX's x86 pax_{open,close}_kernel() implementation, this
>>>>> allows HAVE_ARCH_RARE_WRITE to work on x86.
>>>>>
>>>>
>>>>> +
>>>>> +static __always_inline unsigned long __arch_rare_write_begin(void)
>>>>> +{
>>>>> +       unsigned long cr0;
>>>>> +
>>>>> +       preempt_disable();
>>>>
>>>> This looks wrong.  DEBUG_LOCKS_WARN_ON(!irqs_disabled()) would work,
>>>> as would local_irq_disable().  There's no way that just disabling
>>>> preemption is enough.
>>>>
>>>> (Also, how does this interact with perf nmis?)
>>>
>>> Do you mean preempt_disable() isn't strong enough here? I'm open to
>>> suggestions. The goal would be to make sure nothing between _begin and
>>> _end would get executed without interruption...
>>>
>>
>> Sorry for the very slow response.
>>
>> preempt_disable() isn't strong enough to prevent interrupts, and an
>> interrupt here would run with WP off, causing unknown havoc.  I tend
>> to think that the caller should be responsible for turning off
>> interrupts.
>
> So, something like:
>
> Top-level functions:
>
> static __always_inline rare_write_begin(void)
> {
>     preempt_disable();
>     local_irq_disable();
>     barrier();
>     __arch_rare_write_begin();
>     barrier();
> }

Looks good, except you don't need preempt_disable().
local_irq_disable() also disables preemption.  You might need to use
local_irq_save(), though, depending on whether any callers already
have IRQs off.

--Andy
Mathias Krause April 7, 2017, 8:34 a.m. UTC | #6
On 6 April 2017 at 17:59, Andy Lutomirski <luto@amacapital.net> wrote:
> On Wed, Apr 5, 2017 at 5:14 PM, Kees Cook <keescook@chromium.org> wrote:
>> On Wed, Apr 5, 2017 at 4:57 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>> On Wed, Mar 29, 2017 at 6:41 PM, Kees Cook <keescook@chromium.org> wrote:
>>>> On Wed, Mar 29, 2017 at 3:38 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>>> On Wed, Mar 29, 2017 at 11:15 AM, Kees Cook <keescook@chromium.org> wrote:
>>>>>> Based on PaX's x86 pax_{open,close}_kernel() implementation, this
>>>>>> allows HAVE_ARCH_RARE_WRITE to work on x86.
>>>>>>
>>>>>> +
>>>>>> +static __always_inline unsigned long __arch_rare_write_begin(void)
>>>>>> +{
>>>>>> +       unsigned long cr0;
>>>>>> +
>>>>>> +       preempt_disable();
>>>>>
>>>>> This looks wrong.  DEBUG_LOCKS_WARN_ON(!irqs_disabled()) would work,
>>>>> as would local_irq_disable().  There's no way that just disabling
>>>>> preemption is enough.
>>>>>
>>>>> (Also, how does this interact with perf nmis?)
>>>>
>>>> Do you mean preempt_disable() isn't strong enough here? I'm open to
>>>> suggestions. The goal would be to make sure nothing between _begin and
>>>> _end would get executed without interruption...
>>>>
>>>
>>> Sorry for the very slow response.
>>>
>>> preempt_disable() isn't strong enough to prevent interrupts, and an
>>> interrupt here would run with WP off, causing unknown havoc.  I tend
>>> to think that the caller should be responsible for turning off
>>> interrupts.
>>
>> So, something like:
>>
>> Top-level functions:
>>
>> static __always_inline rare_write_begin(void)
>> {
>>     preempt_disable();
>>     local_irq_disable();
>>     barrier();
>>     __arch_rare_write_begin();
>>     barrier();
>> }
>
> Looks good, except you don't need preempt_disable().
> local_irq_disable() also disables preemption.  You might need to use
> local_irq_save(), though, depending on whether any callers already
> have IRQs off.

Well, doesn't look good to me. NMIs will still be able to interrupt
this code and will run with CR0.WP = 0.

Shouldn't you instead question yourself why PaX can do it "just" with
preempt_disable() instead?!

Cheers,
Mathias
Peter Zijlstra April 7, 2017, 9:37 a.m. UTC | #7
On Wed, Mar 29, 2017 at 11:15:56AM -0700, Kees Cook wrote:
> +static __always_inline unsigned long __arch_rare_write_end(void)
> +{
> +	unsigned long cr0;
> +
> +	barrier();
> +	cr0 = read_cr0() ^ X86_CR0_WP;
> +	BUG_ON(!(cr0 & X86_CR0_WP));
> +	write_cr0(cr0);
> +	barrier();
> +	preempt_enable_no_resched();

NAK

> +	return cr0 ^ X86_CR0_WP;
> +}
Thomas Gleixner April 7, 2017, 9:46 a.m. UTC | #8
On Fri, 7 Apr 2017, Mathias Krause wrote:
> On 6 April 2017 at 17:59, Andy Lutomirski <luto@amacapital.net> wrote:
> > On Wed, Apr 5, 2017 at 5:14 PM, Kees Cook <keescook@chromium.org> wrote:
> >> static __always_inline rare_write_begin(void)
> >> {
> >>     preempt_disable();
> >>     local_irq_disable();
> >>     barrier();
> >>     __arch_rare_write_begin();
> >>     barrier();
> >> }
> >
> > Looks good, except you don't need preempt_disable().
> > local_irq_disable() also disables preemption.  You might need to use
> > local_irq_save(), though, depending on whether any callers already
> > have IRQs off.
> 
> Well, doesn't look good to me. NMIs will still be able to interrupt
> this code and will run with CR0.WP = 0.
> 
> Shouldn't you instead question yourself why PaX can do it "just" with
> preempt_disable() instead?!

That's silly. Just because PaX does it, doesn't mean it's correct. To be
honest, playing games with the CR0.WP bit is outright stupid to begin with.

Whether protected by preempt_disable or local_irq_disable, to make that
work it needs CR0 handling in the exception entry/exit at the lowest
level. And that's just a nightmare maintainence wise as it's prone to be
broken over time. Aside of that it's pointless overhead for the normal case.

The proper solution is:

write_rare(ptr, val)
{
	mp = map_shadow_rw(ptr);
	*mp = val;
	unmap_shadow_rw(mp);
}

map_shadow_rw() is essentially the same thing as we do in the highmem case
where the kernel creates a shadow mapping of the user space pages via
kmap_atomic().

It's valid (at least on x86) to have a shadow map with the same page
attributes but write enabled. That does not require any fixups of CR0 and
just works.

Thanks,

	tglx
Mathias Krause April 7, 2017, 10:51 a.m. UTC | #9
On 7 April 2017 at 11:46, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Fri, 7 Apr 2017, Mathias Krause wrote:
>> On 6 April 2017 at 17:59, Andy Lutomirski <luto@amacapital.net> wrote:
>> > On Wed, Apr 5, 2017 at 5:14 PM, Kees Cook <keescook@chromium.org> wrote:
>> >> static __always_inline rare_write_begin(void)
>> >> {
>> >>     preempt_disable();
>> >>     local_irq_disable();
>> >>     barrier();
>> >>     __arch_rare_write_begin();
>> >>     barrier();
>> >> }
>> >
>> > Looks good, except you don't need preempt_disable().
>> > local_irq_disable() also disables preemption.  You might need to use
>> > local_irq_save(), though, depending on whether any callers already
>> > have IRQs off.
>>
>> Well, doesn't look good to me. NMIs will still be able to interrupt
>> this code and will run with CR0.WP = 0.
>>
>> Shouldn't you instead question yourself why PaX can do it "just" with
>> preempt_disable() instead?!
>
> That's silly. Just because PaX does it, doesn't mean it's correct. To be
> honest, playing games with the CR0.WP bit is outright stupid to begin with.

Why that? It allows fast and CPU local modifications of r/o memory.
OTOH, an approach that needs to fiddle with page table entries
requires global synchronization to keep the individual TLB states in
sync. Hmm.. Not that fast, I'd say.

> Whether protected by preempt_disable or local_irq_disable, to make that
> work it needs CR0 handling in the exception entry/exit at the lowest
> level. And that's just a nightmare maintainence wise as it's prone to be
> broken over time.

It seems to be working fine for more than a decade now in PaX. So it
can't be such a big maintenance nightmare ;)

>                    Aside of that it's pointless overhead for the normal case.

>
> The proper solution is:
>
> write_rare(ptr, val)
> {
>         mp = map_shadow_rw(ptr);
>         *mp = val;
>         unmap_shadow_rw(mp);
> }
>
> map_shadow_rw() is essentially the same thing as we do in the highmem case
> where the kernel creates a shadow mapping of the user space pages via
> kmap_atomic().

The "proper solution" seems to be much slower compared to just
toggling CR0.WP (which is costly in itself, already) because of the
TLB invalidation / synchronisation involved.

> It's valid (at least on x86) to have a shadow map with the same page
> attributes but write enabled. That does not require any fixups of CR0 and
> just works.

"Just works", sure -- but it's not as tightly focused as the PaX
solution which is CPU local, while your proposed solution is globally
visible.


Cheers,
Mathias
Thomas Gleixner April 7, 2017, 1:14 p.m. UTC | #10
On Fri, 7 Apr 2017, Mathias Krause wrote:
> On 7 April 2017 at 11:46, Thomas Gleixner <tglx@linutronix.de> wrote:
> > Whether protected by preempt_disable or local_irq_disable, to make that
> > work it needs CR0 handling in the exception entry/exit at the lowest
> > level. And that's just a nightmare maintainence wise as it's prone to be
> > broken over time.
> 
> It seems to be working fine for more than a decade now in PaX. So it
> can't be such a big maintenance nightmare ;)

I really do not care whether PaX wants to chase and verify that over and
over. I certainly don't want to take the chance to leak CR0.WP ever and I
very much care about extra stuff to check in the entry/exit path.

> The "proper solution" seems to be much slower compared to just
> toggling CR0.WP (which is costly in itself, already) because of the
> TLB invalidation / synchronisation involved.

Why the heck should we care about rare writes being performant?

> > It's valid (at least on x86) to have a shadow map with the same page
> > attributes but write enabled. That does not require any fixups of CR0 and
> > just works.
> 
> "Just works", sure -- but it's not as tightly focused as the PaX
> solution which is CPU local, while your proposed solution is globally
> visible.

Making the world and some more writeable hardly qualifies as tightly
focussed. Making the mapping concept CPU local is not rocket science
either. The question is whethers it's worth the trouble.

Thanks,

	tglx
Mathias Krause April 7, 2017, 1:30 p.m. UTC | #11
On 7 April 2017 at 15:14, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Fri, 7 Apr 2017, Mathias Krause wrote:
>> On 7 April 2017 at 11:46, Thomas Gleixner <tglx@linutronix.de> wrote:
>> > Whether protected by preempt_disable or local_irq_disable, to make that
>> > work it needs CR0 handling in the exception entry/exit at the lowest
>> > level. And that's just a nightmare maintainence wise as it's prone to be
>> > broken over time.
>>
>> It seems to be working fine for more than a decade now in PaX. So it
>> can't be such a big maintenance nightmare ;)
>
> I really do not care whether PaX wants to chase and verify that over and
> over. I certainly don't want to take the chance to leak CR0.WP ever and I
> very much care about extra stuff to check in the entry/exit path.

Fair enough. However, placing a BUG_ON(!(read_cr0() & X86_CR0_WP))
somewhere sensible should make those "leaks" visible fast -- and their
exploitation impossible, i.e. fail hard.

>> The "proper solution" seems to be much slower compared to just
>> toggling CR0.WP (which is costly in itself, already) because of the
>> TLB invalidation / synchronisation involved.
>
> Why the heck should we care about rare writes being performant?

As soon as they stop being rare and people start extending the r/o
protection to critical data structures accessed often. Then
performance matters.

>> > It's valid (at least on x86) to have a shadow map with the same page
>> > attributes but write enabled. That does not require any fixups of CR0 and
>> > just works.
>>
>> "Just works", sure -- but it's not as tightly focused as the PaX
>> solution which is CPU local, while your proposed solution is globally
>> visible.
>
> Making the world and some more writeable hardly qualifies as tightly
> focussed. Making the mapping concept CPU local is not rocket science
> either. The question is whethers it's worth the trouble.

No, the question is if the value of the concept is well understood and
if people can see what could be done with such a strong primitive.
Apparently not...


Cheers,
Mathias
Peter Zijlstra April 7, 2017, 2:45 p.m. UTC | #12
On Fri, Apr 07, 2017 at 12:51:15PM +0200, Mathias Krause wrote:
> Why that? It allows fast and CPU local modifications of r/o memory.
> OTOH, an approach that needs to fiddle with page table entries
> requires global synchronization to keep the individual TLB states in
> sync. Hmm.. Not that fast, I'd say.

The fixmaps used for kmap_atomic are per-cpu, no global sync required.
Andy Lutomirski April 7, 2017, 4:14 p.m. UTC | #13
On Fri, Apr 7, 2017 at 6:30 AM, Mathias Krause <minipli@googlemail.com> wrote:
> On 7 April 2017 at 15:14, Thomas Gleixner <tglx@linutronix.de> wrote:
>> On Fri, 7 Apr 2017, Mathias Krause wrote:
>>> On 7 April 2017 at 11:46, Thomas Gleixner <tglx@linutronix.de> wrote:
>>> > Whether protected by preempt_disable or local_irq_disable, to make that
>>> > work it needs CR0 handling in the exception entry/exit at the lowest
>>> > level. And that's just a nightmare maintainence wise as it's prone to be
>>> > broken over time.
>>>
>>> It seems to be working fine for more than a decade now in PaX. So it
>>> can't be such a big maintenance nightmare ;)
>>
>> I really do not care whether PaX wants to chase and verify that over and
>> over. I certainly don't want to take the chance to leak CR0.WP ever and I
>> very much care about extra stuff to check in the entry/exit path.
>
> Fair enough. However, placing a BUG_ON(!(read_cr0() & X86_CR0_WP))
> somewhere sensible should make those "leaks" visible fast -- and their
> exploitation impossible, i.e. fail hard.

The leaks surely exist and now we'll just add an exploitable BUG.

>>> > It's valid (at least on x86) to have a shadow map with the same page
>>> > attributes but write enabled. That does not require any fixups of CR0 and
>>> > just works.
>>>
>>> "Just works", sure -- but it's not as tightly focused as the PaX
>>> solution which is CPU local, while your proposed solution is globally
>>> visible.
>>
>> Making the world and some more writeable hardly qualifies as tightly
>> focussed. Making the mapping concept CPU local is not rocket science
>> either. The question is whethers it's worth the trouble.
>
> No, the question is if the value of the concept is well understood and
> if people can see what could be done with such a strong primitive.
> Apparently not...

I think we're approaching this all wrong, actually.  The fact that x86
has this CR0.WP thing is arguably a historical accident, and the fact
that PaX uses it doesn't mean that PaX is doing it the best way for
upstream Linux.

Why don't we start at the other end and do a generic non-arch-specific
implementation: set up an mm_struct that contains an RW alias of the
relevant parts of rodata and use use_mm to access it.  (That is,
get_fs() to back up the old fs, set_fs(USER_DS),
use_mm(&rare_write_mm), do the write using copy_to_user, undo
everything.)

Then someone who cares about performance can benchmark the CR0.WP
approach against it and try to argue that it's a good idea.  This
benchmark should wait until I'm done with my PCID work, because PCID
is going to make use_mm() a whole heck of a lot faster.

--Andy
Mark Rutland April 7, 2017, 4:22 p.m. UTC | #14
On Fri, Apr 07, 2017 at 09:14:29AM -0700, Andy Lutomirski wrote:
> I think we're approaching this all wrong, actually.  The fact that x86
> has this CR0.WP thing is arguably a historical accident, and the fact
> that PaX uses it doesn't mean that PaX is doing it the best way for
> upstream Linux.
> 
> Why don't we start at the other end and do a generic non-arch-specific
> implementation: set up an mm_struct that contains an RW alias of the
> relevant parts of rodata and use use_mm to access it.  (That is,
> get_fs() to back up the old fs, set_fs(USER_DS),
> use_mm(&rare_write_mm), do the write using copy_to_user, undo
> everything.)

FWIW, I completely agree with this approach.

That's largely the approach arm64 would have to take regardless (as per
Hoeun's patches), and having a consistent common implementation would be
desireable.

There are a couple of other complications to handle (e.g. a perf
interrupt coming in and trying to read from the mapping), but I think we
can handle that generically.

Thanks,
Mark.
Thomas Gleixner April 7, 2017, 7:25 p.m. UTC | #15
On Fri, 7 Apr 2017, Mathias Krause wrote:
> On 7 April 2017 at 15:14, Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Fri, 7 Apr 2017, Mathias Krause wrote:
> >> On 7 April 2017 at 11:46, Thomas Gleixner <tglx@linutronix.de> wrote:
> >> > Whether protected by preempt_disable or local_irq_disable, to make that
> >> > work it needs CR0 handling in the exception entry/exit at the lowest
> >> > level. And that's just a nightmare maintainence wise as it's prone to be
> >> > broken over time.
> >>
> >> It seems to be working fine for more than a decade now in PaX. So it
> >> can't be such a big maintenance nightmare ;)
> >
> > I really do not care whether PaX wants to chase and verify that over and
> > over. I certainly don't want to take the chance to leak CR0.WP ever and I
> > very much care about extra stuff to check in the entry/exit path.
> 
> Fair enough. However, placing a BUG_ON(!(read_cr0() & X86_CR0_WP))
> somewhere sensible should make those "leaks" visible fast -- and their
> exploitation impossible, i.e. fail hard.

Sure, you trade leaking WP with an potentially exploitable BUG().

> >> The "proper solution" seems to be much slower compared to just
> >> toggling CR0.WP (which is costly in itself, already) because of the
> >> TLB invalidation / synchronisation involved.
> >
> > Why the heck should we care about rare writes being performant?
> 
> As soon as they stop being rare and people start extending the r/o
> protection to critical data structures accessed often. Then
> performance matters.

Emphasis on "Then". I'm not seeing it, because no matter what you do it's
going to be slow. Aside of that increasing the usage will also increase the
chance to leak stuff. In that case I rather leak a single page mapping
temporarily than taking the chance to leak WP.

> >> > It's valid (at least on x86) to have a shadow map with the same page
> >> > attributes but write enabled. That does not require any fixups of CR0 and
> >> > just works.
> >>
> >> "Just works", sure -- but it's not as tightly focused as the PaX
> >> solution which is CPU local, while your proposed solution is globally
> >> visible.
> >
> > Making the world and some more writeable hardly qualifies as tightly
> > focussed. Making the mapping concept CPU local is not rocket science
> > either. The question is whethers it's worth the trouble.
> 
> No, the question is if the value of the concept is well understood and
> if people can see what could be done with such a strong primitive.
> Apparently not...

Oh, well. We can stop that discussion right here, if all you can provide
is a killer phrase.

I'm well aware what can be done with a strong primitive and I certainly
understand the concept, but I'm not naive enough to believe that lifting
one of the strong protections the kernel has by globaly disabling WP is
anything which should be even considered. That bit is a horrible
misconception and should be fused to 1.

Aside of that, if you had taken the time to figure out how kmap_atomic
stuff works then you would have noticed that it does not require cross CPU
pagetable syncs and that the mapping place can be randomized to a certain
degree. So this has neither global impact, nor does it become immediately
globally visible.

Thanks,

	tglx
pageexec@freemail.hu April 7, 2017, 7:52 p.m. UTC | #16
On 7 Apr 2017 at 11:46, Thomas Gleixner wrote:

> On Fri, 7 Apr 2017, Mathias Krause wrote:
> > Well, doesn't look good to me. NMIs will still be able to interrupt
> > this code and will run with CR0.WP = 0.
> > 
> > Shouldn't you instead question yourself why PaX can do it "just" with
> > preempt_disable() instead?!
> 
> That's silly. Just because PaX does it, doesn't mean it's correct.

is that FUD or do you have actionable information to share?

> To be honest, playing games with the CR0.WP bit is outright stupid to begin with.

why is that? cr0.wp exists since the i486 and its behaviour fits my
purposes quite well, it's the best security/performance i know of.

> Whether protected by preempt_disable or local_irq_disable, to make that
> work it needs CR0 handling in the exception entry/exit at the lowest
> level.

correct.

> And that's just a nightmare maintainence wise as it's prone to be
> broken over time.

i've got 14 years of experience of maintaining it and i never saw it break.

> Aside of that it's pointless overhead for the normal case.

unless it's optional code as the whole feature already is.

> The proper solution is:
> 
> write_rare(ptr, val)
> {
>  mp = map_shadow_rw(ptr);
>  *mp = val;
>  unmap_shadow_rw(mp);
> }

this is not *the* proper solution, but only a naive one that suffers from
the exact same need that the cr0.wp approach does and has worse performance
impact. not exactly a win...

[continuing from your next mail in order to save round-trip time]

> I really do not care whether PaX wants to chase and verify that over and
> over.

verifying it is no different than verifying, say, swapgs use.

> I certainly don't want to take the chance to leak CR0.WP ever

why and where would cr0.wp leak?

> and I very much care about extra stuff to check in the entry/exit path.

your 'proper' solution needs (a lot more) extra stuff too.

> Why the heck should we care about rare writes being performant?

because you've been misled by the NIH crowd here that the PaX feature they
tried to (badly) extract from has anything to do with frequency of writes.
it does not. what it does do is provide an environment for variables that
are conceptually writable but for security reasons should be read-only most
of the time by most of the code (ditto for the grossly misunderstood and thus
misnamed ro-after-shit). now imagine locking down the page table hierarchy
with it...

> Making the world and some more writeable hardly qualifies as tightly
> focused.

you forgot to add 'for a window of a few insns' and that the map/unmap
approach does the same under an attacker controlled ptr.

> Making the mapping concept CPU local is not rocket science
> either. The question is whether it's worth the trouble.

it is for people who care about the integrity of the kernel, and all this
read-onlyness stuff implies that some do.
pageexec@freemail.hu April 7, 2017, 7:58 p.m. UTC | #17
On 7 Apr 2017 at 9:14, Andy Lutomirski wrote:

> On Fri, Apr 7, 2017 at 6:30 AM, Mathias Krause <minipli@googlemail.com> wrote:
> > On 7 April 2017 at 15:14, Thomas Gleixner <tglx@linutronix.de> wrote:
> >> On Fri, 7 Apr 2017, Mathias Krause wrote:
> > Fair enough. However, placing a BUG_ON(!(read_cr0() & X86_CR0_WP))
> > somewhere sensible should make those "leaks" visible fast -- and their
> > exploitation impossible, i.e. fail hard.
> 
> The leaks surely exist and now we'll just add an exploitable BUG.

can you please share those leaks that 'surely exist' and CC oss-security
while at it?

> I think we're approaching this all wrong, actually.  The fact that x86
> has this CR0.WP thing is arguably a historical accident, and the fact
> that PaX uses it doesn't mean that PaX is doing it the best way for
> upstream Linux.
> 
> Why don't we start at the other end and do a generic non-arch-specific
> implementation: set up an mm_struct that contains an RW alias of the
> relevant parts of rodata and use use_mm to access it.  (That is,
> get_fs() to back up the old fs, set_fs(USER_DS),
> use_mm(&rare_write_mm), do the write using copy_to_user, undo
> everything.)
> 
> Then someone who cares about performance can benchmark the CR0.WP
> approach against it and try to argue that it's a good idea.  This
> benchmark should wait until I'm done with my PCID work, because PCID
> is going to make use_mm() a whole heck of a lot faster.

in my measurements switching PCID is hovers around 230 cycles for snb-ivb
and 200-220 for hsw-skl whereas cr0 writes are around 230-240 cycles. there's
of course a whole lot more impact for switching address spaces so it'll never
be fast enough to beat cr0.wp.
Thomas Gleixner April 7, 2017, 8:44 p.m. UTC | #18
On Fri, 7 Apr 2017, Andy Lutomirski wrote:
> On Fri, Apr 7, 2017 at 6:30 AM, Mathias Krause <minipli@googlemail.com> wrote:
> > On 7 April 2017 at 15:14, Thomas Gleixner <tglx@linutronix.de> wrote:
> >> I really do not care whether PaX wants to chase and verify that over and
> >> over. I certainly don't want to take the chance to leak CR0.WP ever and I
> >> very much care about extra stuff to check in the entry/exit path.
> >
> > Fair enough. However, placing a BUG_ON(!(read_cr0() & X86_CR0_WP))
> > somewhere sensible should make those "leaks" visible fast -- and their
> > exploitation impossible, i.e. fail hard.
> 
> The leaks surely exist and now we'll just add an exploitable BUG.

:)

> I think we're approaching this all wrong, actually.  The fact that x86
> has this CR0.WP thing is arguably a historical accident, and the fact
> that PaX uses it doesn't mean that PaX is doing it the best way for
> upstream Linux.

As I said before. It should be fused to 1.

> Why don't we start at the other end and do a generic non-arch-specific
> implementation: set up an mm_struct that contains an RW alias of the
> relevant parts of rodata and use use_mm to access it.  (That is,
> get_fs() to back up the old fs, set_fs(USER_DS),
> use_mm(&rare_write_mm), do the write using copy_to_user, undo
> everything.)

That works too, though I'm not sure that it's more efficient than
temporarily creating and undoing a single cpu local alias mapping similar
to the kmap_atomic mechanism in the highmem case.

Aside of that the alias mapping requires a full PGDIR entry unless you want
to get into the mess of keeping yet another identity mapping up to date.

Thanks,

	tglx
Kees Cook April 7, 2017, 9:20 p.m. UTC | #19
On Fri, Apr 7, 2017 at 1:44 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Fri, 7 Apr 2017, Andy Lutomirski wrote:
>> On Fri, Apr 7, 2017 at 6:30 AM, Mathias Krause <minipli@googlemail.com> wrote:
>> > On 7 April 2017 at 15:14, Thomas Gleixner <tglx@linutronix.de> wrote:
>> >> I really do not care whether PaX wants to chase and verify that over and
>> >> over. I certainly don't want to take the chance to leak CR0.WP ever and I
>> >> very much care about extra stuff to check in the entry/exit path.
>> >
>> > Fair enough. However, placing a BUG_ON(!(read_cr0() & X86_CR0_WP))
>> > somewhere sensible should make those "leaks" visible fast -- and their
>> > exploitation impossible, i.e. fail hard.
>>
>> The leaks surely exist and now we'll just add an exploitable BUG.
>
> :)
>
>> I think we're approaching this all wrong, actually.  The fact that x86
>> has this CR0.WP thing is arguably a historical accident, and the fact
>> that PaX uses it doesn't mean that PaX is doing it the best way for
>> upstream Linux.
>
> As I said before. It should be fused to 1.
>
>> Why don't we start at the other end and do a generic non-arch-specific
>> implementation: set up an mm_struct that contains an RW alias of the
>> relevant parts of rodata and use use_mm to access it.  (That is,
>> get_fs() to back up the old fs, set_fs(USER_DS),
>> use_mm(&rare_write_mm), do the write using copy_to_user, undo
>> everything.)
>
> That works too, though I'm not sure that it's more efficient than
> temporarily creating and undoing a single cpu local alias mapping similar
> to the kmap_atomic mechanism in the highmem case.
>
> Aside of that the alias mapping requires a full PGDIR entry unless you want
> to get into the mess of keeping yet another identity mapping up to date.

I probably chose the wrong name for this feature (write rarely).
That's _usually_ true, but "sensitive_write()" was getting rather
long. The things that we need to protect with this are certainly stuff
that doesn't get much writing, but some things are just plain
sensitive (like page tables) and we should still try to be as fast as
possible with them.

I'll try to include a third example in the next posting of the series
that makes this more obvious.

I'm all for a general case for the infrastructure (as Andy and Mark
has mentioned), but I don't want to get into the situation where
people start refusing to use it because it's "too slow" (for example,
see refcount_t vs net-dev right now).

-Kees
Daniel Micay April 8, 2017, 4:12 a.m. UTC | #20
> I probably chose the wrong name for this feature (write rarely).
> That's _usually_ true, but "sensitive_write()" was getting rather
> long. The things that we need to protect with this are certainly stuff
> that doesn't get much writing, but some things are just plain
> sensitive (like page tables) and we should still try to be as fast as
> possible with them.

Not too late to rename it. Scoped write? I think it makes change to
use a different API than PaX for portability too, but not a different
x86 implementation. It's quite important to limit the writes to the
calling thread and it needs to perform well to be introduced widely.

> I'm all for a general case for the infrastructure (as Andy and Mark
> has mentioned), but I don't want to get into the situation where
> people start refusing to use it because it's "too slow" (for example,
> see refcount_t vs net-dev right now).

Meanwhile, the PaX implementation has improved to avoid the issues
that were brought up while only introducing a single always-predicted
(due to code placement) branch on the overflow flag. That seems to
have gone unnoticed upstream, where there's now a much slower
implementation that's not more secure, and is blocked from
introduction in areas where it's most needed based on the performance.
Not to mention that it's opt-in... which is never going to work.
Daniel Micay April 8, 2017, 4:13 a.m. UTC | #21
> Not too late to rename it. Scoped write? I think it makes change to

s/change/sense/
Daniel Micay April 8, 2017, 4:21 a.m. UTC | #22
>> Fair enough. However, placing a BUG_ON(!(read_cr0() & X86_CR0_WP))
>> somewhere sensible should make those "leaks" visible fast -- and their
>> exploitation impossible, i.e. fail hard.
>
> The leaks surely exist and now we'll just add an exploitable BUG.

That didn't seem to matter for landing a rewrite of KSTACKOVERFLOW
with a bunch of *known* DoS bugs dealt with in grsecurity and those
were known issues that were unfixed for no apparent reason other than
keeping egos intact. It looks like there are still some left...

In that case, there also wasn't a security/performance advantage.
Andy Lutomirski April 8, 2017, 4:58 a.m. UTC | #23
On Fri, Apr 7, 2017 at 12:58 PM, PaX Team <pageexec@freemail.hu> wrote:
> On 7 Apr 2017 at 9:14, Andy Lutomirski wrote:
>
>> On Fri, Apr 7, 2017 at 6:30 AM, Mathias Krause <minipli@googlemail.com> wrote:
>> > On 7 April 2017 at 15:14, Thomas Gleixner <tglx@linutronix.de> wrote:
>> >> On Fri, 7 Apr 2017, Mathias Krause wrote:
>> > Fair enough. However, placing a BUG_ON(!(read_cr0() & X86_CR0_WP))
>> > somewhere sensible should make those "leaks" visible fast -- and their
>> > exploitation impossible, i.e. fail hard.
>>
>> The leaks surely exist and now we'll just add an exploitable BUG.
>
> can you please share those leaks that 'surely exist' and CC oss-security
> while at it?

I meant in the patchset here, not in grsecurity.  grsecurity (on very,
very brief inspection) seems to read cr0 and fix it up in
pax_enter_kernel.

>>
>> Then someone who cares about performance can benchmark the CR0.WP
>> approach against it and try to argue that it's a good idea.  This
>> benchmark should wait until I'm done with my PCID work, because PCID
>> is going to make use_mm() a whole heck of a lot faster.
>
> in my measurements switching PCID is hovers around 230 cycles for snb-ivb
> and 200-220 for hsw-skl whereas cr0 writes are around 230-240 cycles. there's
> of course a whole lot more impact for switching address spaces so it'll never
> be fast enough to beat cr0.wp.
>

If I'm reading this right, you're saying that a non-flushing CR3 write
is about the same cost as a CR0.WP write.  If so, then why should CR0
be preferred over the (arch-neutral) CR3 approach?  And why would
switching address spaces obviously be much slower?  There'll be a very
small number of TLB fills needed for the actual protected access.

--Andy
Andy Lutomirski April 8, 2017, 5:07 a.m. UTC | #24
On Fri, Apr 7, 2017 at 9:21 PM, Daniel Micay <danielmicay@gmail.com> wrote:
>>> Fair enough. However, placing a BUG_ON(!(read_cr0() & X86_CR0_WP))
>>> somewhere sensible should make those "leaks" visible fast -- and their
>>> exploitation impossible, i.e. fail hard.
>>
>> The leaks surely exist and now we'll just add an exploitable BUG.
>
> That didn't seem to matter for landing a rewrite of KSTACKOVERFLOW
> with a bunch of *known* DoS bugs dealt with in grsecurity and those
> were known issues that were unfixed for no apparent reason other than
> keeping egos intact. It looks like there are still some left...
>
> In that case, there also wasn't a security/performance advantage.

This is wildly off topic, but I think it's worth answering anyway
because there's an important point here:

grsecurity and PaX are great projects.  They have a lot of good ideas,
and they're put together quite nicely.  The upstream kernel should
*not* do things differently from they way they are in grsecurity/PaX
just because it wants to be different.  Conversely, the upstream
kernel should not do things the same way as PaX just to be more like
PaX.

Keep in mind that the upstream kernel and grsecurity/PaX operate under
different constraints.  The upstream kernel tries to keep itself clean
and to make tree-wide updates rather that keeping compatibility stuff
around.  PaX and grsecurity presumably want to retain some degree of
simplicity when porting to newer upstream versions.

In the context of virtually mapped stacks / KSTACKOVERFLOW, this
naturally leads to different solutions.  The upstream kernel had a
bunch of buggy drivers that played badly with virtually mapped stacks.
grsecurity sensibly went for the approach where the buggy drivers kept
working.  The upstream kernel went for the approach of fixing the
drivers rather than keeping a compatibility workaround.  Different
constraints, different solutions.

The point being that, if someone sends a patch to the x86 entry code
that's justified by "be like PaX" or by "be different than PaX",
that's not okay.  It needs a real justification that stands on its
own.

In the case of rare writes or pax_open_kernel [1] or whatever we want
to call it, CR3 would work without arch-specific code, and CR0 would
not.  That's an argument for CR3 that would need to be countered by
something.  (Sure, avoiding leaks either way might need arch changes.
OTOH, a *randomized* CR3-based approach might not have as much of a
leak issue to begin with.)

[1] Contrary to popular belief, I don't sit around reading grsecurity
code or config options, so I really don't know what this thing is
called.
Daniel Micay April 8, 2017, 7:33 a.m. UTC | #25
> grsecurity and PaX are great projects.  They have a lot of good ideas,
> and they're put together quite nicely.  The upstream kernel should
> *not* do things differently from they way they are in grsecurity/PaX
> just because it wants to be different.  Conversely, the upstream
> kernel should not do things the same way as PaX just to be more like
> PaX.
>
> Keep in mind that the upstream kernel and grsecurity/PaX operate under
> different constraints.  The upstream kernel tries to keep itself clean
> and to make tree-wide updates rather that keeping compatibility stuff
> around.  PaX and grsecurity presumably want to retain some degree of
> simplicity when porting to newer upstream versions.
>
> In the context of virtually mapped stacks / KSTACKOVERFLOW, this
> naturally leads to different solutions.  The upstream kernel had a
> bunch of buggy drivers that played badly with virtually mapped stacks.
> grsecurity sensibly went for the approach where the buggy drivers kept
> working.  The upstream kernel went for the approach of fixing the
> drivers rather than keeping a compatibility workaround.  Different
> constraints, different solutions.

Sure, but nothing is currently landed right now and the PaX
implementation is a known quantity with a lot of testing. The
submitted code is aimed at rare writes to globals, but this feature is
more than that and design decisions shouldn't be based on just the
short term. Kees is sensibly landing features by submitting them a
little bit at a time, but a negative side effect of that is too much
focus on just the initial proposed usage.

As a downstream that's going to be making heavy use of mainline
security features as a base due to PaX and grsecurity becoming
commercial only private patches (*because* of upstream and the
companies involved), I hope things go in the right direction i.e. not
weaker/slower implementations than PaX. For example, while USERCOPY
isn't entirely landed upstream (missing slab whitelisting and user
offset/size), it's the base for the full feature and is going to get
more testing. On the other hand, refcount_t and the slab/page
sanitization work is 100% useless if you're willing to incorporate the
changes to do it without needless performance loss and complexity. PAN
emulation on 64-bit ARM was fresh ground while on ARMv7 a weaker
implementation was used for no better reason than clashing egos. The
upstream virtual mapped stacks will probably be perfectly good, but I
just find it to be such a waste of time and effort to reinvent it and
retread the same ground in terms of finding the bugs.

I actually care a lot more about 64-bit ARM support than I do x86, but
using a portable API for pax_open_kernel (for the simple uses at
least) is separate from choosing the underlying implementation. There
might not be a great way to do it on the architectures I care about
but that doesn't need to hinder x86. It's really not that much code...
A weaker/slower implementation for x86 also encourages the same
elsewhere.

> In the case of rare writes or pax_open_kernel [1] or whatever we want
> to call it, CR3 would work without arch-specific code, and CR0 would
> not.  That's an argument for CR3 that would need to be countered by
> something.  (Sure, avoiding leaks either way might need arch changes.
> OTOH, a *randomized* CR3-based approach might not have as much of a
> leak issue to begin with.)

By randomized do you mean just ASLR? Even if the window of opportunity
to exploit it is small, it's really not the same and this has a lot
more use than just rare writes to small global variables.

I wouldn't consider stack ASLR to be a replacement for making them
readable/writable only by the current thread either (required for
race-free return CFI on x86, at least without not using actual
returns).

> [1] Contrary to popular belief, I don't sit around reading grsecurity
> code or config options, so I really don't know what this thing is
> called.

Lots of the features aren't actually named. Maybe this could be
considered part of KERNEXEC but I don't think it is.
Andy Lutomirski April 8, 2017, 3:20 p.m. UTC | #26
On Sat, Apr 8, 2017 at 12:33 AM, Daniel Micay <danielmicay@gmail.com> wrote:
> The
> submitted code is aimed at rare writes to globals, but this feature is
> more than that and design decisions shouldn't be based on just the
> short term.

Then, if you disagree with a proposed design, *explain why* in a
standalone manner.  Say what future uses a different design would
have.

> I actually care a lot more about 64-bit ARM support than I do x86, but
> using a portable API for pax_open_kernel (for the simple uses at
> least) is separate from choosing the underlying implementation. There
> might not be a great way to do it on the architectures I care about
> but that doesn't need to hinder x86. It's really not that much code...
> A weaker/slower implementation for x86 also encourages the same
> elsewhere.

No one has explained how CR0.WP is weaker or slower than my proposal.
Here's what I'm proposing:

At boot, choose a random address A.  Create an mm_struct that has a
single VMA starting at A that represents the kernel's rarely-written
section.  Compute O = (A - VA of rarely-written section).  To do a
rare write, use_mm() the mm, write to (VA + O), then unuse_mm().

This should work on any arch that has an MMU that allows this type of
aliasing and that doesn't have PA-based protections on the
rarely-written section.

It'll be considerably slower than CR0.WP on a current x86 kernel, but,
with PCID landed, it shouldn't be much slower.  It has the added
benefit that writes to non-rare-write data using the rare-write
primitive will fail.

--Andy
Ingo Molnar April 9, 2017, 10:53 a.m. UTC | #27
* Andy Lutomirski <luto@kernel.org> wrote:

> On Sat, Apr 8, 2017 at 12:33 AM, Daniel Micay <danielmicay@gmail.com> wrote:
> > The
> > submitted code is aimed at rare writes to globals, but this feature is
> > more than that and design decisions shouldn't be based on just the
> > short term.
> 
> Then, if you disagree with a proposed design, *explain why* in a
> standalone manner.  Say what future uses a different design would
> have.
> 
> > I actually care a lot more about 64-bit ARM support than I do x86, but
> > using a portable API for pax_open_kernel (for the simple uses at
> > least) is separate from choosing the underlying implementation. There
> > might not be a great way to do it on the architectures I care about
> > but that doesn't need to hinder x86. It's really not that much code...
> > A weaker/slower implementation for x86 also encourages the same
> > elsewhere.
> 
> No one has explained how CR0.WP is weaker or slower than my proposal.
> Here's what I'm proposing:
> 
> At boot, choose a random address A.  Create an mm_struct that has a
> single VMA starting at A that represents the kernel's rarely-written
> section.  Compute O = (A - VA of rarely-written section).  To do a
> rare write, use_mm() the mm, write to (VA + O), then unuse_mm().

BTW., note that this is basically a pagetable based protection key variant.

> It'll be considerably slower than CR0.WP on a current x86 kernel, but, with PCID 
> landed, it shouldn't be much slower.  It has the added benefit that writes to 
> non-rare-write data using the rare-write primitive will fail.

... which is a security advantage of the use_mm() based design you suggest.

Thanks,

	Ingo
pageexec@freemail.hu April 9, 2017, 12:47 p.m. UTC | #28
On 7 Apr 2017 at 21:58, Andy Lutomirski wrote:

> On Fri, Apr 7, 2017 at 12:58 PM, PaX Team <pageexec@freemail.hu> wrote:
> > On 7 Apr 2017 at 9:14, Andy Lutomirski wrote:
> >> Then someone who cares about performance can benchmark the CR0.WP
> >> approach against it and try to argue that it's a good idea.  This
> >> benchmark should wait until I'm done with my PCID work, because PCID
> >> is going to make use_mm() a whole heck of a lot faster.
> >
> > in my measurements switching PCID is hovers around 230 cycles for snb-ivb
> > and 200-220 for hsw-skl whereas cr0 writes are around 230-240 cycles. there's
> > of course a whole lot more impact for switching address spaces so it'll never
> > be fast enough to beat cr0.wp.
> >
> 
> If I'm reading this right, you're saying that a non-flushing CR3 write
> is about the same cost as a CR0.WP write.  If so, then why should CR0
> be preferred over the (arch-neutral) CR3 approach?

cr3 (page table switching) isn't arch neutral at all ;). you probably meant
the higher level primitives except they're not enough to implement the scheme
as discussed before since the enter/exit paths are very much arch dependent.

on x86 the cost of the pax_open/close_kernel primitives comes from the cr0
writes and nothing else, use_mm suffers not only from the cr3 writes but
also locking/atomic ops and cr4 writes on its path and the inevitable TLB
entry costs. and if cpu vendors cared enough, they could make toggling cr0.wp
a fast path in the microcode and reduce its overhead by an order of magnitude.

>  And why would switching address spaces obviously be much slower? 
> There'll be a very small number of TLB fills needed for the actual
> protected access. 

you'll be duplicating TLB entries in the alternative PCID for both code
and data, where they will accumulate (=take room away from the normal PCID
and expose unwanted memory for access) unless you also flush them when
switching back (which then will cost even more cycles). also i'm not sure
that processors implement all the 12 PCID bits so depending on how many PCIDs
you plan to use, you could be causing even more unnecessary TLB replacements.
pageexec@freemail.hu April 9, 2017, 8:24 p.m. UTC | #29
On 7 Apr 2017 at 22:07, Andy Lutomirski wrote:
> grsecurity and PaX are great projects.  They have a lot of good ideas,
> and they're put together quite nicely.  The upstream kernel should
> *not* do things differently from they way they are in grsecurity/PaX
> just because it wants to be different.  Conversely, the upstream
> kernel should not do things the same way as PaX just to be more like
> PaX.

so weit so gut.

> Keep in mind that the upstream kernel and grsecurity/PaX operate under
> different constraints.  The upstream kernel tries to keep itself clean

so do we.

> and to make tree-wide updates rather that keeping compatibility stuff
> around.

so do we (e.g., fptr fixes for RAP, non-refcount atomic users, etc).

>  PaX and grsecurity presumably want to retain some degree of
> simplicity when porting to newer upstream versions.

s/simplicity/minimality/ as the code itself can be complex but that'll be
of the minimal complexity we can come up with.

> In the context of virtually mapped stacks / KSTACKOVERFLOW, this
> naturally leads to different solutions.  The upstream kernel had a
> bunch of buggy drivers that played badly with virtually mapped stacks.
> grsecurity sensibly went for the approach where the buggy drivers kept
> working.  The upstream kernel went for the approach of fixing the
> drivers rather than keeping a compatibility workaround.  Different
> constraints, different solutions.

except that's not what happened at all. spender's first version did just
a vmalloc for the kstack like the totally NIH'd version upstream does
now. while we always anticipated buggy dma users and thus had code that
would detect them so that we could fix them, we quickly figured that the
upstream kernel wasn't quite up to snuff as we had assumed and faced with
the amount of buggy code, we went for the current vmap approach which
kept users' systems working instead of breaking them.

you're trying to imply that upstream fixed the drivers but as the facts
show, that's not true. you simply unleashed your code on the world and
hoped(?) that enough suckers would try it out during the -rc window. as
we all know several releases and almost a year later, that was a losing
bet as you still keep fixing those drivers (and something tells me that
we haven't seen the end of it). this is simply irresponsible engineering
for no technical reason.

> In the case of rare writes or pax_open_kernel [1] or whatever we want
> to call it, CR3 would work without arch-specific code, and CR0 would
> not.  That's an argument for CR3 that would need to be countered by
> something.  (Sure, avoiding leaks either way might need arch changes.
> OTOH, a *randomized* CR3-based approach might not have as much of a
> leak issue to begin with.)

i have yet to see anyone explain what they mean by 'leak' here but if it
is what i think it is then the arch specific entry/exit changes are not
optional but mandatory. see below for randomization.

[merging in your other mail as it's the same topic]

> No one has explained how CR0.WP is weaker or slower than my proposal.

you misunderstood, Daniel was talking about your use_mm approach.

> Here's what I'm proposing:
> 
> At boot, choose a random address A.

what is the threat that a random address defends against?

>  Create an mm_struct that has a
> single VMA starting at A that represents the kernel's rarely-written
> section.  Compute O = (A - VA of rarely-written section).  To do a
> rare write, use_mm() the mm, write to (VA + O), then unuse_mm().

the problem is that the amount of __read_only data extends beyond vmlinux,
i.e., this approach won't scale. another problem is that it can't be used
inside use_mm and switch_mm themselves (no read-only task structs or percpu
pgd for you ;) and probably several other contexts.

last but not least, use_mm says this about itself:

    (Note: this routine is intended to be called only
    from a kernel thread context)

so using it will need some engineering (or the comment be fixed).

> This should work on any arch that has an MMU that allows this type of
> aliasing and that doesn't have PA-based protections on the rarely-written
> section.

you didn't address the arch-specific changes needed in the enter/exit paths.

> It'll be considerably slower than CR0.WP on a current x86 kernel, but,
> with PCID landed, it shouldn't be much slower.

based on my experience with UDEREF on amd64, unfortunately PCID isn't all
it's cracked up to be (IIRC, it maybe halved the UDEREF overhead instead of
basically eliminating it as i had anticipated, and that was on snb, ivb and
later do even worse).

> It has the added benefit that writes to non-rare-write data using the
> rare-write primitive will fail.

what is the threat model you're assuming for this feature? based on what i
have for PaX (arbitrary read/write access exploited for data-only attacks),
the above makes no sense to me...
Andy Lutomirski April 10, 2017, 12:10 a.m. UTC | #30
On Sun, Apr 9, 2017 at 5:47 AM, PaX Team <pageexec@freemail.hu> wrote:
> On 7 Apr 2017 at 21:58, Andy Lutomirski wrote:
>
>> On Fri, Apr 7, 2017 at 12:58 PM, PaX Team <pageexec@freemail.hu> wrote:
>> > On 7 Apr 2017 at 9:14, Andy Lutomirski wrote:
>> >> Then someone who cares about performance can benchmark the CR0.WP
>> >> approach against it and try to argue that it's a good idea.  This
>> >> benchmark should wait until I'm done with my PCID work, because PCID
>> >> is going to make use_mm() a whole heck of a lot faster.
>> >
>> > in my measurements switching PCID is hovers around 230 cycles for snb-ivb
>> > and 200-220 for hsw-skl whereas cr0 writes are around 230-240 cycles. there's
>> > of course a whole lot more impact for switching address spaces so it'll never
>> > be fast enough to beat cr0.wp.
>> >
>>
>> If I'm reading this right, you're saying that a non-flushing CR3 write
>> is about the same cost as a CR0.WP write.  If so, then why should CR0
>> be preferred over the (arch-neutral) CR3 approach?
>
> cr3 (page table switching) isn't arch neutral at all ;). you probably meant
> the higher level primitives except they're not enough to implement the scheme
> as discussed before since the enter/exit paths are very much arch dependent.

Yes.

>
> on x86 the cost of the pax_open/close_kernel primitives comes from the cr0
> writes and nothing else, use_mm suffers not only from the cr3 writes but
> also locking/atomic ops and cr4 writes on its path and the inevitable TLB
> entry costs. and if cpu vendors cared enough, they could make toggling cr0.wp
> a fast path in the microcode and reduce its overhead by an order of magnitude.
>

If the CR4 writes happen in for this use case, that's a bug.

>>  And why would switching address spaces obviously be much slower?
>> There'll be a very small number of TLB fills needed for the actual
>> protected access.
>
> you'll be duplicating TLB entries in the alternative PCID for both code
> and data, where they will accumulate (=take room away from the normal PCID
> and expose unwanted memory for access) unless you also flush them when
> switching back (which then will cost even more cycles). also i'm not sure
> that processors implement all the 12 PCID bits so depending on how many PCIDs
> you plan to use, you could be causing even more unnecessary TLB replacements.
>

Unless the CPU is rather dumber than I expect, the only duplicated
entries should be for the writable aliases of pages that are written.
The rest of the pages are global and should be shared for all PCIDs.

--Andy
Andy Lutomirski April 10, 2017, 12:31 a.m. UTC | #31
On Sun, Apr 9, 2017 at 1:24 PM, PaX Team <pageexec@freemail.hu> wrote:
>
>> In the context of virtually mapped stacks / KSTACKOVERFLOW, this
>> naturally leads to different solutions.  The upstream kernel had a
>> bunch of buggy drivers that played badly with virtually mapped stacks.
>> grsecurity sensibly went for the approach where the buggy drivers kept
>> working.  The upstream kernel went for the approach of fixing the
>> drivers rather than keeping a compatibility workaround.  Different
>> constraints, different solutions.
>
> except that's not what happened at all. spender's first version did just
> a vmalloc for the kstack like the totally NIH'd version upstream does
> now. while we always anticipated buggy dma users and thus had code that
> would detect them so that we could fix them, we quickly figured that the
> upstream kernel wasn't quite up to snuff as we had assumed and faced with
> the amount of buggy code, we went for the current vmap approach which
> kept users' systems working instead of breaking them.
>
> you're trying to imply that upstream fixed the drivers but as the facts
> show, that's not true. you simply unleashed your code on the world and
> hoped(?) that enough suckers would try it out during the -rc window. as
> we all know several releases and almost a year later, that was a losing
> bet as you still keep fixing those drivers (and something tells me that
> we haven't seen the end of it). this is simply irresponsible engineering
> for no technical reason.

I consider breaking buggy drivers (in a way that they either generally
work okay or that they break with a nice OOPS depending on config) to
be better than having a special case in what's supposed to be a fast
path to keep them working.  I did consider forcing the relevant debug
options on for a while just to help shake these bugs out the woodwork
faster.

>
>> In the case of rare writes or pax_open_kernel [1] or whatever we want
>> to call it, CR3 would work without arch-specific code, and CR0 would
>> not.  That's an argument for CR3 that would need to be countered by
>> something.  (Sure, avoiding leaks either way might need arch changes.
>> OTOH, a *randomized* CR3-based approach might not have as much of a
>> leak issue to begin with.)
>
> i have yet to see anyone explain what they mean by 'leak' here but if it
> is what i think it is then the arch specific entry/exit changes are not
> optional but mandatory. see below for randomization.

By "leak" I mean that a bug or exploit causes unintended code to run
with CR0.WP or a special CR3 or a special PTE or whatever loaded.  PaX
hooks the entry code to avoid leaks.

>> At boot, choose a random address A.
>
> what is the threat that a random address defends against?

Makes it harder to exploit a case where the CR3 setting leaks.

>
>>  Create an mm_struct that has a
>> single VMA starting at A that represents the kernel's rarely-written
>> section.  Compute O = (A - VA of rarely-written section).  To do a
>> rare write, use_mm() the mm, write to (VA + O), then unuse_mm().
>
> the problem is that the amount of __read_only data extends beyond vmlinux,
> i.e., this approach won't scale. another problem is that it can't be used
> inside use_mm and switch_mm themselves (no read-only task structs or percpu
> pgd for you ;) and probably several other contexts.

Can you clarify these uses that extend beyond vmlinux?  I haven't
looked at the grsecurity patch extensively.  Are you talking about the
BPF JIT stuff?  If so, I think that should possibly be handled a bit
differently, since I think the normal
write-to-rare-write-vmlinux-sections primitive should preferably *not*
be usable to write to executable pages.  Using a real mm_struct for
this could help.

>
> last but not least, use_mm says this about itself:
>
>     (Note: this routine is intended to be called only
>     from a kernel thread context)
>
> so using it will need some engineering (or the comment be fixed).

Indeed.

>> It has the added benefit that writes to non-rare-write data using the
>> rare-write primitive will fail.
>
> what is the threat model you're assuming for this feature? based on what i
> have for PaX (arbitrary read/write access exploited for data-only attacks),
> the above makes no sense to me...
>

If I use the primitive to try to write a value to the wrong section
(write to kernel text, for example), IMO it would be nice to OOPS
instead of succeeding.

Please keep in mind that, unlike PaX, uses of a pax_open_kernel()-like
function will may be carefully audited by a friendly security expert
such as yourself.  It would be nice to harden the primitive to a
reasonable extent against minor misuses such as putting it in a
context where the compiler will emit mov-a-reg-with-WP-set-to-CR0;
ret.
Thomas Gleixner April 10, 2017, 8:26 a.m. UTC | #32
On Fri, 7 Apr 2017, PaX Team wrote:
> On 7 Apr 2017 at 11:46, Thomas Gleixner wrote:
> 
> > On Fri, 7 Apr 2017, Mathias Krause wrote:
> > > Well, doesn't look good to me. NMIs will still be able to interrupt
> > > this code and will run with CR0.WP = 0.
> > > 
> > > Shouldn't you instead question yourself why PaX can do it "just" with
> > > preempt_disable() instead?!
> > 
> > That's silly. Just because PaX does it, doesn't mean it's correct.
> 
> is that FUD or do you have actionable information to share?

That has absolutely nothing to do with FUD. I'm merily not accepting
argumentations which say: PaX can do it "just"....

That has exactly zero technical merit and it's not asked too much to
provide precise technical arguments why one implementation is better than
some other.

> > To be honest, playing games with the CR0.WP bit is outright stupid to begin with.
> 
> why is that? cr0.wp exists since the i486 and its behaviour fits my
> purposes quite well, it's the best security/performance i know of.

Works for me has never be a good engineering principle.

> > Whether protected by preempt_disable or local_irq_disable, to make that
> > work it needs CR0 handling in the exception entry/exit at the lowest
> > level.
> 
> correct.
> 
> > And that's just a nightmare maintainence wise as it's prone to be
> > broken over time.
> 
> i've got 14 years of experience of maintaining it and i never saw it break.

It's a difference whether you maintain a special purpose patch set out of
tree for a subset of architectures - I certainly know what I'm talking
about - or keeping stuff sane in the upstream kernel.

> > I certainly don't want to take the chance to leak CR0.WP ever
>
> why and where would cr0.wp leak?

It's bound to happen due to some subtle mistake and up to the point where
you catch it (in the scheduler or entry/exit path) the world is
writeable. And that will be some almost never executed error path which can
be triggered by a carefully crafted attack. A very restricted writeable
region is definitely preferred over full world writeable then, right?

> > Why the heck should we care about rare writes being performant?
> 
> because you've been misled by the NIH crowd here that the PaX feature they
> tried to (badly) extract from has anything to do with frequency of writes.

It would be apprectiated if you could keep your feud out of this. It's
enough to tell me that 'rare write' is a misleading term and why.

> > Making the world and some more writeable hardly qualifies as tightly
> > focused.
> 
> you forgot to add 'for a window of a few insns' and that the map/unmap

If it'd be guaranteed to be a few instructions, then I wouldn't be that
worried. The availability of make_world_writeable() as an unrestricted
usable function makes me nervous as hell. We've had long standing issues
where kmap_atomic() got leaked through a hard to spot almost never executed
error handling path. And the same is bound to happen with this, just with a
way worse outcome.

> approach does the same under an attacker controlled ptr.

Which attacker controlled pointer?

Thanks,

	tglx
Mark Rutland April 10, 2017, 10:22 a.m. UTC | #33
On Sat, Apr 08, 2017 at 08:20:03AM -0700, Andy Lutomirski wrote:
> On Sat, Apr 8, 2017 at 12:33 AM, Daniel Micay <danielmicay@gmail.com> wrote:
> > The
> > submitted code is aimed at rare writes to globals, but this feature is
> > more than that and design decisions shouldn't be based on just the
> > short term.
> 
> Then, if you disagree with a proposed design, *explain why* in a
> standalone manner.  Say what future uses a different design would
> have.
> 
> > I actually care a lot more about 64-bit ARM support than I do x86, but
> > using a portable API for pax_open_kernel (for the simple uses at
> > least) is separate from choosing the underlying implementation. There
> > might not be a great way to do it on the architectures I care about
> > but that doesn't need to hinder x86. It's really not that much code...
> > A weaker/slower implementation for x86 also encourages the same
> > elsewhere.
> 
> No one has explained how CR0.WP is weaker or slower than my proposal.
> Here's what I'm proposing:
> 
> At boot, choose a random address A.  Create an mm_struct that has a
> single VMA starting at A that represents the kernel's rarely-written
> section.  Compute O = (A - VA of rarely-written section).  To do a
> rare write, use_mm() the mm, write to (VA + O), then unuse_mm().
> 
> This should work on any arch that has an MMU that allows this type of
> aliasing and that doesn't have PA-based protections on the
> rarely-written section.

Modulo randomization, that sounds exactly like what I had envisaged [1],
so that makes sense to me.

Thanks,
Mark.

[1] http://www.openwall.com/lists/kernel-hardening/2016/11/18/3
Mark Rutland April 10, 2017, 10:29 a.m. UTC | #34
On Fri, Apr 07, 2017 at 04:45:26PM +0200, Peter Zijlstra wrote:
> On Fri, Apr 07, 2017 at 12:51:15PM +0200, Mathias Krause wrote:
> > Why that? It allows fast and CPU local modifications of r/o memory.
> > OTOH, an approach that needs to fiddle with page table entries
> > requires global synchronization to keep the individual TLB states in
> > sync. Hmm.. Not that fast, I'd say.
> 
> The fixmaps used for kmap_atomic are per-cpu, no global sync required.

That might be fine for x86, but for some architectures fixmap slots and
kmap_atomic mappings happen to be visible to other CPUs even if they're
not required to be.

Using an mm solves that for all, though.

Thanks,
Mark.
pageexec@freemail.hu April 10, 2017, 10:42 a.m. UTC | #35
On 9 Apr 2017 at 17:10, Andy Lutomirski wrote:

> On Sun, Apr 9, 2017 at 5:47 AM, PaX Team <pageexec@freemail.hu> wrote:
> > on x86 the cost of the pax_open/close_kernel primitives comes from the cr0
> > writes and nothing else, use_mm suffers not only from the cr3 writes but
> > also locking/atomic ops and cr4 writes on its path and the inevitable TLB
> > entry costs. and if cpu vendors cared enough, they could make toggling cr0.wp
> > a fast path in the microcode and reduce its overhead by an order of magnitude.
> >
> 
> If the CR4 writes happen in for this use case, that's a bug.

that depends on how you plan to handle perf/rdpmc users and how many
alternative mm structs you plan to manage (one global, one per cpu,
one per mm struct, etc).

> > you'll be duplicating TLB entries in the alternative PCID for both code
> > and data, where they will accumulate (=take room away from the normal PCID
> > and expose unwanted memory for access) unless you also flush them when
> > switching back (which then will cost even more cycles). also i'm not sure
> > that processors implement all the 12 PCID bits so depending on how many PCIDs
> > you plan to use, you could be causing even more unnecessary TLB replacements.
> >
> 
> Unless the CPU is rather dumber than I expect, the only duplicated
> entries should be for the writable aliases of pages that are written.
> The rest of the pages are global and should be shared for all PCIDs.

well, 4.10.2.4 has language like this (4.10.3.2 implies similar):

   A logical processor may use a global TLB entry to translate a linear
   address, even if the TLB entry is associated with a PCID different
   from the current PCID.

that to me says that global page entries are associated with a PCID and
may (not) be used while in another PCID. in Intel-speak that's not 'dumb'
but "tricks up our sleeve that we don't really want to tell you about in
detail, except perhaps under a NDA".
Andy Lutomirski April 10, 2017, 4:01 p.m. UTC | #36
On Mon, Apr 10, 2017 at 3:42 AM, PaX Team <pageexec@freemail.hu> wrote:
> On 9 Apr 2017 at 17:10, Andy Lutomirski wrote:
>
>> On Sun, Apr 9, 2017 at 5:47 AM, PaX Team <pageexec@freemail.hu> wrote:
>> > on x86 the cost of the pax_open/close_kernel primitives comes from the cr0
>> > writes and nothing else, use_mm suffers not only from the cr3 writes but
>> > also locking/atomic ops and cr4 writes on its path and the inevitable TLB
>> > entry costs. and if cpu vendors cared enough, they could make toggling cr0.wp
>> > a fast path in the microcode and reduce its overhead by an order of magnitude.
>> >
>>
>> If the CR4 writes happen in for this use case, that's a bug.
>
> that depends on how you plan to handle perf/rdpmc users and how many
> alternative mm structs you plan to manage (one global, one per cpu,
> one per mm struct, etc).
>

I was thinking one global unless more are needed for some reason.

>> > you'll be duplicating TLB entries in the alternative PCID for both code
>> > and data, where they will accumulate (=take room away from the normal PCID
>> > and expose unwanted memory for access) unless you also flush them when
>> > switching back (which then will cost even more cycles). also i'm not sure
>> > that processors implement all the 12 PCID bits so depending on how many PCIDs
>> > you plan to use, you could be causing even more unnecessary TLB replacements.
>> >
>>
>> Unless the CPU is rather dumber than I expect, the only duplicated
>> entries should be for the writable aliases of pages that are written.
>> The rest of the pages are global and should be shared for all PCIDs.
>
> well, 4.10.2.4 has language like this (4.10.3.2 implies similar):
>
>    A logical processor may use a global TLB entry to translate a linear
>    address, even if the TLB entry is associated with a PCID different
>    from the current PCID.

I read this as: the CPU still semantically tags global TLB entries
with a PCID, but the CPU will use (probably) use those TLB entries
even if the PCIDs don't match.  IIRC none of the TLB instructions have
any effect that makes the PCID associated with a global entry visible,
so the CPU could presumably omit the PCID tags entirely for global
entries.  E.g. I don't think there's any way to say "flush global
entries with a given PCID".

--Andy
pageexec@freemail.hu April 10, 2017, 7:47 p.m. UTC | #37
On 9 Apr 2017 at 17:31, Andy Lutomirski wrote:

> On Sun, Apr 9, 2017 at 1:24 PM, PaX Team <pageexec@freemail.hu> wrote:
> >
> I consider breaking buggy drivers (in a way that they either generally
> work okay

do they work okay when the dma transfer goes to a buffer that crosses
physically non-contiguous page boundaries?

> or that they break with a nice OOPS depending on config) to
> be better than having a special case in what's supposed to be a fast
> path to keep them working.  I did consider forcing the relevant debug
> options on for a while just to help shake these bugs out the woodwork
> faster.

that's a false dichotomy, discovering buggy drivers is orthogonal to (not)
breaking users' systems as grsec shows. and how did you expect to 'shake
these bugs out' when your own suggestion at the time was for distros to
not enable this feature 'for a while'?

> > i have yet to see anyone explain what they mean by 'leak' here but if it
> > is what i think it is then the arch specific entry/exit changes are not
> > optional but mandatory. see below for randomization.
> 
> By "leak" I mean that a bug or exploit causes unintended code to run
> with CR0.WP or a special CR3 or a special PTE or whatever loaded.

how can a bug/exploit cause something like this?

>  PaX hooks the entry code to avoid leaks. 

PaX doesn't instrument enter/exit paths to prevent state leaks into interrupt
context (it's a useful sideeffect though), rather it's needed for correctness
if the kernel can be interrupted at all while it's open (address space switching
will need to handle this too but you have yet to address it).

> >> At boot, choose a random address A.
> >
> > what is the threat that a random address defends against?
> 
> Makes it harder to exploit a case where the CR3 setting leaks.

if an attacker has the ability to cause this leak (details of which are subject
to the question i asked above) then why wouldn't he simply also make use of the
primitives to modify his target via the writable vma without ever having to know
the randomized address? i also wonder what exploit power you assume for this
attack and whether that is already enough to simply go after page tables, etc
instead of figuring out the alternative address space.

> > the problem is that the amount of __read_only data extends beyond vmlinux,
> > i.e., this approach won't scale. another problem is that it can't be used
> > inside use_mm and switch_mm themselves (no read-only task structs or percpu
> > pgd for you ;) and probably several other contexts.
> 
> Can you clarify these uses that extend beyond vmlinux?

one obvious candidate is modules. how do you want to handle them? then there's
a whole bunch of dynamically allocated data that is a candidate for __read_only
treatment.

> > what is the threat model you're assuming for this feature? based on what i
> > have for PaX (arbitrary read/write access exploited for data-only attacks),
> > the above makes no sense to me...
> 
> If I use the primitive to try to write a value to the wrong section
> (write to kernel text, for example), IMO it would be nice to OOPS
> instead of succeeding.

this doesn't tell me what power you're assuming the attacker has. is it
my generic arbitrary read-write ability or something more restricted and
thus less realistic? i.e., how does the attacker get to 'use the primitive'
and (presumably) also control the ptr/data?

as for your specific example, kernel text isn't 'non-rare-write data' that
you spoke of before, but that aside, what prevents an attacker from computing
his target ptr so that after your accessor rebases it, it'd point back to his
intended target instead? will you range-check (find_vma eventually?) each time?
how will you make all this code safe from races from another task? the more
checks you make, the more likely that something sensitive will spill to memory
and be a target itself in order to hijack the sensitive write.

> Please keep in mind that, unlike PaX, uses of a pax_open_kernel()-like
> function will may be carefully audited by a friendly security expert
> such as yourself.  It would be nice to harden the primitive to a
> reasonable extent against minor misuses such as putting it in a
> context where the compiler will emit mov-a-reg-with-WP-set-to-CR0;
> ret.

i don't understand what's there to audit. if you want to treat a given piece
of data as __read_only then you have no choice but to allow writes to it via
the open/close mechanism and the compiler can tell you just where those
writes are (and even do the instrumentation when you get tired of doing it
by hand).
pageexec@freemail.hu April 10, 2017, 7:55 p.m. UTC | #38
On 10 Apr 2017 at 10:26, Thomas Gleixner wrote:

> On Fri, 7 Apr 2017, PaX Team wrote:
> > On 7 Apr 2017 at 11:46, Thomas Gleixner wrote:
> > > That's silly. Just because PaX does it, doesn't mean it's correct.
> > 
> > is that FUD or do you have actionable information to share?
> 
> That has absolutely nothing to do with FUD. I'm merily not accepting
> argumentations which say: PaX can do it "just"....

you implied that what PaX does may not be correct. if you can't back that
up with facts and technical arguments then it is FUD. your turn.

> That has exactly zero technical merit and it's not asked too much to
> provide precise technical arguments why one implementation is better than
> some other.

exactly. start with explaining what is not correct in PaX with "precise
technical arguments".

> > > To be honest, playing games with the CR0.WP bit is outright stupid to begin with.
> > 
> > why is that? cr0.wp exists since the i486 and its behaviour fits my
> > purposes quite well, it's the best security/performance i know of.
> 
> Works for me has never be a good engineering principle.

good thing i didn't say that. on the other hand you failed to provide "precise
technical arguments" for why "playing games with the CR0.WP bit is outright
stupid to begin with". do you have any to share and discuss?

> > > And that's just a nightmare maintainence wise as it's prone to be
> > > broken over time.
> > 
> > i've got 14 years of experience of maintaining it and i never saw it break.
> 
> It's a difference whether you maintain a special purpose patch set out of
> tree for a subset of architectures - I certainly know what I'm talking
> about - or keeping stuff sane in the upstream kernel.

there's no difference to me, i keep my stuff sane regardless. of course what
you do with your out-of-tree code is your business but don't extrapolate it
to mine. now besides argumentum ad verecundiam do you have "precise technical
arguments" as to why maintaining a cr0.wp based approach would be "a nightmare
maintainence wise as it's prone to be broken over time."?

> > > I certainly don't want to take the chance to leak CR0.WP ever
> >
> > why and where would cr0.wp leak?
> 
> It's bound to happen due to some subtle mistake

i don't see what subtle mistake you're thinking of here. can you give me
an example?

> and up to the point where you catch it (in the scheduler or entry/exit path)
> the world is writeable.

where such a leak is caught depends on what subtle mistake you're talking
about, so let's get back to this point once you answered that question.

> And that will be some almost never executed error path which can
> be triggered by a carefully crafted attack.

open/close calls have nothing to do with error paths or even conditional
execution, they're always executed as a sequence so this situation cannot
occur.

> A very restricted writeable region is definitely preferred over full
> world writeable then, right? 

it doesn't matter when the attacker has an arbitrary read/write primitive
which he can just use to modify that 'very restricted writeable region'
to whatever he needs to cover first. now if all the data managing this
region were also protected then it'd matter but that's never going to
happen in the upstream kernel.

> > > Making the world and some more writeable hardly qualifies as tightly
> > > focused.
> > 
> > you forgot to add 'for a window of a few insns' and that the map/unmap
> 
> If it'd be guaranteed to be a few instructions, then I wouldn't be that
> worried.

it is, by definition assignments to otherwise __read_only data are (have to
be) bracketed with open/close instrumentation.

> The availability of make_world_writeable() as an unrestricted
> usable function makes me nervous as hell.

i can't imagine the nightmares you must have lived through for the two decades
during which the kernel was all wide open... spass beiseite, we can address
these fears of yours once you explain just what kind of error situations you
have in mind and why they don't also apply to say text_poke().

> We've had long standing issues where kmap_atomic() got leaked through a
> hard to spot almost never executed error handling path. And the same is
> bound to happen with this, just with a way worse outcome. 

the lifetime and use of kmaps is very different so you'll have to explain
in more detail why the problems they had apply here as well.

> > approach does the same under an attacker controlled ptr.
> 
> Which attacker controlled pointer?

i meant the one passed to the map/unmap code, attacker control over it means
the whole world is effectively writable again (Andy's vma approach would restrict
this to just the interesting read-only data except the whole thing is irrelevant
until all participating data (vma, page tables, etc) are also protected).
Kees Cook April 10, 2017, 8:13 p.m. UTC | #39
On Sun, Apr 9, 2017 at 1:24 PM, PaX Team <pageexec@freemail.hu> wrote:
> On 7 Apr 2017 at 22:07, Andy Lutomirski wrote:
>> No one has explained how CR0.WP is weaker or slower than my proposal.
>
> you misunderstood, Daniel was talking about your use_mm approach.
>
>> Here's what I'm proposing:
>>
>> At boot, choose a random address A.
>
> what is the threat that a random address defends against?
>
>>  Create an mm_struct that has a
>> single VMA starting at A that represents the kernel's rarely-written
>> section.  Compute O = (A - VA of rarely-written section).  To do a
>> rare write, use_mm() the mm, write to (VA + O), then unuse_mm().
>
> the problem is that the amount of __read_only data extends beyond vmlinux,
> i.e., this approach won't scale. another problem is that it can't be used
> inside use_mm and switch_mm themselves (no read-only task structs or percpu
> pgd for you ;) and probably several other contexts.

These are the limitations that concern me: what will we NOT be able to
make read-only as a result of the use_mm() design choice? My RFC
series included a simple case and a constify case, but I did not
include things like making page tables read-only, etc.

I cant accept not using cr0, since we need to design something that
works on arm64 too, which doesn't have anything like this (AFAIK), but
I'd like to make sure we don't paint ourselves into a corner.

-Kees
Andy Lutomirski April 10, 2017, 8:17 p.m. UTC | #40
On Mon, Apr 10, 2017 at 1:13 PM, Kees Cook <keescook@chromium.org> wrote:
> On Sun, Apr 9, 2017 at 1:24 PM, PaX Team <pageexec@freemail.hu> wrote:
>> On 7 Apr 2017 at 22:07, Andy Lutomirski wrote:
>>> No one has explained how CR0.WP is weaker or slower than my proposal.
>>
>> you misunderstood, Daniel was talking about your use_mm approach.
>>
>>> Here's what I'm proposing:
>>>
>>> At boot, choose a random address A.
>>
>> what is the threat that a random address defends against?
>>
>>>  Create an mm_struct that has a
>>> single VMA starting at A that represents the kernel's rarely-written
>>> section.  Compute O = (A - VA of rarely-written section).  To do a
>>> rare write, use_mm() the mm, write to (VA + O), then unuse_mm().
>>
>> the problem is that the amount of __read_only data extends beyond vmlinux,
>> i.e., this approach won't scale. another problem is that it can't be used
>> inside use_mm and switch_mm themselves (no read-only task structs or percpu
>> pgd for you ;) and probably several other contexts.
>
> These are the limitations that concern me: what will we NOT be able to
> make read-only as a result of the use_mm() design choice? My RFC
> series included a simple case and a constify case, but I did not
> include things like making page tables read-only, etc.

If we make page tables read-only, we may need to have multiple levels
of rareness.  Page table writes aren't all that rare, and I can
imagine distros configuring the kernel so that static structs full of
function pointers are read-only (IMO that should be the default or
even mandatory), but page tables may be a different story.

That being said, CR3-twiddling to write to page tables could actually
work.  Hmm.
Andy Lutomirski April 10, 2017, 8:27 p.m. UTC | #41
On Mon, Apr 10, 2017 at 12:47 PM, PaX Team <pageexec@freemail.hu> wrote:
> On 9 Apr 2017 at 17:31, Andy Lutomirski wrote:
>
>> On Sun, Apr 9, 2017 at 1:24 PM, PaX Team <pageexec@freemail.hu> wrote:
>> >
>> I consider breaking buggy drivers (in a way that they either generally
>> work okay
>
> do they work okay when the dma transfer goes to a buffer that crosses
> physically non-contiguous page boundaries?

Nope.  Like I said, i considered making the debugging mandatory.  I
may still send patches to do that.

>> By "leak" I mean that a bug or exploit causes unintended code to run
>> with CR0.WP or a special CR3 or a special PTE or whatever loaded.
>
> how can a bug/exploit cause something like this?

For example: a bug in entry logic, a bug in perf NMI handling, or even
a bug in *nested* perf NMI handling (egads!).  Or maybe some super
nasty interaction with suspend/resume.  These are all fairly unlikely
(except the nested perf case), but still.

As a concrete example, back before my big NMI improvement series, it
was possible for an NMI return to invoke espfix and/or take an IRET
fault.  This *shouldn't* happen on return to a context with CR0.WP
set, but it would be incredibly nasty if it did.  The code is
separated out now, so it should be okay...

>
>>  PaX hooks the entry code to avoid leaks.
>
> PaX doesn't instrument enter/exit paths to prevent state leaks into interrupt
> context (it's a useful sideeffect though), rather it's needed for correctness
> if the kernel can be interrupted at all while it's open (address space switching
> will need to handle this too but you have yet to address it).

I don't think we disagree here.  A leak would be a case of incorrectness.

>
>> >> At boot, choose a random address A.
>> >
>> > what is the threat that a random address defends against?
>>
>> Makes it harder to exploit a case where the CR3 setting leaks.
>
> if an attacker has the ability to cause this leak (details of which are subject
> to the question i asked above) then why wouldn't he simply also make use of the
> primitives to modify his target via the writable vma without ever having to know
> the randomized address? i also wonder what exploit power you assume for this
> attack and whether that is already enough to simply go after page tables, etc
> instead of figuring out the alternative address space.

I'm imagining the power to (a) cause some code path to execute while
the kernel is "open" and (b) the ability to use the buggy code path in
question to write a a fully- or partially-controlled address.  With
CR0.WP clear, this can write shellcode directly.  With CR3 pointing to
a page table that maps some parts of the kernel (but not text!) at a
randomized offset, you need to figure out the offset and find some
other target in the mapping that gets your exploit farther along.  You
can't write shellcode directly.

>
>> > the problem is that the amount of __read_only data extends beyond vmlinux,
>> > i.e., this approach won't scale. another problem is that it can't be used
>> > inside use_mm and switch_mm themselves (no read-only task structs or percpu
>> > pgd for you ;) and probably several other contexts.
>>
>> Can you clarify these uses that extend beyond vmlinux?
>
> one obvious candidate is modules. how do you want to handle them? then there's
> a whole bunch of dynamically allocated data that is a candidate for __read_only
> treatment.

Exactly the same way.  Map those regions at the same offset, maybe
even in the same VMA.  There's no reason that an artificial VMA used
for this purpose can't be many gigabytes long and have vm_ops that
only allow access to certain things.  But multiple VMAs would work,
too.

>
>> > what is the threat model you're assuming for this feature? based on what i
>> > have for PaX (arbitrary read/write access exploited for data-only attacks),
>> > the above makes no sense to me...
>>
>> If I use the primitive to try to write a value to the wrong section
>> (write to kernel text, for example), IMO it would be nice to OOPS
>> instead of succeeding.
>
> this doesn't tell me what power you're assuming the attacker has. is it
> my generic arbitrary read-write ability or something more restricted and
> thus less realistic? i.e., how does the attacker get to 'use the primitive'
> and (presumably) also control the ptr/data?
>
> as for your specific example, kernel text isn't 'non-rare-write data' that
> you spoke of before, but that aside, what prevents an attacker from computing
> his target ptr so that after your accessor rebases it, it'd point back to his
> intended target instead?

It's a restriction on what targets can be hit.  With CR0.WP, you can
hit anything that has a VA.  With CR3, you can hit only that which is
mapped.

> will you range-check (find_vma eventually?) each time?
> how will you make all this code safe from races from another task? the more
> checks you make, the more likely that something sensitive will spill to memory
> and be a target itself in order to hijack the sensitive write.

There's no code here making the checks at write time.  It's just page
table / VMA setup.

>
>> Please keep in mind that, unlike PaX, uses of a pax_open_kernel()-like
>> function will may be carefully audited by a friendly security expert
>> such as yourself.  It would be nice to harden the primitive to a
>> reasonable extent against minor misuses such as putting it in a
>> context where the compiler will emit mov-a-reg-with-WP-set-to-CR0;
>> ret.
>
> i don't understand what's there to audit. if you want to treat a given piece
> of data as __read_only then you have no choice but to allow writes to it via
> the open/close mechanism and the compiler can tell you just where those
> writes are (and even do the instrumentation when you get tired of doing it
> by hand).
>

I mean auditing all uses of pax_open_kernel() or any other function
that opens the kernel.  That function is, as used in PaX, terrifying.
PaX probably gets every user right, but I don't trust driver writers
with a function like pax_open_kernel() that's as powerful as PaX's.

Suppose you get driver code like this:

void foo(int (*func)()) {
  pax_open_kernel();
  *thingy = func();
  pax_close_kernel();
}

That would be a very, very juicy target for a ROP-like attack.  Just
get the kernel to call this function with func pointing to something
that does a memcpy or similar into executable space.  Boom, shellcode
execution.

If CR3 is used instead, exploiting this is considerably more complicated.
diff mbox

Patch

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index cc98d5a294ee..2d1d707aa036 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -106,6 +106,7 @@  config X86
 	select HAVE_ARCH_KMEMCHECK
 	select HAVE_ARCH_MMAP_RND_BITS		if MMU
 	select HAVE_ARCH_MMAP_RND_COMPAT_BITS	if MMU && COMPAT
+	select HAVE_ARCH_RARE_WRITE
 	select HAVE_ARCH_SECCOMP_FILTER
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 1cfb36b8c024..2e6bf661bb84 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -91,6 +91,37 @@  extern struct mm_struct *pgd_page_get_mm(struct page *page);
 
 #endif	/* CONFIG_PARAVIRT */
 
+/* TODO: Bad hack to deal with preempt macros being missing sometimes. */
+#ifndef preempt_disable
+#include <linux/preempt.h>
+#endif
+
+static __always_inline unsigned long __arch_rare_write_begin(void)
+{
+	unsigned long cr0;
+
+	preempt_disable();
+	barrier();
+	cr0 = read_cr0() ^ X86_CR0_WP;
+	BUG_ON(cr0 & X86_CR0_WP);
+	write_cr0(cr0);
+	barrier();
+	return cr0 ^ X86_CR0_WP;
+}
+
+static __always_inline unsigned long __arch_rare_write_end(void)
+{
+	unsigned long cr0;
+
+	barrier();
+	cr0 = read_cr0() ^ X86_CR0_WP;
+	BUG_ON(!(cr0 & X86_CR0_WP));
+	write_cr0(cr0);
+	barrier();
+	preempt_enable_no_resched();
+	return cr0 ^ X86_CR0_WP;
+}
+
 /*
  * The following only work if pte_present() is true.
  * Undefined behaviour if not..