[bpf-next] bpf: Make trampolines W^X
diff mbox series

Message ID 20200103234725.22846-1-kpsingh@chromium.org
State New
Headers show
Series
  • [bpf-next] bpf: Make trampolines W^X
Related show

Commit Message

KP Singh Jan. 3, 2020, 11:47 p.m. UTC
From: KP Singh <kpsingh@google.com>

The image for the BPF trampolines is allocated with
bpf_jit_alloc_exe_page which marks this allocated page executable. This
means that the allocated memory is W and X at the same time making it
susceptible to WX based attacks.

Since the allocated memory is shared between two trampolines (the
current and the next), 2 pages must be allocated to adhere to W^X and
the following sequence is obeyed where trampolines are modified:

- Mark memory as non executable (set_memory_nx). While module_alloc for
  x86 allocates the memory as PAGE_KERNEL and not PAGE_KERNEL_EXEC, not
  all implementations of module_alloc do so
- Mark the memory as read/write (set_memory_rw)
- Modify the trampoline
- Mark the memory as read-only (set_memory_ro)
- Mark the memory as executable (set_memory_x)

There's a window between the modification of the trampoline and setting
the trampoline as read-only where an attacker and ~could~ change the
contents of the memory. This can be further improved by adding more
verification after the memory is marked as read-only.

Reported-by: Kees Cook <keescook@chromium.org>
Signed-off-by: KP Singh <kpsingh@google.com>
---
 arch/x86/net/bpf_jit_comp.c |  2 +-
 include/linux/bpf.h         |  1 -
 kernel/bpf/dispatcher.c     | 30 +++++++++++++++++++++++----
 kernel/bpf/trampoline.c     | 41 +++++++++++++++++++------------------
 4 files changed, 48 insertions(+), 26 deletions(-)

Comments

Andy Lutomirski Jan. 4, 2020, 12:49 a.m. UTC | #1
> On Jan 4, 2020, at 8:47 AM, KP Singh <kpsingh@chromium.org> wrote:
> 
> From: KP Singh <kpsingh@google.com>
> 
> The image for the BPF trampolines is allocated with
> bpf_jit_alloc_exe_page which marks this allocated page executable. This
> means that the allocated memory is W and X at the same time making it
> susceptible to WX based attacks.
> 
> Since the allocated memory is shared between two trampolines (the
> current and the next), 2 pages must be allocated to adhere to W^X and
> the following sequence is obeyed where trampolines are modified:

Can we please do better rather than piling garbage on top of garbage?

> 
> - Mark memory as non executable (set_memory_nx). While module_alloc for
> x86 allocates the memory as PAGE_KERNEL and not PAGE_KERNEL_EXEC, not
> all implementations of module_alloc do so

How about fixing this instead?

> - Mark the memory as read/write (set_memory_rw)

Probably harmless, but see above about fixing it.

> - Modify the trampoline

Seems reasonable. It’s worth noting that this whole approach is suboptimal: the “module” allocator should really be returning a list of pages to be written (not at the final address!) with the actual executable mapping to be materialized later, but that’s a bigger project that you’re welcome to ignore for now.  (Concretely, it should produce a vmap address with backing pages but with the vmap alias either entirely unmapped or read-only. A subsequent healer would, all at once, make the direct map pages RO or not-present and make the vmap alias RX.)

> - Mark the memory as read-only (set_memory_ro)
> - Mark the memory as executable (set_memory_x)

No, thanks. There’s very little excuse for doing two IPI flushes when one would suffice.

As far as I know, all architectures can do this with a single flush without races  x86 certainly can. The module freeing code gets this sequence right. Please reuse its mechanism or, if needed, export the relevant interfaces.
Justin Capella Jan. 5, 2020, 1:19 a.m. UTC | #2
I'm guessing 2 pages are used to allow for different protections? Does
only the first page's protections need to be changed? Is that
"old_image"?

+       set_memory_nx((unsigned long)image, 1);
+       set_memory_rw((unsigned long)image, 1);

+       set_memory_ro((unsigned long)new_image, 1);
+       set_memory_x((unsigned long)new_image, 1);

Because

+       void *old_image = tr->image + ((tr->selector + 1) & 1) * PAGE_SIZE;
+       void *new_image = tr->image + (tr->selector & 1) * PAGE_SIZE


> > - Mark the memory as read-only (set_memory_ro)
> > - Mark the memory as executable (set_memory_x)
>
> No, thanks. There’s very little excuse for doing two IPI flushes when one would suffice.

If there were checks between these steps to verify the trampoline
wasn't tampered with while the page was writable it would make sense
to do so before enabling execution.

Could some of these int's be unsigned to be extra cautious?

One last thought, if the extra checks are implemented, maybe comparing
against the old image prior to setting rw would be worthwhile?
Peter Zijlstra Jan. 6, 2020, 8:23 a.m. UTC | #3
On Sat, Jan 04, 2020 at 09:49:10AM +0900, Andy Lutomirski wrote:

> > - Mark memory as non executable (set_memory_nx). While module_alloc for
> > x86 allocates the memory as PAGE_KERNEL and not PAGE_KERNEL_EXEC, not
> > all implementations of module_alloc do so
> 
> How about fixing this instead?

We only care about STRICT_MODULE_RMW (iow, arm, arm64, s390, x86). Of
those only s390 seems to be 'off'.
Edgecombe, Rick P Jan. 6, 2020, 10:25 p.m. UTC | #4
On Sat, 2020-01-04 at 09:49 +0900, Andy Lutomirski wrote:
> > On Jan 4, 2020, at 8:47 AM, KP Singh <kpsingh@chromium.org> wrote:
> > 
> > From: KP Singh <kpsingh@google.com>
> > 
> > The image for the BPF trampolines is allocated with
> > bpf_jit_alloc_exe_page which marks this allocated page executable. This
> > means that the allocated memory is W and X at the same time making it
> > susceptible to WX based attacks.
> > 
> > Since the allocated memory is shared between two trampolines (the
> > current and the next), 2 pages must be allocated to adhere to W^X and
> > the following sequence is obeyed where trampolines are modified:
> 
> Can we please do better rather than piling garbage on top of garbage?
> 
> > 
> > - Mark memory as non executable (set_memory_nx). While module_alloc for
> > x86 allocates the memory as PAGE_KERNEL and not PAGE_KERNEL_EXEC, not
> > all implementations of module_alloc do so
> 
> How about fixing this instead?
> 
> > - Mark the memory as read/write (set_memory_rw)
> 
> Probably harmless, but see above about fixing it.
> 
> > - Modify the trampoline
> 
> Seems reasonable. It’s worth noting that this whole approach is suboptimal:
> the “module” allocator should really be returning a list of pages to be
> written (not at the final address!) with the actual executable mapping to be
> materialized later, but that’s a bigger project that you’re welcome to ignore
> for now.  (Concretely, it should produce a vmap address with backing pages but
> with the vmap alias either entirely unmapped or read-only. A subsequent healer
> would, all at once, make the direct map pages RO or not-present and make the
> vmap alias RX.)
> > - Mark the memory as read-only (set_memory_ro)
> > - Mark the memory as executable (set_memory_x)
> 
> No, thanks. There’s very little excuse for doing two IPI flushes when one
> would suffice.
> 
> As far as I know, all architectures can do this with a single flush without
> races  x86 certainly can. The module freeing code gets this sequence right.
> Please reuse its mechanism or, if needed, export the relevant interfaces.

So if I understand this right, some trampolines have been added that are
currently set as RWX at modification time AND left that way during runtime? The
discussion on the order of set_memory_() calls in the commit message made me
think that this was just a modification time thing at first.

Also, is there a reason you couldn't use text_poke() to modify the trampoline
with a single flush?
Andy Lutomirski Jan. 7, 2020, 1:36 a.m. UTC | #5
> On Jan 6, 2020, at 12:25 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
> 
> On Sat, 2020-01-04 at 09:49 +0900, Andy Lutomirski wrote:
>>>> On Jan 4, 2020, at 8:47 AM, KP Singh <kpsingh@chromium.org> wrote:
>>> 
>>> From: KP Singh <kpsingh@google.com>
>>> 
>>> The image for the BPF trampolines is allocated with
>>> bpf_jit_alloc_exe_page which marks this allocated page executable. This
>>> means that the allocated memory is W and X at the same time making it
>>> susceptible to WX based attacks.
>>> 
>>> Since the allocated memory is shared between two trampolines (the
>>> current and the next), 2 pages must be allocated to adhere to W^X and
>>> the following sequence is obeyed where trampolines are modified:
>> 
>> Can we please do better rather than piling garbage on top of garbage?
>> 
>>> 
>>> - Mark memory as non executable (set_memory_nx). While module_alloc for
>>> x86 allocates the memory as PAGE_KERNEL and not PAGE_KERNEL_EXEC, not
>>> all implementations of module_alloc do so
>> 
>> How about fixing this instead?
>> 
>>> - Mark the memory as read/write (set_memory_rw)
>> 
>> Probably harmless, but see above about fixing it.
>> 
>>> - Modify the trampoline
>> 
>> Seems reasonable. It’s worth noting that this whole approach is suboptimal:
>> the “module” allocator should really be returning a list of pages to be
>> written (not at the final address!) with the actual executable mapping to be
>> materialized later, but that’s a bigger project that you’re welcome to ignore
>> for now.  (Concretely, it should produce a vmap address with backing pages but
>> with the vmap alias either entirely unmapped or read-only. A subsequent healer
>> would, all at once, make the direct map pages RO or not-present and make the
>> vmap alias RX.)
>>> - Mark the memory as read-only (set_memory_ro)
>>> - Mark the memory as executable (set_memory_x)
>> 
>> No, thanks. There’s very little excuse for doing two IPI flushes when one
>> would suffice.
>> 
>> As far as I know, all architectures can do this with a single flush without
>> races  x86 certainly can. The module freeing code gets this sequence right.
>> Please reuse its mechanism or, if needed, export the relevant interfaces.
> 
> So if I understand this right, some trampolines have been added that are
> currently set as RWX at modification time AND left that way during runtime? The
> discussion on the order of set_memory_() calls in the commit message made me
> think that this was just a modification time thing at first.

I’m not sure what the status quo is.

We really ought to have a genuinely good API for allocation and initialization of text.  We can do so much better than set_memory_blahblah.

FWIW, I have some ideas about making kernel flushes cheaper. It’s currently blocked on finding some time and on tglx’s irqtrace work.

> 
> Also, is there a reason you couldn't use text_poke() to modify the trampoline
> with a single flush?
> 

Does text_poke to an IPI these days?
Edgecombe, Rick P Jan. 7, 2020, 7:01 p.m. UTC | #6
CC Nadav and Jessica.

On Mon, 2020-01-06 at 15:36 -1000, Andy Lutomirski wrote:
> > On Jan 6, 2020, at 12:25 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com>
> > wrote:
> > 
> > On Sat, 2020-01-04 at 09:49 +0900, Andy Lutomirski wrote:
> > > > > On Jan 4, 2020, at 8:47 AM, KP Singh <kpsingh@chromium.org> wrote:
> > > > 
> > > > From: KP Singh <kpsingh@google.com>
> > > > 
> > > > The image for the BPF trampolines is allocated with
> > > > bpf_jit_alloc_exe_page which marks this allocated page executable. This
> > > > means that the allocated memory is W and X at the same time making it
> > > > susceptible to WX based attacks.
> > > > 
> > > > Since the allocated memory is shared between two trampolines (the
> > > > current and the next), 2 pages must be allocated to adhere to W^X and
> > > > the following sequence is obeyed where trampolines are modified:
> > > 
> > > Can we please do better rather than piling garbage on top of garbage?
> > > 
> > > > 
> > > > - Mark memory as non executable (set_memory_nx). While module_alloc for
> > > > x86 allocates the memory as PAGE_KERNEL and not PAGE_KERNEL_EXEC, not
> > > > all implementations of module_alloc do so
> > > 
> > > How about fixing this instead?
> > > 
> > > > - Mark the memory as read/write (set_memory_rw)
> > > 
> > > Probably harmless, but see above about fixing it.
> > > 
> > > > - Modify the trampoline
> > > 
> > > Seems reasonable. It’s worth noting that this whole approach is
> > > suboptimal:
> > > the “module” allocator should really be returning a list of pages to be
> > > written (not at the final address!) with the actual executable mapping to
> > > be
> > > materialized later, but that’s a bigger project that you’re welcome to
> > > ignore
> > > for now.  (Concretely, it should produce a vmap address with backing pages
> > > but
> > > with the vmap alias either entirely unmapped or read-only. A subsequent
> > > healer
> > > would, all at once, make the direct map pages RO or not-present and make
> > > the
> > > vmap alias RX.)
> > > > - Mark the memory as read-only (set_memory_ro)
> > > > - Mark the memory as executable (set_memory_x)
> > > 
> > > No, thanks. There’s very little excuse for doing two IPI flushes when one
> > > would suffice.
> > > 
> > > As far as I know, all architectures can do this with a single flush
> > > without
> > > races  x86 certainly can. The module freeing code gets this sequence
> > > right.
> > > Please reuse its mechanism or, if needed, export the relevant interfaces.
> > 
> > So if I understand this right, some trampolines have been added that are
> > currently set as RWX at modification time AND left that way during runtime?
> > The
> > discussion on the order of set_memory_() calls in the commit message made me
> > think that this was just a modification time thing at first.
> 
> I’m not sure what the status quo is.
> 
> We really ought to have a genuinely good API for allocation and initialization
> of text.  We can do so much better than set_memory_blahblah.
> 
> FWIW, I have some ideas about making kernel flushes cheaper. It’s currently
> blocked on finding some time and on tglx’s irqtrace work.
> 

Makes sense to me. I guess there are 6 types of text allocations now:
 - These two BPF trampolines
 - BPF JITs
 - Modules
 - Kprobes
 - Ftrace

All doing (or should be doing) pretty much the same thing. I believe Jessica had
said at one point that she didn't like all the other features using
module_alloc() as it was supposed to be just for real modules. Where would the
API live?

> > 
> > Also, is there a reason you couldn't use text_poke() to modify the
> > trampoline
> > with a single flush?
> > 
> 
> Does text_poke to an IPI these days?

I don't think so since the RW mapping is just on a single CPU. That was one of
the benefits of the temporary mm struct based thing Nadav did. I haven't looked
into PeterZ's changes though.
Andy Lutomirski Jan. 8, 2020, 8:41 a.m. UTC | #7
> On Jan 7, 2020, at 9:01 AM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
>
> CC Nadav and Jessica.
>
> On Mon, 2020-01-06 at 15:36 -1000, Andy Lutomirski wrote:
>>> On Jan 6, 2020, at 12:25 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com>
>>> wrote:
>>>
>>> On Sat, 2020-01-04 at 09:49 +0900, Andy Lutomirski wrote:
>>>>>>> On Jan 4, 2020, at 8:47 AM, KP Singh <kpsingh@chromium.org> wrote:
>>>>>>
>>>>>> From: KP Singh <kpsingh@google.com>
>>>>>>
>>>>>> The image for the BPF trampolines is allocated with
>>>>>> bpf_jit_alloc_exe_page which marks this allocated page executable. This
>>>>>> means that the allocated memory is W and X at the same time making it
>>>>>> susceptible to WX based attacks.
>>>>>>
>>>>>> Since the allocated memory is shared between two trampolines (the
>>>>>> current and the next), 2 pages must be allocated to adhere to W^X and
>>>>>> the following sequence is obeyed where trampolines are modified:
>>>>>
>>>>> Can we please do better rather than piling garbage on top of garbage?
>>>>>
>>>>>>
>>>>>> - Mark memory as non executable (set_memory_nx). While module_alloc for
>>>>>> x86 allocates the memory as PAGE_KERNEL and not PAGE_KERNEL_EXEC, not
>>>>>> all implementations of module_alloc do so
>>>>>
>>>>> How about fixing this instead?
>>>>>
>>>>>> - Mark the memory as read/write (set_memory_rw)
>>>>>
>>>>> Probably harmless, but see above about fixing it.
>>>>>
>>>>>> - Modify the trampoline
>>>>>
>>>>> Seems reasonable. It’s worth noting that this whole approach is
>>>>> suboptimal:
>>>>> the “module” allocator should really be returning a list of pages to be
>>>>> written (not at the final address!) with the actual executable mapping to
>>>>> be
>>>>> materialized later, but that’s a bigger project that you’re welcome to
>>>>> ignore
>>>>> for now.  (Concretely, it should produce a vmap address with backing pages
>>>>> but
>>>>> with the vmap alias either entirely unmapped or read-only. A subsequent
>>>>> healer
>>>>> would, all at once, make the direct map pages RO or not-present and make
>>>>> the
>>>>> vmap alias RX.)
>>>>>> - Mark the memory as read-only (set_memory_ro)
>>>>>> - Mark the memory as executable (set_memory_x)
>>>>>
>>>>> No, thanks. There’s very little excuse for doing two IPI flushes when one
>>>>> would suffice.
>>>>>
>>>>> As far as I know, all architectures can do this with a single flush
>>>>> without
>>>>> races  x86 certainly can. The module freeing code gets this sequence
>>>>> right.
>>>>> Please reuse its mechanism or, if needed, export the relevant interfaces.
>>>
>>> So if I understand this right, some trampolines have been added that are
>>> currently set as RWX at modification time AND left that way during runtime?
>>> The
>>> discussion on the order of set_memory_() calls in the commit message made me
>>> think that this was just a modification time thing at first.
>>
>> I’m not sure what the status quo is.
>>
>> We really ought to have a genuinely good API for allocation and initialization
>> of text.  We can do so much better than set_memory_blahblah.
>>
>> FWIW, I have some ideas about making kernel flushes cheaper. It’s currently
>> blocked on finding some time and on tglx’s irqtrace work.
>>
>
> Makes sense to me. I guess there are 6 types of text allocations now:
> - These two BPF trampolines
> - BPF JITs
> - Modules
> - Kprobes
> - Ftrace
>
> All doing (or should be doing) pretty much the same thing. I believe Jessica had
> said at one point that she didn't like all the other features using
> module_alloc() as it was supposed to be just for real modules. Where would the
> API live?

New header?  This shouldn’t matter that much.

Here are two strawman proposals.  All of this is very rough -- the
actual data structures and signatures are likely problematic for
multiple reasons.

--- First proposal ---

struct text_allocation {
  void *final_addr;
  struct page *pages;
  int npages;
};

int text_alloc(struct text_allocation *out, size_t size);

/* now final_addr is not accessible and pages is writable. */

int text_freeze(struct text_allocation *alloc);

/* now pages are not accessible and final_addr is RO.  Alternatively,
pages are RO and final_addr is unmapped. */

int text_finish(struct text_allocation *alloc);

/* now final_addr is RX.  All done. */

This gets it with just one flush and gives a chance to double-check in
case of race attacks from other CPUs.  Double-checking is annoying,
though.

--- Second proposal ---

struct text_allocation {
  void *final_addr;
  /* lots of opaque stuff including an mm_struct */
  /* optional: list of struct page, but this isn't obviously useful */
};

int text_alloc(struct text_allocation *out, size_t size);

/* Memory is allocated.  There is no way to access it at all right
now.  The memory is RO or not present in the direct map. */

void __user *text_activate_mapping(struct text_allocation *out);

/* Now the text is RW at *user* address given by return value.
Preemption is off if required by use_temporary_mm().  Real user memory
cannot be accessed. */

void text_deactivate_mapping(struct text_allocation *alloc);

/* Now the memory is inaccessible again. */

void text_finalize(struct text_allocation *alloc);

/* Now it's RX or XO at the final address. */


Pros of second approach:

 - Inherently immune to cross-CPU attack.  No double-check.

 - If we ever implement a cache of non-direct-mapped, unaliased pages,
then it works with no flushes at all.  We could even relax it a bit to
allow non-direct-mapped pages that may have RX / XO aliases but no W
aliases.

 - Can easily access without worrying about page boundaries.

Cons:

 - The use of a temporary mm is annoying -- you can't copy from user
memory, for example.
Edgecombe, Rick P Jan. 8, 2020, 8:52 p.m. UTC | #8
On Wed, 2020-01-08 at 00:41 -0800, Andy Lutomirski wrote:
> > On Jan 7, 2020, at 9:01 AM, Edgecombe, Rick P <rick.p.edgecombe@intel.com>
> > wrote:
> > 
> > CC Nadav and Jessica.
> > 
> > On Mon, 2020-01-06 at 15:36 -1000, Andy Lutomirski wrote:
> > > > On Jan 6, 2020, at 12:25 PM, Edgecombe, Rick P <
> > > > rick.p.edgecombe@intel.com>
> > > > wrote:
> > > > 
> > > > On Sat, 2020-01-04 at 09:49 +0900, Andy Lutomirski wrote:
> > > > > > > > On Jan 4, 2020, at 8:47 AM, KP Singh <kpsingh@chromium.org>
> > > > > > > > wrote:
> > > > > > > 
> > > > > > > From: KP Singh <kpsingh@google.com>
> > > > > > > 
> > > > > > > The image for the BPF trampolines is allocated with
> > > > > > > bpf_jit_alloc_exe_page which marks this allocated page executable.
> > > > > > > This
> > > > > > > means that the allocated memory is W and X at the same time making
> > > > > > > it
> > > > > > > susceptible to WX based attacks.
> > > > > > > 
> > > > > > > Since the allocated memory is shared between two trampolines (the
> > > > > > > current and the next), 2 pages must be allocated to adhere to W^X
> > > > > > > and
> > > > > > > the following sequence is obeyed where trampolines are modified:
> > > > > > 
> > > > > > Can we please do better rather than piling garbage on top of
> > > > > > garbage?
> > > > > > 
> > > > > > > 
> > > > > > > - Mark memory as non executable (set_memory_nx). While
> > > > > > > module_alloc for
> > > > > > > x86 allocates the memory as PAGE_KERNEL and not PAGE_KERNEL_EXEC,
> > > > > > > not
> > > > > > > all implementations of module_alloc do so
> > > > > > 
> > > > > > How about fixing this instead?
> > > > > > 
> > > > > > > - Mark the memory as read/write (set_memory_rw)
> > > > > > 
> > > > > > Probably harmless, but see above about fixing it.
> > > > > > 
> > > > > > > - Modify the trampoline
> > > > > > 
> > > > > > Seems reasonable. It’s worth noting that this whole approach is
> > > > > > suboptimal:
> > > > > > the “module” allocator should really be returning a list of pages to
> > > > > > be
> > > > > > written (not at the final address!) with the actual executable
> > > > > > mapping to
> > > > > > be
> > > > > > materialized later, but that’s a bigger project that you’re welcome
> > > > > > to
> > > > > > ignore
> > > > > > for now.  (Concretely, it should produce a vmap address with backing
> > > > > > pages
> > > > > > but
> > > > > > with the vmap alias either entirely unmapped or read-only. A
> > > > > > subsequent
> > > > > > healer
> > > > > > would, all at once, make the direct map pages RO or not-present and
> > > > > > make
> > > > > > the
> > > > > > vmap alias RX.)
> > > > > > > - Mark the memory as read-only (set_memory_ro)
> > > > > > > - Mark the memory as executable (set_memory_x)
> > > > > > 
> > > > > > No, thanks. There’s very little excuse for doing two IPI flushes
> > > > > > when one
> > > > > > would suffice.
> > > > > > 
> > > > > > As far as I know, all architectures can do this with a single flush
> > > > > > without
> > > > > > races  x86 certainly can. The module freeing code gets this sequence
> > > > > > right.
> > > > > > Please reuse its mechanism or, if needed, export the relevant
> > > > > > interfaces.
> > > > 
> > > > So if I understand this right, some trampolines have been added that are
> > > > currently set as RWX at modification time AND left that way during
> > > > runtime?
> > > > The
> > > > discussion on the order of set_memory_() calls in the commit message
> > > > made me
> > > > think that this was just a modification time thing at first.
> > > 
> > > I’m not sure what the status quo is.
> > > 
> > > We really ought to have a genuinely good API for allocation and
> > > initialization
> > > of text.  We can do so much better than set_memory_blahblah.
> > > 
> > > FWIW, I have some ideas about making kernel flushes cheaper. It’s
> > > currently
> > > blocked on finding some time and on tglx’s irqtrace work.
> > > 
> > 
> > Makes sense to me. I guess there are 6 types of text allocations now:
> > - These two BPF trampolines
> > - BPF JITs
> > - Modules
> > - Kprobes
> > - Ftrace
> > 
> > All doing (or should be doing) pretty much the same thing. I believe Jessica
> > had
> > said at one point that she didn't like all the other features using
> > module_alloc() as it was supposed to be just for real modules. Where would
> > the
> > API live?
> 
> New header?  This shouldn’t matter that much.
> 
> Here are two strawman proposals.  All of this is very rough -- the
> actual data structures and signatures are likely problematic for
> multiple reasons.
> 
> --- First proposal ---
> 
> struct text_allocation {
>   void *final_addr;
>   struct page *pages;
>   int npages;
> };
> 
> int text_alloc(struct text_allocation *out, size_t size);
> 
> /* now final_addr is not accessible and pages is writable. */
> 
> int text_freeze(struct text_allocation *alloc);
> 
> /* now pages are not accessible and final_addr is RO.  Alternatively,
> pages are RO and final_addr is unmapped. */
> 
> int text_finish(struct text_allocation *alloc);
> 
> /* now final_addr is RX.  All done. */
> 
> This gets it with just one flush and gives a chance to double-check in
> case of race attacks from other CPUs.  Double-checking is annoying,
> though.
> 
> --- Second proposal ---
> 
> struct text_allocation {
>   void *final_addr;
>   /* lots of opaque stuff including an mm_struct */
>   /* optional: list of struct page, but this isn't obviously useful */
> };
> 
> int text_alloc(struct text_allocation *out, size_t size);
> 
> /* Memory is allocated.  There is no way to access it at all right
> now.  The memory is RO or not present in the direct map. */
> 
> void __user *text_activate_mapping(struct text_allocation *out);
> 
> /* Now the text is RW at *user* address given by return value.
> Preemption is off if required by use_temporary_mm().  Real user memory
> cannot be accessed. */
> 
> void text_deactivate_mapping(struct text_allocation *alloc);
> 
> /* Now the memory is inaccessible again. */
> 
> void text_finalize(struct text_allocation *alloc);
> 
> /* Now it's RX or XO at the final address. */
> 
> 
> Pros of second approach:
> 
>  - Inherently immune to cross-CPU attack.  No double-check.
> 
>  - If we ever implement a cache of non-direct-mapped, unaliased pages,
> then it works with no flushes at all.  We could even relax it a bit to
> allow non-direct-mapped pages that may have RX / XO aliases but no W
> aliases.
> 
>  - Can easily access without worrying about page boundaries.
> 
> Cons:
> 
>  - The use of a temporary mm is annoying -- you can't copy from user
> memory, for example.

Probably the first proposal is better for usages where there is a signature that
can be checked like modules, because you could more easily check the signature
after the text is RO. I guess leaving the direct map as RO could work for the
second option too. Both would probably require significant changes to module
signature verification though.

Just a minor point/clarification, but outside of an enhanced signed module case,
I think the cross-CPU attack mitigation can't be full. For example, attacking
the verified BPF byte code (which is apparently planned to no longer be RO), or
the pointers being loaded into these trampolines. There is always going to be
some writable source or pointer to the source, and unless there is a way to
verify the end RO result, it's an un-winnable game of whack-a-mole to do it in
full. Still the less exposed surfaces the better since the writes we are
worrying about in this case are probably not fully arbitrary.

I don't see why it would be so bad to require copying data to the kernel before
sending it through this process. Nothing copies to final allocation directly
from userspace today, and from a perf perspective, how bad is an extra copy when
we are saving TLB shootdowns? Are you thinking to protect the data that's being
loaded from other CPUs?

Otherwise, could we lazily clone/sync the original mm into the temporary one to
allow this? (possibly totally misguided idea)

FWIW, I really like the idea of a cache of unmapped or RO pages. Potentially
several optimizations we could do there.

If this API should be cross platform, we might want to abstract the copy itself
as well, since other arch's might have non __user solutions for copying data in.

Unless someone else wants to, I can probably take a look at a first cut of this
after I get the current thing I'm working on out. Probably better to let the
dust settle on the ftrace changes as well.
Andy Lutomirski Jan. 9, 2020, 6:48 a.m. UTC | #9
> On Jan 8, 2020, at 10:52 AM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
>
> On Wed, 2020-01-08 at 00:41 -0800, Andy Lutomirski wrote:
>>> On Jan 7, 2020, at 9:01 AM, Edgecombe, Rick P <rick.p.edgecombe@intel.com>
>>> wrote:
>>>
>>> CC Nadav and Jessica.
>>>
>>> On Mon, 2020-01-06 at 15:36 -1000, Andy Lutomirski wrote:
>>>>> On Jan 6, 2020, at 12:25 PM, Edgecombe, Rick P <
>>>>> rick.p.edgecombe@intel.com>
>>>>> wrote:
>>>>>
>>>>> On Sat, 2020-01-04 at 09:49 +0900, Andy Lutomirski wrote:
>>>>>>>>> On Jan 4, 2020, at 8:47 AM, KP Singh <kpsingh@chromium.org>
>>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> From: KP Singh <kpsingh@google.com>
>>>>>>>>
>>>>>>>> The image for the BPF trampolines is allocated with
>>>>>>>> bpf_jit_alloc_exe_page which marks this allocated page executable.
>>>>>>>> This
>>>>>>>> means that the allocated memory is W and X at the same time making
>>>>>>>> it
>>>>>>>> susceptible to WX based attacks.
>>>>>>>>
>>>>>>>> Since the allocated memory is shared between two trampolines (the
>>>>>>>> current and the next), 2 pages must be allocated to adhere to W^X
>>>>>>>> and
>>>>>>>> the following sequence is obeyed where trampolines are modified:
>>>>>>>
>>>>>>> Can we please do better rather than piling garbage on top of
>>>>>>> garbage?
>>>>>>>
>>>>>>>>
>>>>>>>> - Mark memory as non executable (set_memory_nx). While
>>>>>>>> module_alloc for
>>>>>>>> x86 allocates the memory as PAGE_KERNEL and not PAGE_KERNEL_EXEC,
>>>>>>>> not
>>>>>>>> all implementations of module_alloc do so
>>>>>>>
>>>>>>> How about fixing this instead?
>>>>>>>
>>>>>>>> - Mark the memory as read/write (set_memory_rw)
>>>>>>>
>>>>>>> Probably harmless, but see above about fixing it.
>>>>>>>
>>>>>>>> - Modify the trampoline
>>>>>>>
>>>>>>> Seems reasonable. It’s worth noting that this whole approach is
>>>>>>> suboptimal:
>>>>>>> the “module” allocator should really be returning a list of pages to
>>>>>>> be
>>>>>>> written (not at the final address!) with the actual executable
>>>>>>> mapping to
>>>>>>> be
>>>>>>> materialized later, but that’s a bigger project that you’re welcome
>>>>>>> to
>>>>>>> ignore
>>>>>>> for now.  (Concretely, it should produce a vmap address with backing
>>>>>>> pages
>>>>>>> but
>>>>>>> with the vmap alias either entirely unmapped or read-only. A
>>>>>>> subsequent
>>>>>>> healer
>>>>>>> would, all at once, make the direct map pages RO or not-present and
>>>>>>> make
>>>>>>> the
>>>>>>> vmap alias RX.)
>>>>>>>> - Mark the memory as read-only (set_memory_ro)
>>>>>>>> - Mark the memory as executable (set_memory_x)
>>>>>>>
>>>>>>> No, thanks. There’s very little excuse for doing two IPI flushes
>>>>>>> when one
>>>>>>> would suffice.
>>>>>>>
>>>>>>> As far as I know, all architectures can do this with a single flush
>>>>>>> without
>>>>>>> races  x86 certainly can. The module freeing code gets this sequence
>>>>>>> right.
>>>>>>> Please reuse its mechanism or, if needed, export the relevant
>>>>>>> interfaces.
>>>>>
>>>>> So if I understand this right, some trampolines have been added that are
>>>>> currently set as RWX at modification time AND left that way during
>>>>> runtime?
>>>>> The
>>>>> discussion on the order of set_memory_() calls in the commit message
>>>>> made me
>>>>> think that this was just a modification time thing at first.
>>>>
>>>> I’m not sure what the status quo is.
>>>>
>>>> We really ought to have a genuinely good API for allocation and
>>>> initialization
>>>> of text.  We can do so much better than set_memory_blahblah.
>>>>
>>>> FWIW, I have some ideas about making kernel flushes cheaper. It’s
>>>> currently
>>>> blocked on finding some time and on tglx’s irqtrace work.
>>>>
>>>
>>> Makes sense to me. I guess there are 6 types of text allocations now:
>>> - These two BPF trampolines
>>> - BPF JITs
>>> - Modules
>>> - Kprobes
>>> - Ftrace
>>>
>>> All doing (or should be doing) pretty much the same thing. I believe Jessica
>>> had
>>> said at one point that she didn't like all the other features using
>>> module_alloc() as it was supposed to be just for real modules. Where would
>>> the
>>> API live?
>>
>> New header?  This shouldn’t matter that much.
>>
>> Here are two strawman proposals.  All of this is very rough -- the
>> actual data structures and signatures are likely problematic for
>> multiple reasons.
>>
>> --- First proposal ---
>>
>> struct text_allocation {
>>  void *final_addr;
>>  struct page *pages;
>>  int npages;
>> };
>>
>> int text_alloc(struct text_allocation *out, size_t size);
>>
>> /* now final_addr is not accessible and pages is writable. */
>>
>> int text_freeze(struct text_allocation *alloc);
>>
>> /* now pages are not accessible and final_addr is RO.  Alternatively,
>> pages are RO and final_addr is unmapped. */
>>
>> int text_finish(struct text_allocation *alloc);
>>
>> /* now final_addr is RX.  All done. */
>>
>> This gets it with just one flush and gives a chance to double-check in
>> case of race attacks from other CPUs.  Double-checking is annoying,
>> though.
>>
>> --- Second proposal ---
>>
>> struct text_allocation {
>>  void *final_addr;
>>  /* lots of opaque stuff including an mm_struct */
>>  /* optional: list of struct page, but this isn't obviously useful */
>> };
>>
>> int text_alloc(struct text_allocation *out, size_t size);
>>
>> /* Memory is allocated.  There is no way to access it at all right
>> now.  The memory is RO or not present in the direct map. */
>>
>> void __user *text_activate_mapping(struct text_allocation *out);
>>
>> /* Now the text is RW at *user* address given by return value.
>> Preemption is off if required by use_temporary_mm().  Real user memory
>> cannot be accessed. */
>>
>> void text_deactivate_mapping(struct text_allocation *alloc);
>>
>> /* Now the memory is inaccessible again. */
>>
>> void text_finalize(struct text_allocation *alloc);
>>
>> /* Now it's RX or XO at the final address. */
>>
>>
>> Pros of second approach:
>>
>> - Inherently immune to cross-CPU attack.  No double-check.
>>
>> - If we ever implement a cache of non-direct-mapped, unaliased pages,
>> then it works with no flushes at all.  We could even relax it a bit to
>> allow non-direct-mapped pages that may have RX / XO aliases but no W
>> aliases.
>>
>> - Can easily access without worrying about page boundaries.
>>
>> Cons:
>>
>> - The use of a temporary mm is annoying -- you can't copy from user
>> memory, for example.
>
> Probably the first proposal is better for usages where there is a signature that
> can be checked like modules, because you could more easily check the signature
> after the text is RO. I guess leaving the direct map as RO could work for the
> second option too. Both would probably require significant changes to module
> signature verification though.

This sounds complicated — for decent performance, we want to apply
alternatives before we make the text RO, at which point verifying the
signature is awkward at best.

>
> Just a minor point/clarification, but outside of an enhanced signed module case,
> I think the cross-CPU attack mitigation can't be full. For example, attacking
> the verified BPF byte code (which is apparently planned to no longer be RO), or
> the pointers being loaded into these trampolines. There is always going to be
> some writable source or pointer to the source, and unless there is a way to
> verify the end RO result, it's an un-winnable game of whack-a-mole to do it in
> full. Still the less exposed surfaces the better since the writes we are
> worrying about in this case are probably not fully arbitrary.

We could use hypervisor- or CR3-based protection. But I agree this is
tricky and not strictly on topic :)

>
> I don't see why it would be so bad to require copying data to the kernel before
> sending it through this process. Nothing copies to final allocation directly
> from userspace today, and from a perf perspective, how bad is an extra copy when
> we are saving TLB shootdowns? Are you thinking to protect the data that's being
> loaded from other CPUs?

Hmm. If there’s a way to make loading stall, then the cross-cpu attack
is a nice way to write shell code, so mitigating this has at least
some value.

>
> Otherwise, could we lazily clone/sync the original mm into the temporary one to
> allow this? (possibly totally misguided idea)

That involves allocating a virtual address at a safe position to make
this work. On architectures like s390, I don’t even know if this is
possible. Even on x86, it’s awkward.  I think it’s easier to just say
that, while the temporary mapping is active, user memory is
inaccessible.

>
> FWIW, I really like the idea of a cache of unmapped or RO pages. Potentially
> several optimizations we could do there.
>

I guess we would track these pages by the maximum permissions than any
current or unmapped but unflushed alias has.  This lets us get totally
unmapped or RO pages out of the cache.  Or even RX — we could
potentially allocate, free, and reallocate text without flushing.

> If this API should be cross platform, we might want to abstract the copy itself
> as well, since other arch's might have non __user solutions for copying data in.

Agreed, although maybe all arches would use “user” mappings.

>
> Unless someone else wants to, I can probably take a look at a first cut of this
> after I get the current thing I'm working on out. Probably better to let the
> dust settle on the ftrace changes as well.

That would be great!

Do you know why the change_page_attr code currently does
vm_unmap_aliases?  This is yet more extra expense. I assume the idea
is that, if we’re changing cache attributes on a non-self-snoop
machine, we need to kill stale aliases, and we should also kill them
if we’re reducing permissions.  But we currently do it excessively.

We should also consider improving vm_unmap_aliases().  As a practical
matter, vm_unmap_aliases() often does a global flush, but it can't be
relied on.  On the other hand, a global flush initiated for other
reasons won't tell the vmap code that aliases have been zapped.

If the locking is okay, we could maybe get away with zapping aliases
from the normal flush code.  Alternatively, we could do something
lockless, e.g.:

atomic64_t kernel_tlb_gen, flushed_kernel_tlb_gen;

flush_tlb_kernel_range(), etc increment kernel_tlb_gen before flushing
and then update flushed_kernel_tlb_gen to match after flushing.

The vmap code immediately removes PTEs when unmaps occur (which it may
very well do right now -- I haven't checked) but also tracks the
kernel_tlb_gen associated with each record of an
unmapped-but-not-zapped area.  Then we split vm_unmap_aliases() into a
variant that unmaps all aliases and a variant that merely promises to
unmap at least one alias.  The former does what the current code does
except that it skips the IPI if all areas in question have tlb_gen <
flushed_kernel_tlb_gen.  The latter clears all areas with tlb_gen <
flushed_kernel_tlb_gen and, if there weren't any, does
flush_tlb_kernel_range() and flushes everything.

(Major caveat: this is wrong for the case where
flush_tlb_kernel_range() only flushes some but not all of the kernel.
So this needs considerable work if it's actually going to me useful.
The plain old "take locks and clean up" approach might be a better
bet.)

--Andy
Edgecombe, Rick P Jan. 10, 2020, 1 a.m. UTC | #10
On Wed, 2020-01-08 at 22:48 -0800, Andy Lutomirski wrote:
> > On Jan 8, 2020, at 10:52 AM, Edgecombe, Rick P <rick.p.edgecombe@intel.com>
> > wrote:
> > 
> > On Wed, 2020-01-08 at 00:41 -0800, Andy Lutomirski wrote:
> > > > On Jan 7, 2020, at 9:01 AM, Edgecombe, Rick P <
> > > > rick.p.edgecombe@intel.com>
> > > > wrote:
> > > > 
> > > > CC Nadav and Jessica.
> > > > 
> > > > On Mon, 2020-01-06 at 15:36 -1000, Andy Lutomirski wrote:
> > > > > > On Jan 6, 2020, at 12:25 PM, Edgecombe, Rick P <
> > > > > > rick.p.edgecombe@intel.com>
> > > > > > wrote:
> > > > > > 
> > > > > > On Sat, 2020-01-04 at 09:49 +0900, Andy Lutomirski wrote:
> > > > > > > > > > On Jan 4, 2020, at 8:47 AM, KP Singh <kpsingh@chromium.org>
> > > > > > > > > > wrote:
> > > > > > > > > 
> > > > > > > > > From: KP Singh <kpsingh@google.com>
> > > > > > > > > 
> > > > > > > > > The image for the BPF trampolines is allocated with
> > > > > > > > > bpf_jit_alloc_exe_page which marks this allocated page
> > > > > > > > > executable.
> > > > > > > > > This
> > > > > > > > > means that the allocated memory is W and X at the same time
> > > > > > > > > making
> > > > > > > > > it
> > > > > > > > > susceptible to WX based attacks.
> > > > > > > > > 
> > > > > > > > > Since the allocated memory is shared between two trampolines
> > > > > > > > > (the
> > > > > > > > > current and the next), 2 pages must be allocated to adhere to
> > > > > > > > > W^X
> > > > > > > > > and
> > > > > > > > > the following sequence is obeyed where trampolines are
> > > > > > > > > modified:
> > > > > > > > 
> > > > > > > > Can we please do better rather than piling garbage on top of
> > > > > > > > garbage?
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > - Mark memory as non executable (set_memory_nx). While
> > > > > > > > > module_alloc for
> > > > > > > > > x86 allocates the memory as PAGE_KERNEL and not
> > > > > > > > > PAGE_KERNEL_EXEC,
> > > > > > > > > not
> > > > > > > > > all implementations of module_alloc do so
> > > > > > > > 
> > > > > > > > How about fixing this instead?
> > > > > > > > 
> > > > > > > > > - Mark the memory as read/write (set_memory_rw)
> > > > > > > > 
> > > > > > > > Probably harmless, but see above about fixing it.
> > > > > > > > 
> > > > > > > > > - Modify the trampoline
> > > > > > > > 
> > > > > > > > Seems reasonable. It’s worth noting that this whole approach is
> > > > > > > > suboptimal:
> > > > > > > > the “module” allocator should really be returning a list of
> > > > > > > > pages to
> > > > > > > > be
> > > > > > > > written (not at the final address!) with the actual executable
> > > > > > > > mapping to
> > > > > > > > be
> > > > > > > > materialized later, but that’s a bigger project that you’re
> > > > > > > > welcome
> > > > > > > > to
> > > > > > > > ignore
> > > > > > > > for now.  (Concretely, it should produce a vmap address with
> > > > > > > > backing
> > > > > > > > pages
> > > > > > > > but
> > > > > > > > with the vmap alias either entirely unmapped or read-only. A
> > > > > > > > subsequent
> > > > > > > > healer
> > > > > > > > would, all at once, make the direct map pages RO or not-present
> > > > > > > > and
> > > > > > > > make
> > > > > > > > the
> > > > > > > > vmap alias RX.)
> > > > > > > > > - Mark the memory as read-only (set_memory_ro)
> > > > > > > > > - Mark the memory as executable (set_memory_x)
> > > > > > > > 
> > > > > > > > No, thanks. There’s very little excuse for doing two IPI flushes
> > > > > > > > when one
> > > > > > > > would suffice.
> > > > > > > > 
> > > > > > > > As far as I know, all architectures can do this with a single
> > > > > > > > flush
> > > > > > > > without
> > > > > > > > races  x86 certainly can. The module freeing code gets this
> > > > > > > > sequence
> > > > > > > > right.
> > > > > > > > Please reuse its mechanism or, if needed, export the relevant
> > > > > > > > interfaces.
> > > > > > 
> > > > > > So if I understand this right, some trampolines have been added that
> > > > > > are
> > > > > > currently set as RWX at modification time AND left that way during
> > > > > > runtime?
> > > > > > The
> > > > > > discussion on the order of set_memory_() calls in the commit message
> > > > > > made me
> > > > > > think that this was just a modification time thing at first.
> > > > > 
> > > > > I’m not sure what the status quo is.
> > > > > 
> > > > > We really ought to have a genuinely good API for allocation and
> > > > > initialization
> > > > > of text.  We can do so much better than set_memory_blahblah.
> > > > > 
> > > > > FWIW, I have some ideas about making kernel flushes cheaper. It’s
> > > > > currently
> > > > > blocked on finding some time and on tglx’s irqtrace work.
> > > > > 
> > > > 
> > > > Makes sense to me. I guess there are 6 types of text allocations now:
> > > > - These two BPF trampolines
> > > > - BPF JITs
> > > > - Modules
> > > > - Kprobes
> > > > - Ftrace
> > > > 
> > > > All doing (or should be doing) pretty much the same thing. I believe
> > > > Jessica
> > > > had
> > > > said at one point that she didn't like all the other features using
> > > > module_alloc() as it was supposed to be just for real modules. Where
> > > > would
> > > > the
> > > > API live?
> > > 
> > > New header?  This shouldn’t matter that much.
> > > 
> > > Here are two strawman proposals.  All of this is very rough -- the
> > > actual data structures and signatures are likely problematic for
> > > multiple reasons.
> > > 
> > > --- First proposal ---
> > > 
> > > struct text_allocation {
> > >  void *final_addr;
> > >  struct page *pages;
> > >  int npages;
> > > };
> > > 
> > > int text_alloc(struct text_allocation *out, size_t size);
> > > 
> > > /* now final_addr is not accessible and pages is writable. */
> > > 
> > > int text_freeze(struct text_allocation *alloc);
> > > 
> > > /* now pages are not accessible and final_addr is RO.  Alternatively,
> > > pages are RO and final_addr is unmapped. */
> > > 
> > > int text_finish(struct text_allocation *alloc);
> > > 
> > > /* now final_addr is RX.  All done. */
> > > 
> > > This gets it with just one flush and gives a chance to double-check in
> > > case of race attacks from other CPUs.  Double-checking is annoying,
> > > though.
> > > 
> > > --- Second proposal ---
> > > 
> > > struct text_allocation {
> > >  void *final_addr;
> > >  /* lots of opaque stuff including an mm_struct */
> > >  /* optional: list of struct page, but this isn't obviously useful */
> > > };
> > > 
> > > int text_alloc(struct text_allocation *out, size_t size);
> > > 
> > > /* Memory is allocated.  There is no way to access it at all right
> > > now.  The memory is RO or not present in the direct map. */
> > > 
> > > void __user *text_activate_mapping(struct text_allocation *out);
> > > 
> > > /* Now the text is RW at *user* address given by return value.
> > > Preemption is off if required by use_temporary_mm().  Real user memory
> > > cannot be accessed. */
> > > 
> > > void text_deactivate_mapping(struct text_allocation *alloc);
> > > 
> > > /* Now the memory is inaccessible again. */
> > > 
> > > void text_finalize(struct text_allocation *alloc);
> > > 
> > > /* Now it's RX or XO at the final address. */
> > > 
> > > 
> > > Pros of second approach:
> > > 
> > > - Inherently immune to cross-CPU attack.  No double-check.
> > > 
> > > - If we ever implement a cache of non-direct-mapped, unaliased pages,
> > > then it works with no flushes at all.  We could even relax it a bit to
> > > allow non-direct-mapped pages that may have RX / XO aliases but no W
> > > aliases.
> > > 
> > > - Can easily access without worrying about page boundaries.
> > > 
> > > Cons:
> > > 
> > > - The use of a temporary mm is annoying -- you can't copy from user
> > > memory, for example.
> > 
> > Probably the first proposal is better for usages where there is a signature
> > that
> > can be checked like modules, because you could more easily check the
> > signature
> > after the text is RO. I guess leaving the direct map as RO could work for
> > the
> > second option too. Both would probably require significant changes to module
> > signature verification though.
> 
> This sounds complicated — for decent performance, we want to apply
> alternatives before we make the text RO, at which point verifying the
> signature is awkward at best.
> 
> > 
> > Just a minor point/clarification, but outside of an enhanced signed module
> > case,
> > I think the cross-CPU attack mitigation can't be full. For example,
> > attacking
> > the verified BPF byte code (which is apparently planned to no longer be RO),
> > or
> > the pointers being loaded into these trampolines. There is always going to
> > be
> > some writable source or pointer to the source, and unless there is a way to
> > verify the end RO result, it's an un-winnable game of whack-a-mole to do it
> > in
> > full. Still the less exposed surfaces the better since the writes we are
> > worrying about in this case are probably not fully arbitrary.
> 
> We could use hypervisor- or CR3-based protection. But I agree this is
> tricky and not strictly on topic :)
> 
> > 
> > I don't see why it would be so bad to require copying data to the kernel
> > before
> > sending it through this process. Nothing copies to final allocation directly
> > from userspace today, and from a perf perspective, how bad is an extra copy
> > when
> > we are saving TLB shootdowns? Are you thinking to protect the data that's
> > being
> > loaded from other CPUs?
> 
> Hmm. If there’s a way to make loading stall, then the cross-cpu attack
> is a nice way to write shell code, so mitigating this has at least
> some value.
> 
> > 
> > Otherwise, could we lazily clone/sync the original mm into the temporary one
> > to
> > allow this? (possibly totally misguided idea)
> 
> That involves allocating a virtual address at a safe position to make
> this work. On architectures like s390, I don’t even know if this is
> possible. Even on x86, it’s awkward.  I think it’s easier to just say
> that, while the temporary mapping is active, user memory is
> inaccessible.
> 
> > 
> > FWIW, I really like the idea of a cache of unmapped or RO pages. Potentially
> > several optimizations we could do there.
> > 
> 
> I guess we would track these pages by the maximum permissions than any
> current or unmapped but unflushed alias has.  This lets us get totally
> unmapped or RO pages out of the cache.  Or even RX — we could
> potentially allocate, free, and reallocate text without flushing.
> 
> > If this API should be cross platform, we might want to abstract the copy
> > itself
> > as well, since other arch's might have non __user solutions for copying data
> > in.
> 
> Agreed, although maybe all arches would use “user” mappings.
> 
> > 
> > Unless someone else wants to, I can probably take a look at a first cut of
> > this
> > after I get the current thing I'm working on out. Probably better to let the
> > dust settle on the ftrace changes as well.
> 
> That would be great!
> 
> Do you know why the change_page_attr code currently does
> vm_unmap_aliases?  This is yet more extra expense. I assume the idea
> is that, if we’re changing cache attributes on a non-self-snoop
> machine, we need to kill stale aliases, and we should also kill them
> if we’re reducing permissions.  But we currently do it excessively.

I had assumed the RO/RW alias reason, but now I went and looked. I guess it was
first motivated by:
"XEN and PAT and such do not like deferred TLB flushing because they can't
always handle multiple aliasing virtual addresses to a physical address.
They now call vm_unmap_aliases() in order to flush any deferred mappings.
That call is very expensive (well, actually not a lot more expensive than
a single vunmap under the old scheme), however it should be OK if not
called too often."

https://github.com/torvalds/linux/commit/db64fe02258f1507e13fe5212a989922323685ce

For PAT, it must be like you are saying with changes in cache properties...

For XEN page table paravirt not sure if the pageattr.c call was involved or not,
but in any case it looks like it is no longer relevant. First it was changed to
eagerly flushing when under XEN pv (d2cb214551de8180542a04ec8c86c0c9412c5124),
and then just eagerly marking vmalloc PTEs as unmapped so I guess XEN can see
this and do what it needs to do (64141da587241301ce8638cc945f8b67853156ec).

For "and such"...not sure.

But flushing RW aliases for new RO memory seems valid too even if it wasn't the
intention.

> We should also consider improving vm_unmap_aliases().  As a practical
> matter, vm_unmap_aliases() often does a global flush, but it can't be
> relied on.  On the other hand, a global flush initiated for other
> reasons won't tell the vmap code that aliases have been zapped.
> 
> If the locking is okay, we could maybe get away with zapping aliases
> from the normal flush code.  Alternatively, we could do something
> lockless, e.g.:
> 
> atomic64_t kernel_tlb_gen, flushed_kernel_tlb_gen;
> 
> flush_tlb_kernel_range(), etc increment kernel_tlb_gen before flushing
> and then update flushed_kernel_tlb_gen to match after flushing.
> 
> The vmap code immediately removes PTEs when unmaps occur (which it may
> very well do right now -- I haven't checked) but also tracks the
> kernel_tlb_gen associated with each record of an
> unmapped-but-not-zapped area.  Then we split vm_unmap_aliases() into a
> variant that unmaps all aliases and a variant that merely promises to
> unmap at least one alias.  The former does what the current code does
> except that it skips the IPI if all areas in question have tlb_gen <
> flushed_kernel_tlb_gen.  The latter clears all areas with tlb_gen <
> flushed_kernel_tlb_gen and, if there weren't any, does
> flush_tlb_kernel_range() and flushes everything.
> 
> (Major caveat: this is wrong for the case where
> flush_tlb_kernel_range() only flushes some but not all of the kernel.
> So this needs considerable work if it's actually going to me useful.
> The plain old "take locks and clean up" approach might be a better
> bet.)
> 

Hmm. In normal usage (!DEBUG_PAGE_ALLOC), are kernel range tlb shootdowns common
outside of module space users and lazy vmap stuff? A tlb_gen solution might only
be worth it in cases where something other than vm_unmap_aliases() and helpers
was doing this frequently.
Andy Lutomirski Jan. 10, 2020, 6:35 p.m. UTC | #11
> On Jan 9, 2020, at 3:01 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:

>> The vmap code immediately removes PTEs when unmaps occur (which it may
>> very well do right now -- I haven't checked) but also tracks the
>> kernel_tlb_gen associated with each record of an
>> unmapped-but-not-zapped area.  Then we split vm_unmap_aliases() into a
>> variant that unmaps all aliases and a variant that merely promises to
>> unmap at least one alias.  The former does what the current code does
>> except that it skips the IPI if all areas in question have tlb_gen <
>> flushed_kernel_tlb_gen.  The latter clears all areas with tlb_gen <
>> flushed_kernel_tlb_gen and, if there weren't any, does
>> flush_tlb_kernel_range() and flushes everything.
>>
>> (Major caveat: this is wrong for the case where
>> flush_tlb_kernel_range() only flushes some but not all of the kernel.
>> So this needs considerable work if it's actually going to me useful.
>> The plain old "take locks and clean up" approach might be a better
>> bet.)
>>
>
> Hmm. In normal usage (!DEBUG_PAGE_ALLOC), are kernel range tlb shootdowns common
> outside of module space users and lazy vmap stuff? A tlb_gen solution might only
> be worth it in cases where something other than vm_unmap_aliases() and helpers
> was doing this frequently.

I suspect that the two bug users aside from vunmap() will be eBPF and,
eventually, XPFO / “exclusive pages” / less crappy SEV-like
implementations / actual high quality MKTME stuff / KVM
side-channel-proof memory.  The latter doesn’t actually exist yet (the
SEV implementation sidesteps this with a horrible hack involving
incoherent mappings that are left active with fingers crossed), but it
really seems like it’s coming.

In general, if we’re going to have a pool of non-RW-direct-mapped
pages, we also want some moderately efficient way to produce such
pages.

Right now, creating and freeing eBPF programs in a loop is probably a
performance disaster on large systems.

Patch
diff mbox series

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 4c8a2d1f8470..a510f8260322 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1527,7 +1527,7 @@  int arch_prepare_bpf_trampoline(void *image, struct btf_func_model *m, u32 flags
 	 * Another half is an area for next trampoline.
 	 * Make sure the trampoline generation logic doesn't overflow.
 	 */
-	if (WARN_ON_ONCE(prog - (u8 *)image > PAGE_SIZE / 2 - BPF_INSN_SAFETY))
+	if (WARN_ON_ONCE(prog - (u8 *)image > PAGE_SIZE - BPF_INSN_SAFETY))
 		return -EFAULT;
 	return 0;
 }
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index b14e51d56a82..3be8b1b0166d 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -502,7 +502,6 @@  struct bpf_trampoline *bpf_trampoline_lookup(u64 key);
 int bpf_trampoline_link_prog(struct bpf_prog *prog);
 int bpf_trampoline_unlink_prog(struct bpf_prog *prog);
 void bpf_trampoline_put(struct bpf_trampoline *tr);
-void *bpf_jit_alloc_exec_page(void);
 #define BPF_DISPATCHER_INIT(name) {			\
 	.mutex = __MUTEX_INITIALIZER(name.mutex),	\
 	.func = &name##func,				\
diff --git a/kernel/bpf/dispatcher.c b/kernel/bpf/dispatcher.c
index 204ee61a3904..f4589da3bb34 100644
--- a/kernel/bpf/dispatcher.c
+++ b/kernel/bpf/dispatcher.c
@@ -93,13 +93,34 @@  int __weak arch_prepare_bpf_dispatcher(void *image, s64 *funcs, int num_funcs)
 static int bpf_dispatcher_prepare(struct bpf_dispatcher *d, void *image)
 {
 	s64 ips[BPF_DISPATCHER_MAX] = {}, *ipsp = &ips[0];
-	int i;
+	int i, ret;
 
 	for (i = 0; i < BPF_DISPATCHER_MAX; i++) {
 		if (d->progs[i].prog)
 			*ipsp++ = (s64)(uintptr_t)d->progs[i].prog->bpf_func;
 	}
-	return arch_prepare_bpf_dispatcher(image, &ips[0], d->num_progs);
+
+	/* First make the page non-executable and then make it RW to avoid it
+	 * from being W+X. While x86's implementation of module_alloc
+	 * allocates memory as non-executable, not all implementations do so.
+	 * Till these are fixed, explicitly mark the memory as NX.
+	 */
+	set_memory_nx((unsigned long)image, 1);
+	set_memory_rw((unsigned long)image, 1);
+
+	ret = arch_prepare_bpf_dispatcher(image, &ips[0], d->num_progs);
+	if (ret)
+		return ret;
+
+	/* First make the page read-only, and only then make it executable to
+	 * prevent it from being W+X in between.
+	 */
+	set_memory_ro((unsigned long)image, 1);
+	/* More checks can be done here to ensure that nothing was changed
+	 * between arch_prepare_bpf_dispatcher and set_memory_ro.
+	 */
+	set_memory_x((unsigned long)image, 1);
+	return 0;
 }
 
 static void bpf_dispatcher_update(struct bpf_dispatcher *d, int prev_num_progs)
@@ -113,7 +134,7 @@  static void bpf_dispatcher_update(struct bpf_dispatcher *d, int prev_num_progs)
 		noff = 0;
 	} else {
 		old = d->image + d->image_off;
-		noff = d->image_off ^ (PAGE_SIZE / 2);
+		noff = d->image_off ^ PAGE_SIZE;
 	}
 
 	new = d->num_progs ? d->image + noff : NULL;
@@ -140,10 +161,11 @@  void bpf_dispatcher_change_prog(struct bpf_dispatcher *d, struct bpf_prog *from,
 
 	mutex_lock(&d->mutex);
 	if (!d->image) {
-		d->image = bpf_jit_alloc_exec_page();
+		d->image = bpf_jit_alloc_exec(2 * PAGE_SIZE);
 		if (!d->image)
 			goto out;
 	}
+	set_vm_flush_reset_perms(d->image);
 
 	prev_num_progs = d->num_progs;
 	changed |= bpf_dispatcher_remove_prog(d, from);
diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
index 505f4e4b31d2..ff6a92ef8945 100644
--- a/kernel/bpf/trampoline.c
+++ b/kernel/bpf/trampoline.c
@@ -14,22 +14,6 @@  static struct hlist_head trampoline_table[TRAMPOLINE_TABLE_SIZE];
 /* serializes access to trampoline_table */
 static DEFINE_MUTEX(trampoline_mutex);
 
-void *bpf_jit_alloc_exec_page(void)
-{
-	void *image;
-
-	image = bpf_jit_alloc_exec(PAGE_SIZE);
-	if (!image)
-		return NULL;
-
-	set_vm_flush_reset_perms(image);
-	/* Keep image as writeable. The alternative is to keep flipping ro/rw
-	 * everytime new program is attached or detached.
-	 */
-	set_memory_x((long)image, 1);
-	return image;
-}
-
 struct bpf_trampoline *bpf_trampoline_lookup(u64 key)
 {
 	struct bpf_trampoline *tr;
@@ -50,12 +34,13 @@  struct bpf_trampoline *bpf_trampoline_lookup(u64 key)
 		goto out;
 
 	/* is_root was checked earlier. No need for bpf_jit_charge_modmem() */
-	image = bpf_jit_alloc_exec_page();
+	image = bpf_jit_alloc_exec(2 * PAGE_SIZE);
 	if (!image) {
 		kfree(tr);
 		tr = NULL;
 		goto out;
 	}
+	set_vm_flush_reset_perms(image);
 
 	tr->key = key;
 	INIT_HLIST_NODE(&tr->hlist);
@@ -125,14 +110,14 @@  static int register_fentry(struct bpf_trampoline *tr, void *new_addr)
 }
 
 /* Each call __bpf_prog_enter + call bpf_func + call __bpf_prog_exit is ~50
- * bytes on x86.  Pick a number to fit into PAGE_SIZE / 2
+ * bytes on x86.  Pick a number to fit into PAGE_SIZE.
  */
 #define BPF_MAX_TRAMP_PROGS 40
 
 static int bpf_trampoline_update(struct bpf_trampoline *tr)
 {
-	void *old_image = tr->image + ((tr->selector + 1) & 1) * PAGE_SIZE/2;
-	void *new_image = tr->image + (tr->selector & 1) * PAGE_SIZE/2;
+	void *old_image = tr->image + ((tr->selector + 1) & 1) * PAGE_SIZE;
+	void *new_image = tr->image + (tr->selector & 1) * PAGE_SIZE;
 	struct bpf_prog *progs_to_run[BPF_MAX_TRAMP_PROGS];
 	int fentry_cnt = tr->progs_cnt[BPF_TRAMP_FENTRY];
 	int fexit_cnt = tr->progs_cnt[BPF_TRAMP_FEXIT];
@@ -160,6 +145,13 @@  static int bpf_trampoline_update(struct bpf_trampoline *tr)
 	if (fexit_cnt)
 		flags = BPF_TRAMP_F_CALL_ORIG | BPF_TRAMP_F_SKIP_FRAME;
 
+
+	/* First make the page non-executable and then make it RW to avoid it
+	 * from being being W+X.
+	 */
+	set_memory_nx((unsigned long)new_image, 1);
+	set_memory_rw((unsigned long)new_image, 1);
+
 	err = arch_prepare_bpf_trampoline(new_image, &tr->func.model, flags,
 					  fentry, fentry_cnt,
 					  fexit, fexit_cnt,
@@ -167,6 +159,15 @@  static int bpf_trampoline_update(struct bpf_trampoline *tr)
 	if (err)
 		goto out;
 
+	/* First make the page read-only, and only then make it executable to
+	 * prevent it from being W+X in between.
+	 */
+	set_memory_ro((unsigned long)new_image, 1);
+	/* More checks can be done here to ensure that nothing was changed
+	 * between arch_prepare_bpf_trampoline and set_memory_ro.
+	 */
+	set_memory_x((unsigned long)new_image, 1);
+
 	if (tr->selector)
 		/* progs already running at this address */
 		err = modify_fentry(tr, old_image, new_image);