diff mbox series

[11/18] maccess: remove strncpy_from_unsafe

Message ID 20200513160038.2482415-12-hch@lst.de (mailing list archive)
State New, archived
Headers show
Series [01/18] maccess: unexport probe_kernel_write and probe_user_write | expand

Commit Message

Christoph Hellwig May 13, 2020, 4 p.m. UTC
All three callers really should try the explicit kernel and user
copies instead.  One has already deprecated the somewhat dangerous
either kernel or user address concept, the other two still need to
follow up eventually.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
---
 include/linux/uaccess.h     |  1 -
 kernel/trace/bpf_trace.c    | 39 +++++++++++++++++++++++++------------
 kernel/trace/trace_kprobe.c |  5 ++++-
 mm/maccess.c                | 39 +------------------------------------
 4 files changed, 32 insertions(+), 52 deletions(-)

Comments

Linus Torvalds May 13, 2020, 7:11 p.m. UTC | #1
On Wed, May 13, 2020 at 9:01 AM Christoph Hellwig <hch@lst.de> wrote:
>
> +static void bpf_strncpy(char *buf, long unsafe_addr)
> +{
> +       buf[0] = 0;
> +       if (strncpy_from_kernel_nofault(buf, (void *)unsafe_addr,
> +                       BPF_STRNCPY_LEN))
> +               strncpy_from_user_nofault(buf, (void __user *)unsafe_addr,
> +                               BPF_STRNCPY_LEN);
> +}

This seems buggy when I look at it.

It seems to think that strncpy_from_kernel_nofault() returns an error code.

Not so, unless I missed where you changed the rules.

It returns the length of the string for a successful copy. 0 is
actually an error case (for count being <= 0).

So the test for success seems entirely wrong.

Also, I do wonder if we shouldn't gate this on TASK_SIZE, and do the
user trial first. On architectures where this thing is valid in the
first place (ie kernel and user addresses are separate), the test for
address size would allow us to avoid a pointless fault due to an
invalid kernel access to user space.

So I think this function should look something like

  static void bpf_strncpy(char *buf, long unsafe_addr)
  {
          /* Try user address */
          if (unsafe_addr < TASK_SIZE) {
                  void __user *ptr = (void __user *)unsafe_addr;
                  if (strncpy_from_user_nofault(buf, ptr, BPF_STRNCPY_LEN) >= 0)
                          return;
          }

          /* .. fall back on trying kernel access */
          buf[0] = 0;
          strncpy_from_kernel_nofault(buf, (void *)unsafe_addr,
BPF_STRNCPY_LEN);
  }

or similar. No?

                   Linus
Christoph Hellwig May 13, 2020, 7:28 p.m. UTC | #2
On Wed, May 13, 2020 at 12:11:27PM -0700, Linus Torvalds wrote:
> On Wed, May 13, 2020 at 9:01 AM Christoph Hellwig <hch@lst.de> wrote:
> >
> > +static void bpf_strncpy(char *buf, long unsafe_addr)
> > +{
> > +       buf[0] = 0;
> > +       if (strncpy_from_kernel_nofault(buf, (void *)unsafe_addr,
> > +                       BPF_STRNCPY_LEN))
> > +               strncpy_from_user_nofault(buf, (void __user *)unsafe_addr,
> > +                               BPF_STRNCPY_LEN);
> > +}
> 
> This seems buggy when I look at it.
> 
> It seems to think that strncpy_from_kernel_nofault() returns an error code.
> 
> Not so, unless I missed where you changed the rules.

I didn't change the rules, so yes, this is wrong.

> Also, I do wonder if we shouldn't gate this on TASK_SIZE, and do the
> user trial first. On architectures where this thing is valid in the
> first place (ie kernel and user addresses are separate), the test for
> address size would allow us to avoid a pointless fault due to an
> invalid kernel access to user space.
> 
> So I think this function should look something like
> 
>   static void bpf_strncpy(char *buf, long unsafe_addr)
>   {
>           /* Try user address */
>           if (unsafe_addr < TASK_SIZE) {
>                   void __user *ptr = (void __user *)unsafe_addr;
>                   if (strncpy_from_user_nofault(buf, ptr, BPF_STRNCPY_LEN) >= 0)
>                           return;
>           }
> 
>           /* .. fall back on trying kernel access */
>           buf[0] = 0;
>           strncpy_from_kernel_nofault(buf, (void *)unsafe_addr,
> BPF_STRNCPY_LEN);
>   }
> 
> or similar. No?

So on say s390 TASK_SIZE_USUALLy is (-PAGE_SIZE), which means we'd alway
try the user copy first, which seems odd.

I'd really like to here from the bpf folks what the expected use case
is here, and if the typical argument is kernel or user memory.
Linus Torvalds May 13, 2020, 11:03 p.m. UTC | #3
On Wed, May 13, 2020 at 3:36 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> It's used for both.

Daniel, BPF real;ly needs to make up its mind about that.

You *cannot* use ti for both.

Yes, it happens to work on x86 and some other architectures.

But on other architectures, the exact same pointer value can be a
kernel pointer or a user pointer.

> Given this is enabled on pretty much all program types, my
> assumption would be that usage is still more often on kernel memory than user one.

You need to pick one.

If you know it is a user pointer, use strncpy_from_user() (possibly
with disable_pagefault() aka strncpy_from_user_nofault()).

And if you know it is a kernel pointer, use strncpy_from_unsafe() (aka
strncpy_from_kernel_nofault()).

You really can't pick the "randomly one or the other guess what I mean " option.

                  Linus
Masami Hiramatsu (Google) May 13, 2020, 11:20 p.m. UTC | #4
On Thu, 14 May 2020 00:36:28 +0200
Daniel Borkmann <daniel@iogearbox.net> wrote:

> On 5/13/20 9:28 PM, Christoph Hellwig wrote:
> > On Wed, May 13, 2020 at 12:11:27PM -0700, Linus Torvalds wrote:
> >> On Wed, May 13, 2020 at 9:01 AM Christoph Hellwig <hch@lst.de> wrote:
> >>>
> >>> +static void bpf_strncpy(char *buf, long unsafe_addr)
> >>> +{
> >>> +       buf[0] = 0;
> >>> +       if (strncpy_from_kernel_nofault(buf, (void *)unsafe_addr,
> >>> +                       BPF_STRNCPY_LEN))
> >>> +               strncpy_from_user_nofault(buf, (void __user *)unsafe_addr,
> >>> +                               BPF_STRNCPY_LEN);
> >>> +}
> >>
> >> This seems buggy when I look at it.
> >>
> >> It seems to think that strncpy_from_kernel_nofault() returns an error code.
> >>
> >> Not so, unless I missed where you changed the rules.
> > 
> > I didn't change the rules, so yes, this is wrong.
> > 
> >> Also, I do wonder if we shouldn't gate this on TASK_SIZE, and do the
> >> user trial first. On architectures where this thing is valid in the
> >> first place (ie kernel and user addresses are separate), the test for
> >> address size would allow us to avoid a pointless fault due to an
> >> invalid kernel access to user space.
> >>
> >> So I think this function should look something like
> >>
> >>    static void bpf_strncpy(char *buf, long unsafe_addr)
> >>    {
> >>            /* Try user address */
> >>            if (unsafe_addr < TASK_SIZE) {
> >>                    void __user *ptr = (void __user *)unsafe_addr;
> >>                    if (strncpy_from_user_nofault(buf, ptr, BPF_STRNCPY_LEN) >= 0)
> >>                            return;
> >>            }
> >>
> >>            /* .. fall back on trying kernel access */
> >>            buf[0] = 0;
> >>            strncpy_from_kernel_nofault(buf, (void *)unsafe_addr,
> >> BPF_STRNCPY_LEN);
> >>    }
> >>
> >> or similar. No?
> > 
> > So on say s390 TASK_SIZE_USUALLy is (-PAGE_SIZE), which means we'd alway
> > try the user copy first, which seems odd.
> > 
> > I'd really like to here from the bpf folks what the expected use case
> > is here, and if the typical argument is kernel or user memory.
> 
> It's used for both. Given this is enabled on pretty much all program types, my
> assumption would be that usage is still more often on kernel memory than user one.

For trace_kprobe.c current order (kernel -> user fallback) is preferred
because it has another function dedicated for user memory.

Thank you,
Al Viro May 13, 2020, 11:28 p.m. UTC | #5
On Thu, May 14, 2020 at 12:36:28AM +0200, Daniel Borkmann wrote:

> > So on say s390 TASK_SIZE_USUALLy is (-PAGE_SIZE), which means we'd alway
> > try the user copy first, which seems odd.
> > 
> > I'd really like to here from the bpf folks what the expected use case
> > is here, and if the typical argument is kernel or user memory.
> 
> It's used for both. Given this is enabled on pretty much all program types, my
> assumption would be that usage is still more often on kernel memory than user one.

Then it needs an argument telling it which one to use.  Look at sparc64.
Or s390.  Or parisc.  Et sodding cetera.

The underlying model is that the kernel lives in a separate address space.
Yes, on x86 it's actually sharing the page tables with userland, but that's
not universal.  The same address can be both a valid userland one _and_
a valid kernel one.  You need to tell which one do you want.
Linus Torvalds May 13, 2020, 11:59 p.m. UTC | #6
On Wed, May 13, 2020 at 4:21 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
>
>
> For trace_kprobe.c current order (kernel -> user fallback) is preferred
> because it has another function dedicated for user memory.

Well, then it should just use the "strict" kernel-only one for the
non-user memory.

But yes, if there are legacy interfaces, then we might want to say
"these continue to work for the legacy case on platforms where we can
tell which kind of pointer it is from the bit pattern".

But we should likely at least disallow it entirely on platforms where
we really can't - or pick one hardcoded choice. On sparc, you really
_have_ to specify one or the other.

                  Linus
Masami Hiramatsu (Google) May 14, 2020, 1 a.m. UTC | #7
On Wed, 13 May 2020 16:59:40 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Wed, May 13, 2020 at 4:21 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
> >
> >
> > For trace_kprobe.c current order (kernel -> user fallback) is preferred
> > because it has another function dedicated for user memory.
> 
> Well, then it should just use the "strict" kernel-only one for the
> non-user memory.
> 
> But yes, if there are legacy interfaces, then we might want to say
> "these continue to work for the legacy case on platforms where we can
> tell which kind of pointer it is from the bit pattern".

Yes, that was why I changed my mind and send reviewed-by last time.

https://lore.kernel.org/bpf/20200511142716.f1ff6fc55220012982c47fec@kernel.org/

> But we should likely at least disallow it entirely on platforms where
> we really can't - or pick one hardcoded choice. On sparc, you really
> _have_ to specify one or the other.

OK. BTW, is there any way to detect the kernel/user space overlap on
memory layout statically? If there, I can do it. (I don't like
"if (CONFIG_X86)" thing....)
Or, maybe we need CONFIG_ARCH_OVERLAP_ADDRESS_SPACE?

Thank you,
Linus Torvalds May 14, 2020, 2:43 a.m. UTC | #8
On Wed, May 13, 2020 at 6:00 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
>
> > But we should likely at least disallow it entirely on platforms where
> > we really can't - or pick one hardcoded choice. On sparc, you really
> > _have_ to specify one or the other.
>
> OK. BTW, is there any way to detect the kernel/user space overlap on
> memory layout statically? If there, I can do it. (I don't like
> "if (CONFIG_X86)" thing....)
> Or, maybe we need CONFIG_ARCH_OVERLAP_ADDRESS_SPACE?

I think it would be better to have a CONFIG variable that
architectures can just 'select' to show that they are ok with separate
kernel and user addresses.

Because I don't think we have any way to say that right now as-is. You
can probably come up with hacky ways to approximate it, ie something
like

    if (TASK_SIZE_MAX > PAGE_OFFSET)
        .... they overlap ..

which would almost work, but..

                 Linus
Masami Hiramatsu (Google) May 14, 2020, 9:44 a.m. UTC | #9
On Wed, 13 May 2020 19:43:24 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Wed, May 13, 2020 at 6:00 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
> >
> > > But we should likely at least disallow it entirely on platforms where
> > > we really can't - or pick one hardcoded choice. On sparc, you really
> > > _have_ to specify one or the other.
> >
> > OK. BTW, is there any way to detect the kernel/user space overlap on
> > memory layout statically? If there, I can do it. (I don't like
> > "if (CONFIG_X86)" thing....)
> > Or, maybe we need CONFIG_ARCH_OVERLAP_ADDRESS_SPACE?
> 
> I think it would be better to have a CONFIG variable that
> architectures can just 'select' to show that they are ok with separate
> kernel and user addresses.
> 
> Because I don't think we have any way to say that right now as-is. You
> can probably come up with hacky ways to approximate it, ie something
> like
> 
>     if (TASK_SIZE_MAX > PAGE_OFFSET)
>         .... they overlap ..
> 
> which would almost work, but..

It seems TASK_SIZE_MAX is defined only on x86 and s390, what about
comparing STACK_TOP_MAX with PAGE_OFFSET ?
Anyway, I agree that the best way is introducing a CONFIG.

Thank you,
David Laight May 14, 2020, 10:01 a.m. UTC | #10
From: Daniel Borkmann
> Sent: 14 May 2020 00:59
> On 5/14/20 1:28 AM, Al Viro wrote:
> > On Thu, May 14, 2020 at 12:36:28AM +0200, Daniel Borkmann wrote:
> >
> >>> So on say s390 TASK_SIZE_USUALLy is (-PAGE_SIZE), which means we'd alway
> >>> try the user copy first, which seems odd.
> >>>
> >>> I'd really like to here from the bpf folks what the expected use case
> >>> is here, and if the typical argument is kernel or user memory.
> >>
> >> It's used for both. Given this is enabled on pretty much all program types, my
> >> assumption would be that usage is still more often on kernel memory than user one.
> >
> > Then it needs an argument telling it which one to use.  Look at sparc64.
> > Or s390.  Or parisc.  Et sodding cetera.
> >
> > The underlying model is that the kernel lives in a separate address space.
> > Yes, on x86 it's actually sharing the page tables with userland, but that's
> > not universal.  The same address can be both a valid userland one _and_
> > a valid kernel one.  You need to tell which one do you want.
> 
> Yes, see also 6ae08ae3dea2 ("bpf: Add probe_read_{user, kernel} and probe_read_{user,
> kernel}_str helpers"), and my other reply wrt bpf_trace_printk() on how to address
> this. All I'm trying to say is that both bpf_probe_read() and bpf_trace_printk() do
> exist in this form since early [e]bpf days for ~5yrs now and while broken on non-x86
> there are a lot of users on x86 for this in the wild, so they need to have a chance
> to migrate over to the new facilities before they are fully removed.

If it's not a stupid question why is a BPF program allowed to get
into a situation where it might have an invalid kernel address.

It all stinks of a hole that allows all of kernel memory to be read
and copied to userspace.

Now you might want to something special so that BPF programs just
abort on OOPS instead of possibly paniking the kernel.
But that is different from a copy that expects to be passed garbage.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
diff mbox series

Patch

diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index 7cfc10eb09c60..28944a14e0534 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -311,7 +311,6 @@  extern long probe_user_read(void *dst, const void __user *src, size_t size);
 extern long notrace probe_kernel_write(void *dst, const void *src, size_t size);
 extern long notrace probe_user_write(void __user *dst, const void *src, size_t size);
 
-extern long strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count);
 long strncpy_from_kernel_nofault(char *dst, const void *unsafe_addr,
 		long count);
 
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 3dd4763c195bb..0d849acc9de38 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -226,12 +226,14 @@  static __always_inline int
 bpf_probe_read_kernel_str_common(void *dst, u32 size, const void *unsafe_ptr,
 				 const bool compat)
 {
+	const void __user *user_ptr = (__force const void __user *)unsafe_ptr;
 	int ret = security_locked_down(LOCKDOWN_BPF_READ);
 
 	if (unlikely(ret < 0))
-		goto out;
+		goto fail;
+
 	/*
-	 * The strncpy_from_unsafe_*() call will likely not fill the entire
+	 * The strncpy_from_*_nofault() calls will likely not fill the entire
 	 * buffer, but that's okay in this circumstance as we're probing
 	 * arbitrary memory anyway similar to bpf_probe_read_*() and might
 	 * as well probe the stack. Thus, memory is explicitly cleared
@@ -239,11 +241,16 @@  bpf_probe_read_kernel_str_common(void *dst, u32 size, const void *unsafe_ptr,
 	 * code altogether don't copy garbage; otherwise length of string
 	 * is returned that can be used for bpf_perf_event_output() et al.
 	 */
-	ret = compat ? strncpy_from_unsafe(dst, unsafe_ptr, size) :
-	      strncpy_from_kernel_nofault(dst, unsafe_ptr, size);
-	if (unlikely(ret < 0))
-out:
-		memset(dst, 0, size);
+	ret = strncpy_from_kernel_nofault(dst, unsafe_ptr, size);
+	if (unlikely(ret < 0)) {
+		if (compat)
+			ret = strncpy_from_user_nofault(dst, user_ptr, size);
+		if (unlikely(ret < 0))
+			goto fail;
+	}
+	return 0;
+fail:
+	memset(dst, 0, size);
 	return ret;
 }
 
@@ -321,6 +328,17 @@  static const struct bpf_func_proto *bpf_get_probe_write_proto(void)
 	return &bpf_probe_write_user_proto;
 }
 
+#define BPF_STRNCPY_LEN 64
+
+static void bpf_strncpy(char *buf, long unsafe_addr)
+{
+	buf[0] = 0;
+	if (strncpy_from_kernel_nofault(buf, (void *)unsafe_addr,
+			BPF_STRNCPY_LEN))
+		strncpy_from_user_nofault(buf, (void __user *)unsafe_addr,
+				BPF_STRNCPY_LEN);
+}
+
 /*
  * Only limited trace_printk() conversion specifiers allowed:
  * %d %i %u %x %ld %li %lu %lx %lld %lli %llu %llx %p %s
@@ -332,7 +350,7 @@  BPF_CALL_5(bpf_trace_printk, char *, fmt, u32, fmt_size, u64, arg1,
 	int mod[3] = {};
 	int fmt_cnt = 0;
 	u64 unsafe_addr;
-	char buf[64];
+	char buf[BPF_STRNCPY_LEN];
 	int i;
 
 	/*
@@ -387,10 +405,7 @@  BPF_CALL_5(bpf_trace_printk, char *, fmt, u32, fmt_size, u64, arg1,
 					arg3 = (long) buf;
 					break;
 				}
-				buf[0] = 0;
-				strncpy_from_unsafe(buf,
-						    (void *) (long) unsafe_addr,
-						    sizeof(buf));
+				bpf_strncpy(buf, unsafe_addr);
 			}
 			continue;
 		}
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 4325f9e7fadaa..8c456e30933d3 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -1244,7 +1244,10 @@  fetch_store_string(unsigned long addr, void *dest, void *base)
 	 * Try to get string again, since the string can be changed while
 	 * probing.
 	 */
-	ret = strncpy_from_unsafe(__dest, (void *)addr, maxlen);
+	ret = strncpy_from_kernel_nofault(__dest, (void *)addr, maxlen);
+	if (ret < 0)
+		ret = strncpy_from_user_nofault(__dest, (void __user *)addr,
+				maxlen);
 	if (ret >= 0)
 		*(u32 *)dest = make_data_loc(ret, __dest - base);
 
diff --git a/mm/maccess.c b/mm/maccess.c
index 483a933b7d241..3d85e48013e6b 100644
--- a/mm/maccess.c
+++ b/mm/maccess.c
@@ -8,8 +8,6 @@ 
 
 static long __probe_kernel_read(void *dst, const void *src, size_t size,
 		bool strict);
-static long __strncpy_from_unsafe(char *dst, const void *unsafe_addr,
-		long count, bool strict);
 
 bool __weak probe_kernel_read_allowed(void *dst, const void *unsafe_src,
 		size_t size, bool strict)
@@ -156,35 +154,6 @@  long probe_user_write(void __user *dst, const void *src, size_t size)
 	return 0;
 }
 
-/**
- * strncpy_from_unsafe: - Copy a NUL terminated string from unsafe address.
- * @dst:   Destination address, in kernel space.  This buffer must be at
- *         least @count bytes long.
- * @unsafe_addr: Unsafe address.
- * @count: Maximum number of bytes to copy, including the trailing NUL.
- *
- * Copies a NUL-terminated string from unsafe address to kernel buffer.
- *
- * On success, returns the length of the string INCLUDING the trailing NUL.
- *
- * If access fails, returns -EFAULT (some data may have been copied
- * and the trailing NUL added).
- *
- * If @count is smaller than the length of the string, copies @count-1 bytes,
- * sets the last byte of @dst buffer to NUL and returns @count.
- *
- * Same as strncpy_from_kernel_nofault() except that for architectures with
- * not fully separated user and kernel address spaces this function also works
- * for user address tanges.
- *
- * DO NOT USE THIS FUNCTION - it is broken on architectures with entirely
- * separate kernel and user address spaces, and also a bad idea otherwise.
- */
-long strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count)
-{
-	return __strncpy_from_unsafe(dst, unsafe_addr, count, false);
-}
-
 /**
  * strncpy_from_kernel_nofault: - Copy a NUL terminated string from unsafe
  *				 address.
@@ -204,12 +173,6 @@  long strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count)
  * sets the last byte of @dst buffer to NUL and returns @count.
  */
 long strncpy_from_kernel_nofault(char *dst, const void *unsafe_addr, long count)
-{
-	return __strncpy_from_unsafe(dst, unsafe_addr, count, true);
-}
-
-static long __strncpy_from_unsafe(char *dst, const void *unsafe_addr,
-		long count, bool strict)
 {
 	mm_segment_t old_fs = get_fs();
 	const void *src = unsafe_addr;
@@ -217,7 +180,7 @@  static long __strncpy_from_unsafe(char *dst, const void *unsafe_addr,
 
 	if (unlikely(count <= 0))
 		return 0;
-	if (!probe_kernel_read_allowed(dst, unsafe_addr, count, strict))
+	if (!probe_kernel_read_allowed(dst, unsafe_addr, count, true))
 		return -EFAULT;
 
 	set_fs(KERNEL_DS);