diff mbox series

[14/18] maccess: allow architectures to provide kernel probing directly

Message ID 20200513160038.2482415-15-hch@lst.de (mailing list archive)
State New, archived
Headers show
Series [01/18] maccess: unexport probe_kernel_write and probe_user_write | expand

Commit Message

Christoph Hellwig May 13, 2020, 4 p.m. UTC
Provide alternative versions of probe_kernel_read, probe_kernel_write
and strncpy_from_kernel_unsafe that don't need set_fs magic, but instead
use arch hooks that are modelled after unsafe_{get,put}_user to access
kernel memory in an exception safe way.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 mm/maccess.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 76 insertions(+)

Comments

Linus Torvalds May 13, 2020, 7:36 p.m. UTC | #1
On Wed, May 13, 2020 at 9:01 AM Christoph Hellwig <hch@lst.de> wrote:
>
> +               arch_kernel_read(dst, src, type, err_label);            \

I'm wondering if

 (a) we shouldn't expose this as an interface in general

 (b) it wouldn't be named differently..

The reason for (a) is that several users of the
"copy_from_kernel_nofault()" interfaces just seem to want a single
access from kernel mode.

The reason for (b) is that if we do expose this as a normal interface,
it shouldn't be called "arch_kernel_read", and it should have the same
semantics as "get_user_unsafe()".

IOW, maybe we should simply do exactly that: have a
"get_kernel_nofault()" thing that looks exactly like
unsafe_get_user().

On x86, it would basically be identical to unsafe_get_user().

And on architectures that only have the copy function, you'd just have
a fallback something like this:

  #define get_kernel_nofault(dst, src, err_label) do {  \
        typeof (*src) __gkn_result;                     \
        if (probe_kernel_read(&__gkn_result, src) < 0)  \
                goto err_label;                         \
        (dst) = __gkn_result;                           \
  } while (0)

and now the people who want to read a single kernel word can just do

        get_kernel_nofault(n, untrusted_pointer, error);

and they're done.

And some day - when we get reliably "asm goto" wiith outputs - that
"get_kernel_fault()" will literally be a single instruction asm with
the proper exception handler marker, the way "put_user_unsafe()"
already works (and the way "put_kernel_nofault()" would already work
if it does the above).

             Linus
Christoph Hellwig May 13, 2020, 7:40 p.m. UTC | #2
On Wed, May 13, 2020 at 12:36:18PM -0700, Linus Torvalds wrote:
> > +               arch_kernel_read(dst, src, type, err_label);            \
> 
> I'm wondering if
> 
>  (a) we shouldn't expose this as an interface in general

We do export something like it, currently it is called
probe_kernel_address, and the last patch renames it to
get_kernel_nofault.  However it is implemented as a wrapper
around probe_kernel_address / copy_from_kernel_nofault and thus
not quite as efficient and without the magic goto semantics.

>  (b) it wouldn't be named differently..

It probably should with all the renaming..
Linus Torvalds May 13, 2020, 7:48 p.m. UTC | #3
On Wed, May 13, 2020 at 12:40 PM Christoph Hellwig <hch@lst.de> wrote:
>
> We do export something like it, currently it is called
> probe_kernel_address, and the last patch renames it to
> get_kernel_nofault.  However it is implemented as a wrapper
> around probe_kernel_address / copy_from_kernel_nofault and thus
> not quite as efficient and without the magic goto semantics.

Looking at the current users of "probe_kernel_read()", it looks like
it's almost mostly things that just want a single byte or word.

It's not 100% that: we definitely do several things that want the
"copy" semantics vs the "get" semantics: on the x86 side we have
CALL_INSN_SIZE and MAX_INSN_SIZE, and the ldttss_desc.

But the bulk of them do seem to be a single value.

I don't know if performance really matters here, but to me the whole
"most users seem to want to read a single value" is what makes me
think that maybe that should be the primary model, rather than have
the copy model be the primary one and then we implement the single
value case (badly) with a copy.

It probably doesn't matter that much. I certainly wouldn't hold this
series up over it - it can be a future thing.

         Linus
Christoph Hellwig May 13, 2020, 7:54 p.m. UTC | #4
On Wed, May 13, 2020 at 12:48:54PM -0700, Linus Torvalds wrote:
> Looking at the current users of "probe_kernel_read()", it looks like
> it's almost mostly things that just want a single byte or word.
> 
> It's not 100% that: we definitely do several things that want the
> "copy" semantics vs the "get" semantics: on the x86 side we have
> CALL_INSN_SIZE and MAX_INSN_SIZE, and the ldttss_desc.
> 
> But the bulk of them do seem to be a single value.
> 
> I don't know if performance really matters here, but to me the whole
> "most users seem to want to read a single value" is what makes me
> think that maybe that should be the primary model, rather than have
> the copy model be the primary one and then we implement the single
> value case (badly) with a copy.
> 
> It probably doesn't matter that much. I certainly wouldn't hold this
> series up over it - it can be a future thing.

I can make the get_kernel_nofault implementation suck a little less :)

Note that the arch helper (we could call it unsafe_get_kernel_nofault)
we still need to have a pagefault_disable / pagefault_enable pair
around the calls.  So maybe keep the get_kernel_nofault interface
as-is (without the goto label), and prepare the arch helpers for
being used similar to unsafe_get_user once all architectures are
converted.  And I can throw in a few patches to convert callers
from the copy semantics to the get semantics.
Masami Hiramatsu (Google) May 16, 2020, 3:42 a.m. UTC | #5
Hi Christoph,

On Wed, 13 May 2020 18:00:34 +0200
Christoph Hellwig <hch@lst.de> wrote:

> Provide alternative versions of probe_kernel_read, probe_kernel_write
> and strncpy_from_kernel_unsafe that don't need set_fs magic, but instead
> use arch hooks that are modelled after unsafe_{get,put}_user to access
> kernel memory in an exception safe way.

This patch seems to introduce new implementation of probe_kernel_read/write()
and strncpy_from_kernel_unsafe(), but also drops copy_from/to_kernel_nofault()
and strncpy_from_kernel_nofault() if HAVE_ARCH_PROBE_KERNEL is defined.
In the result, this cause a link error with BPF and kprobe events.

BTW, what is the difference of *_unsafe() and *_nofault()?
(maybe we make those to *_nofault() finally?)

Thank you,

> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  mm/maccess.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 76 insertions(+)
> 
> diff --git a/mm/maccess.c b/mm/maccess.c
> index 9773e2253b495..e9efe2f98e34a 100644
> --- a/mm/maccess.c
> +++ b/mm/maccess.c
> @@ -12,6 +12,81 @@ bool __weak probe_kernel_read_allowed(void *dst, const void *unsafe_src,
>  	return true;
>  }
>  
> +#ifdef HAVE_ARCH_PROBE_KERNEL
> +
> +#define probe_kernel_read_loop(dst, src, len, type, err_label)		\
> +	while (len >= sizeof(type)) {					\
> +		arch_kernel_read(dst, src, type, err_label);		\
> +		dst += sizeof(type);					\
> +		src += sizeof(type);					\
> +		len -= sizeof(type);					\
> +	}
> +
> +long probe_kernel_read(void *dst, const void *src, size_t size)
> +{
> +	if (!probe_kernel_read_allowed(dst, src, size))
> +		return -EFAULT;
> +
> +	pagefault_disable();
> +	probe_kernel_read_loop(dst, src, size, u64, Efault);
> +	probe_kernel_read_loop(dst, src, size, u32, Efault);
> +	probe_kernel_read_loop(dst, src, size, u16, Efault);
> +	probe_kernel_read_loop(dst, src, size, u8, Efault);
> +	pagefault_enable();
> +	return 0;
> +Efault:
> +	pagefault_enable();
> +	return -EFAULT;
> +}
> +EXPORT_SYMBOL_GPL(probe_kernel_read);
> +
> +#define probe_kernel_write_loop(dst, src, len, type, err_label)		\
> +	while (len >= sizeof(type)) {					\
> +		arch_kernel_write(dst, src, type, err_label);		\
> +		dst += sizeof(type);					\
> +		src += sizeof(type);					\
> +		len -= sizeof(type);					\
> +	}
> +
> +long probe_kernel_write(void *dst, const void *src, size_t size)
> +{
> +	pagefault_disable();
> +	probe_kernel_write_loop(dst, src, size, u64, Efault);
> +	probe_kernel_write_loop(dst, src, size, u32, Efault);
> +	probe_kernel_write_loop(dst, src, size, u16, Efault);
> +	probe_kernel_write_loop(dst, src, size, u8, Efault);
> +	pagefault_enable();
> +	return 0;
> +Efault:
> +	pagefault_enable();
> +	return -EFAULT;
> +}
> +
> +long strncpy_from_kernel_unsafe(char *dst, const void *unsafe_addr, long count)
> +{
> +	const void *src = unsafe_addr;
> +
> +	if (unlikely(count <= 0))
> +		return 0;
> +	if (!probe_kernel_read_allowed(dst, unsafe_addr, count))
> +		return -EFAULT;
> +
> +	pagefault_disable();
> +	do {
> +		arch_kernel_read(dst, src, u8, Efault);
> +		dst++;
> +		src++;
> +	} while (dst[-1] && src - unsafe_addr < count);
> +	pagefault_enable();
> +
> +	dst[-1] = '\0';
> +	return src - unsafe_addr;
> +Efault:
> +	pagefault_enable();
> +	dst[-1] = '\0';
> +	return -EFAULT;
> +}
> +#else /* HAVE_ARCH_PROBE_KERNEL */
>  /**
>   * probe_kernel_read(): safely attempt to read from kernel-space
>   * @dst: pointer to the buffer that shall take the data
> @@ -114,6 +189,7 @@ long strncpy_from_kernel_nofault(char *dst, const void *unsafe_addr, long count)
>  
>  	return ret ? -EFAULT : src - unsafe_addr;
>  }
> +#endif /* HAVE_ARCH_PROBE_KERNEL */
>  
>  /**
>   * probe_user_read(): safely attempt to read from a user-space location
> -- 
> 2.26.2
>
Christoph Hellwig May 18, 2020, 3:09 p.m. UTC | #6
On Sat, May 16, 2020 at 12:42:59PM +0900, Masami Hiramatsu wrote:
> > Provide alternative versions of probe_kernel_read, probe_kernel_write
> > and strncpy_from_kernel_unsafe that don't need set_fs magic, but instead
> > use arch hooks that are modelled after unsafe_{get,put}_user to access
> > kernel memory in an exception safe way.
> 
> This patch seems to introduce new implementation of probe_kernel_read/write()
> and strncpy_from_kernel_unsafe(), but also drops copy_from/to_kernel_nofault()
> and strncpy_from_kernel_nofault() if HAVE_ARCH_PROBE_KERNEL is defined.
> In the result, this cause a link error with BPF and kprobe events.

That was just a bug as I didn't commit the changes to switch everything
to _nofault and remove _unsafe entirely, sorry.
diff mbox series

Patch

diff --git a/mm/maccess.c b/mm/maccess.c
index 9773e2253b495..e9efe2f98e34a 100644
--- a/mm/maccess.c
+++ b/mm/maccess.c
@@ -12,6 +12,81 @@  bool __weak probe_kernel_read_allowed(void *dst, const void *unsafe_src,
 	return true;
 }
 
+#ifdef HAVE_ARCH_PROBE_KERNEL
+
+#define probe_kernel_read_loop(dst, src, len, type, err_label)		\
+	while (len >= sizeof(type)) {					\
+		arch_kernel_read(dst, src, type, err_label);		\
+		dst += sizeof(type);					\
+		src += sizeof(type);					\
+		len -= sizeof(type);					\
+	}
+
+long probe_kernel_read(void *dst, const void *src, size_t size)
+{
+	if (!probe_kernel_read_allowed(dst, src, size))
+		return -EFAULT;
+
+	pagefault_disable();
+	probe_kernel_read_loop(dst, src, size, u64, Efault);
+	probe_kernel_read_loop(dst, src, size, u32, Efault);
+	probe_kernel_read_loop(dst, src, size, u16, Efault);
+	probe_kernel_read_loop(dst, src, size, u8, Efault);
+	pagefault_enable();
+	return 0;
+Efault:
+	pagefault_enable();
+	return -EFAULT;
+}
+EXPORT_SYMBOL_GPL(probe_kernel_read);
+
+#define probe_kernel_write_loop(dst, src, len, type, err_label)		\
+	while (len >= sizeof(type)) {					\
+		arch_kernel_write(dst, src, type, err_label);		\
+		dst += sizeof(type);					\
+		src += sizeof(type);					\
+		len -= sizeof(type);					\
+	}
+
+long probe_kernel_write(void *dst, const void *src, size_t size)
+{
+	pagefault_disable();
+	probe_kernel_write_loop(dst, src, size, u64, Efault);
+	probe_kernel_write_loop(dst, src, size, u32, Efault);
+	probe_kernel_write_loop(dst, src, size, u16, Efault);
+	probe_kernel_write_loop(dst, src, size, u8, Efault);
+	pagefault_enable();
+	return 0;
+Efault:
+	pagefault_enable();
+	return -EFAULT;
+}
+
+long strncpy_from_kernel_unsafe(char *dst, const void *unsafe_addr, long count)
+{
+	const void *src = unsafe_addr;
+
+	if (unlikely(count <= 0))
+		return 0;
+	if (!probe_kernel_read_allowed(dst, unsafe_addr, count))
+		return -EFAULT;
+
+	pagefault_disable();
+	do {
+		arch_kernel_read(dst, src, u8, Efault);
+		dst++;
+		src++;
+	} while (dst[-1] && src - unsafe_addr < count);
+	pagefault_enable();
+
+	dst[-1] = '\0';
+	return src - unsafe_addr;
+Efault:
+	pagefault_enable();
+	dst[-1] = '\0';
+	return -EFAULT;
+}
+#else /* HAVE_ARCH_PROBE_KERNEL */
 /**
  * probe_kernel_read(): safely attempt to read from kernel-space
  * @dst: pointer to the buffer that shall take the data
@@ -114,6 +189,7 @@  long strncpy_from_kernel_nofault(char *dst, const void *unsafe_addr, long count)
 
 	return ret ? -EFAULT : src - unsafe_addr;
 }
+#endif /* HAVE_ARCH_PROBE_KERNEL */
 
 /**
  * probe_user_read(): safely attempt to read from a user-space location