mbox series

[RFC,v2,0/12] hardening: statically allocated protected memory

Message ID 20181219213338.26619-1-igor.stoppa@huawei.com (mailing list archive)
Headers show
Series hardening: statically allocated protected memory | expand

Message

Igor Stoppa Dec. 19, 2018, 9:33 p.m. UTC
Patch-set implementing write-rare memory protection for statically
allocated data.
Its purpose it to keep data write protected kernel data which is seldom
modified.
There is no read overhead, however writing requires special operations that
are probably unsitable for often-changing data.
The use is opt-in, by applying the modifier __wr_after_init to a variable
declaration.

As the name implies, the write protection kicks in only after init() is
completed; before that moment, the data is modifiable in the usual way.

Current Limitations:
* supports only data which is allocated statically, at build time.
* supports only x86_64, other earchitectures need to provide own backend

Some notes:
- there is a part of generic code which is basically a NOP, but should
  allow using unconditionally the write protection. It will automatically
  default to non-protected functionality, if the specific architecture
  doesn't support write-rare
- to avoid the risk of weakening __ro_after_init, __wr_after_init data is
  in a separate set of pages, and any invocation will confirm that the
  memory affected falls within this range.
  rodata_test is modified accordingly, to check also this case.
- for now, the patchset addresses only x86_64, as each architecture seems
  to have own way of dealing with user space. Once a few are implemented,
  it should be more obvious what code can be refactored as common.
- the memset_user() assembly function seems to work, but I'm not too sure
  it's really ok
- I've added a simple example: the protection of ima_policy_flags
- the last patch is optional, but it seemed worth to do the refactoring

Changelog:

v1->v2

* introduce cleaner split between generic and arch code
* add x86_64 specific memset_user()
* replace kernel-space memset() memcopy() with userspace counterpart
* randomize the base address for the alternate map across the entire
  available address range from user space (128TB - 64TB)
* convert BUG() to WARN()
* turn verification of written data into debugging option
* wr_rcu_assign_pointer() as special case of wr_assign()
* example with protection of ima_policy_flags
* documentation

CC: Andy Lutomirski <luto@amacapital.net>
CC: Nadav Amit <nadav.amit@gmail.com>
CC: Matthew Wilcox <willy@infradead.org>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Kees Cook <keescook@chromium.org>
CC: Dave Hansen <dave.hansen@linux.intel.com>
CC: Mimi Zohar <zohar@linux.vnet.ibm.com>
CC: linux-integrity@vger.kernel.org
CC: kernel-hardening@lists.openwall.com
CC: linux-mm@kvack.org
CC: linux-kernel@vger.kernel.org

Igor Stoppa (12):
	[PATCH 01/12] x86_64: memset_user()
	[PATCH 02/12] __wr_after_init: linker section and label
	[PATCH 03/12] __wr_after_init: generic header
	[PATCH 04/12] __wr_after_init: x86_64: __wr_op
	[PATCH 05/12] __wr_after_init: x86_64: debug writes
	[PATCH 06/12] __wr_after_init: Documentation: self-protection
	[PATCH 07/12] __wr_after_init: lkdtm test
	[PATCH 08/12] rodata_test: refactor tests
	[PATCH 09/12] rodata_test: add verification for __wr_after_init
	[PATCH 10/12] __wr_after_init: test write rare functionality
	[PATCH 11/12] IMA: turn ima_policy_flags into __wr_after_init
	[PATCH 12/12] x86_64: __clear_user as case of __memset_user


Documentation/security/self-protection.rst |  14 ++-
arch/Kconfig                               |  15 +++
arch/x86/Kconfig                           |   1 +
arch/x86/include/asm/uaccess_64.h          |   6 +
arch/x86/lib/usercopy_64.c                 |  41 +++++--
arch/x86/mm/Makefile                       |   2 +
arch/x86/mm/prmem.c                        | 127 +++++++++++++++++++++
drivers/misc/lkdtm/core.c                  |   3 +
drivers/misc/lkdtm/lkdtm.h                 |   3 +
drivers/misc/lkdtm/perms.c                 |  29 +++++
include/asm-generic/vmlinux.lds.h          |  25 +++++
include/linux/cache.h                      |  21 ++++
include/linux/prmem.h                      | 142 ++++++++++++++++++++++++
init/main.c                                |   2 +
mm/Kconfig.debug                           |  16 +++
mm/Makefile                                |   1 +
mm/rodata_test.c                           |  69 ++++++++----
mm/test_write_rare.c                       | 135 ++++++++++++++++++++++
security/integrity/ima/ima.h               |   3 +-
security/integrity/ima/ima_init.c          |   5 +-
security/integrity/ima/ima_policy.c        |   9 +-
21 files changed, 629 insertions(+), 40 deletions(-)

Comments

Igor Stoppa Dec. 20, 2018, 4:53 p.m. UTC | #1
On 19/12/2018 23:33, Igor Stoppa wrote:

> +	if (WARN_ONCE(op >= WR_OPS_NUMBER, "Invalid WR operation.") ||
> +	    WARN_ONCE(!is_wr_after_init(dst, len), "Invalid WR range."))
> +		return (void *)dst;
> +
> +	offset = dst - (unsigned long)&__start_wr_after_init;

I forgot to remove the offset.
If the whole kernel memory is remapped, it is shifted by wr_poking_base.
I'll fix it in the next iteration.

> +	wr_poking_addr = wr_poking_base + offset;

	wr_poking_addr = wr_poking_base + dst;

--
igor
Thiago Jung Bauermann Dec. 20, 2018, 5:20 p.m. UTC | #2
Hello Igor,

> +/*
> + * The following two variables are statically allocated by the linker
> + * script at the the boundaries of the memory region (rounded up to
> + * multiples of PAGE_SIZE) reserved for __wr_after_init.
> + */
> +extern long __start_wr_after_init;
> +extern long __end_wr_after_init;
> +
> +static inline bool is_wr_after_init(unsigned long ptr, __kernel_size_t size)
> +{
> +	unsigned long start = (unsigned long)&__start_wr_after_init;
> +	unsigned long end = (unsigned long)&__end_wr_after_init;
> +	unsigned long low = ptr;
> +	unsigned long high = ptr + size;
> +
> +	return likely(start <= low && low <= high && high <= end);
> +}
> +
> +void *__wr_op(unsigned long dst, unsigned long src, __kernel_size_t len,
> +	      enum wr_op_type op)
> +{
> +	temporary_mm_state_t prev;
> +	unsigned long offset;
> +	unsigned long wr_poking_addr;
> +
> +	/* Confirm that the writable mapping exists. */
> +	if (WARN_ONCE(!wr_ready, "No writable mapping available"))
> +		return (void *)dst;
> +
> +	if (WARN_ONCE(op >= WR_OPS_NUMBER, "Invalid WR operation.") ||
> +	    WARN_ONCE(!is_wr_after_init(dst, len), "Invalid WR range."))
> +		return (void *)dst;
> +
> +	offset = dst - (unsigned long)&__start_wr_after_init;
> +	wr_poking_addr = wr_poking_base + offset;
> +	local_irq_disable();
> +	prev = use_temporary_mm(wr_poking_mm);
> +
> +	if (op == WR_MEMCPY)
> +		copy_to_user((void __user *)wr_poking_addr, (void *)src, len);
> +	else if (op == WR_MEMSET)
> +		memset_user((void __user *)wr_poking_addr, (u8)src, len);
> +
> +	unuse_temporary_mm(prev);
> +	local_irq_enable();
> +	return (void *)dst;
> +}

There's a lot of casting back and forth between unsigned long and void *
(also in the previous patch). Is there a reason for that? My impression
is that there would be less casts if variables holding addresses were
declared as void * in the first place. In that case, it wouldn't hurt to
have an additional argument in __rw_op() to carry the byte value for the
WR_MEMSET operation.

> +
> +#define TB (1UL << 40)
> +
> +struct mm_struct *copy_init_mm(void);
> +void __init wr_poking_init(void)
> +{
> +	unsigned long start = (unsigned long)&__start_wr_after_init;
> +	unsigned long end = (unsigned long)&__end_wr_after_init;
> +	unsigned long i;
> +	unsigned long wr_range;
> +
> +	wr_poking_mm = copy_init_mm();
> +	if (WARN_ONCE(!wr_poking_mm, "No alternate mapping available."))
> +		return;
> +
> +	wr_range = round_up(end - start, PAGE_SIZE);
> +
> +	/* Randomize the poking address base*/
> +	wr_poking_base = TASK_UNMAPPED_BASE +
> +		(kaslr_get_random_long("Write Rare Poking") & PAGE_MASK) %
> +		(TASK_SIZE - (TASK_UNMAPPED_BASE + wr_range));
> +
> +	/*
> +	 * Place 64TB of kernel address space within 128TB of user address
> +	 * space, at a random page aligned offset.
> +	 */
> +	wr_poking_base = (((unsigned long)kaslr_get_random_long("WR Poke")) &
> +			  PAGE_MASK) % (64 * _BITUL(40));

You're setting wr_poking_base twice in a row? Is this an artifact from
rebase?

--
Thiago Jung Bauermann
IBM Linux Technology Center
Thiago Jung Bauermann Dec. 20, 2018, 5:30 p.m. UTC | #3
Hello Igor,

Igor Stoppa <igor.stoppa@gmail.com> writes:

> diff --git a/security/integrity/ima/ima_init.c b/security/integrity/ima/ima_init.c
> index 59d834219cd6..5f4e13e671bf 100644
> --- a/security/integrity/ima/ima_init.c
> +++ b/security/integrity/ima/ima_init.c
> @@ -21,6 +21,7 @@
>  #include <linux/scatterlist.h>
>  #include <linux/slab.h>
>  #include <linux/err.h>
> +#include <linux/prmem.h>
>
>  #include "ima.h"
>
> @@ -98,9 +99,9 @@ void __init ima_load_x509(void)
>  {
>  	int unset_flags = ima_policy_flag & IMA_APPRAISE;
>
> -	ima_policy_flag &= ~unset_flags;
> +	wr_assign(ima_policy_flag, ima_policy_flag & ~unset_flags);
>  	integrity_load_x509(INTEGRITY_KEYRING_IMA, CONFIG_IMA_X509_PATH);
> -	ima_policy_flag |= unset_flags;
> +	wr_assign(ima_policy_flag, ima_policy_flag | unset_flags);
>  }
>  #endif

In the cover letter, you said:

> As the name implies, the write protection kicks in only after init()
> is completed; before that moment, the data is modifiable in the usual
> way.

Given that, is it still necessary or useful to use wr_assign() in a
function marked with __init?

--
Thiago Jung Bauermann
IBM Linux Technology Center
Igor Stoppa Dec. 20, 2018, 5:46 p.m. UTC | #4
Hi,

On 20/12/2018 19:20, Thiago Jung Bauermann wrote:
> 
> Hello Igor,
> 
>> +/*
>> + * The following two variables are statically allocated by the linker
>> + * script at the the boundaries of the memory region (rounded up to
>> + * multiples of PAGE_SIZE) reserved for __wr_after_init.
>> + */
>> +extern long __start_wr_after_init;
>> +extern long __end_wr_after_init;
>> +
>> +static inline bool is_wr_after_init(unsigned long ptr, __kernel_size_t size)
>> +{
>> +	unsigned long start = (unsigned long)&__start_wr_after_init;
>> +	unsigned long end = (unsigned long)&__end_wr_after_init;
>> +	unsigned long low = ptr;
>> +	unsigned long high = ptr + size;
>> +
>> +	return likely(start <= low && low <= high && high <= end);
>> +}
>> +
>> +void *__wr_op(unsigned long dst, unsigned long src, __kernel_size_t len,
>> +	      enum wr_op_type op)
>> +{
>> +	temporary_mm_state_t prev;
>> +	unsigned long offset;
>> +	unsigned long wr_poking_addr;
>> +
>> +	/* Confirm that the writable mapping exists. */
>> +	if (WARN_ONCE(!wr_ready, "No writable mapping available"))
>> +		return (void *)dst;
>> +
>> +	if (WARN_ONCE(op >= WR_OPS_NUMBER, "Invalid WR operation.") ||
>> +	    WARN_ONCE(!is_wr_after_init(dst, len), "Invalid WR range."))
>> +		return (void *)dst;
>> +
>> +	offset = dst - (unsigned long)&__start_wr_after_init;
>> +	wr_poking_addr = wr_poking_base + offset;
>> +	local_irq_disable();
>> +	prev = use_temporary_mm(wr_poking_mm);
>> +
>> +	if (op == WR_MEMCPY)
>> +		copy_to_user((void __user *)wr_poking_addr, (void *)src, len);
>> +	else if (op == WR_MEMSET)
>> +		memset_user((void __user *)wr_poking_addr, (u8)src, len);
>> +
>> +	unuse_temporary_mm(prev);
>> +	local_irq_enable();
>> +	return (void *)dst;
>> +}
> 
> There's a lot of casting back and forth between unsigned long and void *
> (also in the previous patch). Is there a reason for that?

The intention is to ensure that algebraic operations between addresses 
are performed as intended, rather than gcc operating some incorrect 
optimization, wrongly assuming that two addresses belong to the same object.

Said this, I can certainly have a further look at the code and see if I 
can reduce the amount of casts. I do not like them either.

But I'm not sure how much can be dropped: if I start from (void *), then 
I have to cast them to unsigned long for the math.

And the xxx_user() operations require a (void __user *).

> My impression
> is that there would be less casts if variables holding addresses were
> declared as void * in the first place. 

It might save 1 or 2 casts. I'll do the count.

> In that case, it wouldn't hurt to
> have an additional argument in __rw_op() to carry the byte value for the
> WR_MEMSET operation.

Wouldn't it clobber one more register? Or can gcc figure out that it's 
not used? __wr_op() is not inline.

>> +
>> +#define TB (1UL << 40)

^^^^^^^^^^^^^^^^^^^^^^^^^^spurious

>> +
>> +struct mm_struct *copy_init_mm(void);
>> +void __init wr_poking_init(void)
>> +{
>> +	unsigned long start = (unsigned long)&__start_wr_after_init;
>> +	unsigned long end = (unsigned long)&__end_wr_after_init;
>> +	unsigned long i;
>> +	unsigned long wr_range;
>> +
>> +	wr_poking_mm = copy_init_mm();
>> +	if (WARN_ONCE(!wr_poking_mm, "No alternate mapping available."))
>> +		return;
>> +
>> +	wr_range = round_up(end - start, PAGE_SIZE);
>> +
>> +	/* Randomize the poking address base*/
>> +	wr_poking_base = TASK_UNMAPPED_BASE +
>> +		(kaslr_get_random_long("Write Rare Poking") & PAGE_MASK) %
>> +		(TASK_SIZE - (TASK_UNMAPPED_BASE + wr_range));
>> +
>> +	/*
>> +	 * Place 64TB of kernel address space within 128TB of user address
>> +	 * space, at a random page aligned offset.
>> +	 */
>> +	wr_poking_base = (((unsigned long)kaslr_get_random_long("WR Poke")) &
>> +			  PAGE_MASK) % (64 * _BITUL(40));
> 
> You're setting wr_poking_base twice in a row? Is this an artifact from
> rebase?

Yes, the first is a leftover. Thanks for spotting it.

--
igor
Igor Stoppa Dec. 20, 2018, 5:49 p.m. UTC | #5
Hi,

On 20/12/2018 19:30, Thiago Jung Bauermann wrote:
> 
> Hello Igor,
> 
> Igor Stoppa <igor.stoppa@gmail.com> writes:
> 
>> diff --git a/security/integrity/ima/ima_init.c b/security/integrity/ima/ima_init.c
>> index 59d834219cd6..5f4e13e671bf 100644
>> --- a/security/integrity/ima/ima_init.c
>> +++ b/security/integrity/ima/ima_init.c
>> @@ -21,6 +21,7 @@
>>   #include <linux/scatterlist.h>
>>   #include <linux/slab.h>
>>   #include <linux/err.h>
>> +#include <linux/prmem.h>
>>
>>   #include "ima.h"
>>
>> @@ -98,9 +99,9 @@ void __init ima_load_x509(void)
>>   {
>>   	int unset_flags = ima_policy_flag & IMA_APPRAISE;
>>
>> -	ima_policy_flag &= ~unset_flags;
>> +	wr_assign(ima_policy_flag, ima_policy_flag & ~unset_flags);
>>   	integrity_load_x509(INTEGRITY_KEYRING_IMA, CONFIG_IMA_X509_PATH);
>> -	ima_policy_flag |= unset_flags;
>> +	wr_assign(ima_policy_flag, ima_policy_flag | unset_flags);
>>   }
>>   #endif
> 
> In the cover letter, you said:
> 
>> As the name implies, the write protection kicks in only after init()
>> is completed; before that moment, the data is modifiable in the usual
>> way.
> 
> Given that, is it still necessary or useful to use wr_assign() in a
> function marked with __init?

I might have been over enthusiastic of using the wr interface.
You are right, I can drop these two. Thank you.

--
igor
Matthew Wilcox (Oracle) Dec. 20, 2018, 6:49 p.m. UTC | #6
On Wed, Dec 19, 2018 at 11:33:30PM +0200, Igor Stoppa wrote:
> +void *__wr_op(unsigned long dst, unsigned long src, __kernel_size_t len,
> +	      enum wr_op_type op)
> +{
> +	temporary_mm_state_t prev;
> +	unsigned long offset;
> +	unsigned long wr_poking_addr;
> +
> +	/* Confirm that the writable mapping exists. */
> +	if (WARN_ONCE(!wr_ready, "No writable mapping available"))
> +		return (void *)dst;
> +
> +	if (WARN_ONCE(op >= WR_OPS_NUMBER, "Invalid WR operation.") ||
> +	    WARN_ONCE(!is_wr_after_init(dst, len), "Invalid WR range."))
> +		return (void *)dst;
> +
> +	offset = dst - (unsigned long)&__start_wr_after_init;
> +	wr_poking_addr = wr_poking_base + offset;
> +	local_irq_disable();
> +	prev = use_temporary_mm(wr_poking_mm);
> +
> +	if (op == WR_MEMCPY)
> +		copy_to_user((void __user *)wr_poking_addr, (void *)src, len);
> +	else if (op == WR_MEMSET)
> +		memset_user((void __user *)wr_poking_addr, (u8)src, len);
> +
> +	unuse_temporary_mm(prev);
> +	local_irq_enable();
> +	return (void *)dst;
> +}

I think you're causing yourself more headaches by implementing this "op"
function.  Here's some generic code:

void *wr_memcpy(void *dst, void *src, unsigned int len)
{
	wr_state_t wr_state;
	void *wr_poking_addr = __wr_addr(dst);

	local_irq_disable();
	wr_enable(&wr_state);
	__wr_memcpy(wr_poking_addr, src, len);
	wr_disable(&wr_state);
	local_irq_enable();

	return dst;
}

Now, x86 can define appropriate macros and functions to use the temporary_mm
functionality, and other architectures can do what makes sense to them.
Igor Stoppa Dec. 20, 2018, 7:19 p.m. UTC | #7
On 20/12/2018 20:49, Matthew Wilcox wrote:

> I think you're causing yourself more headaches by implementing this "op"
> function.  

I probably misinterpreted the initial criticism on my first patchset, 
about duplication. Somehow, I'm still thinking to the endgame of having 
higher-level functions, like list management.

> Here's some generic code:

thank you, I have one question, below

> void *wr_memcpy(void *dst, void *src, unsigned int len)
> {
> 	wr_state_t wr_state;
> 	void *wr_poking_addr = __wr_addr(dst);
> 
> 	local_irq_disable();
> 	wr_enable(&wr_state);
> 	__wr_memcpy(wr_poking_addr, src, len);

Is __wraddr() invoked inside wm_memcpy() instead of being invoked 
privately within __wr_memcpy() because the code is generic, or is there 
some other reason?

> 	wr_disable(&wr_state);
> 	local_irq_enable();
> 
> 	return dst;
> }
> 
> Now, x86 can define appropriate macros and functions to use the temporary_mm
> functionality, and other architectures can do what makes sense to them.
> 

--
igor
Matthew Wilcox (Oracle) Dec. 20, 2018, 7:27 p.m. UTC | #8
On Thu, Dec 20, 2018 at 09:19:15PM +0200, Igor Stoppa wrote:
> On 20/12/2018 20:49, Matthew Wilcox wrote:
> > I think you're causing yourself more headaches by implementing this "op"
> > function.
> 
> I probably misinterpreted the initial criticism on my first patchset, about
> duplication. Somehow, I'm still thinking to the endgame of having
> higher-level functions, like list management.
> 
> > Here's some generic code:
> 
> thank you, I have one question, below
> 
> > void *wr_memcpy(void *dst, void *src, unsigned int len)
> > {
> > 	wr_state_t wr_state;
> > 	void *wr_poking_addr = __wr_addr(dst);
> > 
> > 	local_irq_disable();
> > 	wr_enable(&wr_state);
> > 	__wr_memcpy(wr_poking_addr, src, len);
> 
> Is __wraddr() invoked inside wm_memcpy() instead of being invoked privately
> within __wr_memcpy() because the code is generic, or is there some other
> reason?

I was assuming that __wr_addr() might be costly, and we were trying to
minimise the number of instructions executed while write-rare was enabled.
Andy Lutomirski Dec. 21, 2018, 5:23 p.m. UTC | #9
On Thu, Dec 20, 2018 at 11:19 AM Igor Stoppa <igor.stoppa@gmail.com> wrote:
>
>
>
> On 20/12/2018 20:49, Matthew Wilcox wrote:
>
> > I think you're causing yourself more headaches by implementing this "op"
> > function.
>
> I probably misinterpreted the initial criticism on my first patchset,
> about duplication. Somehow, I'm still thinking to the endgame of having
> higher-level functions, like list management.
>
> > Here's some generic code:
>
> thank you, I have one question, below
>
> > void *wr_memcpy(void *dst, void *src, unsigned int len)
> > {
> >       wr_state_t wr_state;
> >       void *wr_poking_addr = __wr_addr(dst);
> >
> >       local_irq_disable();
> >       wr_enable(&wr_state);
> >       __wr_memcpy(wr_poking_addr, src, len);
>
> Is __wraddr() invoked inside wm_memcpy() instead of being invoked
> privately within __wr_memcpy() because the code is generic, or is there
> some other reason?
>
> >       wr_disable(&wr_state);
> >       local_irq_enable();
> >
> >       return dst;
> > }
> >
> > Now, x86 can define appropriate macros and functions to use the temporary_mm
> > functionality, and other architectures can do what makes sense to them.
> >

I suspect that most architectures will want to do this exactly like
x86, though, but sure, it could be restructured like this.

On x86, I *think* that __wr_memcpy() will want to special-case len ==
1, 2, 4, and (on 64-bit) 8 byte writes to keep them atomic. i'm
guessing this is the same on most or all architectures.

>
> --
> igor
Igor Stoppa Dec. 21, 2018, 5:42 p.m. UTC | #10
On 21/12/2018 19:23, Andy Lutomirski wrote:
> On Thu, Dec 20, 2018 at 11:19 AM Igor Stoppa <igor.stoppa@gmail.com> wrote:
>>
>>
>>
>> On 20/12/2018 20:49, Matthew Wilcox wrote:
>>
>>> I think you're causing yourself more headaches by implementing this "op"
>>> function.
>>
>> I probably misinterpreted the initial criticism on my first patchset,
>> about duplication. Somehow, I'm still thinking to the endgame of having
>> higher-level functions, like list management.
>>
>>> Here's some generic code:
>>
>> thank you, I have one question, below
>>
>>> void *wr_memcpy(void *dst, void *src, unsigned int len)
>>> {
>>>        wr_state_t wr_state;
>>>        void *wr_poking_addr = __wr_addr(dst);
>>>
>>>        local_irq_disable();
>>>        wr_enable(&wr_state);
>>>        __wr_memcpy(wr_poking_addr, src, len);
>>
>> Is __wraddr() invoked inside wm_memcpy() instead of being invoked
>> privately within __wr_memcpy() because the code is generic, or is there
>> some other reason?
>>
>>>        wr_disable(&wr_state);
>>>        local_irq_enable();
>>>
>>>        return dst;
>>> }
>>>
>>> Now, x86 can define appropriate macros and functions to use the temporary_mm
>>> functionality, and other architectures can do what makes sense to them.

> I suspect that most architectures will want to do this exactly like
> x86, though, but sure, it could be restructured like this.

In spirit, I think yes, but already I couldn't find a clean ways to do 
multi-arch wr_enable(&wr_state), so I made that too become arch_dependent.

Maybe after implementing write rare for a few archs, it becomes more 
clear (to me, any advice is welcome) which parts can be considered common.

> On x86, I *think* that __wr_memcpy() will want to special-case len ==
> 1, 2, 4, and (on 64-bit) 8 byte writes to keep them atomic. i'm
> guessing this is the same on most or all architectures.

I switched to xxx_user() approach, as you suggested.
For x86_64 I'm using copy_user() and i added a memset_user(), based on 
copy_user().

It's already assembly code optimized for dealing with multiples of 
8-byte words or subsets. You can see this in the first patch of the 
patchset, even this one.

I'll send out the v3 patchset in a short while.

--
igor
Matthew Wilcox (Oracle) Dec. 21, 2018, 7:45 p.m. UTC | #11
On Fri, Dec 21, 2018 at 11:38:16AM -0800, Nadav Amit wrote:
> > On Dec 19, 2018, at 1:33 PM, Igor Stoppa <igor.stoppa@gmail.com> wrote:
> > 
> > +static inline void *wr_memset(void *p, int c, __kernel_size_t len)
> > +{
> > +	return __wr_op((unsigned long)p, (unsigned long)c, len, WR_MEMSET);
> > +}
> 
> What do you think about doing something like:
> 
> #define __wr          __attribute__((address_space(5)))
> 
> And then make all the pointers to write-rarely memory to use this attribute?
> It might require more changes to the code, but can prevent bugs.

I like this idea.  It was something I was considering suggesting.
Igor Stoppa Dec. 23, 2018, 2:28 a.m. UTC | #12
On 21/12/2018 21:45, Matthew Wilcox wrote:
> On Fri, Dec 21, 2018 at 11:38:16AM -0800, Nadav Amit wrote:
>>> On Dec 19, 2018, at 1:33 PM, Igor Stoppa <igor.stoppa@gmail.com> wrote:
>>>
>>> +static inline void *wr_memset(void *p, int c, __kernel_size_t len)
>>> +{
>>> +	return __wr_op((unsigned long)p, (unsigned long)c, len, WR_MEMSET);
>>> +}
>>
>> What do you think about doing something like:
>>
>> #define __wr          __attribute__((address_space(5)))
>>
>> And then make all the pointers to write-rarely memory to use this attribute?
>> It might require more changes to the code, but can prevent bugs.
> 
> I like this idea.  It was something I was considering suggesting.

I have been thinking about this sort of problem, although from a bit 
different angle:

1) enforcing alignment for pointers
This can be implemented in similar way, by creating a multi-attribute 
that would define section, address space, like said here, and alignment.
However I'm not sure if it's possible to do anything to enforce the 
alignment of a pointer field within a structure. I haven't had time to 
look into this yet.

2) validation of the correctness of the actual value
Inside the kernel code, a function is not supposed to sanitize its 
arguments, as long as they come from some other trusted part of the 
kernel, rather than say from userspace or from some HW interface.
However,ROP/JOP should be considered.

I am aware of various efforts to make it harder to exploit these 
techniques, like signed pointers, CFI plugins, LTO.

But they are not necessarily available on every platform and mostly, 
afaik, they focus on specific type of attacks.


LTO can help with global optimizations, for example inlining functions 
across different objects.

CFI can detect jumps in the middle of a function, rather than proper 
function invocation, from its natural entry point.

Signed pointers can prevent data-based attacks to the execution flow, 
and they might have a role in preventing the attack I have in mind, but 
they are not available on all platforms.

What I'd like to do, is to verify, at runtime, that the pointer belongs 
to the type that the receiving function is meant for.

Ex: a legitimate __wr_after_init data must exist between 
__start_wr_after_init and __end_wr_after_init

That is easier and cleaner to test, imho.

But dynamically allocated memory doesn't have any such constraint.
If it was possible to introduce, for example, a flag to pass to vmalloc, 
to get the vmap_area from within a specific address range, it would 
reduce the attack surface.

In the implementation I have right now, I'm using extra flags for the 
pmalloc pages, which means the metadata is the new target for an attack.

But with adding the constraint that a dynamically allocated protected 
memory page must be within a range, then the attacker must change the 
underlying PTE. And if a region of PTEs are all part of protected 
memory, it is possible to make the PMD write rare.

--
igor