[1/1] Sealable memory support

Message ID	20170519103811.2183-2-igor.stoppa@huawei.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kernel-hardening-return-7944-patchwork-kernel-hardening=patchwork.kernel.org@lists.openwall.com> Mailing-List: contact kernel-hardening-help@lists.openwall.com; run by ezmlm Precedence: bulk From: Igor Stoppa <igor.stoppa@huawei.com> To: <mhocko@kernel.org>, <dave.hansen@intel.com>, <labbott@redhat.com> CC: <linux-mm@kvack.org>, <kernel-hardening@lists.openwall.com>, <linux-kernel@vger.kernel.org>, Igor Stoppa <igor.stoppa@huawei.com> Date: Fri, 19 May 2017 13:38:11 +0300 Message-ID: <20170519103811.2183-2-igor.stoppa@huawei.com> In-Reply-To: <20170519103811.2183-1-igor.stoppa@huawei.com> References: <20170519103811.2183-1-igor.stoppa@huawei.com> MIME-Version: 1.0 Content-Type: text/plain Subject: [kernel-hardening] [PATCH 1/1] Sealable memory support

Igor Stoppa May 19, 2017, 10:38 a.m. UTC

Dynamically allocated variables can be made read only,
after they have been initialized, provided that they reside in memory
pages devoid of any RW data.

The implementation supplies means to create independent pools of memory,
which can be individually created, sealed/unsealed and destroyed.

A global pool is made available for those kernel modules that do not
need to manage an independent pool.

Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com>
---
 mm/Makefile  |   2 +-
 mm/smalloc.c | 200 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/smalloc.h |  61 ++++++++++++++++++
 3 files changed, 262 insertions(+), 1 deletion(-)
 create mode 100644 mm/smalloc.c
 create mode 100644 mm/smalloc.h

Greg Kroah-Hartman May 20, 2017, 8:51 a.m. UTC | #1

On Fri, May 19, 2017 at 01:38:11PM +0300, Igor Stoppa wrote:
> Dynamically allocated variables can be made read only,
> after they have been initialized, provided that they reside in memory
> pages devoid of any RW data.
> 
> The implementation supplies means to create independent pools of memory,
> which can be individually created, sealed/unsealed and destroyed.
> 
> A global pool is made available for those kernel modules that do not
> need to manage an independent pool.
> 
> Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com>
> ---
>  mm/Makefile  |   2 +-
>  mm/smalloc.c | 200 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  mm/smalloc.h |  61 ++++++++++++++++++
>  3 files changed, 262 insertions(+), 1 deletion(-)
>  create mode 100644 mm/smalloc.c
>  create mode 100644 mm/smalloc.h

This is really nice, do you have a follow-on patch showing how any of
the kernel can be changed to use this new subsystem?  Without that, it
might be hard to get this approved (we don't like adding new apis
without users.)

thanks,

greg k-h

Igor Stoppa May 22, 2017, 7:45 p.m. UTC | #2

On 20/05/17 11:51, Greg KH wrote:
> On Fri, May 19, 2017 at 01:38:11PM +0300, Igor Stoppa wrote:
>> Dynamically allocated variables can be made read only,

[...]

> This is really nice, do you have a follow-on patch showing how any of
> the kernel can be changed to use this new subsystem?

Yes, actually I started from the need of turning into R/O some data
structures in both LSM Hooks and SE Linux.

> Without that, it
> might be hard to get this approved (we don't like adding new apis
> without users.)

Yes, I just wanted to give an early preview of the current
implementation, since it is significantly different from what I
initially proposed. So I was looking for early feedback.

Right now, I'm fixing it up and adding some more structured debugging.

Then I'll re-submit it together with the LSM Hooks example.

---
thanks, igor

Kees Cook May 22, 2017, 9:38 p.m. UTC | #3

On Fri, May 19, 2017 at 3:38 AM, Igor Stoppa <igor.stoppa@huawei.com> wrote:
> Dynamically allocated variables can be made read only,
> after they have been initialized, provided that they reside in memory
> pages devoid of any RW data.
>
> The implementation supplies means to create independent pools of memory,
> which can be individually created, sealed/unsealed and destroyed.

This would be a welcome addition, thanks for posting it! I have a
bunch of feedback, here and below:

For the first bit of bikeshedding, should this really be called
seal/unseal? My mind is probably just broken from having read TPM
documentation, but this isn't really "sealing" as I'd understand it
(it's not tied to a credential, for example). It's "only" rw/ro.
Perhaps "protect/unprotect" or just simply "readonly/writable", and
call the base function "romalloc"?

This is fundamentally a heap allocator, with linked lists, etc. I'd
like to see as much attention as possible given to hardening it
against attacks, especially adding redzoning around the metadata at
least, and perhaps requiring that CONFIG_DEBUG_LIST be enabled. And as
part of that, I'd like hardened usercopy to grow knowledge of these
allocations so we can bounds-check objects. Right now, mm/usercopy.c
just looks at PageSlab(page) to decide if it should do slab checks. I
think adding a check for this type of object would be very important
there.

The ro/rw granularity here is the _entire_ pool, not a specific
allocation (or page containing the allocation). I'm concerned that
makes this very open to race conditions where, especially in the
global pool, one thread can be trying to write to ro data in a pool
and another has made the pool writable.

> A global pool is made available for those kernel modules that do not
> need to manage an independent pool.
>
> Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com>
> ---
>  mm/Makefile  |   2 +-
>  mm/smalloc.c | 200 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  mm/smalloc.h |  61 ++++++++++++++++++
>  3 files changed, 262 insertions(+), 1 deletion(-)
>  create mode 100644 mm/smalloc.c
>  create mode 100644 mm/smalloc.h
>
> diff --git a/mm/Makefile b/mm/Makefile
> index 026f6a8..737c42a 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -39,7 +39,7 @@ obj-y                 := filemap.o mempool.o oom_kill.o \
>                            mm_init.o mmu_context.o percpu.o slab_common.o \
>                            compaction.o vmacache.o swap_slots.o \
>                            interval_tree.o list_lru.o workingset.o \
> -                          debug.o $(mmu-y)
> +                          debug.o smalloc.o $(mmu-y)
>
>  obj-y += init-mm.o
>
> diff --git a/mm/smalloc.c b/mm/smalloc.c
> new file mode 100644
> index 0000000..fa04cc5
> --- /dev/null
> +++ b/mm/smalloc.c
> @@ -0,0 +1,200 @@
> +/*
> + * smalloc.c: Sealable Memory Allocator
> + *
> + * (C) Copyright 2017 Huawei Technologies Co. Ltd.
> + * Author: Igor Stoppa <igor.stoppa@huawei.com>
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; version 2
> + * of the License.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/printk.h>
> +#include <linux/kobject.h>
> +#include <linux/sysfs.h>
> +#include <linux/init.h>
> +#include <linux/fs.h>
> +#include <linux/string.h>
> +
> +
> +#include <linux/vmalloc.h>
> +#include <asm/cacheflush.h>
> +#include "smalloc.h"

Shouldn't this just be <linux/smalloc.h> ?

> +#define page_roundup(size) (((size) + !(size) - 1 + PAGE_SIZE) & PAGE_MASK)
> +
> +#define pages_nr(size) (page_roundup(size) / PAGE_SIZE)
> +
> +static struct smalloc_pool *global_pool;
> +
> +struct smalloc_node *__smalloc_create_node(unsigned long words)
> +{
> +       struct smalloc_node *node;
> +       unsigned long size;
> +
> +       /* Calculate the size to ask from vmalloc, page aligned. */
> +       size = page_roundup(NODE_HEADER_SIZE + words * sizeof(align_t));
> +       node = vmalloc(size);
> +       if (!node) {
> +               pr_err("No memory for allocating smalloc node.");
> +               return NULL;
> +       }
> +       /* Initialize the node.*/
> +       INIT_LIST_HEAD(&node->list);
> +       node->free = node->data;
> +       node->available_words = (size - NODE_HEADER_SIZE) / sizeof(align_t);
> +       return node;
> +}
> +
> +static __always_inline
> +void *node_alloc(struct smalloc_node *node, unsigned long words)
> +{
> +       register align_t *old_free = node->free;
> +
> +       node->available_words -= words;
> +       node->free += words;
> +       return old_free;
> +}
> +
> +void *smalloc(unsigned long size, struct smalloc_pool *pool)
> +{
> +       struct list_head *pos;
> +       struct smalloc_node *node;
> +       void *ptr;
> +       unsigned long words;
> +
> +       /* If no pool specified, use the global one. */
> +       if (!pool)
> +               pool = global_pool;

It should be impossible, but this should check for global_pool == NULL too, IMO.

> +       mutex_lock(&pool->lock);
> +
> +       /* If the pool is sealed, then return NULL. */
> +       if (pool->seal == SMALLOC_SEALED) {
> +               mutex_unlock(&pool->lock);
> +               return NULL;
> +       }
> +
> +       /* Calculate minimum number of words required. */
> +       words = (size + sizeof(align_t) - 1) / sizeof(align_t);
> +
> +       /* Look for slot that is large enough, in the existing pool.*/
> +       list_for_each(pos, &pool->list) {
> +               node = list_entry(pos, struct smalloc_node, list);
> +               if (node->available_words >= words) {
> +                       ptr = node_alloc(node, words);
> +                       mutex_unlock(&pool->lock);
> +                       return ptr;
> +               }
> +       }
> +
> +       /* No slot found, get a new chunk of virtual memory. */
> +       node = __smalloc_create_node(words);
> +       if (!node) {
> +               mutex_unlock(&pool->lock);
> +               return NULL;
> +       }
> +
> +       list_add(&node->list, &pool->list);
> +       ptr = node_alloc(node, words);
> +       mutex_unlock(&pool->lock);
> +       return ptr;
> +}
> +
> +static __always_inline
> +unsigned long get_node_size(struct smalloc_node *node)
> +{
> +       if (!node)
> +               return 0;
> +       return page_roundup((((void *)node->free) - (void *)node) +
> +                           node->available_words * sizeof(align_t));
> +}
> +
> +static __always_inline
> +unsigned long get_node_pages_nr(struct smalloc_node *node)
> +{
> +       return pages_nr(get_node_size(node));
> +}
> +void smalloc_seal_set(enum seal_t seal, struct smalloc_pool *pool)
> +{
> +       struct list_head *pos;
> +       struct smalloc_node *node;
> +
> +       if (!pool)
> +               pool = global_pool;
> +       mutex_lock(&pool->lock);
> +       if (pool->seal == seal) {
> +               mutex_unlock(&pool->lock);
> +               return;

I actually think this should be a BUG condition, since this means a
mismatched seal/unseal happened. The pool should never be left
writable at rest, a user should create a pool, write to it, seal. Any
updates should unseal, write, seal. To attempt an unseal and find it
already unsealed seems bad.

Finding a user for this would help clarify its protection properties,
too. (The LSM example is likely not the best starting point for that,
as it would depend on other changes that are under discussion.)

> +       }
> +       list_for_each(pos, &pool->list) {
> +               node = list_entry(pos, struct smalloc_node, list);
> +               if (seal == SMALLOC_SEALED)
> +                       set_memory_ro((unsigned long)node,
> +                                     get_node_pages_nr(node));
> +               else if (seal == SMALLOC_UNSEALED)
> +                       set_memory_rw((unsigned long)node,
> +                                     get_node_pages_nr(node));
> +       }
> +       pool->seal = seal;
> +       mutex_unlock(&pool->lock);
> +}
> +
> +int smalloc_initialize(struct smalloc_pool *pool)
> +{
> +       if (!pool)
> +               return -EINVAL;
> +       INIT_LIST_HEAD(&pool->list);
> +       pool->seal = SMALLOC_UNSEALED;
> +       mutex_init(&pool->lock);
> +       return 0;
> +}
> +
> +struct smalloc_pool *smalloc_create(void)
> +{
> +       struct smalloc_pool *pool = vmalloc(sizeof(struct smalloc_pool));
> +
> +       if (!pool) {
> +               pr_err("No memory for allocating pool.");

It might be handy to have pools named like they are for the slab allocator.

> +               return NULL;
> +       }
> +       smalloc_initialize(pool);
> +       return pool;
> +}
> +
> +int smalloc_destroy(struct smalloc_pool *pool)
> +{
> +       struct list_head *pos, *q;
> +       struct smalloc_node *node;
> +
> +       if (!pool)
> +               return -EINVAL;
> +       list_for_each_safe(pos, q, &pool->list) {
> +               node = list_entry(pos, struct smalloc_node, list);
> +               list_del(pos);
> +               vfree(node);
> +       }
> +       return 0;
> +}
> +
> +static int __init smalloc_init(void)
> +{
> +       global_pool = smalloc_create();
> +       if (!global_pool) {
> +               pr_err("Module smalloc initialization failed: no memory.\n");
> +               return -ENOMEM;
> +       }
> +       pr_info("Module smalloc initialized successfully.\n");
> +       return 0;
> +}
> +
> +static void __exit smalloc_exit(void)
> +{
> +       pr_info("Module smalloc un initialized successfully.\n");

typo: space after "un"

> +}
> +
> +module_init(smalloc_init);
> +module_exit(smalloc_exit);
> +MODULE_LICENSE("GPL");
> diff --git a/mm/smalloc.h b/mm/smalloc.h
> new file mode 100644
> index 0000000..344d962
> --- /dev/null
> +++ b/mm/smalloc.h
> @@ -0,0 +1,61 @@
> +/*
> + * smalloc.h: Header for Sealable Memory Allocator
> + *
> + * (C) Copyright 2017 Huawei Technologies Co. Ltd.
> + * Author: Igor Stoppa <igor.stoppa@huawei.com>
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; version 2
> + * of the License.
> + */
> +
> +#ifndef _SMALLOC_H
> +#define _SMALLOC_H
> +
> +#include <linux/list.h>
> +#include <linux/mutex.h>
> +
> +typedef uint64_t align_t;
> +
> +enum seal_t {
> +       SMALLOC_UNSEALED,
> +       SMALLOC_SEALED,
> +};
> +
> +#define __SMALLOC_ALIGNED__ __aligned(sizeof(align_t))
> +
> +#define NODE_HEADER                                    \
> +       struct {                                        \
> +               __SMALLOC_ALIGNED__ struct {            \
> +                       struct list_head list;          \
> +                       align_t *free;                  \
> +                       unsigned long available_words;  \
> +               };                                      \
> +       }
> +
> +#define NODE_HEADER_SIZE sizeof(NODE_HEADER)
> +
> +struct smalloc_pool {
> +       struct list_head list;
> +       struct mutex lock;
> +       enum seal_t seal;
> +};
> +
> +struct smalloc_node {
> +       NODE_HEADER;
> +       __SMALLOC_ALIGNED__ align_t data[];
> +};
> +
> +#define smalloc_seal(pool) \
> +       smalloc_seal_set(SMALLOC_SEALED, pool)
> +
> +#define smalloc_unseal(pool) \
> +       smalloc_seal_set(SMALLOC_UNSEALED, pool)
> +
> +struct smalloc_pool *smalloc_create(void);
> +int smalloc_destroy(struct smalloc_pool *pool);
> +int smalloc_initialize(struct smalloc_pool *pool);
> +void *smalloc(unsigned long size, struct smalloc_pool *pool);
> +void smalloc_seal_set(enum seal_t seal, struct smalloc_pool *pool);

I'd really like to see kernel-doc for the API functions (likely in the .c file).

> +#endif
> --
> 2.9.3

Thanks again for working on this! If you can find examples of file
operations living in the heap, those would be great examples for using
this API (assuming the other properties can be improved).

-Kees

Igor Stoppa May 23, 2017, 9:43 a.m. UTC | #4

On 23/05/17 00:38, Kees Cook wrote:
> On Fri, May 19, 2017 at 3:38 AM, Igor Stoppa <igor.stoppa@huawei.com> wrote:

[...]

> For the first bit of bikeshedding, should this really be called
> seal/unseal? My mind is probably just broken from having read TPM
> documentation, but this isn't really "sealing" as I'd understand it
> (it's not tied to a credential, for example). It's "only" rw/ro.
> Perhaps "protect/unprotect" or just simply "readonly/writable", and
> call the base function "romalloc"?

I was not aware of the specific mean of "seal", in this context.
The term was implicitly proposed by Michal Hocko, while discussing about
the mechanism and I liked it more than what I was using initially:
"lockable".

tbh I like the sound of "smalloc" better than "romalloc"

But this is really the least of my worries :-P

> This is fundamentally a heap allocator, with linked lists, etc. I'd
> like to see as much attention as possible given to hardening it
> against attacks, especially adding redzoning around the metadata at
> least, and perhaps requiring that CONFIG_DEBUG_LIST be enabled.

My initial goal was to provide something that is useful without
affecting performance.

You seem to be pushing for a more extreme approach.
While I have nothing against it and I actually agree that it can be
useful, I would not make it mandatory.

More on this later.

> And as
> part of that, I'd like hardened usercopy to grow knowledge of these
> allocations so we can bounds-check objects. Right now, mm/usercopy.c
> just looks at PageSlab(page) to decide if it should do slab checks. I
> think adding a check for this type of object would be very important
> there.

I am not familiar with this and I need to study it, however I still
think that if there is a significant trade-off in terms of performance
vs resilience, it should be optional, for those who want it.

Maybe there could be a master toggle for these options, if it makes
sense to group them logically. How does this sound?

> The ro/rw granularity here is the _entire_ pool, not a specific
> allocation (or page containing the allocation). I'm concerned that
> makes this very open to race conditions where, especially in the
> global pool, one thread can be trying to write to ro data in a pool
> and another has made the pool writable.

I have the impression we are thinking to something different.
Close, but different enough.

First of all, how using a mutex can create races?
Do you mean with respect of other resources that might be held by
competing users of the pool?

That, imho, is a locking problem that cannot be solved here.
You can try to mitigate it, by reducing the chances it will happen, but
basically you are trying to make do with an user of the API that is not
implementing locking correctly.
I'd say that it's better to fix the user.

If you meant something else, I really didn't get it :-)

More about the frequency of the access: you seem to expect very often
seal/unseal - or lock/unlock, while I don't.

What I envision as primary case for unlocking is the tear down of the
entire pool, for example preceding the unloading of the module.

Think about __ro_after_init : once it is r/o, it stays r/o.

Which also means that there would be possibly a busy transient, when
allocations are made. That is true.
But only for shared pools. If a module uses one pool, I would expect the
initialization of the module to be mostly sequential.
So, no chance of races. Do you agree?

Furthermore, if a module _really_ wants to do parallel allocation, then
maybe it's simpler and cleaner to have one pool per "thread" or whatever
is the mechanism used.

The global pool is mostly there for completing the offering of
__ro_after_init. If one wants ot use it, it's available.
It saves the trouble of having own pool, it means competing with the
rest of the kernel for locking.

If it seems a bad idea, it can be easily removed.

[...]

>> +#include "smalloc.h"
> 
> Shouldn't this just be <linux/smalloc.h> ?

yes, I have fixed it in the next revision

[...]

>> +       /* If no pool specified, use the global one. */
>> +       if (!pool)
>> +               pool = global_pool;
> 
> It should be impossible, but this should check for global_pool == NULL too, IMO.

I'm curious: why do you think it could happen?
I'd rather throw an exception, if I cannot initialize global_pool.

[...]

>> +       if (pool->seal == seal) {
>> +               mutex_unlock(&pool->lock);
>> +               return;
> 
> I actually think this should be a BUG condition, since this means a
> mismatched seal/unseal happened. The pool should never be left
> writable at rest, a user should create a pool, write to it, seal. 

On this I agree.

> Any
> updates should unseal, write, seal. To attempt an unseal and find it
> already unsealed seems bad.

But what you are describing seems exactly the case where there could be
a race.
EX: 2 writers try to unseal and add/change something.
They could have a collision both when trying to seal and unseal.

One could argue that there should be an atomic unseal-update-seal and
maybe unseal-allocate-seal.

I would understand the need for that - and maybe that's really how it
should be implemented, for doing what you seem to envision.

But that's where I think we have different use-cases in mind.

I'd rather not add extra locking to something that doesn't need it:
Allocate - write - seal - read, read, read, ... - unseal - destroy.

What you describe is more complex and does need extra logging, but I
would advocate a different implementation, at least initially.
Later, if it makes sense, fold them into one.

> Finding a user for this would help clarify its protection properties,
> too. (The LSM example is likely not the best starting point for that,
> as it would depend on other changes that are under discussion.)

Well, I *have* to do the LSM - and SE Linux too.

That's where this started and why I'm here with this patch.

From what I could understand about the discussion, the debate about the
changes is settled (is it?).

Casey seemed to be ok with this.

If something else is needed, ok, I can do that too, but I didn't see any
specific reason why LSM should not be good enough as first example.

If you can provide a specific reason why it's not suitable, I can
reconsider.

[...]

>> +       if (!pool) {
>> +               pr_err("No memory for allocating pool.");
> 
> It might be handy to have pools named like they are for the slab allocator.

I considered this.
The only reason I could come up with why it might be desirable is if the
same pool needs to be accessed from two or more places that do not share
the pointer. It doesn't seem particularly useful.
The only upside I can think of is that it would save a memory page, vs
the case of creating 2 separate pools.
It would also introduce locking cross-dependency.

Did I overlook some reason why it would be desirable?

[...]

>> +int smalloc_destroy(struct smalloc_pool *pool)
>> +{
>> +       struct list_head *pos, *q;
>> +       struct smalloc_node *node;
>> +
>> +       if (!pool)
>> +               return -EINVAL;

locking was missing, I added it in the new version
also I moved to goto-like error handling, since there were several
similar exit paths.

>> +       list_for_each_safe(pos, q, &pool->list) {
>> +               node = list_entry(pos, struct smalloc_node, list);

btw, here and in other places I have switched to list_for_each_entry

[...]

> typo: space after "un"

ok

[...]

>> +typedef uint64_t align_t;
>> +
>> +enum seal_t {
>> +       SMALLOC_UNSEALED,
>> +       SMALLOC_SEALED,
>> +};
>> +
>> +#define __SMALLOC_ALIGNED__ __aligned(sizeof(align_t))

How about the alignment? Is it a desirable feature?
Did I overlook some reason why it would not work?

>> +#define NODE_HEADER                                    \
>> +       struct {                                        \
>> +               __SMALLOC_ALIGNED__ struct {            \
>> +                       struct list_head list;          \
>> +                       align_t *free;                  \
>> +                       unsigned long available_words;  \
>> +               };                                      \
>> +       }

Does this look ok? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

[...]

> I'd really like to see kernel-doc for the API functions (likely in the .c file).

Yes, I just see no point right now, since the API doesn't seem to be
agreed/finalized yet.

> Thanks again for working on this! If you can find examples of file
> operations living in the heap, those would be great examples for using
> this API (assuming the other properties can be improved).

As I explained above, I would prefer to continue with LSM, unless you
have some specific reason against it.

If you have some example in mind about file operations living on the
heap - which I suspect you do have :) - we could discuss also about
those and if the locking needs to be modified, but from my perspective
the use case is significantly different and I wouldn't pile it up with this.
Of course I might be missing the point you are trying to make.
In that case, I'm afraid that further explanation is needed from you, as
I do not get it :-)

Thanks a lot for the review, this is exactly the sort of early feedback
I was hoping to receive.

--
igor

Kees Cook May 23, 2017, 8:11 p.m. UTC | #5

On Tue, May 23, 2017 at 2:43 AM, Igor Stoppa <igor.stoppa@huawei.com> wrote:
> On 23/05/17 00:38, Kees Cook wrote:
>> On Fri, May 19, 2017 at 3:38 AM, Igor Stoppa <igor.stoppa@huawei.com> wrote:
>
> [...]
>
>> For the first bit of bikeshedding, should this really be called
>> seal/unseal? My mind is probably just broken from having read TPM
>> documentation, but this isn't really "sealing" as I'd understand it
>> (it's not tied to a credential, for example). It's "only" rw/ro.
>> Perhaps "protect/unprotect" or just simply "readonly/writable", and
>> call the base function "romalloc"?
>
> I was not aware of the specific mean of "seal", in this context.
> The term was implicitly proposed by Michal Hocko, while discussing about
> the mechanism and I liked it more than what I was using initially:
> "lockable".
>
> tbh I like the sound of "smalloc" better than "romalloc"
>
> But this is really the least of my worries :-P
>
>> This is fundamentally a heap allocator, with linked lists, etc. I'd
>> like to see as much attention as possible given to hardening it
>> against attacks, especially adding redzoning around the metadata at
>> least, and perhaps requiring that CONFIG_DEBUG_LIST be enabled.
>
> My initial goal was to provide something that is useful without
> affecting performance.
>
> You seem to be pushing for a more extreme approach.

I don't know about pushing, but it seemed like the API provided for
arbitrary unsealing, etc. More below...

> While I have nothing against it and I actually agree that it can be
> useful, I would not make it mandatory.
>
> More on this later.
>
>> And as
>> part of that, I'd like hardened usercopy to grow knowledge of these
>> allocations so we can bounds-check objects. Right now, mm/usercopy.c
>> just looks at PageSlab(page) to decide if it should do slab checks. I
>> think adding a check for this type of object would be very important
>> there.
>
> I am not familiar with this and I need to study it, however I still
> think that if there is a significant trade-off in terms of performance
> vs resilience, it should be optional, for those who want it.

I would want hardened usercopy support as a requirement for using
smalloc(). Without it, we're regressing the over-read protection that
already exists for slab objects, if kernel code switched from slab to
smalloc. It should be very similar to the existing slab checks. "Is
this a smalloc object? Have we read beyond the end of a given object?"
etc. The metadata is all there, except for an efficient way to mark a
page as a smalloc page, but I think that just requires a new Page$Name
bit test, as done for slab.

> Maybe there could be a master toggle for these options, if it makes
> sense to group them logically. How does this sound?
>
>> The ro/rw granularity here is the _entire_ pool, not a specific
>> allocation (or page containing the allocation). I'm concerned that
>> makes this very open to race conditions where, especially in the
>> global pool, one thread can be trying to write to ro data in a pool
>> and another has made the pool writable.
>
> I have the impression we are thinking to something different.
> Close, but different enough.
>
> First of all, how using a mutex can create races?
> Do you mean with respect of other resources that might be held by
> competing users of the pool?

I meant this:

CPU 1     CPU 2
create
alloc
write
seal
...
unseal
                write
write
seal

The CPU 2 write would be, for example, an attacker using a
vulnerability to attempt to write to memory in the sealed area. All it
would need to do to succeed would be to trigger an action in the
kernel that would do a "legitimate" write (which requires the unseal),
and race it. Unsealing should be CPU-local, if the API is going to
support this kind of access.

> That, imho, is a locking problem that cannot be solved here.
> You can try to mitigate it, by reducing the chances it will happen, but
> basically you are trying to make do with an user of the API that is not
> implementing locking correctly.
> I'd say that it's better to fix the user.
>
> If you meant something else, I really didn't get it :-)
>
> More about the frequency of the access: you seem to expect very often
> seal/unseal - or lock/unlock, while I don't.

I am more concerned about _any_ unseal after initial seal. And even
then, it'd be nice to keep things CPU-local. My concerns are related
to the write-rarely proposal (https://lkml.org/lkml/2017/3/29/704)
which is kind of like this, but focused on the .data section, not
dynamic memory. It has similar concerns about CPU-locality.
Additionally, even writing to memory and then making it read-only
later runs risks (see threads about BPF JIT races vs making things
read-only: https://patchwork.kernel.org/patch/9662653/ Alexei's NAK
doesn't change the risk this series is fixing: races with attacker
writes during assignment but before read-only marking).

So, while smalloc would hugely reduce the window an attacker has
available to change data contents, this API doesn't eliminate it. (To
eliminate it, there would need to be a CPU-local page permission view
that let only the current CPU to the page, and then restore it to
read-only to match the global read-only view.)

> What I envision as primary case for unlocking is the tear down of the
> entire pool, for example preceding the unloading of the module.
>
> Think about __ro_after_init : once it is r/o, it stays r/o.

Ah! In that case, sure. This isn't what the proposed API provided,
though, so let's adjust it to only perform the unseal at destroy time.
That makes it much saner, IMO. "Write once" dynamic allocations, or
"read-only after seal". woalloc? :P yay naming

> Which also means that there would be possibly a busy transient, when
> allocations are made. That is true.
> But only for shared pools. If a module uses one pool, I would expect the
> initialization of the module to be mostly sequential.
> So, no chance of races. Do you agree?

I think a shared global pool would need to be eliminated for true
write-once semantics.

> Furthermore, if a module _really_ wants to do parallel allocation, then
> maybe it's simpler and cleaner to have one pool per "thread" or whatever
> is the mechanism used.

Right.

> The global pool is mostly there for completing the offering of
> __ro_after_init. If one wants ot use it, it's available.
> It saves the trouble of having own pool, it means competing with the
> rest of the kernel for locking.
>
> If it seems a bad idea, it can be easily removed.

I think they need to be explicit, yes.

> I'd rather not add extra locking to something that doesn't need it:
> Allocate - write - seal - read, read, read, ... - unseal - destroy.

Yup, I would be totally fine with this. It still has a race between
allocate and seal, but it's a huge improvement over the existing state
of the world where all dynamic memory is writable. :)

>> Finding a user for this would help clarify its protection properties,
>> too. (The LSM example is likely not the best starting point for that,
>> as it would depend on other changes that are under discussion.)
>
> Well, I *have* to do the LSM - and SE Linux too.
>
> That's where this started and why I'm here with this patch.

Ah, okay. Most of the LSM is happily covered by __ro_after_init. If we
could just drop the runtime disabling of SELinux, we'd be fine.

I'd like to see what you have in mind for LSM, since the only runtime
change would be the SELinux piece, and that would violate the
alloc-write-seal-read-read-destroy lifetime (since you'd need to
unseal-write-seal to change the state of SELinux during runtime).

> From what I could understand about the discussion, the debate about the
> changes is settled (is it?).
>
> Casey seemed to be ok with this.
>
> If something else is needed, ok, I can do that too, but I didn't see any
> specific reason why LSM should not be good enough as first example.
>
> If you can provide a specific reason why it's not suitable, I can
> reconsider.

Hopefully we're on the same page now? I love being able to work off a
patch example, so if you agree about the changes in lifetime
management, and you've got some examples of how/where to use smalloc,
I'd love to review those too. I think there are a LOT of places this
could get used.

> [...]
>
>>> +       if (!pool) {
>>> +               pr_err("No memory for allocating pool.");
>>
>> It might be handy to have pools named like they are for the slab allocator.
>
> I considered this.
> The only reason I could come up with why it might be desirable is if the
> same pool needs to be accessed from two or more places that do not share
> the pointer. It doesn't seem particularly useful.
> The only upside I can think of is that it would save a memory page, vs
> the case of creating 2 separate pools.
> It would also introduce locking cross-dependency.
>
> Did I overlook some reason why it would be desirable?

Hm, I just meant add a char[] to the metadata and pass it in during
create(). Then it's possible to report which smalloc cache is being
examined during hardened usercopy checks.

>
> [...]
>
>>> +int smalloc_destroy(struct smalloc_pool *pool)
>>> +{
>>> +       struct list_head *pos, *q;
>>> +       struct smalloc_node *node;
>>> +
>>> +       if (!pool)
>>> +               return -EINVAL;
>
> locking was missing, I added it in the new version
> also I moved to goto-like error handling, since there were several
> similar exit paths.

It seems like smalloc pools could also be refcounted?

> [...]
>
>>> +typedef uint64_t align_t;
>>> +
>>> +enum seal_t {
>>> +       SMALLOC_UNSEALED,
>>> +       SMALLOC_SEALED,
>>> +};
>>> +
>>> +#define __SMALLOC_ALIGNED__ __aligned(sizeof(align_t))
>
> How about the alignment? Is it a desirable feature?
> Did I overlook some reason why it would not work?

I think alignment makes sense. It makes sure there aren't crazy
problems with structs allocated after the header.

>>> +#define NODE_HEADER                                    \
>>> +       struct {                                        \
>>> +               __SMALLOC_ALIGNED__ struct {            \
>>> +                       struct list_head list;          \
>>> +                       align_t *free;                  \
>>> +                       unsigned long available_words;  \
>>> +               };                                      \
>>> +       }
>
> Does this look ok? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

It's probably a sufficient starting point, depending on how the API
shakes out. Without unseal-write-seal properties, I case much less
about redzoning, etc.

> [...]
>
>> I'd really like to see kernel-doc for the API functions (likely in the .c file).
>
> Yes, I just see no point right now, since the API doesn't seem to be
> agreed/finalized yet.

Sure thing. I guess I was looking for more docs because of my
questions/concerns about lifetime management, etc. :)

>> Thanks again for working on this! If you can find examples of file
>> operations living in the heap, those would be great examples for using
>> this API (assuming the other properties can be improved).
>
> As I explained above, I would prefer to continue with LSM, unless you
> have some specific reason against it.

I think I just lack the imagination for what you're trying to do that
isn't already covered by the existing __ro_after_init changes. :)

> If you have some example in mind about file operations living on the
> heap - which I suspect you do have :) - we could discuss also about
> those and if the locking needs to be modified, but from my perspective
> the use case is significantly different and I wouldn't pile it up with this.
> Of course I might be missing the point you are trying to make.
> In that case, I'm afraid that further explanation is needed from you, as
> I do not get it :-)

Well, a poor example would be struct sock, since it needs to be
regularly written to, but it has function pointers near the end which
have been a very common target for attackers. (Though this is less so
now that INET_DIAG no longer exposes the kernel addresses to allocated
struct socks.)

> Thanks a lot for the review, this is exactly the sort of early feedback
> I was hoping to receive.

Great! Gaining a ro_after_init-like API for dynamic memory would be
quite valuable, I think. And once we have a write-rarely
infrastructure for the .rodata segment, perhaps we can extend it to
the dynamic memory areas too (in the future).

Thanks!

-Kees

Igor Stoppa May 24, 2017, 5:45 p.m. UTC | #6

On 23/05/17 23:11, Kees Cook wrote:
> On Tue, May 23, 2017 at 2:43 AM, Igor Stoppa <igor.stoppa@huawei.com> wrote:

[...]

> I would want hardened usercopy support as a requirement for using
> smalloc(). Without it, we're regressing the over-read protection that
> already exists for slab objects, if kernel code switched from slab to
> smalloc. It should be very similar to the existing slab checks. "Is
> this a smalloc object? Have we read beyond the end of a given object?"
> etc. The metadata is all there, except for an efficient way to mark a
> page as a smalloc page, but I think that just requires a new Page$Name
> bit test, as done for slab.

ok

[...]

> I meant this:
> 
> CPU 1     CPU 2
> create
> alloc
> write
> seal
> ...
> unseal
>                 write
> write
> seal
> 
> The CPU 2 write would be, for example, an attacker using a
> vulnerability to attempt to write to memory in the sealed area. All it
> would need to do to succeed would be to trigger an action in the
> kernel that would do a "legitimate" write (which requires the unseal),
> and race it. Unsealing should be CPU-local, if the API is going to
> support this kind of access.

I see.
If the CPU1 were to forcibly halt anything that can race with it, then
it would be sure that there was no interference.
A reactive approach could be, instead, to re-validate the content after
the sealing, assuming that it is possible.

[...]

> I am more concerned about _any_ unseal after initial seal. And even
> then, it'd be nice to keep things CPU-local. My concerns are related
> to the write-rarely proposal (https://lkml.org/lkml/2017/3/29/704)
> which is kind of like this, but focused on the .data section, not
> dynamic memory. It has similar concerns about CPU-locality.
> Additionally, even writing to memory and then making it read-only
> later runs risks (see threads about BPF JIT races vs making things
> read-only: https://patchwork.kernel.org/patch/9662653/ Alexei's NAK
> doesn't change the risk this series is fixing: races with attacker
> writes during assignment but before read-only marking).

If you are talking about an attacker, rather than protection against
accidental overwrites, how hashing can be enough?
Couldn't the attacker compromise that too?

> So, while smalloc would hugely reduce the window an attacker has
> available to change data contents, this API doesn't eliminate it. (To
> eliminate it, there would need to be a CPU-local page permission view
> that let only the current CPU to the page, and then restore it to
> read-only to match the global read-only view.)

That or, if one is ready to take the hit, freeze every other possible
attack vector. But I'm not sure this could be justifiable.

[...]

> Ah! In that case, sure. This isn't what the proposed API provided,
> though, so let's adjust it to only perform the unseal at destroy time.
> That makes it much saner, IMO. "Write once" dynamic allocations, or
> "read-only after seal". woalloc? :P yay naming

For now I'm still using smalloc.
Anything that is either [x]malloc or [yz]malloc is fine, lengthwise.
Other options might require some re-formatting.

[...]

> I think a shared global pool would need to be eliminated for true
> write-once semantics.

ok

[...]

>> I'd rather not add extra locking to something that doesn't need it:
>> Allocate - write - seal - read, read, read, ... - unseal - destroy.
> 
> Yup, I would be totally fine with this. It still has a race between
> allocate and seal, but it's a huge improvement over the existing state
> of the world where all dynamic memory is writable. :)

Great!

[...]

> Ah, okay. Most of the LSM is happily covered by __ro_after_init. If we
> could just drop the runtime disabling of SELinux, we'd be fine.

I am not sure I understand this point.
If the kernel is properly configured, the master toggle variable
disappears, right?
Or do you mean the disabling through modifications of the linked list of
the hooks?

[...]

> Hm, I just meant add a char[] to the metadata and pass it in during
> create(). Then it's possible to report which smalloc cache is being
> examined during hardened usercopy checks.

Ok, that is not a big deal.
wrt this, I have spent some time writing a debug module, which currently
dumps into a debugfs entry a bunch of info about the various pools.
I could split it across multiple entries, using the label to generate
their names.

[...]

> It seems like smalloc pools could also be refcounted?

I am not sure what you mean.
What do you want to count?
Number of pools? Nodes per pool? Allocations per node?

And what for?

At least in the case of tearing down a pool, when a module is unloaded,
nobody needs to free anything that was allocated with smalloc.
The teardown function will free the pages from each node.

Is this the place where you think there should be a check on the number
of pages freed?

[...]

>>>> +#define NODE_HEADER                                    \
>>>> +       struct {                                        \
>>>> +               __SMALLOC_ALIGNED__ struct {            \
>>>> +                       struct list_head list;          \
>>>> +                       align_t *free;                  \
>>>> +                       unsigned long available_words;  \
>>>> +               };                                      \
>>>> +       }
>>
>> Does this look ok? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> It's probably a sufficient starting point, depending on how the API
> shakes out. Without unseal-write-seal properties, I case much less
> about redzoning, etc.

ok, but my question (I am not sure if it was clear) was about the use of
a macro for the nameless structure that contains the header.

[...]

> Well, a poor example would be struct sock, since it needs to be
> regularly written to, but it has function pointers near the end which
> have been a very common target for attackers. (Though this is less so
> now that INET_DIAG no longer exposes the kernel addresses to allocated
> struct socks.)

Ok, this could be the scope for a further set of patches, after this one
is done.

One more thing: how should I tie this allocator to the rest?
I have verified that is seems to work with both SLUB and SLAB.
Can I make it depend on either of them being enabled?

Should it be optionally enabled?
What to default to, if it's not enabled? vmalloc?

---
thanks, igor

Kees Cook May 28, 2017, 6:23 p.m. UTC | #7

On Wed, May 24, 2017 at 10:45 AM, Igor Stoppa <igor.stoppa@huawei.com> wrote:
> On 23/05/17 23:11, Kees Cook wrote:
>> On Tue, May 23, 2017 at 2:43 AM, Igor Stoppa <igor.stoppa@huawei.com> wrote:
>> I meant this:
>>
>> CPU 1     CPU 2
>> create
>> alloc
>> write
>> seal
>> ...
>> unseal
>>                 write
>> write
>> seal
>>
>> The CPU 2 write would be, for example, an attacker using a
>> vulnerability to attempt to write to memory in the sealed area. All it
>> would need to do to succeed would be to trigger an action in the
>> kernel that would do a "legitimate" write (which requires the unseal),
>> and race it. Unsealing should be CPU-local, if the API is going to
>> support this kind of access.
>
> I see.
> If the CPU1 were to forcibly halt anything that can race with it, then
> it would be sure that there was no interference.

Correct. This is actually what ARM does for doing kernel memory
writing when poking stuff for kprobes, etc. It's rather dramatic,
though. :)

> A reactive approach could be, instead, to re-validate the content after
> the sealing, assuming that it is possible.

I would prefer to avoid this, as that allows an attacker to still have
made the changes (which could even result in them then disabling the
re-validation during the attack).

>> I am more concerned about _any_ unseal after initial seal. And even
>> then, it'd be nice to keep things CPU-local. My concerns are related
>> to the write-rarely proposal (https://lkml.org/lkml/2017/3/29/704)
>> which is kind of like this, but focused on the .data section, not
>> dynamic memory. It has similar concerns about CPU-locality.
>> Additionally, even writing to memory and then making it read-only
>> later runs risks (see threads about BPF JIT races vs making things
>> read-only: https://patchwork.kernel.org/patch/9662653/ Alexei's NAK
>> doesn't change the risk this series is fixing: races with attacker
>> writes during assignment but before read-only marking).
>
> If you are talking about an attacker, rather than protection against
> accidental overwrites, how hashing can be enough?
> Couldn't the attacker compromise that too?

In theory, yes, though the goal was to dedicate a register to the
hash, which would make it hard/impossible for an attacker to reach.
(The BPF JIT situation is just an example of this kind of race,
though. I'm still in favor of reducing the write window to init-time
from full run-time.)

>> So, while smalloc would hugely reduce the window an attacker has
>> available to change data contents, this API doesn't eliminate it. (To
>> eliminate it, there would need to be a CPU-local page permission view
>> that let only the current CPU to the page, and then restore it to
>> read-only to match the global read-only view.)
>
> That or, if one is ready to take the hit, freeze every other possible
> attack vector. But I'm not sure this could be justifiable.

I would expect other people would NAK using "stop all other CPUs"
approach. Better to have the CPU-local writes.

>> Ah! In that case, sure. This isn't what the proposed API provided,
>> though, so let's adjust it to only perform the unseal at destroy time.
>> That makes it much saner, IMO. "Write once" dynamic allocations, or
>> "read-only after seal". woalloc? :P yay naming
>
> For now I'm still using smalloc.
> Anything that is either [x]malloc or [yz]malloc is fine, lengthwise.
> Other options might require some re-formatting.

Yeah, I don't have any better idea for names. :)

>> Ah, okay. Most of the LSM is happily covered by __ro_after_init. If we
>> could just drop the runtime disabling of SELinux, we'd be fine.
>
> I am not sure I understand this point.
> If the kernel is properly configured, the master toggle variable
> disappears, right?
> Or do you mean the disabling through modifications of the linked list of
> the hooks?

We might be talking past each-other. Right now, the LSM is marked with
__ro_after_init, which will make all the list structures, entries, etc
read-only already. There is one case where this is not true, and
that's for CONFIG_SECURITY_WRITABLE_HOOKS for
CONFIG_SECURITY_SELINUX_DISABLE, which wants to do run-time removal of
SELinux. Are you talking about the SELinux policy installed during
early boot? Which things did you want to use smalloc() on?

>> It seems like smalloc pools could also be refcounted?
>
> I am not sure what you mean.
> What do you want to count?
> Number of pools? Nodes per pool? Allocations per node?

I meant things that point into an smalloc() pool could keep a refcount
and when nothing was pointing into it, it could be destroyed. (i.e.
using refcount garbage collection instead of explicit destruction.)

> And what for?

It might be easier to reason about later if allocations get complex.
It's certainly not required for the first version of this.

> At least in the case of tearing down a pool, when a module is unloaded,
> nobody needs to free anything that was allocated with smalloc.
> The teardown function will free the pages from each node.

Right, yeah.

>>>>> +#define NODE_HEADER                                    \
>>>>> +       struct {                                        \
>>>>> +               __SMALLOC_ALIGNED__ struct {            \
>>>>> +                       struct list_head list;          \
>>>>> +                       align_t *free;                  \
>>>>> +                       unsigned long available_words;  \
>>>>> +               };                                      \
>>>>> +       }
>>>
>>> Does this look ok? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>
>> It's probably a sufficient starting point, depending on how the API
>> shakes out. Without unseal-write-seal properties, I case much less
>> about redzoning, etc.
>
> ok, but my question (I am not sure if it was clear) was about the use of
> a macro for the nameless structure that contains the header.

I don't really have an opinion on this. It might be more readable with
a named structure?

> One more thing: how should I tie this allocator to the rest?
> I have verified that is seems to work with both SLUB and SLAB.
> Can I make it depend on either of them being enabled?

It seems totally unrelated. The only relationship I see would be
interaction with hardened usercopy. In a perfect world, none of the
smalloc pools would be allowed to be copied to/from userspace, which
would make integration really easy: if smalloc_pool(ptr) return NOPE;
:P

> Should it be optionally enabled?
> What to default to, if it's not enabled? vmalloc?

I don't see any reason to make it optional.

-Kees

Boris Lukashev May 28, 2017, 6:56 p.m. UTC | #8

One-time sealable memory makes the most sense from a defensive
perspective - red team reads this stuff, the races mentioned will be
implemented as described to win the day, and probably in other
innovative ways. If a gap is left in the implementation, without
explicit coverage by an adjacent function, it will be used no matter
how small the chances of it occurring in the real world are - grooming
systems to create unlikely conditions is fair play (look at
eternalblue's SMB pool machinations).
However, out of tree modules will likely not appreciate this - third
party LSMs, tpe module, SCST, etc. I dont want to get into the "NIH"
debate, they're real functional components, used by real companies,
often enough that they need to be considered a member of the
ecosystem, even if not a first-order member.
So what about a middle ground where CoW semantics are used to enforce
the state of these allocations as RO, but provide a strictly
controlled pathway to read the RO data, copy and modify it, then write
and seal into a new allocation. Successful return from this process
should permit the page table to change the pointer to where the object
now resides, and initiate freeing of the original memory so long as a
refcount is kept for accesses. That way, sealable memory is sealed,
and any consumers reading it will be using the original ptr to the
original smalloc region. Attackers who do manage to change the
allocation by writing a new one still have to figure out how to change
the region used by an existing consumer. New consumers will get hit by
whatever they changed, but the change will be tracked and require them
to gain execution control over the allocator itself in order to affect
the change (or ROP chain something else into doing it, but thats's a
discussion on RAP/CFI).
CPU-local ops are great, if they dont halt the other cores. Stopping
all other CPUs is going to be DoA in HPC and other CPU intensive
workloads - think what ZFS would do if its pipelines kept getting
halted by something running a lot of smallocs (they get
non-preemptible often enough, requiring waits on both sides of the
op), how how LIO would behave - iSCSI waits for no man or woman, or
their allocator strategies. I'm all for "security should be more of a
concern than performance - its easier to build a faster car later if
you're still alive to do it," but keeping in mind who the consumers
are, i can easily see this functionality staying disabled in most
distributions and thus receiving much less testing and beatdown than
it should.
Lastly, my meager understanding is that PAX set the entire kernel as
RO, and implemented writeable access via pax_open/close. How were they
fighting against race conditions, and what is the benefit of specific
regions being allocated this way as opposed to the RO-all-the-things
approach which makes writes a specialized set of operations?

On Sun, May 28, 2017 at 2:23 PM, Kees Cook <keescook@google.com> wrote:
> On Wed, May 24, 2017 at 10:45 AM, Igor Stoppa <igor.stoppa@huawei.com> wrote:
>> On 23/05/17 23:11, Kees Cook wrote:
>>> On Tue, May 23, 2017 at 2:43 AM, Igor Stoppa <igor.stoppa@huawei.com> wrote:
>>> I meant this:
>>>
>>> CPU 1     CPU 2
>>> create
>>> alloc
>>> write
>>> seal
>>> ...
>>> unseal
>>>                 write
>>> write
>>> seal
>>>
>>> The CPU 2 write would be, for example, an attacker using a
>>> vulnerability to attempt to write to memory in the sealed area. All it
>>> would need to do to succeed would be to trigger an action in the
>>> kernel that would do a "legitimate" write (which requires the unseal),
>>> and race it. Unsealing should be CPU-local, if the API is going to
>>> support this kind of access.
>>
>> I see.
>> If the CPU1 were to forcibly halt anything that can race with it, then
>> it would be sure that there was no interference.
>
> Correct. This is actually what ARM does for doing kernel memory
> writing when poking stuff for kprobes, etc. It's rather dramatic,
> though. :)
>
>> A reactive approach could be, instead, to re-validate the content after
>> the sealing, assuming that it is possible.
>
> I would prefer to avoid this, as that allows an attacker to still have
> made the changes (which could even result in them then disabling the
> re-validation during the attack).
>
>>> I am more concerned about _any_ unseal after initial seal. And even
>>> then, it'd be nice to keep things CPU-local. My concerns are related
>>> to the write-rarely proposal (https://lkml.org/lkml/2017/3/29/704)
>>> which is kind of like this, but focused on the .data section, not
>>> dynamic memory. It has similar concerns about CPU-locality.
>>> Additionally, even writing to memory and then making it read-only
>>> later runs risks (see threads about BPF JIT races vs making things
>>> read-only: https://patchwork.kernel.org/patch/9662653/ Alexei's NAK
>>> doesn't change the risk this series is fixing: races with attacker
>>> writes during assignment but before read-only marking).
>>
>> If you are talking about an attacker, rather than protection against
>> accidental overwrites, how hashing can be enough?
>> Couldn't the attacker compromise that too?
>
> In theory, yes, though the goal was to dedicate a register to the
> hash, which would make it hard/impossible for an attacker to reach.
> (The BPF JIT situation is just an example of this kind of race,
> though. I'm still in favor of reducing the write window to init-time
> from full run-time.)
>
>>> So, while smalloc would hugely reduce the window an attacker has
>>> available to change data contents, this API doesn't eliminate it. (To
>>> eliminate it, there would need to be a CPU-local page permission view
>>> that let only the current CPU to the page, and then restore it to
>>> read-only to match the global read-only view.)
>>
>> That or, if one is ready to take the hit, freeze every other possible
>> attack vector. But I'm not sure this could be justifiable.
>
> I would expect other people would NAK using "stop all other CPUs"
> approach. Better to have the CPU-local writes.
>
>>> Ah! In that case, sure. This isn't what the proposed API provided,
>>> though, so let's adjust it to only perform the unseal at destroy time.
>>> That makes it much saner, IMO. "Write once" dynamic allocations, or
>>> "read-only after seal". woalloc? :P yay naming
>>
>> For now I'm still using smalloc.
>> Anything that is either [x]malloc or [yz]malloc is fine, lengthwise.
>> Other options might require some re-formatting.
>
> Yeah, I don't have any better idea for names. :)
>
>>> Ah, okay. Most of the LSM is happily covered by __ro_after_init. If we
>>> could just drop the runtime disabling of SELinux, we'd be fine.
>>
>> I am not sure I understand this point.
>> If the kernel is properly configured, the master toggle variable
>> disappears, right?
>> Or do you mean the disabling through modifications of the linked list of
>> the hooks?
>
> We might be talking past each-other. Right now, the LSM is marked with
> __ro_after_init, which will make all the list structures, entries, etc
> read-only already. There is one case where this is not true, and
> that's for CONFIG_SECURITY_WRITABLE_HOOKS for
> CONFIG_SECURITY_SELINUX_DISABLE, which wants to do run-time removal of
> SELinux. Are you talking about the SELinux policy installed during
> early boot? Which things did you want to use smalloc() on?
>
>>> It seems like smalloc pools could also be refcounted?
>>
>> I am not sure what you mean.
>> What do you want to count?
>> Number of pools? Nodes per pool? Allocations per node?
>
> I meant things that point into an smalloc() pool could keep a refcount
> and when nothing was pointing into it, it could be destroyed. (i.e.
> using refcount garbage collection instead of explicit destruction.)
>
>> And what for?
>
> It might be easier to reason about later if allocations get complex.
> It's certainly not required for the first version of this.
>
>> At least in the case of tearing down a pool, when a module is unloaded,
>> nobody needs to free anything that was allocated with smalloc.
>> The teardown function will free the pages from each node.
>
> Right, yeah.
>
>>>>>> +#define NODE_HEADER                                    \
>>>>>> +       struct {                                        \
>>>>>> +               __SMALLOC_ALIGNED__ struct {            \
>>>>>> +                       struct list_head list;          \
>>>>>> +                       align_t *free;                  \
>>>>>> +                       unsigned long available_words;  \
>>>>>> +               };                                      \
>>>>>> +       }
>>>>
>>>> Does this look ok? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>
>>> It's probably a sufficient starting point, depending on how the API
>>> shakes out. Without unseal-write-seal properties, I case much less
>>> about redzoning, etc.
>>
>> ok, but my question (I am not sure if it was clear) was about the use of
>> a macro for the nameless structure that contains the header.
>
> I don't really have an opinion on this. It might be more readable with
> a named structure?
>
>> One more thing: how should I tie this allocator to the rest?
>> I have verified that is seems to work with both SLUB and SLAB.
>> Can I make it depend on either of them being enabled?
>
> It seems totally unrelated. The only relationship I see would be
> interaction with hardened usercopy. In a perfect world, none of the
> smalloc pools would be allowed to be copied to/from userspace, which
> would make integration really easy: if smalloc_pool(ptr) return NOPE;
> :P
>
>> Should it be optionally enabled?
>> What to default to, if it's not enabled? vmalloc?
>
> I don't see any reason to make it optional.
>
> -Kees
>
> --
> Kees Cook
> Pixel Security

Kees Cook May 28, 2017, 9:32 p.m. UTC | #9

On Sun, May 28, 2017 at 11:56 AM, Boris Lukashev
<blukashev@sempervictus.com> wrote:
> So what about a middle ground where CoW semantics are used to enforce
> the state of these allocations as RO, but provide a strictly
> controlled pathway to read the RO data, copy and modify it, then write
> and seal into a new allocation. Successful return from this process
> should permit the page table to change the pointer to where the object
> now resides, and initiate freeing of the original memory so long as a
> refcount is kept for accesses. That way, sealable memory is sealed,
> and any consumers reading it will be using the original ptr to the
> original smalloc region. Attackers who do manage to change the

This could be another way to do it, yeah, and it helps that smalloc()
is built on vmalloc(). It'd require some careful design, but it could
be a way forward after this initial sealed-after-init version goes in.

> Lastly, my meager understanding is that PAX set the entire kernel as
> RO, and implemented writeable access via pax_open/close. How were they
> fighting against race conditions, and what is the benefit of specific
> regions being allocated this way as opposed to the RO-all-the-things
> approach which makes writes a specialized set of operations?

My understanding is that PaX's KERNEXEC with the constification plugin
moves a substantial portion of the kernel's .data section
(effectively) into the .rodata section. It's not the "entire" kernel.
(Well, depending on how you count. The .text section is already
read-only upstream.) PaX, as far as I know, provided no dynamic memory
allocation protections, like smalloc() would provide.

-Kees

Boris Lukashev May 29, 2017, 6:04 a.m. UTC | #10

If i understand the current direction for smalloc, its to implement it
without the ability to "unseal," which has implications on how LSM
implementations and other users of these dynamic allocations handle
things. If its implemented without a writeable interface for modules
which need it, then switched to a CoW (or other
controlled/isolated/hardened) mechanism for creating writeable
segments in dynamic allocations, that would create another set of
breaking changes for the consumers adapting to the first set. In the
spirit of adopting things faster/smoother, maybe it makes sense to
permit the smalloc re-writing method suggested in the first
implementation; with an API which consumers can later use for a more
secure approach that reduces the attack surface without breaking
consumer interfaces (or internal semantics). A sysctl affecting the
behavior as ro-only or ro-and-sometimes-rw would give users the
flexibility to tune their environment per their needs, which should
also reduce potential conflict/complaints.

On Sun, May 28, 2017 at 5:32 PM, Kees Cook <keescook@google.com> wrote:
> On Sun, May 28, 2017 at 11:56 AM, Boris Lukashev
> <blukashev@sempervictus.com> wrote:
>> So what about a middle ground where CoW semantics are used to enforce
>> the state of these allocations as RO, but provide a strictly
>> controlled pathway to read the RO data, copy and modify it, then write
>> and seal into a new allocation. Successful return from this process
>> should permit the page table to change the pointer to where the object
>> now resides, and initiate freeing of the original memory so long as a
>> refcount is kept for accesses. That way, sealable memory is sealed,
>> and any consumers reading it will be using the original ptr to the
>> original smalloc region. Attackers who do manage to change the
>
> This could be another way to do it, yeah, and it helps that smalloc()
> is built on vmalloc(). It'd require some careful design, but it could
> be a way forward after this initial sealed-after-init version goes in.
>
>> Lastly, my meager understanding is that PAX set the entire kernel as
>> RO, and implemented writeable access via pax_open/close. How were they
>> fighting against race conditions, and what is the benefit of specific
>> regions being allocated this way as opposed to the RO-all-the-things
>> approach which makes writes a specialized set of operations?
>
> My understanding is that PaX's KERNEXEC with the constification plugin
> moves a substantial portion of the kernel's .data section
> (effectively) into the .rodata section. It's not the "entire" kernel.
> (Well, depending on how you count. The .text section is already
> read-only upstream.) PaX, as far as I know, provided no dynamic memory
> allocation protections, like smalloc() would provide.
>
> -Kees
>
> --
> Kees Cook
> Pixel Security

kernel test robot May 31, 2017, 1:55 p.m. UTC | #11

Hi Igor,

[auto build test ERROR on mmotm/master]
[also build test ERROR on v4.12-rc3 next-20170531]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Igor-Stoppa/Sealable-memory-support/20170522-163525
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: x86_64-randconfig-v0-05311736 (attached as .config)
compiler: gcc-4.4 (Debian 4.4.7-8) 4.4.7
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

   In file included from mm/smalloc.c:24:
>> mm/smalloc.h:47: error: flexible array member in otherwise empty struct
   mm/smalloc.c: In function 'smalloc_seal_set':
   mm/smalloc.c:135: error: implicit declaration of function 'set_memory_ro'
   mm/smalloc.c:138: error: implicit declaration of function 'set_memory_rw'

vim +47 mm/smalloc.h

    41		struct mutex lock;
    42		enum seal_t seal;
    43	};
    44	
    45	struct smalloc_node {
    46		NODE_HEADER;
  > 47		__SMALLOC_ALIGNED__ align_t data[];
    48	};
    49	
    50	#define smalloc_seal(pool) \

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

Igor Stoppa May 31, 2017, 9:22 p.m. UTC | #12

On 28/05/17 21:23, Kees Cook wrote:
> On Wed, May 24, 2017 at 10:45 AM, Igor Stoppa <igor.stoppa@huawei.com> wrote:

[...]

>> If the CPU1 were to forcibly halt anything that can race with it, then
>> it would be sure that there was no interference.
> 
> Correct. This is actually what ARM does for doing kernel memory
> writing when poking stuff for kprobes, etc. It's rather dramatic,
> though. :)

ok

>> A reactive approach could be, instead, to re-validate the content after
>> the sealing, assuming that it is possible.
> 
> I would prefer to avoid this, as that allows an attacker to still have
> made the changes (which could even result in them then disabling the
> re-validation during the attack).

ok

[...]

>> If you are talking about an attacker, rather than protection against
>> accidental overwrites, how hashing can be enough?
>> Couldn't the attacker compromise that too?
> 
> In theory, yes, though the goal was to dedicate a register to the
> hash, which would make it hard/impossible for an attacker to reach.
> (The BPF JIT situation is just an example of this kind of race,
> though. I'm still in favor of reducing the write window to init-time
> from full run-time.)

It looks like some compromise is needed between reliability and
performance ...

[...]

> I would expect other people would NAK using "stop all other CPUs"
> approach. Better to have the CPU-local writes.

ok, that could be a feature for the follow-up?

[...]

> Yeah, I don't have any better idea for names. :)

still clinging to smalloc for now, how about pmalloc,
as "protected malloc"?

> We might be talking past each-other. Right now, the LSM is marked with
> __ro_after_init, which will make all the list structures, entries, etc
> read-only already. There is one case where this is not true, and
> that's for CONFIG_SECURITY_WRITABLE_HOOKS for
> CONFIG_SECURITY_SELINUX_DISABLE, which wants to do run-time removal of
> SELinux. 

On this I'll just shut up and provide the patch. It seems easier :-)

> Are you talking about the SELinux policy installed during
> early boot? Which things did you want to use smalloc() on?

The policy DB.
And of course the policy cache has to go, as in: it must be possible to
disable it completely, in favor of re-evaluating the policies, when needed.

I'm pretty sure that it is absolutely needed in certain cases, but in
others I think the re-evaluation is negligible in comparison to the
overall duration of the use cases affected.

So it should be ok to let the system integrator gauge speed (if any) vs
resilience, by enablying/disabling the SE Linux policy cache.

>>> It seems like smalloc pools could also be refcounted?
> 
> I meant things that point into an smalloc() pool could keep a refcount
> and when nothing was pointing into it, it could be destroyed. (i.e.
> using refcount garbage collection instead of explicit destruction.)

ok, but it depends on the level of paranoia that we are talking about.

Counting smalloc vs sfree is easy.

Verifying that the free invocations are actually aligned with previous
allocations is a bit more onerous.

Validating that the free is done from the same entity that invoked the
smalloc might be hard.

>> And what for?
> 
> It might be easier to reason about later if allocations get complex.
> It's certainly not required for the first version of this.

good :-)
more clarifications are needed

[...]

> I don't really have an opinion on this. It might be more readable with
> a named structure?

Eventually I got rid of the macro.
The structure is intentionally unnamed to stress the grouping, without
adding a layer of indirection. But maybe it could be removed. I need to
test it.

>> One more thing: how should I tie this allocator to the rest?
>> I have verified that is seems to work with both SLUB and SLAB.
>> Can I make it depend on either of them being enabled?
> 
> It seems totally unrelated. The only relationship I see would be
> interaction with hardened usercopy. In a perfect world, none of the
> smalloc pools would be allowed to be copied to/from userspace, which
> would make integration really easy: if smalloc_pool(ptr) return NOPE;
> :P

I went down the harder path and implemented the check.

>> Should it be optionally enabled?
>> What to default to, if it's not enabled? vmalloc?
> 
> I don't see any reason to make it optional.

great \o/

So now I'm back to the LSM use case.
It shouldn't take too long.

--
thanks, igor

kernel test robot June 4, 2017, 2:18 a.m. UTC | #13

Hi Igor,

[auto build test ERROR on mmotm/master]
[also build test ERROR on v4.12-rc3 next-20170602]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Igor-Stoppa/Sealable-memory-support/20170522-163525
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: i386-randconfig-n0-06032349 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   mm//smalloc.c: In function 'smalloc_seal_set':
>> mm//smalloc.c:135:4: error: implicit declaration of function 'set_memory_ro' [-Werror=implicit-function-declaration]
       set_memory_ro((unsigned long)node,
       ^~~~~~~~~~~~~
>> mm//smalloc.c:138:4: error: implicit declaration of function 'set_memory_rw' [-Werror=implicit-function-declaration]
       set_memory_rw((unsigned long)node,
       ^~~~~~~~~~~~~
   cc1: some warnings being treated as errors

vim +/set_memory_ro +135 mm//smalloc.c

   129			mutex_unlock(&pool->lock);
   130			return;
   131		}
   132		list_for_each(pos, &pool->list) {
   133			node = list_entry(pos, struct smalloc_node, list);
   134			if (seal == SMALLOC_SEALED)
 > 135				set_memory_ro((unsigned long)node,
   136					      get_node_pages_nr(node));
   137			else if (seal == SMALLOC_UNSEALED)
 > 138				set_memory_rw((unsigned long)node,
   139					      get_node_pages_nr(node));
   140		}
   141		pool->seal = seal;

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[1/1] Sealable memory support

Commit Message

Comments

Patch