Message ID | 20170519103811.2183-2-igor.stoppa@huawei.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, May 19, 2017 at 01:38:11PM +0300, Igor Stoppa wrote: > Dynamically allocated variables can be made read only, > after they have been initialized, provided that they reside in memory > pages devoid of any RW data. > > The implementation supplies means to create independent pools of memory, > which can be individually created, sealed/unsealed and destroyed. > > A global pool is made available for those kernel modules that do not > need to manage an independent pool. > > Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com> > --- > mm/Makefile | 2 +- > mm/smalloc.c | 200 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > mm/smalloc.h | 61 ++++++++++++++++++ > 3 files changed, 262 insertions(+), 1 deletion(-) > create mode 100644 mm/smalloc.c > create mode 100644 mm/smalloc.h This is really nice, do you have a follow-on patch showing how any of the kernel can be changed to use this new subsystem? Without that, it might be hard to get this approved (we don't like adding new apis without users.) thanks, greg k-h
On 20/05/17 11:51, Greg KH wrote: > On Fri, May 19, 2017 at 01:38:11PM +0300, Igor Stoppa wrote: >> Dynamically allocated variables can be made read only, [...] > This is really nice, do you have a follow-on patch showing how any of > the kernel can be changed to use this new subsystem? Yes, actually I started from the need of turning into R/O some data structures in both LSM Hooks and SE Linux. > Without that, it > might be hard to get this approved (we don't like adding new apis > without users.) Yes, I just wanted to give an early preview of the current implementation, since it is significantly different from what I initially proposed. So I was looking for early feedback. Right now, I'm fixing it up and adding some more structured debugging. Then I'll re-submit it together with the LSM Hooks example. --- thanks, igor
On Fri, May 19, 2017 at 3:38 AM, Igor Stoppa <igor.stoppa@huawei.com> wrote: > Dynamically allocated variables can be made read only, > after they have been initialized, provided that they reside in memory > pages devoid of any RW data. > > The implementation supplies means to create independent pools of memory, > which can be individually created, sealed/unsealed and destroyed. This would be a welcome addition, thanks for posting it! I have a bunch of feedback, here and below: For the first bit of bikeshedding, should this really be called seal/unseal? My mind is probably just broken from having read TPM documentation, but this isn't really "sealing" as I'd understand it (it's not tied to a credential, for example). It's "only" rw/ro. Perhaps "protect/unprotect" or just simply "readonly/writable", and call the base function "romalloc"? This is fundamentally a heap allocator, with linked lists, etc. I'd like to see as much attention as possible given to hardening it against attacks, especially adding redzoning around the metadata at least, and perhaps requiring that CONFIG_DEBUG_LIST be enabled. And as part of that, I'd like hardened usercopy to grow knowledge of these allocations so we can bounds-check objects. Right now, mm/usercopy.c just looks at PageSlab(page) to decide if it should do slab checks. I think adding a check for this type of object would be very important there. The ro/rw granularity here is the _entire_ pool, not a specific allocation (or page containing the allocation). I'm concerned that makes this very open to race conditions where, especially in the global pool, one thread can be trying to write to ro data in a pool and another has made the pool writable. > A global pool is made available for those kernel modules that do not > need to manage an independent pool. > > Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com> > --- > mm/Makefile | 2 +- > mm/smalloc.c | 200 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > mm/smalloc.h | 61 ++++++++++++++++++ > 3 files changed, 262 insertions(+), 1 deletion(-) > create mode 100644 mm/smalloc.c > create mode 100644 mm/smalloc.h > > diff --git a/mm/Makefile b/mm/Makefile > index 026f6a8..737c42a 100644 > --- a/mm/Makefile > +++ b/mm/Makefile > @@ -39,7 +39,7 @@ obj-y := filemap.o mempool.o oom_kill.o \ > mm_init.o mmu_context.o percpu.o slab_common.o \ > compaction.o vmacache.o swap_slots.o \ > interval_tree.o list_lru.o workingset.o \ > - debug.o $(mmu-y) > + debug.o smalloc.o $(mmu-y) > > obj-y += init-mm.o > > diff --git a/mm/smalloc.c b/mm/smalloc.c > new file mode 100644 > index 0000000..fa04cc5 > --- /dev/null > +++ b/mm/smalloc.c > @@ -0,0 +1,200 @@ > +/* > + * smalloc.c: Sealable Memory Allocator > + * > + * (C) Copyright 2017 Huawei Technologies Co. Ltd. > + * Author: Igor Stoppa <igor.stoppa@huawei.com> > + * > + * This program is free software; you can redistribute it and/or > + * modify it under the terms of the GNU General Public License > + * as published by the Free Software Foundation; version 2 > + * of the License. > + */ > + > +#include <linux/module.h> > +#include <linux/printk.h> > +#include <linux/kobject.h> > +#include <linux/sysfs.h> > +#include <linux/init.h> > +#include <linux/fs.h> > +#include <linux/string.h> > + > + > +#include <linux/vmalloc.h> > +#include <asm/cacheflush.h> > +#include "smalloc.h" Shouldn't this just be <linux/smalloc.h> ? > +#define page_roundup(size) (((size) + !(size) - 1 + PAGE_SIZE) & PAGE_MASK) > + > +#define pages_nr(size) (page_roundup(size) / PAGE_SIZE) > + > +static struct smalloc_pool *global_pool; > + > +struct smalloc_node *__smalloc_create_node(unsigned long words) > +{ > + struct smalloc_node *node; > + unsigned long size; > + > + /* Calculate the size to ask from vmalloc, page aligned. */ > + size = page_roundup(NODE_HEADER_SIZE + words * sizeof(align_t)); > + node = vmalloc(size); > + if (!node) { > + pr_err("No memory for allocating smalloc node."); > + return NULL; > + } > + /* Initialize the node.*/ > + INIT_LIST_HEAD(&node->list); > + node->free = node->data; > + node->available_words = (size - NODE_HEADER_SIZE) / sizeof(align_t); > + return node; > +} > + > +static __always_inline > +void *node_alloc(struct smalloc_node *node, unsigned long words) > +{ > + register align_t *old_free = node->free; > + > + node->available_words -= words; > + node->free += words; > + return old_free; > +} > + > +void *smalloc(unsigned long size, struct smalloc_pool *pool) > +{ > + struct list_head *pos; > + struct smalloc_node *node; > + void *ptr; > + unsigned long words; > + > + /* If no pool specified, use the global one. */ > + if (!pool) > + pool = global_pool; It should be impossible, but this should check for global_pool == NULL too, IMO. > + mutex_lock(&pool->lock); > + > + /* If the pool is sealed, then return NULL. */ > + if (pool->seal == SMALLOC_SEALED) { > + mutex_unlock(&pool->lock); > + return NULL; > + } > + > + /* Calculate minimum number of words required. */ > + words = (size + sizeof(align_t) - 1) / sizeof(align_t); > + > + /* Look for slot that is large enough, in the existing pool.*/ > + list_for_each(pos, &pool->list) { > + node = list_entry(pos, struct smalloc_node, list); > + if (node->available_words >= words) { > + ptr = node_alloc(node, words); > + mutex_unlock(&pool->lock); > + return ptr; > + } > + } > + > + /* No slot found, get a new chunk of virtual memory. */ > + node = __smalloc_create_node(words); > + if (!node) { > + mutex_unlock(&pool->lock); > + return NULL; > + } > + > + list_add(&node->list, &pool->list); > + ptr = node_alloc(node, words); > + mutex_unlock(&pool->lock); > + return ptr; > +} > + > +static __always_inline > +unsigned long get_node_size(struct smalloc_node *node) > +{ > + if (!node) > + return 0; > + return page_roundup((((void *)node->free) - (void *)node) + > + node->available_words * sizeof(align_t)); > +} > + > +static __always_inline > +unsigned long get_node_pages_nr(struct smalloc_node *node) > +{ > + return pages_nr(get_node_size(node)); > +} > +void smalloc_seal_set(enum seal_t seal, struct smalloc_pool *pool) > +{ > + struct list_head *pos; > + struct smalloc_node *node; > + > + if (!pool) > + pool = global_pool; > + mutex_lock(&pool->lock); > + if (pool->seal == seal) { > + mutex_unlock(&pool->lock); > + return; I actually think this should be a BUG condition, since this means a mismatched seal/unseal happened. The pool should never be left writable at rest, a user should create a pool, write to it, seal. Any updates should unseal, write, seal. To attempt an unseal and find it already unsealed seems bad. Finding a user for this would help clarify its protection properties, too. (The LSM example is likely not the best starting point for that, as it would depend on other changes that are under discussion.) > + } > + list_for_each(pos, &pool->list) { > + node = list_entry(pos, struct smalloc_node, list); > + if (seal == SMALLOC_SEALED) > + set_memory_ro((unsigned long)node, > + get_node_pages_nr(node)); > + else if (seal == SMALLOC_UNSEALED) > + set_memory_rw((unsigned long)node, > + get_node_pages_nr(node)); > + } > + pool->seal = seal; > + mutex_unlock(&pool->lock); > +} > + > +int smalloc_initialize(struct smalloc_pool *pool) > +{ > + if (!pool) > + return -EINVAL; > + INIT_LIST_HEAD(&pool->list); > + pool->seal = SMALLOC_UNSEALED; > + mutex_init(&pool->lock); > + return 0; > +} > + > +struct smalloc_pool *smalloc_create(void) > +{ > + struct smalloc_pool *pool = vmalloc(sizeof(struct smalloc_pool)); > + > + if (!pool) { > + pr_err("No memory for allocating pool."); It might be handy to have pools named like they are for the slab allocator. > + return NULL; > + } > + smalloc_initialize(pool); > + return pool; > +} > + > +int smalloc_destroy(struct smalloc_pool *pool) > +{ > + struct list_head *pos, *q; > + struct smalloc_node *node; > + > + if (!pool) > + return -EINVAL; > + list_for_each_safe(pos, q, &pool->list) { > + node = list_entry(pos, struct smalloc_node, list); > + list_del(pos); > + vfree(node); > + } > + return 0; > +} > + > +static int __init smalloc_init(void) > +{ > + global_pool = smalloc_create(); > + if (!global_pool) { > + pr_err("Module smalloc initialization failed: no memory.\n"); > + return -ENOMEM; > + } > + pr_info("Module smalloc initialized successfully.\n"); > + return 0; > +} > + > +static void __exit smalloc_exit(void) > +{ > + pr_info("Module smalloc un initialized successfully.\n"); typo: space after "un" > +} > + > +module_init(smalloc_init); > +module_exit(smalloc_exit); > +MODULE_LICENSE("GPL"); > diff --git a/mm/smalloc.h b/mm/smalloc.h > new file mode 100644 > index 0000000..344d962 > --- /dev/null > +++ b/mm/smalloc.h > @@ -0,0 +1,61 @@ > +/* > + * smalloc.h: Header for Sealable Memory Allocator > + * > + * (C) Copyright 2017 Huawei Technologies Co. Ltd. > + * Author: Igor Stoppa <igor.stoppa@huawei.com> > + * > + * This program is free software; you can redistribute it and/or > + * modify it under the terms of the GNU General Public License > + * as published by the Free Software Foundation; version 2 > + * of the License. > + */ > + > +#ifndef _SMALLOC_H > +#define _SMALLOC_H > + > +#include <linux/list.h> > +#include <linux/mutex.h> > + > +typedef uint64_t align_t; > + > +enum seal_t { > + SMALLOC_UNSEALED, > + SMALLOC_SEALED, > +}; > + > +#define __SMALLOC_ALIGNED__ __aligned(sizeof(align_t)) > + > +#define NODE_HEADER \ > + struct { \ > + __SMALLOC_ALIGNED__ struct { \ > + struct list_head list; \ > + align_t *free; \ > + unsigned long available_words; \ > + }; \ > + } > + > +#define NODE_HEADER_SIZE sizeof(NODE_HEADER) > + > +struct smalloc_pool { > + struct list_head list; > + struct mutex lock; > + enum seal_t seal; > +}; > + > +struct smalloc_node { > + NODE_HEADER; > + __SMALLOC_ALIGNED__ align_t data[]; > +}; > + > +#define smalloc_seal(pool) \ > + smalloc_seal_set(SMALLOC_SEALED, pool) > + > +#define smalloc_unseal(pool) \ > + smalloc_seal_set(SMALLOC_UNSEALED, pool) > + > +struct smalloc_pool *smalloc_create(void); > +int smalloc_destroy(struct smalloc_pool *pool); > +int smalloc_initialize(struct smalloc_pool *pool); > +void *smalloc(unsigned long size, struct smalloc_pool *pool); > +void smalloc_seal_set(enum seal_t seal, struct smalloc_pool *pool); I'd really like to see kernel-doc for the API functions (likely in the .c file). > +#endif > -- > 2.9.3 Thanks again for working on this! If you can find examples of file operations living in the heap, those would be great examples for using this API (assuming the other properties can be improved). -Kees
On 23/05/17 00:38, Kees Cook wrote: > On Fri, May 19, 2017 at 3:38 AM, Igor Stoppa <igor.stoppa@huawei.com> wrote: [...] > For the first bit of bikeshedding, should this really be called > seal/unseal? My mind is probably just broken from having read TPM > documentation, but this isn't really "sealing" as I'd understand it > (it's not tied to a credential, for example). It's "only" rw/ro. > Perhaps "protect/unprotect" or just simply "readonly/writable", and > call the base function "romalloc"? I was not aware of the specific mean of "seal", in this context. The term was implicitly proposed by Michal Hocko, while discussing about the mechanism and I liked it more than what I was using initially: "lockable". tbh I like the sound of "smalloc" better than "romalloc" But this is really the least of my worries :-P > This is fundamentally a heap allocator, with linked lists, etc. I'd > like to see as much attention as possible given to hardening it > against attacks, especially adding redzoning around the metadata at > least, and perhaps requiring that CONFIG_DEBUG_LIST be enabled. My initial goal was to provide something that is useful without affecting performance. You seem to be pushing for a more extreme approach. While I have nothing against it and I actually agree that it can be useful, I would not make it mandatory. More on this later. > And as > part of that, I'd like hardened usercopy to grow knowledge of these > allocations so we can bounds-check objects. Right now, mm/usercopy.c > just looks at PageSlab(page) to decide if it should do slab checks. I > think adding a check for this type of object would be very important > there. I am not familiar with this and I need to study it, however I still think that if there is a significant trade-off in terms of performance vs resilience, it should be optional, for those who want it. Maybe there could be a master toggle for these options, if it makes sense to group them logically. How does this sound? > The ro/rw granularity here is the _entire_ pool, not a specific > allocation (or page containing the allocation). I'm concerned that > makes this very open to race conditions where, especially in the > global pool, one thread can be trying to write to ro data in a pool > and another has made the pool writable. I have the impression we are thinking to something different. Close, but different enough. First of all, how using a mutex can create races? Do you mean with respect of other resources that might be held by competing users of the pool? That, imho, is a locking problem that cannot be solved here. You can try to mitigate it, by reducing the chances it will happen, but basically you are trying to make do with an user of the API that is not implementing locking correctly. I'd say that it's better to fix the user. If you meant something else, I really didn't get it :-) More about the frequency of the access: you seem to expect very often seal/unseal - or lock/unlock, while I don't. What I envision as primary case for unlocking is the tear down of the entire pool, for example preceding the unloading of the module. Think about __ro_after_init : once it is r/o, it stays r/o. Which also means that there would be possibly a busy transient, when allocations are made. That is true. But only for shared pools. If a module uses one pool, I would expect the initialization of the module to be mostly sequential. So, no chance of races. Do you agree? Furthermore, if a module _really_ wants to do parallel allocation, then maybe it's simpler and cleaner to have one pool per "thread" or whatever is the mechanism used. The global pool is mostly there for completing the offering of __ro_after_init. If one wants ot use it, it's available. It saves the trouble of having own pool, it means competing with the rest of the kernel for locking. If it seems a bad idea, it can be easily removed. [...] >> +#include "smalloc.h" > > Shouldn't this just be <linux/smalloc.h> ? yes, I have fixed it in the next revision [...] >> + /* If no pool specified, use the global one. */ >> + if (!pool) >> + pool = global_pool; > > It should be impossible, but this should check for global_pool == NULL too, IMO. I'm curious: why do you think it could happen? I'd rather throw an exception, if I cannot initialize global_pool. [...] >> + if (pool->seal == seal) { >> + mutex_unlock(&pool->lock); >> + return; > > I actually think this should be a BUG condition, since this means a > mismatched seal/unseal happened. The pool should never be left > writable at rest, a user should create a pool, write to it, seal. On this I agree. > Any > updates should unseal, write, seal. To attempt an unseal and find it > already unsealed seems bad. But what you are describing seems exactly the case where there could be a race. EX: 2 writers try to unseal and add/change something. They could have a collision both when trying to seal and unseal. One could argue that there should be an atomic unseal-update-seal and maybe unseal-allocate-seal. I would understand the need for that - and maybe that's really how it should be implemented, for doing what you seem to envision. But that's where I think we have different use-cases in mind. I'd rather not add extra locking to something that doesn't need it: Allocate - write - seal - read, read, read, ... - unseal - destroy. What you describe is more complex and does need extra logging, but I would advocate a different implementation, at least initially. Later, if it makes sense, fold them into one. > Finding a user for this would help clarify its protection properties, > too. (The LSM example is likely not the best starting point for that, > as it would depend on other changes that are under discussion.) Well, I *have* to do the LSM - and SE Linux too. That's where this started and why I'm here with this patch. From what I could understand about the discussion, the debate about the changes is settled (is it?). Casey seemed to be ok with this. If something else is needed, ok, I can do that too, but I didn't see any specific reason why LSM should not be good enough as first example. If you can provide a specific reason why it's not suitable, I can reconsider. [...] >> + if (!pool) { >> + pr_err("No memory for allocating pool."); > > It might be handy to have pools named like they are for the slab allocator. I considered this. The only reason I could come up with why it might be desirable is if the same pool needs to be accessed from two or more places that do not share the pointer. It doesn't seem particularly useful. The only upside I can think of is that it would save a memory page, vs the case of creating 2 separate pools. It would also introduce locking cross-dependency. Did I overlook some reason why it would be desirable? [...] >> +int smalloc_destroy(struct smalloc_pool *pool) >> +{ >> + struct list_head *pos, *q; >> + struct smalloc_node *node; >> + >> + if (!pool) >> + return -EINVAL; locking was missing, I added it in the new version also I moved to goto-like error handling, since there were several similar exit paths. >> + list_for_each_safe(pos, q, &pool->list) { >> + node = list_entry(pos, struct smalloc_node, list); btw, here and in other places I have switched to list_for_each_entry [...] > typo: space after "un" ok [...] >> +typedef uint64_t align_t; >> + >> +enum seal_t { >> + SMALLOC_UNSEALED, >> + SMALLOC_SEALED, >> +}; >> + >> +#define __SMALLOC_ALIGNED__ __aligned(sizeof(align_t)) How about the alignment? Is it a desirable feature? Did I overlook some reason why it would not work? >> +#define NODE_HEADER \ >> + struct { \ >> + __SMALLOC_ALIGNED__ struct { \ >> + struct list_head list; \ >> + align_t *free; \ >> + unsigned long available_words; \ >> + }; \ >> + } Does this look ok? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [...] > I'd really like to see kernel-doc for the API functions (likely in the .c file). Yes, I just see no point right now, since the API doesn't seem to be agreed/finalized yet. > Thanks again for working on this! If you can find examples of file > operations living in the heap, those would be great examples for using > this API (assuming the other properties can be improved). As I explained above, I would prefer to continue with LSM, unless you have some specific reason against it. If you have some example in mind about file operations living on the heap - which I suspect you do have :) - we could discuss also about those and if the locking needs to be modified, but from my perspective the use case is significantly different and I wouldn't pile it up with this. Of course I might be missing the point you are trying to make. In that case, I'm afraid that further explanation is needed from you, as I do not get it :-) Thanks a lot for the review, this is exactly the sort of early feedback I was hoping to receive. -- igor
On Tue, May 23, 2017 at 2:43 AM, Igor Stoppa <igor.stoppa@huawei.com> wrote: > On 23/05/17 00:38, Kees Cook wrote: >> On Fri, May 19, 2017 at 3:38 AM, Igor Stoppa <igor.stoppa@huawei.com> wrote: > > [...] > >> For the first bit of bikeshedding, should this really be called >> seal/unseal? My mind is probably just broken from having read TPM >> documentation, but this isn't really "sealing" as I'd understand it >> (it's not tied to a credential, for example). It's "only" rw/ro. >> Perhaps "protect/unprotect" or just simply "readonly/writable", and >> call the base function "romalloc"? > > I was not aware of the specific mean of "seal", in this context. > The term was implicitly proposed by Michal Hocko, while discussing about > the mechanism and I liked it more than what I was using initially: > "lockable". > > tbh I like the sound of "smalloc" better than "romalloc" > > But this is really the least of my worries :-P > >> This is fundamentally a heap allocator, with linked lists, etc. I'd >> like to see as much attention as possible given to hardening it >> against attacks, especially adding redzoning around the metadata at >> least, and perhaps requiring that CONFIG_DEBUG_LIST be enabled. > > My initial goal was to provide something that is useful without > affecting performance. > > You seem to be pushing for a more extreme approach. I don't know about pushing, but it seemed like the API provided for arbitrary unsealing, etc. More below... > While I have nothing against it and I actually agree that it can be > useful, I would not make it mandatory. > > More on this later. > >> And as >> part of that, I'd like hardened usercopy to grow knowledge of these >> allocations so we can bounds-check objects. Right now, mm/usercopy.c >> just looks at PageSlab(page) to decide if it should do slab checks. I >> think adding a check for this type of object would be very important >> there. > > I am not familiar with this and I need to study it, however I still > think that if there is a significant trade-off in terms of performance > vs resilience, it should be optional, for those who want it. I would want hardened usercopy support as a requirement for using smalloc(). Without it, we're regressing the over-read protection that already exists for slab objects, if kernel code switched from slab to smalloc. It should be very similar to the existing slab checks. "Is this a smalloc object? Have we read beyond the end of a given object?" etc. The metadata is all there, except for an efficient way to mark a page as a smalloc page, but I think that just requires a new Page$Name bit test, as done for slab. > Maybe there could be a master toggle for these options, if it makes > sense to group them logically. How does this sound? > >> The ro/rw granularity here is the _entire_ pool, not a specific >> allocation (or page containing the allocation). I'm concerned that >> makes this very open to race conditions where, especially in the >> global pool, one thread can be trying to write to ro data in a pool >> and another has made the pool writable. > > I have the impression we are thinking to something different. > Close, but different enough. > > First of all, how using a mutex can create races? > Do you mean with respect of other resources that might be held by > competing users of the pool? I meant this: CPU 1 CPU 2 create alloc write seal ... unseal write write seal The CPU 2 write would be, for example, an attacker using a vulnerability to attempt to write to memory in the sealed area. All it would need to do to succeed would be to trigger an action in the kernel that would do a "legitimate" write (which requires the unseal), and race it. Unsealing should be CPU-local, if the API is going to support this kind of access. > That, imho, is a locking problem that cannot be solved here. > You can try to mitigate it, by reducing the chances it will happen, but > basically you are trying to make do with an user of the API that is not > implementing locking correctly. > I'd say that it's better to fix the user. > > If you meant something else, I really didn't get it :-) > > More about the frequency of the access: you seem to expect very often > seal/unseal - or lock/unlock, while I don't. I am more concerned about _any_ unseal after initial seal. And even then, it'd be nice to keep things CPU-local. My concerns are related to the write-rarely proposal (https://lkml.org/lkml/2017/3/29/704) which is kind of like this, but focused on the .data section, not dynamic memory. It has similar concerns about CPU-locality. Additionally, even writing to memory and then making it read-only later runs risks (see threads about BPF JIT races vs making things read-only: https://patchwork.kernel.org/patch/9662653/ Alexei's NAK doesn't change the risk this series is fixing: races with attacker writes during assignment but before read-only marking). So, while smalloc would hugely reduce the window an attacker has available to change data contents, this API doesn't eliminate it. (To eliminate it, there would need to be a CPU-local page permission view that let only the current CPU to the page, and then restore it to read-only to match the global read-only view.) > What I envision as primary case for unlocking is the tear down of the > entire pool, for example preceding the unloading of the module. > > Think about __ro_after_init : once it is r/o, it stays r/o. Ah! In that case, sure. This isn't what the proposed API provided, though, so let's adjust it to only perform the unseal at destroy time. That makes it much saner, IMO. "Write once" dynamic allocations, or "read-only after seal". woalloc? :P yay naming > Which also means that there would be possibly a busy transient, when > allocations are made. That is true. > But only for shared pools. If a module uses one pool, I would expect the > initialization of the module to be mostly sequential. > So, no chance of races. Do you agree? I think a shared global pool would need to be eliminated for true write-once semantics. > Furthermore, if a module _really_ wants to do parallel allocation, then > maybe it's simpler and cleaner to have one pool per "thread" or whatever > is the mechanism used. Right. > The global pool is mostly there for completing the offering of > __ro_after_init. If one wants ot use it, it's available. > It saves the trouble of having own pool, it means competing with the > rest of the kernel for locking. > > If it seems a bad idea, it can be easily removed. I think they need to be explicit, yes. > I'd rather not add extra locking to something that doesn't need it: > Allocate - write - seal - read, read, read, ... - unseal - destroy. Yup, I would be totally fine with this. It still has a race between allocate and seal, but it's a huge improvement over the existing state of the world where all dynamic memory is writable. :) >> Finding a user for this would help clarify its protection properties, >> too. (The LSM example is likely not the best starting point for that, >> as it would depend on other changes that are under discussion.) > > Well, I *have* to do the LSM - and SE Linux too. > > That's where this started and why I'm here with this patch. Ah, okay. Most of the LSM is happily covered by __ro_after_init. If we could just drop the runtime disabling of SELinux, we'd be fine. I'd like to see what you have in mind for LSM, since the only runtime change would be the SELinux piece, and that would violate the alloc-write-seal-read-read-destroy lifetime (since you'd need to unseal-write-seal to change the state of SELinux during runtime). > From what I could understand about the discussion, the debate about the > changes is settled (is it?). > > Casey seemed to be ok with this. > > If something else is needed, ok, I can do that too, but I didn't see any > specific reason why LSM should not be good enough as first example. > > If you can provide a specific reason why it's not suitable, I can > reconsider. Hopefully we're on the same page now? I love being able to work off a patch example, so if you agree about the changes in lifetime management, and you've got some examples of how/where to use smalloc, I'd love to review those too. I think there are a LOT of places this could get used. > [...] > >>> + if (!pool) { >>> + pr_err("No memory for allocating pool."); >> >> It might be handy to have pools named like they are for the slab allocator. > > I considered this. > The only reason I could come up with why it might be desirable is if the > same pool needs to be accessed from two or more places that do not share > the pointer. It doesn't seem particularly useful. > The only upside I can think of is that it would save a memory page, vs > the case of creating 2 separate pools. > It would also introduce locking cross-dependency. > > Did I overlook some reason why it would be desirable? Hm, I just meant add a char[] to the metadata and pass it in during create(). Then it's possible to report which smalloc cache is being examined during hardened usercopy checks. > > [...] > >>> +int smalloc_destroy(struct smalloc_pool *pool) >>> +{ >>> + struct list_head *pos, *q; >>> + struct smalloc_node *node; >>> + >>> + if (!pool) >>> + return -EINVAL; > > locking was missing, I added it in the new version > also I moved to goto-like error handling, since there were several > similar exit paths. It seems like smalloc pools could also be refcounted? > [...] > >>> +typedef uint64_t align_t; >>> + >>> +enum seal_t { >>> + SMALLOC_UNSEALED, >>> + SMALLOC_SEALED, >>> +}; >>> + >>> +#define __SMALLOC_ALIGNED__ __aligned(sizeof(align_t)) > > How about the alignment? Is it a desirable feature? > Did I overlook some reason why it would not work? I think alignment makes sense. It makes sure there aren't crazy problems with structs allocated after the header. >>> +#define NODE_HEADER \ >>> + struct { \ >>> + __SMALLOC_ALIGNED__ struct { \ >>> + struct list_head list; \ >>> + align_t *free; \ >>> + unsigned long available_words; \ >>> + }; \ >>> + } > > Does this look ok? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ It's probably a sufficient starting point, depending on how the API shakes out. Without unseal-write-seal properties, I case much less about redzoning, etc. > [...] > >> I'd really like to see kernel-doc for the API functions (likely in the .c file). > > Yes, I just see no point right now, since the API doesn't seem to be > agreed/finalized yet. Sure thing. I guess I was looking for more docs because of my questions/concerns about lifetime management, etc. :) >> Thanks again for working on this! If you can find examples of file >> operations living in the heap, those would be great examples for using >> this API (assuming the other properties can be improved). > > As I explained above, I would prefer to continue with LSM, unless you > have some specific reason against it. I think I just lack the imagination for what you're trying to do that isn't already covered by the existing __ro_after_init changes. :) > If you have some example in mind about file operations living on the > heap - which I suspect you do have :) - we could discuss also about > those and if the locking needs to be modified, but from my perspective > the use case is significantly different and I wouldn't pile it up with this. > Of course I might be missing the point you are trying to make. > In that case, I'm afraid that further explanation is needed from you, as > I do not get it :-) Well, a poor example would be struct sock, since it needs to be regularly written to, but it has function pointers near the end which have been a very common target for attackers. (Though this is less so now that INET_DIAG no longer exposes the kernel addresses to allocated struct socks.) > Thanks a lot for the review, this is exactly the sort of early feedback > I was hoping to receive. Great! Gaining a ro_after_init-like API for dynamic memory would be quite valuable, I think. And once we have a write-rarely infrastructure for the .rodata segment, perhaps we can extend it to the dynamic memory areas too (in the future). Thanks! -Kees
On 23/05/17 23:11, Kees Cook wrote: > On Tue, May 23, 2017 at 2:43 AM, Igor Stoppa <igor.stoppa@huawei.com> wrote: [...] > I would want hardened usercopy support as a requirement for using > smalloc(). Without it, we're regressing the over-read protection that > already exists for slab objects, if kernel code switched from slab to > smalloc. It should be very similar to the existing slab checks. "Is > this a smalloc object? Have we read beyond the end of a given object?" > etc. The metadata is all there, except for an efficient way to mark a > page as a smalloc page, but I think that just requires a new Page$Name > bit test, as done for slab. ok [...] > I meant this: > > CPU 1 CPU 2 > create > alloc > write > seal > ... > unseal > write > write > seal > > The CPU 2 write would be, for example, an attacker using a > vulnerability to attempt to write to memory in the sealed area. All it > would need to do to succeed would be to trigger an action in the > kernel that would do a "legitimate" write (which requires the unseal), > and race it. Unsealing should be CPU-local, if the API is going to > support this kind of access. I see. If the CPU1 were to forcibly halt anything that can race with it, then it would be sure that there was no interference. A reactive approach could be, instead, to re-validate the content after the sealing, assuming that it is possible. [...] > I am more concerned about _any_ unseal after initial seal. And even > then, it'd be nice to keep things CPU-local. My concerns are related > to the write-rarely proposal (https://lkml.org/lkml/2017/3/29/704) > which is kind of like this, but focused on the .data section, not > dynamic memory. It has similar concerns about CPU-locality. > Additionally, even writing to memory and then making it read-only > later runs risks (see threads about BPF JIT races vs making things > read-only: https://patchwork.kernel.org/patch/9662653/ Alexei's NAK > doesn't change the risk this series is fixing: races with attacker > writes during assignment but before read-only marking). If you are talking about an attacker, rather than protection against accidental overwrites, how hashing can be enough? Couldn't the attacker compromise that too? > So, while smalloc would hugely reduce the window an attacker has > available to change data contents, this API doesn't eliminate it. (To > eliminate it, there would need to be a CPU-local page permission view > that let only the current CPU to the page, and then restore it to > read-only to match the global read-only view.) That or, if one is ready to take the hit, freeze every other possible attack vector. But I'm not sure this could be justifiable. [...] > Ah! In that case, sure. This isn't what the proposed API provided, > though, so let's adjust it to only perform the unseal at destroy time. > That makes it much saner, IMO. "Write once" dynamic allocations, or > "read-only after seal". woalloc? :P yay naming For now I'm still using smalloc. Anything that is either [x]malloc or [yz]malloc is fine, lengthwise. Other options might require some re-formatting. [...] > I think a shared global pool would need to be eliminated for true > write-once semantics. ok [...] >> I'd rather not add extra locking to something that doesn't need it: >> Allocate - write - seal - read, read, read, ... - unseal - destroy. > > Yup, I would be totally fine with this. It still has a race between > allocate and seal, but it's a huge improvement over the existing state > of the world where all dynamic memory is writable. :) Great! [...] > Ah, okay. Most of the LSM is happily covered by __ro_after_init. If we > could just drop the runtime disabling of SELinux, we'd be fine. I am not sure I understand this point. If the kernel is properly configured, the master toggle variable disappears, right? Or do you mean the disabling through modifications of the linked list of the hooks? [...] > Hm, I just meant add a char[] to the metadata and pass it in during > create(). Then it's possible to report which smalloc cache is being > examined during hardened usercopy checks. Ok, that is not a big deal. wrt this, I have spent some time writing a debug module, which currently dumps into a debugfs entry a bunch of info about the various pools. I could split it across multiple entries, using the label to generate their names. [...] > It seems like smalloc pools could also be refcounted? I am not sure what you mean. What do you want to count? Number of pools? Nodes per pool? Allocations per node? And what for? At least in the case of tearing down a pool, when a module is unloaded, nobody needs to free anything that was allocated with smalloc. The teardown function will free the pages from each node. Is this the place where you think there should be a check on the number of pages freed? [...] >>>> +#define NODE_HEADER \ >>>> + struct { \ >>>> + __SMALLOC_ALIGNED__ struct { \ >>>> + struct list_head list; \ >>>> + align_t *free; \ >>>> + unsigned long available_words; \ >>>> + }; \ >>>> + } >> >> Does this look ok? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > It's probably a sufficient starting point, depending on how the API > shakes out. Without unseal-write-seal properties, I case much less > about redzoning, etc. ok, but my question (I am not sure if it was clear) was about the use of a macro for the nameless structure that contains the header. [...] > Well, a poor example would be struct sock, since it needs to be > regularly written to, but it has function pointers near the end which > have been a very common target for attackers. (Though this is less so > now that INET_DIAG no longer exposes the kernel addresses to allocated > struct socks.) Ok, this could be the scope for a further set of patches, after this one is done. One more thing: how should I tie this allocator to the rest? I have verified that is seems to work with both SLUB and SLAB. Can I make it depend on either of them being enabled? Should it be optionally enabled? What to default to, if it's not enabled? vmalloc? --- thanks, igor
On Wed, May 24, 2017 at 10:45 AM, Igor Stoppa <igor.stoppa@huawei.com> wrote: > On 23/05/17 23:11, Kees Cook wrote: >> On Tue, May 23, 2017 at 2:43 AM, Igor Stoppa <igor.stoppa@huawei.com> wrote: >> I meant this: >> >> CPU 1 CPU 2 >> create >> alloc >> write >> seal >> ... >> unseal >> write >> write >> seal >> >> The CPU 2 write would be, for example, an attacker using a >> vulnerability to attempt to write to memory in the sealed area. All it >> would need to do to succeed would be to trigger an action in the >> kernel that would do a "legitimate" write (which requires the unseal), >> and race it. Unsealing should be CPU-local, if the API is going to >> support this kind of access. > > I see. > If the CPU1 were to forcibly halt anything that can race with it, then > it would be sure that there was no interference. Correct. This is actually what ARM does for doing kernel memory writing when poking stuff for kprobes, etc. It's rather dramatic, though. :) > A reactive approach could be, instead, to re-validate the content after > the sealing, assuming that it is possible. I would prefer to avoid this, as that allows an attacker to still have made the changes (which could even result in them then disabling the re-validation during the attack). >> I am more concerned about _any_ unseal after initial seal. And even >> then, it'd be nice to keep things CPU-local. My concerns are related >> to the write-rarely proposal (https://lkml.org/lkml/2017/3/29/704) >> which is kind of like this, but focused on the .data section, not >> dynamic memory. It has similar concerns about CPU-locality. >> Additionally, even writing to memory and then making it read-only >> later runs risks (see threads about BPF JIT races vs making things >> read-only: https://patchwork.kernel.org/patch/9662653/ Alexei's NAK >> doesn't change the risk this series is fixing: races with attacker >> writes during assignment but before read-only marking). > > If you are talking about an attacker, rather than protection against > accidental overwrites, how hashing can be enough? > Couldn't the attacker compromise that too? In theory, yes, though the goal was to dedicate a register to the hash, which would make it hard/impossible for an attacker to reach. (The BPF JIT situation is just an example of this kind of race, though. I'm still in favor of reducing the write window to init-time from full run-time.) >> So, while smalloc would hugely reduce the window an attacker has >> available to change data contents, this API doesn't eliminate it. (To >> eliminate it, there would need to be a CPU-local page permission view >> that let only the current CPU to the page, and then restore it to >> read-only to match the global read-only view.) > > That or, if one is ready to take the hit, freeze every other possible > attack vector. But I'm not sure this could be justifiable. I would expect other people would NAK using "stop all other CPUs" approach. Better to have the CPU-local writes. >> Ah! In that case, sure. This isn't what the proposed API provided, >> though, so let's adjust it to only perform the unseal at destroy time. >> That makes it much saner, IMO. "Write once" dynamic allocations, or >> "read-only after seal". woalloc? :P yay naming > > For now I'm still using smalloc. > Anything that is either [x]malloc or [yz]malloc is fine, lengthwise. > Other options might require some re-formatting. Yeah, I don't have any better idea for names. :) >> Ah, okay. Most of the LSM is happily covered by __ro_after_init. If we >> could just drop the runtime disabling of SELinux, we'd be fine. > > I am not sure I understand this point. > If the kernel is properly configured, the master toggle variable > disappears, right? > Or do you mean the disabling through modifications of the linked list of > the hooks? We might be talking past each-other. Right now, the LSM is marked with __ro_after_init, which will make all the list structures, entries, etc read-only already. There is one case where this is not true, and that's for CONFIG_SECURITY_WRITABLE_HOOKS for CONFIG_SECURITY_SELINUX_DISABLE, which wants to do run-time removal of SELinux. Are you talking about the SELinux policy installed during early boot? Which things did you want to use smalloc() on? >> It seems like smalloc pools could also be refcounted? > > I am not sure what you mean. > What do you want to count? > Number of pools? Nodes per pool? Allocations per node? I meant things that point into an smalloc() pool could keep a refcount and when nothing was pointing into it, it could be destroyed. (i.e. using refcount garbage collection instead of explicit destruction.) > And what for? It might be easier to reason about later if allocations get complex. It's certainly not required for the first version of this. > At least in the case of tearing down a pool, when a module is unloaded, > nobody needs to free anything that was allocated with smalloc. > The teardown function will free the pages from each node. Right, yeah. >>>>> +#define NODE_HEADER \ >>>>> + struct { \ >>>>> + __SMALLOC_ALIGNED__ struct { \ >>>>> + struct list_head list; \ >>>>> + align_t *free; \ >>>>> + unsigned long available_words; \ >>>>> + }; \ >>>>> + } >>> >>> Does this look ok? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> >> It's probably a sufficient starting point, depending on how the API >> shakes out. Without unseal-write-seal properties, I case much less >> about redzoning, etc. > > ok, but my question (I am not sure if it was clear) was about the use of > a macro for the nameless structure that contains the header. I don't really have an opinion on this. It might be more readable with a named structure? > One more thing: how should I tie this allocator to the rest? > I have verified that is seems to work with both SLUB and SLAB. > Can I make it depend on either of them being enabled? It seems totally unrelated. The only relationship I see would be interaction with hardened usercopy. In a perfect world, none of the smalloc pools would be allowed to be copied to/from userspace, which would make integration really easy: if smalloc_pool(ptr) return NOPE; :P > Should it be optionally enabled? > What to default to, if it's not enabled? vmalloc? I don't see any reason to make it optional. -Kees
One-time sealable memory makes the most sense from a defensive perspective - red team reads this stuff, the races mentioned will be implemented as described to win the day, and probably in other innovative ways. If a gap is left in the implementation, without explicit coverage by an adjacent function, it will be used no matter how small the chances of it occurring in the real world are - grooming systems to create unlikely conditions is fair play (look at eternalblue's SMB pool machinations). However, out of tree modules will likely not appreciate this - third party LSMs, tpe module, SCST, etc. I dont want to get into the "NIH" debate, they're real functional components, used by real companies, often enough that they need to be considered a member of the ecosystem, even if not a first-order member. So what about a middle ground where CoW semantics are used to enforce the state of these allocations as RO, but provide a strictly controlled pathway to read the RO data, copy and modify it, then write and seal into a new allocation. Successful return from this process should permit the page table to change the pointer to where the object now resides, and initiate freeing of the original memory so long as a refcount is kept for accesses. That way, sealable memory is sealed, and any consumers reading it will be using the original ptr to the original smalloc region. Attackers who do manage to change the allocation by writing a new one still have to figure out how to change the region used by an existing consumer. New consumers will get hit by whatever they changed, but the change will be tracked and require them to gain execution control over the allocator itself in order to affect the change (or ROP chain something else into doing it, but thats's a discussion on RAP/CFI). CPU-local ops are great, if they dont halt the other cores. Stopping all other CPUs is going to be DoA in HPC and other CPU intensive workloads - think what ZFS would do if its pipelines kept getting halted by something running a lot of smallocs (they get non-preemptible often enough, requiring waits on both sides of the op), how how LIO would behave - iSCSI waits for no man or woman, or their allocator strategies. I'm all for "security should be more of a concern than performance - its easier to build a faster car later if you're still alive to do it," but keeping in mind who the consumers are, i can easily see this functionality staying disabled in most distributions and thus receiving much less testing and beatdown than it should. Lastly, my meager understanding is that PAX set the entire kernel as RO, and implemented writeable access via pax_open/close. How were they fighting against race conditions, and what is the benefit of specific regions being allocated this way as opposed to the RO-all-the-things approach which makes writes a specialized set of operations? On Sun, May 28, 2017 at 2:23 PM, Kees Cook <keescook@google.com> wrote: > On Wed, May 24, 2017 at 10:45 AM, Igor Stoppa <igor.stoppa@huawei.com> wrote: >> On 23/05/17 23:11, Kees Cook wrote: >>> On Tue, May 23, 2017 at 2:43 AM, Igor Stoppa <igor.stoppa@huawei.com> wrote: >>> I meant this: >>> >>> CPU 1 CPU 2 >>> create >>> alloc >>> write >>> seal >>> ... >>> unseal >>> write >>> write >>> seal >>> >>> The CPU 2 write would be, for example, an attacker using a >>> vulnerability to attempt to write to memory in the sealed area. All it >>> would need to do to succeed would be to trigger an action in the >>> kernel that would do a "legitimate" write (which requires the unseal), >>> and race it. Unsealing should be CPU-local, if the API is going to >>> support this kind of access. >> >> I see. >> If the CPU1 were to forcibly halt anything that can race with it, then >> it would be sure that there was no interference. > > Correct. This is actually what ARM does for doing kernel memory > writing when poking stuff for kprobes, etc. It's rather dramatic, > though. :) > >> A reactive approach could be, instead, to re-validate the content after >> the sealing, assuming that it is possible. > > I would prefer to avoid this, as that allows an attacker to still have > made the changes (which could even result in them then disabling the > re-validation during the attack). > >>> I am more concerned about _any_ unseal after initial seal. And even >>> then, it'd be nice to keep things CPU-local. My concerns are related >>> to the write-rarely proposal (https://lkml.org/lkml/2017/3/29/704) >>> which is kind of like this, but focused on the .data section, not >>> dynamic memory. It has similar concerns about CPU-locality. >>> Additionally, even writing to memory and then making it read-only >>> later runs risks (see threads about BPF JIT races vs making things >>> read-only: https://patchwork.kernel.org/patch/9662653/ Alexei's NAK >>> doesn't change the risk this series is fixing: races with attacker >>> writes during assignment but before read-only marking). >> >> If you are talking about an attacker, rather than protection against >> accidental overwrites, how hashing can be enough? >> Couldn't the attacker compromise that too? > > In theory, yes, though the goal was to dedicate a register to the > hash, which would make it hard/impossible for an attacker to reach. > (The BPF JIT situation is just an example of this kind of race, > though. I'm still in favor of reducing the write window to init-time > from full run-time.) > >>> So, while smalloc would hugely reduce the window an attacker has >>> available to change data contents, this API doesn't eliminate it. (To >>> eliminate it, there would need to be a CPU-local page permission view >>> that let only the current CPU to the page, and then restore it to >>> read-only to match the global read-only view.) >> >> That or, if one is ready to take the hit, freeze every other possible >> attack vector. But I'm not sure this could be justifiable. > > I would expect other people would NAK using "stop all other CPUs" > approach. Better to have the CPU-local writes. > >>> Ah! In that case, sure. This isn't what the proposed API provided, >>> though, so let's adjust it to only perform the unseal at destroy time. >>> That makes it much saner, IMO. "Write once" dynamic allocations, or >>> "read-only after seal". woalloc? :P yay naming >> >> For now I'm still using smalloc. >> Anything that is either [x]malloc or [yz]malloc is fine, lengthwise. >> Other options might require some re-formatting. > > Yeah, I don't have any better idea for names. :) > >>> Ah, okay. Most of the LSM is happily covered by __ro_after_init. If we >>> could just drop the runtime disabling of SELinux, we'd be fine. >> >> I am not sure I understand this point. >> If the kernel is properly configured, the master toggle variable >> disappears, right? >> Or do you mean the disabling through modifications of the linked list of >> the hooks? > > We might be talking past each-other. Right now, the LSM is marked with > __ro_after_init, which will make all the list structures, entries, etc > read-only already. There is one case where this is not true, and > that's for CONFIG_SECURITY_WRITABLE_HOOKS for > CONFIG_SECURITY_SELINUX_DISABLE, which wants to do run-time removal of > SELinux. Are you talking about the SELinux policy installed during > early boot? Which things did you want to use smalloc() on? > >>> It seems like smalloc pools could also be refcounted? >> >> I am not sure what you mean. >> What do you want to count? >> Number of pools? Nodes per pool? Allocations per node? > > I meant things that point into an smalloc() pool could keep a refcount > and when nothing was pointing into it, it could be destroyed. (i.e. > using refcount garbage collection instead of explicit destruction.) > >> And what for? > > It might be easier to reason about later if allocations get complex. > It's certainly not required for the first version of this. > >> At least in the case of tearing down a pool, when a module is unloaded, >> nobody needs to free anything that was allocated with smalloc. >> The teardown function will free the pages from each node. > > Right, yeah. > >>>>>> +#define NODE_HEADER \ >>>>>> + struct { \ >>>>>> + __SMALLOC_ALIGNED__ struct { \ >>>>>> + struct list_head list; \ >>>>>> + align_t *free; \ >>>>>> + unsigned long available_words; \ >>>>>> + }; \ >>>>>> + } >>>> >>>> Does this look ok? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >>> >>> It's probably a sufficient starting point, depending on how the API >>> shakes out. Without unseal-write-seal properties, I case much less >>> about redzoning, etc. >> >> ok, but my question (I am not sure if it was clear) was about the use of >> a macro for the nameless structure that contains the header. > > I don't really have an opinion on this. It might be more readable with > a named structure? > >> One more thing: how should I tie this allocator to the rest? >> I have verified that is seems to work with both SLUB and SLAB. >> Can I make it depend on either of them being enabled? > > It seems totally unrelated. The only relationship I see would be > interaction with hardened usercopy. In a perfect world, none of the > smalloc pools would be allowed to be copied to/from userspace, which > would make integration really easy: if smalloc_pool(ptr) return NOPE; > :P > >> Should it be optionally enabled? >> What to default to, if it's not enabled? vmalloc? > > I don't see any reason to make it optional. > > -Kees > > -- > Kees Cook > Pixel Security
On Sun, May 28, 2017 at 11:56 AM, Boris Lukashev <blukashev@sempervictus.com> wrote: > So what about a middle ground where CoW semantics are used to enforce > the state of these allocations as RO, but provide a strictly > controlled pathway to read the RO data, copy and modify it, then write > and seal into a new allocation. Successful return from this process > should permit the page table to change the pointer to where the object > now resides, and initiate freeing of the original memory so long as a > refcount is kept for accesses. That way, sealable memory is sealed, > and any consumers reading it will be using the original ptr to the > original smalloc region. Attackers who do manage to change the This could be another way to do it, yeah, and it helps that smalloc() is built on vmalloc(). It'd require some careful design, but it could be a way forward after this initial sealed-after-init version goes in. > Lastly, my meager understanding is that PAX set the entire kernel as > RO, and implemented writeable access via pax_open/close. How were they > fighting against race conditions, and what is the benefit of specific > regions being allocated this way as opposed to the RO-all-the-things > approach which makes writes a specialized set of operations? My understanding is that PaX's KERNEXEC with the constification plugin moves a substantial portion of the kernel's .data section (effectively) into the .rodata section. It's not the "entire" kernel. (Well, depending on how you count. The .text section is already read-only upstream.) PaX, as far as I know, provided no dynamic memory allocation protections, like smalloc() would provide. -Kees
If i understand the current direction for smalloc, its to implement it without the ability to "unseal," which has implications on how LSM implementations and other users of these dynamic allocations handle things. If its implemented without a writeable interface for modules which need it, then switched to a CoW (or other controlled/isolated/hardened) mechanism for creating writeable segments in dynamic allocations, that would create another set of breaking changes for the consumers adapting to the first set. In the spirit of adopting things faster/smoother, maybe it makes sense to permit the smalloc re-writing method suggested in the first implementation; with an API which consumers can later use for a more secure approach that reduces the attack surface without breaking consumer interfaces (or internal semantics). A sysctl affecting the behavior as ro-only or ro-and-sometimes-rw would give users the flexibility to tune their environment per their needs, which should also reduce potential conflict/complaints. On Sun, May 28, 2017 at 5:32 PM, Kees Cook <keescook@google.com> wrote: > On Sun, May 28, 2017 at 11:56 AM, Boris Lukashev > <blukashev@sempervictus.com> wrote: >> So what about a middle ground where CoW semantics are used to enforce >> the state of these allocations as RO, but provide a strictly >> controlled pathway to read the RO data, copy and modify it, then write >> and seal into a new allocation. Successful return from this process >> should permit the page table to change the pointer to where the object >> now resides, and initiate freeing of the original memory so long as a >> refcount is kept for accesses. That way, sealable memory is sealed, >> and any consumers reading it will be using the original ptr to the >> original smalloc region. Attackers who do manage to change the > > This could be another way to do it, yeah, and it helps that smalloc() > is built on vmalloc(). It'd require some careful design, but it could > be a way forward after this initial sealed-after-init version goes in. > >> Lastly, my meager understanding is that PAX set the entire kernel as >> RO, and implemented writeable access via pax_open/close. How were they >> fighting against race conditions, and what is the benefit of specific >> regions being allocated this way as opposed to the RO-all-the-things >> approach which makes writes a specialized set of operations? > > My understanding is that PaX's KERNEXEC with the constification plugin > moves a substantial portion of the kernel's .data section > (effectively) into the .rodata section. It's not the "entire" kernel. > (Well, depending on how you count. The .text section is already > read-only upstream.) PaX, as far as I know, provided no dynamic memory > allocation protections, like smalloc() would provide. > > -Kees > > -- > Kees Cook > Pixel Security
Hi Igor, [auto build test ERROR on mmotm/master] [also build test ERROR on v4.12-rc3 next-20170531] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Igor-Stoppa/Sealable-memory-support/20170522-163525 base: git://git.cmpxchg.org/linux-mmotm.git master config: x86_64-randconfig-v0-05311736 (attached as .config) compiler: gcc-4.4 (Debian 4.4.7-8) 4.4.7 reproduce: # save the attached .config to linux build tree make ARCH=x86_64 All errors (new ones prefixed by >>): In file included from mm/smalloc.c:24: >> mm/smalloc.h:47: error: flexible array member in otherwise empty struct mm/smalloc.c: In function 'smalloc_seal_set': mm/smalloc.c:135: error: implicit declaration of function 'set_memory_ro' mm/smalloc.c:138: error: implicit declaration of function 'set_memory_rw' vim +47 mm/smalloc.h 41 struct mutex lock; 42 enum seal_t seal; 43 }; 44 45 struct smalloc_node { 46 NODE_HEADER; > 47 __SMALLOC_ALIGNED__ align_t data[]; 48 }; 49 50 #define smalloc_seal(pool) \ --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation
On 28/05/17 21:23, Kees Cook wrote: > On Wed, May 24, 2017 at 10:45 AM, Igor Stoppa <igor.stoppa@huawei.com> wrote: [...] >> If the CPU1 were to forcibly halt anything that can race with it, then >> it would be sure that there was no interference. > > Correct. This is actually what ARM does for doing kernel memory > writing when poking stuff for kprobes, etc. It's rather dramatic, > though. :) ok >> A reactive approach could be, instead, to re-validate the content after >> the sealing, assuming that it is possible. > > I would prefer to avoid this, as that allows an attacker to still have > made the changes (which could even result in them then disabling the > re-validation during the attack). ok [...] >> If you are talking about an attacker, rather than protection against >> accidental overwrites, how hashing can be enough? >> Couldn't the attacker compromise that too? > > In theory, yes, though the goal was to dedicate a register to the > hash, which would make it hard/impossible for an attacker to reach. > (The BPF JIT situation is just an example of this kind of race, > though. I'm still in favor of reducing the write window to init-time > from full run-time.) It looks like some compromise is needed between reliability and performance ... [...] > I would expect other people would NAK using "stop all other CPUs" > approach. Better to have the CPU-local writes. ok, that could be a feature for the follow-up? [...] > Yeah, I don't have any better idea for names. :) still clinging to smalloc for now, how about pmalloc, as "protected malloc"? > We might be talking past each-other. Right now, the LSM is marked with > __ro_after_init, which will make all the list structures, entries, etc > read-only already. There is one case where this is not true, and > that's for CONFIG_SECURITY_WRITABLE_HOOKS for > CONFIG_SECURITY_SELINUX_DISABLE, which wants to do run-time removal of > SELinux. On this I'll just shut up and provide the patch. It seems easier :-) > Are you talking about the SELinux policy installed during > early boot? Which things did you want to use smalloc() on? The policy DB. And of course the policy cache has to go, as in: it must be possible to disable it completely, in favor of re-evaluating the policies, when needed. I'm pretty sure that it is absolutely needed in certain cases, but in others I think the re-evaluation is negligible in comparison to the overall duration of the use cases affected. So it should be ok to let the system integrator gauge speed (if any) vs resilience, by enablying/disabling the SE Linux policy cache. >>> It seems like smalloc pools could also be refcounted? > > I meant things that point into an smalloc() pool could keep a refcount > and when nothing was pointing into it, it could be destroyed. (i.e. > using refcount garbage collection instead of explicit destruction.) ok, but it depends on the level of paranoia that we are talking about. Counting smalloc vs sfree is easy. Verifying that the free invocations are actually aligned with previous allocations is a bit more onerous. Validating that the free is done from the same entity that invoked the smalloc might be hard. >> And what for? > > It might be easier to reason about later if allocations get complex. > It's certainly not required for the first version of this. good :-) more clarifications are needed [...] > I don't really have an opinion on this. It might be more readable with > a named structure? Eventually I got rid of the macro. The structure is intentionally unnamed to stress the grouping, without adding a layer of indirection. But maybe it could be removed. I need to test it. >> One more thing: how should I tie this allocator to the rest? >> I have verified that is seems to work with both SLUB and SLAB. >> Can I make it depend on either of them being enabled? > > It seems totally unrelated. The only relationship I see would be > interaction with hardened usercopy. In a perfect world, none of the > smalloc pools would be allowed to be copied to/from userspace, which > would make integration really easy: if smalloc_pool(ptr) return NOPE; > :P I went down the harder path and implemented the check. >> Should it be optionally enabled? >> What to default to, if it's not enabled? vmalloc? > > I don't see any reason to make it optional. great \o/ So now I'm back to the LSM use case. It shouldn't take too long. -- thanks, igor
Hi Igor, [auto build test ERROR on mmotm/master] [also build test ERROR on v4.12-rc3 next-20170602] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Igor-Stoppa/Sealable-memory-support/20170522-163525 base: git://git.cmpxchg.org/linux-mmotm.git master config: i386-randconfig-n0-06032349 (attached as .config) compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901 reproduce: # save the attached .config to linux build tree make ARCH=i386 All errors (new ones prefixed by >>): mm//smalloc.c: In function 'smalloc_seal_set': >> mm//smalloc.c:135:4: error: implicit declaration of function 'set_memory_ro' [-Werror=implicit-function-declaration] set_memory_ro((unsigned long)node, ^~~~~~~~~~~~~ >> mm//smalloc.c:138:4: error: implicit declaration of function 'set_memory_rw' [-Werror=implicit-function-declaration] set_memory_rw((unsigned long)node, ^~~~~~~~~~~~~ cc1: some warnings being treated as errors vim +/set_memory_ro +135 mm//smalloc.c 129 mutex_unlock(&pool->lock); 130 return; 131 } 132 list_for_each(pos, &pool->list) { 133 node = list_entry(pos, struct smalloc_node, list); 134 if (seal == SMALLOC_SEALED) > 135 set_memory_ro((unsigned long)node, 136 get_node_pages_nr(node)); 137 else if (seal == SMALLOC_UNSEALED) > 138 set_memory_rw((unsigned long)node, 139 get_node_pages_nr(node)); 140 } 141 pool->seal = seal; --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation
diff --git a/mm/Makefile b/mm/Makefile index 026f6a8..737c42a 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -39,7 +39,7 @@ obj-y := filemap.o mempool.o oom_kill.o \ mm_init.o mmu_context.o percpu.o slab_common.o \ compaction.o vmacache.o swap_slots.o \ interval_tree.o list_lru.o workingset.o \ - debug.o $(mmu-y) + debug.o smalloc.o $(mmu-y) obj-y += init-mm.o diff --git a/mm/smalloc.c b/mm/smalloc.c new file mode 100644 index 0000000..fa04cc5 --- /dev/null +++ b/mm/smalloc.c @@ -0,0 +1,200 @@ +/* + * smalloc.c: Sealable Memory Allocator + * + * (C) Copyright 2017 Huawei Technologies Co. Ltd. + * Author: Igor Stoppa <igor.stoppa@huawei.com> + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; version 2 + * of the License. + */ + +#include <linux/module.h> +#include <linux/printk.h> +#include <linux/kobject.h> +#include <linux/sysfs.h> +#include <linux/init.h> +#include <linux/fs.h> +#include <linux/string.h> + + +#include <linux/vmalloc.h> +#include <asm/cacheflush.h> +#include "smalloc.h" + +#define page_roundup(size) (((size) + !(size) - 1 + PAGE_SIZE) & PAGE_MASK) + +#define pages_nr(size) (page_roundup(size) / PAGE_SIZE) + +static struct smalloc_pool *global_pool; + +struct smalloc_node *__smalloc_create_node(unsigned long words) +{ + struct smalloc_node *node; + unsigned long size; + + /* Calculate the size to ask from vmalloc, page aligned. */ + size = page_roundup(NODE_HEADER_SIZE + words * sizeof(align_t)); + node = vmalloc(size); + if (!node) { + pr_err("No memory for allocating smalloc node."); + return NULL; + } + /* Initialize the node.*/ + INIT_LIST_HEAD(&node->list); + node->free = node->data; + node->available_words = (size - NODE_HEADER_SIZE) / sizeof(align_t); + return node; +} + +static __always_inline +void *node_alloc(struct smalloc_node *node, unsigned long words) +{ + register align_t *old_free = node->free; + + node->available_words -= words; + node->free += words; + return old_free; +} + +void *smalloc(unsigned long size, struct smalloc_pool *pool) +{ + struct list_head *pos; + struct smalloc_node *node; + void *ptr; + unsigned long words; + + /* If no pool specified, use the global one. */ + if (!pool) + pool = global_pool; + + mutex_lock(&pool->lock); + + /* If the pool is sealed, then return NULL. */ + if (pool->seal == SMALLOC_SEALED) { + mutex_unlock(&pool->lock); + return NULL; + } + + /* Calculate minimum number of words required. */ + words = (size + sizeof(align_t) - 1) / sizeof(align_t); + + /* Look for slot that is large enough, in the existing pool.*/ + list_for_each(pos, &pool->list) { + node = list_entry(pos, struct smalloc_node, list); + if (node->available_words >= words) { + ptr = node_alloc(node, words); + mutex_unlock(&pool->lock); + return ptr; + } + } + + /* No slot found, get a new chunk of virtual memory. */ + node = __smalloc_create_node(words); + if (!node) { + mutex_unlock(&pool->lock); + return NULL; + } + + list_add(&node->list, &pool->list); + ptr = node_alloc(node, words); + mutex_unlock(&pool->lock); + return ptr; +} + +static __always_inline +unsigned long get_node_size(struct smalloc_node *node) +{ + if (!node) + return 0; + return page_roundup((((void *)node->free) - (void *)node) + + node->available_words * sizeof(align_t)); +} + +static __always_inline +unsigned long get_node_pages_nr(struct smalloc_node *node) +{ + return pages_nr(get_node_size(node)); +} +void smalloc_seal_set(enum seal_t seal, struct smalloc_pool *pool) +{ + struct list_head *pos; + struct smalloc_node *node; + + if (!pool) + pool = global_pool; + mutex_lock(&pool->lock); + if (pool->seal == seal) { + mutex_unlock(&pool->lock); + return; + } + list_for_each(pos, &pool->list) { + node = list_entry(pos, struct smalloc_node, list); + if (seal == SMALLOC_SEALED) + set_memory_ro((unsigned long)node, + get_node_pages_nr(node)); + else if (seal == SMALLOC_UNSEALED) + set_memory_rw((unsigned long)node, + get_node_pages_nr(node)); + } + pool->seal = seal; + mutex_unlock(&pool->lock); +} + +int smalloc_initialize(struct smalloc_pool *pool) +{ + if (!pool) + return -EINVAL; + INIT_LIST_HEAD(&pool->list); + pool->seal = SMALLOC_UNSEALED; + mutex_init(&pool->lock); + return 0; +} + +struct smalloc_pool *smalloc_create(void) +{ + struct smalloc_pool *pool = vmalloc(sizeof(struct smalloc_pool)); + + if (!pool) { + pr_err("No memory for allocating pool."); + return NULL; + } + smalloc_initialize(pool); + return pool; +} + +int smalloc_destroy(struct smalloc_pool *pool) +{ + struct list_head *pos, *q; + struct smalloc_node *node; + + if (!pool) + return -EINVAL; + list_for_each_safe(pos, q, &pool->list) { + node = list_entry(pos, struct smalloc_node, list); + list_del(pos); + vfree(node); + } + return 0; +} + +static int __init smalloc_init(void) +{ + global_pool = smalloc_create(); + if (!global_pool) { + pr_err("Module smalloc initialization failed: no memory.\n"); + return -ENOMEM; + } + pr_info("Module smalloc initialized successfully.\n"); + return 0; +} + +static void __exit smalloc_exit(void) +{ + pr_info("Module smalloc un initialized successfully.\n"); +} + +module_init(smalloc_init); +module_exit(smalloc_exit); +MODULE_LICENSE("GPL"); diff --git a/mm/smalloc.h b/mm/smalloc.h new file mode 100644 index 0000000..344d962 --- /dev/null +++ b/mm/smalloc.h @@ -0,0 +1,61 @@ +/* + * smalloc.h: Header for Sealable Memory Allocator + * + * (C) Copyright 2017 Huawei Technologies Co. Ltd. + * Author: Igor Stoppa <igor.stoppa@huawei.com> + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; version 2 + * of the License. + */ + +#ifndef _SMALLOC_H +#define _SMALLOC_H + +#include <linux/list.h> +#include <linux/mutex.h> + +typedef uint64_t align_t; + +enum seal_t { + SMALLOC_UNSEALED, + SMALLOC_SEALED, +}; + +#define __SMALLOC_ALIGNED__ __aligned(sizeof(align_t)) + +#define NODE_HEADER \ + struct { \ + __SMALLOC_ALIGNED__ struct { \ + struct list_head list; \ + align_t *free; \ + unsigned long available_words; \ + }; \ + } + +#define NODE_HEADER_SIZE sizeof(NODE_HEADER) + +struct smalloc_pool { + struct list_head list; + struct mutex lock; + enum seal_t seal; +}; + +struct smalloc_node { + NODE_HEADER; + __SMALLOC_ALIGNED__ align_t data[]; +}; + +#define smalloc_seal(pool) \ + smalloc_seal_set(SMALLOC_SEALED, pool) + +#define smalloc_unseal(pool) \ + smalloc_seal_set(SMALLOC_UNSEALED, pool) + +struct smalloc_pool *smalloc_create(void); +int smalloc_destroy(struct smalloc_pool *pool); +int smalloc_initialize(struct smalloc_pool *pool); +void *smalloc(unsigned long size, struct smalloc_pool *pool); +void smalloc_seal_set(enum seal_t seal, struct smalloc_pool *pool); +#endif
Dynamically allocated variables can be made read only, after they have been initialized, provided that they reside in memory pages devoid of any RW data. The implementation supplies means to create independent pools of memory, which can be individually created, sealed/unsealed and destroyed. A global pool is made available for those kernel modules that do not need to manage an independent pool. Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com> --- mm/Makefile | 2 +- mm/smalloc.c | 200 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ mm/smalloc.h | 61 ++++++++++++++++++ 3 files changed, 262 insertions(+), 1 deletion(-) create mode 100644 mm/smalloc.c create mode 100644 mm/smalloc.h