Message ID | 20240305100933.it.923-kees@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | slab: Introduce dedicated bucket allocator | expand |
On 2024/03/05 18:10, Kees Cook wrote: > Hi, > > Repeating the commit logs for patch 4 here: > > Dedicated caches are available For fixed size allocations via > kmem_cache_alloc(), but for dynamically sized allocations there is only > the global kmalloc API's set of buckets available. This means it isn't > possible to separate specific sets of dynamically sized allocations into > a separate collection of caches. > > This leads to a use-after-free exploitation weakness in the Linux > kernel since many heap memory spraying/grooming attacks depend on using > userspace-controllable dynamically sized allocations to collide with > fixed size allocations that end up in same cache. > > While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense > against these kinds of "type confusion" attacks, including for fixed > same-size heap objects, we can create a complementary deterministic > defense for dynamically sized allocations. > > In order to isolate user-controllable sized allocations from system > allocations, introduce kmem_buckets_create(), which behaves like > kmem_cache_create(). (The next patch will introduce kmem_buckets_alloc(), > which behaves like kmem_cache_alloc().) So can I say the vision here would be to make all the kernel interfaces that handles user space input to use separated caches? Which looks like creating a "grey zone" in the middle of kernel space (trusted) and user space (untrusted) memory. I've also thought that maybe hardening on the "border" could be more efficient and targeted than a mitigation that affects globally, e.g. CONFIG_RANDOM_KMALLOC_CACHES.
On Wed, Mar 06, 2024 at 09:47:36AM +0800, GONG, Ruiqi wrote: > > > On 2024/03/05 18:10, Kees Cook wrote: > > Hi, > > > > Repeating the commit logs for patch 4 here: > > > > Dedicated caches are available For fixed size allocations via > > kmem_cache_alloc(), but for dynamically sized allocations there is only > > the global kmalloc API's set of buckets available. This means it isn't > > possible to separate specific sets of dynamically sized allocations into > > a separate collection of caches. > > > > This leads to a use-after-free exploitation weakness in the Linux > > kernel since many heap memory spraying/grooming attacks depend on using > > userspace-controllable dynamically sized allocations to collide with > > fixed size allocations that end up in same cache. > > > > While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense > > against these kinds of "type confusion" attacks, including for fixed > > same-size heap objects, we can create a complementary deterministic > > defense for dynamically sized allocations. > > > > In order to isolate user-controllable sized allocations from system > > allocations, introduce kmem_buckets_create(), which behaves like > > kmem_cache_create(). (The next patch will introduce kmem_buckets_alloc(), > > which behaves like kmem_cache_alloc().) > > So can I say the vision here would be to make all the kernel interfaces > that handles user space input to use separated caches? Which looks like > creating a "grey zone" in the middle of kernel space (trusted) and user > space (untrusted) memory. I've also thought that maybe hardening on the > "border" could be more efficient and targeted than a mitigation that > affects globally, e.g. CONFIG_RANDOM_KMALLOC_CACHES. I think it ends up having a similar effect, yes. The more copies that move to memdup_user(), the more coverage is created. The main point is to just not share caches between different kinds of allocations. The most abused version of this is the userspace size-controllable allocations, which this targets. The existing caches (which could still be used for type confusion attacks when the sizes are sufficiently similar) have a good chance of being mitigated by CONFIG_RANDOM_KMALLOC_CACHES already, so this proposed change is just complementary, IMO. -Kees
On 2024/03/08 4:31, Kees Cook wrote: > On Wed, Mar 06, 2024 at 09:47:36AM +0800, GONG, Ruiqi wrote: >> >> >> On 2024/03/05 18:10, Kees Cook wrote: >>> Hi, >>> >>> Repeating the commit logs for patch 4 here: >>> >>> Dedicated caches are available For fixed size allocations via >>> kmem_cache_alloc(), but for dynamically sized allocations there is only >>> the global kmalloc API's set of buckets available. This means it isn't >>> possible to separate specific sets of dynamically sized allocations into >>> a separate collection of caches. >>> >>> This leads to a use-after-free exploitation weakness in the Linux >>> kernel since many heap memory spraying/grooming attacks depend on using >>> userspace-controllable dynamically sized allocations to collide with >>> fixed size allocations that end up in same cache. >>> >>> While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense >>> against these kinds of "type confusion" attacks, including for fixed >>> same-size heap objects, we can create a complementary deterministic >>> defense for dynamically sized allocations. >>> >>> In order to isolate user-controllable sized allocations from system >>> allocations, introduce kmem_buckets_create(), which behaves like >>> kmem_cache_create(). (The next patch will introduce kmem_buckets_alloc(), >>> which behaves like kmem_cache_alloc().) >> >> So can I say the vision here would be to make all the kernel interfaces >> that handles user space input to use separated caches? Which looks like >> creating a "grey zone" in the middle of kernel space (trusted) and user >> space (untrusted) memory. I've also thought that maybe hardening on the >> "border" could be more efficient and targeted than a mitigation that >> affects globally, e.g. CONFIG_RANDOM_KMALLOC_CACHES. > > I think it ends up having a similar effect, yes. The more copies that > move to memdup_user(), the more coverage is created. The main point is to > just not share caches between different kinds of allocations. The most > abused version of this is the userspace size-controllable allocations, > which this targets. I agree. Currently if we want to fulfill a more strict separation between user-space manageable memory and other memory in kernel space, technically speaking for fixed size allocations we could transform them into using dedicated caches (i.e. kmem_cache_create()), but for dynamic size allocations I don't think of any solution. With the APIs provided by this patch set, we've got something that works. > ... The existing caches (which could still be used for > type confusion attacks when the sizes are sufficiently similar) have a > good chance of being mitigated by CONFIG_RANDOM_KMALLOC_CACHES already, > so this proposed change is just complementary, IMO. Maybe in the future we could require that all user-kernel interfaces that make use of SLAB caches should use either kmem_cache_create() or kmem_buckets_create()? ;) > > -Kees >
On 3/5/24 11:10 AM, Kees Cook wrote: > Hi, > > Repeating the commit logs for patch 4 here: > > Dedicated caches are available For fixed size allocations via > kmem_cache_alloc(), but for dynamically sized allocations there is only > the global kmalloc API's set of buckets available. This means it isn't > possible to separate specific sets of dynamically sized allocations into > a separate collection of caches. > > This leads to a use-after-free exploitation weakness in the Linux > kernel since many heap memory spraying/grooming attacks depend on using > userspace-controllable dynamically sized allocations to collide with > fixed size allocations that end up in same cache. > > While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense > against these kinds of "type confusion" attacks, including for fixed > same-size heap objects, we can create a complementary deterministic > defense for dynamically sized allocations. > > In order to isolate user-controllable sized allocations from system > allocations, introduce kmem_buckets_create(), which behaves like > kmem_cache_create(). (The next patch will introduce kmem_buckets_alloc(), > which behaves like kmem_cache_alloc().) > > Allows for confining allocations to a dedicated set of sized caches > (which have the same layout as the kmalloc caches). > > This can also be used in the future once codetag allocation annotations > exist to implement per-caller allocation cache isolation[0] even for > dynamic allocations. > > Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [0] > > After the implemetation are 2 example patches of how this could be used > for some repeat "offenders" that get used in exploits. There are more to > be isolated beyond just these. Repeating the commit log for patch 8 here: > > The msg subsystem is a common target for exploiting[1][2][3][4][5][6] > use-after-free type confusion flaws in the kernel for both read and > write primitives. Avoid having a user-controlled size cache share the > global kmalloc allocator by using a separate set of kmalloc buckets. > > Link: https://blog.hacktivesecurity.com/index.php/2022/06/13/linux-kernel-exploit-development-1day-case-study/ [1] > Link: https://hardenedvault.net/blog/2022-11-13-msg_msg-recon-mitigation-ved/ [2] > Link: https://www.willsroot.io/2021/08/corctf-2021-fire-of-salvation-writeup.html [3] > Link: https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html [4] > Link: https://google.github.io/security-research/pocs/linux/cve-2021-22555/writeup.html [5] > Link: https://zplin.me/papers/ELOISE.pdf [6] Hi Kees, after reading [1] I think the points should be addressed, mainly about the feasibility of converting users manually. On a related technical note I worry what will become of /proc/slabinfo when we convert non-trivial amounts of users. Also would interested to hear Jann Horn et al.'s opinion, and whether the SLAB_VIRTUAL effort will continue? Thanks, Vlastimil [1] https://dustri.org/b/notes-on-the-slab-introduce-dedicated-bucket-allocator-series.html > -Kees > > v2: significant rewrite, generalized the buckets type, added kvmalloc style > v1: https://lore.kernel.org/lkml/20240304184252.work.496-kees@kernel.org/ > > Kees Cook (9): > slab: Introduce kmem_buckets typedef > slub: Plumb kmem_buckets into __do_kmalloc_node() > util: Introduce __kvmalloc_node() that can take kmem_buckets argument > slab: Introduce kmem_buckets_create() > slab: Introduce kmem_buckets_alloc() > slub: Introduce kmem_buckets_alloc_track_caller() > slab: Introduce kmem_buckets_valloc() > ipc, msg: Use dedicated slab buckets for alloc_msg() > mm/util: Use dedicated slab buckets for memdup_user() > > include/linux/slab.h | 50 +++++++++++++++++++++------- > ipc/msgutil.c | 13 +++++++- > lib/fortify_kunit.c | 2 +- > mm/slab.h | 6 ++-- > mm/slab_common.c | 77 ++++++++++++++++++++++++++++++++++++++++++-- > mm/slub.c | 14 ++++---- > mm/util.c | 23 +++++++++---- > 7 files changed, 154 insertions(+), 31 deletions(-) >
On Mon, Mar 25, 2024 at 10:03:23AM +0100, Vlastimil Babka wrote: > On 3/5/24 11:10 AM, Kees Cook wrote: > > Hi, > > > > Repeating the commit logs for patch 4 here: > > > > Dedicated caches are available For fixed size allocations via > > kmem_cache_alloc(), but for dynamically sized allocations there is only > > the global kmalloc API's set of buckets available. This means it isn't > > possible to separate specific sets of dynamically sized allocations into > > a separate collection of caches. > > > > This leads to a use-after-free exploitation weakness in the Linux > > kernel since many heap memory spraying/grooming attacks depend on using > > userspace-controllable dynamically sized allocations to collide with > > fixed size allocations that end up in same cache. > > > > While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense > > against these kinds of "type confusion" attacks, including for fixed > > same-size heap objects, we can create a complementary deterministic > > defense for dynamically sized allocations. > > > > In order to isolate user-controllable sized allocations from system > > allocations, introduce kmem_buckets_create(), which behaves like > > kmem_cache_create(). (The next patch will introduce kmem_buckets_alloc(), > > which behaves like kmem_cache_alloc().) > > > > Allows for confining allocations to a dedicated set of sized caches > > (which have the same layout as the kmalloc caches). > > > > This can also be used in the future once codetag allocation annotations > > exist to implement per-caller allocation cache isolation[0] even for > > dynamic allocations. > > > > Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [0] > > > > After the implemetation are 2 example patches of how this could be used > > for some repeat "offenders" that get used in exploits. There are more to > > be isolated beyond just these. Repeating the commit log for patch 8 here: > > > > The msg subsystem is a common target for exploiting[1][2][3][4][5][6] > > use-after-free type confusion flaws in the kernel for both read and > > write primitives. Avoid having a user-controlled size cache share the > > global kmalloc allocator by using a separate set of kmalloc buckets. > > > > Link: https://blog.hacktivesecurity.com/index.php/2022/06/13/linux-kernel-exploit-development-1day-case-study/ [1] > > Link: https://hardenedvault.net/blog/2022-11-13-msg_msg-recon-mitigation-ved/ [2] > > Link: https://www.willsroot.io/2021/08/corctf-2021-fire-of-salvation-writeup.html [3] > > Link: https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html [4] > > Link: https://google.github.io/security-research/pocs/linux/cve-2021-22555/writeup.html [5] > > Link: https://zplin.me/papers/ELOISE.pdf [6] > > Hi Kees, > > after reading [1] I think the points should be addressed, mainly about the > feasibility of converting users manually. Sure, I can do that. Adding Julien to this thread... Julien can you please respond to LKML patches in email? It's much easier to keep things in a single thread. :) ] This is playing wack-a-mole Kind of, but not really. These patches provide a mechanism for having dedicated dynamically-sized slab caches (to match kmem_cache_create(), which only works for fixed-size allocations). This is needed to expand the codetag work into doing per-call-site allocations, as I detailed here[1]. Also, adding uses manually isn't very difficult, as can be seen in the examples I included. In fact, my examples between v1 and v2 collapsed from 3 to 2, because covering memdup_user() actually covered 2 known allocation paths (attrs and vma names), and given its usage pattern, will cover more in the future without changes. ] something like AUTOSLAB would be better Yes, that's the goal of [1]. This is a prerequisite for that, as mentioned in the cover letter. ] The slabs needs to be pinned Yes, and this is a general problem[2] with all kmalloc allocations, though. This isn't unique to to this patch series. SLAB_VIRTUAL solves it, and is under development. ] Lacks guard pages Yes, and again, this is a general problem with all kmalloc allocations. Solving it, like SLAB_VIRTUAL, would be a complementary hardening improvement to the allocator generally. ] PAX_USERCOPY has been marking these sites since 2012 Either it's whack-a-mole or it's not. :) PAX_USERCOPY shows that it _is_ possible to mark all sites. Regardless, like AUTOSLAB, PAX_USERCOPY isn't upstream, and its current implementation is an unpublished modification to a GPL project. I look forward to someone proposing it for inclusion in Linux, but for now we can work with the patches where an effort _has_ been made to upstream them for the benefit of the entire ecosystem. ] What about CONFIG_KMALLOC_SPLIT_VARSIZE This proposed improvement is hampered by not having dedicated _dynamically_ sized kmem caches, which this series provides. And with codetag-split allocations[1], the goals of CONFIG_KMALLOC_SPLIT_VARSIZE are more fully realized, providing much more complete coverage. ] I have no idea how the community around the Linux kernel works with ] their email-based workflows Step 1: reply to the proposal in email instead of (or perhaps in addition to) making blog posts. :) > On a related technical note I > worry what will become of /proc/slabinfo when we convert non-trivial amounts > of users. It gets longer. :) And potentially makes the codetag /proc file redundant. All that said, there are very few APIs in the kernel where userspace can control both the size and contents of an allocation. > Also would interested to hear Jann Horn et al.'s opinion, and whether the > SLAB_VIRTUAL effort will continue? SLAB_VIRTUAL is needed to address the reclamation UAF gap, and is still being developed. I don't intend to let it fall off the radar. (Which is why I included Jann and Matteo in CC originally.) In the meantime, adding this series as-is kills two long-standing exploitation methodologies, and paves the way to providing very fine-grained caches using codetags (which I imagine would be entirely optional and trivial to control with a boot param). -Kees [1] https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook/ [2] https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html
On Mon, Mar 25, 2024 at 10:03:23AM +0100, Vlastimil Babka wrote: > On 3/5/24 11:10 AM, Kees Cook wrote: > > Hi, > > > > Repeating the commit logs for patch 4 here: > > > > Dedicated caches are available For fixed size allocations via > > kmem_cache_alloc(), but for dynamically sized allocations there is only > > the global kmalloc API's set of buckets available. This means it isn't > > possible to separate specific sets of dynamically sized allocations into > > a separate collection of caches. > > > > This leads to a use-after-free exploitation weakness in the Linux > > kernel since many heap memory spraying/grooming attacks depend on using > > userspace-controllable dynamically sized allocations to collide with > > fixed size allocations that end up in same cache. > > > > While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense > > against these kinds of "type confusion" attacks, including for fixed > > same-size heap objects, we can create a complementary deterministic > > defense for dynamically sized allocations. > > > > In order to isolate user-controllable sized allocations from system > > allocations, introduce kmem_buckets_create(), which behaves like > > kmem_cache_create(). (The next patch will introduce kmem_buckets_alloc(), > > which behaves like kmem_cache_alloc().) > > > > Allows for confining allocations to a dedicated set of sized caches > > (which have the same layout as the kmalloc caches). > > > > This can also be used in the future once codetag allocation annotations > > exist to implement per-caller allocation cache isolation[0] even for > > dynamic allocations. > > > > Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [0] > > > > After the implemetation are 2 example patches of how this could be used > > for some repeat "offenders" that get used in exploits. There are more to > > be isolated beyond just these. Repeating the commit log for patch 8 here: > > > > The msg subsystem is a common target for exploiting[1][2][3][4][5][6] > > use-after-free type confusion flaws in the kernel for both read and > > write primitives. Avoid having a user-controlled size cache share the > > global kmalloc allocator by using a separate set of kmalloc buckets. > > > > Link: https://blog.hacktivesecurity.com/index.php/2022/06/13/linux-kernel-exploit-development-1day-case-study/ [1] > > Link: https://hardenedvault.net/blog/2022-11-13-msg_msg-recon-mitigation-ved/ [2] > > Link: https://www.willsroot.io/2021/08/corctf-2021-fire-of-salvation-writeup.html [3] > > Link: https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html [4] > > Link: https://google.github.io/security-research/pocs/linux/cve-2021-22555/writeup.html [5] > > Link: https://zplin.me/papers/ELOISE.pdf [6] > > Hi Kees, > > after reading [1] I think the points should be addressed, mainly about the > feasibility of converting users manually. On a related technical note I > worry what will become of /proc/slabinfo when we convert non-trivial amounts > of users. There shouldn't be any need to convert users to this interface - just leverage the alloc_hooks() macro.
On Mon, Mar 25, 2024 at 03:32:12PM -0400, Kent Overstreet wrote: > On Mon, Mar 25, 2024 at 10:03:23AM +0100, Vlastimil Babka wrote: > > On 3/5/24 11:10 AM, Kees Cook wrote: > > > Hi, > > > > > > Repeating the commit logs for patch 4 here: > > > > > > Dedicated caches are available For fixed size allocations via > > > kmem_cache_alloc(), but for dynamically sized allocations there is only > > > the global kmalloc API's set of buckets available. This means it isn't > > > possible to separate specific sets of dynamically sized allocations into > > > a separate collection of caches. > > > > > > This leads to a use-after-free exploitation weakness in the Linux > > > kernel since many heap memory spraying/grooming attacks depend on using > > > userspace-controllable dynamically sized allocations to collide with > > > fixed size allocations that end up in same cache. > > > > > > While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense > > > against these kinds of "type confusion" attacks, including for fixed > > > same-size heap objects, we can create a complementary deterministic > > > defense for dynamically sized allocations. > > > > > > In order to isolate user-controllable sized allocations from system > > > allocations, introduce kmem_buckets_create(), which behaves like > > > kmem_cache_create(). (The next patch will introduce kmem_buckets_alloc(), > > > which behaves like kmem_cache_alloc().) > > > > > > Allows for confining allocations to a dedicated set of sized caches > > > (which have the same layout as the kmalloc caches). > > > > > > This can also be used in the future once codetag allocation annotations > > > exist to implement per-caller allocation cache isolation[0] even for > > > dynamic allocations. > > > > > > Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [0] > > > > > > After the implemetation are 2 example patches of how this could be used > > > for some repeat "offenders" that get used in exploits. There are more to > > > be isolated beyond just these. Repeating the commit log for patch 8 here: > > > > > > The msg subsystem is a common target for exploiting[1][2][3][4][5][6] > > > use-after-free type confusion flaws in the kernel for both read and > > > write primitives. Avoid having a user-controlled size cache share the > > > global kmalloc allocator by using a separate set of kmalloc buckets. > > > > > > Link: https://blog.hacktivesecurity.com/index.php/2022/06/13/linux-kernel-exploit-development-1day-case-study/ [1] > > > Link: https://hardenedvault.net/blog/2022-11-13-msg_msg-recon-mitigation-ved/ [2] > > > Link: https://www.willsroot.io/2021/08/corctf-2021-fire-of-salvation-writeup.html [3] > > > Link: https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html [4] > > > Link: https://google.github.io/security-research/pocs/linux/cve-2021-22555/writeup.html [5] > > > Link: https://zplin.me/papers/ELOISE.pdf [6] > > > > Hi Kees, > > > > after reading [1] I think the points should be addressed, mainly about the > > feasibility of converting users manually. On a related technical note I > > worry what will become of /proc/slabinfo when we convert non-trivial amounts > > of users. > > There shouldn't be any need to convert users to this interface - just > leverage the alloc_hooks() macro. I expect to do both -- using the alloc_hooks() macro to do per-call-site-allocation caches will certainly have a non-trivial amount of memory usage overhead, and not all systems will want it. We can have a boot param to choose between per-site and normal, though normal can include a handful of these manually identified places.
25 March 2024 at 19:24, "Kees Cook" <keescook@chromium.org> wrote: > > On Mon, Mar 25, 2024 at 10:03:23AM +0100, Vlastimil Babka wrote: > > > > > On 3/5/24 11:10 AM, Kees Cook wrote: > > > > Hi, > > > > > > > > Repeating the commit logs for patch 4 here: > > > > > > > > Dedicated caches are available For fixed size allocations via > > > > kmem_cache_alloc(), but for dynamically sized allocations there is only > > > > the global kmalloc API's set of buckets available. This means it isn't > > > > possible to separate specific sets of dynamically sized allocations into > > > > a separate collection of caches. > > > > > > > > This leads to a use-after-free exploitation weakness in the Linux > > > > kernel since many heap memory spraying/grooming attacks depend on using > > > > userspace-controllable dynamically sized allocations to collide with > > > > fixed size allocations that end up in same cache. > > > > > > > > While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense > > > > against these kinds of "type confusion" attacks, including for fixed > > > > same-size heap objects, we can create a complementary deterministic > > > > defense for dynamically sized allocations. > > > > > > > > In order to isolate user-controllable sized allocations from system > > > > allocations, introduce kmem_buckets_create(), which behaves like > > > > kmem_cache_create(). (The next patch will introduce kmem_buckets_alloc(), > > > > which behaves like kmem_cache_alloc().) > > > > > > > > Allows for confining allocations to a dedicated set of sized caches > > > > (which have the same layout as the kmalloc caches). > > > > > > > > This can also be used in the future once codetag allocation annotations > > > > exist to implement per-caller allocation cache isolation[0] even for > > > > dynamic allocations. > > > > > > > > Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [0] > > > > > > > > After the implemetation are 2 example patches of how this could be used > > > > for some repeat "offenders" that get used in exploits. There are more to > > > > be isolated beyond just these. Repeating the commit log for patch 8 here: > > > > > > > > The msg subsystem is a common target for exploiting[1][2][3][4][5][6] > > > > use-after-free type confusion flaws in the kernel for both read and > > > > write primitives. Avoid having a user-controlled size cache share the > > > > global kmalloc allocator by using a separate set of kmalloc buckets. > > > > > > > > Link: https://blog.hacktivesecurity.com/index.php/2022/06/13/linux-kernel-exploit-development-1day-case-study/ [1] > > > > Link: https://hardenedvault.net/blog/2022-11-13-msg_msg-recon-mitigation-ved/ [2] > > > > Link: https://www.willsroot.io/2021/08/corctf-2021-fire-of-salvation-writeup.html [3] > > > > Link: https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html [4] > > > > Link: https://google.github.io/security-research/pocs/linux/cve-2021-22555/writeup.html [5] > > > > Link: https://zplin.me/papers/ELOISE.pdf [6] > > > > > > > > Hi Kees, > > > > > > > > after reading [1] I think the points should be addressed, mainly about the > > > > feasibility of converting users manually. > > > > Sure, I can do that. > > Adding Julien to this thread... Julien can you please respond to LKML > > patches in email? It's much easier to keep things in a single thread. :) > > ] This is playing wack-a-mole > > Kind of, but not really. These patches provide a mechanism for having > > dedicated dynamically-sized slab caches (to match kmem_cache_create(), > > which only works for fixed-size allocations). This is needed to expand > > the codetag work into doing per-call-site allocations, as I detailed > > here[1]. > > Also, adding uses manually isn't very difficult, as can be seen in the > > examples I included. In fact, my examples between v1 and v2 collapsed > > from 3 to 2, because covering memdup_user() actually covered 2 known > > allocation paths (attrs and vma names), and given its usage pattern, > > will cover more in the future without changes. It's not about difficulty, it's about scale. There are hundreds of interesting structures: I'm worried that no one will take the time to add a separate bucket for each of them, chase their call-sites down, and monitor every single newly added structures to check if they are "interesting" and should benefit from their own bucket as well. > > ] something like AUTOSLAB would be better > > Yes, that's the goal of [1]. This is a prerequisite for that, as > > mentioned in the cover letter. This series looks unrelated to [1] to me: the former adds a mechanism to add buckets and expects developers to manually make use of them, while the latter is about adding infrastructure to automate call-site-based segregation. > ] The slabs needs to be pinned > > Yes, and this is a general problem[2] with all kmalloc allocations, though. > > This isn't unique to to this patch series. SLAB_VIRTUAL solves it, and > > is under development. Then it would be nice to mention it in the serie, as an acknowledged limitation. > ] Lacks guard pages > > Yes, and again, this is a general problem with all kmalloc allocations. > > Solving it, like SLAB_VIRTUAL, would be a complementary hardening > > improvement to the allocator generally. Then it would also be nice to mention it, because currently it's unclear that those limitations are both known and will be properly addressed. > > ] PAX_USERCOPY has been marking these sites since 2012 > > Either it's whack-a-mole or it's not. :) This annotation was added 12 years ago in PaX, and while it was state of the art back then, I think that in 2024 we can do better than this. > PAX_USERCOPY shows that it _is_ possible to mark all sites. It shows that it's possible to annotate some sites (17 in grsecurity-3.1-4.9.9-201702122044.patch), and while it has a similar approach to your series, its annotations aren't conveying the same meaning. > Regardless, like AUTOSLAB, PAX_USERCOPY isn't > > upstream, and its current implementation is an unpublished modification > > to a GPL project. I look forward to someone proposing it for inclusion > > in Linux, but for now we can work with the patches where an effort _has_ > > been made to upstream them for the benefit of the entire ecosystem. > > ] What about CONFIG_KMALLOC_SPLIT_VARSIZE > > This proposed improvement is hampered by not having dedicated > > _dynamically_ sized kmem caches, which this series provides. And with > > codetag-split allocations[1], the goals of CONFIG_KMALLOC_SPLIT_VARSIZE > > are more fully realized, providing much more complete coverage. CONFIG_KMALLOC_SPLIT_VARSIZE has been bypassed dozen of times in various ways as part of Google's kernelCTF. Your series is, to my understanding, a weaker form of it. So I'm not super-convinced that it's the right approach to mitigate UAF. Do you think it would be possible for Google to add this series to its kernelCTF, so gather empirical data on how feasible/easy it is to bypass it? > > ] I have no idea how the community around the Linux kernel works with > > ] their email-based workflows > > Step 1: reply to the proposal in email instead of (or perhaps in > > addition to) making blog posts. :) > > > > > On a related technical note I > > > > worry what will become of /proc/slabinfo when we convert non-trivial amounts > > > > of users. > > > > It gets longer. :) And potentially makes the codetag /proc file > > redundant. All that said, there are very few APIs in the kernel where > > userspace can control both the size and contents of an allocation. > > > > > Also would interested to hear Jann Horn et al.'s opinion, and whether the > > > > SLAB_VIRTUAL effort will continue? > > > > SLAB_VIRTUAL is needed to address the reclamation UAF gap, and is > > still being developed. I don't intend to let it fall off the radar. > > (Which is why I included Jann and Matteo in CC originally.) > > In the meantime, adding this series as-is kills two long-standing > > exploitation methodologies, and paves the way to providing very > > fine-grained caches using codetags (which I imagine would be entirely > > optional and trivial to control with a boot param). > > -Kees > > [1] https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook/ > > [2] https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html > > -- > > Kees Cook >
On Tue, Mar 26, 2024 at 06:07:07PM +0000, julien.voisin@dustri.org wrote: > 25 March 2024 at 19:24, "Kees Cook" <keescook@chromium.org> wrote: > > On Mon, Mar 25, 2024 at 10:03:23AM +0100, Vlastimil Babka wrote: > > > On 3/5/24 11:10 AM, Kees Cook wrote: > > > Hi, > > > > > > Repeating the commit logs for patch 4 here: > > > > > > Dedicated caches are available For fixed size allocations via > > > kmem_cache_alloc(), but for dynamically sized allocations there is only > > > the global kmalloc API's set of buckets available. This means it isn't > > > possible to separate specific sets of dynamically sized allocations into > > > a separate collection of caches. > > > > > > This leads to a use-after-free exploitation weakness in the Linux > > > kernel since many heap memory spraying/grooming attacks depend on using > > > userspace-controllable dynamically sized allocations to collide with > > > fixed size allocations that end up in same cache. > > > > > > While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense > > > against these kinds of "type confusion" attacks, including for fixed > > > same-size heap objects, we can create a complementary deterministic > > > defense for dynamically sized allocations. > > > > > > In order to isolate user-controllable sized allocations from system > > > allocations, introduce kmem_buckets_create(), which behaves like > > > kmem_cache_create(). (The next patch will introduce kmem_buckets_alloc(), > > > which behaves like kmem_cache_alloc().) > > > > > > Allows for confining allocations to a dedicated set of sized caches > > > (which have the same layout as the kmalloc caches). > > > > > > This can also be used in the future once codetag allocation annotations > > > exist to implement per-caller allocation cache isolation[0] even for > > > dynamic allocations. > > > > > > Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [0] > > > > > > After the implemetation are 2 example patches of how this could be used > > > for some repeat "offenders" that get used in exploits. There are more to > > > be isolated beyond just these. Repeating the commit log for patch 8 here: > > > > > > The msg subsystem is a common target for exploiting[1][2][3][4][5][6] > > > use-after-free type confusion flaws in the kernel for both read and > > > write primitives. Avoid having a user-controlled size cache share the > > > global kmalloc allocator by using a separate set of kmalloc buckets. > > > > > > Link: https://blog.hacktivesecurity.com/index.php/2022/06/13/linux-kernel-exploit-development-1day-case-study/ [1] > > > Link: https://hardenedvault.net/blog/2022-11-13-msg_msg-recon-mitigation-ved/ [2] > > > Link: https://www.willsroot.io/2021/08/corctf-2021-fire-of-salvation-writeup.html [3] > > > Link: https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html [4] > > > Link: https://google.github.io/security-research/pocs/linux/cve-2021-22555/writeup.html [5] > > > Link: https://zplin.me/papers/ELOISE.pdf [6] > > > > > > Hi Kees, > > > > > > after reading [1] I think the points should be addressed, mainly about the > > > feasibility of converting users manually. > > > > Sure, I can do that. > > Adding Julien to this thread... Julien can you please respond to LKML > > patches in email? It's much easier to keep things in a single thread. :) > > > > ] This is playing wack-a-mole > > Kind of, but not really. These patches provide a mechanism for having > > dedicated dynamically-sized slab caches (to match kmem_cache_create(), > > which only works for fixed-size allocations). This is needed to expand > > the codetag work into doing per-call-site allocations, as I detailed > > here[1]. > > > > Also, adding uses manually isn't very difficult, as can be seen in the > > examples I included. In fact, my examples between v1 and v2 collapsed > > from 3 to 2, because covering memdup_user() actually covered 2 known > > allocation paths (attrs and vma names), and given its usage pattern, > > will cover more in the future without changes. > > It's not about difficulty, it's about scale. There are hundreds of interesting structures: I'm worried that no one will take the time to add a separate bucket for each of them, chase their call-sites down, and monitor every single newly added structures to check if they are "interesting" and should benefit from their own bucket as well. Very few are both: 1) dynamically sized, and 2) coming from userspace, so I think the scale is fine. > > ] something like AUTOSLAB would be better > > Yes, that's the goal of [1]. This is a prerequisite for that, as > > mentioned in the cover letter. > > This series looks unrelated to [1] to me: the former adds a mechanism to add buckets and expects developers to manually make use of them, while the latter is about adding infrastructure to automate call-site-based segregation. Right -- but for call-site-based separation, there is currently no way to separate _dynamically_ sized allocations; only fixed size (via kmem_cache_create()). This series adds the ability for call-site-based separation to also use kmem_bucket_create(). Call-site-based separation isn't possible without this series. > > > ] The slabs needs to be pinned > > Yes, and this is a general problem[2] with all kmalloc allocations, though. > > This isn't unique to to this patch series. SLAB_VIRTUAL solves it, and > > is under development. > > Then it would be nice to mention it in the serie, as an acknowledged limitation. Sure, I can update the cover letter. > > > ] Lacks guard pages > > Yes, and again, this is a general problem with all kmalloc allocations. > > Solving it, like SLAB_VIRTUAL, would be a complementary hardening > > improvement to the allocator generally. > > Then it would also be nice to mention it, because currently it's unclear that those limitations are both known and will be properly addressed. Sure. For both this and pinning, the issues are orthogonal, so it didn't seem useful to distract from what the series was doing, but I can explicitly mention them going forward. > > > ] PAX_USERCOPY has been marking these sites since 2012 > > Either it's whack-a-mole or it's not. :) > > This annotation was added 12 years ago in PaX, and while it was state of the art back then, I think that in 2024 we can do better than this. Agreed. Here's my series to start that. :) > > PAX_USERCOPY shows that it _is_ possible to mark all sites. > > It shows that it's possible to annotate some sites (17 in grsecurity-3.1-4.9.9-201702122044.patch), and while it has a similar approach to your series, its annotations aren't conveying the same meaning. Sure, GFP_USERCOPY is separate. > > Regardless, like AUTOSLAB, PAX_USERCOPY isn't > > upstream, and its current implementation is an unpublished modification > > to a GPL project. I look forward to someone proposing it for inclusion > > in Linux, but for now we can work with the patches where an effort _has_ > > been made to upstream them for the benefit of the entire ecosystem. > > ] What about CONFIG_KMALLOC_SPLIT_VARSIZE > > This proposed improvement is hampered by not having dedicated > > _dynamically_ sized kmem caches, which this series provides. And with > > codetag-split allocations[1], the goals of CONFIG_KMALLOC_SPLIT_VARSIZE > > are more fully realized, providing much more complete coverage. > > CONFIG_KMALLOC_SPLIT_VARSIZE has been bypassed dozen of times in various ways as part of Google's kernelCTF. > Your series is, to my understanding, a weaker form of it. So I'm not super-convinced that it's the right approach to mitigate UAF. This series doesn't do anything that CONFIG_KMALLOC_SPLIT_VARSIZE does. The call-site-separation series (which would depend on this series) would do that work. > Do you think it would be possible for Google to add this series to its kernelCTF, so gather empirical data on how feasible/easy it is to bypass it? Sure, feel free to make that happen. :) But again, I'm less interested in this series as a _standalone_ solution. It's a prerequisite for call-site-based allocation separation. As part of it, though, we can plug the blatant exploitation methods that currently exist. -Kees