Message ID | 20180802132133.23999-1-ard.biesheuvel@linaro.org (mailing list archive) |
---|---|
Headers | show |
Series | arm64: basic ROP mitigation | expand |
On 08/02/2018 03:21 PM, Ard Biesheuvel wrote: > The idea is that we can significantly limit the kernel's attack surface > for ROP based attacks by clearing the stack pointer's sign bit before > returning from a function, and setting it again right after proceeding > from the [expected] return address. This should make it much more difficult > to return to arbitrary gadgets, given that they rely on being chained to > the next via a return address popped off the stack, and this is difficult > when the stack pointer is invalid. Doesn't this break stack unwinding? Thanks, Florian
On 6 August 2018 at 12:07, Florian Weimer <fweimer@redhat.com> wrote: > On 08/02/2018 03:21 PM, Ard Biesheuvel wrote: >> >> The idea is that we can significantly limit the kernel's attack surface >> for ROP based attacks by clearing the stack pointer's sign bit before >> returning from a function, and setting it again right after proceeding >> from the [expected] return address. This should make it much more >> difficult >> to return to arbitrary gadgets, given that they rely on being chained to >> the next via a return address popped off the stack, and this is difficult >> when the stack pointer is invalid. > > > Doesn't this break stack unwinding? > Any exception that is taken between clearing and setting the SP bit will first reset the bit. Since arm64 does not rely on hardware to preserve the exception context, we can do this in code before preserving the registers (which is why we need the 'little dance' in the last patch) So any time the stack unwinding code runs, the stack pointer should be valid.
On 02/08/18 14:21, Ard Biesheuvel wrote: > This is a proof of concept I cooked up, primarily to trigger a discussion > about whether there is a point to doing anything like this, and if there > is, what the pitfalls are. Also, while I am not aware of any similar > implementations, the idea is so simple that I would be surprised if nobody > else thought of the same thing way before I did. So, "TTBR0 PAN: Pointer Auth edition"? :P > The idea is that we can significantly limit the kernel's attack surface > for ROP based attacks by clearing the stack pointer's sign bit before > returning from a function, and setting it again right after proceeding > from the [expected] return address. This should make it much more difficult > to return to arbitrary gadgets, given that they rely on being chained to > the next via a return address popped off the stack, and this is difficult > when the stack pointer is invalid. > > Of course, 4 additional instructions per function return is not exactly > for free, but they are just movs and adds, and leaf functions are > disregarded unless they allocate a stack frame (this comes for free > because simple_return insns are disregarded by the plugin) > > Please shoot, preferably with better ideas ... Actually, on the subject of PAN, shouldn't this at least have a very hard dependency on that? AFAICS without PAN clearing bit 55 of SP is effectively giving userspace direct control of the kernel stack (thanks to TBI). Ouch. I wonder if there's a little more mileage in using "{add,sub} sp, sp, #1" sequences to rely on stack alignment exceptions instead, with the added bonus that that's about as low as the instruction-level overhead can get. Robin. > > Ard Biesheuvel (3): > arm64: use wrapper macro for bl/blx instructions from asm code > gcc: plugins: add ROP shield plugin for arm64 > arm64: enable ROP protection by clearing SP bit #55 across function > returns > > arch/Kconfig | 4 + > arch/arm64/Kconfig | 10 ++ > arch/arm64/include/asm/assembler.h | 21 +++- > arch/arm64/kernel/entry-ftrace.S | 6 +- > arch/arm64/kernel/entry.S | 104 +++++++++------- > arch/arm64/kernel/head.S | 4 +- > arch/arm64/kernel/probes/kprobes_trampoline.S | 2 +- > arch/arm64/kernel/sleep.S | 6 +- > drivers/firmware/efi/libstub/Makefile | 3 +- > scripts/Makefile.gcc-plugins | 7 ++ > scripts/gcc-plugins/arm64_rop_shield_plugin.c | 116 ++++++++++++++++++ > 11 files changed, 228 insertions(+), 55 deletions(-) > create mode 100644 scripts/gcc-plugins/arm64_rop_shield_plugin.c >
On 6 August 2018 at 15:55, Robin Murphy <robin.murphy@arm.com> wrote: > On 02/08/18 14:21, Ard Biesheuvel wrote: >> >> This is a proof of concept I cooked up, primarily to trigger a discussion >> about whether there is a point to doing anything like this, and if there >> is, what the pitfalls are. Also, while I am not aware of any similar >> implementations, the idea is so simple that I would be surprised if nobody >> else thought of the same thing way before I did. > > > So, "TTBR0 PAN: Pointer Auth edition"? :P > >> The idea is that we can significantly limit the kernel's attack surface >> for ROP based attacks by clearing the stack pointer's sign bit before >> returning from a function, and setting it again right after proceeding >> from the [expected] return address. This should make it much more >> difficult >> to return to arbitrary gadgets, given that they rely on being chained to >> the next via a return address popped off the stack, and this is difficult >> when the stack pointer is invalid. >> >> Of course, 4 additional instructions per function return is not exactly >> for free, but they are just movs and adds, and leaf functions are >> disregarded unless they allocate a stack frame (this comes for free >> because simple_return insns are disregarded by the plugin) >> >> Please shoot, preferably with better ideas ... > > > Actually, on the subject of PAN, shouldn't this at least have a very hard > dependency on that? AFAICS without PAN clearing bit 55 of SP is effectively > giving userspace direct control of the kernel stack (thanks to TBI). Ouch. > How's that? Bits 52 .. 54 will still be set, so SP will never contain a valid userland address in any case. Or am I missing something? > I wonder if there's a little more mileage in using "{add,sub} sp, sp, #1" > sequences to rely on stack alignment exceptions instead, with the added > bonus that that's about as low as the instruction-level overhead can get. > Good point. I did consider that, but couldn't convince myself that it isn't easier to defeat: loads via x29 occur reasonably often, and you can simply offset your doctored stack frame by a single byte. >> >> Ard Biesheuvel (3): >> arm64: use wrapper macro for bl/blx instructions from asm code >> gcc: plugins: add ROP shield plugin for arm64 >> arm64: enable ROP protection by clearing SP bit #55 across function >> returns >> >> arch/Kconfig | 4 + >> arch/arm64/Kconfig | 10 ++ >> arch/arm64/include/asm/assembler.h | 21 +++- >> arch/arm64/kernel/entry-ftrace.S | 6 +- >> arch/arm64/kernel/entry.S | 104 +++++++++------- >> arch/arm64/kernel/head.S | 4 +- >> arch/arm64/kernel/probes/kprobes_trampoline.S | 2 +- >> arch/arm64/kernel/sleep.S | 6 +- >> drivers/firmware/efi/libstub/Makefile | 3 +- >> scripts/Makefile.gcc-plugins | 7 ++ >> scripts/gcc-plugins/arm64_rop_shield_plugin.c | 116 ++++++++++++++++++ >> 11 files changed, 228 insertions(+), 55 deletions(-) >> create mode 100644 scripts/gcc-plugins/arm64_rop_shield_plugin.c >> >
On 6 August 2018 at 16:04, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > On 6 August 2018 at 15:55, Robin Murphy <robin.murphy@arm.com> wrote: >> On 02/08/18 14:21, Ard Biesheuvel wrote: >>> >>> This is a proof of concept I cooked up, primarily to trigger a discussion >>> about whether there is a point to doing anything like this, and if there >>> is, what the pitfalls are. Also, while I am not aware of any similar >>> implementations, the idea is so simple that I would be surprised if nobody >>> else thought of the same thing way before I did. >> >> >> So, "TTBR0 PAN: Pointer Auth edition"? :P >> >>> The idea is that we can significantly limit the kernel's attack surface >>> for ROP based attacks by clearing the stack pointer's sign bit before >>> returning from a function, and setting it again right after proceeding >>> from the [expected] return address. This should make it much more >>> difficult >>> to return to arbitrary gadgets, given that they rely on being chained to >>> the next via a return address popped off the stack, and this is difficult >>> when the stack pointer is invalid. >>> >>> Of course, 4 additional instructions per function return is not exactly >>> for free, but they are just movs and adds, and leaf functions are >>> disregarded unless they allocate a stack frame (this comes for free >>> because simple_return insns are disregarded by the plugin) >>> >>> Please shoot, preferably with better ideas ... >> >> >> Actually, on the subject of PAN, shouldn't this at least have a very hard >> dependency on that? AFAICS without PAN clearing bit 55 of SP is effectively >> giving userspace direct control of the kernel stack (thanks to TBI). Ouch. >> > > How's that? Bits 52 .. 54 will still be set, so SP will never contain > a valid userland address in any case. Or am I missing something? > >> I wonder if there's a little more mileage in using "{add,sub} sp, sp, #1" >> sequences to rely on stack alignment exceptions instead, with the added >> bonus that that's about as low as the instruction-level overhead can get. >> > > Good point. I did consider that, but couldn't convince myself that it > isn't easier to defeat: loads via x29 occur reasonably often, and you > can simply offset your doctored stack frame by a single byte. > Also, the restore has to be idempotent: only functions that modify sp set the bit, so it cannot be reset unconditionally. Also, when taking an exception in the middle, we'll return with the bit set even if it was clear when the exception was taken.
On 06/08/18 15:04, Ard Biesheuvel wrote: > On 6 August 2018 at 15:55, Robin Murphy <robin.murphy@arm.com> wrote: >> On 02/08/18 14:21, Ard Biesheuvel wrote: >>> >>> This is a proof of concept I cooked up, primarily to trigger a discussion >>> about whether there is a point to doing anything like this, and if there >>> is, what the pitfalls are. Also, while I am not aware of any similar >>> implementations, the idea is so simple that I would be surprised if nobody >>> else thought of the same thing way before I did. >> >> >> So, "TTBR0 PAN: Pointer Auth edition"? :P >> >>> The idea is that we can significantly limit the kernel's attack surface >>> for ROP based attacks by clearing the stack pointer's sign bit before >>> returning from a function, and setting it again right after proceeding >>> from the [expected] return address. This should make it much more >>> difficult >>> to return to arbitrary gadgets, given that they rely on being chained to >>> the next via a return address popped off the stack, and this is difficult >>> when the stack pointer is invalid. >>> >>> Of course, 4 additional instructions per function return is not exactly >>> for free, but they are just movs and adds, and leaf functions are >>> disregarded unless they allocate a stack frame (this comes for free >>> because simple_return insns are disregarded by the plugin) >>> >>> Please shoot, preferably with better ideas ... >> >> >> Actually, on the subject of PAN, shouldn't this at least have a very hard >> dependency on that? AFAICS without PAN clearing bit 55 of SP is effectively >> giving userspace direct control of the kernel stack (thanks to TBI). Ouch. >> > > How's that? Bits 52 .. 54 will still be set, so SP will never contain > a valid userland address in any case. Or am I missing something? Ah, yes, I'd managed to forget about the address hole, but I think that only makes it a bit trickier, rather than totally safe - it feels like you just need to chain one or two returns through "valid" targets until you can hit an epilogue with a "mov sp, x29" (at first glance there are a fair few of those in my vmlinux), after which we're back to the bit 55 scheme alone giving no protection against retargeting the stack to a valid TTBR0 address. >> I wonder if there's a little more mileage in using "{add,sub} sp, sp, #1" >> sequences to rely on stack alignment exceptions instead, with the added >> bonus that that's about as low as the instruction-level overhead can get. >> > > Good point. I did consider that, but couldn't convince myself that it > isn't easier to defeat: loads via x29 occur reasonably often, and you > can simply offset your doctored stack frame by a single byte. True; in theory there are 3072 possible unaligned offsets to choose from, but compile-time randomisation doesn't seem much use, and hotpatching just about every function call in the kernel isn't a nice thought either. Robin.
On 6 August 2018 at 17:38, Robin Murphy <robin.murphy@arm.com> wrote: > On 06/08/18 15:04, Ard Biesheuvel wrote: >> >> On 6 August 2018 at 15:55, Robin Murphy <robin.murphy@arm.com> wrote: >>> >>> On 02/08/18 14:21, Ard Biesheuvel wrote: >>>> >>>> >>>> This is a proof of concept I cooked up, primarily to trigger a >>>> discussion >>>> about whether there is a point to doing anything like this, and if there >>>> is, what the pitfalls are. Also, while I am not aware of any similar >>>> implementations, the idea is so simple that I would be surprised if >>>> nobody >>>> else thought of the same thing way before I did. >>> >>> >>> >>> So, "TTBR0 PAN: Pointer Auth edition"? :P >>> >>>> The idea is that we can significantly limit the kernel's attack surface >>>> for ROP based attacks by clearing the stack pointer's sign bit before >>>> returning from a function, and setting it again right after proceeding >>>> from the [expected] return address. This should make it much more >>>> difficult >>>> to return to arbitrary gadgets, given that they rely on being chained to >>>> the next via a return address popped off the stack, and this is >>>> difficult >>>> when the stack pointer is invalid. >>>> >>>> Of course, 4 additional instructions per function return is not exactly >>>> for free, but they are just movs and adds, and leaf functions are >>>> disregarded unless they allocate a stack frame (this comes for free >>>> because simple_return insns are disregarded by the plugin) >>>> >>>> Please shoot, preferably with better ideas ... >>> >>> >>> >>> Actually, on the subject of PAN, shouldn't this at least have a very hard >>> dependency on that? AFAICS without PAN clearing bit 55 of SP is >>> effectively >>> giving userspace direct control of the kernel stack (thanks to TBI). >>> Ouch. >>> >> >> How's that? Bits 52 .. 54 will still be set, so SP will never contain >> a valid userland address in any case. Or am I missing something? > > > Ah, yes, I'd managed to forget about the address hole, but I think that only > makes it a bit trickier, rather than totally safe - it feels like you just > need to chain one or two returns through "valid" targets until you can hit > an epilogue with a "mov sp, x29" (at first glance there are a fair few of > those in my vmlinux), after which we're back to the bit 55 scheme alone > giving no protection against retargeting the stack to a valid TTBR0 address. > Wouldn't such an epilogue clear the SP bit before returning again? >>> I wonder if there's a little more mileage in using "{add,sub} sp, sp, >>> #1" >>> sequences to rely on stack alignment exceptions instead, with the added >>> bonus that that's about as low as the instruction-level overhead can get. >>> >> >> Good point. I did consider that, but couldn't convince myself that it >> isn't easier to defeat: loads via x29 occur reasonably often, and you >> can simply offset your doctored stack frame by a single byte. > > > True; in theory there are 3072 possible unaligned offsets to choose from, > but compile-time randomisation doesn't seem much use, and hotpatching just > about every function call in the kernel isn't a nice thought either. > > Robin.
On 6 August 2018 at 17:50, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > On 6 August 2018 at 17:38, Robin Murphy <robin.murphy@arm.com> wrote: >> On 06/08/18 15:04, Ard Biesheuvel wrote: >>> >>> On 6 August 2018 at 15:55, Robin Murphy <robin.murphy@arm.com> wrote: >>>> >>>> On 02/08/18 14:21, Ard Biesheuvel wrote: >>>>> >>>>> >>>>> This is a proof of concept I cooked up, primarily to trigger a >>>>> discussion >>>>> about whether there is a point to doing anything like this, and if there >>>>> is, what the pitfalls are. Also, while I am not aware of any similar >>>>> implementations, the idea is so simple that I would be surprised if >>>>> nobody >>>>> else thought of the same thing way before I did. >>>> >>>> >>>> >>>> So, "TTBR0 PAN: Pointer Auth edition"? :P >>>> >>>>> The idea is that we can significantly limit the kernel's attack surface >>>>> for ROP based attacks by clearing the stack pointer's sign bit before >>>>> returning from a function, and setting it again right after proceeding >>>>> from the [expected] return address. This should make it much more >>>>> difficult >>>>> to return to arbitrary gadgets, given that they rely on being chained to >>>>> the next via a return address popped off the stack, and this is >>>>> difficult >>>>> when the stack pointer is invalid. >>>>> >>>>> Of course, 4 additional instructions per function return is not exactly >>>>> for free, but they are just movs and adds, and leaf functions are >>>>> disregarded unless they allocate a stack frame (this comes for free >>>>> because simple_return insns are disregarded by the plugin) >>>>> >>>>> Please shoot, preferably with better ideas ... >>>> >>>> >>>> >>>> Actually, on the subject of PAN, shouldn't this at least have a very hard >>>> dependency on that? AFAICS without PAN clearing bit 55 of SP is >>>> effectively >>>> giving userspace direct control of the kernel stack (thanks to TBI). >>>> Ouch. >>>> >>> >>> How's that? Bits 52 .. 54 will still be set, so SP will never contain >>> a valid userland address in any case. Or am I missing something? >> >> >> Ah, yes, I'd managed to forget about the address hole, but I think that only >> makes it a bit trickier, rather than totally safe - it feels like you just >> need to chain one or two returns through "valid" targets until you can hit >> an epilogue with a "mov sp, x29" (at first glance there are a fair few of >> those in my vmlinux), after which we're back to the bit 55 scheme alone >> giving no protection against retargeting the stack to a valid TTBR0 address. >> > > Wouldn't such an epilogue clear the SP bit before returning again? > ... or are you saying you can play tricks and clear bits 52 .. 54 ? If so, you can already do that, right? And apply it to bit 55 as well?
On 06/08/18 17:04, Ard Biesheuvel wrote: > On 6 August 2018 at 17:50, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: >> On 6 August 2018 at 17:38, Robin Murphy <robin.murphy@arm.com> wrote: >>> On 06/08/18 15:04, Ard Biesheuvel wrote: >>>> >>>> On 6 August 2018 at 15:55, Robin Murphy <robin.murphy@arm.com> wrote: >>>>> >>>>> On 02/08/18 14:21, Ard Biesheuvel wrote: >>>>>> >>>>>> >>>>>> This is a proof of concept I cooked up, primarily to trigger a >>>>>> discussion >>>>>> about whether there is a point to doing anything like this, and if there >>>>>> is, what the pitfalls are. Also, while I am not aware of any similar >>>>>> implementations, the idea is so simple that I would be surprised if >>>>>> nobody >>>>>> else thought of the same thing way before I did. >>>>> >>>>> >>>>> >>>>> So, "TTBR0 PAN: Pointer Auth edition"? :P >>>>> >>>>>> The idea is that we can significantly limit the kernel's attack surface >>>>>> for ROP based attacks by clearing the stack pointer's sign bit before >>>>>> returning from a function, and setting it again right after proceeding >>>>>> from the [expected] return address. This should make it much more >>>>>> difficult >>>>>> to return to arbitrary gadgets, given that they rely on being chained to >>>>>> the next via a return address popped off the stack, and this is >>>>>> difficult >>>>>> when the stack pointer is invalid. >>>>>> >>>>>> Of course, 4 additional instructions per function return is not exactly >>>>>> for free, but they are just movs and adds, and leaf functions are >>>>>> disregarded unless they allocate a stack frame (this comes for free >>>>>> because simple_return insns are disregarded by the plugin) >>>>>> >>>>>> Please shoot, preferably with better ideas ... >>>>> >>>>> >>>>> >>>>> Actually, on the subject of PAN, shouldn't this at least have a very hard >>>>> dependency on that? AFAICS without PAN clearing bit 55 of SP is >>>>> effectively >>>>> giving userspace direct control of the kernel stack (thanks to TBI). >>>>> Ouch. >>>>> >>>> >>>> How's that? Bits 52 .. 54 will still be set, so SP will never contain >>>> a valid userland address in any case. Or am I missing something? >>> >>> >>> Ah, yes, I'd managed to forget about the address hole, but I think that only >>> makes it a bit trickier, rather than totally safe - it feels like you just >>> need to chain one or two returns through "valid" targets until you can hit >>> an epilogue with a "mov sp, x29" (at first glance there are a fair few of >>> those in my vmlinux), after which we're back to the bit 55 scheme alone >>> giving no protection against retargeting the stack to a valid TTBR0 address. >>> >> >> Wouldn't such an epilogue clear the SP bit before returning again? >> > > ... or are you saying you can play tricks and clear bits 52 .. 54 ? If > so, you can already do that, right? And apply it to bit 55 as well? Indeed, in this scenario clearing bit 55 immediately before the final ret does nothing because the "valid" return beforehand loaded x29 with an arbitrary userspace address from a doctored stack frame, so the rest of that epilogue beyond that first mov already ran off the fake stack. Admittedly you might have to retain control of the "real" kernel stack and go through much the same dance if the gadget chain ever needs to pass through a real return target (to mitigate bit 55 being unconditionally set again to make an invalid TTBR1 address). Working around the mitigations certainly makes the exploit more difficult, but still seemingly far from impossible. And yes, AFAICS an attacker could indeed use the same SP-hijacking trick today (in the same absence of PAN), it's just that without any mitigations to prevent using the kernel stack alone I can't imagine it would be worth the extra complication. I guess what I'm getting at is that if the protection mechanism is "always return with SP outside TTBR1", there seems little point in going through the motions if SP in TTBR0 could still be valid and allow an attack to succeed anyway; this is basically just me working through a justification for saying the proposed scheme needs "depends on ARM64_PAN || ARM64_SW_TTBR0_PAN", making it that much uglier for v8.0 CPUs... Robin.
On Mon, Aug 6, 2018 at 10:45 AM, Robin Murphy <robin.murphy@arm.com> wrote: > I guess what I'm getting at is that if the protection mechanism is "always > return with SP outside TTBR1", there seems little point in going through the > motions if SP in TTBR0 could still be valid and allow an attack to succeed > anyway; this is basically just me working through a justification for saying > the proposed scheme needs "depends on ARM64_PAN || ARM64_SW_TTBR0_PAN", > making it that much uglier for v8.0 CPUs... I think anyone with v8.0 CPUs interested in this mitigation would also very much want PAN emulation. If a "depends on" isn't desired, what about "imply" in the Kconfig? -Kees
On 6 August 2018 at 20:49, Kees Cook <keescook@chromium.org> wrote: > On Mon, Aug 6, 2018 at 10:45 AM, Robin Murphy <robin.murphy@arm.com> wrote: >> I guess what I'm getting at is that if the protection mechanism is "always >> return with SP outside TTBR1", there seems little point in going through the >> motions if SP in TTBR0 could still be valid and allow an attack to succeed >> anyway; this is basically just me working through a justification for saying >> the proposed scheme needs "depends on ARM64_PAN || ARM64_SW_TTBR0_PAN", >> making it that much uglier for v8.0 CPUs... > > I think anyone with v8.0 CPUs interested in this mitigation would also > very much want PAN emulation. If a "depends on" isn't desired, what > about "imply" in the Kconfig? > Yes, but actually, using bit #0 is maybe a better alternative in any case. You can never dereference SP with bit #0 set, regardless of whether the address points to user or kernel space, and my concern about reloading sp from x29 doesn't really make sense, given that x29 is always assigned from sp right after pushing x29 and x30 in the function prologue, and sp only gets restored from x29 in the epilogue when there is a stack frame to begin with, in which case we add #1 to sp again before returning from the function. The other code gets a lot cleaner as well. So for the return we'll have ldp x29, x30, [sp], #nn >>add sp, sp, #0x1 ret and for the function call bl <foo> >>mov x30, sp >>bic sp, x30, #1 The restore sequence in entry.s:96 (which has no spare registers) gets much simpler as well: --- a/arch/arm64/kernel/entry.S +++ b/arch/arm64/kernel/entry.S @@ -95,6 +95,15 @@ alternative_else_nop_endif */ add sp, sp, x0 // sp' = sp + x0 sub x0, sp, x0 // x0' = sp' - x0 = (sp + x0) - x0 = sp +#ifdef CONFIG_ARM64_ROP_SHIELD + tbnz x0, #0, 1f + .subsection 1 +1: sub x0, x0, #1 + sub sp, sp, #1 + b 2f + .previous +2: +#endif tbnz x0, #THREAD_SHIFT, 0f sub x0, sp, x0 // x0'' = sp' - x0' = (sp + x0) - sp = x0 sub sp, sp, x0 // sp'' = sp' - x0 = (sp + x0) - x0 = sp
On Mon, Aug 6, 2018 at 12:35 PM, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > On 6 August 2018 at 20:49, Kees Cook <keescook@chromium.org> wrote: >> On Mon, Aug 6, 2018 at 10:45 AM, Robin Murphy <robin.murphy@arm.com> wrote: >>> I guess what I'm getting at is that if the protection mechanism is "always >>> return with SP outside TTBR1", there seems little point in going through the >>> motions if SP in TTBR0 could still be valid and allow an attack to succeed >>> anyway; this is basically just me working through a justification for saying >>> the proposed scheme needs "depends on ARM64_PAN || ARM64_SW_TTBR0_PAN", >>> making it that much uglier for v8.0 CPUs... >> >> I think anyone with v8.0 CPUs interested in this mitigation would also >> very much want PAN emulation. If a "depends on" isn't desired, what >> about "imply" in the Kconfig? >> > > Yes, but actually, using bit #0 is maybe a better alternative in any > case. You can never dereference SP with bit #0 set, regardless of > whether the address points to user or kernel space, and my concern > about reloading sp from x29 doesn't really make sense, given that x29 > is always assigned from sp right after pushing x29 and x30 in the > function prologue, and sp only gets restored from x29 in the epilogue > when there is a stack frame to begin with, in which case we add #1 to > sp again before returning from the function. Fair enough! :) > The other code gets a lot cleaner as well. > > So for the return we'll have > > ldp x29, x30, [sp], #nn >>>add sp, sp, #0x1 > ret > > and for the function call > > bl <foo> >>>mov x30, sp >>>bic sp, x30, #1 > > The restore sequence in entry.s:96 (which has no spare registers) gets > much simpler as well: > > --- a/arch/arm64/kernel/entry.S > +++ b/arch/arm64/kernel/entry.S > @@ -95,6 +95,15 @@ alternative_else_nop_endif > */ > add sp, sp, x0 // sp' = sp + x0 > sub x0, sp, x0 // x0' = sp' - x0 = (sp + x0) - x0 = sp > +#ifdef CONFIG_ARM64_ROP_SHIELD > + tbnz x0, #0, 1f > + .subsection 1 > +1: sub x0, x0, #1 > + sub sp, sp, #1 > + b 2f > + .previous > +2: > +#endif > tbnz x0, #THREAD_SHIFT, 0f > sub x0, sp, x0 // x0'' = sp' - x0' = (sp + x0) - sp = x0 > sub sp, sp, x0 // sp'' = sp' - x0 = (sp + x0) - x0 = sp I get slightly concerned about "add" vs "clear bit", but I don't see a real way to chain a lot of "add"s to get to avoid the unaligned access. Is "or" less efficient than "add"? -Kees
On 6 August 2018 at 21:50, Kees Cook <keescook@chromium.org> wrote: > On Mon, Aug 6, 2018 at 12:35 PM, Ard Biesheuvel > <ard.biesheuvel@linaro.org> wrote: >> On 6 August 2018 at 20:49, Kees Cook <keescook@chromium.org> wrote: >>> On Mon, Aug 6, 2018 at 10:45 AM, Robin Murphy <robin.murphy@arm.com> wrote: >>>> I guess what I'm getting at is that if the protection mechanism is "always >>>> return with SP outside TTBR1", there seems little point in going through the >>>> motions if SP in TTBR0 could still be valid and allow an attack to succeed >>>> anyway; this is basically just me working through a justification for saying >>>> the proposed scheme needs "depends on ARM64_PAN || ARM64_SW_TTBR0_PAN", >>>> making it that much uglier for v8.0 CPUs... >>> >>> I think anyone with v8.0 CPUs interested in this mitigation would also >>> very much want PAN emulation. If a "depends on" isn't desired, what >>> about "imply" in the Kconfig? >>> >> >> Yes, but actually, using bit #0 is maybe a better alternative in any >> case. You can never dereference SP with bit #0 set, regardless of >> whether the address points to user or kernel space, and my concern >> about reloading sp from x29 doesn't really make sense, given that x29 >> is always assigned from sp right after pushing x29 and x30 in the >> function prologue, and sp only gets restored from x29 in the epilogue >> when there is a stack frame to begin with, in which case we add #1 to >> sp again before returning from the function. > > Fair enough! :) > >> The other code gets a lot cleaner as well. >> >> So for the return we'll have >> >> ldp x29, x30, [sp], #nn >>>>add sp, sp, #0x1 >> ret >> >> and for the function call >> >> bl <foo> >>>>mov x30, sp >>>>bic sp, x30, #1 >> >> The restore sequence in entry.s:96 (which has no spare registers) gets >> much simpler as well: >> >> --- a/arch/arm64/kernel/entry.S >> +++ b/arch/arm64/kernel/entry.S >> @@ -95,6 +95,15 @@ alternative_else_nop_endif >> */ >> add sp, sp, x0 // sp' = sp + x0 >> sub x0, sp, x0 // x0' = sp' - x0 = (sp + x0) - x0 = sp >> +#ifdef CONFIG_ARM64_ROP_SHIELD >> + tbnz x0, #0, 1f >> + .subsection 1 >> +1: sub x0, x0, #1 >> + sub sp, sp, #1 >> + b 2f >> + .previous >> +2: >> +#endif >> tbnz x0, #THREAD_SHIFT, 0f >> sub x0, sp, x0 // x0'' = sp' - x0' = (sp + x0) - sp = x0 >> sub sp, sp, x0 // sp'' = sp' - x0 = (sp + x0) - x0 = sp > > I get slightly concerned about "add" vs "clear bit", but I don't see a > real way to chain a lot of "add"s to get to avoid the unaligned > access. Is "or" less efficient than "add"? > Yes. The stack pointer is special on arm64, and can only be used with a limited set of ALU instructions. So orring #1 would involve 'mov <reg>, sp ; orr sp, <reg>, #1' like in the 'bic' case above, which requires a scratch register as well.
I think the phrasing of "limit kernel attack surface against ROP attacks" is confusing and misleading. ROP does not describe a class of bugs, vulnerabilities or attacks against the kernel - it's just one of many code-reuse techniques that can be used by an attacker while exploiting a vulnerability. But that's kind of off-topic! I think what this thread is talking about is implementing extremely coarse-grained reverse-edge control-flow-integrity, in that a return can only return to the address following a legitimate call, but it can return to any of those. I suspect there's not much benefit to this, since (as far as I can see) the assumption is that an attacker has the means to direct flow of execution as far as taking complete control of the (el1) stack before executing any ROP payload. At that point, I think it's highly unlikely an attacker needs to chain gadgets through return instructions at all - I suspect there are a few places in the kernel where it is necessary to load the entire register context from a register that is not the stack pointer, and it would likely not be more than a minor inconvenience to an attacker to use these (and chaining through branch register) instructions instead of chaining through return instructions. I'd have to take a closer look at an arm64 kernel image to be sure though - I'll do that when I get a chance and update... Regards, Mark On Mon, 6 Aug 2018 at 19:28, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > On 6 August 2018 at 21:50, Kees Cook <keescook@chromium.org> wrote: > > On Mon, Aug 6, 2018 at 12:35 PM, Ard Biesheuvel > > <ard.biesheuvel@linaro.org> wrote: > >> On 6 August 2018 at 20:49, Kees Cook <keescook@chromium.org> wrote: > >>> On Mon, Aug 6, 2018 at 10:45 AM, Robin Murphy <robin.murphy@arm.com> > wrote: > >>>> I guess what I'm getting at is that if the protection mechanism is > "always > >>>> return with SP outside TTBR1", there seems little point in going > through the > >>>> motions if SP in TTBR0 could still be valid and allow an attack to > succeed > >>>> anyway; this is basically just me working through a justification for > saying > >>>> the proposed scheme needs "depends on ARM64_PAN || > ARM64_SW_TTBR0_PAN", > >>>> making it that much uglier for v8.0 CPUs... > >>> > >>> I think anyone with v8.0 CPUs interested in this mitigation would also > >>> very much want PAN emulation. If a "depends on" isn't desired, what > >>> about "imply" in the Kconfig? > >>> > >> > >> Yes, but actually, using bit #0 is maybe a better alternative in any > >> case. You can never dereference SP with bit #0 set, regardless of > >> whether the address points to user or kernel space, and my concern > >> about reloading sp from x29 doesn't really make sense, given that x29 > >> is always assigned from sp right after pushing x29 and x30 in the > >> function prologue, and sp only gets restored from x29 in the epilogue > >> when there is a stack frame to begin with, in which case we add #1 to > >> sp again before returning from the function. > > > > Fair enough! :) > > > >> The other code gets a lot cleaner as well. > >> > >> So for the return we'll have > >> > >> ldp x29, x30, [sp], #nn > >>>>add sp, sp, #0x1 > >> ret > >> > >> and for the function call > >> > >> bl <foo> > >>>>mov x30, sp > >>>>bic sp, x30, #1 > >> > >> The restore sequence in entry.s:96 (which has no spare registers) gets > >> much simpler as well: > >> > >> --- a/arch/arm64/kernel/entry.S > >> +++ b/arch/arm64/kernel/entry.S > >> @@ -95,6 +95,15 @@ alternative_else_nop_endif > >> */ > >> add sp, sp, x0 // sp' = sp + x0 > >> sub x0, sp, x0 // x0' = sp' - x0 = (sp + x0) - x0 = sp > >> +#ifdef CONFIG_ARM64_ROP_SHIELD > >> + tbnz x0, #0, 1f > >> + .subsection 1 > >> +1: sub x0, x0, #1 > >> + sub sp, sp, #1 > >> + b 2f > >> + .previous > >> +2: > >> +#endif > >> tbnz x0, #THREAD_SHIFT, 0f > >> sub x0, sp, x0 // x0'' = sp' - x0' = (sp + x0) - sp = > x0 > >> sub sp, sp, x0 // sp'' = sp' - x0 = (sp + x0) - x0 = sp > > > > I get slightly concerned about "add" vs "clear bit", but I don't see a > > real way to chain a lot of "add"s to get to avoid the unaligned > > access. Is "or" less efficient than "add"? > > > > Yes. The stack pointer is special on arm64, and can only be used with > a limited set of ALU instructions. So orring #1 would involve 'mov > <reg>, sp ; orr sp, <reg>, #1' like in the 'bic' case above, which > requires a scratch register as well. > <div><div dir="auto">I think the phrasing of "limit kernel attack surface against ROP attacks" is confusing and misleading. ROP does not describe a class of bugs, vulnerabilities or attacks against the kernel - it's just one of many code-reuse techniques that can be used by an attacker while exploiting a vulnerability. But that's kind of off-topic!</div><div dir="auto"><br></div><div dir="auto">I think what this thread is talking about is implementing extremely coarse-grained reverse-edge control-flow-integrity, in that a return can only return to the address following a legitimate call, but it can return to any of those.<br></div></div><div dir="auto"><br></div><div dir="auto">I suspect there's not much benefit to this, since (as far as I can see) the assumption is that an attacker has the means to direct flow of execution as far as taking complete control of the (el1) stack before executing any ROP payload. </div><div dir="auto"><br></div><div dir="auto">At that point, I think it's highly unlikely an attacker needs to chain gadgets through return instructions at all - I suspect there are a few places in the kernel where it is necessary to load the entire register context from a register that is not the stack pointer, and it would likely not be more than a minor inconvenience to an attacker to use these (and chaining through branch register) instructions instead of chaining through return instructions.</div><div dir="auto"><br></div><div dir="auto">I'd have to take a closer look at an arm64 kernel image to be sure though - I'll do that when I get a chance and update...</div><div dir="auto"><br></div><div dir="auto">Regards,</div><div dir="auto">Mark</div><div><br><div class="gmail_quote"><div dir="ltr">On Mon, 6 Aug 2018 at 19:28, Ard Biesheuvel <<a href="mailto:ard.biesheuvel@linaro.org">ard.biesheuvel@linaro.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On 6 August 2018 at 21:50, Kees Cook <<a href="mailto:keescook@chromium.org" target="_blank">keescook@chromium.org</a>> wrote:<br> > On Mon, Aug 6, 2018 at 12:35 PM, Ard Biesheuvel<br> > <<a href="mailto:ard.biesheuvel@linaro.org" target="_blank">ard.biesheuvel@linaro.org</a>> wrote:<br> >> On 6 August 2018 at 20:49, Kees Cook <<a href="mailto:keescook@chromium.org" target="_blank">keescook@chromium.org</a>> wrote:<br> >>> On Mon, Aug 6, 2018 at 10:45 AM, Robin Murphy <<a href="mailto:robin.murphy@arm.com" target="_blank">robin.murphy@arm.com</a>> wrote:<br> >>>> I guess what I'm getting at is that if the protection mechanism is "always<br> >>>> return with SP outside TTBR1", there seems little point in going through the<br> >>>> motions if SP in TTBR0 could still be valid and allow an attack to succeed<br> >>>> anyway; this is basically just me working through a justification for saying<br> >>>> the proposed scheme needs "depends on ARM64_PAN || ARM64_SW_TTBR0_PAN",<br> >>>> making it that much uglier for v8.0 CPUs...<br> >>><br> >>> I think anyone with v8.0 CPUs interested in this mitigation would also<br> >>> very much want PAN emulation. If a "depends on" isn't desired, what<br> >>> about "imply" in the Kconfig?<br> >>><br> >><br> >> Yes, but actually, using bit #0 is maybe a better alternative in any<br> >> case. You can never dereference SP with bit #0 set, regardless of<br> >> whether the address points to user or kernel space, and my concern<br> >> about reloading sp from x29 doesn't really make sense, given that x29<br> >> is always assigned from sp right after pushing x29 and x30 in the<br> >> function prologue, and sp only gets restored from x29 in the epilogue<br> >> when there is a stack frame to begin with, in which case we add #1 to<br> >> sp again before returning from the function.<br> ><br> > Fair enough! :)<br> ><br> >> The other code gets a lot cleaner as well.<br> >><br> >> So for the return we'll have<br> >><br> >> ldp x29, x30, [sp], #nn<br> >>>>add sp, sp, #0x1<br> >> ret<br> >><br> >> and for the function call<br> >><br> >> bl <foo><br> >>>>mov x30, sp<br> >>>>bic sp, x30, #1<br> >><br> >> The restore sequence in entry.s:96 (which has no spare registers) gets<br> >> much simpler as well:<br> >><br> >> --- a/arch/arm64/kernel/entry.S<br> >> +++ b/arch/arm64/kernel/entry.S<br> >> @@ -95,6 +95,15 @@ alternative_else_nop_endif<br> >> */<br> >> add sp, sp, x0 // sp' = sp + x0<br> >> sub x0, sp, x0 // x0' = sp' - x0 = (sp + x0) - x0 = sp<br> >> +#ifdef CONFIG_ARM64_ROP_SHIELD<br> >> + tbnz x0, #0, 1f<br> >> + .subsection 1<br> >> +1: sub x0, x0, #1<br> >> + sub sp, sp, #1<br> >> + b 2f<br> >> + .previous<br> >> +2:<br> >> +#endif<br> >> tbnz x0, #THREAD_SHIFT, 0f<br> >> sub x0, sp, x0 // x0'' = sp' - x0' = (sp + x0) - sp = x0<br> >> sub sp, sp, x0 // sp'' = sp' - x0 = (sp + x0) - x0 = sp<br> ><br> > I get slightly concerned about "add" vs "clear bit", but I don't see a<br> > real way to chain a lot of "add"s to get to avoid the unaligned<br> > access. Is "or" less efficient than "add"?<br> ><br> <br> Yes. The stack pointer is special on arm64, and can only be used with<br> a limited set of ALU instructions. So orring #1 would involve 'mov<br> <reg>, sp ; orr sp, <reg>, #1' like in the 'bic' case above, which<br> requires a scratch register as well.<br> </blockquote></div></div>
On 7 August 2018 at 05:05, Mark Brand <markbrand@google.com> wrote: > I think the phrasing of "limit kernel attack surface against ROP attacks" is > confusing and misleading. ROP does not describe a class of bugs, > vulnerabilities or attacks against the kernel - it's just one of many > code-reuse techniques that can be used by an attacker while exploiting a > vulnerability. But that's kind of off-topic! > > I think what this thread is talking about is implementing extremely > coarse-grained reverse-edge control-flow-integrity, in that a return can > only return to the address following a legitimate call, but it can return to > any of those. > Indeed. Apologies for not mastering the lingo, but it is indeed about no longer being able to subvert function returns into jumping to arbitrary places in the code. > I suspect there's not much benefit to this, since (as far as I can see) the > assumption is that an attacker has the means to direct flow of execution as > far as taking complete control of the (el1) stack before executing any ROP > payload. > > At that point, I think it's highly unlikely an attacker needs to chain > gadgets through return instructions at all - I suspect there are a few > places in the kernel where it is necessary to load the entire register > context from a register that is not the stack pointer, and it would likely > not be more than a minor inconvenience to an attacker to use these (and > chaining through branch register) instructions instead of chaining through > return instructions. > > I'd have to take a closer look at an arm64 kernel image to be sure though - > I'll do that when I get a chance and update... > Thanks. Reloading all registers from an arbitrary offset register should occur rarely, no? Could we work around that? > On Mon, 6 Aug 2018 at 19:28, Ard Biesheuvel <ard.biesheuvel@linaro.org> > wrote: >> >> On 6 August 2018 at 21:50, Kees Cook <keescook@chromium.org> wrote: >> > On Mon, Aug 6, 2018 at 12:35 PM, Ard Biesheuvel >> > <ard.biesheuvel@linaro.org> wrote: >> >> On 6 August 2018 at 20:49, Kees Cook <keescook@chromium.org> wrote: >> >>> On Mon, Aug 6, 2018 at 10:45 AM, Robin Murphy <robin.murphy@arm.com> >> >>> wrote: >> >>>> I guess what I'm getting at is that if the protection mechanism is >> >>>> "always >> >>>> return with SP outside TTBR1", there seems little point in going >> >>>> through the >> >>>> motions if SP in TTBR0 could still be valid and allow an attack to >> >>>> succeed >> >>>> anyway; this is basically just me working through a justification for >> >>>> saying >> >>>> the proposed scheme needs "depends on ARM64_PAN || >> >>>> ARM64_SW_TTBR0_PAN", >> >>>> making it that much uglier for v8.0 CPUs... >> >>> >> >>> I think anyone with v8.0 CPUs interested in this mitigation would also >> >>> very much want PAN emulation. If a "depends on" isn't desired, what >> >>> about "imply" in the Kconfig? >> >>> >> >> >> >> Yes, but actually, using bit #0 is maybe a better alternative in any >> >> case. You can never dereference SP with bit #0 set, regardless of >> >> whether the address points to user or kernel space, and my concern >> >> about reloading sp from x29 doesn't really make sense, given that x29 >> >> is always assigned from sp right after pushing x29 and x30 in the >> >> function prologue, and sp only gets restored from x29 in the epilogue >> >> when there is a stack frame to begin with, in which case we add #1 to >> >> sp again before returning from the function. >> > >> > Fair enough! :) >> > >> >> The other code gets a lot cleaner as well. >> >> >> >> So for the return we'll have >> >> >> >> ldp x29, x30, [sp], #nn >> >>>>add sp, sp, #0x1 >> >> ret >> >> >> >> and for the function call >> >> >> >> bl <foo> >> >>>>mov x30, sp >> >>>>bic sp, x30, #1 >> >> >> >> The restore sequence in entry.s:96 (which has no spare registers) gets >> >> much simpler as well: >> >> >> >> --- a/arch/arm64/kernel/entry.S >> >> +++ b/arch/arm64/kernel/entry.S >> >> @@ -95,6 +95,15 @@ alternative_else_nop_endif >> >> */ >> >> add sp, sp, x0 // sp' = sp + x0 >> >> sub x0, sp, x0 // x0' = sp' - x0 = (sp + x0) - x0 = sp >> >> +#ifdef CONFIG_ARM64_ROP_SHIELD >> >> + tbnz x0, #0, 1f >> >> + .subsection 1 >> >> +1: sub x0, x0, #1 >> >> + sub sp, sp, #1 >> >> + b 2f >> >> + .previous >> >> +2: >> >> +#endif >> >> tbnz x0, #THREAD_SHIFT, 0f >> >> sub x0, sp, x0 // x0'' = sp' - x0' = (sp + x0) - sp = >> >> x0 >> >> sub sp, sp, x0 // sp'' = sp' - x0 = (sp + x0) - x0 = >> >> sp >> > >> > I get slightly concerned about "add" vs "clear bit", but I don't see a >> > real way to chain a lot of "add"s to get to avoid the unaligned >> > access. Is "or" less efficient than "add"? >> > >> >> Yes. The stack pointer is special on arm64, and can only be used with >> a limited set of ALU instructions. So orring #1 would involve 'mov >> <reg>, sp ; orr sp, <reg>, #1' like in the 'bic' case above, which >> requires a scratch register as well.
On Tue, Aug 7, 2018 at 2:22 AM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > > On 7 August 2018 at 05:05, Mark Brand <markbrand@google.com> wrote: > > I think the phrasing of "limit kernel attack surface against ROP attacks" is > > confusing and misleading. ROP does not describe a class of bugs, > > vulnerabilities or attacks against the kernel - it's just one of many > > code-reuse techniques that can be used by an attacker while exploiting a > > vulnerability. But that's kind of off-topic! > > > > I think what this thread is talking about is implementing extremely > > coarse-grained reverse-edge control-flow-integrity, in that a return can > > only return to the address following a legitimate call, but it can return to > > any of those. > > > > Indeed. Apologies for not mastering the lingo, but it is indeed about > no longer being able to subvert function returns into jumping to > arbitrary places in the code. > > > I suspect there's not much benefit to this, since (as far as I can see) the > > assumption is that an attacker has the means to direct flow of execution as > > far as taking complete control of the (el1) stack before executing any ROP > > payload. > > > > At that point, I think it's highly unlikely an attacker needs to chain > > gadgets through return instructions at all - I suspect there are a few > > places in the kernel where it is necessary to load the entire register > > context from a register that is not the stack pointer, and it would likely > > not be more than a minor inconvenience to an attacker to use these (and > > chaining through branch register) instructions instead of chaining through > > return instructions. > > > > I'd have to take a closer look at an arm64 kernel image to be sure though - > > I'll do that when I get a chance and update... > > > > Thanks. Reloading all registers from an arbitrary offset register > should occur rarely, no? Could we work around that? I forgot about the gmail-html-by-default... Hopefully everyone else can read the quotes though :-/. I took a look and have put together an example rop chain that doesn't use any return instructions that you could instrument, that will call an arbitrary kernel function with controlled parameters (at least x0 - x4, would have to probably mess with some alignment and add a repetition of the last gadget to get all register control. It assumes that the attacker has control over the memory pointed to by x0 at the point where they get control of pc, and that they know where that memory is located (but it would also work if they just controlled the memory pointed to by x0, and had another chunk of kernel memory they control at a known address. Seems like a pretty reasonable starting assumption, and I'm sure anyone with a little motivation could produce similar chains for other starting conditions, this just seemed the "most likely" reasonable conditions to me. There are two basic principles used here - (1) chaining through the mempool_free function, I found this really quickly when searching for useful gadgets based off x0 void mempool_free(void *element, mempool_t *pool) { unsigned long flags; if (unlikely(element == NULL)) return; /* snip */ smp_rmb(); /* snip */ if (unlikely(pool->curr_nr < pool->min_nr)) { spin_lock_irqsave(&pool->lock, flags); if (likely(pool->curr_nr < pool->min_nr)) { add_element(pool, element); spin_unlock_irqrestore(&pool->lock, flags); wake_up(&pool->wait); return; } spin_unlock_irqrestore(&pool->lock, flags); } pool->free(element, pool->pool_data); } Since the callsites for this function usually load the arguments through some registers, and the function to call gets pulled out of one of those arguments, it's easy to get a couple of registers loaded here and then the chain continue. (2) loading complete register state using kernel_exit macro. Since the kernel_exit macro actually loads spsr_el1 and elr_el1 from registers, I think that you can let the eret return to anywhere in el1 without dropping to el0, since the same handler is used for "exiting the kernel" when a hardware interrupt interrupts the kernel itself. I didn't fill out the necessary register values in the chain below, since I don't anyway have a device around to test this on right now. I'm not sure that you could really robustly protect this eret; I suppose that you could try and somehow validate the saved register state, but given that it would be happening on every exception return, I suspect it would be expensive. 0:dispatch_io + yy (mempool_free gadget, appears in plenty of other places.) ffffff8008a340d4 084c41a9 ldp x8, x19, [x0, #0x10] ffffff8008a340d8 190040f9 ldr x25, [x0] ffffff8008a340dc 1a1040f9 ldr x26, [x0, #0x20] ffffff8008a340e0 010140f9 ldr x1, [x8] ffffff8008a340e4 ed80dd97 bl mempool_free mempool_free: ffffff8008194498 f44fbea9 stp x20, x19, [sp, #-0x20 {__saved_x20} {__saved_x19}]! ffffff800819449c fd7b01a9 stp x29, x30, [sp, #0x10 {__saved_x29} {__saved_x30}] ffffff80081944a0 fd430091 add x29, sp, #0x10 {__saved_x29} ffffff80081944a4 f30301aa mov x19, x1 ffffff80081944a8 f40300aa mov x20, x0 ffffff80081944ac 340100b4 cbz x20, 0xffffff80081944d0 ffffff80081944b0 bf3903d5 dmb ishld ffffff80081944b4 68a64029 ldp w8, w9, [x19, #0x4] ffffff80081944b8 3f01086b cmp w9, w8 ffffff80081944bc 0b010054 b.lt 0xffffff80081944dc ffffff80081944c0 681640f9 ldr x8, [x19, #0x28] ffffff80081944c4 610e40f9 ldr x1, [x19, #0x18] ffffff80081944c8 e00314aa mov x0, x20 ffffff80081944cc 00013fd6 blr x8 ffffff80081944d0 fd7b41a9 ldp x29, x30, [sp, #0x10 {__saved_x29} {__saved_x30}] ffffff80081944d4 f44fc2a8 ldp x20, x19, [sp {__saved_x20} {__saved_x19}], #0x20 ffffff80081944d8 c0035fd6 ret ffffff8008a340e8 e00319aa mov x0, x25 ffffff8008a340ec e1031aaa mov x1, x26 ffffff8008a340f0 60023fd6 blr x19 1:el1_irq + xx - (x1, x26) -> sp control ffffff800808314c 5f030091 mov sp, x26 ffffff8008083150 fd4fbfa9 stp x29, x19, [sp, #-0x10]! {__saved_x0} ffffff8008083154 fd030091 mov x29, sp ffffff8008083158 20003fd6 blr x1 2:ipc_log_extract + xx (sp, x19) -> survival ffffff800817c35c e0c30091 add x0, sp, #0x30 {var_170} ffffff800817c360 e1430091 add x1, sp, #0x10 {var_190} ffffff800817c364 60023fd6 blr x19 3:dispatch_io + xx (mempool_free gadget, appears in plenty of other places.) ffffff8008a342cc 084c41a9 ldp x8, x19, [x0, #0x10] ffffff8008a342d0 140040f9 ldr x20, [x0] ffffff8008a342d4 151040f9 ldr x21, [x0, #0x20] ffffff8008a342d8 010140f9 ldr x1, [x8] ffffff8008a342dc 6f80dd97 bl mempool_free mempool_free: ffffff8008194498 f44fbea9 stp x20, x19, [sp, #-0x20 {__saved_x20} {__saved_x19}]! ffffff800819449c fd7b01a9 stp x29, x30, [sp, #0x10 {__saved_x29} {__saved_x30}] ffffff80081944a0 fd430091 add x29, sp, #0x10 {__saved_x29} ffffff80081944a4 f30301aa mov x19, x1 ffffff80081944a8 f40300aa mov x20, x0 ffffff80081944ac 340100b4 cbz x20, 0xffffff80081944d0 ffffff80081944b0 bf3903d5 dmb ishld ffffff80081944b4 68a64029 ldp w8, w9, [x19, #0x4] ffffff80081944b8 3f01086b cmp w9, w8 ffffff80081944bc 0b010054 b.lt 0xffffff80081944dc ffffff80081944c0 681640f9 ldr x8, [x19, #0x28] ffffff80081944c4 610e40f9 ldr x1, [x19, #0x18] ffffff80081944c8 e00314aa mov x0, x20 ffffff80081944cc 00013fd6 blr x8 ffffff80081944d0 fd7b41a9 ldp x29, x30, [sp, #0x10 {__saved_x29} {__saved_x30}] ffffff80081944d4 f44fc2a8 ldp x20, x19, [sp {__saved_x20} {__saved_x19}], #0x20 ffffff80081944d8 c0035fd6 ret ffffff8008a342e0 e00314aa mov x0, x20 ffffff8008a342e4 e10315aa mov x1, x21 ffffff8008a342e8 60023fd6 blr x19 4:bus_sort_breadthfirst + xx - (x26) ffffff8008683cc8 561740f9 ldr x22, [x26, #0x28] ffffff8008683ccc e00315aa mov x0, x21 ffffff8008683cd0 e10316aa mov x1, x22 ffffff8008683cd4 80023fd6 blr x20 5:kernel_exit (macro) - (x21, x22, sp) -> full register control & pc control ffffff8008082f64 354018d5 msr elr_el1, x21 ffffff8008082f68 164018d5 msr spsr_el1, x22 ffffff8008082f6c e00740a9 ldp x0, x1, [sp {var_130} {var_128}] ffffff8008082f70 e20f41a9 ldp x2, x3, [sp, #0x10 {var_120} {var_118}] ffffff8008082f74 e41742a9 ldp x4, x5, [sp, #0x20 {var_110} {var_108}] ffffff8008082f78 e61f43a9 ldp x6, x7, [sp, #0x30 {var_100} {var_f8}] ffffff8008082f7c e82744a9 ldp x8, x9, [sp, #0x40 {var_f0} {var_e8}] ffffff8008082f80 ea2f45a9 ldp x10, x11, [sp, #0x50 {var_e0} {var_d8}] ffffff8008082f84 ec3746a9 ldp x12, x13, [sp, #0x60 {var_d0} {var_c8}] ffffff8008082f88 ee3f47a9 ldp x14, x15, [sp, #0x70 {var_c0} {var_b8}] ffffff8008082f8c f04748a9 ldp x16, x17, [sp, #0x80 {var_b0} {var_a8}] ffffff8008082f90 f24f49a9 ldp x18, x19, [sp, #0x90 {var_a0} {var_98}] ffffff8008082f94 f4574aa9 ldp x20, x21, [sp, #0xa0 {var_90} {var_88}] ffffff8008082f98 f65f4ba9 ldp x22, x23, [sp, #0xb0 {var_80} {var_78}] ffffff8008082f9c f8674ca9 ldp x24, x25, [sp, #0xc0 {var_70} {var_68}] ffffff8008082fa0 fa6f4da9 ldp x26, x27, [sp, #0xd0 {var_60} {var_58}] ffffff8008082fa4 fc774ea9 ldp x28, x29, [sp, #0xe0 {var_50} {var_48}] ffffff8008082fa8 fe7b40f9 ldr x30, [sp, #0xf0 {var_40}] ffffff8008082fac ffc30491 add sp, sp, #0x130 ffffff8008082fb0 e0039fd6 eret ptr = 0000414100000000 = initial x0 0000: 2525252525252525 ; (0:40d8) x25 0008: 0000414100000030 ; (0:40e0) x1 0010: 0000414100000000 ; (0:40d4) x8 0018: ffffff8008a342cc ; (0:40d4) x19 -> branch target (2:c364) 0020: 0000414100000070 ; (0:40dc) x26 -> sp (1:314c) 0028: 0030: 8888888899999999 ; (0:44b4) w8, w9 0038: 0040: 0048: ffffff800817c35c ; (0:44c4) x1 -> branch target (1:3158) 0050: 0058: ffffff800808314c ; (0:44c0) x8 -> branch target (0:44c4) 0060: xxxxxxxxxxxxxxxx ; saved x29 <-- sp@(1:3154) 0068: xxxxxxxxxxxxxxxx ; saved x19 0070: ; <-- sp@(1:314c), (5:2f64) 0078: 0080: 0088: 0090: 0098: 2222222222222222 ; (4:3cc8) x22 -> spsr_el1 00a0: ffffff8008082f64 ; (3:42d0) x20 -> branch target (4:3cd4) <-- x0@(2:c35c) 00a8: 00004141000000d0 ; (3:42d8) x1 00b0: 00004141000000a0 ; (3:42cc) x8 00b8: 1919191919191919 ; (3:42cc) x19 00c0: 2121212121212121 ; (3:42d4) x21 -> elr_el1 00c8: 00d0: 8888888899999999 ; (3:44b4) w8, w9 00d8: 00e0: 00e8: 1111111111111111 ; (3:44c4) x1 00f0: 00f8: ffffff800808314c ; (3:44c0) x8 -> branch target (3:44c4) > > > On Mon, 6 Aug 2018 at 19:28, Ard Biesheuvel <ard.biesheuvel@linaro.org> > > wrote: > >> > >> On 6 August 2018 at 21:50, Kees Cook <keescook@chromium.org> wrote: > >> > On Mon, Aug 6, 2018 at 12:35 PM, Ard Biesheuvel > >> > <ard.biesheuvel@linaro.org> wrote: > >> >> On 6 August 2018 at 20:49, Kees Cook <keescook@chromium.org> wrote: > >> >>> On Mon, Aug 6, 2018 at 10:45 AM, Robin Murphy <robin.murphy@arm.com> > >> >>> wrote: > >> >>>> I guess what I'm getting at is that if the protection mechanism is > >> >>>> "always > >> >>>> return with SP outside TTBR1", there seems little point in going > >> >>>> through the > >> >>>> motions if SP in TTBR0 could still be valid and allow an attack to > >> >>>> succeed > >> >>>> anyway; this is basically just me working through a justification for > >> >>>> saying > >> >>>> the proposed scheme needs "depends on ARM64_PAN || > >> >>>> ARM64_SW_TTBR0_PAN", > >> >>>> making it that much uglier for v8.0 CPUs... > >> >>> > >> >>> I think anyone with v8.0 CPUs interested in this mitigation would also > >> >>> very much want PAN emulation. If a "depends on" isn't desired, what > >> >>> about "imply" in the Kconfig? > >> >>> > >> >> > >> >> Yes, but actually, using bit #0 is maybe a better alternative in any > >> >> case. You can never dereference SP with bit #0 set, regardless of > >> >> whether the address points to user or kernel space, and my concern > >> >> about reloading sp from x29 doesn't really make sense, given that x29 > >> >> is always assigned from sp right after pushing x29 and x30 in the > >> >> function prologue, and sp only gets restored from x29 in the epilogue > >> >> when there is a stack frame to begin with, in which case we add #1 to > >> >> sp again before returning from the function. > >> > > >> > Fair enough! :) > >> > > >> >> The other code gets a lot cleaner as well. > >> >> > >> >> So for the return we'll have > >> >> > >> >> ldp x29, x30, [sp], #nn > >> >>>>add sp, sp, #0x1 > >> >> ret > >> >> > >> >> and for the function call > >> >> > >> >> bl <foo> > >> >>>>mov x30, sp > >> >>>>bic sp, x30, #1 > >> >> > >> >> The restore sequence in entry.s:96 (which has no spare registers) gets > >> >> much simpler as well: > >> >> > >> >> --- a/arch/arm64/kernel/entry.S > >> >> +++ b/arch/arm64/kernel/entry.S > >> >> @@ -95,6 +95,15 @@ alternative_else_nop_endif > >> >> */ > >> >> add sp, sp, x0 // sp' = sp + x0 > >> >> sub x0, sp, x0 // x0' = sp' - x0 = (sp + x0) - x0 = sp > >> >> +#ifdef CONFIG_ARM64_ROP_SHIELD > >> >> + tbnz x0, #0, 1f > >> >> + .subsection 1 > >> >> +1: sub x0, x0, #1 > >> >> + sub sp, sp, #1 > >> >> + b 2f > >> >> + .previous > >> >> +2: > >> >> +#endif > >> >> tbnz x0, #THREAD_SHIFT, 0f > >> >> sub x0, sp, x0 // x0'' = sp' - x0' = (sp + x0) - sp = > >> >> x0 > >> >> sub sp, sp, x0 // sp'' = sp' - x0 = (sp + x0) - x0 = > >> >> sp > >> > > >> > I get slightly concerned about "add" vs "clear bit", but I don't see a > >> > real way to chain a lot of "add"s to get to avoid the unaligned > >> > access. Is "or" less efficient than "add"? > >> > > >> > >> Yes. The stack pointer is special on arm64, and can only be used with > >> a limited set of ALU instructions. So orring #1 would involve 'mov > >> <reg>, sp ; orr sp, <reg>, #1' like in the 'bic' case above, which > >> requires a scratch register as well.
On Wed, Aug 8, 2018 at 9:09 AM, Mark Brand <markbrand@google.com> wrote: > (1) chaining through the mempool_free function, I found this really > quickly when searching for useful gadgets based off x0 > > void mempool_free(void *element, mempool_t *pool) > { > unsigned long flags; > > if (unlikely(element == NULL)) > return; > > /* snip */ > smp_rmb(); > > /* snip */ > if (unlikely(pool->curr_nr < pool->min_nr)) { > spin_lock_irqsave(&pool->lock, flags); > if (likely(pool->curr_nr < pool->min_nr)) { > add_element(pool, element); > spin_unlock_irqrestore(&pool->lock, flags); > wake_up(&pool->wait); > return; > } > spin_unlock_irqrestore(&pool->lock, flags); > } > > pool->free(element, pool->pool_data); > } > > Since the callsites for this function usually load the arguments > through some registers, and the function to call gets pulled out of > one of those arguments, it's easy to get a couple of registers loaded > here and then the chain continue. > > (2) loading complete register state using kernel_exit macro. > > Since the kernel_exit macro actually loads spsr_el1 and elr_el1 from > registers, I think that you can let the eret return to anywhere in el1 > without dropping to el0, since the same handler is used for "exiting > the kernel" when a hardware interrupt interrupts the kernel itself. I > didn't fill out the necessary register values in the chain below, > since I don't anyway have a device around to test this on right now. > > I'm not sure that you could really robustly protect this eret; I > suppose that you could try and somehow validate the saved register > state, but given that it would be happening on every exception return, > I suspect it would be expensive. This is a wonderful example, thanks! Yeah, Call-Oriented Programming? :P mempool_free() is quite a nice gadget. -Kees
On Wed 8 Aug 2018 at 19:09, Mark Brand <markbrand@google.com> wrote: > On Tue, Aug 7, 2018 at 2:22 AM Ard Biesheuvel <ard.biesheuvel@linaro.org> > wrote: > > > > On 7 August 2018 at 05:05, Mark Brand <markbrand@google.com> wrote: > > > I think the phrasing of "limit kernel attack surface against ROP > attacks" is > > > confusing and misleading. ROP does not describe a class of bugs, > > > vulnerabilities or attacks against the kernel - it's just one of many > > > code-reuse techniques that can be used by an attacker while exploiting > a > > > vulnerability. But that's kind of off-topic! > > > > > > I think what this thread is talking about is implementing extremely > > > coarse-grained reverse-edge control-flow-integrity, in that a return > can > > > only return to the address following a legitimate call, but it can > return to > > > any of those. > > > > > > > Indeed. Apologies for not mastering the lingo, but it is indeed about > > no longer being able to subvert function returns into jumping to > > arbitrary places in the code. > > > > > I suspect there's not much benefit to this, since (as far as I can > see) the > > > assumption is that an attacker has the means to direct flow of > execution as > > > far as taking complete control of the (el1) stack before executing any > ROP > > > payload. > > > > > > At that point, I think it's highly unlikely an attacker needs to chain > > > gadgets through return instructions at all - I suspect there are a few > > > places in the kernel where it is necessary to load the entire register > > > context from a register that is not the stack pointer, and it would > likely > > > not be more than a minor inconvenience to an attacker to use these (and > > > chaining through branch register) instructions instead of chaining > through > > > return instructions. > > > > > > I'd have to take a closer look at an arm64 kernel image to be sure > though - > > > I'll do that when I get a chance and update... > > > > > > > Thanks. Reloading all registers from an arbitrary offset register > > should occur rarely, no? Could we work around that? > > I forgot about the gmail-html-by-default... Hopefully everyone else > can read the quotes though :-/. > > I took a look and have put together an example rop chain that doesn't > use any return instructions that you could instrument, that will call > an arbitrary kernel function with controlled parameters (at least x0 - > x4, would have to probably mess with some alignment and add a > repetition of the last gadget to get all register control. It assumes > that the attacker has control over the memory pointed to by x0 at the > point where they get control of pc, and that they know where that > memory is located (but it would also work if they just controlled the > memory pointed to by x0, and had another chunk of kernel memory they > control at a known address. Seems like a pretty reasonable starting > assumption, and I'm sure anyone with a little motivation could produce > similar chains for other starting conditions, this just seemed the > "most likely" reasonable conditions to me. > Thanks a lot for taking the time to put together this excellent example. I will study it in more detail after I return from my vacation. Ard. > There are two basic principles used here - > > (1) chaining through the mempool_free function, I found this really > quickly when searching for useful gadgets based off x0 > > void mempool_free(void *element, mempool_t *pool) > { > unsigned long flags; > > if (unlikely(element == NULL)) > return; > > /* snip */ > smp_rmb(); > > /* snip */ > if (unlikely(pool->curr_nr < pool->min_nr)) { > spin_lock_irqsave(&pool->lock, flags); > if (likely(pool->curr_nr < pool->min_nr)) { > add_element(pool, element); > spin_unlock_irqrestore(&pool->lock, flags); > wake_up(&pool->wait); > return; > } > spin_unlock_irqrestore(&pool->lock, flags); > } > > pool->free(element, pool->pool_data); > } > > Since the callsites for this function usually load the arguments > through some registers, and the function to call gets pulled out of > one of those arguments, it's easy to get a couple of registers loaded > here and then the chain continue. > > (2) loading complete register state using kernel_exit macro. > > Since the kernel_exit macro actually loads spsr_el1 and elr_el1 from > registers, I think that you can let the eret return to anywhere in el1 > without dropping to el0, since the same handler is used for "exiting > the kernel" when a hardware interrupt interrupts the kernel itself. I > didn't fill out the necessary register values in the chain below, > since I don't anyway have a device around to test this on right now. > > I'm not sure that you could really robustly protect this eret; I > suppose that you could try and somehow validate the saved register > state, but given that it would be happening on every exception return, > I suspect it would be expensive. > > 0:dispatch_io + yy (mempool_free gadget, appears in plenty of other > places.) > ffffff8008a340d4 084c41a9 ldp x8, x19, [x0, #0x10] > ffffff8008a340d8 190040f9 ldr x25, [x0] > ffffff8008a340dc 1a1040f9 ldr x26, [x0, #0x20] > ffffff8008a340e0 010140f9 ldr x1, [x8] > ffffff8008a340e4 ed80dd97 bl mempool_free > > mempool_free: > ffffff8008194498 f44fbea9 stp x20, x19, [sp, #-0x20 > {__saved_x20} {__saved_x19}]! > ffffff800819449c fd7b01a9 stp x29, x30, [sp, #0x10 > {__saved_x29} {__saved_x30}] > ffffff80081944a0 fd430091 add x29, sp, #0x10 {__saved_x29} > ffffff80081944a4 f30301aa mov x19, x1 > ffffff80081944a8 f40300aa mov x20, x0 > ffffff80081944ac 340100b4 cbz x20, 0xffffff80081944d0 > > ffffff80081944b0 bf3903d5 dmb ishld > ffffff80081944b4 68a64029 ldp w8, w9, [x19, #0x4] > ffffff80081944b8 3f01086b cmp w9, w8 > ffffff80081944bc 0b010054 b.lt 0xffffff80081944dc > > ffffff80081944c0 681640f9 ldr x8, [x19, #0x28] > ffffff80081944c4 610e40f9 ldr x1, [x19, #0x18] > ffffff80081944c8 e00314aa mov x0, x20 > ffffff80081944cc 00013fd6 blr x8 > > ffffff80081944d0 fd7b41a9 ldp x29, x30, [sp, #0x10 > {__saved_x29} {__saved_x30}] > ffffff80081944d4 f44fc2a8 ldp x20, x19, [sp {__saved_x20} > {__saved_x19}], #0x20 > ffffff80081944d8 c0035fd6 ret > > ffffff8008a340e8 e00319aa mov x0, x25 > ffffff8008a340ec e1031aaa mov x1, x26 > ffffff8008a340f0 60023fd6 blr x19 > > 1:el1_irq + xx - (x1, x26) -> sp control > ffffff800808314c 5f030091 mov sp, x26 > ffffff8008083150 fd4fbfa9 stp x29, x19, [sp, #-0x10]! {__saved_x0} > ffffff8008083154 fd030091 mov x29, sp > ffffff8008083158 20003fd6 blr x1 > > 2:ipc_log_extract + xx (sp, x19) -> survival > ffffff800817c35c e0c30091 add x0, sp, #0x30 {var_170} > ffffff800817c360 e1430091 add x1, sp, #0x10 {var_190} > ffffff800817c364 60023fd6 blr x19 > > 3:dispatch_io + xx (mempool_free gadget, appears in plenty of other > places.) > ffffff8008a342cc 084c41a9 ldp x8, x19, [x0, #0x10] > ffffff8008a342d0 140040f9 ldr x20, [x0] > ffffff8008a342d4 151040f9 ldr x21, [x0, #0x20] > ffffff8008a342d8 010140f9 ldr x1, [x8] > ffffff8008a342dc 6f80dd97 bl mempool_free > > mempool_free: > ffffff8008194498 f44fbea9 stp x20, x19, [sp, #-0x20 > {__saved_x20} {__saved_x19}]! > ffffff800819449c fd7b01a9 stp x29, x30, [sp, #0x10 > {__saved_x29} {__saved_x30}] > ffffff80081944a0 fd430091 add x29, sp, #0x10 {__saved_x29} > ffffff80081944a4 f30301aa mov x19, x1 > ffffff80081944a8 f40300aa mov x20, x0 > ffffff80081944ac 340100b4 cbz x20, 0xffffff80081944d0 > > ffffff80081944b0 bf3903d5 dmb ishld > ffffff80081944b4 68a64029 ldp w8, w9, [x19, #0x4] > ffffff80081944b8 3f01086b cmp w9, w8 > ffffff80081944bc 0b010054 b.lt 0xffffff80081944dc > > ffffff80081944c0 681640f9 ldr x8, [x19, #0x28] > ffffff80081944c4 610e40f9 ldr x1, [x19, #0x18] > ffffff80081944c8 e00314aa mov x0, x20 > ffffff80081944cc 00013fd6 blr x8 > > ffffff80081944d0 fd7b41a9 ldp x29, x30, [sp, #0x10 > {__saved_x29} {__saved_x30}] > ffffff80081944d4 f44fc2a8 ldp x20, x19, [sp {__saved_x20} > {__saved_x19}], #0x20 > ffffff80081944d8 c0035fd6 ret > > ffffff8008a342e0 e00314aa mov x0, x20 > ffffff8008a342e4 e10315aa mov x1, x21 > ffffff8008a342e8 60023fd6 blr x19 > > 4:bus_sort_breadthfirst + xx - (x26) > ffffff8008683cc8 561740f9 ldr x22, [x26, #0x28] > ffffff8008683ccc e00315aa mov x0, x21 > ffffff8008683cd0 e10316aa mov x1, x22 > ffffff8008683cd4 80023fd6 blr x20 > > 5:kernel_exit (macro) - (x21, x22, sp) -> full register control & pc > control > ffffff8008082f64 354018d5 msr elr_el1, x21 > ffffff8008082f68 164018d5 msr spsr_el1, x22 > ffffff8008082f6c e00740a9 ldp x0, x1, [sp {var_130} {var_128}] > ffffff8008082f70 e20f41a9 ldp x2, x3, [sp, #0x10 {var_120} > {var_118}] > ffffff8008082f74 e41742a9 ldp x4, x5, [sp, #0x20 {var_110} > {var_108}] > ffffff8008082f78 e61f43a9 ldp x6, x7, [sp, #0x30 {var_100} {var_f8}] > ffffff8008082f7c e82744a9 ldp x8, x9, [sp, #0x40 {var_f0} {var_e8}] > ffffff8008082f80 ea2f45a9 ldp x10, x11, [sp, #0x50 {var_e0} > {var_d8}] > ffffff8008082f84 ec3746a9 ldp x12, x13, [sp, #0x60 {var_d0} > {var_c8}] > ffffff8008082f88 ee3f47a9 ldp x14, x15, [sp, #0x70 {var_c0} > {var_b8}] > ffffff8008082f8c f04748a9 ldp x16, x17, [sp, #0x80 {var_b0} > {var_a8}] > ffffff8008082f90 f24f49a9 ldp x18, x19, [sp, #0x90 {var_a0} > {var_98}] > ffffff8008082f94 f4574aa9 ldp x20, x21, [sp, #0xa0 {var_90} > {var_88}] > ffffff8008082f98 f65f4ba9 ldp x22, x23, [sp, #0xb0 {var_80} > {var_78}] > ffffff8008082f9c f8674ca9 ldp x24, x25, [sp, #0xc0 {var_70} > {var_68}] > ffffff8008082fa0 fa6f4da9 ldp x26, x27, [sp, #0xd0 {var_60} > {var_58}] > ffffff8008082fa4 fc774ea9 ldp x28, x29, [sp, #0xe0 {var_50} > {var_48}] > ffffff8008082fa8 fe7b40f9 ldr x30, [sp, #0xf0 {var_40}] > ffffff8008082fac ffc30491 add sp, sp, #0x130 > ffffff8008082fb0 e0039fd6 eret > > > ptr = 0000414100000000 = initial x0 > > 0000: 2525252525252525 ; (0:40d8) x25 > 0008: 0000414100000030 ; (0:40e0) x1 > 0010: 0000414100000000 ; (0:40d4) x8 > 0018: ffffff8008a342cc ; (0:40d4) x19 -> branch target (2:c364) > 0020: 0000414100000070 ; (0:40dc) x26 -> sp (1:314c) > 0028: > 0030: 8888888899999999 ; (0:44b4) w8, w9 > 0038: > 0040: > 0048: ffffff800817c35c ; (0:44c4) x1 -> branch target (1:3158) > 0050: > 0058: ffffff800808314c ; (0:44c0) x8 -> branch target (0:44c4) > 0060: xxxxxxxxxxxxxxxx ; saved x29 <-- sp@ > (1:3154) > 0068: xxxxxxxxxxxxxxxx ; saved x19 > 0070: ; <-- > sp@(1:314c), (5:2f64) > 0078: > 0080: > 0088: > 0090: > 0098: 2222222222222222 ; (4:3cc8) x22 -> spsr_el1 > 00a0: ffffff8008082f64 ; (3:42d0) x20 -> branch target (4:3cd4) <-- x0@ > (2:c35c) > 00a8: 00004141000000d0 ; (3:42d8) x1 > 00b0: 00004141000000a0 ; (3:42cc) x8 > 00b8: 1919191919191919 ; (3:42cc) x19 > 00c0: 2121212121212121 ; (3:42d4) x21 -> elr_el1 > 00c8: > 00d0: 8888888899999999 ; (3:44b4) w8, w9 > 00d8: > 00e0: > 00e8: 1111111111111111 ; (3:44c4) x1 > 00f0: > 00f8: ffffff800808314c ; (3:44c0) x8 -> branch target (3:44c4) > > > > > On Mon, 6 Aug 2018 at 19:28, Ard Biesheuvel <ard.biesheuvel@linaro.org > > > > > wrote: > > >> > > >> On 6 August 2018 at 21:50, Kees Cook <keescook@chromium.org> wrote: > > >> > On Mon, Aug 6, 2018 at 12:35 PM, Ard Biesheuvel > > >> > <ard.biesheuvel@linaro.org> wrote: > > >> >> On 6 August 2018 at 20:49, Kees Cook <keescook@chromium.org> > wrote: > > >> >>> On Mon, Aug 6, 2018 at 10:45 AM, Robin Murphy < > robin.murphy@arm.com> > > >> >>> wrote: > > >> >>>> I guess what I'm getting at is that if the protection mechanism > is > > >> >>>> "always > > >> >>>> return with SP outside TTBR1", there seems little point in going > > >> >>>> through the > > >> >>>> motions if SP in TTBR0 could still be valid and allow an attack > to > > >> >>>> succeed > > >> >>>> anyway; this is basically just me working through a > justification for > > >> >>>> saying > > >> >>>> the proposed scheme needs "depends on ARM64_PAN || > > >> >>>> ARM64_SW_TTBR0_PAN", > > >> >>>> making it that much uglier for v8.0 CPUs... > > >> >>> > > >> >>> I think anyone with v8.0 CPUs interested in this mitigation would > also > > >> >>> very much want PAN emulation. If a "depends on" isn't desired, > what > > >> >>> about "imply" in the Kconfig? > > >> >>> > > >> >> > > >> >> Yes, but actually, using bit #0 is maybe a better alternative in > any > > >> >> case. You can never dereference SP with bit #0 set, regardless of > > >> >> whether the address points to user or kernel space, and my concern > > >> >> about reloading sp from x29 doesn't really make sense, given that > x29 > > >> >> is always assigned from sp right after pushing x29 and x30 in the > > >> >> function prologue, and sp only gets restored from x29 in the > epilogue > > >> >> when there is a stack frame to begin with, in which case we add #1 > to > > >> >> sp again before returning from the function. > > >> > > > >> > Fair enough! :) > > >> > > > >> >> The other code gets a lot cleaner as well. > > >> >> > > >> >> So for the return we'll have > > >> >> > > >> >> ldp x29, x30, [sp], #nn > > >> >>>>add sp, sp, #0x1 > > >> >> ret > > >> >> > > >> >> and for the function call > > >> >> > > >> >> bl <foo> > > >> >>>>mov x30, sp > > >> >>>>bic sp, x30, #1 > > >> >> > > >> >> The restore sequence in entry.s:96 (which has no spare registers) > gets > > >> >> much simpler as well: > > >> >> > > >> >> --- a/arch/arm64/kernel/entry.S > > >> >> +++ b/arch/arm64/kernel/entry.S > > >> >> @@ -95,6 +95,15 @@ alternative_else_nop_endif > > >> >> */ > > >> >> add sp, sp, x0 // sp' = sp + x0 > > >> >> sub x0, sp, x0 // x0' = sp' - x0 = (sp + x0) - x0 > = sp > > >> >> +#ifdef CONFIG_ARM64_ROP_SHIELD > > >> >> + tbnz x0, #0, 1f > > >> >> + .subsection 1 > > >> >> +1: sub x0, x0, #1 > > >> >> + sub sp, sp, #1 > > >> >> + b 2f > > >> >> + .previous > > >> >> +2: > > >> >> +#endif > > >> >> tbnz x0, #THREAD_SHIFT, 0f > > >> >> sub x0, sp, x0 // x0'' = sp' - x0' = (sp + x0) - > sp = > > >> >> x0 > > >> >> sub sp, sp, x0 // sp'' = sp' - x0 = (sp + x0) - > x0 = > > >> >> sp > > >> > > > >> > I get slightly concerned about "add" vs "clear bit", but I don't > see a > > >> > real way to chain a lot of "add"s to get to avoid the unaligned > > >> > access. Is "or" less efficient than "add"? > > >> > > > >> > > >> Yes. The stack pointer is special on arm64, and can only be used with > > >> a limited set of ALU instructions. So orring #1 would involve 'mov > > >> <reg>, sp ; orr sp, <reg>, #1' like in the 'bic' case above, which > > >> requires a scratch register as well. > <div><br></div><div><br><div class="gmail_quote"><div dir="ltr">On Wed 8 Aug 2018 at 19:09, Mark Brand <<a href="mailto:markbrand@google.com">markbrand@google.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On Tue, Aug 7, 2018 at 2:22 AM Ard Biesheuvel <<a href="mailto:ard.biesheuvel@linaro.org" target="_blank">ard.biesheuvel@linaro.org</a>> wrote:<br> ><br> > On 7 August 2018 at 05:05, Mark Brand <<a href="mailto:markbrand@google.com" target="_blank">markbrand@google.com</a>> wrote:<br> > > I think the phrasing of "limit kernel attack surface against ROP attacks" is<br> > > confusing and misleading. ROP does not describe a class of bugs,<br> > > vulnerabilities or attacks against the kernel - it's just one of many<br> > > code-reuse techniques that can be used by an attacker while exploiting a<br> > > vulnerability. But that's kind of off-topic!<br> > ><br> > > I think what this thread is talking about is implementing extremely<br> > > coarse-grained reverse-edge control-flow-integrity, in that a return can<br> > > only return to the address following a legitimate call, but it can return to<br> > > any of those.<br> > ><br> ><br> > Indeed. Apologies for not mastering the lingo, but it is indeed about<br> > no longer being able to subvert function returns into jumping to<br> > arbitrary places in the code.<br> ><br> > > I suspect there's not much benefit to this, since (as far as I can see) the<br> > > assumption is that an attacker has the means to direct flow of execution as<br> > > far as taking complete control of the (el1) stack before executing any ROP<br> > > payload.<br> > ><br> > > At that point, I think it's highly unlikely an attacker needs to chain<br> > > gadgets through return instructions at all - I suspect there are a few<br> > > places in the kernel where it is necessary to load the entire register<br> > > context from a register that is not the stack pointer, and it would likely<br> > > not be more than a minor inconvenience to an attacker to use these (and<br> > > chaining through branch register) instructions instead of chaining through<br> > > return instructions.<br> > ><br> > > I'd have to take a closer look at an arm64 kernel image to be sure though -<br> > > I'll do that when I get a chance and update...<br> > ><br> ><br> > Thanks. Reloading all registers from an arbitrary offset register<br> > should occur rarely, no? Could we work around that?<br> <br> I forgot about the gmail-html-by-default... Hopefully everyone else<br> can read the quotes though :-/.<br> <br> I took a look and have put together an example rop chain that doesn't<br> use any return instructions that you could instrument, that will call<br> an arbitrary kernel function with controlled parameters (at least x0 -<br> x4, would have to probably mess with some alignment and add a<br> repetition of the last gadget to get all register control. It assumes<br> that the attacker has control over the memory pointed to by x0 at the<br> point where they get control of pc, and that they know where that<br> memory is located (but it would also work if they just controlled the<br> memory pointed to by x0, and had another chunk of kernel memory they<br> control at a known address. Seems like a pretty reasonable starting<br> assumption, and I'm sure anyone with a little motivation could produce<br> similar chains for other starting conditions, this just seemed the<br> "most likely" reasonable conditions to me.<br> </blockquote><div dir="auto"><br></div><div dir="auto">Thanks a lot for taking the time to put together this excellent example. I will study it in more detail after I return from my vacation.</div><div dir="auto"><br></div><div dir="auto">Ard.</div><div dir="auto"><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br> There are two basic principles used here -<br> <br> (1) chaining through the mempool_free function, I found this really<br> quickly when searching for useful gadgets based off x0<br> <br> void mempool_free(void *element, mempool_t *pool)<br> {<br> unsigned long flags;<br> <br> if (unlikely(element == NULL))<br> return;<br> <br> /* snip */<br> smp_rmb();<br> <br> /* snip */<br> if (unlikely(pool->curr_nr < pool->min_nr)) {<br> spin_lock_irqsave(&pool->lock, flags);<br> if (likely(pool->curr_nr < pool->min_nr)) {<br> add_element(pool, element);<br> spin_unlock_irqrestore(&pool->lock, flags);<br> wake_up(&pool->wait);<br> return;<br> }<br> spin_unlock_irqrestore(&pool->lock, flags);<br> }<br> <br> pool->free(element, pool->pool_data);<br> }<br> <br> Since the callsites for this function usually load the arguments<br> through some registers, and the function to call gets pulled out of<br> one of those arguments, it's easy to get a couple of registers loaded<br> here and then the chain continue.<br> <br> (2) loading complete register state using kernel_exit macro.<br> <br> Since the kernel_exit macro actually loads spsr_el1 and elr_el1 from<br> registers, I think that you can let the eret return to anywhere in el1<br> without dropping to el0, since the same handler is used for "exiting<br> the kernel" when a hardware interrupt interrupts the kernel itself. I<br> didn't fill out the necessary register values in the chain below,<br> since I don't anyway have a device around to test this on right now.<br> <br> I'm not sure that you could really robustly protect this eret; I<br> suppose that you could try and somehow validate the saved register<br> state, but given that it would be happening on every exception return,<br> I suspect it would be expensive.<br> <br> 0:dispatch_io + yy (mempool_free gadget, appears in plenty of other places.)<br> ffffff8008a340d4 084c41a9 ldp x8, x19, [x0, #0x10]<br> ffffff8008a340d8 190040f9 ldr x25, [x0]<br> ffffff8008a340dc 1a1040f9 ldr x26, [x0, #0x20]<br> ffffff8008a340e0 010140f9 ldr x1, [x8]<br> ffffff8008a340e4 ed80dd97 bl mempool_free<br> <br> mempool_free:<br> ffffff8008194498 f44fbea9 stp x20, x19, [sp, #-0x20<br> {__saved_x20} {__saved_x19}]!<br> ffffff800819449c fd7b01a9 stp x29, x30, [sp, #0x10<br> {__saved_x29} {__saved_x30}]<br> ffffff80081944a0 fd430091 add x29, sp, #0x10 {__saved_x29}<br> ffffff80081944a4 f30301aa mov x19, x1<br> ffffff80081944a8 f40300aa mov x20, x0<br> ffffff80081944ac 340100b4 cbz x20, 0xffffff80081944d0<br> <br> ffffff80081944b0 bf3903d5 dmb ishld<br> ffffff80081944b4 68a64029 ldp w8, w9, [x19, #0x4]<br> ffffff80081944b8 3f01086b cmp w9, w8<br> ffffff80081944bc 0b010054 <a href="http://b.lt" rel="noreferrer" target="_blank">b.lt</a> 0xffffff80081944dc<br> <br> ffffff80081944c0 681640f9 ldr x8, [x19, #0x28]<br> ffffff80081944c4 610e40f9 ldr x1, [x19, #0x18]<br> ffffff80081944c8 e00314aa mov x0, x20<br> ffffff80081944cc 00013fd6 blr x8<br> <br> ffffff80081944d0 fd7b41a9 ldp x29, x30, [sp, #0x10<br> {__saved_x29} {__saved_x30}]<br> ffffff80081944d4 f44fc2a8 ldp x20, x19, [sp {__saved_x20}<br> {__saved_x19}], #0x20<br> ffffff80081944d8 c0035fd6 ret<br> <br> ffffff8008a340e8 e00319aa mov x0, x25<br> ffffff8008a340ec e1031aaa mov x1, x26<br> ffffff8008a340f0 60023fd6 blr x19<br> <br> 1:el1_irq + xx - (x1, x26) -> sp control<br> ffffff800808314c 5f030091 mov sp, x26<br> ffffff8008083150 fd4fbfa9 stp x29, x19, [sp, #-0x10]! {__saved_x0}<br> ffffff8008083154 fd030091 mov x29, sp<br> ffffff8008083158 20003fd6 blr x1<br> <br> 2:ipc_log_extract + xx (sp, x19) -> survival<br> ffffff800817c35c e0c30091 add x0, sp, #0x30 {var_170}<br> ffffff800817c360 e1430091 add x1, sp, #0x10 {var_190}<br> ffffff800817c364 60023fd6 blr x19<br> <br> 3:dispatch_io + xx (mempool_free gadget, appears in plenty of other places.)<br> ffffff8008a342cc 084c41a9 ldp x8, x19, [x0, #0x10]<br> ffffff8008a342d0 140040f9 ldr x20, [x0]<br> ffffff8008a342d4 151040f9 ldr x21, [x0, #0x20]<br> ffffff8008a342d8 010140f9 ldr x1, [x8]<br> ffffff8008a342dc 6f80dd97 bl mempool_free<br> <br> mempool_free:<br> ffffff8008194498 f44fbea9 stp x20, x19, [sp, #-0x20<br> {__saved_x20} {__saved_x19}]!<br> ffffff800819449c fd7b01a9 stp x29, x30, [sp, #0x10<br> {__saved_x29} {__saved_x30}]<br> ffffff80081944a0 fd430091 add x29, sp, #0x10 {__saved_x29}<br> ffffff80081944a4 f30301aa mov x19, x1<br> ffffff80081944a8 f40300aa mov x20, x0<br> ffffff80081944ac 340100b4 cbz x20, 0xffffff80081944d0<br> <br> ffffff80081944b0 bf3903d5 dmb ishld<br> ffffff80081944b4 68a64029 ldp w8, w9, [x19, #0x4]<br> ffffff80081944b8 3f01086b cmp w9, w8<br> ffffff80081944bc 0b010054 <a href="http://b.lt" rel="noreferrer" target="_blank">b.lt</a> 0xffffff80081944dc<br> <br> ffffff80081944c0 681640f9 ldr x8, [x19, #0x28]<br> ffffff80081944c4 610e40f9 ldr x1, [x19, #0x18]<br> ffffff80081944c8 e00314aa mov x0, x20<br> ffffff80081944cc 00013fd6 blr x8<br> <br> ffffff80081944d0 fd7b41a9 ldp x29, x30, [sp, #0x10<br> {__saved_x29} {__saved_x30}]<br> ffffff80081944d4 f44fc2a8 ldp x20, x19, [sp {__saved_x20}<br> {__saved_x19}], #0x20<br> ffffff80081944d8 c0035fd6 ret<br> <br> ffffff8008a342e0 e00314aa mov x0, x20<br> ffffff8008a342e4 e10315aa mov x1, x21<br> ffffff8008a342e8 60023fd6 blr x19<br> <br> 4:bus_sort_breadthfirst + xx - (x26)<br> ffffff8008683cc8 561740f9 ldr x22, [x26, #0x28]<br> ffffff8008683ccc e00315aa mov x0, x21<br> ffffff8008683cd0 e10316aa mov x1, x22<br> ffffff8008683cd4 80023fd6 blr x20<br> <br> 5:kernel_exit (macro) - (x21, x22, sp) -> full register control & pc control<br> ffffff8008082f64 354018d5 msr elr_el1, x21<br> ffffff8008082f68 164018d5 msr spsr_el1, x22<br> ffffff8008082f6c e00740a9 ldp x0, x1, [sp {var_130} {var_128}]<br> ffffff8008082f70 e20f41a9 ldp x2, x3, [sp, #0x10 {var_120} {var_118}]<br> ffffff8008082f74 e41742a9 ldp x4, x5, [sp, #0x20 {var_110} {var_108}]<br> ffffff8008082f78 e61f43a9 ldp x6, x7, [sp, #0x30 {var_100} {var_f8}]<br> ffffff8008082f7c e82744a9 ldp x8, x9, [sp, #0x40 {var_f0} {var_e8}]<br> ffffff8008082f80 ea2f45a9 ldp x10, x11, [sp, #0x50 {var_e0} {var_d8}]<br> ffffff8008082f84 ec3746a9 ldp x12, x13, [sp, #0x60 {var_d0} {var_c8}]<br> ffffff8008082f88 ee3f47a9 ldp x14, x15, [sp, #0x70 {var_c0} {var_b8}]<br> ffffff8008082f8c f04748a9 ldp x16, x17, [sp, #0x80 {var_b0} {var_a8}]<br> ffffff8008082f90 f24f49a9 ldp x18, x19, [sp, #0x90 {var_a0} {var_98}]<br> ffffff8008082f94 f4574aa9 ldp x20, x21, [sp, #0xa0 {var_90} {var_88}]<br> ffffff8008082f98 f65f4ba9 ldp x22, x23, [sp, #0xb0 {var_80} {var_78}]<br> ffffff8008082f9c f8674ca9 ldp x24, x25, [sp, #0xc0 {var_70} {var_68}]<br> ffffff8008082fa0 fa6f4da9 ldp x26, x27, [sp, #0xd0 {var_60} {var_58}]<br> ffffff8008082fa4 fc774ea9 ldp x28, x29, [sp, #0xe0 {var_50} {var_48}]<br> ffffff8008082fa8 fe7b40f9 ldr x30, [sp, #0xf0 {var_40}]<br> ffffff8008082fac ffc30491 add sp, sp, #0x130<br> ffffff8008082fb0 e0039fd6 eret<br> <br> <br> ptr = 0000414100000000 = initial x0<br> <br> 0000: 2525252525252525 ; (0:40d8) x25<br> 0008: 0000414100000030 ; (0:40e0) x1<br> 0010: 0000414100000000 ; (0:40d4) x8<br> 0018: ffffff8008a342cc ; (0:40d4) x19 -> branch target (2:c364)<br> 0020: 0000414100000070 ; (0:40dc) x26 -> sp (1:314c)<br> 0028:<br> 0030: 8888888899999999 ; (0:44b4) w8, w9<br> 0038:<br> 0040:<br> 0048: ffffff800817c35c ; (0:44c4) x1 -> branch target (1:3158)<br> 0050:<br> 0058: ffffff800808314c ; (0:44c0) x8 -> branch target (0:44c4)<br> 0060: xxxxxxxxxxxxxxxx ; saved x29 <-- sp@(1:3154)<br> 0068: xxxxxxxxxxxxxxxx ; saved x19<br> 0070: ; <--<br> sp@(1:314c), (5:2f64)<br> 0078:<br> 0080:<br> 0088:<br> 0090:<br> 0098: 2222222222222222 ; (4:3cc8) x22 -> spsr_el1<br> 00a0: ffffff8008082f64 ; (3:42d0) x20 -> branch target (4:3cd4) <-- x0@(2:c35c)<br> 00a8: 00004141000000d0 ; (3:42d8) x1<br> 00b0: 00004141000000a0 ; (3:42cc) x8<br> 00b8: 1919191919191919 ; (3:42cc) x19<br> 00c0: 2121212121212121 ; (3:42d4) x21 -> elr_el1<br> 00c8:<br> 00d0: 8888888899999999 ; (3:44b4) w8, w9<br> 00d8:<br> 00e0:<br> 00e8: 1111111111111111 ; (3:44c4) x1<br> 00f0:<br> 00f8: ffffff800808314c ; (3:44c0) x8 -> branch target (3:44c4)<br> ><br> > > On Mon, 6 Aug 2018 at 19:28, Ard Biesheuvel <<a href="mailto:ard.biesheuvel@linaro.org" target="_blank">ard.biesheuvel@linaro.org</a>><br> > > wrote:<br> > >><br> > >> On 6 August 2018 at 21:50, Kees Cook <<a href="mailto:keescook@chromium.org" target="_blank">keescook@chromium.org</a>> wrote:<br> > >> > On Mon, Aug 6, 2018 at 12:35 PM, Ard Biesheuvel<br> > >> > <<a href="mailto:ard.biesheuvel@linaro.org" target="_blank">ard.biesheuvel@linaro.org</a>> wrote:<br> > >> >> On 6 August 2018 at 20:49, Kees Cook <<a href="mailto:keescook@chromium.org" target="_blank">keescook@chromium.org</a>> wrote:<br> > >> >>> On Mon, Aug 6, 2018 at 10:45 AM, Robin Murphy <<a href="mailto:robin.murphy@arm.com" target="_blank">robin.murphy@arm.com</a>><br> > >> >>> wrote:<br> > >> >>>> I guess what I'm getting at is that if the protection mechanism is<br> > >> >>>> "always<br> > >> >>>> return with SP outside TTBR1", there seems little point in going<br> > >> >>>> through the<br> > >> >>>> motions if SP in TTBR0 could still be valid and allow an attack to<br> > >> >>>> succeed<br> > >> >>>> anyway; this is basically just me working through a justification for<br> > >> >>>> saying<br> > >> >>>> the proposed scheme needs "depends on ARM64_PAN ||<br> > >> >>>> ARM64_SW_TTBR0_PAN",<br> > >> >>>> making it that much uglier for v8.0 CPUs...<br> > >> >>><br> > >> >>> I think anyone with v8.0 CPUs interested in this mitigation would also<br> > >> >>> very much want PAN emulation. If a "depends on" isn't desired, what<br> > >> >>> about "imply" in the Kconfig?<br> > >> >>><br> > >> >><br> > >> >> Yes, but actually, using bit #0 is maybe a better alternative in any<br> > >> >> case. You can never dereference SP with bit #0 set, regardless of<br> > >> >> whether the address points to user or kernel space, and my concern<br> > >> >> about reloading sp from x29 doesn't really make sense, given that x29<br> > >> >> is always assigned from sp right after pushing x29 and x30 in the<br> > >> >> function prologue, and sp only gets restored from x29 in the epilogue<br> > >> >> when there is a stack frame to begin with, in which case we add #1 to<br> > >> >> sp again before returning from the function.<br> > >> ><br> > >> > Fair enough! :)<br> > >> ><br> > >> >> The other code gets a lot cleaner as well.<br> > >> >><br> > >> >> So for the return we'll have<br> > >> >><br> > >> >> ldp x29, x30, [sp], #nn<br> > >> >>>>add sp, sp, #0x1<br> > >> >> ret<br> > >> >><br> > >> >> and for the function call<br> > >> >><br> > >> >> bl <foo><br> > >> >>>>mov x30, sp<br> > >> >>>>bic sp, x30, #1<br> > >> >><br> > >> >> The restore sequence in entry.s:96 (which has no spare registers) gets<br> > >> >> much simpler as well:<br> > >> >><br> > >> >> --- a/arch/arm64/kernel/entry.S<br> > >> >> +++ b/arch/arm64/kernel/entry.S<br> > >> >> @@ -95,6 +95,15 @@ alternative_else_nop_endif<br> > >> >> */<br> > >> >> add sp, sp, x0 // sp' = sp + x0<br> > >> >> sub x0, sp, x0 // x0' = sp' - x0 = (sp + x0) - x0 = sp<br> > >> >> +#ifdef CONFIG_ARM64_ROP_SHIELD<br> > >> >> + tbnz x0, #0, 1f<br> > >> >> + .subsection 1<br> > >> >> +1: sub x0, x0, #1<br> > >> >> + sub sp, sp, #1<br> > >> >> + b 2f<br> > >> >> + .previous<br> > >> >> +2:<br> > >> >> +#endif<br> > >> >> tbnz x0, #THREAD_SHIFT, 0f<br> > >> >> sub x0, sp, x0 // x0'' = sp' - x0' = (sp + x0) - sp =<br> > >> >> x0<br> > >> >> sub sp, sp, x0 // sp'' = sp' - x0 = (sp + x0) - x0 =<br> > >> >> sp<br> > >> ><br> > >> > I get slightly concerned about "add" vs "clear bit", but I don't see a<br> > >> > real way to chain a lot of "add"s to get to avoid the unaligned<br> > >> > access. Is "or" less efficient than "add"?<br> > >> ><br> > >><br> > >> Yes. The stack pointer is special on arm64, and can only be used with<br> > >> a limited set of ALU instructions. So orring #1 would involve 'mov<br> > >> <reg>, sp ; orr sp, <reg>, #1' like in the 'bic' case above, which<br> > >> requires a scratch register as well.<br> </blockquote></div></div>
On 08/02/2018 06:21 AM, Ard Biesheuvel wrote: > This is a proof of concept I cooked up, primarily to trigger a discussion > about whether there is a point to doing anything like this, and if there > is, what the pitfalls are. Also, while I am not aware of any similar > implementations, the idea is so simple that I would be surprised if nobody > else thought of the same thing way before I did. > > The idea is that we can significantly limit the kernel's attack surface > for ROP based attacks by clearing the stack pointer's sign bit before > returning from a function, and setting it again right after proceeding > from the [expected] return address. This should make it much more difficult > to return to arbitrary gadgets, given that they rely on being chained to > the next via a return address popped off the stack, and this is difficult > when the stack pointer is invalid. > > Of course, 4 additional instructions per function return is not exactly > for free, but they are just movs and adds, and leaf functions are > disregarded unless they allocate a stack frame (this comes for free > because simple_return insns are disregarded by the plugin) > > Please shoot, preferably with better ideas ... > > Ard Biesheuvel (3): > arm64: use wrapper macro for bl/blx instructions from asm code > gcc: plugins: add ROP shield plugin for arm64 > arm64: enable ROP protection by clearing SP bit #55 across function > returns > > arch/Kconfig | 4 + > arch/arm64/Kconfig | 10 ++ > arch/arm64/include/asm/assembler.h | 21 +++- > arch/arm64/kernel/entry-ftrace.S | 6 +- > arch/arm64/kernel/entry.S | 104 +++++++++------- > arch/arm64/kernel/head.S | 4 +- > arch/arm64/kernel/probes/kprobes_trampoline.S | 2 +- > arch/arm64/kernel/sleep.S | 6 +- > drivers/firmware/efi/libstub/Makefile | 3 +- > scripts/Makefile.gcc-plugins | 7 ++ > scripts/gcc-plugins/arm64_rop_shield_plugin.c | 116 ++++++++++++++++++ > 11 files changed, 228 insertions(+), 55 deletions(-) > create mode 100644 scripts/gcc-plugins/arm64_rop_shield_plugin.c > I tried this on the Fedora config and it died in mutex_lock #0 el1_sync () at arch/arm64/kernel/entry.S:570 #1 0xffff000008c62ed4 in __cmpxchg_case_acq_8 (new=<optimized out>, old=<optimized out>, ptr=<optimized out>) at ./arch/arm64/include/asm/atomic_lse.h:480 #2 __cmpxchg_acq (size=<optimized out>, new=<optimized out>, old=<optimized out>, ptr=<optimized out>) at ./arch/arm64/include/asm/cmpxchg.h:141 #3 __mutex_trylock_fast (lock=<optimized out>) at kernel/locking/mutex.c:144 #4 mutex_lock (lock=0xffff0000098dee48 <cgroup_mutex>) at kernel/locking/mutex.c:241 #5 0xffff000008f40978 in kallsyms_token_index () ffff000008bda050 <mutex_lock>: ffff000008bda050: a9bf7bfd stp x29, x30, [sp, #-16]! ffff000008bda054: aa0003e3 mov x3, x0 ffff000008bda058: d5384102 mrs x2, sp_el0 ffff000008bda05c: 910003fd mov x29, sp ffff000008bda060: d2800001 mov x1, #0x0 // #0 ffff000008bda064: 97ff85af bl ffff000008bbb720 <__ll_sc___cmpxchg_case_acq_8> ffff000008bda068: d503201f nop ffff000008bda06c: d503201f nop ffff000008bda070: b50000c0 cbnz x0, ffff000008bda088 <mutex_lock+0x38> ffff000008bda074: a8c17bfd ldp x29, x30, [sp], #16 ffff000008bda078: 910003f0 mov x16, sp ffff000008bda07c: 9248fa1f and sp, x16, #0xff7fffffffffffff ffff000008bda080: d65f03c0 ret ffff000008bda084: d503201f nop ffff000008bda088: aa0303e0 mov x0, x3 ffff000008bda08c: 97ffffe7 bl ffff000008bda028 <__mutex_lock_slowpath> ffff000008bda090: 910003fe mov x30, sp ffff000008bda094: b24903df orr sp, x30, #0x80000000000000 ffff000008bda098: a8c17bfd ldp x29, x30, [sp], #16 ffff000008bda09c: 910003f0 mov x16, sp ffff000008bda0a0: 9248fa1f and sp, x16, #0xff7fffffffffffff ffff000008bda0a4: d65f03c0 ret ffff000008bbb720 <__ll_sc___cmpxchg_case_acq_8>: ffff000008bbb720: f9800011 prfm pstl1strm, [x0] ffff000008bbb724: c85ffc10 ldaxr x16, [x0] ffff000008bbb728: ca010211 eor x17, x16, x1 ffff000008bbb72c: b5000071 cbnz x17, ffff000008bbb738 <__ll_sc___cmpxchg_case_acq_8+0x18> ffff000008bbb730: c8117c02 stxr w17, x2, [x0] ffff000008bbb734: 35ffff91 cbnz w17, ffff000008bbb724 <__ll_sc___cmpxchg_case_acq_8+0x4> ffff000008bbb738: aa1003e0 mov x0, x16 ffff000008bbb73c: 910003f0 mov x16, sp ffff000008bbb740: 9248fa1f and sp, x16, #0xff7fffffffffffff ffff000008bbb744: d65f03c0 ret If I turn off CONFIG_ARM64_LSE_ATOMICS it works Thanks, Laura
On 18 August 2018 at 03:27, Laura Abbott <labbott@redhat.com> wrote: > On 08/02/2018 06:21 AM, Ard Biesheuvel wrote: >> >> This is a proof of concept I cooked up, primarily to trigger a discussion >> about whether there is a point to doing anything like this, and if there >> is, what the pitfalls are. Also, while I am not aware of any similar >> implementations, the idea is so simple that I would be surprised if nobody >> else thought of the same thing way before I did. >> >> The idea is that we can significantly limit the kernel's attack surface >> for ROP based attacks by clearing the stack pointer's sign bit before >> returning from a function, and setting it again right after proceeding >> from the [expected] return address. This should make it much more >> difficult >> to return to arbitrary gadgets, given that they rely on being chained to >> the next via a return address popped off the stack, and this is difficult >> when the stack pointer is invalid. >> >> Of course, 4 additional instructions per function return is not exactly >> for free, but they are just movs and adds, and leaf functions are >> disregarded unless they allocate a stack frame (this comes for free >> because simple_return insns are disregarded by the plugin) >> >> Please shoot, preferably with better ideas ... >> >> Ard Biesheuvel (3): >> arm64: use wrapper macro for bl/blx instructions from asm code >> gcc: plugins: add ROP shield plugin for arm64 >> arm64: enable ROP protection by clearing SP bit #55 across function >> returns >> >> arch/Kconfig | 4 + >> arch/arm64/Kconfig | 10 ++ >> arch/arm64/include/asm/assembler.h | 21 +++- >> arch/arm64/kernel/entry-ftrace.S | 6 +- >> arch/arm64/kernel/entry.S | 104 +++++++++------- >> arch/arm64/kernel/head.S | 4 +- >> arch/arm64/kernel/probes/kprobes_trampoline.S | 2 +- >> arch/arm64/kernel/sleep.S | 6 +- >> drivers/firmware/efi/libstub/Makefile | 3 +- >> scripts/Makefile.gcc-plugins | 7 ++ >> scripts/gcc-plugins/arm64_rop_shield_plugin.c | 116 ++++++++++++++++++ >> 11 files changed, 228 insertions(+), 55 deletions(-) >> create mode 100644 scripts/gcc-plugins/arm64_rop_shield_plugin.c >> > > I tried this on the Fedora config and it died in mutex_lock > > #0 el1_sync () at arch/arm64/kernel/entry.S:570 > #1 0xffff000008c62ed4 in __cmpxchg_case_acq_8 (new=<optimized out>, > old=<optimized out>, ptr=<optimized out>) at > ./arch/arm64/include/asm/atomic_lse.h:480 > #2 __cmpxchg_acq (size=<optimized out>, new=<optimized out>, old=<optimized > out>, ptr=<optimized out>) at ./arch/arm64/include/asm/cmpxchg.h:141 > #3 __mutex_trylock_fast (lock=<optimized out>) at > kernel/locking/mutex.c:144 > #4 mutex_lock (lock=0xffff0000098dee48 <cgroup_mutex>) at > kernel/locking/mutex.c:241 > #5 0xffff000008f40978 in kallsyms_token_index () > > ffff000008bda050 <mutex_lock>: > ffff000008bda050: a9bf7bfd stp x29, x30, [sp, #-16]! > ffff000008bda054: aa0003e3 mov x3, x0 > ffff000008bda058: d5384102 mrs x2, sp_el0 > ffff000008bda05c: 910003fd mov x29, sp > ffff000008bda060: d2800001 mov x1, #0x0 > // #0 > ffff000008bda064: 97ff85af bl ffff000008bbb720 > <__ll_sc___cmpxchg_case_acq_8> > ffff000008bda068: d503201f nop > ffff000008bda06c: d503201f nop > ffff000008bda070: b50000c0 cbnz x0, ffff000008bda088 > <mutex_lock+0x38> > ffff000008bda074: a8c17bfd ldp x29, x30, [sp], #16 > ffff000008bda078: 910003f0 mov x16, sp > ffff000008bda07c: 9248fa1f and sp, x16, #0xff7fffffffffffff > ffff000008bda080: d65f03c0 ret > ffff000008bda084: d503201f nop > ffff000008bda088: aa0303e0 mov x0, x3 > ffff000008bda08c: 97ffffe7 bl ffff000008bda028 > <__mutex_lock_slowpath> > ffff000008bda090: 910003fe mov x30, sp > ffff000008bda094: b24903df orr sp, x30, #0x80000000000000 > ffff000008bda098: a8c17bfd ldp x29, x30, [sp], #16 > ffff000008bda09c: 910003f0 mov x16, sp > ffff000008bda0a0: 9248fa1f and sp, x16, #0xff7fffffffffffff > ffff000008bda0a4: d65f03c0 ret > > ffff000008bbb720 <__ll_sc___cmpxchg_case_acq_8>: > ffff000008bbb720: f9800011 prfm pstl1strm, [x0] > ffff000008bbb724: c85ffc10 ldaxr x16, [x0] > ffff000008bbb728: ca010211 eor x17, x16, x1 > ffff000008bbb72c: b5000071 cbnz x17, ffff000008bbb738 > <__ll_sc___cmpxchg_case_acq_8+0x18> > ffff000008bbb730: c8117c02 stxr w17, x2, [x0] > ffff000008bbb734: 35ffff91 cbnz w17, ffff000008bbb724 > <__ll_sc___cmpxchg_case_acq_8+0x4> > ffff000008bbb738: aa1003e0 mov x0, x16 > ffff000008bbb73c: 910003f0 mov x16, sp > ffff000008bbb740: 9248fa1f and sp, x16, #0xff7fffffffffffff > ffff000008bbb744: d65f03c0 ret > > If I turn off CONFIG_ARM64_LSE_ATOMICS it works > Thanks Laura. It is unlikely that this series will be resubmitted in a form that is anywhere close to its current form, but this is a useful data point nonetheless. Disregarding ll_sc_atomics.o is straight-forward, and I am glad to hear that it works without issue otherwise.