Message ID | 20181005081333.15018-2-ard.biesheuvel@linaro.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | patchable function pointers for pluggable crypto routines | expand |
On Fri, Oct 05, 2018 at 10:13:25AM +0200, Ard Biesheuvel wrote: > Add a function pointer abstraction that can be implemented by the arch > in a manner that avoids the downsides of function pointers, i.e., the > fact that they are typically located in a writable data section, and > their vulnerability to Spectre like defects. > > The FFP (or fast function pointer) is callable as a function, since > the generic incarnation is simply that. However, due to the fact that > C does not distinguish between functions and function pointers at the > call site, the architecture can instead emit it as a patchable sequence > of instructions consisting of ordinary branches. This is basically a static_key, except for indirection function calls? So why not call the thing static_func or static_call or something like that?
On 5 October 2018 at 15:57, Peter Zijlstra <peterz@infradead.org> wrote: > On Fri, Oct 05, 2018 at 10:13:25AM +0200, Ard Biesheuvel wrote: >> Add a function pointer abstraction that can be implemented by the arch >> in a manner that avoids the downsides of function pointers, i.e., the >> fact that they are typically located in a writable data section, and >> their vulnerability to Spectre like defects. >> >> The FFP (or fast function pointer) is callable as a function, since >> the generic incarnation is simply that. However, due to the fact that >> C does not distinguish between functions and function pointers at the >> call site, the architecture can instead emit it as a patchable sequence >> of instructions consisting of ordinary branches. > > This is basically a static_key, except for indirection function calls? Yes, that is why I put you on cc :-) > So why not call the thing static_func or static_call or something like > that? Yep that sounds better.
On Fri, Oct 05, 2018 at 10:13:25AM +0200, Ard Biesheuvel wrote: > diff --git a/include/linux/ffp.h b/include/linux/ffp.h > new file mode 100644 > index 000000000000..8fc3b4c9b38f > --- /dev/null > +++ b/include/linux/ffp.h > @@ -0,0 +1,43 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > + > +#ifndef __LINUX_FFP_H > +#define __LINUX_FFP_H > + > +#include <linux/types.h> > +#include <linux/compiler.h> > + > +#ifdef CONFIG_HAVE_ARCH_FFP > +#include <asm/ffp.h> > +#else > + > +struct ffp { > + void (**fn)(void); > + void (*default_fn)(void); > +}; > + > +#define DECLARE_FFP(_fn, _def) \ > + extern typeof(_def) *_fn; \ > + extern struct ffp const __ffp_ ## _fn > + > +#define DEFINE_FFP(_fn, _def) \ > + typeof(_def) *_fn = &_def; \ > + struct ffp const __ffp_ ## _fn \ > + = { (void(**)(void))&_fn, (void(*)(void))&_def }; \ > + EXPORT_SYMBOL(__ffp_ ## _fn) > + > +static inline void ffp_set_target(const struct ffp *m, void *new_fn) > +{ > + WRITE_ONCE(*m->fn, new_fn); > +} > + > +static inline void ffp_reset_target(const struct ffp *m) > +{ > + WRITE_ONCE(*m->fn, m->default_fn); > +} > + > +#endif > + > +#define SET_FFP(_fn, _new) ffp_set_target(&__ffp_ ## _fn, _new) > +#define RESET_FFP(_fn) ffp_reset_target(&__ffp_ ## _fn) > + > +#endif I don't understand this interface. There is no wrapper for the call site, so how are we going to patch all call-sites when you update the target?
On 5 October 2018 at 16:14, Peter Zijlstra <peterz@infradead.org> wrote: > On Fri, Oct 05, 2018 at 10:13:25AM +0200, Ard Biesheuvel wrote: >> diff --git a/include/linux/ffp.h b/include/linux/ffp.h >> new file mode 100644 >> index 000000000000..8fc3b4c9b38f >> --- /dev/null >> +++ b/include/linux/ffp.h >> @@ -0,0 +1,43 @@ >> +/* SPDX-License-Identifier: GPL-2.0 */ >> + >> +#ifndef __LINUX_FFP_H >> +#define __LINUX_FFP_H >> + >> +#include <linux/types.h> >> +#include <linux/compiler.h> >> + >> +#ifdef CONFIG_HAVE_ARCH_FFP >> +#include <asm/ffp.h> >> +#else >> + >> +struct ffp { >> + void (**fn)(void); >> + void (*default_fn)(void); >> +}; >> + >> +#define DECLARE_FFP(_fn, _def) \ >> + extern typeof(_def) *_fn; \ >> + extern struct ffp const __ffp_ ## _fn >> + >> +#define DEFINE_FFP(_fn, _def) \ >> + typeof(_def) *_fn = &_def; \ >> + struct ffp const __ffp_ ## _fn \ >> + = { (void(**)(void))&_fn, (void(*)(void))&_def }; \ >> + EXPORT_SYMBOL(__ffp_ ## _fn) >> + >> +static inline void ffp_set_target(const struct ffp *m, void *new_fn) >> +{ >> + WRITE_ONCE(*m->fn, new_fn); >> +} >> + >> +static inline void ffp_reset_target(const struct ffp *m) >> +{ >> + WRITE_ONCE(*m->fn, m->default_fn); >> +} >> + >> +#endif >> + >> +#define SET_FFP(_fn, _new) ffp_set_target(&__ffp_ ## _fn, _new) >> +#define RESET_FFP(_fn) ffp_reset_target(&__ffp_ ## _fn) >> + >> +#endif > > I don't understand this interface. There is no wrapper for the call > site, so how are we going to patch all call-sites when you update the > target? This is the generic implementation, and in this case, it is just a function pointer which gets dereferenced and called indirectly at each call site. In the arch specific implementations (for ones where it matters, see 2/9), it is more like a PLT entry, which gets called from each call site, and which can be patched to point to another function. So we replace a single indirect call with a direct call plus a direct jump.
> On Oct 5, 2018, at 7:14 AM, Peter Zijlstra <peterz@infradead.org> wrote: > >> On Fri, Oct 05, 2018 at 10:13:25AM +0200, Ard Biesheuvel wrote: >> diff --git a/include/linux/ffp.h b/include/linux/ffp.h >> new file mode 100644 >> index 000000000000..8fc3b4c9b38f >> --- /dev/null >> +++ b/include/linux/ffp.h >> @@ -0,0 +1,43 @@ >> +/* SPDX-License-Identifier: GPL-2.0 */ >> + >> +#ifndef __LINUX_FFP_H >> +#define __LINUX_FFP_H >> + >> +#include <linux/types.h> >> +#include <linux/compiler.h> >> + >> +#ifdef CONFIG_HAVE_ARCH_FFP >> +#include <asm/ffp.h> >> +#else >> + >> +struct ffp { >> + void (**fn)(void); >> + void (*default_fn)(void); >> +}; >> + >> +#define DECLARE_FFP(_fn, _def) \ >> + extern typeof(_def) *_fn; \ >> + extern struct ffp const __ffp_ ## _fn >> + >> +#define DEFINE_FFP(_fn, _def) \ >> + typeof(_def) *_fn = &_def; \ >> + struct ffp const __ffp_ ## _fn \ >> + = { (void(**)(void))&_fn, (void(*)(void))&_def }; \ >> + EXPORT_SYMBOL(__ffp_ ## _fn) >> + >> +static inline void ffp_set_target(const struct ffp *m, void *new_fn) >> +{ >> + WRITE_ONCE(*m->fn, new_fn); >> +} >> + >> +static inline void ffp_reset_target(const struct ffp *m) >> +{ >> + WRITE_ONCE(*m->fn, m->default_fn); >> +} >> + >> +#endif >> + >> +#define SET_FFP(_fn, _new) ffp_set_target(&__ffp_ ## _fn, _new) >> +#define RESET_FFP(_fn) ffp_reset_target(&__ffp_ ## _fn) >> + >> +#endif > > I don't understand this interface. There is no wrapper for the call > site, so how are we going to patch all call-sites when you update the > target? I’m also confused. Anyway, we have patchable functions on x86. They’re called PVOPs, and they’re way overcomplicated. I’ve proposed a better way that should generate better code, be more portable, and be more maintainable. It goes like this. To call the function, you literally just call the default implementation. It *might* be necessary to call a nonexistent wrapper to avoid annoying optimizations. At build time, the kernel is built with relocations, so the object files contain relocation entries for the call. We collect these entries into a table. If we’re using the “nonexistent wrapper” approach, we can link in a .S or linker script to alias them to the default implementation. To patch them, we just patch them. It can’t necessarily be done concurrently because nothing forces the right alignment. But we can do it at boot time and module load time. (Maybe we can patch at runtime on architectures with appropriate instruction alignment. Or we ask gcc for an extension to align calls to a function.) Most of the machinery already exists: this is roughly how the module loader resolves calls outside of a module.
On 5 October 2018 at 17:08, Andy Lutomirski <luto@amacapital.net> wrote: > > >> On Oct 5, 2018, at 7:14 AM, Peter Zijlstra <peterz@infradead.org> wrote: >> >>> On Fri, Oct 05, 2018 at 10:13:25AM +0200, Ard Biesheuvel wrote: >>> diff --git a/include/linux/ffp.h b/include/linux/ffp.h >>> new file mode 100644 >>> index 000000000000..8fc3b4c9b38f >>> --- /dev/null >>> +++ b/include/linux/ffp.h >>> @@ -0,0 +1,43 @@ >>> +/* SPDX-License-Identifier: GPL-2.0 */ >>> + >>> +#ifndef __LINUX_FFP_H >>> +#define __LINUX_FFP_H >>> + >>> +#include <linux/types.h> >>> +#include <linux/compiler.h> >>> + >>> +#ifdef CONFIG_HAVE_ARCH_FFP >>> +#include <asm/ffp.h> >>> +#else >>> + >>> +struct ffp { >>> + void (**fn)(void); >>> + void (*default_fn)(void); >>> +}; >>> + >>> +#define DECLARE_FFP(_fn, _def) \ >>> + extern typeof(_def) *_fn; \ >>> + extern struct ffp const __ffp_ ## _fn >>> + >>> +#define DEFINE_FFP(_fn, _def) \ >>> + typeof(_def) *_fn = &_def; \ >>> + struct ffp const __ffp_ ## _fn \ >>> + = { (void(**)(void))&_fn, (void(*)(void))&_def }; \ >>> + EXPORT_SYMBOL(__ffp_ ## _fn) >>> + >>> +static inline void ffp_set_target(const struct ffp *m, void *new_fn) >>> +{ >>> + WRITE_ONCE(*m->fn, new_fn); >>> +} >>> + >>> +static inline void ffp_reset_target(const struct ffp *m) >>> +{ >>> + WRITE_ONCE(*m->fn, m->default_fn); >>> +} >>> + >>> +#endif >>> + >>> +#define SET_FFP(_fn, _new) ffp_set_target(&__ffp_ ## _fn, _new) >>> +#define RESET_FFP(_fn) ffp_reset_target(&__ffp_ ## _fn) >>> + >>> +#endif >> >> I don't understand this interface. There is no wrapper for the call >> site, so how are we going to patch all call-sites when you update the >> target? > > I’m also confused. > > Anyway, we have patchable functions on x86. They’re called PVOPs, and they’re way overcomplicated. > > I’ve proposed a better way that should generate better code, be more portable, and be more maintainable. It goes like this. > > To call the function, you literally just call the default implementation. It *might* be necessary to call a nonexistent wrapper to avoid annoying optimizations. At build time, the kernel is built with relocations, so the object files contain relocation entries for the call. We collect these entries into a table. If we’re using the “nonexistent wrapper” approach, we can link in a .S or linker script to alias them to the default implementation. > > To patch them, we just patch them. It can’t necessarily be done concurrently because nothing forces the right alignment. But we can do it at boot time and module load time. (Maybe we can patch at runtime on architectures with appropriate instruction alignment. Or we ask gcc for an extension to align calls to a function.) > > Most of the machinery already exists: this is roughly how the module loader resolves calls outside of a module. Yeah nothing is ever simple on x86 :-( So are you saying the approach i use in patch #2 (which would translate to emitting a jmpq instruction pointing to the default implementation, and patching it at runtime to point elsewhere) would not fly on x86?
On Fri, Oct 5, 2018 at 8:24 AM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > > On 5 October 2018 at 17:08, Andy Lutomirski <luto@amacapital.net> wrote: > > > > > >> On Oct 5, 2018, at 7:14 AM, Peter Zijlstra <peterz@infradead.org> wrote: > >> > >>> On Fri, Oct 05, 2018 at 10:13:25AM +0200, Ard Biesheuvel wrote: > >>> diff --git a/include/linux/ffp.h b/include/linux/ffp.h > >>> new file mode 100644 > >>> index 000000000000..8fc3b4c9b38f > >>> --- /dev/null > >>> +++ b/include/linux/ffp.h > >>> @@ -0,0 +1,43 @@ > >>> +/* SPDX-License-Identifier: GPL-2.0 */ > >>> + > >>> +#ifndef __LINUX_FFP_H > >>> +#define __LINUX_FFP_H > >>> + > >>> +#include <linux/types.h> > >>> +#include <linux/compiler.h> > >>> + > >>> +#ifdef CONFIG_HAVE_ARCH_FFP > >>> +#include <asm/ffp.h> > >>> +#else > >>> + > >>> +struct ffp { > >>> + void (**fn)(void); > >>> + void (*default_fn)(void); > >>> +}; > >>> + > >>> +#define DECLARE_FFP(_fn, _def) \ > >>> + extern typeof(_def) *_fn; \ > >>> + extern struct ffp const __ffp_ ## _fn > >>> + > >>> +#define DEFINE_FFP(_fn, _def) \ > >>> + typeof(_def) *_fn = &_def; \ > >>> + struct ffp const __ffp_ ## _fn \ > >>> + = { (void(**)(void))&_fn, (void(*)(void))&_def }; \ > >>> + EXPORT_SYMBOL(__ffp_ ## _fn) > >>> + > >>> +static inline void ffp_set_target(const struct ffp *m, void *new_fn) > >>> +{ > >>> + WRITE_ONCE(*m->fn, new_fn); > >>> +} > >>> + > >>> +static inline void ffp_reset_target(const struct ffp *m) > >>> +{ > >>> + WRITE_ONCE(*m->fn, m->default_fn); > >>> +} > >>> + > >>> +#endif > >>> + > >>> +#define SET_FFP(_fn, _new) ffp_set_target(&__ffp_ ## _fn, _new) > >>> +#define RESET_FFP(_fn) ffp_reset_target(&__ffp_ ## _fn) > >>> + > >>> +#endif > >> > >> I don't understand this interface. There is no wrapper for the call > >> site, so how are we going to patch all call-sites when you update the > >> target? > > > > I’m also confused. > > > > Anyway, we have patchable functions on x86. They’re called PVOPs, and they’re way overcomplicated. > > > > I’ve proposed a better way that should generate better code, be more portable, and be more maintainable. It goes like this. > > > > To call the function, you literally just call the default implementation. It *might* be necessary to call a nonexistent wrapper to avoid annoying optimizations. At build time, the kernel is built with relocations, so the object files contain relocation entries for the call. We collect these entries into a table. If we’re using the “nonexistent wrapper” approach, we can link in a .S or linker script to alias them to the default implementation. > > > > To patch them, we just patch them. It can’t necessarily be done concurrently because nothing forces the right alignment. But we can do it at boot time and module load time. (Maybe we can patch at runtime on architectures with appropriate instruction alignment. Or we ask gcc for an extension to align calls to a function.) > > > > Most of the machinery already exists: this is roughly how the module loader resolves calls outside of a module. > > Yeah nothing is ever simple on x86 :-( > > So are you saying the approach i use in patch #2 (which would > translate to emitting a jmpq instruction pointing to the default > implementation, and patching it at runtime to point elsewhere) would > not fly on x86? After getting some more sleep, I'm obviously wrong. The text_poke_bp() mechanism will work. It's just really slow. Let me try to summarize some of the issues. First, when emitting jumps and calls from inline asm on x86, there are a few considerations that are annoying: 1. Following the x86_64 ABI calling conventions is basically impossible. x86_64 requires a 128-byte redzone and 16-byte stack alignment. After much discussion a while back, we decided that it was flat-out impossible on current gcc to get the stack pointer aligned in a known manner in an inline asm statement. Instead, if we actually need alignment, we need to align manually. Fortunately, the kernel is built with an override that forces only 8-byte alignment (on *most* GCC versions). But for crypto in particular, it sucks extra, since the crypto code is basically the only thing in the kernel that actually wants 16-byte alignment. I don't think this is a huge problem in practice, but it's annoying. And the kernel is built without a redzone. 2. On x86_64, depending on config, we either need frame pointers or ORC. ORC is no big deal -- it Just Works (tm). Frame pointers need extra asm hackery. It's doable, but it's still annoying. 3. Actually getting the asm constraints right to do what a C programmer expects is distinctly nontrivial. I just fixed an extremely longstanding bug in the vDSO code in which the asm constraints for the syscall fallback were wrong in such a way that GCC didn't notice that the fallback wrote to its output parameter. Whoops. And having all this asm hackery per architecture is ugly and annoying. So my suggestion is to do it like a regular relocation. Call a function the normal way (make it literally be a C function and call it), and rig up the noinline and noclone attributes and whatever else is needed to make sure that it's a *relocatable* call. Then the toolchain emits ELF relocations saying exactly what part of the text needs patching, and we can patch it at runtime. On x86, this is a bit extra annoying because we can't fully reliably parse backwards to find the beginning of the instruction, but objtool could doit. And then we get something that is mostly arch-neutral! Because surely ARM can also use a relocation-based mechanism. I will generally object to x86 containing more than one inline-asm-hackery-based patchable call mechanism, which your series will add. I would *love* to see a non-inline-asm one, and then we could move most of the x86 paravirt crap over to use it for a big win in readability and maintainability. --Andy
On 5 October 2018 at 18:58, Andy Lutomirski <luto@kernel.org> wrote: > On Fri, Oct 5, 2018 at 8:24 AM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: >> >> On 5 October 2018 at 17:08, Andy Lutomirski <luto@amacapital.net> wrote: >> > >> > >> >> On Oct 5, 2018, at 7:14 AM, Peter Zijlstra <peterz@infradead.org> wrote: >> >> >> >>> On Fri, Oct 05, 2018 at 10:13:25AM +0200, Ard Biesheuvel wrote: >> >>> diff --git a/include/linux/ffp.h b/include/linux/ffp.h >> >>> new file mode 100644 >> >>> index 000000000000..8fc3b4c9b38f >> >>> --- /dev/null >> >>> +++ b/include/linux/ffp.h >> >>> @@ -0,0 +1,43 @@ >> >>> +/* SPDX-License-Identifier: GPL-2.0 */ >> >>> + >> >>> +#ifndef __LINUX_FFP_H >> >>> +#define __LINUX_FFP_H >> >>> + >> >>> +#include <linux/types.h> >> >>> +#include <linux/compiler.h> >> >>> + >> >>> +#ifdef CONFIG_HAVE_ARCH_FFP >> >>> +#include <asm/ffp.h> >> >>> +#else >> >>> + >> >>> +struct ffp { >> >>> + void (**fn)(void); >> >>> + void (*default_fn)(void); >> >>> +}; >> >>> + >> >>> +#define DECLARE_FFP(_fn, _def) \ >> >>> + extern typeof(_def) *_fn; \ >> >>> + extern struct ffp const __ffp_ ## _fn >> >>> + >> >>> +#define DEFINE_FFP(_fn, _def) \ >> >>> + typeof(_def) *_fn = &_def; \ >> >>> + struct ffp const __ffp_ ## _fn \ >> >>> + = { (void(**)(void))&_fn, (void(*)(void))&_def }; \ >> >>> + EXPORT_SYMBOL(__ffp_ ## _fn) >> >>> + >> >>> +static inline void ffp_set_target(const struct ffp *m, void *new_fn) >> >>> +{ >> >>> + WRITE_ONCE(*m->fn, new_fn); >> >>> +} >> >>> + >> >>> +static inline void ffp_reset_target(const struct ffp *m) >> >>> +{ >> >>> + WRITE_ONCE(*m->fn, m->default_fn); >> >>> +} >> >>> + >> >>> +#endif >> >>> + >> >>> +#define SET_FFP(_fn, _new) ffp_set_target(&__ffp_ ## _fn, _new) >> >>> +#define RESET_FFP(_fn) ffp_reset_target(&__ffp_ ## _fn) >> >>> + >> >>> +#endif >> >> >> >> I don't understand this interface. There is no wrapper for the call >> >> site, so how are we going to patch all call-sites when you update the >> >> target? >> > >> > I’m also confused. >> > >> > Anyway, we have patchable functions on x86. They’re called PVOPs, and they’re way overcomplicated. >> > >> > I’ve proposed a better way that should generate better code, be more portable, and be more maintainable. It goes like this. >> > >> > To call the function, you literally just call the default implementation. It *might* be necessary to call a nonexistent wrapper to avoid annoying optimizations. At build time, the kernel is built with relocations, so the object files contain relocation entries for the call. We collect these entries into a table. If we’re using the “nonexistent wrapper” approach, we can link in a .S or linker script to alias them to the default implementation. >> > >> > To patch them, we just patch them. It can’t necessarily be done concurrently because nothing forces the right alignment. But we can do it at boot time and module load time. (Maybe we can patch at runtime on architectures with appropriate instruction alignment. Or we ask gcc for an extension to align calls to a function.) >> > >> > Most of the machinery already exists: this is roughly how the module loader resolves calls outside of a module. >> >> Yeah nothing is ever simple on x86 :-( >> >> So are you saying the approach i use in patch #2 (which would >> translate to emitting a jmpq instruction pointing to the default >> implementation, and patching it at runtime to point elsewhere) would >> not fly on x86? > > After getting some more sleep, I'm obviously wrong. The > text_poke_bp() mechanism will work. It's just really slow. > OK > Let me try to summarize some of the issues. First, when emitting > jumps and calls from inline asm on x86, there are a few considerations > that are annoying: > > 1. Following the x86_64 ABI calling conventions is basically > impossible. x86_64 requires a 128-byte redzone and 16-byte stack > alignment. After much discussion a while back, we decided that it was > flat-out impossible on current gcc to get the stack pointer aligned in > a known manner in an inline asm statement. Instead, if we actually > need alignment, we need to align manually. Fortunately, the kernel is > built with an override that forces only 8-byte alignment (on *most* > GCC versions). But for crypto in particular, it sucks extra, since > the crypto code is basically the only thing in the kernel that > actually wants 16-byte alignment. I don't think this is a huge > problem in practice, but it's annoying. And the kernel is built > without a redzone. > > 2. On x86_64, depending on config, we either need frame pointers or > ORC. ORC is no big deal -- it Just Works (tm). Frame pointers need > extra asm hackery. It's doable, but it's still annoying. > > 3. Actually getting the asm constraints right to do what a C > programmer expects is distinctly nontrivial. I just fixed an > extremely longstanding bug in the vDSO code in which the asm > constraints for the syscall fallback were wrong in such a way that GCC > didn't notice that the fallback wrote to its output parameter. > Whoops. > OK, so the thing I am missing is why this all matters. Note that the compiler should take care of all of this. It emits a call a function with external linkage having prototype X, and all the inline asm does is emit a jmp to some function having that same prototype, either the default one or the one we patched in. Apologies if I am missing something obvious here: as you know, x86 is not my focus in general. So in the arm64 case, we have " .globl " #_fn " \n" \ #_fn " : \n" \ " b " #_def " \n" \ " nop \n" \ " nop \n" \ " nop \n" \ " nop \n" \ and all the patching does is replace the target of that branch (the NOPs are there for jumps that are more then 128 MB away, which require a indirect jump on arm64) > And having all this asm hackery per architecture is ugly and annoying. > True. But note that only the architectures that cannot tolerate the default instantiation using function pointers will require a special implementation. > So my suggestion is to do it like a regular relocation. Call a > function the normal way (make it literally be a C function and call > it), and rig up the noinline and noclone attributes and whatever else > is needed to make sure that it's a *relocatable* call. Then the > toolchain emits ELF relocations saying exactly what part of the text > needs patching, and we can patch it at runtime. On x86, this is a bit > extra annoying because we can't fully reliably parse backwards to find > the beginning of the instruction, but objtool could doit. > > And then we get something that is mostly arch-neutral! Because surely > ARM can also use a relocation-based mechanism. > Not really. We don't have any build time tooling for this, and CONFIG_RELOCATABLE only produces R_AARCH64_RELATIVE relocations for absolute quantities. So it would mean we'd have to start building vmlinux with --emit-relocs, add tooling to parse all of that etc etc. > I will generally object to x86 containing more than one > inline-asm-hackery-based patchable call mechanism, which your series > will add. I would *love* to see a non-inline-asm one, and then we > could move most of the x86 paravirt crap over to use it for a big win > in readability and maintainability. > Fair enough.
On Fri, Oct 5, 2018 at 10:11 AM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > > On 5 October 2018 at 18:58, Andy Lutomirski <luto@kernel.org> wrote: > > On Fri, Oct 5, 2018 at 8:24 AM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > >> > >> On 5 October 2018 at 17:08, Andy Lutomirski <luto@amacapital.net> wrote: > >> > > >> > > >> >> On Oct 5, 2018, at 7:14 AM, Peter Zijlstra <peterz@infradead.org> wrote: > >> >> > >> >>> On Fri, Oct 05, 2018 at 10:13:25AM +0200, Ard Biesheuvel wrote: > >> >>> diff --git a/include/linux/ffp.h b/include/linux/ffp.h > >> >>> new file mode 100644 > >> >>> index 000000000000..8fc3b4c9b38f > >> >>> --- /dev/null > >> >>> +++ b/include/linux/ffp.h > >> >>> @@ -0,0 +1,43 @@ > >> >>> +/* SPDX-License-Identifier: GPL-2.0 */ > >> >>> + > >> >>> +#ifndef __LINUX_FFP_H > >> >>> +#define __LINUX_FFP_H > >> >>> + > >> >>> +#include <linux/types.h> > >> >>> +#include <linux/compiler.h> > >> >>> + > >> >>> +#ifdef CONFIG_HAVE_ARCH_FFP > >> >>> +#include <asm/ffp.h> > >> >>> +#else > >> >>> + > >> >>> +struct ffp { > >> >>> + void (**fn)(void); > >> >>> + void (*default_fn)(void); > >> >>> +}; > >> >>> + > >> >>> +#define DECLARE_FFP(_fn, _def) \ > >> >>> + extern typeof(_def) *_fn; \ > >> >>> + extern struct ffp const __ffp_ ## _fn > >> >>> + > >> >>> +#define DEFINE_FFP(_fn, _def) \ > >> >>> + typeof(_def) *_fn = &_def; \ > >> >>> + struct ffp const __ffp_ ## _fn \ > >> >>> + = { (void(**)(void))&_fn, (void(*)(void))&_def }; \ > >> >>> + EXPORT_SYMBOL(__ffp_ ## _fn) > >> >>> + > >> >>> +static inline void ffp_set_target(const struct ffp *m, void *new_fn) > >> >>> +{ > >> >>> + WRITE_ONCE(*m->fn, new_fn); > >> >>> +} > >> >>> + > >> >>> +static inline void ffp_reset_target(const struct ffp *m) > >> >>> +{ > >> >>> + WRITE_ONCE(*m->fn, m->default_fn); > >> >>> +} > >> >>> + > >> >>> +#endif > >> >>> + > >> >>> +#define SET_FFP(_fn, _new) ffp_set_target(&__ffp_ ## _fn, _new) > >> >>> +#define RESET_FFP(_fn) ffp_reset_target(&__ffp_ ## _fn) > >> >>> + > >> >>> +#endif > >> >> > >> >> I don't understand this interface. There is no wrapper for the call > >> >> site, so how are we going to patch all call-sites when you update the > >> >> target? > >> > > >> > I’m also confused. > >> > > >> > Anyway, we have patchable functions on x86. They’re called PVOPs, and they’re way overcomplicated. > >> > > >> > I’ve proposed a better way that should generate better code, be more portable, and be more maintainable. It goes like this. > >> > > >> > To call the function, you literally just call the default implementation. It *might* be necessary to call a nonexistent wrapper to avoid annoying optimizations. At build time, the kernel is built with relocations, so the object files contain relocation entries for the call. We collect these entries into a table. If we’re using the “nonexistent wrapper” approach, we can link in a .S or linker script to alias them to the default implementation. > >> > > >> > To patch them, we just patch them. It can’t necessarily be done concurrently because nothing forces the right alignment. But we can do it at boot time and module load time. (Maybe we can patch at runtime on architectures with appropriate instruction alignment. Or we ask gcc for an extension to align calls to a function.) > >> > > >> > Most of the machinery already exists: this is roughly how the module loader resolves calls outside of a module. > >> > >> Yeah nothing is ever simple on x86 :-( > >> > >> So are you saying the approach i use in patch #2 (which would > >> translate to emitting a jmpq instruction pointing to the default > >> implementation, and patching it at runtime to point elsewhere) would > >> not fly on x86? > > > > After getting some more sleep, I'm obviously wrong. The > > text_poke_bp() mechanism will work. It's just really slow. > > > > OK > > > Let me try to summarize some of the issues. First, when emitting > > jumps and calls from inline asm on x86, there are a few considerations > > that are annoying: > > > > 1. Following the x86_64 ABI calling conventions is basically > > impossible. x86_64 requires a 128-byte redzone and 16-byte stack > > alignment. After much discussion a while back, we decided that it was > > flat-out impossible on current gcc to get the stack pointer aligned in > > a known manner in an inline asm statement. Instead, if we actually > > need alignment, we need to align manually. Fortunately, the kernel is > > built with an override that forces only 8-byte alignment (on *most* > > GCC versions). But for crypto in particular, it sucks extra, since > > the crypto code is basically the only thing in the kernel that > > actually wants 16-byte alignment. I don't think this is a huge > > problem in practice, but it's annoying. And the kernel is built > > without a redzone. > > > > 2. On x86_64, depending on config, we either need frame pointers or > > ORC. ORC is no big deal -- it Just Works (tm). Frame pointers need > > extra asm hackery. It's doable, but it's still annoying. > > > > 3. Actually getting the asm constraints right to do what a C > > programmer expects is distinctly nontrivial. I just fixed an > > extremely longstanding bug in the vDSO code in which the asm > > constraints for the syscall fallback were wrong in such a way that GCC > > didn't notice that the fallback wrote to its output parameter. > > Whoops. > > > > OK, so the thing I am missing is why this all matters. > > Note that the compiler should take care of all of this. It emits a > call a function with external linkage having prototype X, and all the > inline asm does is emit a jmp to some function having that same > prototype, either the default one or the one we patched in. > > Apologies if I am missing something obvious here: as you know, x86 is > not my focus in general. The big issue that bothers me isn't the x86-ism so much as the nasty interactions with the optimizer. On x86, we have all of this working. It's in arch/x86/include/asm/paravirt_types.h, and it looks roughly like: asm volatile(pre \ paravirt_alt(PARAVIRT_CALL) \ post \ : call_clbr, ASM_CALL_CONSTRAINT \ : paravirt_type(op), \ paravirt_clobber(clbr), \ ##__VA_ARGS__ \ : "memory", "cc" extra_clbr); \ With some extra magic for the constraints. And I don't even think this is strictly correct -- from very recent experience, telling the compiler that "memory" is clobbered and that a bunch of arguments are used as numeric inputs may not actually imply that the asm modifies the target of pointer arguments. Checks this lovely bug out: https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso-tglx&id=715bd9d12f84d8f5cc8ad21d888f9bc304a8eb0b As far as I can tell, the whole PVOP infrastructure has the same bug. And I don't see how to avoid it generically on x86 or any other architecture. (PeterZ, am I wrong? Are we really just getting lucky that x86 pvop calls actually work? Or do we not have enough of them that take pointers as arguments for this to matter?) Plus, asm volatile ( ..., "memory" ) is a barrier and makes code generation suck. Whereas, if we use my suggestion the semantics are precisely those of any other C function call because, as far as GCC is concerned, it *is* a C function call. So the generated code *is* a function call. --Andy
On 5 October 2018 at 19:20, Andy Lutomirski <luto@kernel.org> wrote: > On Fri, Oct 5, 2018 at 10:11 AM Ard Biesheuvel > <ard.biesheuvel@linaro.org> wrote: >> >> On 5 October 2018 at 18:58, Andy Lutomirski <luto@kernel.org> wrote: >> > On Fri, Oct 5, 2018 at 8:24 AM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: >> >> >> >> On 5 October 2018 at 17:08, Andy Lutomirski <luto@amacapital.net> wrote: >> >> > >> >> > >> >> >> On Oct 5, 2018, at 7:14 AM, Peter Zijlstra <peterz@infradead.org> wrote: >> >> >> >> >> >>> On Fri, Oct 05, 2018 at 10:13:25AM +0200, Ard Biesheuvel wrote: >> >> >>> diff --git a/include/linux/ffp.h b/include/linux/ffp.h >> >> >>> new file mode 100644 >> >> >>> index 000000000000..8fc3b4c9b38f >> >> >>> --- /dev/null >> >> >>> +++ b/include/linux/ffp.h >> >> >>> @@ -0,0 +1,43 @@ >> >> >>> +/* SPDX-License-Identifier: GPL-2.0 */ >> >> >>> + >> >> >>> +#ifndef __LINUX_FFP_H >> >> >>> +#define __LINUX_FFP_H >> >> >>> + >> >> >>> +#include <linux/types.h> >> >> >>> +#include <linux/compiler.h> >> >> >>> + >> >> >>> +#ifdef CONFIG_HAVE_ARCH_FFP >> >> >>> +#include <asm/ffp.h> >> >> >>> +#else >> >> >>> + >> >> >>> +struct ffp { >> >> >>> + void (**fn)(void); >> >> >>> + void (*default_fn)(void); >> >> >>> +}; >> >> >>> + >> >> >>> +#define DECLARE_FFP(_fn, _def) \ >> >> >>> + extern typeof(_def) *_fn; \ >> >> >>> + extern struct ffp const __ffp_ ## _fn >> >> >>> + >> >> >>> +#define DEFINE_FFP(_fn, _def) \ >> >> >>> + typeof(_def) *_fn = &_def; \ >> >> >>> + struct ffp const __ffp_ ## _fn \ >> >> >>> + = { (void(**)(void))&_fn, (void(*)(void))&_def }; \ >> >> >>> + EXPORT_SYMBOL(__ffp_ ## _fn) >> >> >>> + >> >> >>> +static inline void ffp_set_target(const struct ffp *m, void *new_fn) >> >> >>> +{ >> >> >>> + WRITE_ONCE(*m->fn, new_fn); >> >> >>> +} >> >> >>> + >> >> >>> +static inline void ffp_reset_target(const struct ffp *m) >> >> >>> +{ >> >> >>> + WRITE_ONCE(*m->fn, m->default_fn); >> >> >>> +} >> >> >>> + >> >> >>> +#endif >> >> >>> + >> >> >>> +#define SET_FFP(_fn, _new) ffp_set_target(&__ffp_ ## _fn, _new) >> >> >>> +#define RESET_FFP(_fn) ffp_reset_target(&__ffp_ ## _fn) >> >> >>> + >> >> >>> +#endif >> >> >> >> >> >> I don't understand this interface. There is no wrapper for the call >> >> >> site, so how are we going to patch all call-sites when you update the >> >> >> target? >> >> > >> >> > I’m also confused. >> >> > >> >> > Anyway, we have patchable functions on x86. They’re called PVOPs, and they’re way overcomplicated. >> >> > >> >> > I’ve proposed a better way that should generate better code, be more portable, and be more maintainable. It goes like this. >> >> > >> >> > To call the function, you literally just call the default implementation. It *might* be necessary to call a nonexistent wrapper to avoid annoying optimizations. At build time, the kernel is built with relocations, so the object files contain relocation entries for the call. We collect these entries into a table. If we’re using the “nonexistent wrapper” approach, we can link in a .S or linker script to alias them to the default implementation. >> >> > >> >> > To patch them, we just patch them. It can’t necessarily be done concurrently because nothing forces the right alignment. But we can do it at boot time and module load time. (Maybe we can patch at runtime on architectures with appropriate instruction alignment. Or we ask gcc for an extension to align calls to a function.) >> >> > >> >> > Most of the machinery already exists: this is roughly how the module loader resolves calls outside of a module. >> >> >> >> Yeah nothing is ever simple on x86 :-( >> >> >> >> So are you saying the approach i use in patch #2 (which would >> >> translate to emitting a jmpq instruction pointing to the default >> >> implementation, and patching it at runtime to point elsewhere) would >> >> not fly on x86? >> > >> > After getting some more sleep, I'm obviously wrong. The >> > text_poke_bp() mechanism will work. It's just really slow. >> > >> >> OK >> >> > Let me try to summarize some of the issues. First, when emitting >> > jumps and calls from inline asm on x86, there are a few considerations >> > that are annoying: >> > >> > 1. Following the x86_64 ABI calling conventions is basically >> > impossible. x86_64 requires a 128-byte redzone and 16-byte stack >> > alignment. After much discussion a while back, we decided that it was >> > flat-out impossible on current gcc to get the stack pointer aligned in >> > a known manner in an inline asm statement. Instead, if we actually >> > need alignment, we need to align manually. Fortunately, the kernel is >> > built with an override that forces only 8-byte alignment (on *most* >> > GCC versions). But for crypto in particular, it sucks extra, since >> > the crypto code is basically the only thing in the kernel that >> > actually wants 16-byte alignment. I don't think this is a huge >> > problem in practice, but it's annoying. And the kernel is built >> > without a redzone. >> > >> > 2. On x86_64, depending on config, we either need frame pointers or >> > ORC. ORC is no big deal -- it Just Works (tm). Frame pointers need >> > extra asm hackery. It's doable, but it's still annoying. >> > >> > 3. Actually getting the asm constraints right to do what a C >> > programmer expects is distinctly nontrivial. I just fixed an >> > extremely longstanding bug in the vDSO code in which the asm >> > constraints for the syscall fallback were wrong in such a way that GCC >> > didn't notice that the fallback wrote to its output parameter. >> > Whoops. >> > >> >> OK, so the thing I am missing is why this all matters. >> >> Note that the compiler should take care of all of this. It emits a >> call a function with external linkage having prototype X, and all the >> inline asm does is emit a jmp to some function having that same >> prototype, either the default one or the one we patched in. >> >> Apologies if I am missing something obvious here: as you know, x86 is >> not my focus in general. > > The big issue that bothers me isn't the x86-ism so much as the nasty > interactions with the optimizer. On x86, we have all of this working. > It's in arch/x86/include/asm/paravirt_types.h, and it looks roughly > like: > > asm volatile(pre \ > paravirt_alt(PARAVIRT_CALL) \ > post \ > : call_clbr, ASM_CALL_CONSTRAINT \ > : paravirt_type(op), \ > paravirt_clobber(clbr), \ > ##__VA_ARGS__ \ > : "memory", "cc" extra_clbr); \ > > With some extra magic for the constraints. And I don't even think > this is strictly correct -- from very recent experience, telling the > compiler that "memory" is clobbered and that a bunch of arguments are > used as numeric inputs may not actually imply that the asm modifies > the target of pointer arguments. Checks this lovely bug out: > > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso-tglx&id=715bd9d12f84d8f5cc8ad21d888f9bc304a8eb0b > > As far as I can tell, the whole PVOP infrastructure has the same bug. > And I don't see how to avoid it generically on x86 or any other > architecture. (PeterZ, am I wrong? Are we really just getting lucky > that x86 pvop calls actually work? Or do we not have enough of them > that take pointers as arguments for this to matter?) > > Plus, asm volatile ( ..., "memory" ) is a barrier and makes code > generation suck. > > Whereas, if we use my suggestion the semantics are precisely those of > any other C function call because, as far as GCC is concerned, it *is* > a C function call. So the generated code *is* a function call. > But it is the *compiler* that actually emits the call. Only, the target of that call is a jmpq to another location where some version of the routine lives, and all have the same prototype. How is that any different from PLTs in shared libraries?
On Fri, Oct 5, 2018 at 10:23 AM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > > On 5 October 2018 at 19:20, Andy Lutomirski <luto@kernel.org> wrote: > > On Fri, Oct 5, 2018 at 10:11 AM Ard Biesheuvel > > <ard.biesheuvel@linaro.org> wrote: > >> > >> On 5 October 2018 at 18:58, Andy Lutomirski <luto@kernel.org> wrote: > >> > On Fri, Oct 5, 2018 at 8:24 AM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > >> >> > >> >> On 5 October 2018 at 17:08, Andy Lutomirski <luto@amacapital.net> wrote: > >> >> > > >> >> > > >> >> >> On Oct 5, 2018, at 7:14 AM, Peter Zijlstra <peterz@infradead.org> wrote: > >> >> >> > >> >> >>> On Fri, Oct 05, 2018 at 10:13:25AM +0200, Ard Biesheuvel wrote: > >> >> >>> diff --git a/include/linux/ffp.h b/include/linux/ffp.h > >> >> >>> new file mode 100644 > >> >> >>> index 000000000000..8fc3b4c9b38f > >> >> >>> --- /dev/null > >> >> >>> +++ b/include/linux/ffp.h > >> >> >>> @@ -0,0 +1,43 @@ > >> >> >>> +/* SPDX-License-Identifier: GPL-2.0 */ > >> >> >>> + > >> >> >>> +#ifndef __LINUX_FFP_H > >> >> >>> +#define __LINUX_FFP_H > >> >> >>> + > >> >> >>> +#include <linux/types.h> > >> >> >>> +#include <linux/compiler.h> > >> >> >>> + > >> >> >>> +#ifdef CONFIG_HAVE_ARCH_FFP > >> >> >>> +#include <asm/ffp.h> > >> >> >>> +#else > >> >> >>> + > >> >> >>> +struct ffp { > >> >> >>> + void (**fn)(void); > >> >> >>> + void (*default_fn)(void); > >> >> >>> +}; > >> >> >>> + > >> >> >>> +#define DECLARE_FFP(_fn, _def) \ > >> >> >>> + extern typeof(_def) *_fn; \ > >> >> >>> + extern struct ffp const __ffp_ ## _fn > >> >> >>> + > >> >> >>> +#define DEFINE_FFP(_fn, _def) \ > >> >> >>> + typeof(_def) *_fn = &_def; \ > >> >> >>> + struct ffp const __ffp_ ## _fn \ > >> >> >>> + = { (void(**)(void))&_fn, (void(*)(void))&_def }; \ > >> >> >>> + EXPORT_SYMBOL(__ffp_ ## _fn) > >> >> >>> + > >> >> >>> +static inline void ffp_set_target(const struct ffp *m, void *new_fn) > >> >> >>> +{ > >> >> >>> + WRITE_ONCE(*m->fn, new_fn); > >> >> >>> +} > >> >> >>> + > >> >> >>> +static inline void ffp_reset_target(const struct ffp *m) > >> >> >>> +{ > >> >> >>> + WRITE_ONCE(*m->fn, m->default_fn); > >> >> >>> +} > >> >> >>> + > >> >> >>> +#endif > >> >> >>> + > >> >> >>> +#define SET_FFP(_fn, _new) ffp_set_target(&__ffp_ ## _fn, _new) > >> >> >>> +#define RESET_FFP(_fn) ffp_reset_target(&__ffp_ ## _fn) > >> >> >>> + > >> >> >>> +#endif > >> >> >> > >> >> >> I don't understand this interface. There is no wrapper for the call > >> >> >> site, so how are we going to patch all call-sites when you update the > >> >> >> target? > >> >> > > >> >> > I’m also confused. > >> >> > > >> >> > Anyway, we have patchable functions on x86. They’re called PVOPs, and they’re way overcomplicated. > >> >> > > >> >> > I’ve proposed a better way that should generate better code, be more portable, and be more maintainable. It goes like this. > >> >> > > >> >> > To call the function, you literally just call the default implementation. It *might* be necessary to call a nonexistent wrapper to avoid annoying optimizations. At build time, the kernel is built with relocations, so the object files contain relocation entries for the call. We collect these entries into a table. If we’re using the “nonexistent wrapper” approach, we can link in a .S or linker script to alias them to the default implementation. > >> >> > > >> >> > To patch them, we just patch them. It can’t necessarily be done concurrently because nothing forces the right alignment. But we can do it at boot time and module load time. (Maybe we can patch at runtime on architectures with appropriate instruction alignment. Or we ask gcc for an extension to align calls to a function.) > >> >> > > >> >> > Most of the machinery already exists: this is roughly how the module loader resolves calls outside of a module. > >> >> > >> >> Yeah nothing is ever simple on x86 :-( > >> >> > >> >> So are you saying the approach i use in patch #2 (which would > >> >> translate to emitting a jmpq instruction pointing to the default > >> >> implementation, and patching it at runtime to point elsewhere) would > >> >> not fly on x86? > >> > > >> > After getting some more sleep, I'm obviously wrong. The > >> > text_poke_bp() mechanism will work. It's just really slow. > >> > > >> > >> OK > >> > >> > Let me try to summarize some of the issues. First, when emitting > >> > jumps and calls from inline asm on x86, there are a few considerations > >> > that are annoying: > >> > > >> > 1. Following the x86_64 ABI calling conventions is basically > >> > impossible. x86_64 requires a 128-byte redzone and 16-byte stack > >> > alignment. After much discussion a while back, we decided that it was > >> > flat-out impossible on current gcc to get the stack pointer aligned in > >> > a known manner in an inline asm statement. Instead, if we actually > >> > need alignment, we need to align manually. Fortunately, the kernel is > >> > built with an override that forces only 8-byte alignment (on *most* > >> > GCC versions). But for crypto in particular, it sucks extra, since > >> > the crypto code is basically the only thing in the kernel that > >> > actually wants 16-byte alignment. I don't think this is a huge > >> > problem in practice, but it's annoying. And the kernel is built > >> > without a redzone. > >> > > >> > 2. On x86_64, depending on config, we either need frame pointers or > >> > ORC. ORC is no big deal -- it Just Works (tm). Frame pointers need > >> > extra asm hackery. It's doable, but it's still annoying. > >> > > >> > 3. Actually getting the asm constraints right to do what a C > >> > programmer expects is distinctly nontrivial. I just fixed an > >> > extremely longstanding bug in the vDSO code in which the asm > >> > constraints for the syscall fallback were wrong in such a way that GCC > >> > didn't notice that the fallback wrote to its output parameter. > >> > Whoops. > >> > > >> > >> OK, so the thing I am missing is why this all matters. > >> > >> Note that the compiler should take care of all of this. It emits a > >> call a function with external linkage having prototype X, and all the > >> inline asm does is emit a jmp to some function having that same > >> prototype, either the default one or the one we patched in. > >> > >> Apologies if I am missing something obvious here: as you know, x86 is > >> not my focus in general. > > > > The big issue that bothers me isn't the x86-ism so much as the nasty > > interactions with the optimizer. On x86, we have all of this working. > > It's in arch/x86/include/asm/paravirt_types.h, and it looks roughly > > like: > > > > asm volatile(pre \ > > paravirt_alt(PARAVIRT_CALL) \ > > post \ > > : call_clbr, ASM_CALL_CONSTRAINT \ > > : paravirt_type(op), \ > > paravirt_clobber(clbr), \ > > ##__VA_ARGS__ \ > > : "memory", "cc" extra_clbr); \ > > > > With some extra magic for the constraints. And I don't even think > > this is strictly correct -- from very recent experience, telling the > > compiler that "memory" is clobbered and that a bunch of arguments are > > used as numeric inputs may not actually imply that the asm modifies > > the target of pointer arguments. Checks this lovely bug out: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso-tglx&id=715bd9d12f84d8f5cc8ad21d888f9bc304a8eb0b > > > > As far as I can tell, the whole PVOP infrastructure has the same bug. > > And I don't see how to avoid it generically on x86 or any other > > architecture. (PeterZ, am I wrong? Are we really just getting lucky > > that x86 pvop calls actually work? Or do we not have enough of them > > that take pointers as arguments for this to matter?) > > > > Plus, asm volatile ( ..., "memory" ) is a barrier and makes code > > generation suck. > > > > Whereas, if we use my suggestion the semantics are precisely those of > > any other C function call because, as far as GCC is concerned, it *is* > > a C function call. So the generated code *is* a function call. > > > > But it is the *compiler* that actually emits the call. > > Only, the target of that call is a jmpq to another location where some > version of the routine lives, and all have the same prototype. > > How is that any different from PLTs in shared libraries? Ah, I see, I misunderstood your code. See other email. I think that, if you rework your series a bit to have a generic version that works on all architectures, then do it like you did on ARM, and make sure you leave the door open for the inline patching approach, then it looks pretty good. (None of this is to say that I disagree with Jason, though -- I'm not entirely convinced that this makes sense for Zinc. But maybe it can be done in a way that makes everyone happy.)
On Fri, Oct 5, 2018 at 7:29 PM Andy Lutomirski <luto@kernel.org> wrote: > (None of this is to say that I disagree with Jason, though -- I'm not > entirely convinced that this makes sense for Zinc. But maybe it can > be done in a way that makes everyone happy.) Zinc indeed will continue to push in the simpler and more minimal direction. Down the line I'm open to trying and benching a few different ways of going about it with dynamic patching -- something that will be pretty easy to experiment with given the lean structure of Zinc -- but for the initial merge I intend to do it the way it is, which is super fast and pretty straightforward to follow. Jason
On Fri, Oct 5, 2018 at 10:38 AM Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > On Fri, Oct 5, 2018 at 7:29 PM Andy Lutomirski <luto@kernel.org> wrote: > > (None of this is to say that I disagree with Jason, though -- I'm not > > entirely convinced that this makes sense for Zinc. But maybe it can > > be done in a way that makes everyone happy.) > > Zinc indeed will continue to push in the simpler and more minimal > direction. Down the line I'm open to trying and benching a few > different ways of going about it with dynamic patching -- something > that will be pretty easy to experiment with given the lean structure > of Zinc -- but for the initial merge I intend to do it the way it is, > which is super fast and pretty straightforward to follow. > I *think* the only change to Zinc per se would be that the calls like chacha20_simd() would be static calls or patchable functions or whatever we want to call them. And there could be a debugfs to override the default selection. Ard, I don't think that sticking this in udev rules makes sense. The kernel has bascially complete information as to what the right choice is, and that will change over time as the implementation gets tuned, and the udev rules will never get updated in sync.
Hey Andy, On Fri, Oct 5, 2018 at 7:44 PM Andy Lutomirski <luto@kernel.org> wrote: > I *think* the only change to Zinc per se would be that the calls like > chacha20_simd() would be static calls or patchable functions or > whatever we want to call them. And there could be a debugfs to > override the default selection. Yea, right, exactly. It turns out this is really easy to do with the way it's structured now. I'd actually experimented considerably with using the static keys a while back, but couldn't find any performance difference on any platform at all (four ARM microarchitectures, three MIPS, various random intel, an old powerpc), so went with the simplest solution. But we can certainly play with more elaborate patching mechanisms later on and see how those turn out. Also, even with the simple bools as we have now, it's quite easy to make all the parameters toggle-able. > Ard, I don't think that sticking this in udev rules makes sense. The > kernel has bascially complete information as to what the right choice > is, and that will change over time as the implementation gets tuned, > and the udev rules will never get updated in sync. Yes, I agree with this. Jason
> On 5 Oct 2018, at 20:28, Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > Hey Andy, > >> On Fri, Oct 5, 2018 at 7:44 PM Andy Lutomirski <luto@kernel.org> wrote: >> I *think* the only change to Zinc per se would be that the calls like >> chacha20_simd() would be static calls or patchable functions or >> whatever we want to call them. And there could be a debugfs to >> override the default selection. > > Yea, right, exactly. It turns out this is really easy to do with the > way it's structured now. I'd actually experimented considerably with > using the static keys a while back, but couldn't find any performance > difference on any platform at all (four ARM microarchitectures, three > MIPS, various random intel, an old powerpc), so went with the simplest > solution. But we can certainly play with more elaborate patching > mechanisms later on and see how those turn out. Also, even with the > simple bools as we have now, it's quite easy to make all the > parameters toggle-able. > >> Ard, I don't think that sticking this in udev rules makes sense. The >> kernel has bascially complete information as to what the right choice >> is, and that will change over time as the implementation gets tuned, >> and the udev rules will never get updated in sync. > > Yes, I agree with this. > > I am not referring to udev rules. I just mean the current way that udev autoloads modules based on CPU features.
diff --git a/arch/Kconfig b/arch/Kconfig index 6801123932a5..2af3442a61d3 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -862,6 +862,9 @@ config HAVE_ARCH_PREL32_RELOCATIONS architectures, and don't require runtime relocation on relocatable kernels. +config HAVE_ARCH_FFP + bool + source "kernel/gcov/Kconfig" source "scripts/gcc-plugins/Kconfig" diff --git a/include/linux/ffp.h b/include/linux/ffp.h new file mode 100644 index 000000000000..8fc3b4c9b38f --- /dev/null +++ b/include/linux/ffp.h @@ -0,0 +1,43 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#ifndef __LINUX_FFP_H +#define __LINUX_FFP_H + +#include <linux/types.h> +#include <linux/compiler.h> + +#ifdef CONFIG_HAVE_ARCH_FFP +#include <asm/ffp.h> +#else + +struct ffp { + void (**fn)(void); + void (*default_fn)(void); +}; + +#define DECLARE_FFP(_fn, _def) \ + extern typeof(_def) *_fn; \ + extern struct ffp const __ffp_ ## _fn + +#define DEFINE_FFP(_fn, _def) \ + typeof(_def) *_fn = &_def; \ + struct ffp const __ffp_ ## _fn \ + = { (void(**)(void))&_fn, (void(*)(void))&_def }; \ + EXPORT_SYMBOL(__ffp_ ## _fn) + +static inline void ffp_set_target(const struct ffp *m, void *new_fn) +{ + WRITE_ONCE(*m->fn, new_fn); +} + +static inline void ffp_reset_target(const struct ffp *m) +{ + WRITE_ONCE(*m->fn, m->default_fn); +} + +#endif + +#define SET_FFP(_fn, _new) ffp_set_target(&__ffp_ ## _fn, _new) +#define RESET_FFP(_fn) ffp_reset_target(&__ffp_ ## _fn) + +#endif
Add a function pointer abstraction that can be implemented by the arch in a manner that avoids the downsides of function pointers, i.e., the fact that they are typically located in a writable data section, and their vulnerability to Spectre like defects. The FFP (or fast function pointer) is callable as a function, since the generic incarnation is simply that. However, due to the fact that C does not distinguish between functions and function pointers at the call site, the architecture can instead emit it as a patchable sequence of instructions consisting of ordinary branches. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> --- arch/Kconfig | 3 ++ include/linux/ffp.h | 43 ++++++++++++++++++++ 2 files changed, 46 insertions(+)