Message ID | 20250303132837.498938-2-dongml2@chinatelecom.cn (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | per-function storage support | expand |
On Mon, Mar 03, 2025 at 09:28:34PM +0800, Menglong Dong wrote: > For now, the layout of cfi and fineibt is hard coded, and the padding is > fixed on 16 bytes. > > Factor out FINEIBT_INSN_OFFSET and CFI_INSN_OFFSET. CFI_INSN_OFFSET is > the offset of cfi, which is the same as FUNCTION_ALIGNMENT when > CALL_PADDING is enabled. And FINEIBT_INSN_OFFSET is the offset where we > put the fineibt preamble on, which is 16 for now. > > When the FUNCTION_ALIGNMENT is bigger than 16, we place the fineibt > preamble on the last 16 bytes of the padding for better performance, which > means the fineibt preamble don't use the space that cfi uses. > > The FINEIBT_INSN_OFFSET is not used in fineibt_caller_start and > fineibt_paranoid_start, as it is always "0x10". Note that we need to > update the offset in fineibt_caller_start and fineibt_paranoid_start if > FINEIBT_INSN_OFFSET changes. > > Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> I'm confused as to what exactly you mean. Preamble will have __cfi symbol and some number of NOPs right before actual symbol like: __cfi_foo: mov $0x12345678, %reg nop nop nop ... foo: FineIBT must be at foo-16, has nothing to do with performance. This 16 can also be spelled: fineibt_preamble_size. The total size of the preamble is FUNCTION_PADDING_BYTES + CFI_CLANG*5. If you increase FUNCTION_PADDING_BYTES by another 5, which is what you want I think, then we'll have total preamble of 21 bytes; 5 bytes kCFI, 16 bytes nop. Then kCFI expects hash to be at -20, while FineIBT must be at -16. This then means there is no unambiguous hole for you to stick your meta-data thing (whatever that is). There are two options: make meta data location depend on cfi_mode, or have __apply_fineibt() rewrite kCFI to also be at -16, so that you can have -21 for your 5 bytes. I think I prefer latter. In any case, I don't think we need *_INSN_OFFSET. At most we need PREAMBLE_SIZE. Hmm?
On Tue, Mar 4, 2025 at 12:55 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, Mar 03, 2025 at 09:28:34PM +0800, Menglong Dong wrote: > > For now, the layout of cfi and fineibt is hard coded, and the padding is > > fixed on 16 bytes. > > > > Factor out FINEIBT_INSN_OFFSET and CFI_INSN_OFFSET. CFI_INSN_OFFSET is > > the offset of cfi, which is the same as FUNCTION_ALIGNMENT when > > CALL_PADDING is enabled. And FINEIBT_INSN_OFFSET is the offset where we > > put the fineibt preamble on, which is 16 for now. > > > > When the FUNCTION_ALIGNMENT is bigger than 16, we place the fineibt > > preamble on the last 16 bytes of the padding for better performance, which > > means the fineibt preamble don't use the space that cfi uses. > > > > The FINEIBT_INSN_OFFSET is not used in fineibt_caller_start and > > fineibt_paranoid_start, as it is always "0x10". Note that we need to > > update the offset in fineibt_caller_start and fineibt_paranoid_start if > > FINEIBT_INSN_OFFSET changes. > > > > Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> > > I'm confused as to what exactly you mean. > > Preamble will have __cfi symbol and some number of NOPs right before > actual symbol like: > > __cfi_foo: > mov $0x12345678, %reg > nop > nop > nop > ... > foo: > > FineIBT must be at foo-16, has nothing to do with performance. This 16 > can also be spelled: fineibt_preamble_size. > > The total size of the preamble is FUNCTION_PADDING_BYTES + CFI_CLANG*5. > > If you increase FUNCTION_PADDING_BYTES by another 5, which is what you > want I think, then we'll have total preamble of 21 bytes; 5 bytes kCFI, > 16 bytes nop. Hello, sorry that I forgot to add something to the changelog. In fact, I don't add extra 5-bytes anymore, which you can see in the 3rd patch. The thing is that we can't add extra 5-bytes if CFI is enabled. Without CFI, we can make the padding space any value, such as 5-bytes, and the layout will be like this: __align: nop nop nop nop nop foo: -- __align +5 However, the CFI will always make the cfi insn 16-bytes aligned. When we set the FUNCTION_PADDING_BYTES to (11 + 5), the layout will be like this: __cfi_foo: nop (11) mov $0x12345678, %reg nop (16) foo: and the padding space is 32-bytes actually. So, we can just select FUNCTION_ALIGNMENT_32B instead, which makes the padding space 32-bytes too, and have the following layout: __cfi_foo: mov $0x12345678, %reg nop (27) foo: And the layout will be like this if fineibt and function metadata is both used: __cfi_foo: mov -- 5 -- cfi, not used anymore if fineibt is used nop nop nop mov -- 5 -- function metadata nop nop nop fineibt -- 16 -- fineibt foo: nopw -- 4 ...... The things that I make in this commit is to make sure that the code in arch/x86/kernel/alternative.c can find the location of cfi hash and fineibt depends on the FUNCTION_ALIGNMENT. the offset of cfi and fineibt is different now, so we need to do some adjustment here. In the beginning, I thought to make the layout like this to ensure that the offset of cfi and fineibt the same: __cfi_foo: fineibt -- 16 -- fineibt mov -- 5 -- function metadata nop(11) foo: nopw -- 4 ...... The adjustment will be easier in this mode. However, it may have impact on the performance. That way I say it doesn't impact the performance in this commit. Sorry that I didn't describe it clearly in the commit log, and I'll add the things above to the commit log too in the next version. Thanks! Menglong Dong > > Then kCFI expects hash to be at -20, while FineIBT must be at -16. > > This then means there is no unambiguous hole for you to stick your > meta-data thing (whatever that is). > > There are two options: make meta data location depend on cfi_mode, or > have __apply_fineibt() rewrite kCFI to also be at -16, so that you can > have -21 for your 5 bytes. > > I think I prefer latter. > > In any case, I don't think we need *_INSN_OFFSET. At most we need > PREAMBLE_SIZE. > > Hmm?
On Tue, Mar 04, 2025 at 09:10:12AM +0800, Menglong Dong wrote: > Hello, sorry that I forgot to add something to the changelog. In fact, > I don't add extra 5-bytes anymore, which you can see in the 3rd patch. > > The thing is that we can't add extra 5-bytes if CFI is enabled. Without > CFI, we can make the padding space any value, such as 5-bytes, and > the layout will be like this: > > __align: > nop > nop > nop > nop > nop > foo: -- __align +5 > > However, the CFI will always make the cfi insn 16-bytes aligned. When > we set the FUNCTION_PADDING_BYTES to (11 + 5), the layout will be > like this: > > __cfi_foo: > nop (11) > mov $0x12345678, %reg > nop (16) > foo: > > and the padding space is 32-bytes actually. So, we can just select > FUNCTION_ALIGNMENT_32B instead, which makes the padding > space 32-bytes too, and have the following layout: > > __cfi_foo: > mov $0x12345678, %reg > nop (27) > foo: *blink*, wtf is clang smoking. I mean, you're right, this is what it is doing, but that is somewhat unexpected. Let me go look at clang source, this is insane.
On Tue, Mar 04, 2025 at 06:38:53AM +0100, Peter Zijlstra wrote: > On Tue, Mar 04, 2025 at 09:10:12AM +0800, Menglong Dong wrote: > > Hello, sorry that I forgot to add something to the changelog. In fact, > > I don't add extra 5-bytes anymore, which you can see in the 3rd patch. > > > > The thing is that we can't add extra 5-bytes if CFI is enabled. Without > > CFI, we can make the padding space any value, such as 5-bytes, and > > the layout will be like this: > > > > __align: > > nop > > nop > > nop > > nop > > nop > > foo: -- __align +5 > > > > However, the CFI will always make the cfi insn 16-bytes aligned. When > > we set the FUNCTION_PADDING_BYTES to (11 + 5), the layout will be > > like this: > > > > __cfi_foo: > > nop (11) > > mov $0x12345678, %reg > > nop (16) > > foo: > > > > and the padding space is 32-bytes actually. So, we can just select > > FUNCTION_ALIGNMENT_32B instead, which makes the padding > > space 32-bytes too, and have the following layout: > > > > __cfi_foo: > > mov $0x12345678, %reg > > nop (27) > > foo: > > *blink*, wtf is clang smoking. > > I mean, you're right, this is what it is doing, but that is somewhat > unexpected. Let me go look at clang source, this is insane. Bah, this is because assemblers are stupid :/ There is no way to tell them to have foo aligned such that there are at least N bytes free before it. So what kCFI ends up having to do is align the __cfi symbol to the function alignment, and then stuff enough nops in to make the real symbol meet alignment. And the end result is utter insanity. I mean, look at this: 50: 2e e9 00 00 00 00 cs jmp 56 <__traceiter_initcall_level+0x46> 52: R_X86_64_PLT32 __x86_return_thunk-0x4 56: 66 2e 0f 1f 84 00 00 00 00 00 cs nopw 0x0(%rax,%rax,1) 0000000000000060 <__cfi___probestub_initcall_level>: 60: 90 nop 61: 90 nop 62: 90 nop 63: 90 nop 64: 90 nop 65: 90 nop 66: 90 nop 67: 90 nop 68: 90 nop 69: 90 nop 6a: 90 nop 6b: b8 b1 fd 66 f9 mov $0xf966fdb1,%eax 0000000000000070 <__probestub_initcall_level>: 70: 2e e9 00 00 00 00 cs jmp 76 <__probestub_initcall_level+0x6> 72: R_X86_64_PLT32 __x86_return_thunk-0x4 That's 21 bytes wasted, for no reason other than that asm doesn't have a directive to say: get me a place that is M before N alignment. Because ideally the whole above thing would look like: 50: 2e e9 00 00 00 00 cs jmp 56 <__traceiter_initcall_level+0x46> 52: R_X86_64_PLT32 __x86_return_thunk-0x4 56: 66 2e 0f 1f 84 cs nopw (%rax,%rax,1) 000000000000005b <__cfi___probestub_initcall_level>: 5b: b8 b1 fd 66 f9 mov $0xf966fdb1,%eax 0000000000000060 <__probestub_initcall_level>: 60: 2e e9 00 00 00 00 cs jmp 76 <__probestub_initcall_level+0x6> 72: R_X86_64_PLT32 __x86_return_thunk-0x4
On Tue, Mar 4, 2025 at 2:16 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Tue, Mar 04, 2025 at 06:38:53AM +0100, Peter Zijlstra wrote: > > On Tue, Mar 04, 2025 at 09:10:12AM +0800, Menglong Dong wrote: > > > Hello, sorry that I forgot to add something to the changelog. In fact, > > > I don't add extra 5-bytes anymore, which you can see in the 3rd patch. > > > > > > The thing is that we can't add extra 5-bytes if CFI is enabled. Without > > > CFI, we can make the padding space any value, such as 5-bytes, and > > > the layout will be like this: > > > > > > __align: > > > nop > > > nop > > > nop > > > nop > > > nop > > > foo: -- __align +5 > > > > > > However, the CFI will always make the cfi insn 16-bytes aligned. When > > > we set the FUNCTION_PADDING_BYTES to (11 + 5), the layout will be > > > like this: > > > > > > __cfi_foo: > > > nop (11) > > > mov $0x12345678, %reg > > > nop (16) > > > foo: > > > > > > and the padding space is 32-bytes actually. So, we can just select > > > FUNCTION_ALIGNMENT_32B instead, which makes the padding > > > space 32-bytes too, and have the following layout: > > > > > > __cfi_foo: > > > mov $0x12345678, %reg > > > nop (27) > > > foo: > > > > *blink*, wtf is clang smoking. > > > > I mean, you're right, this is what it is doing, but that is somewhat > > unexpected. Let me go look at clang source, this is insane. > > Bah, this is because assemblers are stupid :/ > > There is no way to tell them to have foo aligned such that there are at > least N bytes free before it. > > So what kCFI ends up having to do is align the __cfi symbol to the > function alignment, and then stuff enough nops in to make the real > symbol meet alignment. > > And the end result is utter insanity. > > I mean, look at this: > > 50: 2e e9 00 00 00 00 cs jmp 56 <__traceiter_initcall_level+0x46> 52: R_X86_64_PLT32 __x86_return_thunk-0x4 > 56: 66 2e 0f 1f 84 00 00 00 00 00 cs nopw 0x0(%rax,%rax,1) > > 0000000000000060 <__cfi___probestub_initcall_level>: > 60: 90 nop > 61: 90 nop > 62: 90 nop > 63: 90 nop > 64: 90 nop > 65: 90 nop > 66: 90 nop > 67: 90 nop > 68: 90 nop > 69: 90 nop > 6a: 90 nop > 6b: b8 b1 fd 66 f9 mov $0xf966fdb1,%eax > > 0000000000000070 <__probestub_initcall_level>: > 70: 2e e9 00 00 00 00 cs jmp 76 <__probestub_initcall_level+0x6> 72: R_X86_64_PLT32 __x86_return_thunk-0x4 > > > That's 21 bytes wasted, for no reason other than that asm doesn't have a > directive to say: get me a place that is M before N alignment. > > Because ideally the whole above thing would look like: > > 50: 2e e9 00 00 00 00 cs jmp 56 <__traceiter_initcall_level+0x46> 52: R_X86_64_PLT32 __x86_return_thunk-0x4 > 56: 66 2e 0f 1f 84 cs nopw (%rax,%rax,1) > > 000000000000005b <__cfi___probestub_initcall_level>: > 5b: b8 b1 fd 66 f9 mov $0xf966fdb1,%eax > > 0000000000000060 <__probestub_initcall_level>: > 60: 2e e9 00 00 00 00 cs jmp 76 <__probestub_initcall_level+0x6> 72: R_X86_64_PLT32 __x86_return_thunk-0x4 Hi, peter. Thank you for the testing, which is quite helpful to understand the whole thing. I was surprised at this too. Without CALL_PADDING, the cfi is nop(11) + mov; with CALL_PADDING, the cfi is mov + nop(11), which is weird, as it seems that we can select CALL_PADDING if CFI_CLANG to make things consistent. And I thought that it is designed to be this for some reasons :/ Hmm......so what should we do now? Accept and bear it, or do something different? I'm good at clang, so the solution that I can think of is how to bear it :/ According to my testing, the text size will increase: ~2.2% if we make FUNCTION_PADDING_BYTES 27 and select FUNCTION_ALIGNMENT_16B. ~3.5% if we make FUNCTION_PADDING_BYTES 27 and select FUNCTION_ALIGNMENT_32B. We don't have to select FUNCTION_ALIGNMENT_32B, so the worst case is to increase ~2.2%. What do you think? Thanks! Menglong Dong > > >
On Tue, Mar 4, 2025 at 3:47 PM Menglong Dong <menglong8.dong@gmail.com> wrote: > > On Tue, Mar 4, 2025 at 2:16 PM Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Tue, Mar 04, 2025 at 06:38:53AM +0100, Peter Zijlstra wrote: > > > On Tue, Mar 04, 2025 at 09:10:12AM +0800, Menglong Dong wrote: > > > > Hello, sorry that I forgot to add something to the changelog. In fact, > > > > I don't add extra 5-bytes anymore, which you can see in the 3rd patch. > > > > > > > > The thing is that we can't add extra 5-bytes if CFI is enabled. Without > > > > CFI, we can make the padding space any value, such as 5-bytes, and > > > > the layout will be like this: > > > > > > > > __align: > > > > nop > > > > nop > > > > nop > > > > nop > > > > nop > > > > foo: -- __align +5 > > > > > > > > However, the CFI will always make the cfi insn 16-bytes aligned. When > > > > we set the FUNCTION_PADDING_BYTES to (11 + 5), the layout will be > > > > like this: > > > > > > > > __cfi_foo: > > > > nop (11) > > > > mov $0x12345678, %reg > > > > nop (16) > > > > foo: > > > > > > > > and the padding space is 32-bytes actually. So, we can just select > > > > FUNCTION_ALIGNMENT_32B instead, which makes the padding > > > > space 32-bytes too, and have the following layout: > > > > > > > > __cfi_foo: > > > > mov $0x12345678, %reg > > > > nop (27) > > > > foo: > > > > > > *blink*, wtf is clang smoking. > > > > > > I mean, you're right, this is what it is doing, but that is somewhat > > > unexpected. Let me go look at clang source, this is insane. > > > > Bah, this is because assemblers are stupid :/ > > > > There is no way to tell them to have foo aligned such that there are at > > least N bytes free before it. > > > > So what kCFI ends up having to do is align the __cfi symbol to the > > function alignment, and then stuff enough nops in to make the real > > symbol meet alignment. > > > > And the end result is utter insanity. > > > > I mean, look at this: > > > > 50: 2e e9 00 00 00 00 cs jmp 56 <__traceiter_initcall_level+0x46> 52: R_X86_64_PLT32 __x86_return_thunk-0x4 > > 56: 66 2e 0f 1f 84 00 00 00 00 00 cs nopw 0x0(%rax,%rax,1) > > > > 0000000000000060 <__cfi___probestub_initcall_level>: > > 60: 90 nop > > 61: 90 nop > > 62: 90 nop > > 63: 90 nop > > 64: 90 nop > > 65: 90 nop > > 66: 90 nop > > 67: 90 nop > > 68: 90 nop > > 69: 90 nop > > 6a: 90 nop > > 6b: b8 b1 fd 66 f9 mov $0xf966fdb1,%eax > > > > 0000000000000070 <__probestub_initcall_level>: > > 70: 2e e9 00 00 00 00 cs jmp 76 <__probestub_initcall_level+0x6> 72: R_X86_64_PLT32 __x86_return_thunk-0x4 > > > > > > That's 21 bytes wasted, for no reason other than that asm doesn't have a > > directive to say: get me a place that is M before N alignment. > > > > Because ideally the whole above thing would look like: > > > > 50: 2e e9 00 00 00 00 cs jmp 56 <__traceiter_initcall_level+0x46> 52: R_X86_64_PLT32 __x86_return_thunk-0x4 > > 56: 66 2e 0f 1f 84 cs nopw (%rax,%rax,1) > > > > 000000000000005b <__cfi___probestub_initcall_level>: > > 5b: b8 b1 fd 66 f9 mov $0xf966fdb1,%eax > > > > 0000000000000060 <__probestub_initcall_level>: > > 60: 2e e9 00 00 00 00 cs jmp 76 <__probestub_initcall_level+0x6> 72: R_X86_64_PLT32 __x86_return_thunk-0x4 > > Hi, peter. Thank you for the testing, which is quite helpful > to understand the whole thing. > > I was surprised at this too. Without CALL_PADDING, the cfi is > nop(11) + mov; with CALL_PADDING, the cfi is mov + nop(11), > which is weird, as it seems that we can select CALL_PADDING if > CFI_CLANG to make things consistent. And I thought that it is > designed to be this for some reasons :/ > > Hmm......so what should we do now? Accept and bear it, > or do something different? > > I'm good at clang, so the solution that I can think of is how to *not good at* > bear it :/ > > According to my testing, the text size will increase: > > ~2.2% if we make FUNCTION_PADDING_BYTES 27 and select > FUNCTION_ALIGNMENT_16B. > > ~3.5% if we make FUNCTION_PADDING_BYTES 27 and select > FUNCTION_ALIGNMENT_32B. > > We don't have to select FUNCTION_ALIGNMENT_32B, so the > worst case is to increase ~2.2%. > > What do you think? > > Thanks! > Menglong Dong > > > > > > >
On Tue, Mar 04, 2025 at 03:47:45PM +0800, Menglong Dong wrote: > We don't have to select FUNCTION_ALIGNMENT_32B, so the > worst case is to increase ~2.2%. > > What do you think? Well, since I don't understand what you need this for at all, I'm firmly on the side of not doing this. What actual problem is being solved with this meta data nonsense? Why is it worth blowing up our I$ footprint over. Also note, that if you're going to be explaining this, start from scratch, as I have absolutely 0 clues about BPF and such.
On March 4, 2025 1:42:20 AM PST, Peter Zijlstra <peterz@infradead.org> wrote: >On Tue, Mar 04, 2025 at 03:47:45PM +0800, Menglong Dong wrote: >> We don't have to select FUNCTION_ALIGNMENT_32B, so the >> worst case is to increase ~2.2%. >> >> What do you think? > >Well, since I don't understand what you need this for at all, I'm firmly >on the side of not doing this. > >What actual problem is being solved with this meta data nonsense? Why is >it worth blowing up our I$ footprint over. > >Also note, that if you're going to be explaining this, start from >scratch, as I have absolutely 0 clues about BPF and such. I would appreciate such information as well. The idea seems dubious on the surface.
On Tue, Mar 4, 2025 at 10:53 PM H. Peter Anvin <hpa@zytor.com> wrote: > > On March 4, 2025 1:42:20 AM PST, Peter Zijlstra <peterz@infradead.org> wrote: > >On Tue, Mar 04, 2025 at 03:47:45PM +0800, Menglong Dong wrote: > >> We don't have to select FUNCTION_ALIGNMENT_32B, so the > >> worst case is to increase ~2.2%. > >> > >> What do you think? > > > >Well, since I don't understand what you need this for at all, I'm firmly > >on the side of not doing this. > > > >What actual problem is being solved with this meta data nonsense? Why is > >it worth blowing up our I$ footprint over. > > > >Also note, that if you're going to be explaining this, start from > >scratch, as I have absolutely 0 clues about BPF and such. > > I would appreciate such information as well. The idea seems dubious on the surface. Ok, let me explain it from the beginning. (My English is not good, but I'll try to describe it as clear as possible :/) Many BPF program types need to depend on the BPF trampoline, such as BPF_PROG_TYPE_TRACING, BPF_PROG_TYPE_EXT, BPF_PROG_TYPE_LSM, etc. BPF trampoline is a bridge between the kernel (or bpf) function and BPF program, and it acts just like the trampoline that ftrace uses. Generally speaking, it is used to hook a function, just like what ftrace do: foo: endbr nop5 --> call trampoline_foo xxxx In short, the trampoline_foo can be this: trampoline_foo: prepare a array and store the args of foo to the array call fentry_bpf1 call fentry_bpf2 ...... call foo+4 (origin call) save the return value of foo call fexit_bpf1 (this bpf can get the return value of foo) call fexit_bpf2 ....... return to the caller of foo We can see that the trampoline_foo can be only used for the function foo, as different kernel function can be attached different BPF programs, and have different argument count, etc. Therefore, we have to create 1000 BPF trampolines if we want to attach a BPF program to 1000 kernel functions. The creation of the BPF trampoline is expensive. According to my testing, It will spend more than 1 second to create 100 bpf trampoline. What's more, it consumes more memory. If we have the per-function metadata supporting, then we can create a global BPF trampoline, like this: trampoline_global: prepare a array and store the args of foo to the array get the metadata by the ip call metadata.fentry_bpf1 call metadata.fentry_bpf2 .... call foo+4 (origin call) save the return value of foo call metadata.fexit_bpf1 (this bpf can get the return value of foo) call metadata.fexit_bpf2 ....... return to the caller of foo (The metadata holds more information for the global trampoline than I described.) Then, we don't need to create a trampoline for every kernel function anymore. Another beneficiary can be ftrace. For now, all the kernel functions that are enabled by dynamic ftrace will be added to a filter hash if there are more than one callbacks. And hash lookup will happen when the traced functions are called, which has an impact on the performance, see __ftrace_ops_list_func() -> ftrace_ops_test(). With the per-function metadata supporting, we can store the information that if the callback is enabled on the kernel function to the metadata, which can make the performance much better. The per-function metadata storage is a basic function, and I think there may be other functions that can use it for better performance in the feature too. (Hope that I'm describing it clearly :/) Thanks! Menglong Dong
On March 4, 2025 5:19:09 PM PST, Menglong Dong <menglong8.dong@gmail.com> wrote: >On Tue, Mar 4, 2025 at 10:53 PM H. Peter Anvin <hpa@zytor.com> wrote: >> >> On March 4, 2025 1:42:20 AM PST, Peter Zijlstra <peterz@infradead.org> wrote: >> >On Tue, Mar 04, 2025 at 03:47:45PM +0800, Menglong Dong wrote: >> >> We don't have to select FUNCTION_ALIGNMENT_32B, so the >> >> worst case is to increase ~2.2%. >> >> >> >> What do you think? >> > >> >Well, since I don't understand what you need this for at all, I'm firmly >> >on the side of not doing this. >> > >> >What actual problem is being solved with this meta data nonsense? Why is >> >it worth blowing up our I$ footprint over. >> > >> >Also note, that if you're going to be explaining this, start from >> >scratch, as I have absolutely 0 clues about BPF and such. >> >> I would appreciate such information as well. The idea seems dubious on the surface. > >Ok, let me explain it from the beginning. (My English is not good, >but I'll try to describe it as clear as possible :/) > >Many BPF program types need to depend on the BPF trampoline, >such as BPF_PROG_TYPE_TRACING, BPF_PROG_TYPE_EXT, >BPF_PROG_TYPE_LSM, etc. BPF trampoline is a bridge between >the kernel (or bpf) function and BPF program, and it acts just like the >trampoline that ftrace uses. > >Generally speaking, it is used to hook a function, just like what ftrace >do: > >foo: > endbr > nop5 --> call trampoline_foo > xxxx > >In short, the trampoline_foo can be this: > >trampoline_foo: > prepare a array and store the args of foo to the array > call fentry_bpf1 > call fentry_bpf2 > ...... > call foo+4 (origin call) > save the return value of foo > call fexit_bpf1 (this bpf can get the return value of foo) > call fexit_bpf2 > ....... > return to the caller of foo > >We can see that the trampoline_foo can be only used for >the function foo, as different kernel function can be attached >different BPF programs, and have different argument count, >etc. Therefore, we have to create 1000 BPF trampolines if >we want to attach a BPF program to 1000 kernel functions. > >The creation of the BPF trampoline is expensive. According to >my testing, It will spend more than 1 second to create 100 bpf >trampoline. What's more, it consumes more memory. > >If we have the per-function metadata supporting, then we can >create a global BPF trampoline, like this: > >trampoline_global: > prepare a array and store the args of foo to the array > get the metadata by the ip > call metadata.fentry_bpf1 > call metadata.fentry_bpf2 > .... > call foo+4 (origin call) > save the return value of foo > call metadata.fexit_bpf1 (this bpf can get the return value of foo) > call metadata.fexit_bpf2 > ....... > return to the caller of foo > >(The metadata holds more information for the global trampoline than >I described.) > >Then, we don't need to create a trampoline for every kernel function >anymore. > >Another beneficiary can be ftrace. For now, all the kernel functions that >are enabled by dynamic ftrace will be added to a filter hash if there are >more than one callbacks. And hash lookup will happen when the traced >functions are called, which has an impact on the performance, see >__ftrace_ops_list_func() -> ftrace_ops_test(). With the per-function >metadata supporting, we can store the information that if the callback is >enabled on the kernel function to the metadata, which can make the performance >much better. > >The per-function metadata storage is a basic function, and I think there >may be other functions that can use it for better performance in the feature >too. > >(Hope that I'm describing it clearly :/) > >Thanks! >Menglong Dong > This is way too cursory. For one thing, you need to start by explaining why you are asking to put this *inline* with the code, which is something that normally would be avoided at all cost.
On Wed, Mar 5, 2025 at 4:30 PM H. Peter Anvin <hpa@zytor.com> wrote: > > On March 4, 2025 5:19:09 PM PST, Menglong Dong <menglong8.dong@gmail.com> wrote: > >On Tue, Mar 4, 2025 at 10:53 PM H. Peter Anvin <hpa@zytor.com> wrote: > >> > >> On March 4, 2025 1:42:20 AM PST, Peter Zijlstra <peterz@infradead.org> wrote: > >> >On Tue, Mar 04, 2025 at 03:47:45PM +0800, Menglong Dong wrote: > >> >> We don't have to select FUNCTION_ALIGNMENT_32B, so the > >> >> worst case is to increase ~2.2%. > >> >> > >> >> What do you think? > >> > > >> >Well, since I don't understand what you need this for at all, I'm firmly > >> >on the side of not doing this. > >> > > >> >What actual problem is being solved with this meta data nonsense? Why is > >> >it worth blowing up our I$ footprint over. > >> > > >> >Also note, that if you're going to be explaining this, start from > >> >scratch, as I have absolutely 0 clues about BPF and such. > >> > >> I would appreciate such information as well. The idea seems dubious on the surface. > > > >Ok, let me explain it from the beginning. (My English is not good, > >but I'll try to describe it as clear as possible :/) > > > >Many BPF program types need to depend on the BPF trampoline, > >such as BPF_PROG_TYPE_TRACING, BPF_PROG_TYPE_EXT, > >BPF_PROG_TYPE_LSM, etc. BPF trampoline is a bridge between > >the kernel (or bpf) function and BPF program, and it acts just like the > >trampoline that ftrace uses. > > > >Generally speaking, it is used to hook a function, just like what ftrace > >do: > > > >foo: > > endbr > > nop5 --> call trampoline_foo > > xxxx > > > >In short, the trampoline_foo can be this: > > > >trampoline_foo: > > prepare a array and store the args of foo to the array > > call fentry_bpf1 > > call fentry_bpf2 > > ...... > > call foo+4 (origin call) > > save the return value of foo > > call fexit_bpf1 (this bpf can get the return value of foo) > > call fexit_bpf2 > > ....... > > return to the caller of foo > > > >We can see that the trampoline_foo can be only used for > >the function foo, as different kernel function can be attached > >different BPF programs, and have different argument count, > >etc. Therefore, we have to create 1000 BPF trampolines if > >we want to attach a BPF program to 1000 kernel functions. > > > >The creation of the BPF trampoline is expensive. According to > >my testing, It will spend more than 1 second to create 100 bpf > >trampoline. What's more, it consumes more memory. > > > >If we have the per-function metadata supporting, then we can > >create a global BPF trampoline, like this: > > > >trampoline_global: > > prepare a array and store the args of foo to the array > > get the metadata by the ip > > call metadata.fentry_bpf1 > > call metadata.fentry_bpf2 > > .... > > call foo+4 (origin call) > > save the return value of foo > > call metadata.fexit_bpf1 (this bpf can get the return value of foo) > > call metadata.fexit_bpf2 > > ....... > > return to the caller of foo > > > >(The metadata holds more information for the global trampoline than > >I described.) > > > >Then, we don't need to create a trampoline for every kernel function > >anymore. > > > >Another beneficiary can be ftrace. For now, all the kernel functions that > >are enabled by dynamic ftrace will be added to a filter hash if there are > >more than one callbacks. And hash lookup will happen when the traced > >functions are called, which has an impact on the performance, see > >__ftrace_ops_list_func() -> ftrace_ops_test(). With the per-function > >metadata supporting, we can store the information that if the callback is > >enabled on the kernel function to the metadata, which can make the performance > >much better. > > > >The per-function metadata storage is a basic function, and I think there > >may be other functions that can use it for better performance in the feature > >too. > > > >(Hope that I'm describing it clearly :/) > > > >Thanks! > >Menglong Dong > > > > This is way too cursory. For one thing, you need to start by explaining why you are asking to put this *inline* with the code, which is something that normally would be avoided at all cost. Hi, Sorry that I don't understand the *inline* here, do you mean why puting the metadata in the function padding?
On Wed, 5 Mar 2025 09:19:09 +0800 Menglong Dong <menglong8.dong@gmail.com> wrote: > Ok, let me explain it from the beginning. (My English is not good, > but I'll try to describe it as clear as possible :/) I always appreciate those who struggle with English having these conversations. Thank you for that, as I know I am horrible in speaking any other language. (I can get by in German, but even Germans tell me to switch back to English ;-) > > Many BPF program types need to depend on the BPF trampoline, > such as BPF_PROG_TYPE_TRACING, BPF_PROG_TYPE_EXT, > BPF_PROG_TYPE_LSM, etc. BPF trampoline is a bridge between > the kernel (or bpf) function and BPF program, and it acts just like the > trampoline that ftrace uses. > > Generally speaking, it is used to hook a function, just like what ftrace > do: > > foo: > endbr > nop5 --> call trampoline_foo > xxxx > > In short, the trampoline_foo can be this: > > trampoline_foo: > prepare a array and store the args of foo to the array > call fentry_bpf1 > call fentry_bpf2 > ...... > call foo+4 (origin call) Note, I brought up this issue when I first heard about how BPF does this. The calling of the original function from the trampoline. I said this will cause issues, and is only good for a few functions. Once you start doing this for 1000s of functions, it's going to be a nightmare. Looks like you are now in the nightmare phase. My argument was once you have this case, you need to switch over to the kretprobe / function graph way of doing things, which is to have a shadow stack and hijack the return address. Yes, that has slightly more overhead, but it's better than having to add all theses hacks. And function graph has been updated so that it can do this for other users. fprobes uses it now, and bpf can too. > save the return value of foo > call fexit_bpf1 (this bpf can get the return value of foo) > call fexit_bpf2 > ....... > return to the caller of foo > > We can see that the trampoline_foo can be only used for > the function foo, as different kernel function can be attached > different BPF programs, and have different argument count, > etc. Therefore, we have to create 1000 BPF trampolines if > we want to attach a BPF program to 1000 kernel functions. > > The creation of the BPF trampoline is expensive. According to > my testing, It will spend more than 1 second to create 100 bpf > trampoline. What's more, it consumes more memory. > > If we have the per-function metadata supporting, then we can > create a global BPF trampoline, like this: > > trampoline_global: > prepare a array and store the args of foo to the array > get the metadata by the ip > call metadata.fentry_bpf1 > call metadata.fentry_bpf2 > .... > call foo+4 (origin call) So if this is a global trampoline, wouldn't this "call foo" need to be an indirect call? It can't be a direct call, otherwise you need a separate trampoline for that. This means you need to mitigate for spectre here, and you just lost the performance gain from not using function graph. > save the return value of foo > call metadata.fexit_bpf1 (this bpf can get the return value of foo) > call metadata.fexit_bpf2 > ....... > return to the caller of foo > > (The metadata holds more information for the global trampoline than > I described.) > > Then, we don't need to create a trampoline for every kernel function > anymore. > > Another beneficiary can be ftrace. For now, all the kernel functions that > are enabled by dynamic ftrace will be added to a filter hash if there are > more than one callbacks. And hash lookup will happen when the traced > functions are called, which has an impact on the performance, see > __ftrace_ops_list_func() -> ftrace_ops_test(). With the per-function > metadata supporting, we can store the information that if the callback is > enabled on the kernel function to the metadata, which can make the performance > much better. Let me say now that ftrace will not use this. Looks like too much work for little gain. The only time this impacts ftrace is when there's two different callbacks tracing the same function, and it only impacts that function. All other functions being traced still call the appropriate trampoline for the callback. -- Steve > > The per-function metadata storage is a basic function, and I think there > may be other functions that can use it for better performance in the feature > too. > > (Hope that I'm describing it clearly :/)
On Wed, Mar 5, 2025 at 11:02 PM Steven Rostedt <rostedt@goodmis.org> wrote: > > On Wed, 5 Mar 2025 09:19:09 +0800 > Menglong Dong <menglong8.dong@gmail.com> wrote: > > > Ok, let me explain it from the beginning. (My English is not good, > > but I'll try to describe it as clear as possible :/) > > I always appreciate those who struggle with English having these > conversations. Thank you for that, as I know I am horrible in speaking any > other language. (I can get by in German, but even Germans tell me to switch > back to English ;-) > > > > > Many BPF program types need to depend on the BPF trampoline, > > such as BPF_PROG_TYPE_TRACING, BPF_PROG_TYPE_EXT, > > BPF_PROG_TYPE_LSM, etc. BPF trampoline is a bridge between > > the kernel (or bpf) function and BPF program, and it acts just like the > > trampoline that ftrace uses. > > > > Generally speaking, it is used to hook a function, just like what ftrace > > do: > > > > foo: > > endbr > > nop5 --> call trampoline_foo > > xxxx > > > > In short, the trampoline_foo can be this: > > > > trampoline_foo: > > prepare a array and store the args of foo to the array > > call fentry_bpf1 > > call fentry_bpf2 > > ...... > > call foo+4 (origin call) > > Note, I brought up this issue when I first heard about how BPF does this. > The calling of the original function from the trampoline. I said this will > cause issues, and is only good for a few functions. Once you start doing > this for 1000s of functions, it's going to be a nightmare. > > Looks like you are now in the nightmare phase. > > My argument was once you have this case, you need to switch over to the > kretprobe / function graph way of doing things, which is to have a shadow > stack and hijack the return address. Yes, that has slightly more overhead, > but it's better than having to add all theses hacks. > > And function graph has been updated so that it can do this for other users. > fprobes uses it now, and bpf can too. Yeah, I heard that the kretprobe is able to get the function arguments too, which benefits from the function graph. Besides the overhead, another problem is that we can't do direct memory access if we use the BPF based on kretprobe. > > > save the return value of foo > > call fexit_bpf1 (this bpf can get the return value of foo) > > call fexit_bpf2 > > ....... > > return to the caller of foo > > > > We can see that the trampoline_foo can be only used for > > the function foo, as different kernel function can be attached > > different BPF programs, and have different argument count, > > etc. Therefore, we have to create 1000 BPF trampolines if > > we want to attach a BPF program to 1000 kernel functions. > > > > The creation of the BPF trampoline is expensive. According to > > my testing, It will spend more than 1 second to create 100 bpf > > trampoline. What's more, it consumes more memory. > > > > If we have the per-function metadata supporting, then we can > > create a global BPF trampoline, like this: > > > > trampoline_global: > > prepare a array and store the args of foo to the array > > get the metadata by the ip > > call metadata.fentry_bpf1 > > call metadata.fentry_bpf2 > > .... > > call foo+4 (origin call) > > So if this is a global trampoline, wouldn't this "call foo" need to be an > indirect call? It can't be a direct call, otherwise you need a separate > trampoline for that. > > This means you need to mitigate for spectre here, and you just lost the > performance gain from not using function graph. Yeah, you are right, this is an indirect call here. I haven't done any research on mitigating for spectre yet, and maybe we can convert it into a direct call somehow? Such as, we maintain a trampoline_table: some preparation jmp +%eax (eax is the index of the target function) call foo1 + 4 return call foo2 + 4 return call foo3 + 4 return (Hmm......Is the jmp above also an indirect call?) And in the trampoline_global, we can call it like this: mov metadata.index %eax call trampoline_table I'm not sure if it works. However, indirect call is also used in function graph, so we still have better performance. Isn't it? Let me have a look at the code of the function graph first :/ Thanks! Menglong Dong > > > > save the return value of foo > > call metadata.fexit_bpf1 (this bpf can get the return value of foo) > > call metadata.fexit_bpf2 > > ....... > > return to the caller of foo > > > > (The metadata holds more information for the global trampoline than > > I described.) > > > > Then, we don't need to create a trampoline for every kernel function > > anymore. > > > > Another beneficiary can be ftrace. For now, all the kernel functions that > > are enabled by dynamic ftrace will be added to a filter hash if there are > > more than one callbacks. And hash lookup will happen when the traced > > functions are called, which has an impact on the performance, see > > __ftrace_ops_list_func() -> ftrace_ops_test(). With the per-function > > metadata supporting, we can store the information that if the callback is > > enabled on the kernel function to the metadata, which can make the performance > > much better. > > Let me say now that ftrace will not use this. Looks like too much work for > little gain. The only time this impacts ftrace is when there's two > different callbacks tracing the same function, and it only impacts that > function. All other functions being traced still call the appropriate > trampoline for the callback. > > -- Steve > > > > > The per-function metadata storage is a basic function, and I think there > > may be other functions that can use it for better performance in the feature > > too. > > > > (Hope that I'm describing it clearly :/) >
On Wed, Mar 5, 2025 at 6:59 PM Menglong Dong <menglong8.dong@gmail.com> wrote: > > I'm not sure if it works. However, indirect call is also used > in function graph, so we still have better performance. Isn't it? > > Let me have a look at the code of the function graph first :/ Menglong, Function graph infra isn't going to help. "call foo" isn't a problem either. But we have to step back. per-function metadata is an optimization and feels like we're doing a premature optimization here without collecting performance numbers first. Let's implement multi-fentry with generic get_metadata_by_ip() first. get_metadata_by_ip() will be a hashtable in such a case and then we can compare its performance when it's implemented as a direct lookup from ip-4 (this patch) vs hash table (that does 'ip' to 'metadata' lookup). If/when we decide to do this per-function metadata we can also punt to generic hashtable for cfi, IBT, FineIBT, etc configs. When mitigations are enabled the performance suffers anyway, so hashtable lookup vs direct ip-4 lookup won't make much difference. So we can enable per-function metadata only on non-mitigation configs when FUNCTION_ALIGNMENT=16. There will be some number of bytes available before every function and if we can tell gcc/llvm to leave at least 5 bytes there the growth of vmlinux .text will be within a noise. So let's figure out the design of multi-fenty first with a hashtable for metadata and decide next steps afterwards.
On Thu, Mar 6, 2025 at 11:39 AM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Wed, Mar 5, 2025 at 6:59 PM Menglong Dong <menglong8.dong@gmail.com> wrote: > > > > I'm not sure if it works. However, indirect call is also used > > in function graph, so we still have better performance. Isn't it? > > > > Let me have a look at the code of the function graph first :/ > > Menglong, > > Function graph infra isn't going to help. > "call foo" isn't a problem either. > > But we have to step back. > per-function metadata is an optimization and feels like > we're doing a premature optimization here without collecting > performance numbers first. > > Let's implement multi-fentry with generic get_metadata_by_ip() first. > get_metadata_by_ip() will be a hashtable in such a case and > then we can compare its performance when it's implemented as > a direct lookup from ip-4 (this patch) vs hash table > (that does 'ip' to 'metadata' lookup). Hi, Alexei You are right, I should do such a performance comparison. > > If/when we decide to do this per-function metadata we can also > punt to generic hashtable for cfi, IBT, FineIBT, etc configs. > When mitigations are enabled the performance suffers anyway, > so hashtable lookup vs direct ip-4 lookup won't make much difference. > So we can enable per-function metadata only on non-mitigation configs > when FUNCTION_ALIGNMENT=16. > There will be some number of bytes available before every function > and if we can tell gcc/llvm to leave at least 5 bytes there > the growth of vmlinux .text will be within a noise. Sounds great! It's so different to make the per-function metadata work in all the cases. Especially, we can't implement it in arm64 if CFI_CLANG is enabled. And the fallbacking to the hash table makes it much easier in these cases. > > So let's figure out the design of multi-fenty first with a hashtable > for metadata and decide next steps afterwards. Ok, I'll develop a version for fentry multi-link with both hashtable and function metadata, and do some performance testing. Thank you for your advice :/ Thanks! Menglong Dong
diff --git a/arch/x86/include/asm/cfi.h b/arch/x86/include/asm/cfi.h index 2f6a01f098b5..04525f2f6bf2 100644 --- a/arch/x86/include/asm/cfi.h +++ b/arch/x86/include/asm/cfi.h @@ -108,6 +108,14 @@ extern bhi_thunk __bhi_args_end[]; struct pt_regs; +#ifdef CONFIG_CALL_PADDING +#define FINEIBT_INSN_OFFSET 16 +#define CFI_INSN_OFFSET CONFIG_FUNCTION_ALIGNMENT +#else +#define FINEIBT_INSN_OFFSET 0 +#define CFI_INSN_OFFSET 5 +#endif + #ifdef CONFIG_CFI_CLANG enum bug_trap_type handle_cfi_failure(struct pt_regs *regs); #define __bpfcall @@ -118,11 +126,8 @@ static inline int cfi_get_offset(void) { switch (cfi_mode) { case CFI_FINEIBT: - return 16; case CFI_KCFI: - if (IS_ENABLED(CONFIG_CALL_PADDING)) - return 16; - return 5; + return CFI_INSN_OFFSET; default: return 0; } diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index 32e4b801db99..0088d2313f33 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -917,7 +917,7 @@ void __init_or_module noinline apply_seal_endbr(s32 *start, s32 *end) poison_endbr(addr); if (IS_ENABLED(CONFIG_FINEIBT)) - poison_cfi(addr - 16); + poison_cfi(addr); } } @@ -980,12 +980,13 @@ u32 cfi_get_func_hash(void *func) { u32 hash; - func -= cfi_get_offset(); switch (cfi_mode) { case CFI_FINEIBT: + func -= FINEIBT_INSN_OFFSET; func += 7; break; case CFI_KCFI: + func -= CFI_INSN_OFFSET; func += 1; break; default: @@ -1372,7 +1373,7 @@ static int cfi_rewrite_preamble(s32 *start, s32 *end) * have determined there are no indirect calls to it and we * don't need no CFI either. */ - if (!is_endbr(addr + 16)) + if (!is_endbr(addr + CFI_INSN_OFFSET)) continue; hash = decode_preamble_hash(addr, &arity); @@ -1380,6 +1381,7 @@ static int cfi_rewrite_preamble(s32 *start, s32 *end) addr, addr, 5, addr)) return -EINVAL; + addr += (CFI_INSN_OFFSET - FINEIBT_INSN_OFFSET); text_poke_early(addr, fineibt_preamble_start, fineibt_preamble_size); WARN_ON(*(u32 *)(addr + fineibt_preamble_hash) != 0x12345678); text_poke_early(addr + fineibt_preamble_hash, &hash, 4); @@ -1402,10 +1404,10 @@ static void cfi_rewrite_endbr(s32 *start, s32 *end) for (s = start; s < end; s++) { void *addr = (void *)s + *s; - if (!exact_endbr(addr + 16)) + if (!exact_endbr(addr + CFI_INSN_OFFSET)) continue; - poison_endbr(addr + 16); + poison_endbr(addr + CFI_INSN_OFFSET); } } @@ -1543,12 +1545,12 @@ static void __apply_fineibt(s32 *start_retpoline, s32 *end_retpoline, return; case CFI_FINEIBT: - /* place the FineIBT preamble at func()-16 */ + /* place the FineIBT preamble at func()-FINEIBT_INSN_OFFSET */ ret = cfi_rewrite_preamble(start_cfi, end_cfi); if (ret) goto err; - /* rewrite the callers to target func()-16 */ + /* rewrite the callers to target func()-FINEIBT_INSN_OFFSET */ ret = cfi_rewrite_callers(start_retpoline, end_retpoline); if (ret) goto err; @@ -1588,6 +1590,7 @@ static void poison_cfi(void *addr) */ switch (cfi_mode) { case CFI_FINEIBT: + addr -= FINEIBT_INSN_OFFSET; /* * FineIBT prefix should start with an ENDBR. */ @@ -1607,6 +1610,7 @@ static void poison_cfi(void *addr) break; case CFI_KCFI: + addr -= CFI_INSN_OFFSET; /* * kCFI prefix should start with a valid hash. */ diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index 72776dcb75aa..ee86a5df5ffb 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -415,6 +415,12 @@ static int emit_call(u8 **prog, void *func, void *ip); static void emit_fineibt(u8 **pprog, u8 *ip, u32 hash, int arity) { u8 *prog = *pprog; +#ifdef CONFIG_CALL_PADDING + int i; + + for (i = 0; i < CFI_INSN_OFFSET - 16; i++) + EMIT1(0x90); +#endif EMIT_ENDBR(); EMIT3_off32(0x41, 0x81, 0xea, hash); /* subl $hash, %r10d */ @@ -432,20 +438,14 @@ static void emit_fineibt(u8 **pprog, u8 *ip, u32 hash, int arity) static void emit_kcfi(u8 **pprog, u32 hash) { u8 *prog = *pprog; +#ifdef CONFIG_CALL_PADDING + int i; +#endif EMIT1_off32(0xb8, hash); /* movl $hash, %eax */ #ifdef CONFIG_CALL_PADDING - EMIT1(0x90); - EMIT1(0x90); - EMIT1(0x90); - EMIT1(0x90); - EMIT1(0x90); - EMIT1(0x90); - EMIT1(0x90); - EMIT1(0x90); - EMIT1(0x90); - EMIT1(0x90); - EMIT1(0x90); + for (i = 0; i < CFI_INSN_OFFSET - 5; i++) + EMIT1(0x90); #endif EMIT_ENDBR();
For now, the layout of cfi and fineibt is hard coded, and the padding is fixed on 16 bytes. Factor out FINEIBT_INSN_OFFSET and CFI_INSN_OFFSET. CFI_INSN_OFFSET is the offset of cfi, which is the same as FUNCTION_ALIGNMENT when CALL_PADDING is enabled. And FINEIBT_INSN_OFFSET is the offset where we put the fineibt preamble on, which is 16 for now. When the FUNCTION_ALIGNMENT is bigger than 16, we place the fineibt preamble on the last 16 bytes of the padding for better performance, which means the fineibt preamble don't use the space that cfi uses. The FINEIBT_INSN_OFFSET is not used in fineibt_caller_start and fineibt_paranoid_start, as it is always "0x10". Note that we need to update the offset in fineibt_caller_start and fineibt_paranoid_start if FINEIBT_INSN_OFFSET changes. Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> --- v4: - rebase to the newest tip/x86/core, the fineibt has some updating --- arch/x86/include/asm/cfi.h | 13 +++++++++---- arch/x86/kernel/alternative.c | 18 +++++++++++------- arch/x86/net/bpf_jit_comp.c | 22 +++++++++++----------- 3 files changed, 31 insertions(+), 22 deletions(-)