diff mbox series

[v4,1/4] x86/ibt: factor out cfi and fineibt offset

Message ID 20250303132837.498938-2-dongml2@chinatelecom.cn (mailing list archive)
State New
Headers show
Series per-function storage support | expand

Commit Message

Menglong Dong March 3, 2025, 1:28 p.m. UTC
For now, the layout of cfi and fineibt is hard coded, and the padding is
fixed on 16 bytes.

Factor out FINEIBT_INSN_OFFSET and CFI_INSN_OFFSET. CFI_INSN_OFFSET is
the offset of cfi, which is the same as FUNCTION_ALIGNMENT when
CALL_PADDING is enabled. And FINEIBT_INSN_OFFSET is the offset where we
put the fineibt preamble on, which is 16 for now.

When the FUNCTION_ALIGNMENT is bigger than 16, we place the fineibt
preamble on the last 16 bytes of the padding for better performance, which
means the fineibt preamble don't use the space that cfi uses.

The FINEIBT_INSN_OFFSET is not used in fineibt_caller_start and
fineibt_paranoid_start, as it is always "0x10". Note that we need to
update the offset in fineibt_caller_start and fineibt_paranoid_start if
FINEIBT_INSN_OFFSET changes.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
---
v4:
- rebase to the newest tip/x86/core, the fineibt has some updating
---
 arch/x86/include/asm/cfi.h    | 13 +++++++++----
 arch/x86/kernel/alternative.c | 18 +++++++++++-------
 arch/x86/net/bpf_jit_comp.c   | 22 +++++++++++-----------
 3 files changed, 31 insertions(+), 22 deletions(-)

Comments

Peter Zijlstra March 3, 2025, 4:54 p.m. UTC | #1
On Mon, Mar 03, 2025 at 09:28:34PM +0800, Menglong Dong wrote:
> For now, the layout of cfi and fineibt is hard coded, and the padding is
> fixed on 16 bytes.
> 
> Factor out FINEIBT_INSN_OFFSET and CFI_INSN_OFFSET. CFI_INSN_OFFSET is
> the offset of cfi, which is the same as FUNCTION_ALIGNMENT when
> CALL_PADDING is enabled. And FINEIBT_INSN_OFFSET is the offset where we
> put the fineibt preamble on, which is 16 for now.
> 
> When the FUNCTION_ALIGNMENT is bigger than 16, we place the fineibt
> preamble on the last 16 bytes of the padding for better performance, which
> means the fineibt preamble don't use the space that cfi uses.
> 
> The FINEIBT_INSN_OFFSET is not used in fineibt_caller_start and
> fineibt_paranoid_start, as it is always "0x10". Note that we need to
> update the offset in fineibt_caller_start and fineibt_paranoid_start if
> FINEIBT_INSN_OFFSET changes.
> 
> Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>

I'm confused as to what exactly you mean.

Preamble will have __cfi symbol and some number of NOPs right before
actual symbol like:

__cfi_foo:
  mov $0x12345678, %reg
  nop
  nop
  nop
  ...
foo:

FineIBT must be at foo-16, has nothing to do with performance. This 16
can also be spelled: fineibt_preamble_size.

The total size of the preamble is FUNCTION_PADDING_BYTES + CFI_CLANG*5.

If you increase FUNCTION_PADDING_BYTES by another 5, which is what you
want I think, then we'll have total preamble of 21 bytes; 5 bytes kCFI,
16 bytes nop.

Then kCFI expects hash to be at -20, while FineIBT must be at -16.

This then means there is no unambiguous hole for you to stick your
meta-data thing (whatever that is).

There are two options: make meta data location depend on cfi_mode, or
have __apply_fineibt() rewrite kCFI to also be at -16, so that you can
have -21 for your 5 bytes.

I think I prefer latter.

In any case, I don't think we need *_INSN_OFFSET. At most we need
PREAMBLE_SIZE.

Hmm?
Menglong Dong March 4, 2025, 1:10 a.m. UTC | #2
On Tue, Mar 4, 2025 at 12:55 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Mar 03, 2025 at 09:28:34PM +0800, Menglong Dong wrote:
> > For now, the layout of cfi and fineibt is hard coded, and the padding is
> > fixed on 16 bytes.
> >
> > Factor out FINEIBT_INSN_OFFSET and CFI_INSN_OFFSET. CFI_INSN_OFFSET is
> > the offset of cfi, which is the same as FUNCTION_ALIGNMENT when
> > CALL_PADDING is enabled. And FINEIBT_INSN_OFFSET is the offset where we
> > put the fineibt preamble on, which is 16 for now.
> >
> > When the FUNCTION_ALIGNMENT is bigger than 16, we place the fineibt
> > preamble on the last 16 bytes of the padding for better performance, which
> > means the fineibt preamble don't use the space that cfi uses.
> >
> > The FINEIBT_INSN_OFFSET is not used in fineibt_caller_start and
> > fineibt_paranoid_start, as it is always "0x10". Note that we need to
> > update the offset in fineibt_caller_start and fineibt_paranoid_start if
> > FINEIBT_INSN_OFFSET changes.
> >
> > Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
>
> I'm confused as to what exactly you mean.
>
> Preamble will have __cfi symbol and some number of NOPs right before
> actual symbol like:
>
> __cfi_foo:
>   mov $0x12345678, %reg
>   nop
>   nop
>   nop
>   ...
> foo:
>
> FineIBT must be at foo-16, has nothing to do with performance. This 16
> can also be spelled: fineibt_preamble_size.
>
> The total size of the preamble is FUNCTION_PADDING_BYTES + CFI_CLANG*5.
>
> If you increase FUNCTION_PADDING_BYTES by another 5, which is what you
> want I think, then we'll have total preamble of 21 bytes; 5 bytes kCFI,
> 16 bytes nop.

Hello, sorry that I forgot to add something to the changelog. In fact,
I don't add extra 5-bytes anymore, which you can see in the 3rd patch.

The thing is that we can't add extra 5-bytes if CFI is enabled. Without
CFI, we can make the padding space any value, such as 5-bytes, and
the layout will be like this:

__align:
  nop
  nop
  nop
  nop
  nop
foo: -- __align +5

However, the CFI will always make the cfi insn 16-bytes aligned. When
we set the FUNCTION_PADDING_BYTES to (11 + 5), the layout will be
like this:

__cfi_foo:
  nop (11)
  mov $0x12345678, %reg
  nop (16)
foo:

and the padding space is 32-bytes actually. So, we can just select
FUNCTION_ALIGNMENT_32B instead, which makes the padding
space 32-bytes too, and have the following layout:

__cfi_foo:
  mov $0x12345678, %reg
  nop (27)
foo:

And the layout will be like this if fineibt and function metadata is both
used:

__cfi_foo:
        mov     --      5       -- cfi, not used anymore if fineibt is used
        nop
        nop
        nop
        mov     --      5       -- function metadata
        nop
        nop
        nop
        fineibt --      16      -- fineibt
foo:
        nopw    --      4
        ......

The things that I make in this commit is to make sure that
the code in arch/x86/kernel/alternative.c can find the location
of cfi hash and fineibt depends on the FUNCTION_ALIGNMENT.
the offset of cfi and fineibt is different now, so we need to do
some adjustment here.

In the beginning, I thought to make the layout like this to ensure
that the offset of cfi and fineibt the same:

__cfi_foo:
        fineibt  --   16  --  fineibt
        mov    --    5    -- function metadata
        nop(11)
foo:
        nopw    --      4
        ......

The adjustment will be easier in this mode. However, it may have
impact on the performance. That way I say it doesn't impact the
performance in this commit.

Sorry that I didn't describe it clearly in the commit log, and I'll
add the things above to the commit log too in the next version.

Thanks!
Menglong Dong

>
> Then kCFI expects hash to be at -20, while FineIBT must be at -16.
>
> This then means there is no unambiguous hole for you to stick your
> meta-data thing (whatever that is).
>
> There are two options: make meta data location depend on cfi_mode, or
> have __apply_fineibt() rewrite kCFI to also be at -16, so that you can
> have -21 for your 5 bytes.
>
> I think I prefer latter.
>
> In any case, I don't think we need *_INSN_OFFSET. At most we need
> PREAMBLE_SIZE.
>
> Hmm?
Peter Zijlstra March 4, 2025, 5:38 a.m. UTC | #3
On Tue, Mar 04, 2025 at 09:10:12AM +0800, Menglong Dong wrote:
> Hello, sorry that I forgot to add something to the changelog. In fact,
> I don't add extra 5-bytes anymore, which you can see in the 3rd patch.
> 
> The thing is that we can't add extra 5-bytes if CFI is enabled. Without
> CFI, we can make the padding space any value, such as 5-bytes, and
> the layout will be like this:
> 
> __align:
>   nop
>   nop
>   nop
>   nop
>   nop
> foo: -- __align +5
> 
> However, the CFI will always make the cfi insn 16-bytes aligned. When
> we set the FUNCTION_PADDING_BYTES to (11 + 5), the layout will be
> like this:
> 
> __cfi_foo:
>   nop (11)
>   mov $0x12345678, %reg
>   nop (16)
> foo:
> 
> and the padding space is 32-bytes actually. So, we can just select
> FUNCTION_ALIGNMENT_32B instead, which makes the padding
> space 32-bytes too, and have the following layout:
> 
> __cfi_foo:
>   mov $0x12345678, %reg
>   nop (27)
> foo:

*blink*, wtf is clang smoking.

I mean, you're right, this is what it is doing, but that is somewhat
unexpected. Let me go look at clang source, this is insane.
Peter Zijlstra March 4, 2025, 6:16 a.m. UTC | #4
On Tue, Mar 04, 2025 at 06:38:53AM +0100, Peter Zijlstra wrote:
> On Tue, Mar 04, 2025 at 09:10:12AM +0800, Menglong Dong wrote:
> > Hello, sorry that I forgot to add something to the changelog. In fact,
> > I don't add extra 5-bytes anymore, which you can see in the 3rd patch.
> > 
> > The thing is that we can't add extra 5-bytes if CFI is enabled. Without
> > CFI, we can make the padding space any value, such as 5-bytes, and
> > the layout will be like this:
> > 
> > __align:
> >   nop
> >   nop
> >   nop
> >   nop
> >   nop
> > foo: -- __align +5
> > 
> > However, the CFI will always make the cfi insn 16-bytes aligned. When
> > we set the FUNCTION_PADDING_BYTES to (11 + 5), the layout will be
> > like this:
> > 
> > __cfi_foo:
> >   nop (11)
> >   mov $0x12345678, %reg
> >   nop (16)
> > foo:
> > 
> > and the padding space is 32-bytes actually. So, we can just select
> > FUNCTION_ALIGNMENT_32B instead, which makes the padding
> > space 32-bytes too, and have the following layout:
> > 
> > __cfi_foo:
> >   mov $0x12345678, %reg
> >   nop (27)
> > foo:
> 
> *blink*, wtf is clang smoking.
> 
> I mean, you're right, this is what it is doing, but that is somewhat
> unexpected. Let me go look at clang source, this is insane.

Bah, this is because assemblers are stupid :/

There is no way to tell them to have foo aligned such that there are at
least N bytes free before it.

So what kCFI ends up having to do is align the __cfi symbol to the
function alignment, and then stuff enough nops in to make the real
symbol meet alignment.

And the end result is utter insanity.

I mean, look at this:

      50:       2e e9 00 00 00 00       cs jmp 56 <__traceiter_initcall_level+0x46>     52: R_X86_64_PLT32      __x86_return_thunk-0x4
      56:       66 2e 0f 1f 84 00 00 00 00 00   cs nopw 0x0(%rax,%rax,1)

0000000000000060 <__cfi___probestub_initcall_level>:
      60:       90                      nop
      61:       90                      nop
      62:       90                      nop
      63:       90                      nop
      64:       90                      nop
      65:       90                      nop
      66:       90                      nop
      67:       90                      nop
      68:       90                      nop
      69:       90                      nop
      6a:       90                      nop
      6b:       b8 b1 fd 66 f9          mov    $0xf966fdb1,%eax

0000000000000070 <__probestub_initcall_level>:
      70:       2e e9 00 00 00 00       cs jmp 76 <__probestub_initcall_level+0x6>      72: R_X86_64_PLT32      __x86_return_thunk-0x4


That's 21 bytes wasted, for no reason other than that asm doesn't have a
directive to say: get me a place that is M before N alignment.

Because ideally the whole above thing would look like:

      50:       2e e9 00 00 00 00       cs jmp 56 <__traceiter_initcall_level+0x46>     52: R_X86_64_PLT32      __x86_return_thunk-0x4
      56:       66 2e 0f 1f 84 		cs nopw (%rax,%rax,1)

000000000000005b <__cfi___probestub_initcall_level>:
      5b:       b8 b1 fd 66 f9          mov    $0xf966fdb1,%eax

0000000000000060 <__probestub_initcall_level>:
      60:       2e e9 00 00 00 00       cs jmp 76 <__probestub_initcall_level+0x6>      72: R_X86_64_PLT32      __x86_return_thunk-0x4
Menglong Dong March 4, 2025, 7:47 a.m. UTC | #5
On Tue, Mar 4, 2025 at 2:16 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Mar 04, 2025 at 06:38:53AM +0100, Peter Zijlstra wrote:
> > On Tue, Mar 04, 2025 at 09:10:12AM +0800, Menglong Dong wrote:
> > > Hello, sorry that I forgot to add something to the changelog. In fact,
> > > I don't add extra 5-bytes anymore, which you can see in the 3rd patch.
> > >
> > > The thing is that we can't add extra 5-bytes if CFI is enabled. Without
> > > CFI, we can make the padding space any value, such as 5-bytes, and
> > > the layout will be like this:
> > >
> > > __align:
> > >   nop
> > >   nop
> > >   nop
> > >   nop
> > >   nop
> > > foo: -- __align +5
> > >
> > > However, the CFI will always make the cfi insn 16-bytes aligned. When
> > > we set the FUNCTION_PADDING_BYTES to (11 + 5), the layout will be
> > > like this:
> > >
> > > __cfi_foo:
> > >   nop (11)
> > >   mov $0x12345678, %reg
> > >   nop (16)
> > > foo:
> > >
> > > and the padding space is 32-bytes actually. So, we can just select
> > > FUNCTION_ALIGNMENT_32B instead, which makes the padding
> > > space 32-bytes too, and have the following layout:
> > >
> > > __cfi_foo:
> > >   mov $0x12345678, %reg
> > >   nop (27)
> > > foo:
> >
> > *blink*, wtf is clang smoking.
> >
> > I mean, you're right, this is what it is doing, but that is somewhat
> > unexpected. Let me go look at clang source, this is insane.
>
> Bah, this is because assemblers are stupid :/
>
> There is no way to tell them to have foo aligned such that there are at
> least N bytes free before it.
>
> So what kCFI ends up having to do is align the __cfi symbol to the
> function alignment, and then stuff enough nops in to make the real
> symbol meet alignment.
>
> And the end result is utter insanity.
>
> I mean, look at this:
>
>       50:       2e e9 00 00 00 00       cs jmp 56 <__traceiter_initcall_level+0x46>     52: R_X86_64_PLT32      __x86_return_thunk-0x4
>       56:       66 2e 0f 1f 84 00 00 00 00 00   cs nopw 0x0(%rax,%rax,1)
>
> 0000000000000060 <__cfi___probestub_initcall_level>:
>       60:       90                      nop
>       61:       90                      nop
>       62:       90                      nop
>       63:       90                      nop
>       64:       90                      nop
>       65:       90                      nop
>       66:       90                      nop
>       67:       90                      nop
>       68:       90                      nop
>       69:       90                      nop
>       6a:       90                      nop
>       6b:       b8 b1 fd 66 f9          mov    $0xf966fdb1,%eax
>
> 0000000000000070 <__probestub_initcall_level>:
>       70:       2e e9 00 00 00 00       cs jmp 76 <__probestub_initcall_level+0x6>      72: R_X86_64_PLT32      __x86_return_thunk-0x4
>
>
> That's 21 bytes wasted, for no reason other than that asm doesn't have a
> directive to say: get me a place that is M before N alignment.
>
> Because ideally the whole above thing would look like:
>
>       50:       2e e9 00 00 00 00       cs jmp 56 <__traceiter_initcall_level+0x46>     52: R_X86_64_PLT32      __x86_return_thunk-0x4
>       56:       66 2e 0f 1f 84          cs nopw (%rax,%rax,1)
>
> 000000000000005b <__cfi___probestub_initcall_level>:
>       5b:       b8 b1 fd 66 f9          mov    $0xf966fdb1,%eax
>
> 0000000000000060 <__probestub_initcall_level>:
>       60:       2e e9 00 00 00 00       cs jmp 76 <__probestub_initcall_level+0x6>      72: R_X86_64_PLT32      __x86_return_thunk-0x4

Hi, peter. Thank you for the testing, which is quite helpful
to understand the whole thing.

I was surprised at this too. Without CALL_PADDING, the cfi is
nop(11) + mov; with CALL_PADDING, the cfi is mov + nop(11),
which is weird, as it seems that we can select CALL_PADDING if
CFI_CLANG to make things consistent. And I  thought that it is
designed to be this for some reasons :/

Hmm......so what should we do now? Accept and bear it,
or do something different?

I'm good at clang, so the solution that I can think of is how to
bear it :/

According to my testing, the text size will increase:

~2.2% if we make FUNCTION_PADDING_BYTES 27 and select
FUNCTION_ALIGNMENT_16B.

~3.5% if we make FUNCTION_PADDING_BYTES 27 and select
FUNCTION_ALIGNMENT_32B.

We don't have to select FUNCTION_ALIGNMENT_32B, so the
worst case is to increase ~2.2%.

What do you think?

Thanks!
Menglong Dong

>
>
>
Menglong Dong March 4, 2025, 8:41 a.m. UTC | #6
On Tue, Mar 4, 2025 at 3:47 PM Menglong Dong <menglong8.dong@gmail.com> wrote:
>
> On Tue, Mar 4, 2025 at 2:16 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Tue, Mar 04, 2025 at 06:38:53AM +0100, Peter Zijlstra wrote:
> > > On Tue, Mar 04, 2025 at 09:10:12AM +0800, Menglong Dong wrote:
> > > > Hello, sorry that I forgot to add something to the changelog. In fact,
> > > > I don't add extra 5-bytes anymore, which you can see in the 3rd patch.
> > > >
> > > > The thing is that we can't add extra 5-bytes if CFI is enabled. Without
> > > > CFI, we can make the padding space any value, such as 5-bytes, and
> > > > the layout will be like this:
> > > >
> > > > __align:
> > > >   nop
> > > >   nop
> > > >   nop
> > > >   nop
> > > >   nop
> > > > foo: -- __align +5
> > > >
> > > > However, the CFI will always make the cfi insn 16-bytes aligned. When
> > > > we set the FUNCTION_PADDING_BYTES to (11 + 5), the layout will be
> > > > like this:
> > > >
> > > > __cfi_foo:
> > > >   nop (11)
> > > >   mov $0x12345678, %reg
> > > >   nop (16)
> > > > foo:
> > > >
> > > > and the padding space is 32-bytes actually. So, we can just select
> > > > FUNCTION_ALIGNMENT_32B instead, which makes the padding
> > > > space 32-bytes too, and have the following layout:
> > > >
> > > > __cfi_foo:
> > > >   mov $0x12345678, %reg
> > > >   nop (27)
> > > > foo:
> > >
> > > *blink*, wtf is clang smoking.
> > >
> > > I mean, you're right, this is what it is doing, but that is somewhat
> > > unexpected. Let me go look at clang source, this is insane.
> >
> > Bah, this is because assemblers are stupid :/
> >
> > There is no way to tell them to have foo aligned such that there are at
> > least N bytes free before it.
> >
> > So what kCFI ends up having to do is align the __cfi symbol to the
> > function alignment, and then stuff enough nops in to make the real
> > symbol meet alignment.
> >
> > And the end result is utter insanity.
> >
> > I mean, look at this:
> >
> >       50:       2e e9 00 00 00 00       cs jmp 56 <__traceiter_initcall_level+0x46>     52: R_X86_64_PLT32      __x86_return_thunk-0x4
> >       56:       66 2e 0f 1f 84 00 00 00 00 00   cs nopw 0x0(%rax,%rax,1)
> >
> > 0000000000000060 <__cfi___probestub_initcall_level>:
> >       60:       90                      nop
> >       61:       90                      nop
> >       62:       90                      nop
> >       63:       90                      nop
> >       64:       90                      nop
> >       65:       90                      nop
> >       66:       90                      nop
> >       67:       90                      nop
> >       68:       90                      nop
> >       69:       90                      nop
> >       6a:       90                      nop
> >       6b:       b8 b1 fd 66 f9          mov    $0xf966fdb1,%eax
> >
> > 0000000000000070 <__probestub_initcall_level>:
> >       70:       2e e9 00 00 00 00       cs jmp 76 <__probestub_initcall_level+0x6>      72: R_X86_64_PLT32      __x86_return_thunk-0x4
> >
> >
> > That's 21 bytes wasted, for no reason other than that asm doesn't have a
> > directive to say: get me a place that is M before N alignment.
> >
> > Because ideally the whole above thing would look like:
> >
> >       50:       2e e9 00 00 00 00       cs jmp 56 <__traceiter_initcall_level+0x46>     52: R_X86_64_PLT32      __x86_return_thunk-0x4
> >       56:       66 2e 0f 1f 84          cs nopw (%rax,%rax,1)
> >
> > 000000000000005b <__cfi___probestub_initcall_level>:
> >       5b:       b8 b1 fd 66 f9          mov    $0xf966fdb1,%eax
> >
> > 0000000000000060 <__probestub_initcall_level>:
> >       60:       2e e9 00 00 00 00       cs jmp 76 <__probestub_initcall_level+0x6>      72: R_X86_64_PLT32      __x86_return_thunk-0x4
>
> Hi, peter. Thank you for the testing, which is quite helpful
> to understand the whole thing.
>
> I was surprised at this too. Without CALL_PADDING, the cfi is
> nop(11) + mov; with CALL_PADDING, the cfi is mov + nop(11),
> which is weird, as it seems that we can select CALL_PADDING if
> CFI_CLANG to make things consistent. And I  thought that it is
> designed to be this for some reasons :/
>
> Hmm......so what should we do now? Accept and bear it,
> or do something different?
>
> I'm good at clang, so the solution that I can think of is how to

*not good at*

> bear it :/
>
> According to my testing, the text size will increase:
>
> ~2.2% if we make FUNCTION_PADDING_BYTES 27 and select
> FUNCTION_ALIGNMENT_16B.
>
> ~3.5% if we make FUNCTION_PADDING_BYTES 27 and select
> FUNCTION_ALIGNMENT_32B.
>
> We don't have to select FUNCTION_ALIGNMENT_32B, so the
> worst case is to increase ~2.2%.
>
> What do you think?
>
> Thanks!
> Menglong Dong
>
> >
> >
> >
Peter Zijlstra March 4, 2025, 9:42 a.m. UTC | #7
On Tue, Mar 04, 2025 at 03:47:45PM +0800, Menglong Dong wrote:
> We don't have to select FUNCTION_ALIGNMENT_32B, so the
> worst case is to increase ~2.2%.
> 
> What do you think?

Well, since I don't understand what you need this for at all, I'm firmly
on the side of not doing this.

What actual problem is being solved with this meta data nonsense? Why is
it worth blowing up our I$ footprint over.

Also note, that if you're going to be explaining this, start from
scratch, as I have absolutely 0 clues about BPF and such.
H. Peter Anvin March 4, 2025, 2:52 p.m. UTC | #8
On March 4, 2025 1:42:20 AM PST, Peter Zijlstra <peterz@infradead.org> wrote:
>On Tue, Mar 04, 2025 at 03:47:45PM +0800, Menglong Dong wrote:
>> We don't have to select FUNCTION_ALIGNMENT_32B, so the
>> worst case is to increase ~2.2%.
>> 
>> What do you think?
>
>Well, since I don't understand what you need this for at all, I'm firmly
>on the side of not doing this.
>
>What actual problem is being solved with this meta data nonsense? Why is
>it worth blowing up our I$ footprint over.
>
>Also note, that if you're going to be explaining this, start from
>scratch, as I have absolutely 0 clues about BPF and such.

I would appreciate such information as well. The idea seems dubious on the surface.
Menglong Dong March 5, 2025, 1:19 a.m. UTC | #9
On Tue, Mar 4, 2025 at 10:53 PM H. Peter Anvin <hpa@zytor.com> wrote:
>
> On March 4, 2025 1:42:20 AM PST, Peter Zijlstra <peterz@infradead.org> wrote:
> >On Tue, Mar 04, 2025 at 03:47:45PM +0800, Menglong Dong wrote:
> >> We don't have to select FUNCTION_ALIGNMENT_32B, so the
> >> worst case is to increase ~2.2%.
> >>
> >> What do you think?
> >
> >Well, since I don't understand what you need this for at all, I'm firmly
> >on the side of not doing this.
> >
> >What actual problem is being solved with this meta data nonsense? Why is
> >it worth blowing up our I$ footprint over.
> >
> >Also note, that if you're going to be explaining this, start from
> >scratch, as I have absolutely 0 clues about BPF and such.
>
> I would appreciate such information as well. The idea seems dubious on the surface.

Ok, let me explain it from the beginning. (My English is not good,
but I'll try to describe it as clear as possible :/)

Many BPF program types need to depend on the BPF trampoline,
such as BPF_PROG_TYPE_TRACING, BPF_PROG_TYPE_EXT,
BPF_PROG_TYPE_LSM, etc. BPF trampoline is a bridge between
the kernel (or bpf) function and BPF program, and it acts just like the
trampoline that ftrace uses.

Generally speaking, it is used to hook a function, just like what ftrace
do:

foo:
    endbr
    nop5  -->  call trampoline_foo
    xxxx

In short, the trampoline_foo can be this:

trampoline_foo:
    prepare a array and store the args of foo to the array
    call fentry_bpf1
    call fentry_bpf2
    ......
    call foo+4 (origin call)
    save the return value of foo
    call fexit_bpf1 (this bpf can get the return value of foo)
    call fexit_bpf2
    .......
    return to the caller of foo

We can see that the trampoline_foo can be only used for
the function foo, as different kernel function can be attached
different BPF programs, and have different argument count,
etc. Therefore, we have to create 1000 BPF trampolines if
we want to attach a BPF program to 1000 kernel functions.

The creation of the BPF trampoline is expensive. According to
my testing, It will spend more than 1 second to create 100 bpf
trampoline. What's more, it consumes more memory.

If we have the per-function metadata supporting, then we can
create a global BPF trampoline, like this:

trampoline_global:
    prepare a array and store the args of foo to the array
    get the metadata by the ip
    call metadata.fentry_bpf1
    call metadata.fentry_bpf2
    ....
    call foo+4 (origin call)
    save the return value of foo
    call metadata.fexit_bpf1 (this bpf can get the return value of foo)
    call metadata.fexit_bpf2
    .......
    return to the caller of foo

(The metadata holds more information for the global trampoline than
I described.)

Then, we don't need to create a trampoline for every kernel function
anymore.

Another beneficiary can be ftrace. For now, all the kernel functions that
are enabled by dynamic ftrace will be added to a filter hash if there are
more than one callbacks. And hash lookup will happen when the traced
functions are called, which has an impact on the performance, see
__ftrace_ops_list_func() -> ftrace_ops_test(). With the per-function
metadata supporting, we can store the information that if the callback is
enabled on the kernel function to the metadata, which can make the performance
much better.

The per-function metadata storage is a basic function, and I think there
may be other functions that can use it for better performance in the feature
too.

(Hope that I'm describing it clearly :/)

Thanks!
Menglong Dong
H. Peter Anvin March 5, 2025, 8:29 a.m. UTC | #10
On March 4, 2025 5:19:09 PM PST, Menglong Dong <menglong8.dong@gmail.com> wrote:
>On Tue, Mar 4, 2025 at 10:53 PM H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> On March 4, 2025 1:42:20 AM PST, Peter Zijlstra <peterz@infradead.org> wrote:
>> >On Tue, Mar 04, 2025 at 03:47:45PM +0800, Menglong Dong wrote:
>> >> We don't have to select FUNCTION_ALIGNMENT_32B, so the
>> >> worst case is to increase ~2.2%.
>> >>
>> >> What do you think?
>> >
>> >Well, since I don't understand what you need this for at all, I'm firmly
>> >on the side of not doing this.
>> >
>> >What actual problem is being solved with this meta data nonsense? Why is
>> >it worth blowing up our I$ footprint over.
>> >
>> >Also note, that if you're going to be explaining this, start from
>> >scratch, as I have absolutely 0 clues about BPF and such.
>>
>> I would appreciate such information as well. The idea seems dubious on the surface.
>
>Ok, let me explain it from the beginning. (My English is not good,
>but I'll try to describe it as clear as possible :/)
>
>Many BPF program types need to depend on the BPF trampoline,
>such as BPF_PROG_TYPE_TRACING, BPF_PROG_TYPE_EXT,
>BPF_PROG_TYPE_LSM, etc. BPF trampoline is a bridge between
>the kernel (or bpf) function and BPF program, and it acts just like the
>trampoline that ftrace uses.
>
>Generally speaking, it is used to hook a function, just like what ftrace
>do:
>
>foo:
>    endbr
>    nop5  -->  call trampoline_foo
>    xxxx
>
>In short, the trampoline_foo can be this:
>
>trampoline_foo:
>    prepare a array and store the args of foo to the array
>    call fentry_bpf1
>    call fentry_bpf2
>    ......
>    call foo+4 (origin call)
>    save the return value of foo
>    call fexit_bpf1 (this bpf can get the return value of foo)
>    call fexit_bpf2
>    .......
>    return to the caller of foo
>
>We can see that the trampoline_foo can be only used for
>the function foo, as different kernel function can be attached
>different BPF programs, and have different argument count,
>etc. Therefore, we have to create 1000 BPF trampolines if
>we want to attach a BPF program to 1000 kernel functions.
>
>The creation of the BPF trampoline is expensive. According to
>my testing, It will spend more than 1 second to create 100 bpf
>trampoline. What's more, it consumes more memory.
>
>If we have the per-function metadata supporting, then we can
>create a global BPF trampoline, like this:
>
>trampoline_global:
>    prepare a array and store the args of foo to the array
>    get the metadata by the ip
>    call metadata.fentry_bpf1
>    call metadata.fentry_bpf2
>    ....
>    call foo+4 (origin call)
>    save the return value of foo
>    call metadata.fexit_bpf1 (this bpf can get the return value of foo)
>    call metadata.fexit_bpf2
>    .......
>    return to the caller of foo
>
>(The metadata holds more information for the global trampoline than
>I described.)
>
>Then, we don't need to create a trampoline for every kernel function
>anymore.
>
>Another beneficiary can be ftrace. For now, all the kernel functions that
>are enabled by dynamic ftrace will be added to a filter hash if there are
>more than one callbacks. And hash lookup will happen when the traced
>functions are called, which has an impact on the performance, see
>__ftrace_ops_list_func() -> ftrace_ops_test(). With the per-function
>metadata supporting, we can store the information that if the callback is
>enabled on the kernel function to the metadata, which can make the performance
>much better.
>
>The per-function metadata storage is a basic function, and I think there
>may be other functions that can use it for better performance in the feature
>too.
>
>(Hope that I'm describing it clearly :/)
>
>Thanks!
>Menglong Dong
>

This is way too cursory. For one thing, you need to start by explaining why you are asking to put this *inline* with the code, which is something that normally would be avoided at all cost.
Menglong Dong March 5, 2025, 8:49 a.m. UTC | #11
On Wed, Mar 5, 2025 at 4:30 PM H. Peter Anvin <hpa@zytor.com> wrote:
>
> On March 4, 2025 5:19:09 PM PST, Menglong Dong <menglong8.dong@gmail.com> wrote:
> >On Tue, Mar 4, 2025 at 10:53 PM H. Peter Anvin <hpa@zytor.com> wrote:
> >>
> >> On March 4, 2025 1:42:20 AM PST, Peter Zijlstra <peterz@infradead.org> wrote:
> >> >On Tue, Mar 04, 2025 at 03:47:45PM +0800, Menglong Dong wrote:
> >> >> We don't have to select FUNCTION_ALIGNMENT_32B, so the
> >> >> worst case is to increase ~2.2%.
> >> >>
> >> >> What do you think?
> >> >
> >> >Well, since I don't understand what you need this for at all, I'm firmly
> >> >on the side of not doing this.
> >> >
> >> >What actual problem is being solved with this meta data nonsense? Why is
> >> >it worth blowing up our I$ footprint over.
> >> >
> >> >Also note, that if you're going to be explaining this, start from
> >> >scratch, as I have absolutely 0 clues about BPF and such.
> >>
> >> I would appreciate such information as well. The idea seems dubious on the surface.
> >
> >Ok, let me explain it from the beginning. (My English is not good,
> >but I'll try to describe it as clear as possible :/)
> >
> >Many BPF program types need to depend on the BPF trampoline,
> >such as BPF_PROG_TYPE_TRACING, BPF_PROG_TYPE_EXT,
> >BPF_PROG_TYPE_LSM, etc. BPF trampoline is a bridge between
> >the kernel (or bpf) function and BPF program, and it acts just like the
> >trampoline that ftrace uses.
> >
> >Generally speaking, it is used to hook a function, just like what ftrace
> >do:
> >
> >foo:
> >    endbr
> >    nop5  -->  call trampoline_foo
> >    xxxx
> >
> >In short, the trampoline_foo can be this:
> >
> >trampoline_foo:
> >    prepare a array and store the args of foo to the array
> >    call fentry_bpf1
> >    call fentry_bpf2
> >    ......
> >    call foo+4 (origin call)
> >    save the return value of foo
> >    call fexit_bpf1 (this bpf can get the return value of foo)
> >    call fexit_bpf2
> >    .......
> >    return to the caller of foo
> >
> >We can see that the trampoline_foo can be only used for
> >the function foo, as different kernel function can be attached
> >different BPF programs, and have different argument count,
> >etc. Therefore, we have to create 1000 BPF trampolines if
> >we want to attach a BPF program to 1000 kernel functions.
> >
> >The creation of the BPF trampoline is expensive. According to
> >my testing, It will spend more than 1 second to create 100 bpf
> >trampoline. What's more, it consumes more memory.
> >
> >If we have the per-function metadata supporting, then we can
> >create a global BPF trampoline, like this:
> >
> >trampoline_global:
> >    prepare a array and store the args of foo to the array
> >    get the metadata by the ip
> >    call metadata.fentry_bpf1
> >    call metadata.fentry_bpf2
> >    ....
> >    call foo+4 (origin call)
> >    save the return value of foo
> >    call metadata.fexit_bpf1 (this bpf can get the return value of foo)
> >    call metadata.fexit_bpf2
> >    .......
> >    return to the caller of foo
> >
> >(The metadata holds more information for the global trampoline than
> >I described.)
> >
> >Then, we don't need to create a trampoline for every kernel function
> >anymore.
> >
> >Another beneficiary can be ftrace. For now, all the kernel functions that
> >are enabled by dynamic ftrace will be added to a filter hash if there are
> >more than one callbacks. And hash lookup will happen when the traced
> >functions are called, which has an impact on the performance, see
> >__ftrace_ops_list_func() -> ftrace_ops_test(). With the per-function
> >metadata supporting, we can store the information that if the callback is
> >enabled on the kernel function to the metadata, which can make the performance
> >much better.
> >
> >The per-function metadata storage is a basic function, and I think there
> >may be other functions that can use it for better performance in the feature
> >too.
> >
> >(Hope that I'm describing it clearly :/)
> >
> >Thanks!
> >Menglong Dong
> >
>
> This is way too cursory. For one thing, you need to start by explaining why you are asking to put this *inline* with the code, which is something that normally would be avoided at all cost.

Hi,

Sorry that I don't understand the *inline* here, do you mean
why puting the metadata in the function padding?
Steven Rostedt March 5, 2025, 3:03 p.m. UTC | #12
On Wed, 5 Mar 2025 09:19:09 +0800
Menglong Dong <menglong8.dong@gmail.com> wrote:

> Ok, let me explain it from the beginning. (My English is not good,
> but I'll try to describe it as clear as possible :/)

I always appreciate those who struggle with English having these
conversations. Thank you for that, as I know I am horrible in speaking any
other language. (I can get by in German, but even Germans tell me to switch
back to English ;-)

> 
> Many BPF program types need to depend on the BPF trampoline,
> such as BPF_PROG_TYPE_TRACING, BPF_PROG_TYPE_EXT,
> BPF_PROG_TYPE_LSM, etc. BPF trampoline is a bridge between
> the kernel (or bpf) function and BPF program, and it acts just like the
> trampoline that ftrace uses.
> 
> Generally speaking, it is used to hook a function, just like what ftrace
> do:
> 
> foo:
>     endbr
>     nop5  -->  call trampoline_foo
>     xxxx
> 
> In short, the trampoline_foo can be this:
> 
> trampoline_foo:
>     prepare a array and store the args of foo to the array
>     call fentry_bpf1
>     call fentry_bpf2
>     ......
>     call foo+4 (origin call)

Note, I brought up this issue when I first heard about how BPF does this.
The calling of the original function from the trampoline. I said this will
cause issues, and is only good for a few functions. Once you start doing
this for 1000s of functions, it's going to be a nightmare.

Looks like you are now in the nightmare phase.

My argument was once you have this case, you need to switch over to the
kretprobe / function graph way of doing things, which is to have a shadow
stack and hijack the return address. Yes, that has slightly more overhead,
but it's better than having to add all theses hacks.

And function graph has been updated so that it can do this for other users.
fprobes uses it now, and bpf can too.

>     save the return value of foo
>     call fexit_bpf1 (this bpf can get the return value of foo)
>     call fexit_bpf2
>     .......
>     return to the caller of foo
> 
> We can see that the trampoline_foo can be only used for
> the function foo, as different kernel function can be attached
> different BPF programs, and have different argument count,
> etc. Therefore, we have to create 1000 BPF trampolines if
> we want to attach a BPF program to 1000 kernel functions.
> 
> The creation of the BPF trampoline is expensive. According to
> my testing, It will spend more than 1 second to create 100 bpf
> trampoline. What's more, it consumes more memory.
> 
> If we have the per-function metadata supporting, then we can
> create a global BPF trampoline, like this:
> 
> trampoline_global:
>     prepare a array and store the args of foo to the array
>     get the metadata by the ip
>     call metadata.fentry_bpf1
>     call metadata.fentry_bpf2
>     ....
>     call foo+4 (origin call)

So if this is a global trampoline, wouldn't this "call foo" need to be an
indirect call? It can't be a direct call, otherwise you need a separate
trampoline for that.

This means you need to mitigate for spectre here, and you just lost the
performance gain from not using function graph.


>     save the return value of foo
>     call metadata.fexit_bpf1 (this bpf can get the return value of foo)
>     call metadata.fexit_bpf2
>     .......
>     return to the caller of foo
> 
> (The metadata holds more information for the global trampoline than
> I described.)
> 
> Then, we don't need to create a trampoline for every kernel function
> anymore.
> 
> Another beneficiary can be ftrace. For now, all the kernel functions that
> are enabled by dynamic ftrace will be added to a filter hash if there are
> more than one callbacks. And hash lookup will happen when the traced
> functions are called, which has an impact on the performance, see
> __ftrace_ops_list_func() -> ftrace_ops_test(). With the per-function
> metadata supporting, we can store the information that if the callback is
> enabled on the kernel function to the metadata, which can make the performance
> much better.

Let me say now that ftrace will not use this. Looks like too much work for
little gain. The only time this impacts ftrace is when there's two
different callbacks tracing the same function, and it only impacts that
function. All other functions being traced still call the appropriate
trampoline for the callback.

-- Steve

> 
> The per-function metadata storage is a basic function, and I think there
> may be other functions that can use it for better performance in the feature
> too.
> 
> (Hope that I'm describing it clearly :/)
Menglong Dong March 6, 2025, 2:58 a.m. UTC | #13
On Wed, Mar 5, 2025 at 11:02 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Wed, 5 Mar 2025 09:19:09 +0800
> Menglong Dong <menglong8.dong@gmail.com> wrote:
>
> > Ok, let me explain it from the beginning. (My English is not good,
> > but I'll try to describe it as clear as possible :/)
>
> I always appreciate those who struggle with English having these
> conversations. Thank you for that, as I know I am horrible in speaking any
> other language. (I can get by in German, but even Germans tell me to switch
> back to English ;-)
>
> >
> > Many BPF program types need to depend on the BPF trampoline,
> > such as BPF_PROG_TYPE_TRACING, BPF_PROG_TYPE_EXT,
> > BPF_PROG_TYPE_LSM, etc. BPF trampoline is a bridge between
> > the kernel (or bpf) function and BPF program, and it acts just like the
> > trampoline that ftrace uses.
> >
> > Generally speaking, it is used to hook a function, just like what ftrace
> > do:
> >
> > foo:
> >     endbr
> >     nop5  -->  call trampoline_foo
> >     xxxx
> >
> > In short, the trampoline_foo can be this:
> >
> > trampoline_foo:
> >     prepare a array and store the args of foo to the array
> >     call fentry_bpf1
> >     call fentry_bpf2
> >     ......
> >     call foo+4 (origin call)
>
> Note, I brought up this issue when I first heard about how BPF does this.
> The calling of the original function from the trampoline. I said this will
> cause issues, and is only good for a few functions. Once you start doing
> this for 1000s of functions, it's going to be a nightmare.
>
> Looks like you are now in the nightmare phase.
>
> My argument was once you have this case, you need to switch over to the
> kretprobe / function graph way of doing things, which is to have a shadow
> stack and hijack the return address. Yes, that has slightly more overhead,
> but it's better than having to add all theses hacks.
>
> And function graph has been updated so that it can do this for other users.
> fprobes uses it now, and bpf can too.

Yeah, I heard that the kretprobe is able to get the function
arguments too, which benefits from the function graph.

Besides the overhead, another problem is that we can't do
direct memory access if we use the BPF based on kretprobe.

>
> >     save the return value of foo
> >     call fexit_bpf1 (this bpf can get the return value of foo)
> >     call fexit_bpf2
> >     .......
> >     return to the caller of foo
> >
> > We can see that the trampoline_foo can be only used for
> > the function foo, as different kernel function can be attached
> > different BPF programs, and have different argument count,
> > etc. Therefore, we have to create 1000 BPF trampolines if
> > we want to attach a BPF program to 1000 kernel functions.
> >
> > The creation of the BPF trampoline is expensive. According to
> > my testing, It will spend more than 1 second to create 100 bpf
> > trampoline. What's more, it consumes more memory.
> >
> > If we have the per-function metadata supporting, then we can
> > create a global BPF trampoline, like this:
> >
> > trampoline_global:
> >     prepare a array and store the args of foo to the array
> >     get the metadata by the ip
> >     call metadata.fentry_bpf1
> >     call metadata.fentry_bpf2
> >     ....
> >     call foo+4 (origin call)
>
> So if this is a global trampoline, wouldn't this "call foo" need to be an
> indirect call? It can't be a direct call, otherwise you need a separate
> trampoline for that.
>
> This means you need to mitigate for spectre here, and you just lost the
> performance gain from not using function graph.

Yeah, you are right, this is an indirect call here. I haven't done
any research on mitigating for spectre yet, and maybe we can
convert it into a direct call somehow? Such as, we maintain a
trampoline_table:
    some preparation
    jmp +%eax (eax is the index of the target function)
    call foo1 + 4
    return
    call foo2 + 4
    return
    call foo3 + 4
    return

(Hmm......Is the jmp above also an indirect call?)

And in the trampoline_global, we can call it like this:

    mov metadata.index %eax
    call trampoline_table

I'm not sure if it works. However, indirect call is also used
in function graph, so we still have better performance. Isn't it?

Let me have a look at the code of the function graph first :/

Thanks!
Menglong Dong

>
>
> >     save the return value of foo
> >     call metadata.fexit_bpf1 (this bpf can get the return value of foo)
> >     call metadata.fexit_bpf2
> >     .......
> >     return to the caller of foo
> >
> > (The metadata holds more information for the global trampoline than
> > I described.)
> >
> > Then, we don't need to create a trampoline for every kernel function
> > anymore.
> >
> > Another beneficiary can be ftrace. For now, all the kernel functions that
> > are enabled by dynamic ftrace will be added to a filter hash if there are
> > more than one callbacks. And hash lookup will happen when the traced
> > functions are called, which has an impact on the performance, see
> > __ftrace_ops_list_func() -> ftrace_ops_test(). With the per-function
> > metadata supporting, we can store the information that if the callback is
> > enabled on the kernel function to the metadata, which can make the performance
> > much better.
>
> Let me say now that ftrace will not use this. Looks like too much work for
> little gain. The only time this impacts ftrace is when there's two
> different callbacks tracing the same function, and it only impacts that
> function. All other functions being traced still call the appropriate
> trampoline for the callback.
>
> -- Steve
>
> >
> > The per-function metadata storage is a basic function, and I think there
> > may be other functions that can use it for better performance in the feature
> > too.
> >
> > (Hope that I'm describing it clearly :/)
>
Alexei Starovoitov March 6, 2025, 3:39 a.m. UTC | #14
On Wed, Mar 5, 2025 at 6:59 PM Menglong Dong <menglong8.dong@gmail.com> wrote:
>
> I'm not sure if it works. However, indirect call is also used
> in function graph, so we still have better performance. Isn't it?
>
> Let me have a look at the code of the function graph first :/

Menglong,

Function graph infra isn't going to help.
"call foo" isn't a problem either.

But we have to step back.
per-function metadata is an optimization and feels like
we're doing a premature optimization here without collecting
performance numbers first.

Let's implement multi-fentry with generic get_metadata_by_ip() first.
get_metadata_by_ip() will be a hashtable in such a case and
then we can compare its performance when it's implemented as
a direct lookup from ip-4 (this patch) vs hash table
(that does 'ip' to 'metadata' lookup).

If/when we decide to do this per-function metadata we can also
punt to generic hashtable for cfi, IBT, FineIBT, etc configs.
When mitigations are enabled the performance suffers anyway,
so hashtable lookup vs direct ip-4 lookup won't make much difference.
So we can enable per-function metadata only on non-mitigation configs
when FUNCTION_ALIGNMENT=16.
There will be some number of bytes available before every function
and if we can tell gcc/llvm to leave at least 5 bytes there
the growth of vmlinux .text will be within a noise.

So let's figure out the design of multi-fenty first with a hashtable
for metadata and decide next steps afterwards.
Menglong Dong March 6, 2025, 8:50 a.m. UTC | #15
On Thu, Mar 6, 2025 at 11:39 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Mar 5, 2025 at 6:59 PM Menglong Dong <menglong8.dong@gmail.com> wrote:
> >
> > I'm not sure if it works. However, indirect call is also used
> > in function graph, so we still have better performance. Isn't it?
> >
> > Let me have a look at the code of the function graph first :/
>
> Menglong,
>
> Function graph infra isn't going to help.
> "call foo" isn't a problem either.
>
> But we have to step back.
> per-function metadata is an optimization and feels like
> we're doing a premature optimization here without collecting
> performance numbers first.
>
> Let's implement multi-fentry with generic get_metadata_by_ip() first.
> get_metadata_by_ip() will be a hashtable in such a case and
> then we can compare its performance when it's implemented as
> a direct lookup from ip-4 (this patch) vs hash table
> (that does 'ip' to 'metadata' lookup).

Hi, Alexei

You are right, I should do such a performance comparison.

>
> If/when we decide to do this per-function metadata we can also
> punt to generic hashtable for cfi, IBT, FineIBT, etc configs.
> When mitigations are enabled the performance suffers anyway,
> so hashtable lookup vs direct ip-4 lookup won't make much difference.
> So we can enable per-function metadata only on non-mitigation configs
> when FUNCTION_ALIGNMENT=16.
> There will be some number of bytes available before every function
> and if we can tell gcc/llvm to leave at least 5 bytes there
> the growth of vmlinux .text will be within a noise.

Sounds great! It's so different to make the per-function metadata
work in all the cases. Especially, we can't implement it in arm64
if CFI_CLANG is enabled. And the fallbacking to the hash table makes
it much easier in these cases.

>
> So let's figure out the design of multi-fenty first with a hashtable
> for metadata and decide next steps afterwards.

Ok, I'll develop a version for fentry multi-link with both hashtable
and function metadata, and do some performance testing. Thank
you for your advice :/

Thanks!
Menglong Dong
diff mbox series

Patch

diff --git a/arch/x86/include/asm/cfi.h b/arch/x86/include/asm/cfi.h
index 2f6a01f098b5..04525f2f6bf2 100644
--- a/arch/x86/include/asm/cfi.h
+++ b/arch/x86/include/asm/cfi.h
@@ -108,6 +108,14 @@  extern bhi_thunk __bhi_args_end[];
 
 struct pt_regs;
 
+#ifdef CONFIG_CALL_PADDING
+#define FINEIBT_INSN_OFFSET	16
+#define CFI_INSN_OFFSET		CONFIG_FUNCTION_ALIGNMENT
+#else
+#define FINEIBT_INSN_OFFSET	0
+#define CFI_INSN_OFFSET		5
+#endif
+
 #ifdef CONFIG_CFI_CLANG
 enum bug_trap_type handle_cfi_failure(struct pt_regs *regs);
 #define __bpfcall
@@ -118,11 +126,8 @@  static inline int cfi_get_offset(void)
 {
 	switch (cfi_mode) {
 	case CFI_FINEIBT:
-		return 16;
 	case CFI_KCFI:
-		if (IS_ENABLED(CONFIG_CALL_PADDING))
-			return 16;
-		return 5;
+		return CFI_INSN_OFFSET;
 	default:
 		return 0;
 	}
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 32e4b801db99..0088d2313f33 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -917,7 +917,7 @@  void __init_or_module noinline apply_seal_endbr(s32 *start, s32 *end)
 
 		poison_endbr(addr);
 		if (IS_ENABLED(CONFIG_FINEIBT))
-			poison_cfi(addr - 16);
+			poison_cfi(addr);
 	}
 }
 
@@ -980,12 +980,13 @@  u32 cfi_get_func_hash(void *func)
 {
 	u32 hash;
 
-	func -= cfi_get_offset();
 	switch (cfi_mode) {
 	case CFI_FINEIBT:
+		func -= FINEIBT_INSN_OFFSET;
 		func += 7;
 		break;
 	case CFI_KCFI:
+		func -= CFI_INSN_OFFSET;
 		func += 1;
 		break;
 	default:
@@ -1372,7 +1373,7 @@  static int cfi_rewrite_preamble(s32 *start, s32 *end)
 		 * have determined there are no indirect calls to it and we
 		 * don't need no CFI either.
 		 */
-		if (!is_endbr(addr + 16))
+		if (!is_endbr(addr + CFI_INSN_OFFSET))
 			continue;
 
 		hash = decode_preamble_hash(addr, &arity);
@@ -1380,6 +1381,7 @@  static int cfi_rewrite_preamble(s32 *start, s32 *end)
 			 addr, addr, 5, addr))
 			return -EINVAL;
 
+		addr += (CFI_INSN_OFFSET - FINEIBT_INSN_OFFSET);
 		text_poke_early(addr, fineibt_preamble_start, fineibt_preamble_size);
 		WARN_ON(*(u32 *)(addr + fineibt_preamble_hash) != 0x12345678);
 		text_poke_early(addr + fineibt_preamble_hash, &hash, 4);
@@ -1402,10 +1404,10 @@  static void cfi_rewrite_endbr(s32 *start, s32 *end)
 	for (s = start; s < end; s++) {
 		void *addr = (void *)s + *s;
 
-		if (!exact_endbr(addr + 16))
+		if (!exact_endbr(addr + CFI_INSN_OFFSET))
 			continue;
 
-		poison_endbr(addr + 16);
+		poison_endbr(addr + CFI_INSN_OFFSET);
 	}
 }
 
@@ -1543,12 +1545,12 @@  static void __apply_fineibt(s32 *start_retpoline, s32 *end_retpoline,
 		return;
 
 	case CFI_FINEIBT:
-		/* place the FineIBT preamble at func()-16 */
+		/* place the FineIBT preamble at func()-FINEIBT_INSN_OFFSET */
 		ret = cfi_rewrite_preamble(start_cfi, end_cfi);
 		if (ret)
 			goto err;
 
-		/* rewrite the callers to target func()-16 */
+		/* rewrite the callers to target func()-FINEIBT_INSN_OFFSET */
 		ret = cfi_rewrite_callers(start_retpoline, end_retpoline);
 		if (ret)
 			goto err;
@@ -1588,6 +1590,7 @@  static void poison_cfi(void *addr)
 	 */
 	switch (cfi_mode) {
 	case CFI_FINEIBT:
+		addr -= FINEIBT_INSN_OFFSET;
 		/*
 		 * FineIBT prefix should start with an ENDBR.
 		 */
@@ -1607,6 +1610,7 @@  static void poison_cfi(void *addr)
 		break;
 
 	case CFI_KCFI:
+		addr -= CFI_INSN_OFFSET;
 		/*
 		 * kCFI prefix should start with a valid hash.
 		 */
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 72776dcb75aa..ee86a5df5ffb 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -415,6 +415,12 @@  static int emit_call(u8 **prog, void *func, void *ip);
 static void emit_fineibt(u8 **pprog, u8 *ip, u32 hash, int arity)
 {
 	u8 *prog = *pprog;
+#ifdef CONFIG_CALL_PADDING
+	int i;
+
+	for (i = 0; i < CFI_INSN_OFFSET - 16; i++)
+		EMIT1(0x90);
+#endif
 
 	EMIT_ENDBR();
 	EMIT3_off32(0x41, 0x81, 0xea, hash);		/* subl $hash, %r10d	*/
@@ -432,20 +438,14 @@  static void emit_fineibt(u8 **pprog, u8 *ip, u32 hash, int arity)
 static void emit_kcfi(u8 **pprog, u32 hash)
 {
 	u8 *prog = *pprog;
+#ifdef CONFIG_CALL_PADDING
+	int i;
+#endif
 
 	EMIT1_off32(0xb8, hash);			/* movl $hash, %eax	*/
 #ifdef CONFIG_CALL_PADDING
-	EMIT1(0x90);
-	EMIT1(0x90);
-	EMIT1(0x90);
-	EMIT1(0x90);
-	EMIT1(0x90);
-	EMIT1(0x90);
-	EMIT1(0x90);
-	EMIT1(0x90);
-	EMIT1(0x90);
-	EMIT1(0x90);
-	EMIT1(0x90);
+	for (i = 0; i < CFI_INSN_OFFSET - 5; i++)
+		EMIT1(0x90);
 #endif
 	EMIT_ENDBR();