Message ID | 20240812052106.3980303-1-yonghong.song@linux.dev (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | BPF |
Headers | show |
Series | [bpf,1/2] bpf: Fix a kernel verifier crash in stacksafe() | expand |
On Sun, 2024-08-11 at 22:21 -0700, Yonghong Song wrote: > Daniel Hodges reported a kernel verifier crash when playing with sched-ext. > The crash dump looks like below: > > [ 65.874474] BUG: kernel NULL pointer dereference, address: 0000000000000088 > [ 65.888406] #PF: supervisor read access in kernel mode > [ 65.898682] #PF: error_code(0x0000) - not-present page > [ 65.908957] PGD 0 P4D 0 > [ 65.914020] Oops: 0000 [#1] SMP > [ 65.920300] CPU: 19 PID: 9364 Comm: scx_layered Kdump: loaded Tainted: G S E 6.9.5-g93cea04637ea-dirty #7 > [ 65.941874] Hardware name: Quanta Delta Lake MP 29F0EMA01D0/Delta Lake-Class1, BIOS F0E_3A19 04/27/2023 > [ 65.960664] RIP: 0010:states_equal+0x3ee/0x770 > [ 65.969559] Code: 33 85 ed 89 e8 41 0f 48 c7 83 e0 f8 89 e9 29 c1 48 63 c1 4c 89 e9 48 c1 e1 07 49 8d 14 08 0f > b6 54 10 78 49 03 8a 58 05 00 00 <3a> 54 08 78 0f 85 60 03 00 00 49 c1 e5 07 43 8b 44 28 70 83 e0 03 > [ 66.007120] RSP: 0018:ffffc9000ebeb8b8 EFLAGS: 00010202 > [ 66.017570] RAX: 0000000000000000 RBX: ffff888149719680 RCX: 0000000000000010 > [ 66.031843] RDX: 0000000000000000 RSI: ffff88907f4e0c08 RDI: ffff8881572f0000 > [ 66.046115] RBP: 0000000000000000 R08: ffff8883d5014000 R09: ffffffff83065d50 > [ 66.060386] R10: ffff8881bf9a1800 R11: 0000000000000002 R12: 0000000000000000 > [ 66.074659] R13: 0000000000000000 R14: ffff888149719a40 R15: 0000000000000007 > [ 66.088932] FS: 00007f5d5da96800(0000) GS:ffff88907f4c0000(0000) knlGS:0000000000000000 > [ 66.105114] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 66.116606] CR2: 0000000000000088 CR3: 0000000388261001 CR4: 00000000007706f0 > [ 66.130873] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 66.145145] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > [ 66.159416] PKRU: 55555554 > [ 66.164823] Call Trace: > [ 66.169709] <TASK> > [ 66.173906] ? __die_body+0x66/0xb0 > [ 66.180890] ? page_fault_oops+0x370/0x3d0 > [ 66.189082] ? console_unlock+0xb5/0x140 > [ 66.196926] ? exc_page_fault+0x4f/0xb0 > [ 66.204597] ? asm_exc_page_fault+0x22/0x30 > [ 66.212974] ? states_equal+0x3ee/0x770 > [ 66.220643] ? states_equal+0x529/0x770 > [ 66.228312] do_check+0x60f/0x5240 > [ 66.235114] do_check_common+0x388/0x840 > [ 66.242960] do_check_subprogs+0x101/0x150 > [ 66.251150] bpf_check+0x5d5/0x4b60 > [ 66.258134] ? __mod_memcg_state+0x79/0x110 > [ 66.266506] ? pcpu_alloc+0x892/0xba0 > [ 66.273829] bpf_prog_load+0x5bb/0x660 > [ 66.281324] ? bpf_prog_bind_map+0x1e1/0x290 > [ 66.289862] __sys_bpf+0x29d/0x3a0 > [ 66.296664] __x64_sys_bpf+0x18/0x20 > [ 66.303811] do_syscall_64+0x6a/0x140 > [ 66.311133] entry_SYSCALL_64_after_hwframe+0x4b/0x53 > > Forther investigation shows that the crash is due to invalid memory access in stacksafe(). > More specifically, it is the following code: > > if (exact != NOT_EXACT && > old->stack[spi].slot_type[i % BPF_REG_SIZE] != > cur->stack[spi].slot_type[i % BPF_REG_SIZE]) > return false; > > If cur->allocated_stack is 0, cur->stack will be a ZERO_SIZE_PTR. If this happens, > cur->stack[spi].slot_type[i % BPF_REG_SIZE] will crash the kernel as the memory > address is illegal. This is exactly what happened in the above crash dump. > If cur->allocated_stack is not 0, the above code could trigger array out-of-bound > access. > > The patch added a condition 'i < cur->allocated_stack' to ensure > cur->stack[spi].slot_type[i % BPF_REG_SIZE] memory access always legal. > > Fixes: 2793a8b015f7 ("bpf: exact states comparison for iterator convergence checks") > Cc: Eduard Zingerman <eddyz87@gmail.com> > Reported-by: Daniel Hodges <hodgesd@meta.com> > Signed-off-by: Yonghong Song <yonghong.song@linux.dev> > --- My bad, for some reason I thought that 'if (i >= cur->allocated_stack) return false;' check below would be sufficient. (Which is obviously not true, sigh...). Acked-by: Eduard Zingerman <eddyz87@gmail.com> [...]
On Mon, Aug 12, 2024 at 10:38 AM Eduard Zingerman <eddyz87@gmail.com> wrote: > > On Sun, 2024-08-11 at 22:21 -0700, Yonghong Song wrote: > > Daniel Hodges reported a kernel verifier crash when playing with sched-ext. > > The crash dump looks like below: > > > > [ 65.874474] BUG: kernel NULL pointer dereference, address: 0000000000000088 > > [ 65.888406] #PF: supervisor read access in kernel mode > > [ 65.898682] #PF: error_code(0x0000) - not-present page > > [ 65.908957] PGD 0 P4D 0 > > [ 65.914020] Oops: 0000 [#1] SMP > > [ 65.920300] CPU: 19 PID: 9364 Comm: scx_layered Kdump: loaded Tainted: G S E 6.9.5-g93cea04637ea-dirty #7 > > [ 65.941874] Hardware name: Quanta Delta Lake MP 29F0EMA01D0/Delta Lake-Class1, BIOS F0E_3A19 04/27/2023 > > [ 65.960664] RIP: 0010:states_equal+0x3ee/0x770 > > [ 65.969559] Code: 33 85 ed 89 e8 41 0f 48 c7 83 e0 f8 89 e9 29 c1 48 63 c1 4c 89 e9 48 c1 e1 07 49 8d 14 08 0f > > b6 54 10 78 49 03 8a 58 05 00 00 <3a> 54 08 78 0f 85 60 03 00 00 49 c1 e5 07 43 8b 44 28 70 83 e0 03 > > [ 66.007120] RSP: 0018:ffffc9000ebeb8b8 EFLAGS: 00010202 > > [ 66.017570] RAX: 0000000000000000 RBX: ffff888149719680 RCX: 0000000000000010 > > [ 66.031843] RDX: 0000000000000000 RSI: ffff88907f4e0c08 RDI: ffff8881572f0000 > > [ 66.046115] RBP: 0000000000000000 R08: ffff8883d5014000 R09: ffffffff83065d50 > > [ 66.060386] R10: ffff8881bf9a1800 R11: 0000000000000002 R12: 0000000000000000 > > [ 66.074659] R13: 0000000000000000 R14: ffff888149719a40 R15: 0000000000000007 > > [ 66.088932] FS: 00007f5d5da96800(0000) GS:ffff88907f4c0000(0000) knlGS:0000000000000000 > > [ 66.105114] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 66.116606] CR2: 0000000000000088 CR3: 0000000388261001 CR4: 00000000007706f0 > > [ 66.130873] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > [ 66.145145] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > [ 66.159416] PKRU: 55555554 > > [ 66.164823] Call Trace: > > [ 66.169709] <TASK> > > [ 66.173906] ? __die_body+0x66/0xb0 > > [ 66.180890] ? page_fault_oops+0x370/0x3d0 > > [ 66.189082] ? console_unlock+0xb5/0x140 > > [ 66.196926] ? exc_page_fault+0x4f/0xb0 > > [ 66.204597] ? asm_exc_page_fault+0x22/0x30 > > [ 66.212974] ? states_equal+0x3ee/0x770 > > [ 66.220643] ? states_equal+0x529/0x770 > > [ 66.228312] do_check+0x60f/0x5240 > > [ 66.235114] do_check_common+0x388/0x840 > > [ 66.242960] do_check_subprogs+0x101/0x150 > > [ 66.251150] bpf_check+0x5d5/0x4b60 > > [ 66.258134] ? __mod_memcg_state+0x79/0x110 > > [ 66.266506] ? pcpu_alloc+0x892/0xba0 > > [ 66.273829] bpf_prog_load+0x5bb/0x660 > > [ 66.281324] ? bpf_prog_bind_map+0x1e1/0x290 > > [ 66.289862] __sys_bpf+0x29d/0x3a0 > > [ 66.296664] __x64_sys_bpf+0x18/0x20 > > [ 66.303811] do_syscall_64+0x6a/0x140 > > [ 66.311133] entry_SYSCALL_64_after_hwframe+0x4b/0x53 > > > > Forther investigation shows that the crash is due to invalid memory access in stacksafe(). > > More specifically, it is the following code: > > > > if (exact != NOT_EXACT && > > old->stack[spi].slot_type[i % BPF_REG_SIZE] != > > cur->stack[spi].slot_type[i % BPF_REG_SIZE]) > > return false; > > > > If cur->allocated_stack is 0, cur->stack will be a ZERO_SIZE_PTR. If this happens, > > cur->stack[spi].slot_type[i % BPF_REG_SIZE] will crash the kernel as the memory > > address is illegal. This is exactly what happened in the above crash dump. > > If cur->allocated_stack is not 0, the above code could trigger array out-of-bound > > access. > > > > The patch added a condition 'i < cur->allocated_stack' to ensure > > cur->stack[spi].slot_type[i % BPF_REG_SIZE] memory access always legal. > > > > Fixes: 2793a8b015f7 ("bpf: exact states comparison for iterator convergence checks") > > Cc: Eduard Zingerman <eddyz87@gmail.com> > > Reported-by: Daniel Hodges <hodgesd@meta.com> > > Signed-off-by: Yonghong Song <yonghong.song@linux.dev> > > --- > > My bad, for some reason I thought that 'if (i >= cur->allocated_stack) return false;' > check below would be sufficient. (Which is obviously not true, sigh...). > > Acked-by: Eduard Zingerman <eddyz87@gmail.com> Should we move the check up instead? if (i >= cur->allocated_stack) return false; Checking it twice looks odd.
On Mon, 2024-08-12 at 10:44 -0700, Alexei Starovoitov wrote: [...] > Should we move the check up instead? > > if (i >= cur->allocated_stack) > return false; > > Checking it twice looks odd. A few checks before that, namely: if (!(old->stack[spi].spilled_ptr.live & REG_LIVE_READ) && exact == NOT_EXACT) { i += BPF_REG_SIZE - 1; /* explored state didn't use this */ continue; } if (old->stack[spi].slot_type[i % BPF_REG_SIZE] == STACK_INVALID) continue; if (env->allow_uninit_stack && old->stack[spi].slot_type[i % BPF_REG_SIZE] == STACK_MISC) continue; Should be done regardless cur->allocated_stack.
On Mon, Aug 12, 2024 at 10:47 AM Eduard Zingerman <eddyz87@gmail.com> wrote: > > On Mon, 2024-08-12 at 10:44 -0700, Alexei Starovoitov wrote: > > [...] > > > Should we move the check up instead? > > > > if (i >= cur->allocated_stack) > > return false; > > > > Checking it twice looks odd. > > A few checks before that, namely: > > if (!(old->stack[spi].spilled_ptr.live & REG_LIVE_READ) > && exact == NOT_EXACT) { > i += BPF_REG_SIZE - 1; > /* explored state didn't use this */ > continue; > } > > if (old->stack[spi].slot_type[i % BPF_REG_SIZE] == STACK_INVALID) > continue; > > if (env->allow_uninit_stack && > old->stack[spi].slot_type[i % BPF_REG_SIZE] == STACK_MISC) > continue; > > Should be done regardless cur->allocated_stack. Right, but then let's sink old->slot_type != cur->slot_type down?
On Mon, 2024-08-12 at 10:50 -0700, Alexei Starovoitov wrote: > On Mon, Aug 12, 2024 at 10:47 AM Eduard Zingerman <eddyz87@gmail.com> wrote: > > > > On Mon, 2024-08-12 at 10:44 -0700, Alexei Starovoitov wrote: > > > > [...] > > > > > Should we move the check up instead? > > > > > > if (i >= cur->allocated_stack) > > > return false; > > > > > > Checking it twice looks odd. > > > > A few checks before that, namely: > > > > if (!(old->stack[spi].spilled_ptr.live & REG_LIVE_READ) > > && exact == NOT_EXACT) { > > i += BPF_REG_SIZE - 1; > > /* explored state didn't use this */ > > continue; > > } > > > > if (old->stack[spi].slot_type[i % BPF_REG_SIZE] == STACK_INVALID) > > continue; > > > > if (env->allow_uninit_stack && > > old->stack[spi].slot_type[i % BPF_REG_SIZE] == STACK_MISC) > > continue; > > > > Should be done regardless cur->allocated_stack. > > Right, but then let's sink old->slot_type != cur->slot_type down? It does not seem correct to swap the order for these two checks: if (exact != NOT_EXACT && i < cur->allocated_stack && old->stack[spi].slot_type[i % BPF_REG_SIZE] != cur->stack[spi].slot_type[i % BPF_REG_SIZE]) return false; if (!(old->stack[spi].spilled_ptr.live & REG_LIVE_READ) && exact == NOT_EXACT) { i += BPF_REG_SIZE - 1; /* explored state didn't use this */ continue; } if we do, 'slot_type' won't be checked for 'cur' when 'old' register is not marked live.
On 8/12/24 10:50 AM, Alexei Starovoitov wrote: > On Mon, Aug 12, 2024 at 10:47 AM Eduard Zingerman <eddyz87@gmail.com> wrote: >> On Mon, 2024-08-12 at 10:44 -0700, Alexei Starovoitov wrote: >> >> [...] >> >>> Should we move the check up instead? >>> >>> if (i >= cur->allocated_stack) >>> return false; >>> >>> Checking it twice looks odd. >> A few checks before that, namely: >> >> if (!(old->stack[spi].spilled_ptr.live & REG_LIVE_READ) >> && exact == NOT_EXACT) { >> i += BPF_REG_SIZE - 1; >> /* explored state didn't use this */ >> continue; >> } >> >> if (old->stack[spi].slot_type[i % BPF_REG_SIZE] == STACK_INVALID) >> continue; >> >> if (env->allow_uninit_stack && >> old->stack[spi].slot_type[i % BPF_REG_SIZE] == STACK_MISC) >> continue; >> >> Should be done regardless cur->allocated_stack. > Right, but then let's sink old->slot_type != cur->slot_type down? We could do the following to avoid double comparison: diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index df3be12096cf..1906798f1a3d 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -17338,10 +17338,13 @@ static bool stacksafe(struct bpf_verifier_env *env, struct bpf_func_state *old, */ for (i = 0; i < old->allocated_stack; i++) { struct bpf_reg_state *old_reg, *cur_reg; + bool cur_exceed_bound; spi = i / BPF_REG_SIZE; - if (exact != NOT_EXACT && + cur_exceed_bound = i >= cur->allocated_stack; + + if (exact != NOT_EXACT && !cur_exceed_bound && old->stack[spi].slot_type[i % BPF_REG_SIZE] != cur->stack[spi].slot_type[i % BPF_REG_SIZE]) return false; @@ -17363,7 +17366,7 @@ static bool stacksafe(struct bpf_verifier_env *env, struct bpf_func_state *old, /* explored stack has more populated slots than current stack * and these slots were used */ - if (i >= cur->allocated_stack) + if (cur_exceed_bound) return false; /* 64-bit scalar spill vs all slots MISC and vice versa. WDYT?
On Mon, 2024-08-12 at 11:26 -0700, Yonghong Song wrote: [...] > > We could do the following to avoid double comparison: diff --git > a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index > df3be12096cf..1906798f1a3d 100644 --- a/kernel/bpf/verifier.c +++ > b/kernel/bpf/verifier.c @@ -17338,10 +17338,13 @@ static bool > stacksafe(struct bpf_verifier_env *env, struct bpf_func_state *old, */ > for (i = 0; i < old->allocated_stack; i++) { struct bpf_reg_state > *old_reg, *cur_reg; + bool cur_exceed_bound; spi = i / BPF_REG_SIZE; - > if (exact != NOT_EXACT && + cur_exceed_bound = i >= > cur->allocated_stack; + + if (exact != NOT_EXACT && !cur_exceed_bound && > old->stack[spi].slot_type[i % BPF_REG_SIZE] != > cur->stack[spi].slot_type[i % BPF_REG_SIZE]) return false; @@ -17363,7 > +17366,7 @@ static bool stacksafe(struct bpf_verifier_env *env, struct > bpf_func_state *old, /* explored stack has more populated slots than > current stack * and these slots were used */ - if (i >= > cur->allocated_stack) + if (cur_exceed_bound) return false; /* 64-bit > scalar spill vs all slots MISC and vice versa. WDYT? > Yonghong, something went wrong with formatting of the above email, could you please resend?
On 8/12/24 11:30 AM, Eduard Zingerman wrote: > On Mon, 2024-08-12 at 11:26 -0700, Yonghong Song wrote: > > [...] > >> We could do the following to avoid double comparison: diff --git >> a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index >> df3be12096cf..1906798f1a3d 100644 --- a/kernel/bpf/verifier.c +++ >> b/kernel/bpf/verifier.c @@ -17338,10 +17338,13 @@ static bool >> stacksafe(struct bpf_verifier_env *env, struct bpf_func_state *old, */ >> for (i = 0; i < old->allocated_stack; i++) { struct bpf_reg_state >> *old_reg, *cur_reg; + bool cur_exceed_bound; spi = i / BPF_REG_SIZE; - >> if (exact != NOT_EXACT && + cur_exceed_bound = i >= >> cur->allocated_stack; + + if (exact != NOT_EXACT && !cur_exceed_bound && >> old->stack[spi].slot_type[i % BPF_REG_SIZE] != >> cur->stack[spi].slot_type[i % BPF_REG_SIZE]) return false; @@ -17363,7 >> +17366,7 @@ static bool stacksafe(struct bpf_verifier_env *env, struct >> bpf_func_state *old, /* explored stack has more populated slots than >> current stack * and these slots were used */ - if (i >= >> cur->allocated_stack) + if (cur_exceed_bound) return false; /* 64-bit >> scalar spill vs all slots MISC and vice versa. WDYT? >> > Yonghong, something went wrong with formatting of the above email, > could you please resend? Sorry, I copy-paste from 'git diff' result to my email window. Not sure why it caused the format issue after I sent out. Anyway, the following is the patch I suggested: diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index df3be12096cf..1906798f1a3d 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -17338,10 +17338,13 @@ static bool stacksafe(struct bpf_verifier_env *env, struct bpf_func_state *old, */ for (i = 0; i < old->allocated_stack; i++) { struct bpf_reg_state *old_reg, *cur_reg; + bool cur_exceed_bound; spi = i / BPF_REG_SIZE; - if (exact != NOT_EXACT && + cur_exceed_bound = i >= cur->allocated_stack; + + if (exact != NOT_EXACT && !cur_exceed_bound && old->stack[spi].slot_type[i % BPF_REG_SIZE] != cur->stack[spi].slot_type[i % BPF_REG_SIZE]) return false; @@ -17363,7 +17366,7 @@ static bool stacksafe(struct bpf_verifier_env *env, struct bpf_func_state *old, /* explored stack has more populated slots than current stack * and these slots were used */ - if (i >= cur->allocated_stack) + if (cur_exceed_bound) return false; /* 64-bit scalar spill vs all slots MISC and vice versa.
On Mon, 2024-08-12 at 11:36 -0700, Yonghong Song wrote: [...] > Sorry, I copy-paste from 'git diff' result to my email window. Not sure > why it caused the format issue after I sent out. Sure, no problem > Anyway, the following is the patch I suggested: > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > index df3be12096cf..1906798f1a3d 100644 > --- a/kernel/bpf/verifier.c > +++ b/kernel/bpf/verifier.c > @@ -17338,10 +17338,13 @@ static bool stacksafe(struct bpf_verifier_env *env, struct bpf_func_state *old, > */ > for (i = 0; i < old->allocated_stack; i++) { > struct bpf_reg_state *old_reg, *cur_reg; > + bool cur_exceed_bound; > > spi = i / BPF_REG_SIZE; > > - if (exact != NOT_EXACT && > + cur_exceed_bound = i >= cur->allocated_stack; idk, I think C compiler would do this anyways, to me the code is fine both with and without this additional variable. > + > + if (exact != NOT_EXACT && !cur_exceed_bound && > old->stack[spi].slot_type[i % BPF_REG_SIZE] != > cur->stack[spi].slot_type[i % BPF_REG_SIZE]) > return false; > @@ -17363,7 +17366,7 @@ static bool stacksafe(struct bpf_verifier_env *env, struct bpf_func_state *old, > /* explored stack has more populated slots than current stack > * and these slots were used > */ > - if (i >= cur->allocated_stack) > + if (cur_exceed_bound) > return false; > > /* 64-bit scalar spill vs all slots MISC and vice versa. >
On 8/12/24 11:41 AM, Eduard Zingerman wrote: > On Mon, 2024-08-12 at 11:36 -0700, Yonghong Song wrote: > > [...] > >> Sorry, I copy-paste from 'git diff' result to my email window. Not sure >> why it caused the format issue after I sent out. > Sure, no problem > >> Anyway, the following is the patch I suggested: >> >> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c >> index df3be12096cf..1906798f1a3d 100644 >> --- a/kernel/bpf/verifier.c >> +++ b/kernel/bpf/verifier.c >> @@ -17338,10 +17338,13 @@ static bool stacksafe(struct bpf_verifier_env *env, struct bpf_func_state *old, >> */ >> for (i = 0; i < old->allocated_stack; i++) { >> struct bpf_reg_state *old_reg, *cur_reg; >> + bool cur_exceed_bound; >> >> spi = i / BPF_REG_SIZE; >> >> - if (exact != NOT_EXACT && >> + cur_exceed_bound = i >= cur->allocated_stack; > idk, I think C compiler would do this anyways, > to me the code is fine both with and without this additional variable. Okay, I will keep the original (simpler) patch then. > >> + >> + if (exact != NOT_EXACT && !cur_exceed_bound && >> old->stack[spi].slot_type[i % BPF_REG_SIZE] != >> cur->stack[spi].slot_type[i % BPF_REG_SIZE]) >> return false; >> @@ -17363,7 +17366,7 @@ static bool stacksafe(struct bpf_verifier_env *env, struct bpf_func_state *old, >> /* explored stack has more populated slots than current stack >> * and these slots were used >> */ >> - if (i >= cur->allocated_stack) >> + if (cur_exceed_bound) >> return false; >> >> /* 64-bit scalar spill vs all slots MISC and vice versa. >> >
On Mon, Aug 12, 2024 at 10:57 AM Eduard Zingerman <eddyz87@gmail.com> wrote: > > On Mon, 2024-08-12 at 10:50 -0700, Alexei Starovoitov wrote: > > On Mon, Aug 12, 2024 at 10:47 AM Eduard Zingerman <eddyz87@gmail.com> wrote: > > > > > > On Mon, 2024-08-12 at 10:44 -0700, Alexei Starovoitov wrote: > > > > > > [...] > > > > > > > Should we move the check up instead? > > > > > > > > if (i >= cur->allocated_stack) > > > > return false; > > > > > > > > Checking it twice looks odd. > > > > > > A few checks before that, namely: > > > > > > if (!(old->stack[spi].spilled_ptr.live & REG_LIVE_READ) > > > && exact == NOT_EXACT) { > > > i += BPF_REG_SIZE - 1; > > > /* explored state didn't use this */ > > > continue; > > > } > > > > > > if (old->stack[spi].slot_type[i % BPF_REG_SIZE] == STACK_INVALID) > > > continue; > > > > > > if (env->allow_uninit_stack && > > > old->stack[spi].slot_type[i % BPF_REG_SIZE] == STACK_MISC) > > > continue; > > > > > > Should be done regardless cur->allocated_stack. > > > > Right, but then let's sink old->slot_type != cur->slot_type down? > > It does not seem correct to swap the order for these two checks: > > if (exact != NOT_EXACT && i < cur->allocated_stack && > old->stack[spi].slot_type[i % BPF_REG_SIZE] != > cur->stack[spi].slot_type[i % BPF_REG_SIZE]) > return false; > > if (!(old->stack[spi].spilled_ptr.live & REG_LIVE_READ) > && exact == NOT_EXACT) { > i += BPF_REG_SIZE - 1; > /* explored state didn't use this */ > continue; > } > > if we do, 'slot_type' won't be checked for 'cur' when 'old' register is not marked live. I see. This is to compare states in open coded iter loops when liveness is not propagated yet, right? Then when comparing for exact states we should probably do: if (exact != NOT_EXACT && (i >= cur->allocated_stack || old->stack[spi].slot_type[i % BPF_REG_SIZE] != cur->stack[spi].slot_type[i % BPF_REG_SIZE])) return false; ?
On Mon, 2024-08-12 at 12:29 -0700, Alexei Starovoitov wrote: [...] > > It does not seem correct to swap the order for these two checks: > > > > if (exact != NOT_EXACT && i < cur->allocated_stack && > > old->stack[spi].slot_type[i % BPF_REG_SIZE] != > > cur->stack[spi].slot_type[i % BPF_REG_SIZE]) > > return false; > > > > if (!(old->stack[spi].spilled_ptr.live & REG_LIVE_READ) > > && exact == NOT_EXACT) { > > i += BPF_REG_SIZE - 1; > > /* explored state didn't use this */ > > continue; > > } > > > > if we do, 'slot_type' won't be checked for 'cur' when 'old' register is not marked live. > > I see. This is to compare states in open coded iter loops when liveness > is not propagated yet, right? Yes > > Then when comparing for exact states we should probably do: > if (exact != NOT_EXACT && > (i >= cur->allocated_stack || > old->stack[spi].slot_type[i % BPF_REG_SIZE] != > cur->stack[spi].slot_type[i % BPF_REG_SIZE])) > return false; > > ? Hm, right, otherwise the old slots in the interval [cur->allocated_stack..old->allocated_stack) won't be checked using exact rules.
On 8/12/24 12:43 PM, Eduard Zingerman wrote: > On Mon, 2024-08-12 at 12:29 -0700, Alexei Starovoitov wrote: > > [...] > >>> It does not seem correct to swap the order for these two checks: >>> >>> if (exact != NOT_EXACT && i < cur->allocated_stack && >>> old->stack[spi].slot_type[i % BPF_REG_SIZE] != >>> cur->stack[spi].slot_type[i % BPF_REG_SIZE]) >>> return false; >>> >>> if (!(old->stack[spi].spilled_ptr.live & REG_LIVE_READ) >>> && exact == NOT_EXACT) { >>> i += BPF_REG_SIZE - 1; >>> /* explored state didn't use this */ >>> continue; >>> } >>> >>> if we do, 'slot_type' won't be checked for 'cur' when 'old' register is not marked live. >> I see. This is to compare states in open coded iter loops when liveness >> is not propagated yet, right? > Yes > >> Then when comparing for exact states we should probably do: >> if (exact != NOT_EXACT && >> (i >= cur->allocated_stack || >> old->stack[spi].slot_type[i % BPF_REG_SIZE] != >> cur->stack[spi].slot_type[i % BPF_REG_SIZE])) >> return false; >> >> ? > Hm, right, otherwise the old slots in the interval > [cur->allocated_stack..old->allocated_stack) > won't be checked using exact rules. Okay, for *exact* stack slot_type comparison. Will make the change and send v2 soon.
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 4cb5441ad75f..1e3d7794bf13 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -16883,7 +16883,7 @@ static bool stacksafe(struct bpf_verifier_env *env, struct bpf_func_state *old, spi = i / BPF_REG_SIZE; - if (exact != NOT_EXACT && + if (exact != NOT_EXACT && i < cur->allocated_stack && old->stack[spi].slot_type[i % BPF_REG_SIZE] != cur->stack[spi].slot_type[i % BPF_REG_SIZE]) return false;
Daniel Hodges reported a kernel verifier crash when playing with sched-ext. The crash dump looks like below: [ 65.874474] BUG: kernel NULL pointer dereference, address: 0000000000000088 [ 65.888406] #PF: supervisor read access in kernel mode [ 65.898682] #PF: error_code(0x0000) - not-present page [ 65.908957] PGD 0 P4D 0 [ 65.914020] Oops: 0000 [#1] SMP [ 65.920300] CPU: 19 PID: 9364 Comm: scx_layered Kdump: loaded Tainted: G S E 6.9.5-g93cea04637ea-dirty #7 [ 65.941874] Hardware name: Quanta Delta Lake MP 29F0EMA01D0/Delta Lake-Class1, BIOS F0E_3A19 04/27/2023 [ 65.960664] RIP: 0010:states_equal+0x3ee/0x770 [ 65.969559] Code: 33 85 ed 89 e8 41 0f 48 c7 83 e0 f8 89 e9 29 c1 48 63 c1 4c 89 e9 48 c1 e1 07 49 8d 14 08 0f b6 54 10 78 49 03 8a 58 05 00 00 <3a> 54 08 78 0f 85 60 03 00 00 49 c1 e5 07 43 8b 44 28 70 83 e0 03 [ 66.007120] RSP: 0018:ffffc9000ebeb8b8 EFLAGS: 00010202 [ 66.017570] RAX: 0000000000000000 RBX: ffff888149719680 RCX: 0000000000000010 [ 66.031843] RDX: 0000000000000000 RSI: ffff88907f4e0c08 RDI: ffff8881572f0000 [ 66.046115] RBP: 0000000000000000 R08: ffff8883d5014000 R09: ffffffff83065d50 [ 66.060386] R10: ffff8881bf9a1800 R11: 0000000000000002 R12: 0000000000000000 [ 66.074659] R13: 0000000000000000 R14: ffff888149719a40 R15: 0000000000000007 [ 66.088932] FS: 00007f5d5da96800(0000) GS:ffff88907f4c0000(0000) knlGS:0000000000000000 [ 66.105114] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 66.116606] CR2: 0000000000000088 CR3: 0000000388261001 CR4: 00000000007706f0 [ 66.130873] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 66.145145] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 66.159416] PKRU: 55555554 [ 66.164823] Call Trace: [ 66.169709] <TASK> [ 66.173906] ? __die_body+0x66/0xb0 [ 66.180890] ? page_fault_oops+0x370/0x3d0 [ 66.189082] ? console_unlock+0xb5/0x140 [ 66.196926] ? exc_page_fault+0x4f/0xb0 [ 66.204597] ? asm_exc_page_fault+0x22/0x30 [ 66.212974] ? states_equal+0x3ee/0x770 [ 66.220643] ? states_equal+0x529/0x770 [ 66.228312] do_check+0x60f/0x5240 [ 66.235114] do_check_common+0x388/0x840 [ 66.242960] do_check_subprogs+0x101/0x150 [ 66.251150] bpf_check+0x5d5/0x4b60 [ 66.258134] ? __mod_memcg_state+0x79/0x110 [ 66.266506] ? pcpu_alloc+0x892/0xba0 [ 66.273829] bpf_prog_load+0x5bb/0x660 [ 66.281324] ? bpf_prog_bind_map+0x1e1/0x290 [ 66.289862] __sys_bpf+0x29d/0x3a0 [ 66.296664] __x64_sys_bpf+0x18/0x20 [ 66.303811] do_syscall_64+0x6a/0x140 [ 66.311133] entry_SYSCALL_64_after_hwframe+0x4b/0x53 Forther investigation shows that the crash is due to invalid memory access in stacksafe(). More specifically, it is the following code: if (exact != NOT_EXACT && old->stack[spi].slot_type[i % BPF_REG_SIZE] != cur->stack[spi].slot_type[i % BPF_REG_SIZE]) return false; If cur->allocated_stack is 0, cur->stack will be a ZERO_SIZE_PTR. If this happens, cur->stack[spi].slot_type[i % BPF_REG_SIZE] will crash the kernel as the memory address is illegal. This is exactly what happened in the above crash dump. If cur->allocated_stack is not 0, the above code could trigger array out-of-bound access. The patch added a condition 'i < cur->allocated_stack' to ensure cur->stack[spi].slot_type[i % BPF_REG_SIZE] memory access always legal. Fixes: 2793a8b015f7 ("bpf: exact states comparison for iterator convergence checks") Cc: Eduard Zingerman <eddyz87@gmail.com> Reported-by: Daniel Hodges <hodgesd@meta.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> --- kernel/bpf/verifier.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)