diff mbox series

[v2,bpf-next] bpf: increment and use correct thread iterator

Message ID 20201204034302.2123841-1-jonathan.lemon@gmail.com (mailing list archive)
State Changes Requested
Delegated to: BPF
Headers show
Series [v2,bpf-next] bpf: increment and use correct thread iterator | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for bpf-next
netdev/subject_prefix success Link
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 1 this patch: 1
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch warning WARNING: From:/Signed-off-by: email address mismatch: 'From: Jonathan Lemon <bsd@fb.com>' != 'Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>'
netdev/build_allmodconfig_warn success Errors and warnings before: 1 this patch: 1
netdev/header_inline success Link
netdev/stable success Stable not CCed

Commit Message

Jonathan Lemon Dec. 4, 2020, 3:43 a.m. UTC
From: Jonathan Lemon <bsd@fb.com>

If unable to obtain the file structure for the current task,
proceed to the next task number after the one returned from
task_seq_get_next(), instead of the next task number from the
original iterator.

Use thread_group_leader() instead of comparing tgid vs pid, which
might may be racy.

Only obtain the task reference count at the end of the RCU section
instead of repeatedly obtaining/releasing it when iterathing though
a thread group.

Fixes: eaaacd23910f ("bpf: Add task and task/file iterator targets")
Fixes: 203d7b054fc7 ("bpf: Avoid iterating duplicated files for task_file iterator")

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 kernel/bpf/task_iter.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

Comments

Yonghong Song Dec. 4, 2020, 8:01 a.m. UTC | #1
On 12/3/20 7:43 PM, Jonathan Lemon wrote:
> From: Jonathan Lemon <bsd@fb.com>

Could you explain in the commit log what problem this patch
tries to solve? What bad things could happen without this patch?

> 
> If unable to obtain the file structure for the current task,
> proceed to the next task number after the one returned from
> task_seq_get_next(), instead of the next task number from the
> original iterator.
This seems a correct change. The current code should still work
but it may do some redundant/unnecessary work in kernel.
This only happens when a task does not have any file,
no sure whether this is the culprit for the problem this
patch tries to address.

> 
> Use thread_group_leader() instead of comparing tgid vs pid, which
> might may be racy.

I see

static inline bool thread_group_leader(struct task_struct *p)
{
         return p->exit_signal >= 0;
}

I am not sure whether thread_group_leader(task) is equivalent
to task->tgid == task->pid or not. Any documentation or explanation?

Could you explain why task->tgid != task->pid in the original
code could be racy?

> 
> Only obtain the task reference count at the end of the RCU section
> instead of repeatedly obtaining/releasing it when iterathing though
> a thread group.

I think this is an optimization and not about the correctness.

> 
> Fixes: eaaacd23910f ("bpf: Add task and task/file iterator targets")
> Fixes: 203d7b054fc7 ("bpf: Avoid iterating duplicated files for task_file iterator")
> 
> Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
> ---
>   kernel/bpf/task_iter.c | 8 ++++----
>   1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
> index 0458a40edf10..66a52fcf589a 100644
> --- a/kernel/bpf/task_iter.c
> +++ b/kernel/bpf/task_iter.c
> @@ -33,17 +33,17 @@ static struct task_struct *task_seq_get_next(struct pid_namespace *ns,
>   	pid = find_ge_pid(*tid, ns);
>   	if (pid) {
>   		*tid = pid_nr_ns(pid, ns);
> -		task = get_pid_task(pid, PIDTYPE_PID);
> +		task = pid_task(pid, PIDTYPE_PID);
>   		if (!task) {
>   			++*tid;
>   			goto retry;
> -		} else if (skip_if_dup_files && task->tgid != task->pid &&
> +		} else if (skip_if_dup_files && !thread_group_leader(task) &&
>   			   task->files == task->group_leader->files) {
> -			put_task_struct(task);
>   			task = NULL;
>   			++*tid;
>   			goto retry;
>   		}
> +		get_task_struct(task);
>   	}
>   	rcu_read_unlock();
>   
> @@ -164,7 +164,7 @@ task_file_seq_get_next(struct bpf_iter_seq_task_file_info *info)
>   		curr_files = get_files_struct(curr_task);
>   		if (!curr_files) {
>   			put_task_struct(curr_task);
> -			curr_tid = ++(info->tid);
> +			curr_tid = curr_tid + 1;
>   			info->fd = 0;
>   			goto again;
>   		}
>
Jonathan Lemon Dec. 4, 2020, 5:14 p.m. UTC | #2
On Fri, Dec 04, 2020 at 12:01:53AM -0800, Yonghong Song wrote:
> 
> 
> On 12/3/20 7:43 PM, Jonathan Lemon wrote:
> > From: Jonathan Lemon <bsd@fb.com>
> 
> Could you explain in the commit log what problem this patch
> tries to solve? What bad things could happen without this patch?

Without the patch, on a particular set of systems, RCU will repeatedly
generate stall warnings similar to the trace below.  The common factor
for all the traces seems to be using task_file_seq_next().  With the
patch, all the warnings go away.

 rcu: INFO: rcu_sched self-detected stall on CPU
 rcu: \x0910-....: (20666 ticks this GP) idle=4b6/1/0x4000000000000002 softirq=14346773/14346773 fqs=5064
 \x09(t=21013 jiffies g=25395133 q=154147)
 NMI backtrace for cpu 10
 #1
 Hardware name: Quanta Leopard ORv2-DDR4/Leopard ORv2-DDR4, BIOS F06_3B17 03/16/2018
 Call Trace:
  <IRQ>
  dump_stack+0x50/0x70
  nmi_cpu_backtrace.cold.6+0x13/0x50
  ? lapic_can_unplug_cpu.cold.30+0x40/0x40
  nmi_trigger_cpumask_backtrace+0xba/0xca
  rcu_dump_cpu_stacks+0x99/0xc7
  rcu_sched_clock_irq.cold.90+0x1b4/0x3aa
  ? tick_sched_do_timer+0x60/0x60
  update_process_times+0x24/0x50
  tick_sched_timer+0x37/0x70
  __hrtimer_run_queues+0xfe/0x270
  hrtimer_interrupt+0xf4/0x210
  smp_apic_timer_interrupt+0x5e/0x120
  apic_timer_interrupt+0xf/0x20
  </IRQ>
 RIP: 0010:find_ge_pid_upd+0x5/0x20
 Code: 80 00 00 00 00 0f 1f 44 00 00 48 83 ec 08 89 7c 24 04 48 8d 7e 08 48 8d 74 24 04 e8 d5 d3 9a 00 48 83 c4 08 c3 0f 1f 44 00 00 <48> 89 f8 48 8d 7e 08 48 89 c6 e9 bc d3 9a 00 cc cc cc cc cc cc cc
 RSP: 0018:ffffc9002b7abdb8 EFLAGS: 00000297 ORIG_RAX: ffffffffffffff13
 RAX: 00000000002ca5cd RBX: ffff889c44c0ba00 RCX: 0000000000000000
 RDX: 0000000000000002 RSI: ffffffff8284eb80 RDI: ffffc9002b7abdc4
 RBP: ffffc9002b7abe0c R08: ffff8895afe93a00 R09: ffff8891388abb50
 R10: 000000000000000c R11: 00000000002ca600 R12: 000000000000003f
 R13: ffffffff8284eb80 R14: 0000000000000001 R15: 00000000ffffffff
  task_seq_get_next+0x53/0x180
  task_file_seq_get_next+0x159/0x220
  task_file_seq_next+0x4f/0xa0
  bpf_seq_read+0x159/0x390
  vfs_read+0x8a/0x140
  ksys_read+0x59/0xd0
  do_syscall_64+0x42/0x110
  entry_SYSCALL_64_after_hwframe+0x44/0xa9


> > If unable to obtain the file structure for the current task,
> > proceed to the next task number after the one returned from
> > task_seq_get_next(), instead of the next task number from the
> > original iterator.
> This seems a correct change. The current code should still work
> but it may do some redundant/unnecessary work in kernel.
> This only happens when a task does not have any file,
> no sure whether this is the culprit for the problem this
> patch tries to address.
> 
> > 
> > Use thread_group_leader() instead of comparing tgid vs pid, which
> > might may be racy.
> 
> I see
> 
> static inline bool thread_group_leader(struct task_struct *p)
> {
>         return p->exit_signal >= 0;
> }
> 
> I am not sure whether thread_group_leader(task) is equivalent
> to task->tgid == task->pid or not. Any documentation or explanation?
> 
> Could you explain why task->tgid != task->pid in the original
> code could be racy?

My understanding is that anything which uses pid_t for comparision
in the kernel is incorrect.  Looking at de_thread(), there is a 
section which swaps the pid and tids around, but doesn't seem to 
change tgid directly.

There's also this comment in linux/pid.h:
        /*
         * Both old and new leaders may be attached to
         * the same pid in the middle of de_thread().
         */

So the safest thing to do is use the explicit thread_group_leader()
macro rather than trying to open code things.


> > Only obtain the task reference count at the end of the RCU section
> > instead of repeatedly obtaining/releasing it when iterathing though
> > a thread group.
> 
> I think this is an optimization and not about the correctness.

Yes, but the loop in question can be executed thousands of times, and 
there isn't much point in doing this needless work.  It's unclear 
whether this is a significant time contribution to the RCU stall,
but reducing the amount of refcounting isn't a bad thing.
Yonghong Song Dec. 4, 2020, 6:54 p.m. UTC | #3
On 12/4/20 9:14 AM, Jonathan Lemon wrote:
> On Fri, Dec 04, 2020 at 12:01:53AM -0800, Yonghong Song wrote:
>>
>>
>> On 12/3/20 7:43 PM, Jonathan Lemon wrote:
>>> From: Jonathan Lemon <bsd@fb.com>
>>
>> Could you explain in the commit log what problem this patch
>> tries to solve? What bad things could happen without this patch?
> 
> Without the patch, on a particular set of systems, RCU will repeatedly
> generate stall warnings similar to the trace below.  The common factor
> for all the traces seems to be using task_file_seq_next().  With the
> patch, all the warnings go away.
> 
>   rcu: INFO: rcu_sched self-detected stall on CPU
>   rcu: \x0910-....: (20666 ticks this GP) idle=4b6/1/0x4000000000000002 softirq=14346773/14346773 fqs=5064
>   \x09(t=21013 jiffies g=25395133 q=154147)
>   NMI backtrace for cpu 10
>   #1
>   Hardware name: Quanta Leopard ORv2-DDR4/Leopard ORv2-DDR4, BIOS F06_3B17 03/16/2018
>   Call Trace:
>    <IRQ>
>    dump_stack+0x50/0x70
>    nmi_cpu_backtrace.cold.6+0x13/0x50
>    ? lapic_can_unplug_cpu.cold.30+0x40/0x40
>    nmi_trigger_cpumask_backtrace+0xba/0xca
>    rcu_dump_cpu_stacks+0x99/0xc7
>    rcu_sched_clock_irq.cold.90+0x1b4/0x3aa
>    ? tick_sched_do_timer+0x60/0x60
>    update_process_times+0x24/0x50
>    tick_sched_timer+0x37/0x70
>    __hrtimer_run_queues+0xfe/0x270
>    hrtimer_interrupt+0xf4/0x210
>    smp_apic_timer_interrupt+0x5e/0x120
>    apic_timer_interrupt+0xf/0x20
>    </IRQ>
>   RIP: 0010:find_ge_pid_upd+0x5/0x20
>   Code: 80 00 00 00 00 0f 1f 44 00 00 48 83 ec 08 89 7c 24 04 48 8d 7e 08 48 8d 74 24 04 e8 d5 d3 9a 00 48 83 c4 08 c3 0f 1f 44 00 00 <48> 89 f8 48 8d 7e 08 48 89 c6 e9 bc d3 9a 00 cc cc cc cc cc cc cc
>   RSP: 0018:ffffc9002b7abdb8 EFLAGS: 00000297 ORIG_RAX: ffffffffffffff13
>   RAX: 00000000002ca5cd RBX: ffff889c44c0ba00 RCX: 0000000000000000
>   RDX: 0000000000000002 RSI: ffffffff8284eb80 RDI: ffffc9002b7abdc4
>   RBP: ffffc9002b7abe0c R08: ffff8895afe93a00 R09: ffff8891388abb50
>   R10: 000000000000000c R11: 00000000002ca600 R12: 000000000000003f
>   R13: ffffffff8284eb80 R14: 0000000000000001 R15: 00000000ffffffff
>    task_seq_get_next+0x53/0x180
>    task_file_seq_get_next+0x159/0x220
>    task_file_seq_next+0x4f/0xa0
>    bpf_seq_read+0x159/0x390
>    vfs_read+0x8a/0x140
>    ksys_read+0x59/0xd0
>    do_syscall_64+0x42/0x110
>    entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> 
>>> If unable to obtain the file structure for the current task,
>>> proceed to the next task number after the one returned from
>>> task_seq_get_next(), instead of the next task number from the
>>> original iterator.
>> This seems a correct change. The current code should still work
>> but it may do some redundant/unnecessary work in kernel.
>> This only happens when a task does not have any file,
>> no sure whether this is the culprit for the problem this
>> patch tries to address.
>>
>>>
>>> Use thread_group_leader() instead of comparing tgid vs pid, which
>>> might may be racy.
>>
>> I see
>>
>> static inline bool thread_group_leader(struct task_struct *p)
>> {
>>          return p->exit_signal >= 0;
>> }
>>
>> I am not sure whether thread_group_leader(task) is equivalent
>> to task->tgid == task->pid or not. Any documentation or explanation?
>>
>> Could you explain why task->tgid != task->pid in the original
>> code could be racy?
> 
> My understanding is that anything which uses pid_t for comparision
> in the kernel is incorrect.  Looking at de_thread(), there is a
> section which swaps the pid and tids around, but doesn't seem to
> change tgid directly.
> 
> There's also this comment in linux/pid.h:
>          /*
>           * Both old and new leaders may be attached to
>           * the same pid in the middle of de_thread().
>           */
> 
> So the safest thing to do is use the explicit thread_group_leader()
> macro rather than trying to open code things.

in de_thread(), I see p->exit_signal (used in thread_group_leader())
has race as well.

...
                 exchange_tids(tsk, leader);
                 transfer_pid(leader, tsk, PIDTYPE_TGID);
                 transfer_pid(leader, tsk, PIDTYPE_PGID);
                 transfer_pid(leader, tsk, PIDTYPE_SID);

                 list_replace_rcu(&leader->tasks, &tsk->tasks);
                 list_replace_init(&leader->sibling, &tsk->sibling);

                 tsk->group_leader = tsk;
                 leader->group_leader = tsk;

                 tsk->exit_signal = SIGCHLD;
                 leader->exit_signal = -1;

                 BUG_ON(leader->exit_state != EXIT_ZOMBIE);
                 leader->exit_state = EXIT_DEAD;
...

But I am not sure how this race or pid race could impact
task traversal differently.

> 
> 
>>> Only obtain the task reference count at the end of the RCU section
>>> instead of repeatedly obtaining/releasing it when iterathing though
>>> a thread group.
>>
>> I think this is an optimization and not about the correctness.
> 
> Yes, but the loop in question can be executed thousands of times, and
> there isn't much point in doing this needless work.  It's unclear
> whether this is a significant time contribution to the RCU stall,
> but reducing the amount of refcounting isn't a bad thing.

Sure. Totally agree.
Yonghong Song Dec. 9, 2020, 7:02 p.m. UTC | #4
On 12/4/20 9:14 AM, Jonathan Lemon wrote:
> On Fri, Dec 04, 2020 at 12:01:53AM -0800, Yonghong Song wrote:
>>
>>
>> On 12/3/20 7:43 PM, Jonathan Lemon wrote:
>>> From: Jonathan Lemon <bsd@fb.com>
>>
>> Could you explain in the commit log what problem this patch
>> tries to solve? What bad things could happen without this patch?
> 
> Without the patch, on a particular set of systems, RCU will repeatedly
> generate stall warnings similar to the trace below.  The common factor
> for all the traces seems to be using task_file_seq_next().  With the
> patch, all the warnings go away.
> 
>   rcu: INFO: rcu_sched self-detected stall on CPU
>   rcu: \x0910-....: (20666 ticks this GP) idle=4b6/1/0x4000000000000002 softirq=14346773/14346773 fqs=5064
>   \x09(t=21013 jiffies g=25395133 q=154147)
>   NMI backtrace for cpu 10
>   #1
>   Hardware name: Quanta Leopard ORv2-DDR4/Leopard ORv2-DDR4, BIOS F06_3B17 03/16/2018
>   Call Trace:
>    <IRQ>
>    dump_stack+0x50/0x70
>    nmi_cpu_backtrace.cold.6+0x13/0x50
>    ? lapic_can_unplug_cpu.cold.30+0x40/0x40
>    nmi_trigger_cpumask_backtrace+0xba/0xca
>    rcu_dump_cpu_stacks+0x99/0xc7
>    rcu_sched_clock_irq.cold.90+0x1b4/0x3aa
>    ? tick_sched_do_timer+0x60/0x60
>    update_process_times+0x24/0x50
>    tick_sched_timer+0x37/0x70
>    __hrtimer_run_queues+0xfe/0x270
>    hrtimer_interrupt+0xf4/0x210
>    smp_apic_timer_interrupt+0x5e/0x120
>    apic_timer_interrupt+0xf/0x20
>    </IRQ>
>   RIP: 0010:find_ge_pid_upd+0x5/0x20
>   Code: 80 00 00 00 00 0f 1f 44 00 00 48 83 ec 08 89 7c 24 04 48 8d 7e 08 48 8d 74 24 04 e8 d5 d3 9a 00 48 83 c4 08 c3 0f 1f 44 00 00 <48> 89 f8 48 8d 7e 08 48 89 c6 e9 bc d3 9a 00 cc cc cc cc cc cc cc
>   RSP: 0018:ffffc9002b7abdb8 EFLAGS: 00000297 ORIG_RAX: ffffffffffffff13
>   RAX: 00000000002ca5cd RBX: ffff889c44c0ba00 RCX: 0000000000000000
>   RDX: 0000000000000002 RSI: ffffffff8284eb80 RDI: ffffc9002b7abdc4
>   RBP: ffffc9002b7abe0c R08: ffff8895afe93a00 R09: ffff8891388abb50
>   R10: 000000000000000c R11: 00000000002ca600 R12: 000000000000003f
>   R13: ffffffff8284eb80 R14: 0000000000000001 R15: 00000000ffffffff
>    task_seq_get_next+0x53/0x180
>    task_file_seq_get_next+0x159/0x220
>    task_file_seq_next+0x4f/0xa0
>    bpf_seq_read+0x159/0x390
>    vfs_read+0x8a/0x140
>    ksys_read+0x59/0xd0
>    do_syscall_64+0x42/0x110
>    entry_SYSCALL_64_after_hwframe+0x44/0xa9

Maybe you can post v3 of the patch with the above information in the
commit description so people can better understand what the problem
you are trying to solve here?

Also, could you also send to bpf@vger.kernel.org?

> 
> 
>>> If unable to obtain the file structure for the current task,
>>> proceed to the next task number after the one returned from
>>> task_seq_get_next(), instead of the next task number from the
>>> original iterator.
>> This seems a correct change. The current code should still work
>> but it may do some redundant/unnecessary work in kernel.
>> This only happens when a task does not have any file,
>> no sure whether this is the culprit for the problem this
>> patch tries to address.
>>
>>>
>>> Use thread_group_leader() instead of comparing tgid vs pid, which
>>> might may be racy.
>>
>> I see
>>
>> static inline bool thread_group_leader(struct task_struct *p)
>> {
>>          return p->exit_signal >= 0;
>> }
>>
>> I am not sure whether thread_group_leader(task) is equivalent
>> to task->tgid == task->pid or not. Any documentation or explanation?
>>
>> Could you explain why task->tgid != task->pid in the original
>> code could be racy?
> 
> My understanding is that anything which uses pid_t for comparision
> in the kernel is incorrect.  Looking at de_thread(), there is a
> section which swaps the pid and tids around, but doesn't seem to
> change tgid directly.
> 
> There's also this comment in linux/pid.h:
>          /*
>           * Both old and new leaders may be attached to
>           * the same pid in the middle of de_thread().
>           */
> 
> So the safest thing to do is use the explicit thread_group_leader()
> macro rather than trying to open code things.

I did some limited experiments and did not trigger a case where
task->tgid != task->pid not agreeing with !thread_group_leader().
Will need more tests in the environment to reproduce the warning
to confirm whether this is the culprit or not.

> 
> 
>>> Only obtain the task reference count at the end of the RCU section
>>> instead of repeatedly obtaining/releasing it when iterathing though
>>> a thread group.
>>
>> I think this is an optimization and not about the correctness.
> 
> Yes, but the loop in question can be executed thousands of times, and
> there isn't much point in doing this needless work.  It's unclear
> whether this is a significant time contribution to the RCU stall,
> but reducing the amount of refcounting isn't a bad thing.
>
Jonathan Lemon Dec. 11, 2020, 4:30 p.m. UTC | #5
On Wed, Dec 09, 2020 at 11:02:54AM -0800, Yonghong Song wrote:
> 
> 
> Maybe you can post v3 of the patch with the above information in the
> commit description so people can better understand what the problem
> you are trying to solve here?
> 
> Also, could you also send to bpf@vger.kernel.org?

Sure, I can do that.

> > > > If unable to obtain the file structure for the current task,
> > > > proceed to the next task number after the one returned from
> > > > task_seq_get_next(), instead of the next task number from the
> > > > original iterator.
> > > This seems a correct change. The current code should still work
> > > but it may do some redundant/unnecessary work in kernel.
> > > This only happens when a task does not have any file,
> > > no sure whether this is the culprit for the problem this
> > > patch tries to address.
> > > 
> > > > 
> > > > Use thread_group_leader() instead of comparing tgid vs pid, which
> > > > might may be racy.
> > > 
> > > I see
> > > 
> > > static inline bool thread_group_leader(struct task_struct *p)
> > > {
> > >          return p->exit_signal >= 0;
> > > }
> > > 
> > > I am not sure whether thread_group_leader(task) is equivalent
> > > to task->tgid == task->pid or not. Any documentation or explanation?
> > > 
> > > Could you explain why task->tgid != task->pid in the original
> > > code could be racy?
> > 
> > My understanding is that anything which uses pid_t for comparision
> > in the kernel is incorrect.  Looking at de_thread(), there is a
> > section which swaps the pid and tids around, but doesn't seem to
> > change tgid directly.
> > 
> > There's also this comment in linux/pid.h:
> >          /*
> >           * Both old and new leaders may be attached to
> >           * the same pid in the middle of de_thread().
> >           */
> > 
> > So the safest thing to do is use the explicit thread_group_leader()
> > macro rather than trying to open code things.
> 
> I did some limited experiments and did not trigger a case where
> task->tgid != task->pid not agreeing with !thread_group_leader().
> Will need more tests in the environment to reproduce the warning
> to confirm whether this is the culprit or not.

Perhaps, but on the other hand, the splats disappear with this 
patch, so it's doing something right.  If your debug code hasn't
detected any cases where thread_group_leader() isn't making a 
difference, then there shouldn't be any objections in making the 
replacement, right?  It does make the code easier to understand
and matches the rest of the kernel.
Yonghong Song Dec. 14, 2020, 7:01 a.m. UTC | #6
On 12/11/20 8:30 AM, Jonathan Lemon wrote:
> On Wed, Dec 09, 2020 at 11:02:54AM -0800, Yonghong Song wrote:
>>
>>
>> Maybe you can post v3 of the patch with the above information in the
>> commit description so people can better understand what the problem
>> you are trying to solve here?
>>
>> Also, could you also send to bpf@vger.kernel.org?
> 
> Sure, I can do that.
> 
>>>>> If unable to obtain the file structure for the current task,
>>>>> proceed to the next task number after the one returned from
>>>>> task_seq_get_next(), instead of the next task number from the
>>>>> original iterator.
>>>> This seems a correct change. The current code should still work
>>>> but it may do some redundant/unnecessary work in kernel.
>>>> This only happens when a task does not have any file,
>>>> no sure whether this is the culprit for the problem this
>>>> patch tries to address.
>>>>
>>>>>
>>>>> Use thread_group_leader() instead of comparing tgid vs pid, which
>>>>> might may be racy.
>>>>
>>>> I see
>>>>
>>>> static inline bool thread_group_leader(struct task_struct *p)
>>>> {
>>>>           return p->exit_signal >= 0;
>>>> }
>>>>
>>>> I am not sure whether thread_group_leader(task) is equivalent
>>>> to task->tgid == task->pid or not. Any documentation or explanation?
>>>>
>>>> Could you explain why task->tgid != task->pid in the original
>>>> code could be racy?
>>>
>>> My understanding is that anything which uses pid_t for comparision
>>> in the kernel is incorrect.  Looking at de_thread(), there is a
>>> section which swaps the pid and tids around, but doesn't seem to
>>> change tgid directly.
>>>
>>> There's also this comment in linux/pid.h:
>>>           /*
>>>            * Both old and new leaders may be attached to
>>>            * the same pid in the middle of de_thread().
>>>            */
>>>
>>> So the safest thing to do is use the explicit thread_group_leader()
>>> macro rather than trying to open code things.
>>
>> I did some limited experiments and did not trigger a case where
>> task->tgid != task->pid not agreeing with !thread_group_leader().
>> Will need more tests in the environment to reproduce the warning
>> to confirm whether this is the culprit or not.
> 
> Perhaps, but on the other hand, the splats disappear with this
> patch, so it's doing something right.  If your debug code hasn't
> detected any cases where thread_group_leader() isn't making a
> difference, then there shouldn't be any objections in making the
> replacement, right?  It does make the code easier to understand
> and matches the rest of the kernel.

Agree. Let me do a little more experiments to double check we
did not miss anything with this particular change and will
report back later.
diff mbox series

Patch

diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
index 0458a40edf10..66a52fcf589a 100644
--- a/kernel/bpf/task_iter.c
+++ b/kernel/bpf/task_iter.c
@@ -33,17 +33,17 @@  static struct task_struct *task_seq_get_next(struct pid_namespace *ns,
 	pid = find_ge_pid(*tid, ns);
 	if (pid) {
 		*tid = pid_nr_ns(pid, ns);
-		task = get_pid_task(pid, PIDTYPE_PID);
+		task = pid_task(pid, PIDTYPE_PID);
 		if (!task) {
 			++*tid;
 			goto retry;
-		} else if (skip_if_dup_files && task->tgid != task->pid &&
+		} else if (skip_if_dup_files && !thread_group_leader(task) &&
 			   task->files == task->group_leader->files) {
-			put_task_struct(task);
 			task = NULL;
 			++*tid;
 			goto retry;
 		}
+		get_task_struct(task);
 	}
 	rcu_read_unlock();
 
@@ -164,7 +164,7 @@  task_file_seq_get_next(struct bpf_iter_seq_task_file_info *info)
 		curr_files = get_files_struct(curr_task);
 		if (!curr_files) {
 			put_task_struct(curr_task);
-			curr_tid = ++(info->tid);
+			curr_tid = curr_tid + 1;
 			info->fd = 0;
 			goto again;
 		}