tracing: Fix race when concurrently splice_read trace_pipe

Message ID	20230810123905.1531061-1-zhengyejian1@huawei.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <linux-trace-kernel-owner@vger.kernel.org> From: Zheng Yejian <zhengyejian1@huawei.com> To: <rostedt@goodmis.org>, <mhiramat@kernel.org> CC: <laijs@cn.fujitsu.com>, <linux-kernel@vger.kernel.org>, <linux-trace-kernel@vger.kernel.org>, <zhengyejian1@huawei.com> Subject: [PATCH] tracing: Fix race when concurrently splice_read trace_pipe Date: Thu, 10 Aug 2023 20:39:05 +0800 Message-ID: <20230810123905.1531061-1-zhengyejian1@huawei.com> MIME-Version: 1.0 Content-Transfer-Encoding: 7BIT Content-Type: text/plain; charset=US-ASCII Precedence: bulk
Series	tracing: Fix race when concurrently splice_read trace_pipe \| expand tracing: Fix race when concurrently splice_read trace_pipe

Zheng Yejian Aug. 10, 2023, 12:39 p.m. UTC

When concurrently splice_read file trace_pipe and per_cpu/cpu*/trace_pipe,
there are more data being read out than expected.

The root cause is that in tracing_splice_read_pipe(), an entry is found
outside locks, it may be read by multiple readers or consumed by other
reader as starting printing it.

To fix it, change to find entry after holding locks.

Fixes: 7e53bd42d14c ("tracing: Consolidate protection of reader access to the ring buffer")
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
---
 kernel/trace/trace.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

Masami Hiramatsu (Google) Aug. 11, 2023, 11:42 a.m. UTC | #1

On Thu, 10 Aug 2023 20:39:05 +0800
Zheng Yejian <zhengyejian1@huawei.com> wrote:

> When concurrently splice_read file trace_pipe and per_cpu/cpu*/trace_pipe,
> there are more data being read out than expected.
> 
> The root cause is that in tracing_splice_read_pipe(), an entry is found
> outside locks, it may be read by multiple readers or consumed by other
> reader as starting printing it.
> 
> To fix it, change to find entry after holding locks.
> 
> Fixes: 7e53bd42d14c ("tracing: Consolidate protection of reader access to the ring buffer")
> Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
> ---
>  kernel/trace/trace.c | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index b8870078ef58..f169d33b948f 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -7054,14 +7054,16 @@ static ssize_t tracing_splice_read_pipe(struct file *filp,
>  	if (ret <= 0)
>  		goto out_err;
>  
> -	if (!iter->ent && !trace_find_next_entry_inc(iter)) {
> +	trace_event_read_lock();
> +	trace_access_lock(iter->cpu_file);
> +
> +	if (!trace_find_next_entry_inc(iter)) {

It seems you skips '!iter->ent' check. Is there any reason for this change?

Thank you,

> +		trace_access_unlock(iter->cpu_file);
> +		trace_event_read_unlock();
>  		ret = -EFAULT;
>  		goto out_err;
>  	}
>  
> -	trace_event_read_lock();
> -	trace_access_lock(iter->cpu_file);
> -
>  	/* Fill as many pages as possible. */
>  	for (i = 0, rem = len; i < spd.nr_pages_max && rem; i++) {
>  		spd.pages[i] = alloc_page(GFP_KERNEL);
> -- 
> 2.25.1
>

Zheng Yejian Aug. 11, 2023, 12:37 p.m. UTC | #2

On 2023/8/11 19:42, Masami Hiramatsu (Google) wrote:
> On Thu, 10 Aug 2023 20:39:05 +0800
> Zheng Yejian <zhengyejian1@huawei.com> wrote:
> 
>> When concurrently splice_read file trace_pipe and per_cpu/cpu*/trace_pipe,
>> there are more data being read out than expected.
>>
>> The root cause is that in tracing_splice_read_pipe(), an entry is found
>> outside locks, it may be read by multiple readers or consumed by other
>> reader as starting printing it.
>>
>> To fix it, change to find entry after holding locks.
>>
>> Fixes: 7e53bd42d14c ("tracing: Consolidate protection of reader access to the ring buffer")
>> Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
>> ---
>>   kernel/trace/trace.c | 10 ++++++----
>>   1 file changed, 6 insertions(+), 4 deletions(-)
>>
>> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
>> index b8870078ef58..f169d33b948f 100644
>> --- a/kernel/trace/trace.c
>> +++ b/kernel/trace/trace.c
>> @@ -7054,14 +7054,16 @@ static ssize_t tracing_splice_read_pipe(struct file *filp,
>>   	if (ret <= 0)
>>   		goto out_err;
>>   
>> -	if (!iter->ent && !trace_find_next_entry_inc(iter)) {
>> +	trace_event_read_lock();
>> +	trace_access_lock(iter->cpu_file);
>> +
>> +	if (!trace_find_next_entry_inc(iter)) {
> 
> It seems you skips '!iter->ent' check. Is there any reason for this change?

IIUC, 'iter->ent' may be the entry that was found but not consumed
in last call tracing_splice_read_pipe(), and in this call, 'iter->ent'
may have being consumed, so we may should find a new 'iter->ent' before
printing it in tracing_fill_pipe_page(), see following reduced codes:

   tracing_splice_read_pipe() {
     if (!iter->ent && !trace_find_next_entry_inc(iter)) {  // 1. find 
entry here
         ... ...
     }
     tracing_fill_pipe_page() {
       for (;;) {
         ... ...
         ret = print_trace_line(iter);  // 2. print entry
         ... ...
         if (!trace_find_next_entry_inc()) {  // 3. find next entry
           ... ...
           break;
         }
       }

--

Thanks,
Zheng Yejian

> 
> Thank you,
> 
>> +		trace_access_unlock(iter->cpu_file);
>> +		trace_event_read_unlock();
>>   		ret = -EFAULT;
>>   		goto out_err;
>>   	}
>>   
>> -	trace_event_read_lock();
>> -	trace_access_lock(iter->cpu_file);
>> -
>>   	/* Fill as many pages as possible. */
>>   	for (i = 0, rem = len; i < spd.nr_pages_max && rem; i++) {
>>   		spd.pages[i] = alloc_page(GFP_KERNEL);
>> -- 
>> 2.25.1
>>
> 
>

Steven Rostedt Aug. 11, 2023, 7:24 p.m. UTC | #3

On Fri, 11 Aug 2023 20:37:07 +0800
Zheng Yejian <zhengyejian1@huawei.com> wrote:

> On 2023/8/11 19:42, Masami Hiramatsu (Google) wrote:
> > On Thu, 10 Aug 2023 20:39:05 +0800
> > Zheng Yejian <zhengyejian1@huawei.com> wrote:
> >   
> >> When concurrently splice_read file trace_pipe and per_cpu/cpu*/trace_pipe,
> >> there are more data being read out than expected.
> >>
> >> The root cause is that in tracing_splice_read_pipe(), an entry is found
> >> outside locks, it may be read by multiple readers or consumed by other
> >> reader as starting printing it.
> >>
> >> To fix it, change to find entry after holding locks.
> >>
> >> Fixes: 7e53bd42d14c ("tracing: Consolidate protection of reader access to the ring buffer")
> >> Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
> >> ---
> >>   kernel/trace/trace.c | 10 ++++++----
> >>   1 file changed, 6 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> >> index b8870078ef58..f169d33b948f 100644
> >> --- a/kernel/trace/trace.c
> >> +++ b/kernel/trace/trace.c
> >> @@ -7054,14 +7054,16 @@ static ssize_t tracing_splice_read_pipe(struct file *filp,
> >>   	if (ret <= 0)
> >>   		goto out_err;
> >>   
> >> -	if (!iter->ent && !trace_find_next_entry_inc(iter)) {
> >> +	trace_event_read_lock();
> >> +	trace_access_lock(iter->cpu_file);
> >> +
> >> +	if (!trace_find_next_entry_inc(iter)) {  
> > 
> > It seems you skips '!iter->ent' check. Is there any reason for this change?  
> 
> IIUC, 'iter->ent' may be the entry that was found but not consumed
> in last call tracing_splice_read_pipe(), and in this call, 'iter->ent'
> may have being consumed, so we may should find a new 'iter->ent' before
> printing it in tracing_fill_pipe_page(), see following reduced codes:

And if it wasn't consumed? We just lost it?

> 
>    tracing_splice_read_pipe() {
>      if (!iter->ent && !trace_find_next_entry_inc(iter)) {  // 1. find 
> entry here
>          ... ...
>      }
>      tracing_fill_pipe_page() {
>        for (;;) {
>          ... ...
>          ret = print_trace_line(iter);  // 2. print entry
>          ... ...

You missed:

           count = trace_seq_used(&iter->seq) - save_len;
           if (rem < count) {
                rem = 0;
                iter->seq.seq.len = save_len;

Where the above just threw away what was printed in the above
"print_trace_line()", and it never went to console.

                break;
           }

-- Steve


>          if (!trace_find_next_entry_inc()) {  // 3. find next entry
>            ... ...
>            break;
>          }
>        }
> 
> --
> 
> Thanks,
> Zheng Yejian
> 
> > 
> > Thank you,
> >   
> >> +		trace_access_unlock(iter->cpu_file);
> >> +		trace_event_read_unlock();
> >>   		ret = -EFAULT;
> >>   		goto out_err;
> >>   	}
> >>   
> >> -	trace_event_read_lock();
> >> -	trace_access_lock(iter->cpu_file);
> >> -
> >>   	/* Fill as many pages as possible. */
> >>   	for (i = 0, rem = len; i < spd.nr_pages_max && rem; i++) {
> >>   		spd.pages[i] = alloc_page(GFP_KERNEL);
> >> -- 
> >> 2.25.1
> >>  
> > 
> >

Steven Rostedt Aug. 11, 2023, 7:25 p.m. UTC | #4

On Thu, 10 Aug 2023 20:39:05 +0800
Zheng Yejian <zhengyejian1@huawei.com> wrote:

> When concurrently splice_read file trace_pipe and per_cpu/cpu*/trace_pipe,
> there are more data being read out than expected.

Honestly the real fix is to prevent that use case. We should probably have
access to trace_pipe lock all the per_cpu trace_pipes too.

-- Steve

Zheng Yejian Aug. 12, 2023, 1:45 a.m. UTC | #5

On 2023/8/12 03:25, Steven Rostedt wrote:
> On Thu, 10 Aug 2023 20:39:05 +0800
> Zheng Yejian <zhengyejian1@huawei.com> wrote:
> 
>> When concurrently splice_read file trace_pipe and per_cpu/cpu*/trace_pipe,
>> there are more data being read out than expected.

Sorry, I didn't make clear here. It not just read more but also lost
some data. My case is that, for example:
   1) Inject 3 events into ring_buffer: event1, event2, event3;
   2) Concurrently splice_read through trace_pipes;
   3) Then actually read out: event1, event3, event3. No event2, but 2 
event3.

> 
> Honestly the real fix is to prevent that use case. We should probably have
> access to trace_pipe lock all the per_cpu trace_pipes too.

Yes, we could do that, but would it seem not that effective?
because per_cpu trace_pipe only read its own ring_buffer and not race
with ring_buffers in other cpus.

> 
> -- Steve
>

Zheng Yejian Aug. 12, 2023, 2:22 a.m. UTC | #6

On 2023/8/12 03:24, Steven Rostedt wrote:
> On Fri, 11 Aug 2023 20:37:07 +0800
> Zheng Yejian <zhengyejian1@huawei.com> wrote:
> 
>> On 2023/8/11 19:42, Masami Hiramatsu (Google) wrote:
>>> On Thu, 10 Aug 2023 20:39:05 +0800
>>> Zheng Yejian <zhengyejian1@huawei.com> wrote:
>>>    
>>>> When concurrently splice_read file trace_pipe and per_cpu/cpu*/trace_pipe,
>>>> there are more data being read out than expected.
>>>>
>>>> The root cause is that in tracing_splice_read_pipe(), an entry is found
>>>> outside locks, it may be read by multiple readers or consumed by other
>>>> reader as starting printing it.
>>>>
>>>> To fix it, change to find entry after holding locks.
>>>>
>>>> Fixes: 7e53bd42d14c ("tracing: Consolidate protection of reader access to the ring buffer")
>>>> Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
>>>> ---
>>>>    kernel/trace/trace.c | 10 ++++++----
>>>>    1 file changed, 6 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
>>>> index b8870078ef58..f169d33b948f 100644
>>>> --- a/kernel/trace/trace.c
>>>> +++ b/kernel/trace/trace.c
>>>> @@ -7054,14 +7054,16 @@ static ssize_t tracing_splice_read_pipe(struct file *filp,
>>>>    	if (ret <= 0)
>>>>    		goto out_err;
>>>>    
>>>> -	if (!iter->ent && !trace_find_next_entry_inc(iter)) {
>>>> +	trace_event_read_lock();
>>>> +	trace_access_lock(iter->cpu_file);
>>>> +
>>>> +	if (!trace_find_next_entry_inc(iter)) {
>>>
>>> It seems you skips '!iter->ent' check. Is there any reason for this change?
>>
>> IIUC, 'iter->ent' may be the entry that was found but not consumed
>> in last call tracing_splice_read_pipe(), and in this call, 'iter->ent'
>> may have being consumed, so we may should find a new 'iter->ent' before
>> printing it in tracing_fill_pipe_page(), see following reduced codes:
> 
> And if it wasn't consumed? We just lost it?

If 'iter->ent' was not consumed, trace_find_next_entry_inc() will find
it again, will it?

-- Zheng Yejian

> 
>>
>>     tracing_splice_read_pipe() {
>>       if (!iter->ent && !trace_find_next_entry_inc(iter)) {  // 1. find
>> entry here
>>           ... ...
>>       }
>>       tracing_fill_pipe_page() {
>>         for (;;) {
>>           ... ...
>>           ret = print_trace_line(iter);  // 2. print entry
>>           ... ...
> 
> You missed:
> 
>             count = trace_seq_used(&iter->seq) - save_len;
>             if (rem < count) {
>                  rem = 0;
>                  iter->seq.seq.len = save_len;
> 
> Where the above just threw away what was printed in the above
> "print_trace_line()", and it never went to console.
> 
>                  break;
>             }
> 

Thanks for pointing this out!

-- Zheng Yejian

> -- Steve
> 
> 
>>           if (!trace_find_next_entry_inc()) {  // 3. find next entry
>>             ... ...
>>             break;
>>           }
>>         }
>>
>> --
>>
>> Thanks,
>> Zheng Yejian
>>
>>>
>>> Thank you,
>>>    
>>>> +		trace_access_unlock(iter->cpu_file);
>>>> +		trace_event_read_unlock();
>>>>    		ret = -EFAULT;
>>>>    		goto out_err;
>>>>    	}
>>>>    
>>>> -	trace_event_read_lock();
>>>> -	trace_access_lock(iter->cpu_file);
>>>> -
>>>>    	/* Fill as many pages as possible. */
>>>>    	for (i = 0, rem = len; i < spd.nr_pages_max && rem; i++) {
>>>>    		spd.pages[i] = alloc_page(GFP_KERNEL);
>>>> -- 
>>>> 2.25.1
>>>>   
>>>
>>>    
> 
>

Zheng Yejian Aug. 12, 2023, 7:38 a.m. UTC | #7

On 2023/8/12 03:25, Steven Rostedt wrote:
> On Thu, 10 Aug 2023 20:39:05 +0800
> Zheng Yejian <zhengyejian1@huawei.com> wrote:
> 
>> When concurrently splice_read file trace_pipe and per_cpu/cpu*/trace_pipe,
>> there are more data being read out than expected.
> 
> Honestly the real fix is to prevent that use case. We should probably have
> access to trace_pipe lock all the per_cpu trace_pipes too.
> 
> -- Steve
> 

Hi~

Reproduction testcase is show as below, it can always reproduce the
issue in v5.10, and after this patch, the testcase passed.

In v5.10, when run `cat trace_pipe > /tmp/myfile &`, it call
sendfile() to transmit data from trace_pipe into /tmp/myfile. And in
kernel, .splice_read() of trace_pipe is called then the issue is
reproduced.

However in the newest v6.5, this reproduction case didn't run into the
.splice_read() of trace_pipe, because after commit 97ef77c52b78 ("fs:
check FMODE_LSEEK to control internal pipe splicing"), non-seekable
trace_pipe cannot be sendfile-ed.

``` repro.sh
#!/bin/bash


do_test()
{
         local trace_dir=/sys/kernel/tracing
         local trace=${trace_dir}/trace
         local old_trace_lines
         local new_trace_lines
         local tempfiles
         local testlog="trace pipe concurrency issue"
         local pipe_pids
         local i
         local write_cnt=1000
         local read_cnt=0
         local nr_cpu=`nproc`

         # 1. At first, clear all ring buffer
         echo > ${trace}

         # 2. Count how many lines in trace file now
         old_trace_lines=`cat ${trace} | wc -l`

         # 3. Close water mark so that reader can read as event comes
         echo 0 > ${trace_dir}/buffer_percent

         # 4. Read percpu trace_pipes into local file on background.
         #    Splice read must be used under command 'cat' so that the racy
         #    issue can be reproduced !!!
         i=0
         while [ ${i} -lt ${nr_cpu} ]; do
                 tempfiles[${i}]=/tmp/percpu_trace_pipe_${i}
                 cat ${trace_dir}/per_cpu/cpu${i}/trace_pipe > 
${tempfiles[${i}]} &
                 pipe_pids[${i}]=$!
                 let i=i+1
         done

         # 5. Read main trace_pipe into local file on background.
         #    The same, splice read must be used to reproduce the issue !!!
         tempfiles[${i}]=/tmp/main_trace_pipe
         cat ${trace_dir}/trace_pipe > ${tempfiles[${i}]} &
         pipe_pids[${i}]=$!

         echo "Take a break, let readers run."
         sleep 3

         # 6. Write events into ring buffer through trace_marker, so that
         #    hungry readers start racing these events.
         i=0
         while [ ${i} -lt ${write_cnt} ]; do
                 echo "${testlog} <${i}>" > ${trace_dir}/trace_marker
                 let i=i+1
         done

         # 7. Wait until all events being consumed
         new_trace_lines=`cat ${trace} | wc -l`
         while [ "${new_trace_lines}" != "${old_trace_lines}" ]; do
                 new_trace_lines=`cat ${trace} | wc -l`
                 sleep 1
         done
         echo "All written events have been consumed."

         # 8. Kill all readers and count the events readed
         i=0
         while [ ${i} -lt ${#pipe_pids[*]} ]; do
                 local num

                 kill -9 ${pipe_pids[${i}]}
                 wait ${pipe_pids[${i}]}
                 num=`cat ${tempfiles[${i}]} | grep "${testlog}" | wc -l`
                 let read_cnt=read_cnt+num
                 let i=i+1
         done

         # 9. Expect to read events as much as write
         if [ "${read_cnt}" != "${write_cnt}" ]; then
                 echo "Test fail: write ${write_cnt} but read 
${read_cnt} !!!"
                 return 1
         fi

         # 10. Clean temp files if test success
         i=0
         while [ ${i} -lt ${#tempfiles[*]} ]; do
                 rm ${tempfiles[${i}]}
                 let i=i+1
         done
         return 0
}

do_test
```

-- Zheng Yejian

Masami Hiramatsu (Google) Aug. 12, 2023, 8:47 p.m. UTC | #8

On Sat, 12 Aug 2023 09:45:52 +0800
Zheng Yejian <zhengyejian1@huawei.com> wrote:

> On 2023/8/12 03:25, Steven Rostedt wrote:
> > On Thu, 10 Aug 2023 20:39:05 +0800
> > Zheng Yejian <zhengyejian1@huawei.com> wrote:
> > 
> >> When concurrently splice_read file trace_pipe and per_cpu/cpu*/trace_pipe,
> >> there are more data being read out than expected.
> 
> Sorry, I didn't make clear here. It not just read more but also lost
> some data. My case is that, for example:
>    1) Inject 3 events into ring_buffer: event1, event2, event3;
>    2) Concurrently splice_read through trace_pipes;
>    3) Then actually read out: event1, event3, event3. No event2, but 2 
> event3.
> 
> > 
> > Honestly the real fix is to prevent that use case. We should probably have
> > access to trace_pipe lock all the per_cpu trace_pipes too.
> 
> Yes, we could do that, but would it seem not that effective?
> because per_cpu trace_pipe only read its own ring_buffer and not race
> with ring_buffers in other cpus.

I think Steve said that only one of below is usable.

- Read trace_pipe

or

- Read per_cpu/cpu*/trace_pipe concurrently

And I think this makes sence, especially if you use splice (this *moves*
the page from the ring_buffer to other pipe).

Thank you,


> 
> > 
> > -- Steve
> > 
>

Steven Rostedt Aug. 12, 2023, 9:08 p.m. UTC | #9

On Sat, 12 Aug 2023 10:22:43 +0800
Zheng Yejian <zhengyejian1@huawei.com> wrote:

> > And if it wasn't consumed? We just lost it?  
> 
> If 'iter->ent' was not consumed, trace_find_next_entry_inc() will find
> it again, will it?

No it will not, because it did the "inc". That means the next time it
is called, it will find the next item to read, thinking it already
"consumed" the last one.

-- Steve

Steven Rostedt Aug. 13, 2023, 1:13 a.m. UTC | #10

On Sat, 12 Aug 2023 15:38:12 +0800
Zheng Yejian <zhengyejian1@huawei.com> wrote:

> Reproduction testcase is show as below, it can always reproduce the
> issue in v5.10, and after this patch, the testcase passed.
> 
> In v5.10, when run `cat trace_pipe > /tmp/myfile &`, it call
> sendfile() to transmit data from trace_pipe into /tmp/myfile. And in
> kernel, .splice_read() of trace_pipe is called then the issue is
> reproduced.
> 
> However in the newest v6.5, this reproduction case didn't run into the
> .splice_read() of trace_pipe, because after commit 97ef77c52b78 ("fs:
> check FMODE_LSEEK to control internal pipe splicing"), non-seekable
> trace_pipe cannot be sendfile-ed.

So the test case cannot be run because the "sendfile" on the
trace_pipe now fails?

I'm not sure if this is considered a user space regression or not, but
I figured I'd let the interested parties know.

I don't know what tools out there records trace_pipe via sendfile, but
there might be some.

-- Steve

Linus Torvalds Aug. 13, 2023, 4:41 p.m. UTC | #11

On Sat, 12 Aug 2023 at 18:13, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> So the test case cannot be run because the "sendfile" on the
> trace_pipe now fails?

So sendfile() has always required a seekable source, because it -
intentionally - doesn't do the "splice to pipe and then splice from
pipe to destination".

It just does a "splice from source, splice result to destination".

That sounds like "obviously the right thing to do", but it means that
there is now no buffer with any extended lifetime between the two
operations. And that's a big deal.

 Without that buffer (ie pipe) in the middle, if the splice to
destination fails - or is partial - and the system call is interrupted
by a signal, then the source splice data is now gone, gone, gone.

So it only works if the source can then re-create the data - ie if the
source is seekable.

In that case, if the destination cannot accept all the data, the
sendfile can just go back and read again from the last successfully
written position.

And if you are a data stream that can't seek, then that "from last
successfully written position" doesn't work, and any partial writes
will just have dropped the data.

This seekability test *used* to be to test that the source was either
a regular file or a block device.

Now it literally checks "can I seek".

It looks like the trace_pipe fiel is *claiming* to be a regular file -
so sendfile() used to be ok with it - but cannot actually seek - so
sendfile would silently drop data.

Now sendfile says "I'm sorry, Dave, I'm afraid I can't do that" rather
than silently causing data loss.

Now, this is not a user-visible regression in this case, because "cat"
will always fall back on just regular read/write when sendfile fails.

So all that changed for 'trace_pipe' is that a buggy splice now no
longer triggers the bug that it used to (which the patch in question
was also trying to fix).

End result: things in many ways work better this way.

So it really looks like it never really worked before either. There
was *both* the dropped data bug because "trace_pipe" lied about being
a regular file, *and* apparently this trace_pipe race bug that Zheng
Yejian tried to fix.

Of course, some destinations always accept a full write, so maybe we
could relax the "source must be seekable" to be "source must be
seekable, _OR_ destination must be something that never returns
partial writes".

So sendfile to a POSIX-compliant regular file could always work
(ignoring filesystem full situations, and at that point I think we can
say "yeah, we're done, no recovering from that in sendfile()").

So if somebody really *really* want sendfile to work for this case, then you

 (a) do need to fix the race in tracing_splice_read_pipe (which you
should do anyway, since you can obviously always use splice() itself,
not sendfile()).

AND

 (b) figure out when 'splice_write()' will always succeed fully and
convince people that we can do that "extended version" of sendfile().

Hmm?

               Linus

Steven Rostedt Aug. 14, 2023, 8:16 p.m. UTC | #12

On Sun, 13 Aug 2023 09:41:33 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> So if somebody really *really* want sendfile to work for this case, then you
> 
>  (a) do need to fix the race in tracing_splice_read_pipe (which you
> should do anyway, since you can obviously always use splice() itself,
> not sendfile()).
> 
> AND
> 
>  (b) figure out when 'splice_write()' will always succeed fully and
> convince people that we can do that "extended version" of sendfile().
> 

No big deal. I really don't care about splicing trace_pipe anyway. That was
added by others, but the trace_pipe_raw is what really should be used.

Zheng's race needs to be fixed regardless, but I just wanted to make sure
this wasn't some kind of hidden regression, as there were patches to
trace_pipe to make sendfile get fixed in the past.

a29054d9478d0 ("tracing: Fix crash from reading trace_pipe with sendfile")

But that too was triggered by cat. If cat no longer uses sendfile for
trace_pipe, I'm fine with it.

-- Steve

Steven Rostedt Aug. 16, 2023, 7:23 p.m. UTC | #13

On Sat, 12 Aug 2023 10:22:43 +0800
Zheng Yejian <zhengyejian1@huawei.com> wrote:

>
> >>>> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> >>>> index b8870078ef58..f169d33b948f 100644
> >>>> --- a/kernel/trace/trace.c
> >>>> +++ b/kernel/trace/trace.c
> >>>> @@ -7054,14 +7054,16 @@ static ssize_t tracing_splice_read_pipe(struct file *filp,
> >>>>    	if (ret <= 0)
> >>>>    		goto out_err;
> >>>>    
> >>>> -	if (!iter->ent && !trace_find_next_entry_inc(iter)) {
> >>>> +	trace_event_read_lock();
> >>>> +	trace_access_lock(iter->cpu_file);
> >>>> +
> >>>> +	if (!trace_find_next_entry_inc(iter)) {  
> >>>
> >>> It seems you skips '!iter->ent' check. Is there any reason for this change?  
> >>
> >> IIUC, 'iter->ent' may be the entry that was found but not consumed
> >> in last call tracing_splice_read_pipe(), and in this call, 'iter->ent'
> >> may have being consumed, so we may should find a new 'iter->ent' before
> >> printing it in tracing_fill_pipe_page(), see following reduced codes:  
> > 
> > And if it wasn't consumed? We just lost it?  
> 
> If 'iter->ent' was not consumed, trace_find_next_entry_inc() will find
> it again, will it?
> 
> -- Zheng Yejian
> 
> >   
> >>
> >>     tracing_splice_read_pipe() {
> >>       if (!iter->ent && !trace_find_next_entry_inc(iter)) {  // 1. find
> >> entry here
> >>           ... ...
> >>       }
> >>       tracing_fill_pipe_page() {
> >>         for (;;) {
> >>           ... ...
> >>           ret = print_trace_line(iter);  // 2. print entry
> >>           ... ...  
> > 
> > You missed:
> > 
> >             count = trace_seq_used(&iter->seq) - save_len;
> >             if (rem < count) {
> >                  rem = 0;
> >                  iter->seq.seq.len = save_len;
> > 
> > Where the above just threw away what was printed in the above
> > "print_trace_line()", and it never went to console.
> > 
> >                  break;
> >             }
> >   
> 
> Thanks for pointing this out!

Just to get this moving again, I believe we should add a ref count to
trace_pipe and the per_cpu trace_pipes, where if they are opened, nothing else can read it.

Opening trace_pipe locks all per_cpu ref counts, if any of them are open,
then the trace_pipe open will fail (and releases any ref counts it had
taken).

Opening a per_cpu trace_pipe will up the ref count for just that CPU
buffer. This will allow multiple tasks to read different per_cpu trace_pipe
files, but will prevent the main trace_pipe file from being opened.

Does that work for this?

-- Steve

tracing: Fix race when concurrently splice_read trace_pipe

Commit Message

Comments

Patch