diff mbox series

splice: infinite busy loop lockup bug

Message ID 29de15ff-15e9-5c52-cf87-e0ebdfa1a001@I-love.SAKURA.ne.jp (mailing list archive)
State New, archived
Headers show
Series splice: infinite busy loop lockup bug | expand

Commit Message

Tetsuo Handa Aug. 7, 2020, 10:35 a.m. UTC
syzbot is reporting hung task at pipe_release() [1], for for_each_bvec() from
iterate_bvec() from iterate_all_kinds() from iov_iter_alignment() from
ext4_unaligned_io() from ext4_dio_write_iter() from ext4_file_write_iter() from
call_write_iter() from do_iter_readv_writev() from do_iter_write() from
vfs_iter_write() from iter_file_splice_write() falls into infinite busy loop
with pipe->mutex held.

The reason of falling into infinite busy loop is that iter_file_splice_write()
for some reason generates "struct bio_vec" entry with .bv_len=0 and .bv_offset=0
while for_each_bvec() cannot handle .bv_len == 0.


When splice() from pipe to file works.

[   31.704915][ T6552] Starting vfs_write_iter from.type=17 from.iov_offset=0 from.count=4096 n=1 sd.total_len=65504 left=61408
[   31.709098][ T6552]   array[0]: bv_page=ffffea000870a7c0 bv_len=4096 bv_offset=0

When splice() from pipe to file falls into infinite busy loop.

[   31.717178][ T6553] Starting vfs_write_iter from.type=17 from.iov_offset=0 from.count=4096 n=2 sd.total_len=65504 left=61408
[   31.720983][ T6553]   array[0]: bv_page=ffffea0008706680 bv_len=0 bv_offset=0
[   31.723565][ T6553]   array[1]: bv_page=ffffea00086f4e80 bv_len=4096 bv_offset=0

Is it normal behavior that an empty page is linked to pipe's array?
If yes, don't we need to skip empty pages when iter_file_splice_write() fills in "struct bio_vec *array" ?

[1] https://syzkaller.appspot.com/bug?id=2ccac875e85dc852911a0b5b788ada82dc0a081e

Comments

Al Viro Aug. 7, 2020, 12:27 p.m. UTC | #1
On Fri, Aug 07, 2020 at 07:35:08PM +0900, Tetsuo Handa wrote:
> syzbot is reporting hung task at pipe_release() [1], for for_each_bvec() from
> iterate_bvec() from iterate_all_kinds() from iov_iter_alignment() from
> ext4_unaligned_io() from ext4_dio_write_iter() from ext4_file_write_iter() from
> call_write_iter() from do_iter_readv_writev() from do_iter_write() from
> vfs_iter_write() from iter_file_splice_write() falls into infinite busy loop
> with pipe->mutex held.
> 
> The reason of falling into infinite busy loop is that iter_file_splice_write()
> for some reason generates "struct bio_vec" entry with .bv_len=0 and .bv_offset=0
> while for_each_bvec() cannot handle .bv_len == 0.

broken in 1bdc76aea115 "iov_iter: use bvec iterator to implement iterate_bvec()",
unless I'm misreading it...

Zero-length segments are not disallowed; it's not all that hard to filter them
out in iter_file_splice_write(), but the intent had always been to have
iterate_all_kinds() et.al. able to cope with those.

How are these pipe_buffers with ->len == 0 generated in that reproducer, BTW?
There might be something else fishy going on...
Tetsuo Handa Aug. 7, 2020, 12:34 p.m. UTC | #2
On 2020/08/07 21:27, Al Viro wrote:
> On Fri, Aug 07, 2020 at 07:35:08PM +0900, Tetsuo Handa wrote:
>> syzbot is reporting hung task at pipe_release() [1], for for_each_bvec() from
>> iterate_bvec() from iterate_all_kinds() from iov_iter_alignment() from
>> ext4_unaligned_io() from ext4_dio_write_iter() from ext4_file_write_iter() from
>> call_write_iter() from do_iter_readv_writev() from do_iter_write() from
>> vfs_iter_write() from iter_file_splice_write() falls into infinite busy loop
>> with pipe->mutex held.
>>
>> The reason of falling into infinite busy loop is that iter_file_splice_write()
>> for some reason generates "struct bio_vec" entry with .bv_len=0 and .bv_offset=0
>> while for_each_bvec() cannot handle .bv_len == 0.
> 
> broken in 1bdc76aea115 "iov_iter: use bvec iterator to implement iterate_bvec()",
> unless I'm misreading it...
> 
> Zero-length segments are not disallowed; it's not all that hard to filter them
> out in iter_file_splice_write(), but the intent had always been to have
> iterate_all_kinds() et.al. able to cope with those.
> 
> How are these pipe_buffers with ->len == 0 generated in that reproducer, BTW?
> There might be something else fishy going on...
> 

OK. Indeed writing to empty pipe which returns -EFAULT allows an empty
page to be linked to pipe's array.

Now, I've just found a simple reproducer, and confirmed that this bug is
a local lockup DoS by unprivileged user. Please fix.

----------
#define _GNU_SOURCE
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

int main(int argc, char *argv[])
{
        static char buffer[4096];
        const int fd = open("/tmp/testfile", O_WRONLY | O_CREAT, 0600);
        int pipe_fd[2] = { EOF, EOF };
        pipe(pipe_fd);
        write(pipe_fd[1], NULL, 4096);
        write(pipe_fd[1], buffer, 4096);
        splice(pipe_fd[0], NULL, fd, NULL, 65536, 0);
        return 0;
}
----------

[  125.598898][    C0] rcu: INFO: rcu_sched self-detected stall on CPU
[  125.601072][    C0] rcu: 	0-....: (20171 ticks this GP) idle=526/1/0x4000000000000000 softirq=7918/7918 fqs=5136 
[  125.604874][    C0] 	(t=21006 jiffies g=9341 q=30)
[  125.606512][    C0] NMI backtrace for cpu 0
[  125.607931][    C0] CPU: 0 PID: 2792 Comm: a.out Not tainted 5.8.0+ #793
[  125.610948][    C0] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020
[  125.614938][    C0] Call Trace:
[  125.616049][    C0]  <IRQ>
[  125.617010][    C0]  dump_stack+0x5e/0x7a
[  125.618370][    C0]  nmi_cpu_backtrace.cold.7+0x14/0x52
[  125.620148][    C0]  ? lapic_can_unplug_cpu.cold.39+0x3a/0x3a
[  125.622074][    C0]  nmi_trigger_cpumask_backtrace+0x92/0x9f
[  125.624154][    C0]  arch_trigger_cpumask_backtrace+0x14/0x20
[  125.626102][    C0]  rcu_dump_cpu_stacks+0xa0/0xd0
[  125.627919][    C0]  rcu_sched_clock_irq.cold.95+0x121/0x39c
[  125.629833][    C0]  ? acct_account_cputime+0x17/0x20
[  125.631534][    C0]  ? account_system_index_time+0x8a/0xa0
[  125.633422][    C0]  update_process_times+0x23/0x60
[  125.635070][    C0]  tick_sched_handle.isra.22+0x20/0x60
[  125.636870][    C0]  tick_sched_timer+0x68/0x80
[  125.638403][    C0]  ? tick_sched_handle.isra.22+0x60/0x60
[  125.640588][    C0]  __hrtimer_run_queues+0xf9/0x1a0
[  125.642591][    C0]  hrtimer_interrupt+0xfc/0x210
[  125.645033][    C0]  __sysvec_apic_timer_interrupt+0x4c/0x60
[  125.647292][    C0]  asm_call_on_stack+0xf/0x20
[  125.649192][    C0]  </IRQ>
[  125.650501][    C0]  sysvec_apic_timer_interrupt+0x75/0x80
[  125.652900][    C0]  asm_sysvec_apic_timer_interrupt+0x12/0x20
[  125.655487][    C0] RIP: 0010:iov_iter_copy_from_user_atomic+0x19b/0x350
[  125.658124][    C0] Code: 89 45 d0 48 c1 e6 06 48 03 37 4d 8d 3c 09 4c 89 cf e8 d9 e5 ff ff 48 8b 45 d0 45 39 eb 0f 87 35 01 00 00 49 8b 4a 18 4d 89 f9 <45> 29 dd 45 01 d8 75 12 eb 19 41 83 c4 01 41 29 c0 74 10 44 89 e0
[  125.666132][    C0] RSP: 0018:ffffa6cdc1237aa8 EFLAGS: 00000246
[  125.668557][    C0] RAX: 0000000000000000 RBX: 0000000000001000 RCX: ffff945035a25100
[  125.671576][    C0] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff945035a25100
[  125.674851][    C0] RBP: ffffa6cdc1237ad8 R08: 0000000000000000 R09: ffff945028a80000
[  125.677989][    C0] R10: ffffa6cdc1237de0 R11: 0000000000000000 R12: 0000000000000000
[  125.680990][    C0] R13: 0000000000001000 R14: 0000000000001000 R15: 0000000000001000
[  125.684006][    C0]  iomap_write_actor+0xbe/0x190
[  125.685982][    C0]  ? iomap_write_begin+0x460/0x460
[  125.688031][    C0]  iomap_apply+0xf4/0x1a0
[  125.689810][    C0]  ? iomap_write_begin+0x460/0x460
[  125.691826][    C0]  iomap_file_buffered_write+0x69/0x90
[  125.698598][    C0]  ? iomap_write_begin+0x460/0x460
[  125.705341][    C0]  xfs_file_buffered_aio_write+0xc2/0x2c0
[  125.707780][    C0]  xfs_file_write_iter+0xa3/0xc0
[  125.709802][    C0]  do_iter_readv_writev+0x15b/0x1c0
[  125.712496][    C0]  do_iter_write+0x81/0x190
[  125.715245][    C0]  vfs_iter_write+0x14/0x20
[  125.717221][    C0]  iter_file_splice_write+0x288/0x3e0
[  125.719340][    C0]  do_splice_from+0x1a/0x40
[  125.721175][    C0]  do_splice+0x2e5/0x620
[  125.722950][    C0]  __x64_sys_splice+0x97/0x100
[  125.724937][    C0]  do_syscall_64+0x31/0x40
[  125.726766][    C0]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  125.729388][    C0] RIP: 0033:0x7f7515dd91c3
[  125.731246][    C0] Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 83 3d fd dd 2c 00 00 75 13 49 89 ca b8 13 01 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 2b d4 00 00 48 89 04 24
[  125.749632][    C0] RSP: 002b:00007ffd553cde18 EFLAGS: 00000246 ORIG_RAX: 0000000000000113
[  125.753713][    C0] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f7515dd91c3
[  125.756867][    C0] RDX: 0000000000000003 RSI: 0000000000000000 RDI: 0000000000000004
[  125.759872][    C0] RBP: 0000000000000000 R08: 0000000000010000 R09: 0000000000000000
[  125.763023][    C0] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004005a6
[  125.766225][    C0] R13: 00007ffd553cdf10 R14: 0000000000000000 R15: 0000000000000000
Al Viro Aug. 7, 2020, 12:38 p.m. UTC | #3
On Fri, Aug 07, 2020 at 01:27:27PM +0100, Al Viro wrote:
> On Fri, Aug 07, 2020 at 07:35:08PM +0900, Tetsuo Handa wrote:
> > syzbot is reporting hung task at pipe_release() [1], for for_each_bvec() from
> > iterate_bvec() from iterate_all_kinds() from iov_iter_alignment() from
> > ext4_unaligned_io() from ext4_dio_write_iter() from ext4_file_write_iter() from
> > call_write_iter() from do_iter_readv_writev() from do_iter_write() from
> > vfs_iter_write() from iter_file_splice_write() falls into infinite busy loop
> > with pipe->mutex held.
> > 
> > The reason of falling into infinite busy loop is that iter_file_splice_write()
> > for some reason generates "struct bio_vec" entry with .bv_len=0 and .bv_offset=0
> > while for_each_bvec() cannot handle .bv_len == 0.
> 
> broken in 1bdc76aea115 "iov_iter: use bvec iterator to implement iterate_bvec()",
> unless I'm misreading it...
> 
> Zero-length segments are not disallowed; it's not all that hard to filter them
> out in iter_file_splice_write(), but the intent had always been to have
> iterate_all_kinds() et.al. able to cope with those.
> 
> How are these pipe_buffers with ->len == 0 generated in that reproducer, BTW?
> There might be something else fishy going on...

FWIW, my preference would be to have for_each_bvec() advance past zero-length
segments; I'll need to go through its uses elsewhere in the tree first, though
(after I grab some sleep),
Ming Lei Aug. 7, 2020, 1:41 p.m. UTC | #4
On Fri, Aug 07, 2020 at 01:38:54PM +0100, Al Viro wrote:
> On Fri, Aug 07, 2020 at 01:27:27PM +0100, Al Viro wrote:
> > On Fri, Aug 07, 2020 at 07:35:08PM +0900, Tetsuo Handa wrote:
> > > syzbot is reporting hung task at pipe_release() [1], for for_each_bvec() from
> > > iterate_bvec() from iterate_all_kinds() from iov_iter_alignment() from
> > > ext4_unaligned_io() from ext4_dio_write_iter() from ext4_file_write_iter() from
> > > call_write_iter() from do_iter_readv_writev() from do_iter_write() from
> > > vfs_iter_write() from iter_file_splice_write() falls into infinite busy loop
> > > with pipe->mutex held.
> > > 
> > > The reason of falling into infinite busy loop is that iter_file_splice_write()
> > > for some reason generates "struct bio_vec" entry with .bv_len=0 and .bv_offset=0
> > > while for_each_bvec() cannot handle .bv_len == 0.
> > 
> > broken in 1bdc76aea115 "iov_iter: use bvec iterator to implement iterate_bvec()",
> > unless I'm misreading it...
> > 
> > Zero-length segments are not disallowed; it's not all that hard to filter them
> > out in iter_file_splice_write(), but the intent had always been to have
> > iterate_all_kinds() et.al. able to cope with those.
> > 
> > How are these pipe_buffers with ->len == 0 generated in that reproducer, BTW?
> > There might be something else fishy going on...
> 
> FWIW, my preference would be to have for_each_bvec() advance past zero-length
> segments; I'll need to go through its uses elsewhere in the tree first, though
> (after I grab some sleep),

Usually block layer doesn't allow/support zero bvec, however we can make
for_each_bvec() to support it only.

Tetsuo, can you try the following patch?

diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index ac0c7299d5b8..b03c793dd28d 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -117,11 +117,19 @@ static inline bool bvec_iter_advance(const struct bio_vec *bv,
 	return true;
 }
 
+static inline void bvec_iter_skip_zero_vec(const struct bio_vec *bv,
+		struct bvec_iter *iter)
+{
+	iter->bi_idx++;
+	iter->bi_bvec_done = 0;
+}
+
 #define for_each_bvec(bvl, bio_vec, iter, start)			\
 	for (iter = (start);						\
 	     (iter).bi_size &&						\
-		((bvl = bvec_iter_bvec((bio_vec), (iter))), 1);	\
-	     bvec_iter_advance((bio_vec), &(iter), (bvl).bv_len))
+		((bvl = bvec_iter_bvec((bio_vec), (iter))), 1);		\
+	  (bvl).bv_len ? bvec_iter_advance((bio_vec), &(iter), (bvl).bv_len) : \
+			bvec_iter_skip_zero_vec((bio_vec), &(iter)))
 
 /* for iterating one bio from start to end */
 #define BVEC_ITER_ALL_INIT (struct bvec_iter)				\

Thanks,
Ming
Matthew Wilcox Aug. 7, 2020, 2:11 p.m. UTC | #5
On Fri, Aug 07, 2020 at 09:41:14PM +0800, Ming Lei wrote:
> On Fri, Aug 07, 2020 at 01:38:54PM +0100, Al Viro wrote:
> > FWIW, my preference would be to have for_each_bvec() advance past zero-length
> > segments; I'll need to go through its uses elsewhere in the tree first, though
> > (after I grab some sleep),
> 
> Usually block layer doesn't allow/support zero bvec, however we can make
> for_each_bvec() to support it only.
> 
> Tetsuo, can you try the following patch?
> 
> diff --git a/include/linux/bvec.h b/include/linux/bvec.h
> index ac0c7299d5b8..b03c793dd28d 100644
> --- a/include/linux/bvec.h
> +++ b/include/linux/bvec.h
> @@ -117,11 +117,19 @@ static inline bool bvec_iter_advance(const struct bio_vec *bv,
>  	return true;
>  }
>  
> +static inline void bvec_iter_skip_zero_vec(const struct bio_vec *bv,
> +		struct bvec_iter *iter)
> +{
> +	iter->bi_idx++;
> +	iter->bi_bvec_done = 0;
> +}
> +
>  #define for_each_bvec(bvl, bio_vec, iter, start)			\
>  	for (iter = (start);						\
>  	     (iter).bi_size &&						\
> -		((bvl = bvec_iter_bvec((bio_vec), (iter))), 1);	\
> -	     bvec_iter_advance((bio_vec), &(iter), (bvl).bv_len))
> +		((bvl = bvec_iter_bvec((bio_vec), (iter))), 1);		\
> +	  (bvl).bv_len ? bvec_iter_advance((bio_vec), &(iter), (bvl).bv_len) : \
> +			bvec_iter_skip_zero_vec((bio_vec), &(iter)))

Uhm, bvec_iter_advance() already skips over zero length bio_vecs.

        while (bytes && bytes >= bv[idx].bv_len) {
                bytes -= bv[idx].bv_len;
                idx++;
        }

The problem is when the _first_ bio_vec is zero length.  Maybe something more
like this (which doesn't even compile, but hopefully makes my point):

@@ -86,12 +86,24 @@ struct bvec_iter_all {
        (mp_bvec_iter_page((bvec), (iter)) +                    \
         mp_bvec_iter_page_idx((bvec), (iter)))
 
-#define bvec_iter_bvec(bvec, iter)                             \
-((struct bio_vec) {                                            \
-       .bv_page        = bvec_iter_page((bvec), (iter)),       \
-       .bv_len         = bvec_iter_len((bvec), (iter)),        \
-       .bv_offset      = bvec_iter_offset((bvec), (iter)),     \
-})
+static inline bool bvec_iter_bvec(struct bio_vec *bv, struct bio_vec *bvec,
+               struct bvec_iter *iter)
+{
+       unsigned int idx = iter->bi_idx;
+
+       if (!iter->bi_size)
+               return false;
+
+       while (!bv[idx].bv_len)
+               idx++;
+       iter->bi_idx = idx;
+
+       bv->bv_page = bvec_iter_page(bvec, *iter);
+       bv->bv_len = bvec_iter_len(bvec, *iter);
+       bv->bv_offset = bvec_iter_offset(bvec, *iter);
+
+       return true;
+}
 
 static inline bool bvec_iter_advance(const struct bio_vec *bv,
                struct bvec_iter *iter, unsigned bytes)
@@ -119,8 +131,7 @@ static inline bool bvec_iter_advance(const struct bio_vec *bv,
 
 #define for_each_bvec(bvl, bio_vec, iter, start)                       \
        for (iter = (start);                                            \
-            (iter).bi_size &&                                          \
-               ((bvl = bvec_iter_bvec((bio_vec), (iter))), 1); \
+            bvec_iter_bvec(&(bvl), (bio_vec), &(iter));                \
             bvec_iter_advance((bio_vec), &(iter), (bvl).bv_len))
 
 /* for iterating one bio from start to end */

(I find the whole bvec handling a mess of confusing macros and would
welcome more of it being inline functions, in general).
Tetsuo Handa Aug. 7, 2020, 2:17 p.m. UTC | #6
On 2020/08/07 22:41, Ming Lei wrote:
>> FWIW, my preference would be to have for_each_bvec() advance past zero-length
>> segments; I'll need to go through its uses elsewhere in the tree first, though
>> (after I grab some sleep),
> 
> Usually block layer doesn't allow/support zero bvec, however we can make
> for_each_bvec() to support it only.
> 
> Tetsuo, can you try the following patch?

Yes, this patch solves the lockup. Thank you.
Tetsuo Handa Aug. 7, 2020, 3:11 p.m. UTC | #7
On 2020/08/07 23:11, Matthew Wilcox wrote:
> (I find the whole bvec handling a mess of confusing macros and would
> welcome more of it being inline functions, in general).

Indeed. Inlined functions will be more useful than macros when syzbot
calculates the location of the source code from address for reporting.
I spent a lot of time where

  RIP: 0010:iov_iter_alignment+0x39e/0x850 lib/iov_iter.c:1236

within the complicated macros is. If inlined line numbers were available,
I could have narrowed down the location of infinite loop faster...
Ming Lei Aug. 9, 2020, 2:31 a.m. UTC | #8
On Fri, Aug 07, 2020 at 03:11:48PM +0100, Matthew Wilcox wrote:
> On Fri, Aug 07, 2020 at 09:41:14PM +0800, Ming Lei wrote:
> > On Fri, Aug 07, 2020 at 01:38:54PM +0100, Al Viro wrote:
> > > FWIW, my preference would be to have for_each_bvec() advance past zero-length
> > > segments; I'll need to go through its uses elsewhere in the tree first, though
> > > (after I grab some sleep),
> > 
> > Usually block layer doesn't allow/support zero bvec, however we can make
> > for_each_bvec() to support it only.
> > 
> > Tetsuo, can you try the following patch?
> > 
> > diff --git a/include/linux/bvec.h b/include/linux/bvec.h
> > index ac0c7299d5b8..b03c793dd28d 100644
> > --- a/include/linux/bvec.h
> > +++ b/include/linux/bvec.h
> > @@ -117,11 +117,19 @@ static inline bool bvec_iter_advance(const struct bio_vec *bv,
> >  	return true;
> >  }
> >  
> > +static inline void bvec_iter_skip_zero_vec(const struct bio_vec *bv,
> > +		struct bvec_iter *iter)
> > +{
> > +	iter->bi_idx++;
> > +	iter->bi_bvec_done = 0;
> > +}
> > +
> >  #define for_each_bvec(bvl, bio_vec, iter, start)			\
> >  	for (iter = (start);						\
> >  	     (iter).bi_size &&						\
> > -		((bvl = bvec_iter_bvec((bio_vec), (iter))), 1);	\
> > -	     bvec_iter_advance((bio_vec), &(iter), (bvl).bv_len))
> > +		((bvl = bvec_iter_bvec((bio_vec), (iter))), 1);		\
> > +	  (bvl).bv_len ? bvec_iter_advance((bio_vec), &(iter), (bvl).bv_len) : \
> > +			bvec_iter_skip_zero_vec((bio_vec), &(iter)))
> 
> Uhm, bvec_iter_advance() already skips over zero length bio_vecs.
> 
>         while (bytes && bytes >= bv[idx].bv_len) {
>                 bytes -= bv[idx].bv_len;
>                 idx++;
>         }

The issue is that zero (bvl).bv_len passed to bvec_iter_advance(), so
the iterator can't move on.

And I tried to avoid change to bvec_iter_advance() since this exact
issue only exists on for_each_bvec, and block layer won't support/allow
zero-length bvec.

> 
> The problem is when the _first_ bio_vec is zero length.

It can be any zero-length bvec during the iterating. 

> Maybe something more
> like this (which doesn't even compile, but hopefully makes my point):
> 
> @@ -86,12 +86,24 @@ struct bvec_iter_all {
>         (mp_bvec_iter_page((bvec), (iter)) +                    \
>          mp_bvec_iter_page_idx((bvec), (iter)))
>  
> -#define bvec_iter_bvec(bvec, iter)                             \
> -((struct bio_vec) {                                            \
> -       .bv_page        = bvec_iter_page((bvec), (iter)),       \
> -       .bv_len         = bvec_iter_len((bvec), (iter)),        \
> -       .bv_offset      = bvec_iter_offset((bvec), (iter)),     \
> -})
> +static inline bool bvec_iter_bvec(struct bio_vec *bv, struct bio_vec *bvec,
> +               struct bvec_iter *iter)
> +{
> +       unsigned int idx = iter->bi_idx;
> +
> +       if (!iter->bi_size)
> +               return false;
> +
> +       while (!bv[idx].bv_len)
> +               idx++;
> +       iter->bi_idx = idx;
> +
> +       bv->bv_page = bvec_iter_page(bvec, *iter);
> +       bv->bv_len = bvec_iter_len(bvec, *iter);
> +       bv->bv_offset = bvec_iter_offset(bvec, *iter);
> +
> +       return true;
> +}
>  
>  static inline bool bvec_iter_advance(const struct bio_vec *bv,
>                 struct bvec_iter *iter, unsigned bytes)
> @@ -119,8 +131,7 @@ static inline bool bvec_iter_advance(const struct bio_vec *bv,
>  
>  #define for_each_bvec(bvl, bio_vec, iter, start)                       \
>         for (iter = (start);                                            \
> -            (iter).bi_size &&                                          \
> -               ((bvl = bvec_iter_bvec((bio_vec), (iter))), 1); \
> +            bvec_iter_bvec(&(bvl), (bio_vec), &(iter));                \
>              bvec_iter_advance((bio_vec), &(iter), (bvl).bv_len))
>  
>  /* for iterating one bio from start to end */
> 
> (I find the whole bvec handling a mess of confusing macros and would
> welcome more of it being inline functions, in general).

The above change may bring more code duplication. Meantime, it can't
work because (bvl).bv_len isn't taken into account into bvec_iter_bvec(),
then how can the iterator advance?


Thanks,
Ming
Ming Lei Aug. 9, 2020, 2:49 a.m. UTC | #9
On Sun, Aug 09, 2020 at 10:31:23AM +0800, Ming Lei wrote:
> On Fri, Aug 07, 2020 at 03:11:48PM +0100, Matthew Wilcox wrote:
> > On Fri, Aug 07, 2020 at 09:41:14PM +0800, Ming Lei wrote:
> > > On Fri, Aug 07, 2020 at 01:38:54PM +0100, Al Viro wrote:
> > > > FWIW, my preference would be to have for_each_bvec() advance past zero-length
> > > > segments; I'll need to go through its uses elsewhere in the tree first, though
> > > > (after I grab some sleep),
> > > 
> > > Usually block layer doesn't allow/support zero bvec, however we can make
> > > for_each_bvec() to support it only.
> > > 
> > > Tetsuo, can you try the following patch?
> > > 
> > > diff --git a/include/linux/bvec.h b/include/linux/bvec.h
> > > index ac0c7299d5b8..b03c793dd28d 100644
> > > --- a/include/linux/bvec.h
> > > +++ b/include/linux/bvec.h
> > > @@ -117,11 +117,19 @@ static inline bool bvec_iter_advance(const struct bio_vec *bv,
> > >  	return true;
> > >  }
> > >  
> > > +static inline void bvec_iter_skip_zero_vec(const struct bio_vec *bv,
> > > +		struct bvec_iter *iter)
> > > +{
> > > +	iter->bi_idx++;
> > > +	iter->bi_bvec_done = 0;
> > > +}
> > > +
> > >  #define for_each_bvec(bvl, bio_vec, iter, start)			\
> > >  	for (iter = (start);						\
> > >  	     (iter).bi_size &&						\
> > > -		((bvl = bvec_iter_bvec((bio_vec), (iter))), 1);	\
> > > -	     bvec_iter_advance((bio_vec), &(iter), (bvl).bv_len))
> > > +		((bvl = bvec_iter_bvec((bio_vec), (iter))), 1);		\
> > > +	  (bvl).bv_len ? bvec_iter_advance((bio_vec), &(iter), (bvl).bv_len) : \
> > > +			bvec_iter_skip_zero_vec((bio_vec), &(iter)))
> > 
> > Uhm, bvec_iter_advance() already skips over zero length bio_vecs.
> > 
> >         while (bytes && bytes >= bv[idx].bv_len) {
> >                 bytes -= bv[idx].bv_len;
> >                 idx++;
> >         }
> 
> The issue is that zero (bvl).bv_len passed to bvec_iter_advance(), so
> the iterator can't move on.
> 
> And I tried to avoid change to bvec_iter_advance() since this exact
> issue only exists on for_each_bvec, and block layer won't support/allow
> zero-length bvec.
> 
> > 
> > The problem is when the _first_ bio_vec is zero length.
> 
> It can be any zero-length bvec during the iterating. 
> 
> > Maybe something more
> > like this (which doesn't even compile, but hopefully makes my point):
> > 
> > @@ -86,12 +86,24 @@ struct bvec_iter_all {
> >         (mp_bvec_iter_page((bvec), (iter)) +                    \
> >          mp_bvec_iter_page_idx((bvec), (iter)))
> >  
> > -#define bvec_iter_bvec(bvec, iter)                             \
> > -((struct bio_vec) {                                            \
> > -       .bv_page        = bvec_iter_page((bvec), (iter)),       \
> > -       .bv_len         = bvec_iter_len((bvec), (iter)),        \
> > -       .bv_offset      = bvec_iter_offset((bvec), (iter)),     \
> > -})
> > +static inline bool bvec_iter_bvec(struct bio_vec *bv, struct bio_vec *bvec,
> > +               struct bvec_iter *iter)
> > +{
> > +       unsigned int idx = iter->bi_idx;
> > +
> > +       if (!iter->bi_size)
> > +               return false;
> > +
> > +       while (!bv[idx].bv_len)
> > +               idx++;
> > +       iter->bi_idx = idx;
> > +
> > +       bv->bv_page = bvec_iter_page(bvec, *iter);
> > +       bv->bv_len = bvec_iter_len(bvec, *iter);
> > +       bv->bv_offset = bvec_iter_offset(bvec, *iter);
> > +
> > +       return true;
> > +}
> >  
> >  static inline bool bvec_iter_advance(const struct bio_vec *bv,
> >                 struct bvec_iter *iter, unsigned bytes)
> > @@ -119,8 +131,7 @@ static inline bool bvec_iter_advance(const struct bio_vec *bv,
> >  
> >  #define for_each_bvec(bvl, bio_vec, iter, start)                       \
> >         for (iter = (start);                                            \
> > -            (iter).bi_size &&                                          \
> > -               ((bvl = bvec_iter_bvec((bio_vec), (iter))), 1); \
> > +            bvec_iter_bvec(&(bvl), (bio_vec), &(iter));                \
> >              bvec_iter_advance((bio_vec), &(iter), (bvl).bv_len))
> >  
> >  /* for iterating one bio from start to end */
> > 
> > (I find the whole bvec handling a mess of confusing macros and would
> > welcome more of it being inline functions, in general).
> 
> The above change may bring more code duplication. Meantime, it can't
> work because (bvl).bv_len isn't taken into account into bvec_iter_bvec(),
> then how can the iterator advance?

oops, looks you change bvec_iter_bvec() only for skipping zero length bvec,
and this way might work(still ->bi_bvec_done isn't reset). However the
change is ugly, cause the iterator is supposed to not be updated in
bvec_iter_bvec(). Also block layer code doesn't require such change.

BTW, I agree on switching to inline if performance isn't affected.


Thanks,
Ming
Qian Cai Sept. 1, 2020, 12:51 a.m. UTC | #10
On Fri, Aug 07, 2020 at 09:34:08PM +0900, Tetsuo Handa wrote:
> On 2020/08/07 21:27, Al Viro wrote:
> > On Fri, Aug 07, 2020 at 07:35:08PM +0900, Tetsuo Handa wrote:
> >> syzbot is reporting hung task at pipe_release() [1], for for_each_bvec() from
> >> iterate_bvec() from iterate_all_kinds() from iov_iter_alignment() from
> >> ext4_unaligned_io() from ext4_dio_write_iter() from ext4_file_write_iter() from
> >> call_write_iter() from do_iter_readv_writev() from do_iter_write() from
> >> vfs_iter_write() from iter_file_splice_write() falls into infinite busy loop
> >> with pipe->mutex held.
> >>
> >> The reason of falling into infinite busy loop is that iter_file_splice_write()
> >> for some reason generates "struct bio_vec" entry with .bv_len=0 and .bv_offset=0
> >> while for_each_bvec() cannot handle .bv_len == 0.
> > 
> > broken in 1bdc76aea115 "iov_iter: use bvec iterator to implement iterate_bvec()",
> > unless I'm misreading it...

I have been chasing something similar for a while as in,

https://lore.kernel.org/linux-fsdevel/89F418A9-EB20-48CB-9AE0-52C700E6BB74@lca.pw/

In my case, it seems the endless loop happens in iterate_iovec() instead where
I put a debug patch here,

--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -33,6 +33,7 @@
                if (unlikely(!__v.iov_len))             \
                        continue;                       \
                __v.iov_base = __p->iov_base;           \
+               printk_ratelimited("ITER_IOVEC left = %zu, n = %zu\n", left, n); \
                left = (STEP);                          \
                __v.iov_len -= left;                    \
                skip = __v.iov_len;                     \

and end up seeing overflows ("n" supposes to be less than PAGE_SIZE) before the
soft-lockups and a dead system,

[ 4300.249180][T470195] ITER_IOVEC left = 0, n = 48566423

Thoughts?

> > 
> > Zero-length segments are not disallowed; it's not all that hard to filter them
> > out in iter_file_splice_write(), but the intent had always been to have
> > iterate_all_kinds() et.al. able to cope with those.
> > 
> > How are these pipe_buffers with ->len == 0 generated in that reproducer, BTW?
> > There might be something else fishy going on...
> > 
> 
> OK. Indeed writing to empty pipe which returns -EFAULT allows an empty
> page to be linked to pipe's array.
> 
> Now, I've just found a simple reproducer, and confirmed that this bug is
> a local lockup DoS by unprivileged user. Please fix.
> 
> ----------
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <unistd.h>
> 
> int main(int argc, char *argv[])
> {
>         static char buffer[4096];
>         const int fd = open("/tmp/testfile", O_WRONLY | O_CREAT, 0600);
>         int pipe_fd[2] = { EOF, EOF };
>         pipe(pipe_fd);
>         write(pipe_fd[1], NULL, 4096);
>         write(pipe_fd[1], buffer, 4096);
>         splice(pipe_fd[0], NULL, fd, NULL, 65536, 0);
>         return 0;
> }
> ----------
> 
> [  125.598898][    C0] rcu: INFO: rcu_sched self-detected stall on CPU
> [  125.601072][    C0] rcu: 	0-....: (20171 ticks this GP) idle=526/1/0x4000000000000000 softirq=7918/7918 fqs=5136 
> [  125.604874][    C0] 	(t=21006 jiffies g=9341 q=30)
> [  125.606512][    C0] NMI backtrace for cpu 0
> [  125.607931][    C0] CPU: 0 PID: 2792 Comm: a.out Not tainted 5.8.0+ #793
> [  125.610948][    C0] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020
> [  125.614938][    C0] Call Trace:
> [  125.616049][    C0]  <IRQ>
> [  125.617010][    C0]  dump_stack+0x5e/0x7a
> [  125.618370][    C0]  nmi_cpu_backtrace.cold.7+0x14/0x52
> [  125.620148][    C0]  ? lapic_can_unplug_cpu.cold.39+0x3a/0x3a
> [  125.622074][    C0]  nmi_trigger_cpumask_backtrace+0x92/0x9f
> [  125.624154][    C0]  arch_trigger_cpumask_backtrace+0x14/0x20
> [  125.626102][    C0]  rcu_dump_cpu_stacks+0xa0/0xd0
> [  125.627919][    C0]  rcu_sched_clock_irq.cold.95+0x121/0x39c
> [  125.629833][    C0]  ? acct_account_cputime+0x17/0x20
> [  125.631534][    C0]  ? account_system_index_time+0x8a/0xa0
> [  125.633422][    C0]  update_process_times+0x23/0x60
> [  125.635070][    C0]  tick_sched_handle.isra.22+0x20/0x60
> [  125.636870][    C0]  tick_sched_timer+0x68/0x80
> [  125.638403][    C0]  ? tick_sched_handle.isra.22+0x60/0x60
> [  125.640588][    C0]  __hrtimer_run_queues+0xf9/0x1a0
> [  125.642591][    C0]  hrtimer_interrupt+0xfc/0x210
> [  125.645033][    C0]  __sysvec_apic_timer_interrupt+0x4c/0x60
> [  125.647292][    C0]  asm_call_on_stack+0xf/0x20
> [  125.649192][    C0]  </IRQ>
> [  125.650501][    C0]  sysvec_apic_timer_interrupt+0x75/0x80
> [  125.652900][    C0]  asm_sysvec_apic_timer_interrupt+0x12/0x20
> [  125.655487][    C0] RIP: 0010:iov_iter_copy_from_user_atomic+0x19b/0x350
> [  125.658124][    C0] Code: 89 45 d0 48 c1 e6 06 48 03 37 4d 8d 3c 09 4c 89 cf e8 d9 e5 ff ff 48 8b 45 d0 45 39 eb 0f 87 35 01 00 00 49 8b 4a 18 4d 89 f9 <45> 29 dd 45 01 d8 75 12 eb 19 41 83 c4 01 41 29 c0 74 10 44 89 e0
> [  125.666132][    C0] RSP: 0018:ffffa6cdc1237aa8 EFLAGS: 00000246
> [  125.668557][    C0] RAX: 0000000000000000 RBX: 0000000000001000 RCX: ffff945035a25100
> [  125.671576][    C0] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff945035a25100
> [  125.674851][    C0] RBP: ffffa6cdc1237ad8 R08: 0000000000000000 R09: ffff945028a80000
> [  125.677989][    C0] R10: ffffa6cdc1237de0 R11: 0000000000000000 R12: 0000000000000000
> [  125.680990][    C0] R13: 0000000000001000 R14: 0000000000001000 R15: 0000000000001000
> [  125.684006][    C0]  iomap_write_actor+0xbe/0x190
> [  125.685982][    C0]  ? iomap_write_begin+0x460/0x460
> [  125.688031][    C0]  iomap_apply+0xf4/0x1a0
> [  125.689810][    C0]  ? iomap_write_begin+0x460/0x460
> [  125.691826][    C0]  iomap_file_buffered_write+0x69/0x90
> [  125.698598][    C0]  ? iomap_write_begin+0x460/0x460
> [  125.705341][    C0]  xfs_file_buffered_aio_write+0xc2/0x2c0
> [  125.707780][    C0]  xfs_file_write_iter+0xa3/0xc0
> [  125.709802][    C0]  do_iter_readv_writev+0x15b/0x1c0
> [  125.712496][    C0]  do_iter_write+0x81/0x190
> [  125.715245][    C0]  vfs_iter_write+0x14/0x20
> [  125.717221][    C0]  iter_file_splice_write+0x288/0x3e0
> [  125.719340][    C0]  do_splice_from+0x1a/0x40
> [  125.721175][    C0]  do_splice+0x2e5/0x620
> [  125.722950][    C0]  __x64_sys_splice+0x97/0x100
> [  125.724937][    C0]  do_syscall_64+0x31/0x40
> [  125.726766][    C0]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  125.729388][    C0] RIP: 0033:0x7f7515dd91c3
> [  125.731246][    C0] Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 83 3d fd dd 2c 00 00 75 13 49 89 ca b8 13 01 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 2b d4 00 00 48 89 04 24
> [  125.749632][    C0] RSP: 002b:00007ffd553cde18 EFLAGS: 00000246 ORIG_RAX: 0000000000000113
> [  125.753713][    C0] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f7515dd91c3
> [  125.756867][    C0] RDX: 0000000000000003 RSI: 0000000000000000 RDI: 0000000000000004
> [  125.759872][    C0] RBP: 0000000000000000 R08: 0000000000010000 R09: 0000000000000000
> [  125.763023][    C0] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004005a6
> [  125.766225][    C0] R13: 00007ffd553cdf10 R14: 0000000000000000 R15: 0000000000000000
>
Al Viro Sept. 1, 2020, 1:09 a.m. UTC | #11
On Mon, Aug 31, 2020 at 08:51:32PM -0400, Qian Cai wrote:
> On Fri, Aug 07, 2020 at 09:34:08PM +0900, Tetsuo Handa wrote:
> > On 2020/08/07 21:27, Al Viro wrote:
> > > On Fri, Aug 07, 2020 at 07:35:08PM +0900, Tetsuo Handa wrote:
> > >> syzbot is reporting hung task at pipe_release() [1], for for_each_bvec() from
> > >> iterate_bvec() from iterate_all_kinds() from iov_iter_alignment() from
> > >> ext4_unaligned_io() from ext4_dio_write_iter() from ext4_file_write_iter() from
> > >> call_write_iter() from do_iter_readv_writev() from do_iter_write() from
> > >> vfs_iter_write() from iter_file_splice_write() falls into infinite busy loop
> > >> with pipe->mutex held.
> > >>
> > >> The reason of falling into infinite busy loop is that iter_file_splice_write()
> > >> for some reason generates "struct bio_vec" entry with .bv_len=0 and .bv_offset=0
> > >> while for_each_bvec() cannot handle .bv_len == 0.
> > > 
> > > broken in 1bdc76aea115 "iov_iter: use bvec iterator to implement iterate_bvec()",
> > > unless I'm misreading it...
> 
> I have been chasing something similar for a while as in,
> 
> https://lore.kernel.org/linux-fsdevel/89F418A9-EB20-48CB-9AE0-52C700E6BB74@lca.pw/
> 
> In my case, it seems the endless loop happens in iterate_iovec() instead where
> I put a debug patch here,
> 
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -33,6 +33,7 @@
>                 if (unlikely(!__v.iov_len))             \
>                         continue;                       \
>                 __v.iov_base = __p->iov_base;           \
> +               printk_ratelimited("ITER_IOVEC left = %zu, n = %zu\n", left, n); \
>                 left = (STEP);                          \
>                 __v.iov_len -= left;                    \
>                 skip = __v.iov_len;                     \
> 
> and end up seeing overflows ("n" supposes to be less than PAGE_SIZE) before the
> soft-lockups and a dead system,
> 
> [ 4300.249180][T470195] ITER_IOVEC left = 0, n = 48566423
> 
> Thoughts?

Er...  Where does that size come from?  If that's generic_perform_write(),
I'd like to see pos, offset and bytes at the time of call...  ->iov_offset would
also be interesting to see (along with the entire iovec array, really).
Ming Lei Sept. 1, 2020, 1:10 a.m. UTC | #12
On Tue, Sep 1, 2020 at 8:53 AM Qian Cai <cai@lca.pw> wrote:
>
> On Fri, Aug 07, 2020 at 09:34:08PM +0900, Tetsuo Handa wrote:
> > On 2020/08/07 21:27, Al Viro wrote:
> > > On Fri, Aug 07, 2020 at 07:35:08PM +0900, Tetsuo Handa wrote:
> > >> syzbot is reporting hung task at pipe_release() [1], for for_each_bvec() from
> > >> iterate_bvec() from iterate_all_kinds() from iov_iter_alignment() from
> > >> ext4_unaligned_io() from ext4_dio_write_iter() from ext4_file_write_iter() from
> > >> call_write_iter() from do_iter_readv_writev() from do_iter_write() from
> > >> vfs_iter_write() from iter_file_splice_write() falls into infinite busy loop
> > >> with pipe->mutex held.
> > >>
> > >> The reason of falling into infinite busy loop is that iter_file_splice_write()
> > >> for some reason generates "struct bio_vec" entry with .bv_len=0 and .bv_offset=0
> > >> while for_each_bvec() cannot handle .bv_len == 0.
> > >
> > > broken in 1bdc76aea115 "iov_iter: use bvec iterator to implement iterate_bvec()",
> > > unless I'm misreading it...
>
> I have been chasing something similar for a while as in,
>
> https://lore.kernel.org/linux-fsdevel/89F418A9-EB20-48CB-9AE0-52C700E6BB74@lca.pw/
>
> In my case, it seems the endless loop happens in iterate_iovec() instead where
> I put a debug patch here,
>
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -33,6 +33,7 @@
>                 if (unlikely(!__v.iov_len))             \
>                         continue;                       \
>                 __v.iov_base = __p->iov_base;           \
> +               printk_ratelimited("ITER_IOVEC left = %zu, n = %zu\n", left, n); \
>                 left = (STEP);                          \
>                 __v.iov_len -= left;                    \
>                 skip = __v.iov_len;                     \
>
> and end up seeing overflows ("n" supposes to be less than PAGE_SIZE) before the
> soft-lockups and a dead system,
>
> [ 4300.249180][T470195] ITER_IOVEC left = 0, n = 48566423
>
> Thoughts?

Does the following patch make a difference for you?

https://lore.kernel.org/linux-block/20200817100055.2495905-1-ming.lei@redhat.com/

thanks,
Ming Lei
Qian Cai Sept. 1, 2020, 1:47 a.m. UTC | #13
On Tue, Sep 01, 2020 at 02:09:28AM +0100, Al Viro wrote:
> On Mon, Aug 31, 2020 at 08:51:32PM -0400, Qian Cai wrote:
> > On Fri, Aug 07, 2020 at 09:34:08PM +0900, Tetsuo Handa wrote:
> > > On 2020/08/07 21:27, Al Viro wrote:
> > > > On Fri, Aug 07, 2020 at 07:35:08PM +0900, Tetsuo Handa wrote:
> > > >> syzbot is reporting hung task at pipe_release() [1], for for_each_bvec() from
> > > >> iterate_bvec() from iterate_all_kinds() from iov_iter_alignment() from
> > > >> ext4_unaligned_io() from ext4_dio_write_iter() from ext4_file_write_iter() from
> > > >> call_write_iter() from do_iter_readv_writev() from do_iter_write() from
> > > >> vfs_iter_write() from iter_file_splice_write() falls into infinite busy loop
> > > >> with pipe->mutex held.
> > > >>
> > > >> The reason of falling into infinite busy loop is that iter_file_splice_write()
> > > >> for some reason generates "struct bio_vec" entry with .bv_len=0 and .bv_offset=0
> > > >> while for_each_bvec() cannot handle .bv_len == 0.
> > > > 
> > > > broken in 1bdc76aea115 "iov_iter: use bvec iterator to implement iterate_bvec()",
> > > > unless I'm misreading it...
> > 
> > I have been chasing something similar for a while as in,
> > 
> > https://lore.kernel.org/linux-fsdevel/89F418A9-EB20-48CB-9AE0-52C700E6BB74@lca.pw/
> > 
> > In my case, it seems the endless loop happens in iterate_iovec() instead where
> > I put a debug patch here,
> > 
> > --- a/lib/iov_iter.c
> > +++ b/lib/iov_iter.c
> > @@ -33,6 +33,7 @@
> >                 if (unlikely(!__v.iov_len))             \
> >                         continue;                       \
> >                 __v.iov_base = __p->iov_base;           \
> > +               printk_ratelimited("ITER_IOVEC left = %zu, n = %zu\n", left, n); \
> >                 left = (STEP);                          \
> >                 __v.iov_len -= left;                    \
> >                 skip = __v.iov_len;                     \
> > 
> > and end up seeing overflows ("n" supposes to be less than PAGE_SIZE) before the
> > soft-lockups and a dead system,
> > 
> > [ 4300.249180][T470195] ITER_IOVEC left = 0, n = 48566423
> > 
> > Thoughts?
> 
> Er...  Where does that size come from?  If that's generic_perform_write(),
> I'd like to see pos, offset and bytes at the time of call...  ->iov_offset would
> also be interesting to see (along with the entire iovec array, really).

Yes, generic_perform_write(). I'll see if I can capture more information.

[ 2867.463013][T217919] ITER_IOVEC left = 0, n = 2209 
[ 2867.466154][T217971] ITER_IOVEC left = 0, n = 2093279 
[ 2867.903689][T217971] ITER_IOVEC left = 0, n = 2093259 
[ 2867.928514][T217971] ITER_IOVEC left = 0, n = 2093239 
[ 2867.952450][T217971] ITER_IOVEC left = 0, n = 2090980 
[ 2867.976585][T217971] ITER_IOVEC left = 0, n = 2090960 
[ 2869.219459][T217774] futex_wake_op: trinity-c61 tries to shift op by -1; fix this program 
[ 2870.005178][T218110] futex_wake_op: trinity-c9 tries to shift op by -1; fix this program 
[ 2870.297607][T218213] futex_wake_op: trinity-c4 tries to shift op by -836; fix this program 
[ 2870.338118][T218213] futex_wake_op: trinity-c4 tries to shift op by -836; fix this program 
[-- MARK -- Mon Aug 31 18:50:00 2020] 
[ 2893.870387][   C62] watchdog: BUG: soft lockup - CPU#62 stuck for 23s! [trinity-c5:218155] 
[ 2893.912341][   C62] Modules linked in: nls_ascii nls_cp437 vfat fat kvm_intel kvm irqbypass efivars ip_tables x_tables sd_mod ahci bnx2x libahci mdio libata firmware_class dm_mirror dm_region_hash dm_log dm_mod efivarfs 
[ 2894.003787][   C62] irq event stamp: 55378 
[ 2894.022398][   C62] hardirqs last  enabled at (55377): [<ffffffff95a00c42>] asm_sysvec_apic_timer_interrupt+0x12/0x20 
[ 2894.070770][   C62] hardirqs last disabled at (55378): [<ffffffff959b111d>] irqentry_enter+0x1d/0x50 
[ 2894.112794][   C62] softirqs last  enabled at (50602): [<ffffffff95c0070f>] __do_softirq+0x70f/0xa9f 
[ 2894.154580][   C62] softirqs last disabled at (49393): [<ffffffff95a00ec2>] asm_call_on_stack+0x12/0x20 
[ 2894.197654][   C62] CPU: 62 PID: 218155 Comm: trinity-c5 Not tainted 5.9.0-rc2-next-20200828+ #4 
[ 2894.239807][   C62] Hardware name: HP Synergy 480 Gen9/Synergy 480 Gen9 Compute Module, BIOS I37 10/21/2019 
[ 2894.284894][   C62] RIP: 0010:iov_iter_copy_from_user_atomic+0x598/0xab0 
[ 2894.316037][   C62] Code: 42 0f b6 14 08 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 7c 03 00 00 48 8d 7e 08 8b 46 0c 48 89 fa 48 c1 ea 03 01 e8 <42> 0f b6 14 0a 84 d2 74 09 80 fa 03 0f 8e 0e 03 00 00 44 8b 6e 08 
[ 2894.408763][   C62] RSP: 0018:ffffc9000fb7f848 EFLAGS: 00000246 
[ 2894.440034][   C62] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff888fce295100 
[ 2894.480312][   C62] RDX: 1ffff111f9c52a21 RSI: ffff888fce295100 RDI: ffff888fce295108 
[ 2894.517551][   C62] RBP: 0000000000000000 R08: fffff52001f6ffa1 R09: dffffc0000000000 
[ 2894.556460][   C62] R10: ffff889055c1f000 R11: 0000000000000400 R12: ffffc9000fb7fcf0 
[ 2894.593928][   C62] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000400 
[ 2894.630979][   C62] FS:  00007ff89140a740(0000) GS:ffff88905fd80000(0000) knlGS:0000000000000000 
[ 2894.673104][   C62] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 
[ 2894.702914][   C62] CR2: 00007ff8906242fc CR3: 000000101bb76004 CR4: 00000000003706e0 
[ 2894.740628][   C62] DR0: 00007ff891250000 DR1: 0000000000000000 DR2: 0000000000000000 
[ 2894.778016][   C62] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 
[ 2894.815393][   C62] Call Trace: 
[ 2894.830531][   C62]  ? shmem_write_begin+0xc4/0x1b0 
[ 2894.853643][   C62]  generic_perform_write+0x2ac/0x410 
[ 2894.878266][   C62]  ? filemap_check_errors+0xe0/0xe0 
[ 2894.901557][   C62]  ? file_update_time+0x215/0x380 
[ 2894.925234][   C62]  ? update_time+0xa0/0xa0 
[ 2894.947683][   C62]  ? down_write+0xdb/0x150 
[ 2894.969843][   C62]  __generic_file_write_iter+0x2fe/0x4f0 
[ 2894.995527][   C62]  ? rcu_read_unlock+0x50/0x50 
[ 2895.017683][   C62]  generic_file_write_iter+0x2ee/0x520 
[ 2895.042981][   C62]  ? __generic_file_write_iter+0x4f0/0x4f0 
[ 2895.069612][   C62]  ? __mutex_lock+0x4af/0x1390 
[ 2895.092240][   C62]  do_iter_readv_writev+0x388/0x6f0 
[ 2895.115794][   C62]  ? lockdep_hardirqs_on_prepare+0x33e/0x4e0 
[ 2895.143303][   C62]  ? default_llseek+0x240/0x240 
[ 2895.165522][   C62]  ? rcu_read_lock_bh_held+0xc0/0xc0 
[ 2895.190122][   C62]  do_iter_write+0x130/0x5f0 
[ 2895.211226][   C62]  iter_file_splice_write+0x54c/0xa40 
[ 2895.235917][   C62]  ? page_cache_pipe_buf_try_steal+0x1e0/0x1e0 
[ 2895.264415][   C62]  ? rcu_read_lock_any_held+0xdb/0x100 
[ 2895.290249][   C62]  do_splice+0x86c/0x1440 
[ 2895.310155][   C62]  ? syscall_enter_from_user_mode+0x1b/0x230 
[ 2895.337531][   C62]  ? direct_splice_actor+0x100/0x100 
[ 2895.362348][   C62]  __x64_sys_splice+0x151/0x200 
[ 2895.384677][   C62]  do_syscall_64+0x33/0x40 
[ 2895.405223][   C62]  entry_SYSCALL_64_after_hwframe+0x44/0xa9 
[ 2895.433800][   C62] RIP: 0033:0x7ff890d246ed 
[ 2895.456812][   C62] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 6b 57 2c 00 f7 d8 64 89 01 48 
[ 2895.553070][   C62] RSP: 002b:00007fff8a6acfb8 EFLAGS: 00000246 ORIG_RAX: 0000000000000113 
[ 2895.592859][   C62] RAX: ffffffffffffffda RBX: 0000000000000113 RCX: 00007ff890d246ed 
[ 2895.630744][   C62] RDX: 000000000000015b RSI: 0000000000000000 RDI: 000000000000016b 
[ 2895.668870][   C62] RBP: 0000000000000113 R08: 0000000000000400 R09: 0000000000000000 
[ 2895.707052][   C62] R10: 00007ff88f30a000 R11: 0000000000000246 R12: 0000000000000002 
[ 2895.744716][   C62] R13: 00007ff8913ee058 R14: 00007ff89140a6c0 R15: 00007ff8913ee000
Qian Cai Sept. 1, 2020, 3:32 a.m. UTC | #14
On Tue, Sep 01, 2020 at 02:09:28AM +0100, Al Viro wrote:
> On Mon, Aug 31, 2020 at 08:51:32PM -0400, Qian Cai wrote:
> > On Fri, Aug 07, 2020 at 09:34:08PM +0900, Tetsuo Handa wrote:
> > > On 2020/08/07 21:27, Al Viro wrote:
> > > > On Fri, Aug 07, 2020 at 07:35:08PM +0900, Tetsuo Handa wrote:
> > > >> syzbot is reporting hung task at pipe_release() [1], for for_each_bvec() from
> > > >> iterate_bvec() from iterate_all_kinds() from iov_iter_alignment() from
> > > >> ext4_unaligned_io() from ext4_dio_write_iter() from ext4_file_write_iter() from
> > > >> call_write_iter() from do_iter_readv_writev() from do_iter_write() from
> > > >> vfs_iter_write() from iter_file_splice_write() falls into infinite busy loop
> > > >> with pipe->mutex held.
> > > >>
> > > >> The reason of falling into infinite busy loop is that iter_file_splice_write()
> > > >> for some reason generates "struct bio_vec" entry with .bv_len=0 and .bv_offset=0
> > > >> while for_each_bvec() cannot handle .bv_len == 0.
> > > > 
> > > > broken in 1bdc76aea115 "iov_iter: use bvec iterator to implement iterate_bvec()",
> > > > unless I'm misreading it...
> > 
> > I have been chasing something similar for a while as in,
> > 
> > https://lore.kernel.org/linux-fsdevel/89F418A9-EB20-48CB-9AE0-52C700E6BB74@lca.pw/
> > 
> > In my case, it seems the endless loop happens in iterate_iovec() instead where
> > I put a debug patch here,
> > 
> > --- a/lib/iov_iter.c
> > +++ b/lib/iov_iter.c
> > @@ -33,6 +33,7 @@
> >                 if (unlikely(!__v.iov_len))             \
> >                         continue;                       \
> >                 __v.iov_base = __p->iov_base;           \
> > +               printk_ratelimited("ITER_IOVEC left = %zu, n = %zu\n", left, n); \
> >                 left = (STEP);                          \
> >                 __v.iov_len -= left;                    \
> >                 skip = __v.iov_len;                     \
> > 
> > and end up seeing overflows ("n" supposes to be less than PAGE_SIZE) before the
> > soft-lockups and a dead system,
> > 
> > [ 4300.249180][T470195] ITER_IOVEC left = 0, n = 48566423
> > 
> > Thoughts?
> 
> Er...  Where does that size come from?  If that's generic_perform_write(),
> I'd like to see pos, offset and bytes at the time of call...  ->iov_offset would
> also be interesting to see (along with the entire iovec array, really).

I used a new debug patch but not sure how to capture without
printk_ratelimited() because the call sites are large,

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index d812eef23a32..214b93c3d8a8 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -33,6 +33,7 @@
                if (unlikely(!__v.iov_len))             \
                        continue;                       \
                __v.iov_base = __p->iov_base;           \
+               printk_ratelimited("ITER_IOVEC left = %zu, n = %zu, iov_offset = %zu, iov_base = %px, iov_len = %lu\n", left, n, i->iov_offset, __p->iov_base, __p->iov_len); \
                left = (STEP);                          \
                __v.iov_len -= left;                    \
                skip = __v.iov_len;                     \
diff --git a/mm/filemap.c b/mm/filemap.c
index 1aaea26556cc..202b0ab28adf 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3408,6 +3408,7 @@ ssize_t generic_perform_write(struct file *file,
                if (mapping_writably_mapped(mapping))
                        flush_dcache_page(page);
 
+               printk_ratelimited("pos = %lld, offset = %ld, bytes = %ld\n", pos, offset, bytes);
                copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
                flush_dcache_page(page);

Al, does it help? If not, can you cook a debug patch instead?

[  587.765400][T21348] pos = 468152746240, offset = 256, bytes = 3840
[  587.765424][T21348] pos = 468152750080, offset = 0, bytes = 4096
[  587.765439][T21348] pos = 468152754176, offset = 0, bytes = 4096
[  587.765455][T21348] pos = 468152758272, offset = 0, bytes = 4096
[  591.235409][T22038] ITER_IOVEC left = 0, n = 22476968, iov_offset = 0, iov_base = 00007ff12570c000, iov_len = 20
[  591.313456][T22038] ITER_IOVEC left = 0, n = 22476948, iov_offset = 0, iov_base = 00007ff123250000, iov_len = 20
[  591.363679][T22038] ITER_IOVEC left = 0, n = 22476928, iov_offset = 0, iov_base = 00007ff123650000, iov_len = 6370
[  591.413217][T22038] ITER_IOVEC left = 0, n = 22470558, iov_offset = 0, iov_base = 00007ff123850000, iov_len = 20
[  591.459754][T22038] ITER_IOVEC left = 0, n = 22470538, iov_offset = 0, iov_base = 00007ff12570c000, iov_len = 376
[  591.507748][T22038] ITER_IOVEC left = 0, n = 22470162, iov_offset = 0, iov_base = 00007ff12570c000, iov_len = 3473
[  591.557395][T22038] ITER_IOVEC left = 0, n = 22466689, iov_offset = 0, iov_base = 00007ff123250000, iov_len = 20
[  591.605295][T22038] ITER_IOVEC left = 0, n = 22466669, iov_offset = 0, iov_base = 00007ff12570a000, iov_len = 3392
[  591.653045][T22038] ITER_IOVEC left = 0, n = 22463277, iov_offset = 0, iov_base = 00007ff123250000, iov_len = 1329038
[  591.705324][T22038] ITER_IOVEC left = 0, n = 21134239, iov_offset = 0, iov_base = 00007ff123850000, iov_len = 20
Al Viro Sept. 1, 2020, 4:07 a.m. UTC | #15
On Mon, Aug 31, 2020 at 11:32:28PM -0400, Qian Cai wrote:

> I used a new debug patch but not sure how to capture without
> printk_ratelimited() because the call sites are large,

	if (!strcmp(current->comm, "bugger"))
		printk(KERN_ERR....
and call the binary you are running ./bugger...  And I'd slap such
printk into the beginning of iterate_iovec() as well, if not into
the entry of iov_iter_copy_from_user_atomic().  That BS value of
n must've come from somewhere; it should expand to 'bytes'.
What we have in the beginning is

	const struct iovec *iov;
	struct iovec v;
        size_t skip = i->iov_offset;
        size_t left;
        size_t wanted = bytes;

        iov = i->iov;

        __v.iov_len = min(bytes, iov->iov_len - skip);
        if (likely(__v.iov_len)) {
                __v.iov_base = iov->iov_base + skip;
                left = copyin((p += v.iov_len) - v.iov_len, v.iov_base, v.iov_len);
                __v.iov_len -= left;
                skip += __v.iov_len;
                bytes -= __v.iov_len;
        } else {
                left = 0;
	}

and something leaves you with bytes bumped to 22476968.  What was in that first
iovec?  Incidentally, what's in 'wanted'?  And...  Timestamps don't look like
that crap has come from generic_perform_write() - it's about 4 seconds later.

While we are at it, there are other users of iterate_all_kinds(), and some of them
can very well get large sizes; they are not copying anything (iov_iter_advance(),
for starters).  There that kind of values would be just fine; are you sure those
printks came from iov_iter_copy_from_user_atomic()?
Qian Cai Sept. 1, 2020, 2:22 p.m. UTC | #16
On Tue, Sep 01, 2020 at 09:10:05AM +0800, Ming Lei wrote:
> On Tue, Sep 1, 2020 at 8:53 AM Qian Cai <cai@lca.pw> wrote:
> >
> > On Fri, Aug 07, 2020 at 09:34:08PM +0900, Tetsuo Handa wrote:
> > > On 2020/08/07 21:27, Al Viro wrote:
> > > > On Fri, Aug 07, 2020 at 07:35:08PM +0900, Tetsuo Handa wrote:
> > > >> syzbot is reporting hung task at pipe_release() [1], for for_each_bvec() from
> > > >> iterate_bvec() from iterate_all_kinds() from iov_iter_alignment() from
> > > >> ext4_unaligned_io() from ext4_dio_write_iter() from ext4_file_write_iter() from
> > > >> call_write_iter() from do_iter_readv_writev() from do_iter_write() from
> > > >> vfs_iter_write() from iter_file_splice_write() falls into infinite busy loop
> > > >> with pipe->mutex held.
> > > >>
> > > >> The reason of falling into infinite busy loop is that iter_file_splice_write()
> > > >> for some reason generates "struct bio_vec" entry with .bv_len=0 and .bv_offset=0
> > > >> while for_each_bvec() cannot handle .bv_len == 0.
> > > >
> > > > broken in 1bdc76aea115 "iov_iter: use bvec iterator to implement iterate_bvec()",
> > > > unless I'm misreading it...
> >
> > I have been chasing something similar for a while as in,
> >
> > https://lore.kernel.org/linux-fsdevel/89F418A9-EB20-48CB-9AE0-52C700E6BB74@lca.pw/
> >
> > In my case, it seems the endless loop happens in iterate_iovec() instead where
> > I put a debug patch here,
> >
> > --- a/lib/iov_iter.c
> > +++ b/lib/iov_iter.c
> > @@ -33,6 +33,7 @@
> >                 if (unlikely(!__v.iov_len))             \
> >                         continue;                       \
> >                 __v.iov_base = __p->iov_base;           \
> > +               printk_ratelimited("ITER_IOVEC left = %zu, n = %zu\n", left, n); \
> >                 left = (STEP);                          \
> >                 __v.iov_len -= left;                    \
> >                 skip = __v.iov_len;                     \
> >
> > and end up seeing overflows ("n" supposes to be less than PAGE_SIZE) before the
> > soft-lockups and a dead system,
> >
> > [ 4300.249180][T470195] ITER_IOVEC left = 0, n = 48566423
> >
> > Thoughts?
> 
> Does the following patch make a difference for you?
> 
> https://lore.kernel.org/linux-block/20200817100055.2495905-1-ming.lei@redhat.com/

Yes, it does. I could no longer be able to reproduce the soft lockups with the
patch applied.
diff mbox series

Patch

--- a/fs/splice.c
+++ b/fs/splice.c
@@ -747,6 +747,14 @@  iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
 		}
 
 		iov_iter_bvec(&from, WRITE, array, n, sd.total_len - left);
+		if (!strncmp(current->comm, "syz-executor", 12)) {
+			int i;
+			printk("Starting vfs_write_iter from.type=%d from.iov_offset=%zu from.count=%zu n=%u sd.total_len=%zu left=%zu\n",
+			       from.type, from.iov_offset, from.count, n, sd.total_len, left);
+			for (i = 0; i < n; i++)
+				printk("  array[%u]: bv_page=%px bv_len=%u bv_offset=%u\n",
+				       i, array[i].bv_page, array[i].bv_len, array[i].bv_offset);
+		}
 		ret = vfs_iter_write(out, &from, &sd.pos, 0);
 		if (ret <= 0)
 			break;