diff mbox

Btrfs: fix compressed write corruption on enospc

Message ID 1406213285-19607-1-git-send-email-bo.li.liu@oracle.com (mailing list archive)
State Accepted
Headers show

Commit Message

Liu Bo July 24, 2014, 2:48 p.m. UTC
When failing to allocate space for the whole compressed extent, we'll
fallback to uncompressed IO, but we've forgotten to redirty the pages
which belong to this compressed extent, and these 'clean' pages will
simply skip 'submit' part and go to endio directly, at last we got data
corruption as we write nothing.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
---
 fs/btrfs/inode.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

Comments

Chris Mason July 24, 2014, 2:55 p.m. UTC | #1
On 07/24/2014 10:48 AM, Liu Bo wrote:
> When failing to allocate space for the whole compressed extent, we'll
> fallback to uncompressed IO, but we've forgotten to redirty the pages
> which belong to this compressed extent, and these 'clean' pages will
> simply skip 'submit' part and go to endio directly, at last we got data
> corruption as we write nothing.

This fallback code was my #1 suspect for the hangs people have been
seeing since 3.15.  I changed things around to trigger the fallback
randomly and wasn't able to trigger problems, but I was looking for
hangs and not corruptions.

-chris


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liu Bo July 25, 2014, 1 a.m. UTC | #2
On Thu, Jul 24, 2014 at 10:55:47AM -0400, Chris Mason wrote:
> On 07/24/2014 10:48 AM, Liu Bo wrote:
> > When failing to allocate space for the whole compressed extent, we'll
> > fallback to uncompressed IO, but we've forgotten to redirty the pages
> > which belong to this compressed extent, and these 'clean' pages will
> > simply skip 'submit' part and go to endio directly, at last we got data
> > corruption as we write nothing.
> 
> This fallback code was my #1 suspect for the hangs people have been
> seeing since 3.15.  I changed things around to trigger the fallback
> randomly and wasn't able to trigger problems, but I was looking for
> hangs and not corruptions.
> 

So now you're able to trigger the hang without changing the fallback code?

I tried raid1 and raid0 with fsmark and rsync in different ways but still fails
to reproduce the hang :-(

The most weird thing is who the hell holds the free space inode's page, is it
possible to share pages with other inode? (My answer is NO, but I'm not sure
now...)

thanks,
-liubo

> -chris
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wang Shilong July 25, 2014, 1:53 a.m. UTC | #3
On 07/24/2014 10:48 PM, Liu Bo wrote:
> When failing to allocate space for the whole compressed extent, we'll
> fallback to uncompressed IO, but we've forgotten to redirty the pages
> which belong to this compressed extent, and these 'clean' pages will
> simply skip 'submit' part and go to endio directly, at last we got data
> corruption as we write nothing.
>
> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
> ---
>   fs/btrfs/inode.c | 12 ++++++++++++
>   1 file changed, 12 insertions(+)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 3668048..8ea7610 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -709,6 +709,18 @@ retry:
>   				unlock_extent(io_tree, async_extent->start,
>   					      async_extent->start +
>   					      async_extent->ram_size - 1);
> +
> +				/*
> +				 * we need to redirty the pages if we decide to
> +				 * fallback to uncompressed IO, otherwise we
> +				 * will not submit these pages down to lower
> +				 * layers.
> +				 */
> +				extent_range_redirty_for_io(inode,
> +						async_extent->start,
> +						async_extent->start +
> +						async_extent->ram_size - 1);
> +
>   				goto retry;
BTW, if such ENOSPC happens, it means we could not reserve compressed space.
So we retry with nocompression codes, it will try to reserve more space. 
Any reason
we do such things?

Thanks,
Wang
>   			}
>   			goto out_free;

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liu Bo July 25, 2014, 2:08 a.m. UTC | #4
On Fri, Jul 25, 2014 at 09:53:43AM +0800, Wang Shilong wrote:
> On 07/24/2014 10:48 PM, Liu Bo wrote:
> >When failing to allocate space for the whole compressed extent, we'll
> >fallback to uncompressed IO, but we've forgotten to redirty the pages
> >which belong to this compressed extent, and these 'clean' pages will
> >simply skip 'submit' part and go to endio directly, at last we got data
> >corruption as we write nothing.
> >
> >Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
> >---
> >  fs/btrfs/inode.c | 12 ++++++++++++
> >  1 file changed, 12 insertions(+)
> >
> >diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> >index 3668048..8ea7610 100644
> >--- a/fs/btrfs/inode.c
> >+++ b/fs/btrfs/inode.c
> >@@ -709,6 +709,18 @@ retry:
> >  				unlock_extent(io_tree, async_extent->start,
> >  					      async_extent->start +
> >  					      async_extent->ram_size - 1);
> >+
> >+				/*
> >+				 * we need to redirty the pages if we decide to
> >+				 * fallback to uncompressed IO, otherwise we
> >+				 * will not submit these pages down to lower
> >+				 * layers.
> >+				 */
> >+				extent_range_redirty_for_io(inode,
> >+						async_extent->start,
> >+						async_extent->start +
> >+						async_extent->ram_size - 1);
> >+
> >  				goto retry;
> BTW, if such ENOSPC happens, it means we could not reserve compressed space.
> So we retry with nocompression codes, it will try to reserve more
> space. Any reason
> we do such things?

Compressed extents needs continuous space while uncompressed extents can have
more choices.

thanks,
-liubo

> 
> Thanks,
> Wang
> >  			}
> >  			goto out_free;
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wang Shilong July 25, 2014, 2:11 a.m. UTC | #5
On 07/25/2014 10:08 AM, Liu Bo wrote:
> On Fri, Jul 25, 2014 at 09:53:43AM +0800, Wang Shilong wrote:
>> On 07/24/2014 10:48 PM, Liu Bo wrote:
>>> When failing to allocate space for the whole compressed extent, we'll
>>> fallback to uncompressed IO, but we've forgotten to redirty the pages
>>> which belong to this compressed extent, and these 'clean' pages will
>>> simply skip 'submit' part and go to endio directly, at last we got data
>>> corruption as we write nothing.
>>>
>>> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
>>> ---
>>>   fs/btrfs/inode.c | 12 ++++++++++++
>>>   1 file changed, 12 insertions(+)
>>>
>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>>> index 3668048..8ea7610 100644
>>> --- a/fs/btrfs/inode.c
>>> +++ b/fs/btrfs/inode.c
>>> @@ -709,6 +709,18 @@ retry:
>>>   				unlock_extent(io_tree, async_extent->start,
>>>   					      async_extent->start +
>>>   					      async_extent->ram_size - 1);
>>> +
>>> +				/*
>>> +				 * we need to redirty the pages if we decide to
>>> +				 * fallback to uncompressed IO, otherwise we
>>> +				 * will not submit these pages down to lower
>>> +				 * layers.
>>> +				 */
>>> +				extent_range_redirty_for_io(inode,
>>> +						async_extent->start,
>>> +						async_extent->start +
>>> +						async_extent->ram_size - 1);
>>> +
>>>   				goto retry;
>> BTW, if such ENOSPC happens, it means we could not reserve compressed space.
>> So we retry with nocompression codes, it will try to reserve more
>> space. Any reason
>> we do such things?
> Compressed extents needs continuous space while uncompressed extents can have
> more choices.
Yeah, that is reasonable. Thanks for your answer.^_^

>
> thanks,
> -liubo
>
>> Thanks,
>> Wang
>>>   			}
>>>   			goto out_free;
> .
>

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Martin Steigerwald July 25, 2014, 9:54 a.m. UTC | #6
Am Donnerstag, 24. Juli 2014, 22:48:05 schrieben Sie:
> When failing to allocate space for the whole compressed extent, we'll
> fallback to uncompressed IO, but we've forgotten to redirty the pages
> which belong to this compressed extent, and these 'clean' pages will
> simply skip 'submit' part and go to endio directly, at last we got data
> corruption as we write nothing.
> 
> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
> ---
>  fs/btrfs/inode.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 3668048..8ea7610 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -709,6 +709,18 @@ retry:
>  				unlock_extent(io_tree, async_extent->start,
>  					      async_extent->start +
>  					      async_extent->ram_size - 1);
> +
> +				/*
> +				 * we need to redirty the pages if we decide to
> +				 * fallback to uncompressed IO, otherwise we
> +				 * will not submit these pages down to lower
> +				 * layers.
> +				 */
> +				extent_range_redirty_for_io(inode,
> +						async_extent->start,
> +						async_extent->start +
> +						async_extent->ram_size - 1);
> +
>  				goto retry;
>  			}
>  			goto out_free;

I am testing this currently. So far no lockup. Lets see. Still has not filled 
the the block device with trees completely after I balanced them:

Label: 'home'  uuid: […]
        Total devices 2 FS bytes used 125.57GiB
        devid    1 size 160.00GiB used 153.00GiB path /dev/dm-0
        devid    2 size 160.00GiB used 153.00GiB path /dev/mapper/sata-home

I believe the lockups happen more easily if the trees occupy all of disk 
space. Well I will do some compiling of some KDE components which may let 
BTRFS fill all space again.

This patch will mean it when it can´t make enough free space in the 
(fragmented) tree it will write uncompressed?

This would mean that one would have a defragment trees regularily to allow for 
writes to happen compressed at all times.

Well… of course still better than lockup or corruption.

Ciao,
Martin Steigerwald July 25, 2014, 9:58 a.m. UTC | #7
Am Freitag, 25. Juli 2014, 09:00:19 schrieb Liu Bo:
> On Thu, Jul 24, 2014 at 10:55:47AM -0400, Chris Mason wrote:
> > On 07/24/2014 10:48 AM, Liu Bo wrote:
> > > When failing to allocate space for the whole compressed extent, we'll
> > > fallback to uncompressed IO, but we've forgotten to redirty the pages
> > > which belong to this compressed extent, and these 'clean' pages will
> > > simply skip 'submit' part and go to endio directly, at last we got data
> > > corruption as we write nothing.
> > 
> > This fallback code was my #1 suspect for the hangs people have been
> > seeing since 3.15.  I changed things around to trigger the fallback
> > randomly and wasn't able to trigger problems, but I was looking for
> > hangs and not corruptions.
> 
> So now you're able to trigger the hang without changing the fallback code?
> 
> I tried raid1 and raid0 with fsmark and rsync in different ways but still
> fails to reproduce the hang :-(
> 
> The most weird thing is who the hell holds the free space inode's page, is
> it possible to share pages with other inode? (My answer is NO, but I'm not
> sure now...)

Can you try doing this on a BTRFS filesystem that has all of block device free 
space filled with its trees already? My suspicion is that this highly 
contributes to the likelyhood of this hang to happen (see my other mail).

This really seems had to hit in a test scenario, while it happens regularily 
on some production filesystems.

Making a kernel package with make-kpkg (from kernel-package package in Debian) 
triggers lockup quite reliably here.

That said, with 3.16.0-rc6-tp520-fixcompwrite+ no lockup so far, but after 
balancing trees do not yet fill whole device again.
Martin Steigerwald Aug. 4, 2014, 12:50 p.m. UTC | #8
Am Freitag, 25. Juli 2014, 11:54:37 schrieb Martin Steigerwald:
> Am Donnerstag, 24. Juli 2014, 22:48:05 schrieben Sie:
> > When failing to allocate space for the whole compressed extent, we'll
> > fallback to uncompressed IO, but we've forgotten to redirty the pages
> > which belong to this compressed extent, and these 'clean' pages will
> > simply skip 'submit' part and go to endio directly, at last we got data
> > corruption as we write nothing.
> > 
> > Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
> > ---
> > 
> >  fs/btrfs/inode.c | 12 ++++++++++++
> >  1 file changed, 12 insertions(+)
> > 
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 3668048..8ea7610 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > 
> > @@ -709,6 +709,18 @@ retry:
> >  				unlock_extent(io_tree, async_extent->start,
> >  				
> >  					      async_extent->start +
> >  					      async_extent->ram_size - 1);
> > 
> > +
> > +				/*
> > +				 * we need to redirty the pages if we decide to
> > +				 * fallback to uncompressed IO, otherwise we
> > +				 * will not submit these pages down to lower
> > +				 * layers.
> > +				 */
> > +				extent_range_redirty_for_io(inode,
> > +						async_extent->start,
> > +						async_extent->start +
> > +						async_extent->ram_size - 1);
> > +
> > 
> >  				goto retry;
> >  			
> >  			}
> >  			goto out_free;
> 
> I am testing this currently. So far no lockup. Lets see. Still has not
> filled the the block device with trees completely after I balanced them:
> 
> Label: 'home'  uuid: […]
>         Total devices 2 FS bytes used 125.57GiB
>         devid    1 size 160.00GiB used 153.00GiB path /dev/dm-0
>         devid    2 size 160.00GiB used 153.00GiB path /dev/mapper/sata-home
> 
> I believe the lockups happen more easily if the trees occupy all of disk
> space. Well I will do some compiling of some KDE components which may let
> BTRFS fill all space again.
> 
> This patch will mean it when it can´t make enough free space in the
> (fragmented) tree it will write uncompressed?
> 
> This would mean that one would have a defragment trees regularily to allow
> for writes to happen compressed at all times.
> 
> Well… of course still better than lockup or corruption.

So lookups so far anymore.

Tested with 3.16-rc5, 3.16-rc7, now running 3-16 final.

/home BTRFS only got today filled completely tree-wise:

merkaba:~> btrfs fi sh /home
Label: 'home'  uuid: […]
        Total devices 2 FS bytes used 127.35GiB
        devid    1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
        devid    2 size 160.00GiB used 160.00GiB path /dev/dm-3

But I had KDE and kernel compile running full throttle at that time and still 
good.

Tested-By: Martin Steigerwald <martin@lichtvoll.de>


I think this should go to stable. Thanks, Liu.

Ciao,
Martin Steigerwald Aug. 4, 2014, 12:52 p.m. UTC | #9
Am Montag, 4. August 2014, 14:50:29 schrieb Martin Steigerwald:
> Am Freitag, 25. Juli 2014, 11:54:37 schrieb Martin Steigerwald:
> > Am Donnerstag, 24. Juli 2014, 22:48:05 schrieben Sie:
> > > When failing to allocate space for the whole compressed extent, we'll
> > > fallback to uncompressed IO, but we've forgotten to redirty the pages
> > > which belong to this compressed extent, and these 'clean' pages will
> > > simply skip 'submit' part and go to endio directly, at last we got data
> > > corruption as we write nothing.
> > >
> > > 
> > >
> > > Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
> > > ---
> > >
> > > 
> > >  fs/btrfs/inode.c | 12 ++++++++++++
> > >  1 file changed, 12 insertions(+)
> > > 
> > >
> > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > > index 3668048..8ea7610 100644
> > > --- a/fs/btrfs/inode.c
> > > +++ b/fs/btrfs/inode.c
> > >
> > > 
> > >
> > > @@ -709,6 +709,18 @@ retry:
> > >                             unlock_extent(io_tree, async_extent->start,
> > >                             
> > >                                           async_extent->start +
> > >                                           async_extent->ram_size - 1);
> > > 
> > >
> > > +
> > > +                           /*
> > > +                            * we need to redirty the pages if we decide
> > > to
> > > +                            * fallback to uncompressed IO, otherwise we
> > > +                            * will not submit these pages down to lower
> > > +                            * layers.
> > > +                            */
> > > +                           extent_range_redirty_for_io(inode,
> > > +                                           async_extent->start,
> > > +                                           async_extent->start +
> > > +                                           async_extent->ram_size - 1);
> > > +
> > >
> > > 
> > >                             goto retry;
> > >                     
> > >                     }
> > >                     goto out_free;
> >
> > 
> >
> > I am testing this currently. So far no lockup. Lets see. Still has not
> >
> > filled the the block device with trees completely after I balanced them:
> > 
> >
> > Label: 'home'  uuid: […]
> >
> >         Total devices 2 FS bytes used 125.57GiB
> >         devid    1 size 160.00GiB used 153.00GiB path /dev/dm-0
> >         devid    2 size 160.00GiB used 153.00GiB path
> >/dev/mapper/sata-home 
> >
> > I believe the lockups happen more easily if the trees occupy all of disk
> > space. Well I will do some compiling of some KDE components which may let
> > BTRFS fill all space again.
> >
> > 
> >
> > This patch will mean it when it can´t make enough free space in the
> > (fragmented) tree it will write uncompressed?
> >
> > 
> >
> > This would mean that one would have a defragment trees regularily to allow
> > for writes to happen compressed at all times.
> >
> > 
> >
> > Well… of course still better than lockup or corruption.
> 
> So lookups so far anymore.

No lookups of course.
Martin Steigerwald Aug. 6, 2014, 10:21 a.m. UTC | #10
Am Montag, 4. August 2014, 14:50:29 schrieb Martin Steigerwald:
> Am Freitag, 25. Juli 2014, 11:54:37 schrieb Martin Steigerwald:
> > Am Donnerstag, 24. Juli 2014, 22:48:05 schrieben Sie:
> > > When failing to allocate space for the whole compressed extent, we'll
> > > fallback to uncompressed IO, but we've forgotten to redirty the pages
> > > which belong to this compressed extent, and these 'clean' pages will
> > > simply skip 'submit' part and go to endio directly, at last we got data
> > > corruption as we write nothing.
> > > 
> > > Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
> > > ---
> > > 
> > >  fs/btrfs/inode.c | 12 ++++++++++++
> > >  1 file changed, 12 insertions(+)
> > > 
> > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > > index 3668048..8ea7610 100644
> > > --- a/fs/btrfs/inode.c
> > > +++ b/fs/btrfs/inode.c
> > > 
> > > @@ -709,6 +709,18 @@ retry:
> > >  				unlock_extent(io_tree, async_extent->start,
> > >  				
> > >  					      async_extent->start +
> > >  					      async_extent->ram_size - 1);
> > > 
> > > +
> > > +				/*
> > > +				 * we need to redirty the pages if we decide to
> > > +				 * fallback to uncompressed IO, otherwise we
> > > +				 * will not submit these pages down to lower
> > > +				 * layers.
> > > +				 */
> > > +				extent_range_redirty_for_io(inode,
> > > +						async_extent->start,
> > > +						async_extent->start +
> > > +						async_extent->ram_size - 1);
> > > +
> > > 
> > >  				goto retry;
> > >  			
> > >  			}
> > >  			goto out_free;
> > 
> > I am testing this currently. So far no lockup. Lets see. Still has not
> > filled the the block device with trees completely after I balanced them:
> > 
> > Label: 'home'  uuid: […]
> > 
> >         Total devices 2 FS bytes used 125.57GiB
> >         devid    1 size 160.00GiB used 153.00GiB path /dev/dm-0
> >         devid    2 size 160.00GiB used 153.00GiB path
> >         /dev/mapper/sata-home
> > 
> > I believe the lockups happen more easily if the trees occupy all of disk
> > space. Well I will do some compiling of some KDE components which may let
> > BTRFS fill all space again.
> > 
> > This patch will mean it when it can´t make enough free space in the
> > (fragmented) tree it will write uncompressed?
> > 
> > This would mean that one would have a defragment trees regularily to allow
> > for writes to happen compressed at all times.
> > 
> > Well… of course still better than lockup or corruption.
> 
> So lookups so far anymore.
> 
> Tested with 3.16-rc5, 3.16-rc7, now running 3-16 final.
> 
> /home BTRFS only got today filled completely tree-wise:
> 
> merkaba:~> btrfs fi sh /home
> Label: 'home'  uuid: […]
>         Total devices 2 FS bytes used 127.35GiB
>         devid    1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
> devid    2 size 160.00GiB used 160.00GiB path /dev/dm-3
> 
> But I had KDE and kernel compile running full throttle at that time and
> still good.
> 
> Tested-By: Martin Steigerwald <martin@lichtvoll.de>
> 
> 
> I think this should go to stable. Thanks, Liu.

Unfortunately this fix does not seem to fix all lockups.

Just had a hard lockup again during java-bases CrashPlanPROe app backuping
company data which is stored on BTRFS via ecryptfs to central Backup server.

It basically happened on about the first heavy write I/O occasion after
the BTRFS trees filled the complete device:

I am now balancing the trees down to lower sizes manually with

btrfs balance start -dusage=10 /home

btrfs balance start -musage=10 /home

and raising values. BTW I got out of space with trying both at the same time:

merkaba:~#1> btrfs balance start -dusage=10 -musage=10 /home
ERROR: error during balancing '/home' - No space left on device
There may be more info in syslog - try dmesg | tail

merkaba:~#1> btrfs fi sh /home
Label: 'home'  uuid: […]
        Total devices 2 FS bytes used 128.76GiB
        devid    1 size 160.00GiB used 146.00GiB path /dev/dm-0
        devid    2 size 160.00GiB used 146.00GiB path /dev/mapper/sata-home

So I am pretty sure meanwhile that hangs can best be trigger *if* BTRFS
trees fill the complete device.

I will try to keep tree sizes down as a work-around for now even it if means
additional write access towards the SSD devices.

And make sure tree sizes stay down on my first server BTRFS as well although
this uses debian backport kernel 3.14 and thus may not be affected.

Are there any other fixes to try out? I really like to see this resolved. Its
in two stable kernel revisions already: 3.15 and 3.16. And by this it means
if not fixed next Debian stable (Jessie) will be affected by it.


Some kern.log (have stored the complete file)

Aug  6 12:01:16 merkaba kernel: [90496.262084] INFO: task java:21301 blocked for more than 120 seconds.
Aug  6 12:01:16 merkaba kernel: [90496.263626]       Tainted: G           O  3.16.0-tp520-fixcompwrite+ #3
Aug  6 12:01:16 merkaba kernel: [90496.265159] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug  6 12:01:16 merkaba kernel: [90496.266756] java            D ffff880044e3cef0     0 21301      1 0x00000000
Aug  6 12:01:16 merkaba kernel: [90496.268353]  ffff8801960e3bd8 0000000000000002 ffff880407f649b0 ffff8801960e3fd8
Aug  6 12:01:16 merkaba kernel: [90496.269980]  ffff880044e3c9b0 0000000000013040 ffff880044e3c9b0 ffff88041e293040
Aug  6 12:01:16 merkaba kernel: [90496.271766]  ffff88041e5c6868 ffff8801960e3c70 0000000000000002 ffffffff810db1d9
Aug  6 12:01:16 merkaba kernel: [90496.273383] Call Trace:
Aug  6 12:01:16 merkaba kernel: [90496.275017]  [<ffffffff810db1d9>] ? wait_on_page_read+0x37/0x37
Aug  6 12:01:16 merkaba kernel: [90496.276630]  [<ffffffff81470fd0>] schedule+0x64/0x66
Aug  6 12:01:16 merkaba kernel: [90496.278209]  [<ffffffff81471157>] io_schedule+0x57/0x76
Aug  6 12:01:16 merkaba kernel: [90496.279817]  [<ffffffff810db1e2>] sleep_on_page+0x9/0xd
Aug  6 12:01:16 merkaba kernel: [90496.281403]  [<ffffffff8147152d>] __wait_on_bit_lock+0x41/0x85
Aug  6 12:01:16 merkaba kernel: [90496.282991]  [<ffffffff810db29f>] __lock_page+0x70/0x7c
Aug  6 12:01:16 merkaba kernel: [90496.284550]  [<ffffffff81070f3a>] ? autoremove_wake_function+0x2f/0x2f
Aug  6 12:01:16 merkaba kernel: [90496.286156]  [<ffffffffc0476617>] lock_page+0x1e/0x21 [btrfs]
Aug  6 12:01:16 merkaba kernel: [90496.287742]  [<ffffffffc0476617>] ? lock_page+0x1e/0x21 [btrfs]
Aug  6 12:01:16 merkaba kernel: [90496.289344]  [<ffffffffc047a8f4>] extent_write_cache_pages.isra.21.constprop.42+0x1a7/0x2d9 [btrfs]
Aug  6 12:01:16 merkaba kernel: [90496.290955]  [<ffffffff810dba82>] ? find_get_pages_tag+0xfc/0x123
Aug  6 12:01:16 merkaba kernel: [90496.292574]  [<ffffffffc047ae1c>] extent_writepages+0x46/0x57 [btrfs]
Aug  6 12:01:16 merkaba kernel: [90496.294154]  [<ffffffffc04650b2>] ? btrfs_submit_direct+0x3ef/0x3ef [btrfs]
Aug  6 12:01:16 merkaba kernel: [90496.295760]  [<ffffffffc046361f>] btrfs_writepages+0x23/0x25 [btrfs]
Aug  6 12:01:16 merkaba kernel: [90496.297492]  [<ffffffff810e556b>] do_writepages+0x1b/0x24
Aug  6 12:01:16 merkaba kernel: [90496.299035]  [<ffffffff810dc88a>] __filemap_fdatawrite_range+0x50/0x52
Aug  6 12:01:16 merkaba kernel: [90496.300561]  [<ffffffff810dc8f1>] filemap_fdatawrite_range+0xe/0x10
Aug  6 12:01:16 merkaba kernel: [90496.302118]  [<ffffffffc046ea30>] btrfs_sync_file+0x67/0x2bd [btrfs]
Aug  6 12:01:16 merkaba kernel: [90496.303630]  [<ffffffff810dc88a>] ? __filemap_fdatawrite_range+0x50/0x52
Aug  6 12:01:16 merkaba kernel: [90496.305158]  [<ffffffff81156289>] vfs_fsync_range+0x1c/0x1e
Aug  6 12:01:16 merkaba kernel: [90496.306669]  [<ffffffff811562a2>] vfs_fsync+0x17/0x19
Aug  6 12:01:16 merkaba kernel: [90496.308197]  [<ffffffffc04f732a>] ecryptfs_fsync+0x2f/0x34 [ecryptfs]
Aug  6 12:01:16 merkaba kernel: [90496.309711]  [<ffffffff81156289>] vfs_fsync_range+0x1c/0x1e
Aug  6 12:01:16 merkaba kernel: [90496.311249]  [<ffffffff811562a2>] vfs_fsync+0x17/0x19
Aug  6 12:01:16 merkaba kernel: [90496.312771]  [<ffffffff811564c9>] do_fsync+0x2c/0x45
Aug  6 12:01:16 merkaba kernel: [90496.314288]  [<ffffffff811566af>] SyS_fsync+0xb/0xf
Aug  6 12:01:16 merkaba kernel: [90496.315800]  [<ffffffff8147420b>] tracesys+0xdd/0xe2



Aug  6 12:01:16 merkaba kernel: [90496.380221] INFO: task java:21563 blocked for more than 120 seconds.
Aug  6 12:01:16 merkaba kernel: [90496.381691]       Tainted: G           O  3.16.0-tp520-fixcompwrite+ #3
Aug  6 12:01:16 merkaba kernel: [90496.383192] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug  6 12:01:16 merkaba kernel: [90496.384687] java            D ffff880038111dd0     0 21563      1 0x00000000
Aug  6 12:01:16 merkaba kernel: [90496.386203]  ffff88006df0fbd8 0000000000000002 ffffffff81a15500 ffff88006df0ffd8
Aug  6 12:01:16 merkaba kernel: [90496.387843]  ffff880038111890 0000000000013040 ffff880038111890 ffff88041e213040
Aug  6 12:01:16 merkaba kernel: [90496.389414]  ffff88041e5cc568 ffff88006df0fc70 0000000000000002 ffffffff810db1d9
Aug  6 12:01:16 merkaba kernel: [90496.391031] Call Trace:
Aug  6 12:01:16 merkaba kernel: [90496.392574]  [<ffffffff810db1d9>] ? wait_on_page_read+0x37/0x37
Aug  6 12:01:16 merkaba kernel: [90496.394154]  [<ffffffff81470fd0>] schedule+0x64/0x66
Aug  6 12:01:16 merkaba kernel: [90496.395686]  [<ffffffff81471157>] io_schedule+0x57/0x76
Aug  6 12:01:16 merkaba kernel: [90496.397218]  [<ffffffff810db1e2>] sleep_on_page+0x9/0xd
Aug  6 12:01:16 merkaba kernel: [90496.398723]  [<ffffffff8147152d>] __wait_on_bit_lock+0x41/0x85
Aug  6 12:01:16 merkaba kernel: [90496.400232]  [<ffffffff810db29f>] __lock_page+0x70/0x7c
Aug  6 12:01:16 merkaba kernel: [90496.401895]  [<ffffffff81070f3a>] ? autoremove_wake_function+0x2f/0x2f
Aug  6 12:01:16 merkaba kernel: [90496.403440]  [<ffffffffc0476617>] lock_page+0x1e/0x21 [btrfs]
Aug  6 12:01:16 merkaba kernel: [90496.404942]  [<ffffffffc0476617>] ? lock_page+0x1e/0x21 [btrfs]
Aug  6 12:01:16 merkaba kernel: [90496.406433]  [<ffffffffc047a8f4>] extent_write_cache_pages.isra.21.constprop.42+0x1a7/0x2d9 [btrfs]
Aug  6 12:01:16 merkaba kernel: [90496.407950]  [<ffffffff810dba82>] ? find_get_pages_tag+0xfc/0x123
Aug  6 12:01:16 merkaba kernel: [90496.409474]  [<ffffffffc047ae1c>] extent_writepages+0x46/0x57 [btrfs]
Aug  6 12:01:16 merkaba kernel: [90496.411020]  [<ffffffffc04650b2>] ? btrfs_submit_direct+0x3ef/0x3ef [btrfs]
Aug  6 12:01:16 merkaba kernel: [90496.412558]  [<ffffffffc046361f>] btrfs_writepages+0x23/0x25 [btrfs]
Aug  6 12:01:16 merkaba kernel: [90496.414102]  [<ffffffff810e556b>] do_writepages+0x1b/0x24
Aug  6 12:01:16 merkaba kernel: [90496.415621]  [<ffffffff810dc88a>] __filemap_fdatawrite_range+0x50/0x52
Aug  6 12:01:16 merkaba kernel: [90496.417184]  [<ffffffff810dc8f1>] filemap_fdatawrite_range+0xe/0x10
Aug  6 12:01:16 merkaba kernel: [90496.418753]  [<ffffffffc046ea30>] btrfs_sync_file+0x67/0x2bd [btrfs]
Aug  6 12:01:16 merkaba kernel: [90496.420344]  [<ffffffff810dc88a>] ? __filemap_fdatawrite_range+0x50/0x52
Aug  6 12:01:16 merkaba kernel: [90496.421914]  [<ffffffff81156289>] vfs_fsync_range+0x1c/0x1e
Aug  6 12:01:16 merkaba kernel: [90496.423467]  [<ffffffff811562a2>] vfs_fsync+0x17/0x19
Aug  6 12:01:16 merkaba kernel: [90496.425051]  [<ffffffffc04f732a>] ecryptfs_fsync+0x2f/0x34 [ecryptfs]
Aug  6 12:01:16 merkaba kernel: [90496.426593]  [<ffffffff81156289>] vfs_fsync_range+0x1c/0x1e
Aug  6 12:01:16 merkaba kernel: [90496.428280]  [<ffffffff811562a2>] vfs_fsync+0x17/0x19
Aug  6 12:01:16 merkaba kernel: [90496.429853]  [<ffffffff811564c9>] do_fsync+0x2c/0x45
Aug  6 12:01:16 merkaba kernel: [90496.431351]  [<ffffffff811566af>] SyS_fsync+0xb/0xf
Aug  6 12:01:16 merkaba kernel: [90496.432841]  [<ffffffff8147420b>] tracesys+0xdd/0xe2



Aug  6 12:01:16 merkaba kernel: [90496.434306] INFO: task kworker/u8:3:21401 blocked for more than 120 seconds.
Aug  6 12:01:16 merkaba kernel: [90496.435814]       Tainted: G           O  3.16.0-tp520-fixcompwrite+ #3
Aug  6 12:01:16 merkaba kernel: [90496.437328] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug  6 12:01:16 merkaba kernel: [90496.438885] kworker/u8:3    D ffff880133ebe780     0 21401      2 0x00000000
Aug  6 12:01:16 merkaba kernel: [90496.440464] Workqueue: btrfs-flush_delalloc normal_work_helper [btrfs]
Aug  6 12:01:16 merkaba kernel: [90496.442037]  ffff88003e953b18 0000000000000002 ffffffff81a15500 ffff88003e953fd8
Aug  6 12:01:16 merkaba kernel: [90496.443639]  ffff880133ebe240 0000000000013040 ffff880133ebe240 ffff88041e213040
Aug  6 12:01:16 merkaba kernel: [90496.445246]  ffff88041e5bc968 ffff88003e953bb0 0000000000000002 ffffffff810db1d9
Aug  6 12:01:16 merkaba kernel: [90496.446901] Call Trace:
Aug  6 12:01:16 merkaba kernel: [90496.448485]  [<ffffffff810db1d9>] ? wait_on_page_read+0x37/0x37
Aug  6 12:01:16 merkaba kernel: [90496.450081]  [<ffffffff81470fd0>] schedule+0x64/0x66
Aug  6 12:01:16 merkaba kernel: [90496.451682]  [<ffffffff81471157>] io_schedule+0x57/0x76
Aug  6 12:01:16 merkaba kernel: [90496.453271]  [<ffffffff810db1e2>] sleep_on_page+0x9/0xd
Aug  6 12:01:16 merkaba kernel: [90496.455037]  [<ffffffff8147152d>] __wait_on_bit_lock+0x41/0x85
Aug  6 12:01:16 merkaba kernel: [90496.456617]  [<ffffffff810db29f>] __lock_page+0x70/0x7c
Aug  6 12:01:16 merkaba kernel: [90496.458203]  [<ffffffff81070f3a>] ? autoremove_wake_function+0x2f/0x2f
Aug  6 12:01:16 merkaba kernel: [90496.459793]  [<ffffffffc0476617>] lock_page+0x1e/0x21 [btrfs]
Aug  6 12:01:16 merkaba kernel: [90496.461353]  [<ffffffffc0476617>] ? lock_page+0x1e/0x21 [btrfs]
Aug  6 12:01:16 merkaba kernel: [90496.462917]  [<ffffffffc047a8f4>] extent_write_cache_pages.isra.21.constprop.42+0x1a7/0x2d9 [btrfs]
Aug  6 12:01:16 merkaba kernel: [90496.464479]  [<ffffffff81009e60>] ? native_sched_clock+0x3a/0x3c
Aug  6 12:01:16 merkaba kernel: [90496.466036]  [<ffffffffc047ae1c>] extent_writepages+0x46/0x57 [btrfs]
Aug  6 12:01:16 merkaba kernel: [90496.467632]  [<ffffffff81066fbf>] ? task_group_account_field+0x3b/0x40
Aug  6 12:01:16 merkaba kernel: [90496.469168]  [<ffffffffc04650b2>] ? btrfs_submit_direct+0x3ef/0x3ef [btrfs]
Aug  6 12:01:16 merkaba kernel: [90496.470737]  [<ffffffffc046361f>] btrfs_writepages+0x23/0x25 [btrfs]
Aug  6 12:01:16 merkaba kernel: [90496.472307]  [<ffffffff810e556b>] do_writepages+0x1b/0x24
Aug  6 12:01:16 merkaba kernel: [90496.473885]  [<ffffffff810dc88a>] __filemap_fdatawrite_range+0x50/0x52
Aug  6 12:01:16 merkaba kernel: [90496.475458]  [<ffffffff810dd1b3>] filemap_flush+0x17/0x19
Aug  6 12:01:16 merkaba kernel: [90496.477041]  [<ffffffffc0465df9>] btrfs_run_delalloc_work+0x2e/0x64 [btrfs]
Aug  6 12:01:16 merkaba kernel: [90496.478624]  [<ffffffffc04863d7>] normal_work_helper+0xdf/0x224 [btrfs]
Aug  6 12:01:16 merkaba kernel: [90496.480257]  [<ffffffff81052d8c>] process_one_work+0x16f/0x2b8
Aug  6 12:01:16 merkaba kernel: [90496.481977]  [<ffffffff81053636>] worker_thread+0x27b/0x32e
Aug  6 12:01:16 merkaba kernel: [90496.483544]  [<ffffffff810533bb>] ? cancel_delayed_work_sync+0x10/0x10
Aug  6 12:01:16 merkaba kernel: [90496.485082]  [<ffffffff81058012>] kthread+0xb2/0xba
Aug  6 12:01:16 merkaba kernel: [90496.486624]  [<ffffffff81470000>] ? ap_handle_dropped_data+0xf/0xc8
Aug  6 12:01:16 merkaba kernel: [90496.488148]  [<ffffffff81057f60>] ? __kthread_parkme+0x62/0x62
Aug  6 12:01:16 merkaba kernel: [90496.489719]  [<ffffffff81473f6c>] ret_from_fork+0x7c/0xb0
Aug  6 12:01:16 merkaba kernel: [90496.491265]  [<ffffffff81057f60>] ? __kthread_parkme+0x62/0x62


Ciao,
Hugo Mills Aug. 6, 2014, 10:29 a.m. UTC | #11
On Wed, Aug 06, 2014 at 12:21:59PM +0200, Martin Steigerwald wrote:
> It basically happened on about the first heavy write I/O occasion after
> the BTRFS trees filled the complete device:
> 
> I am now balancing the trees down to lower sizes manually with
> 
> btrfs balance start -dusage=10 /home
> 
> btrfs balance start -musage=10 /home

   Note that balance has nothing to do with balancing the metadata
trees. The tree structures are automatically balanced as part of their
normal operation. A "btrfs balance start" is a much higher-level
operation. It's called balance because the overall effect is to
balance the data usage evenly across multiple devices. (Actually, to
balance the available space evenly).

   Also note that the data part isn't tree-structured, so referring to
"balancing the trees" with a -d flag is doubly misleading. :)

   Hugo.
Martin Steigerwald Aug. 6, 2014, 12:28 p.m. UTC | #12
Am Mittwoch, 6. August 2014, 11:29:19 schrieb Hugo Mills:
> On Wed, Aug 06, 2014 at 12:21:59PM +0200, Martin Steigerwald wrote:
> > It basically happened on about the first heavy write I/O occasion after
> > the BTRFS trees filled the complete device:
> > 
> > I am now balancing the trees down to lower sizes manually with
> > 
> > btrfs balance start -dusage=10 /home
> > 
> > btrfs balance start -musage=10 /home
> 
>    Note that balance has nothing to do with balancing the metadata
> trees. The tree structures are automatically balanced as part of their
> normal operation. A "btrfs balance start" is a much higher-level
> operation. It's called balance because the overall effect is to
> balance the data usage evenly across multiple devices. (Actually, to
> balance the available space evenly).
> 
>    Also note that the data part isn't tree-structured, so referring to
> "balancing the trees" with a -d flag is doubly misleading. :)

Hmm, it makes used size in

merkaba:~> btrfs fi sh /home
Label: 'home'  uuid: […]
        Total devices 2 FS bytes used 129.12GiB
        devid    1 size 160.00GiB used 142.03GiB path /dev/dm-0
        devid    2 size 160.00GiB used 142.03GiB path /dev/mapper/sata-home

and I thought this the is size used by the trees BTRFS creates.

So you say it does not balance shortest versus longest path but… as the tree 
algorithm does this automatically… but just the *data* in the tree?

In any way: I should not be required to do this kind of manual maintenance in 
order to prevent BTRFS from locking up hard on write accesses.

Ciao,
Chris Mason Aug. 6, 2014, 1:35 p.m. UTC | #13
On 08/06/2014 06:21 AM, Martin Steigerwald wrote:

>> I think this should go to stable. Thanks, Liu.

I'm definitely tagging this for stable.

> 
> Unfortunately this fix does not seem to fix all lockups.

The traces below are a little different, could you please send the whole
file?

-chris

> 
> Just had a hard lockup again during java-bases CrashPlanPROe app backuping
> company data which is stored on BTRFS via ecryptfs to central Backup server.
> 
> It basically happened on about the first heavy write I/O occasion after
> the BTRFS trees filled the complete device:
> 
> I am now balancing the trees down to lower sizes manually with
> 
> btrfs balance start -dusage=10 /home
> 
> btrfs balance start -musage=10 /home
> 
> and raising values. BTW I got out of space with trying both at the same time:
> 
> merkaba:~#1> btrfs balance start -dusage=10 -musage=10 /home
> ERROR: error during balancing '/home' - No space left on device
> There may be more info in syslog - try dmesg | tail
> 
> merkaba:~#1> btrfs fi sh /home
> Label: 'home'  uuid: […]
>         Total devices 2 FS bytes used 128.76GiB
>         devid    1 size 160.00GiB used 146.00GiB path /dev/dm-0
>         devid    2 size 160.00GiB used 146.00GiB path /dev/mapper/sata-home
> 
> So I am pretty sure meanwhile that hangs can best be trigger *if* BTRFS
> trees fill the complete device.
> 
> I will try to keep tree sizes down as a work-around for now even it if means
> additional write access towards the SSD devices.
> 
> And make sure tree sizes stay down on my first server BTRFS as well although
> this uses debian backport kernel 3.14 and thus may not be affected.
> 
> Are there any other fixes to try out? I really like to see this resolved. Its
> in two stable kernel revisions already: 3.15 and 3.16. And by this it means
> if not fixed next Debian stable (Jessie) will be affected by it.
> 
> 
> Some kern.log (have stored the complete file)
> 
> Aug  6 12:01:16 merkaba kernel: [90496.262084] INFO: task java:21301 blocked for more than 120 seconds.
> Aug  6 12:01:16 merkaba kernel: [90496.263626]       Tainted: G           O  3.16.0-tp520-fixcompwrite+ #3
> Aug  6 12:01:16 merkaba kernel: [90496.265159] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Aug  6 12:01:16 merkaba kernel: [90496.266756] java            D ffff880044e3cef0     0 21301      1 0x00000000
> Aug  6 12:01:16 merkaba kernel: [90496.268353]  ffff8801960e3bd8 0000000000000002 ffff880407f649b0 ffff8801960e3fd8
> Aug  6 12:01:16 merkaba kernel: [90496.269980]  ffff880044e3c9b0 0000000000013040 ffff880044e3c9b0 ffff88041e293040
> Aug  6 12:01:16 merkaba kernel: [90496.271766]  ffff88041e5c6868 ffff8801960e3c70 0000000000000002 ffffffff810db1d9
> Aug  6 12:01:16 merkaba kernel: [90496.273383] Call Trace:
> Aug  6 12:01:16 merkaba kernel: [90496.275017]  [<ffffffff810db1d9>] ? wait_on_page_read+0x37/0x37
> Aug  6 12:01:16 merkaba kernel: [90496.276630]  [<ffffffff81470fd0>] schedule+0x64/0x66
> Aug  6 12:01:16 merkaba kernel: [90496.278209]  [<ffffffff81471157>] io_schedule+0x57/0x76
> Aug  6 12:01:16 merkaba kernel: [90496.279817]  [<ffffffff810db1e2>] sleep_on_page+0x9/0xd
> Aug  6 12:01:16 merkaba kernel: [90496.281403]  [<ffffffff8147152d>] __wait_on_bit_lock+0x41/0x85
> Aug  6 12:01:16 merkaba kernel: [90496.282991]  [<ffffffff810db29f>] __lock_page+0x70/0x7c
> Aug  6 12:01:16 merkaba kernel: [90496.284550]  [<ffffffff81070f3a>] ? autoremove_wake_function+0x2f/0x2f
> Aug  6 12:01:16 merkaba kernel: [90496.286156]  [<ffffffffc0476617>] lock_page+0x1e/0x21 [btrfs]
> Aug  6 12:01:16 merkaba kernel: [90496.287742]  [<ffffffffc0476617>] ? lock_page+0x1e/0x21 [btrfs]
> Aug  6 12:01:16 merkaba kernel: [90496.289344]  [<ffffffffc047a8f4>] extent_write_cache_pages.isra.21.constprop.42+0x1a7/0x2d9 [btrfs]
> Aug  6 12:01:16 merkaba kernel: [90496.290955]  [<ffffffff810dba82>] ? find_get_pages_tag+0xfc/0x123
> Aug  6 12:01:16 merkaba kernel: [90496.292574]  [<ffffffffc047ae1c>] extent_writepages+0x46/0x57 [btrfs]
> Aug  6 12:01:16 merkaba kernel: [90496.294154]  [<ffffffffc04650b2>] ? btrfs_submit_direct+0x3ef/0x3ef [btrfs]
> Aug  6 12:01:16 merkaba kernel: [90496.295760]  [<ffffffffc046361f>] btrfs_writepages+0x23/0x25 [btrfs]
> Aug  6 12:01:16 merkaba kernel: [90496.297492]  [<ffffffff810e556b>] do_writepages+0x1b/0x24
> Aug  6 12:01:16 merkaba kernel: [90496.299035]  [<ffffffff810dc88a>] __filemap_fdatawrite_range+0x50/0x52
> Aug  6 12:01:16 merkaba kernel: [90496.300561]  [<ffffffff810dc8f1>] filemap_fdatawrite_range+0xe/0x10
> Aug  6 12:01:16 merkaba kernel: [90496.302118]  [<ffffffffc046ea30>] btrfs_sync_file+0x67/0x2bd [btrfs]
> Aug  6 12:01:16 merkaba kernel: [90496.303630]  [<ffffffff810dc88a>] ? __filemap_fdatawrite_range+0x50/0x52
> Aug  6 12:01:16 merkaba kernel: [90496.305158]  [<ffffffff81156289>] vfs_fsync_range+0x1c/0x1e
> Aug  6 12:01:16 merkaba kernel: [90496.306669]  [<ffffffff811562a2>] vfs_fsync+0x17/0x19
> Aug  6 12:01:16 merkaba kernel: [90496.308197]  [<ffffffffc04f732a>] ecryptfs_fsync+0x2f/0x34 [ecryptfs]
> Aug  6 12:01:16 merkaba kernel: [90496.309711]  [<ffffffff81156289>] vfs_fsync_range+0x1c/0x1e
> Aug  6 12:01:16 merkaba kernel: [90496.311249]  [<ffffffff811562a2>] vfs_fsync+0x17/0x19
> Aug  6 12:01:16 merkaba kernel: [90496.312771]  [<ffffffff811564c9>] do_fsync+0x2c/0x45
> Aug  6 12:01:16 merkaba kernel: [90496.314288]  [<ffffffff811566af>] SyS_fsync+0xb/0xf
> Aug  6 12:01:16 merkaba kernel: [90496.315800]  [<ffffffff8147420b>] tracesys+0xdd/0xe2
> 
> 
> 
> Aug  6 12:01:16 merkaba kernel: [90496.380221] INFO: task java:21563 blocked for more than 120 seconds.
> Aug  6 12:01:16 merkaba kernel: [90496.381691]       Tainted: G           O  3.16.0-tp520-fixcompwrite+ #3
> Aug  6 12:01:16 merkaba kernel: [90496.383192] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Aug  6 12:01:16 merkaba kernel: [90496.384687] java            D ffff880038111dd0     0 21563      1 0x00000000
> Aug  6 12:01:16 merkaba kernel: [90496.386203]  ffff88006df0fbd8 0000000000000002 ffffffff81a15500 ffff88006df0ffd8
> Aug  6 12:01:16 merkaba kernel: [90496.387843]  ffff880038111890 0000000000013040 ffff880038111890 ffff88041e213040
> Aug  6 12:01:16 merkaba kernel: [90496.389414]  ffff88041e5cc568 ffff88006df0fc70 0000000000000002 ffffffff810db1d9
> Aug  6 12:01:16 merkaba kernel: [90496.391031] Call Trace:
> Aug  6 12:01:16 merkaba kernel: [90496.392574]  [<ffffffff810db1d9>] ? wait_on_page_read+0x37/0x37
> Aug  6 12:01:16 merkaba kernel: [90496.394154]  [<ffffffff81470fd0>] schedule+0x64/0x66
> Aug  6 12:01:16 merkaba kernel: [90496.395686]  [<ffffffff81471157>] io_schedule+0x57/0x76
> Aug  6 12:01:16 merkaba kernel: [90496.397218]  [<ffffffff810db1e2>] sleep_on_page+0x9/0xd
> Aug  6 12:01:16 merkaba kernel: [90496.398723]  [<ffffffff8147152d>] __wait_on_bit_lock+0x41/0x85
> Aug  6 12:01:16 merkaba kernel: [90496.400232]  [<ffffffff810db29f>] __lock_page+0x70/0x7c
> Aug  6 12:01:16 merkaba kernel: [90496.401895]  [<ffffffff81070f3a>] ? autoremove_wake_function+0x2f/0x2f
> Aug  6 12:01:16 merkaba kernel: [90496.403440]  [<ffffffffc0476617>] lock_page+0x1e/0x21 [btrfs]
> Aug  6 12:01:16 merkaba kernel: [90496.404942]  [<ffffffffc0476617>] ? lock_page+0x1e/0x21 [btrfs]
> Aug  6 12:01:16 merkaba kernel: [90496.406433]  [<ffffffffc047a8f4>] extent_write_cache_pages.isra.21.constprop.42+0x1a7/0x2d9 [btrfs]
> Aug  6 12:01:16 merkaba kernel: [90496.407950]  [<ffffffff810dba82>] ? find_get_pages_tag+0xfc/0x123
> Aug  6 12:01:16 merkaba kernel: [90496.409474]  [<ffffffffc047ae1c>] extent_writepages+0x46/0x57 [btrfs]
> Aug  6 12:01:16 merkaba kernel: [90496.411020]  [<ffffffffc04650b2>] ? btrfs_submit_direct+0x3ef/0x3ef [btrfs]
> Aug  6 12:01:16 merkaba kernel: [90496.412558]  [<ffffffffc046361f>] btrfs_writepages+0x23/0x25 [btrfs]
> Aug  6 12:01:16 merkaba kernel: [90496.414102]  [<ffffffff810e556b>] do_writepages+0x1b/0x24
> Aug  6 12:01:16 merkaba kernel: [90496.415621]  [<ffffffff810dc88a>] __filemap_fdatawrite_range+0x50/0x52
> Aug  6 12:01:16 merkaba kernel: [90496.417184]  [<ffffffff810dc8f1>] filemap_fdatawrite_range+0xe/0x10
> Aug  6 12:01:16 merkaba kernel: [90496.418753]  [<ffffffffc046ea30>] btrfs_sync_file+0x67/0x2bd [btrfs]
> Aug  6 12:01:16 merkaba kernel: [90496.420344]  [<ffffffff810dc88a>] ? __filemap_fdatawrite_range+0x50/0x52
> Aug  6 12:01:16 merkaba kernel: [90496.421914]  [<ffffffff81156289>] vfs_fsync_range+0x1c/0x1e
> Aug  6 12:01:16 merkaba kernel: [90496.423467]  [<ffffffff811562a2>] vfs_fsync+0x17/0x19
> Aug  6 12:01:16 merkaba kernel: [90496.425051]  [<ffffffffc04f732a>] ecryptfs_fsync+0x2f/0x34 [ecryptfs]
> Aug  6 12:01:16 merkaba kernel: [90496.426593]  [<ffffffff81156289>] vfs_fsync_range+0x1c/0x1e
> Aug  6 12:01:16 merkaba kernel: [90496.428280]  [<ffffffff811562a2>] vfs_fsync+0x17/0x19
> Aug  6 12:01:16 merkaba kernel: [90496.429853]  [<ffffffff811564c9>] do_fsync+0x2c/0x45
> Aug  6 12:01:16 merkaba kernel: [90496.431351]  [<ffffffff811566af>] SyS_fsync+0xb/0xf
> Aug  6 12:01:16 merkaba kernel: [90496.432841]  [<ffffffff8147420b>] tracesys+0xdd/0xe2
> 
> 
> 
> Aug  6 12:01:16 merkaba kernel: [90496.434306] INFO: task kworker/u8:3:21401 blocked for more than 120 seconds.
> Aug  6 12:01:16 merkaba kernel: [90496.435814]       Tainted: G           O  3.16.0-tp520-fixcompwrite+ #3
> Aug  6 12:01:16 merkaba kernel: [90496.437328] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Aug  6 12:01:16 merkaba kernel: [90496.438885] kworker/u8:3    D ffff880133ebe780     0 21401      2 0x00000000
> Aug  6 12:01:16 merkaba kernel: [90496.440464] Workqueue: btrfs-flush_delalloc normal_work_helper [btrfs]
> Aug  6 12:01:16 merkaba kernel: [90496.442037]  ffff88003e953b18 0000000000000002 ffffffff81a15500 ffff88003e953fd8
> Aug  6 12:01:16 merkaba kernel: [90496.443639]  ffff880133ebe240 0000000000013040 ffff880133ebe240 ffff88041e213040
> Aug  6 12:01:16 merkaba kernel: [90496.445246]  ffff88041e5bc968 ffff88003e953bb0 0000000000000002 ffffffff810db1d9
> Aug  6 12:01:16 merkaba kernel: [90496.446901] Call Trace:
> Aug  6 12:01:16 merkaba kernel: [90496.448485]  [<ffffffff810db1d9>] ? wait_on_page_read+0x37/0x37
> Aug  6 12:01:16 merkaba kernel: [90496.450081]  [<ffffffff81470fd0>] schedule+0x64/0x66
> Aug  6 12:01:16 merkaba kernel: [90496.451682]  [<ffffffff81471157>] io_schedule+0x57/0x76
> Aug  6 12:01:16 merkaba kernel: [90496.453271]  [<ffffffff810db1e2>] sleep_on_page+0x9/0xd
> Aug  6 12:01:16 merkaba kernel: [90496.455037]  [<ffffffff8147152d>] __wait_on_bit_lock+0x41/0x85
> Aug  6 12:01:16 merkaba kernel: [90496.456617]  [<ffffffff810db29f>] __lock_page+0x70/0x7c
> Aug  6 12:01:16 merkaba kernel: [90496.458203]  [<ffffffff81070f3a>] ? autoremove_wake_function+0x2f/0x2f
> Aug  6 12:01:16 merkaba kernel: [90496.459793]  [<ffffffffc0476617>] lock_page+0x1e/0x21 [btrfs]
> Aug  6 12:01:16 merkaba kernel: [90496.461353]  [<ffffffffc0476617>] ? lock_page+0x1e/0x21 [btrfs]
> Aug  6 12:01:16 merkaba kernel: [90496.462917]  [<ffffffffc047a8f4>] extent_write_cache_pages.isra.21.constprop.42+0x1a7/0x2d9 [btrfs]
> Aug  6 12:01:16 merkaba kernel: [90496.464479]  [<ffffffff81009e60>] ? native_sched_clock+0x3a/0x3c
> Aug  6 12:01:16 merkaba kernel: [90496.466036]  [<ffffffffc047ae1c>] extent_writepages+0x46/0x57 [btrfs]
> Aug  6 12:01:16 merkaba kernel: [90496.467632]  [<ffffffff81066fbf>] ? task_group_account_field+0x3b/0x40
> Aug  6 12:01:16 merkaba kernel: [90496.469168]  [<ffffffffc04650b2>] ? btrfs_submit_direct+0x3ef/0x3ef [btrfs]
> Aug  6 12:01:16 merkaba kernel: [90496.470737]  [<ffffffffc046361f>] btrfs_writepages+0x23/0x25 [btrfs]
> Aug  6 12:01:16 merkaba kernel: [90496.472307]  [<ffffffff810e556b>] do_writepages+0x1b/0x24
> Aug  6 12:01:16 merkaba kernel: [90496.473885]  [<ffffffff810dc88a>] __filemap_fdatawrite_range+0x50/0x52
> Aug  6 12:01:16 merkaba kernel: [90496.475458]  [<ffffffff810dd1b3>] filemap_flush+0x17/0x19
> Aug  6 12:01:16 merkaba kernel: [90496.477041]  [<ffffffffc0465df9>] btrfs_run_delalloc_work+0x2e/0x64 [btrfs]
> Aug  6 12:01:16 merkaba kernel: [90496.478624]  [<ffffffffc04863d7>] normal_work_helper+0xdf/0x224 [btrfs]
> Aug  6 12:01:16 merkaba kernel: [90496.480257]  [<ffffffff81052d8c>] process_one_work+0x16f/0x2b8
> Aug  6 12:01:16 merkaba kernel: [90496.481977]  [<ffffffff81053636>] worker_thread+0x27b/0x32e
> Aug  6 12:01:16 merkaba kernel: [90496.483544]  [<ffffffff810533bb>] ? cancel_delayed_work_sync+0x10/0x10
> Aug  6 12:01:16 merkaba kernel: [90496.485082]  [<ffffffff81058012>] kthread+0xb2/0xba
> Aug  6 12:01:16 merkaba kernel: [90496.486624]  [<ffffffff81470000>] ? ap_handle_dropped_data+0xf/0xc8
> Aug  6 12:01:16 merkaba kernel: [90496.488148]  [<ffffffff81057f60>] ? __kthread_parkme+0x62/0x62
> Aug  6 12:01:16 merkaba kernel: [90496.489719]  [<ffffffff81473f6c>] ret_from_fork+0x7c/0xb0
> Aug  6 12:01:16 merkaba kernel: [90496.491265]  [<ffffffff81057f60>] ? __kthread_parkme+0x62/0x62
> 
> 
> Ciao,
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Martin Steigerwald Aug. 6, 2014, 2:43 p.m. UTC | #14
Am Mittwoch, 6. August 2014, 09:35:51 schrieb Chris Mason:
> On 08/06/2014 06:21 AM, Martin Steigerwald wrote:
> >> I think this should go to stable. Thanks, Liu.
> 
> I'm definitely tagging this for stable.
> 
> > Unfortunately this fix does not seem to fix all lockups.
> 
> The traces below are a little different, could you please send the whole
> file?

Will paste it at the end.
 
> > Just had a hard lockup again during java-bases CrashPlanPROe app backuping
> > company data which is stored on BTRFS via ecryptfs to central Backup
> > server.
> > 
> > It basically happened on about the first heavy write I/O occasion after
> > the BTRFS trees filled the complete device:
> > 
> > I am now balancing the trees down to lower sizes manually with
> > 
> > btrfs balance start -dusage=10 /home
> > 
> > btrfs balance start -musage=10 /home
> > 
> > and raising values. BTW I got out of space with trying both at the same
> > time:
> > 
> > merkaba:~#1> btrfs balance start -dusage=10 -musage=10 /home
> > ERROR: error during balancing '/home' - No space left on device
> > There may be more info in syslog - try dmesg | tail
> > 
> > merkaba:~#1> btrfs fi sh /home
> > Label: 'home'  uuid: […]
> > 
> >         Total devices 2 FS bytes used 128.76GiB
> >         devid    1 size 160.00GiB used 146.00GiB path /dev/dm-0
> >         devid    2 size 160.00GiB used 146.00GiB path
> >         /dev/mapper/sata-home
> > 
> > So I am pretty sure meanwhile that hangs can best be trigger *if* BTRFS
> > trees fill the complete device.
> > 
> > I will try to keep tree sizes down as a work-around for now even it if
> > means additional write access towards the SSD devices.
> > 
> > And make sure tree sizes stay down on my first server BTRFS as well
> > although this uses debian backport kernel 3.14 and thus may not be
> > affected.
> > 
> > Are there any other fixes to try out? I really like to see this resolved.
> > Its in two stable kernel revisions already: 3.15 and 3.16. And by this it
> > means if not fixed next Debian stable (Jessie) will be affected by it.
> > 
> > 
> > Some kern.log (have stored the complete file)
[…]

Attached, xz compressed, since 180 KB as plaintext in a mail is a bit much.

This is complete from today resuming from hibernation today morning, the BTRFS 
hang, the reboot and the first balancing runs to make BTRFS more stable again.

Interestingly it was ecryptfs reporting issues before the BTRFS hang got 
reported. There are several a bit different looking traces in there.

Thanks,
Chris Mason Aug. 6, 2014, 3:18 p.m. UTC | #15
On 08/06/2014 10:43 AM, Martin Steigerwald wrote:
> Am Mittwoch, 6. August 2014, 09:35:51 schrieb Chris Mason:
>> On 08/06/2014 06:21 AM, Martin Steigerwald wrote:
>>>> I think this should go to stable. Thanks, Liu.
>>
>> I'm definitely tagging this for stable.
>>
>>> Unfortunately this fix does not seem to fix all lockups.
>>
>> The traces below are a little different, could you please send the whole
>> file?
> 
> Will paste it at the end.

[90496.156016] kworker/u8:14   D ffff880044e38540     0 21050      2 0x00000000
[90496.157683] Workqueue: btrfs-delalloc normal_work_helper [btrfs]
[90496.159320]  ffff88022880f990 0000000000000002 ffff880407f649b0 ffff88022880ffd8
[90496.160997]  ffff880044e38000 0000000000013040 ffff880044e38000 7fffffffffffffff
[90496.162686]  ffff880301383aa0 0000000000000002 ffffffff814705d0 ffff880301383a98
[90496.164360] Call Trace:
[90496.166028]  [<ffffffff814705d0>] ? michael_mic.part.6+0x21/0x21
[90496.167854]  [<ffffffff81470fd0>] schedule+0x64/0x66
[90496.169574]  [<ffffffff814705ff>] schedule_timeout+0x2f/0x114
[90496.171221]  [<ffffffff8106479a>] ? wake_up_process+0x2f/0x32
[90496.172867]  [<ffffffff81062c3b>] ? get_parent_ip+0xd/0x3c
[90496.174472]  [<ffffffff81062ce5>] ? preempt_count_add+0x7b/0x8e
[90496.176053]  [<ffffffff814717f3>] __wait_for_common+0x11e/0x163
[90496.177619]  [<ffffffff814717f3>] ? __wait_for_common+0x11e/0x163
[90496.179173]  [<ffffffff810647aa>] ? wake_up_state+0xd/0xd
[90496.180728]  [<ffffffff81471857>] wait_for_completion+0x1f/0x21
[90496.182285]  [<ffffffffc044e3b3>] btrfs_async_run_delayed_refs+0xbf/0xd9 [btrfs]
[90496.183833]  [<ffffffffc04624e1>] __btrfs_end_transaction+0x2b6/0x2ec [btrfs]
[90496.185380]  [<ffffffffc0462522>] btrfs_end_transaction+0xb/0xd [btrfs]
[90496.186940]  [<ffffffffc0451742>] find_free_extent+0x8a9/0x976 [btrfs]
[90496.189464]  [<ffffffffc0451990>] btrfs_reserve_extent+0x6f/0x119 [btrfs]
[90496.191326]  [<ffffffffc0466b45>] cow_file_range+0x1a6/0x377 [btrfs]
[90496.193080]  [<ffffffffc047adc4>] ? extent_write_locked_range+0x10c/0x11e [btrfs]
[90496.194659]  [<ffffffffc04677e4>] submit_compressed_extents+0x100/0x412 [btrfs]
[90496.196225]  [<ffffffff8120e344>] ? debug_smp_processor_id+0x17/0x19
[90496.197776]  [<ffffffffc0467b78>] async_cow_submit+0x82/0x87 [btrfs]
[90496.199383]  [<ffffffffc048644b>] normal_work_helper+0x153/0x224 [btrfs]
[90496.200944]  [<ffffffff81052d8c>] process_one_work+0x16f/0x2b8
[90496.202483]  [<ffffffff81053636>] worker_thread+0x27b/0x32e
[90496.204000]  [<ffffffff810533bb>] ? cancel_delayed_work_sync+0x10/0x10
[90496.205514]  [<ffffffff81058012>] kthread+0xb2/0xba
[90496.207040]  [<ffffffff81470000>] ? ap_handle_dropped_data+0xf/0xc8
[90496.208565]  [<ffffffff81057f60>] ? __kthread_parkme+0x62/0x62
[90496.210096]  [<ffffffff81473f6c>] ret_from_fork+0x7c/0xb0
[90496.211618]  [<ffffffff81057f60>] ? __kthread_parkme+0x62/0x62


Ok, this should explain the hang.  submit_compressed_extents is calling
cow_file_range with a locked page.

cow_file_range is trying to find a free extent and in the process is
calling btrfs_end_transaction, which is running the async delayed refs,
which is trying to write dirty pages, which is waiting for your locked
page.

I should be able to reproduce this ;)

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Mason Aug. 7, 2014, 12:52 a.m. UTC | #16
On 08/06/2014 11:18 AM, Chris Mason wrote:
> On 08/06/2014 10:43 AM, Martin Steigerwald wrote:
>> Am Mittwoch, 6. August 2014, 09:35:51 schrieb Chris Mason:
>>> On 08/06/2014 06:21 AM, Martin Steigerwald wrote:
>>>>> I think this should go to stable. Thanks, Liu.
>>>
>>> I'm definitely tagging this for stable.
>>>
>>>> Unfortunately this fix does not seem to fix all lockups.
>>>
>>> The traces below are a little different, could you please send the whole
>>> file?
>>
>> Will paste it at the end.
> 
> [90496.156016] kworker/u8:14   D ffff880044e38540     0 21050      2 0x00000000
> [90496.157683] Workqueue: btrfs-delalloc normal_work_helper [btrfs]
> [90496.159320]  ffff88022880f990 0000000000000002 ffff880407f649b0 ffff88022880ffd8
> [90496.160997]  ffff880044e38000 0000000000013040 ffff880044e38000 7fffffffffffffff
> [90496.162686]  ffff880301383aa0 0000000000000002 ffffffff814705d0 ffff880301383a98
> [90496.164360] Call Trace:
> [90496.166028]  [<ffffffff814705d0>] ? michael_mic.part.6+0x21/0x21
> [90496.167854]  [<ffffffff81470fd0>] schedule+0x64/0x66
> [90496.169574]  [<ffffffff814705ff>] schedule_timeout+0x2f/0x114
> [90496.171221]  [<ffffffff8106479a>] ? wake_up_process+0x2f/0x32
> [90496.172867]  [<ffffffff81062c3b>] ? get_parent_ip+0xd/0x3c
> [90496.174472]  [<ffffffff81062ce5>] ? preempt_count_add+0x7b/0x8e
> [90496.176053]  [<ffffffff814717f3>] __wait_for_common+0x11e/0x163
> [90496.177619]  [<ffffffff814717f3>] ? __wait_for_common+0x11e/0x163
> [90496.179173]  [<ffffffff810647aa>] ? wake_up_state+0xd/0xd
> [90496.180728]  [<ffffffff81471857>] wait_for_completion+0x1f/0x21
> [90496.182285]  [<ffffffffc044e3b3>] btrfs_async_run_delayed_refs+0xbf/0xd9 [btrfs]
> [90496.183833]  [<ffffffffc04624e1>] __btrfs_end_transaction+0x2b6/0x2ec [btrfs]
> [90496.185380]  [<ffffffffc0462522>] btrfs_end_transaction+0xb/0xd [btrfs]
> [90496.186940]  [<ffffffffc0451742>] find_free_extent+0x8a9/0x976 [btrfs]
> [90496.189464]  [<ffffffffc0451990>] btrfs_reserve_extent+0x6f/0x119 [btrfs]
> [90496.191326]  [<ffffffffc0466b45>] cow_file_range+0x1a6/0x377 [btrfs]
> [90496.193080]  [<ffffffffc047adc4>] ? extent_write_locked_range+0x10c/0x11e [btrfs]
> [90496.194659]  [<ffffffffc04677e4>] submit_compressed_extents+0x100/0x412 [btrfs]
> [90496.196225]  [<ffffffff8120e344>] ? debug_smp_processor_id+0x17/0x19
> [90496.197776]  [<ffffffffc0467b78>] async_cow_submit+0x82/0x87 [btrfs]
> [90496.199383]  [<ffffffffc048644b>] normal_work_helper+0x153/0x224 [btrfs]
> [90496.200944]  [<ffffffff81052d8c>] process_one_work+0x16f/0x2b8
> [90496.202483]  [<ffffffff81053636>] worker_thread+0x27b/0x32e
> [90496.204000]  [<ffffffff810533bb>] ? cancel_delayed_work_sync+0x10/0x10
> [90496.205514]  [<ffffffff81058012>] kthread+0xb2/0xba
> [90496.207040]  [<ffffffff81470000>] ? ap_handle_dropped_data+0xf/0xc8
> [90496.208565]  [<ffffffff81057f60>] ? __kthread_parkme+0x62/0x62
> [90496.210096]  [<ffffffff81473f6c>] ret_from_fork+0x7c/0xb0
> [90496.211618]  [<ffffffff81057f60>] ? __kthread_parkme+0x62/0x62
> 
> 
> Ok, this should explain the hang.  submit_compressed_extents is calling
> cow_file_range with a locked page.
> 
> cow_file_range is trying to find a free extent and in the process is
> calling btrfs_end_transaction, which is running the async delayed refs,
> which is trying to write dirty pages, which is waiting for your locked
> page.
> 
> I should be able to reproduce this ;)

This part of the trace is relatively new because Liu Bo's patch made us
redirty the pages, making it more likely that we'd try to write them
during commit.

But, at the end of the day we have a fundamental deadlock with
committing a transaction while holding a locked page from an ordered file.

For now, I'm ripping out the strict ordered file and going back to a
best-effort filemap_flush like ext4 is using.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liu Bo Aug. 7, 2014, 7:50 a.m. UTC | #17
Hi,

On Wed, Aug 06, 2014 at 08:52:41PM -0400, Chris Mason wrote:
> On 08/06/2014 11:18 AM, Chris Mason wrote:
> > On 08/06/2014 10:43 AM, Martin Steigerwald wrote:
> >> Am Mittwoch, 6. August 2014, 09:35:51 schrieb Chris Mason:
> >>> On 08/06/2014 06:21 AM, Martin Steigerwald wrote:
> >>>>> I think this should go to stable. Thanks, Liu.
> >>>
> >>> I'm definitely tagging this for stable.
> >>>
> >>>> Unfortunately this fix does not seem to fix all lockups.
> >>>
> >>> The traces below are a little different, could you please send the whole
> >>> file?
> >>
> >> Will paste it at the end.
> > 
> > [90496.156016] kworker/u8:14   D ffff880044e38540     0 21050      2 0x00000000
> > [90496.157683] Workqueue: btrfs-delalloc normal_work_helper [btrfs]
> > [90496.159320]  ffff88022880f990 0000000000000002 ffff880407f649b0 ffff88022880ffd8
> > [90496.160997]  ffff880044e38000 0000000000013040 ffff880044e38000 7fffffffffffffff
> > [90496.162686]  ffff880301383aa0 0000000000000002 ffffffff814705d0 ffff880301383a98
> > [90496.164360] Call Trace:
> > [90496.166028]  [<ffffffff814705d0>] ? michael_mic.part.6+0x21/0x21
> > [90496.167854]  [<ffffffff81470fd0>] schedule+0x64/0x66
> > [90496.169574]  [<ffffffff814705ff>] schedule_timeout+0x2f/0x114
> > [90496.171221]  [<ffffffff8106479a>] ? wake_up_process+0x2f/0x32
> > [90496.172867]  [<ffffffff81062c3b>] ? get_parent_ip+0xd/0x3c
> > [90496.174472]  [<ffffffff81062ce5>] ? preempt_count_add+0x7b/0x8e
> > [90496.176053]  [<ffffffff814717f3>] __wait_for_common+0x11e/0x163
> > [90496.177619]  [<ffffffff814717f3>] ? __wait_for_common+0x11e/0x163
> > [90496.179173]  [<ffffffff810647aa>] ? wake_up_state+0xd/0xd
> > [90496.180728]  [<ffffffff81471857>] wait_for_completion+0x1f/0x21
> > [90496.182285]  [<ffffffffc044e3b3>] btrfs_async_run_delayed_refs+0xbf/0xd9 [btrfs]
> > [90496.183833]  [<ffffffffc04624e1>] __btrfs_end_transaction+0x2b6/0x2ec [btrfs]
> > [90496.185380]  [<ffffffffc0462522>] btrfs_end_transaction+0xb/0xd [btrfs]
> > [90496.186940]  [<ffffffffc0451742>] find_free_extent+0x8a9/0x976 [btrfs]
> > [90496.189464]  [<ffffffffc0451990>] btrfs_reserve_extent+0x6f/0x119 [btrfs]
> > [90496.191326]  [<ffffffffc0466b45>] cow_file_range+0x1a6/0x377 [btrfs]
> > [90496.193080]  [<ffffffffc047adc4>] ? extent_write_locked_range+0x10c/0x11e [btrfs]
> > [90496.194659]  [<ffffffffc04677e4>] submit_compressed_extents+0x100/0x412 [btrfs]
> > [90496.196225]  [<ffffffff8120e344>] ? debug_smp_processor_id+0x17/0x19
> > [90496.197776]  [<ffffffffc0467b78>] async_cow_submit+0x82/0x87 [btrfs]
> > [90496.199383]  [<ffffffffc048644b>] normal_work_helper+0x153/0x224 [btrfs]
> > [90496.200944]  [<ffffffff81052d8c>] process_one_work+0x16f/0x2b8
> > [90496.202483]  [<ffffffff81053636>] worker_thread+0x27b/0x32e
> > [90496.204000]  [<ffffffff810533bb>] ? cancel_delayed_work_sync+0x10/0x10
> > [90496.205514]  [<ffffffff81058012>] kthread+0xb2/0xba
> > [90496.207040]  [<ffffffff81470000>] ? ap_handle_dropped_data+0xf/0xc8
> > [90496.208565]  [<ffffffff81057f60>] ? __kthread_parkme+0x62/0x62
> > [90496.210096]  [<ffffffff81473f6c>] ret_from_fork+0x7c/0xb0
> > [90496.211618]  [<ffffffff81057f60>] ? __kthread_parkme+0x62/0x62
> > 
> > 
> > Ok, this should explain the hang.  submit_compressed_extents is calling
> > cow_file_range with a locked page.
> > 
> > cow_file_range is trying to find a free extent and in the process is
> > calling btrfs_end_transaction, which is running the async delayed refs,
> > which is trying to write dirty pages, which is waiting for your locked
> > page.
> > 
> > I should be able to reproduce this ;)
> 
> This part of the trace is relatively new because Liu Bo's patch made us
> redirty the pages, making it more likely that we'd try to write them
> during commit.
> 
> But, at the end of the day we have a fundamental deadlock with
> committing a transaction while holding a locked page from an ordered file.
> 
> For now, I'm ripping out the strict ordered file and going back to a
> best-effort filemap_flush like ext4 is using.

I think I've figured the deadlock out, this is obviously a race case, really
hard to reproduce :-(

So it turns out to be related to workqueues -- now a kthread can process
work_struct queued in different workqueues, so we can explain the deadlock as
such,

(1)
"btrfs-delalloc" workqueue gets a compressed extent to process with all its pages
locked during this, and it runs into read free space cache inode, and then wait
on lock_page().

(2)
Reading that free space cache inode comes to submit part, and we have a
indirect twice endio way for it, with the first endio we come to end_workqueue_bio()
and queue a work in "btrfs-endio-meta" workqueue, and it will run the real
endio() for us, but ONLY when it's processed.

So the problem is a kthread can serve several workqueues, which means
works in "btrfs-endio-meta" workqueues and works in "btrfs-flush_delalloc"
workqueues can be in the same processing list of a kthread.  When
"btrfs-flush_delalloc" waits for the compressed page and "btrfs-endio-meta"
comes after it, it hangs.

For now, making a "btrfs-endio-meta-high" can make it pass, but still testing it.

PS:
However, I still have a question on this, kernel workqueue kthread can wake up
another kthread in the same worker pool to run when it's going to sleep(in
__schedule()), so in our hang case, does it happen to be there is no such
'another kthread'?

PPS:
Looking at the above new trace, I think we'd better come up with something new,
making a high priority workqueue seems to be not a good solution.

thanks,
-liubo

> 
> -chris
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Miao Xie Aug. 7, 2014, 8:20 a.m. UTC | #18
On Thu, 7 Aug 2014 15:50:30 +0800, Liu Bo wrote:
>>> [90496.156016] kworker/u8:14   D ffff880044e38540     0 21050      2 0x00000000
>>> [90496.157683] Workqueue: btrfs-delalloc normal_work_helper [btrfs]
>>> [90496.159320]  ffff88022880f990 0000000000000002 ffff880407f649b0 ffff88022880ffd8
>>> [90496.160997]  ffff880044e38000 0000000000013040 ffff880044e38000 7fffffffffffffff
>>> [90496.162686]  ffff880301383aa0 0000000000000002 ffffffff814705d0 ffff880301383a98
>>> [90496.164360] Call Trace:
>>> [90496.166028]  [<ffffffff814705d0>] ? michael_mic.part.6+0x21/0x21
>>> [90496.167854]  [<ffffffff81470fd0>] schedule+0x64/0x66
>>> [90496.169574]  [<ffffffff814705ff>] schedule_timeout+0x2f/0x114
>>> [90496.171221]  [<ffffffff8106479a>] ? wake_up_process+0x2f/0x32
>>> [90496.172867]  [<ffffffff81062c3b>] ? get_parent_ip+0xd/0x3c
>>> [90496.174472]  [<ffffffff81062ce5>] ? preempt_count_add+0x7b/0x8e
>>> [90496.176053]  [<ffffffff814717f3>] __wait_for_common+0x11e/0x163
>>> [90496.177619]  [<ffffffff814717f3>] ? __wait_for_common+0x11e/0x163
>>> [90496.179173]  [<ffffffff810647aa>] ? wake_up_state+0xd/0xd
>>> [90496.180728]  [<ffffffff81471857>] wait_for_completion+0x1f/0x21
>>> [90496.182285]  [<ffffffffc044e3b3>] btrfs_async_run_delayed_refs+0xbf/0xd9 [btrfs]
>>> [90496.183833]  [<ffffffffc04624e1>] __btrfs_end_transaction+0x2b6/0x2ec [btrfs]
>>> [90496.185380]  [<ffffffffc0462522>] btrfs_end_transaction+0xb/0xd [btrfs]
>>> [90496.186940]  [<ffffffffc0451742>] find_free_extent+0x8a9/0x976 [btrfs]
>>> [90496.189464]  [<ffffffffc0451990>] btrfs_reserve_extent+0x6f/0x119 [btrfs]
>>> [90496.191326]  [<ffffffffc0466b45>] cow_file_range+0x1a6/0x377 [btrfs]
>>> [90496.193080]  [<ffffffffc047adc4>] ? extent_write_locked_range+0x10c/0x11e [btrfs]
>>> [90496.194659]  [<ffffffffc04677e4>] submit_compressed_extents+0x100/0x412 [btrfs]
>>> [90496.196225]  [<ffffffff8120e344>] ? debug_smp_processor_id+0x17/0x19
>>> [90496.197776]  [<ffffffffc0467b78>] async_cow_submit+0x82/0x87 [btrfs]
>>> [90496.199383]  [<ffffffffc048644b>] normal_work_helper+0x153/0x224 [btrfs]
>>> [90496.200944]  [<ffffffff81052d8c>] process_one_work+0x16f/0x2b8
>>> [90496.202483]  [<ffffffff81053636>] worker_thread+0x27b/0x32e
>>> [90496.204000]  [<ffffffff810533bb>] ? cancel_delayed_work_sync+0x10/0x10
>>> [90496.205514]  [<ffffffff81058012>] kthread+0xb2/0xba
>>> [90496.207040]  [<ffffffff81470000>] ? ap_handle_dropped_data+0xf/0xc8
>>> [90496.208565]  [<ffffffff81057f60>] ? __kthread_parkme+0x62/0x62
>>> [90496.210096]  [<ffffffff81473f6c>] ret_from_fork+0x7c/0xb0
>>> [90496.211618]  [<ffffffff81057f60>] ? __kthread_parkme+0x62/0x62
>>>
>>>
>>> Ok, this should explain the hang.  submit_compressed_extents is calling
>>> cow_file_range with a locked page.
>>>
>>> cow_file_range is trying to find a free extent and in the process is
>>> calling btrfs_end_transaction, which is running the async delayed refs,
>>> which is trying to write dirty pages, which is waiting for your locked
>>> page.
>>>
>>> I should be able to reproduce this ;)
>>
>> This part of the trace is relatively new because Liu Bo's patch made us
>> redirty the pages, making it more likely that we'd try to write them
>> during commit.
>>
>> But, at the end of the day we have a fundamental deadlock with
>> committing a transaction while holding a locked page from an ordered file.
>>
>> For now, I'm ripping out the strict ordered file and going back to a
>> best-effort filemap_flush like ext4 is using.
> 
> I think I've figured the deadlock out, this is obviously a race case, really
> hard to reproduce :-(
> 
> So it turns out to be related to workqueues -- now a kthread can process
> work_struct queued in different workqueues, so we can explain the deadlock as
> such,
> 
> (1)
> "btrfs-delalloc" workqueue gets a compressed extent to process with all its pages
> locked during this, and it runs into read free space cache inode, and then wait
> on lock_page().
> 
> (2)
> Reading that free space cache inode comes to submit part, and we have a
> indirect twice endio way for it, with the first endio we come to end_workqueue_bio()
> and queue a work in "btrfs-endio-meta" workqueue, and it will run the real
> endio() for us, but ONLY when it's processed.
> 
> So the problem is a kthread can serve several workqueues, which means
> works in "btrfs-endio-meta" workqueues and works in "btrfs-flush_delalloc"
> workqueues can be in the same processing list of a kthread.  When
> "btrfs-flush_delalloc" waits for the compressed page and "btrfs-endio-meta"
> comes after it, it hangs.

I don't think it is right. All the btrfs workqueue has RECLAIM flag, which means
each btrfs workqueue has its own rescue worker. So the problem you said should
not happen.

Thanks
Miao

> 
> For now, making a "btrfs-endio-meta-high" can make it pass, but still testing it.
> 
> PS:
> However, I still have a question on this, kernel workqueue kthread can wake up
> another kthread in the same worker pool to run when it's going to sleep(in
> __schedule()), so in our hang case, does it happen to be there is no such
> 'another kthread'?
> 
> PPS:
> Looking at the above new trace, I think we'd better come up with something new,
> making a high priority workqueue seems to be not a good solution.
> 
> thanks,
> -liubo
> 
>>
>> -chris
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> .
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Mason Aug. 7, 2014, 2:02 p.m. UTC | #19
On 08/07/2014 04:20 AM, Miao Xie wrote:
> On Thu, 7 Aug 2014 15:50:30 +0800, Liu Bo wrote:
>>>> [90496.156016] kworker/u8:14   D ffff880044e38540     0 21050      2 0x00000000
>>>> [90496.157683] Workqueue: btrfs-delalloc normal_work_helper [btrfs]
>>>> [90496.159320]  ffff88022880f990 0000000000000002 ffff880407f649b0 ffff88022880ffd8
>>>> [90496.160997]  ffff880044e38000 0000000000013040 ffff880044e38000 7fffffffffffffff
>>>> [90496.162686]  ffff880301383aa0 0000000000000002 ffffffff814705d0 ffff880301383a98
>>>> [90496.164360] Call Trace:
>>>> [90496.166028]  [<ffffffff814705d0>] ? michael_mic.part.6+0x21/0x21
>>>> [90496.167854]  [<ffffffff81470fd0>] schedule+0x64/0x66
>>>> [90496.169574]  [<ffffffff814705ff>] schedule_timeout+0x2f/0x114
>>>> [90496.171221]  [<ffffffff8106479a>] ? wake_up_process+0x2f/0x32
>>>> [90496.172867]  [<ffffffff81062c3b>] ? get_parent_ip+0xd/0x3c
>>>> [90496.174472]  [<ffffffff81062ce5>] ? preempt_count_add+0x7b/0x8e
>>>> [90496.176053]  [<ffffffff814717f3>] __wait_for_common+0x11e/0x163
>>>> [90496.177619]  [<ffffffff814717f3>] ? __wait_for_common+0x11e/0x163
>>>> [90496.179173]  [<ffffffff810647aa>] ? wake_up_state+0xd/0xd
>>>> [90496.180728]  [<ffffffff81471857>] wait_for_completion+0x1f/0x21
>>>> [90496.182285]  [<ffffffffc044e3b3>] btrfs_async_run_delayed_refs+0xbf/0xd9 [btrfs]
>>>> [90496.183833]  [<ffffffffc04624e1>] __btrfs_end_transaction+0x2b6/0x2ec [btrfs]
>>>> [90496.185380]  [<ffffffffc0462522>] btrfs_end_transaction+0xb/0xd [btrfs]
>>>> [90496.186940]  [<ffffffffc0451742>] find_free_extent+0x8a9/0x976 [btrfs]
>>>> [90496.189464]  [<ffffffffc0451990>] btrfs_reserve_extent+0x6f/0x119 [btrfs]
>>>> [90496.191326]  [<ffffffffc0466b45>] cow_file_range+0x1a6/0x377 [btrfs]
>>>> [90496.193080]  [<ffffffffc047adc4>] ? extent_write_locked_range+0x10c/0x11e [btrfs]
>>>> [90496.194659]  [<ffffffffc04677e4>] submit_compressed_extents+0x100/0x412 [btrfs]
>>>> [90496.196225]  [<ffffffff8120e344>] ? debug_smp_processor_id+0x17/0x19
>>>> [90496.197776]  [<ffffffffc0467b78>] async_cow_submit+0x82/0x87 [btrfs]
>>>> [90496.199383]  [<ffffffffc048644b>] normal_work_helper+0x153/0x224 [btrfs]
>>>> [90496.200944]  [<ffffffff81052d8c>] process_one_work+0x16f/0x2b8
>>>> [90496.202483]  [<ffffffff81053636>] worker_thread+0x27b/0x32e
>>>> [90496.204000]  [<ffffffff810533bb>] ? cancel_delayed_work_sync+0x10/0x10
>>>> [90496.205514]  [<ffffffff81058012>] kthread+0xb2/0xba
>>>> [90496.207040]  [<ffffffff81470000>] ? ap_handle_dropped_data+0xf/0xc8
>>>> [90496.208565]  [<ffffffff81057f60>] ? __kthread_parkme+0x62/0x62
>>>> [90496.210096]  [<ffffffff81473f6c>] ret_from_fork+0x7c/0xb0
>>>> [90496.211618]  [<ffffffff81057f60>] ? __kthread_parkme+0x62/0x62
>>>>
>>>>
>>>> Ok, this should explain the hang.  submit_compressed_extents is calling
>>>> cow_file_range with a locked page.
>>>>
>>>> cow_file_range is trying to find a free extent and in the process is
>>>> calling btrfs_end_transaction, which is running the async delayed refs,
>>>> which is trying to write dirty pages, which is waiting for your locked
>>>> page.
>>>>
>>>> I should be able to reproduce this ;)
>>>
>>> This part of the trace is relatively new because Liu Bo's patch made us
>>> redirty the pages, making it more likely that we'd try to write them
>>> during commit.
>>>
>>> But, at the end of the day we have a fundamental deadlock with
>>> committing a transaction while holding a locked page from an ordered file.
>>>
>>> For now, I'm ripping out the strict ordered file and going back to a
>>> best-effort filemap_flush like ext4 is using.
>>
>> I think I've figured the deadlock out, this is obviously a race case, really
>> hard to reproduce :-(
>>
>> So it turns out to be related to workqueues -- now a kthread can process
>> work_struct queued in different workqueues, so we can explain the deadlock as
>> such,
>>
>> (1)
>> "btrfs-delalloc" workqueue gets a compressed extent to process with all its pages
>> locked during this, and it runs into read free space cache inode, and then wait
>> on lock_page().
>>
>> (2)
>> Reading that free space cache inode comes to submit part, and we have a
>> indirect twice endio way for it, with the first endio we come to end_workqueue_bio()
>> and queue a work in "btrfs-endio-meta" workqueue, and it will run the real
>> endio() for us, but ONLY when it's processed.
>>
>> So the problem is a kthread can serve several workqueues, which means
>> works in "btrfs-endio-meta" workqueues and works in "btrfs-flush_delalloc"
>> workqueues can be in the same processing list of a kthread.  When
>> "btrfs-flush_delalloc" waits for the compressed page and "btrfs-endio-meta"
>> comes after it, it hangs.
> 
> I don't think it is right. All the btrfs workqueue has RECLAIM flag, which means
> each btrfs workqueue has its own rescue worker. So the problem you said should
> not happen.

Right, I traded some emails with Tejun about this and spent a few days
trying to prove the workqueues were doing the wrong thing.   It will end
up spawning another worker thread for the new work, and it won't get
queued up behind the existing thread.

If both work items went to the same workqueue, you'd definitely be right.

I've got a patch to change the flush-delalloc code so we don't do the
file writes during commit.  It seems like the only choice right now.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liu Bo Aug. 10, 2014, 2:55 p.m. UTC | #20
On Thu, Aug 07, 2014 at 10:02:15AM -0400, Chris Mason wrote:
> 
> 
> On 08/07/2014 04:20 AM, Miao Xie wrote:
> > On Thu, 7 Aug 2014 15:50:30 +0800, Liu Bo wrote:
> >>>> [90496.156016] kworker/u8:14   D ffff880044e38540     0 21050      2 0x00000000
> >>>> [90496.157683] Workqueue: btrfs-delalloc normal_work_helper [btrfs]
> >>>> [90496.159320]  ffff88022880f990 0000000000000002 ffff880407f649b0 ffff88022880ffd8
> >>>> [90496.160997]  ffff880044e38000 0000000000013040 ffff880044e38000 7fffffffffffffff
> >>>> [90496.162686]  ffff880301383aa0 0000000000000002 ffffffff814705d0 ffff880301383a98
> >>>> [90496.164360] Call Trace:
> >>>> [90496.166028]  [<ffffffff814705d0>] ? michael_mic.part.6+0x21/0x21
> >>>> [90496.167854]  [<ffffffff81470fd0>] schedule+0x64/0x66
> >>>> [90496.169574]  [<ffffffff814705ff>] schedule_timeout+0x2f/0x114
> >>>> [90496.171221]  [<ffffffff8106479a>] ? wake_up_process+0x2f/0x32
> >>>> [90496.172867]  [<ffffffff81062c3b>] ? get_parent_ip+0xd/0x3c
> >>>> [90496.174472]  [<ffffffff81062ce5>] ? preempt_count_add+0x7b/0x8e
> >>>> [90496.176053]  [<ffffffff814717f3>] __wait_for_common+0x11e/0x163
> >>>> [90496.177619]  [<ffffffff814717f3>] ? __wait_for_common+0x11e/0x163
> >>>> [90496.179173]  [<ffffffff810647aa>] ? wake_up_state+0xd/0xd
> >>>> [90496.180728]  [<ffffffff81471857>] wait_for_completion+0x1f/0x21
> >>>> [90496.182285]  [<ffffffffc044e3b3>] btrfs_async_run_delayed_refs+0xbf/0xd9 [btrfs]
> >>>> [90496.183833]  [<ffffffffc04624e1>] __btrfs_end_transaction+0x2b6/0x2ec [btrfs]
> >>>> [90496.185380]  [<ffffffffc0462522>] btrfs_end_transaction+0xb/0xd [btrfs]
> >>>> [90496.186940]  [<ffffffffc0451742>] find_free_extent+0x8a9/0x976 [btrfs]
> >>>> [90496.189464]  [<ffffffffc0451990>] btrfs_reserve_extent+0x6f/0x119 [btrfs]
> >>>> [90496.191326]  [<ffffffffc0466b45>] cow_file_range+0x1a6/0x377 [btrfs]
> >>>> [90496.193080]  [<ffffffffc047adc4>] ? extent_write_locked_range+0x10c/0x11e [btrfs]
> >>>> [90496.194659]  [<ffffffffc04677e4>] submit_compressed_extents+0x100/0x412 [btrfs]
> >>>> [90496.196225]  [<ffffffff8120e344>] ? debug_smp_processor_id+0x17/0x19
> >>>> [90496.197776]  [<ffffffffc0467b78>] async_cow_submit+0x82/0x87 [btrfs]
> >>>> [90496.199383]  [<ffffffffc048644b>] normal_work_helper+0x153/0x224 [btrfs]
> >>>> [90496.200944]  [<ffffffff81052d8c>] process_one_work+0x16f/0x2b8
> >>>> [90496.202483]  [<ffffffff81053636>] worker_thread+0x27b/0x32e
> >>>> [90496.204000]  [<ffffffff810533bb>] ? cancel_delayed_work_sync+0x10/0x10
> >>>> [90496.205514]  [<ffffffff81058012>] kthread+0xb2/0xba
> >>>> [90496.207040]  [<ffffffff81470000>] ? ap_handle_dropped_data+0xf/0xc8
> >>>> [90496.208565]  [<ffffffff81057f60>] ? __kthread_parkme+0x62/0x62
> >>>> [90496.210096]  [<ffffffff81473f6c>] ret_from_fork+0x7c/0xb0
> >>>> [90496.211618]  [<ffffffff81057f60>] ? __kthread_parkme+0x62/0x62
> >>>>
> >>>>
> >>>> Ok, this should explain the hang.  submit_compressed_extents is calling
> >>>> cow_file_range with a locked page.
> >>>>
> >>>> cow_file_range is trying to find a free extent and in the process is
> >>>> calling btrfs_end_transaction, which is running the async delayed refs,
> >>>> which is trying to write dirty pages, which is waiting for your locked
> >>>> page.
> >>>>
> >>>> I should be able to reproduce this ;)
> >>>
> >>> This part of the trace is relatively new because Liu Bo's patch made us
> >>> redirty the pages, making it more likely that we'd try to write them
> >>> during commit.
> >>>
> >>> But, at the end of the day we have a fundamental deadlock with
> >>> committing a transaction while holding a locked page from an ordered file.
> >>>
> >>> For now, I'm ripping out the strict ordered file and going back to a
> >>> best-effort filemap_flush like ext4 is using.
> >>
> >> I think I've figured the deadlock out, this is obviously a race case, really
> >> hard to reproduce :-(
> >>
> >> So it turns out to be related to workqueues -- now a kthread can process
> >> work_struct queued in different workqueues, so we can explain the deadlock as
> >> such,
> >>
> >> (1)
> >> "btrfs-delalloc" workqueue gets a compressed extent to process with all its pages
> >> locked during this, and it runs into read free space cache inode, and then wait
> >> on lock_page().
> >>
> >> (2)
> >> Reading that free space cache inode comes to submit part, and we have a
> >> indirect twice endio way for it, with the first endio we come to end_workqueue_bio()
> >> and queue a work in "btrfs-endio-meta" workqueue, and it will run the real
> >> endio() for us, but ONLY when it's processed.
> >>
> >> So the problem is a kthread can serve several workqueues, which means
> >> works in "btrfs-endio-meta" workqueues and works in "btrfs-flush_delalloc"
> >> workqueues can be in the same processing list of a kthread.  When
> >> "btrfs-flush_delalloc" waits for the compressed page and "btrfs-endio-meta"
> >> comes after it, it hangs.
> > 
> > I don't think it is right. All the btrfs workqueue has RECLAIM flag, which means
> > each btrfs workqueue has its own rescue worker. So the problem you said should
> > not happen.
> 
> Right, I traded some emails with Tejun about this and spent a few days
> trying to prove the workqueues were doing the wrong thing.   It will end
> up spawning another worker thread for the new work, and it won't get
> queued up behind the existing thread.
> 
> If both work items went to the same workqueue, you'd definitely be right.
> 
> I've got a patch to change the flush-delalloc code so we don't do the
> file writes during commit.  It seems like the only choice right now.

Not the only choice any more ;)

It turns out to be related to async_cow's ordered list, say we have two
async_cow works on the wq->ordered_list, and the first work(named A) finishes its
->ordered_func() and ->ordered_free(), and the second work(B) starts B's
->ordered_func() which gets to read free space cache inode, where it queues a
work on @endio_meta_workers, but this work happens to be the same address with
A's work.

So now the situation is,

(1) in kthread's looping worker_thread(), work A is actually running its job,
(2) however, work A has freed its memory but kthread still want to use this
    address of memory, which means worker->current_work is still A's address.
(3) B's readahead for free space cache inode happens to queue a work whose
    address of memory is just the previous address of A's work, which means
    another worker's ->current_work is also A's address.
(4) as in btrfs we all use function normal_work_helper(), so
    worker->current_func is fixed here.
(5) worker_thread()
           ->process_one_work()
                    ->find_worker_executing_work()    
                    (find a collision, another work returns)

Then we saw the hang.

Still testing my patch, will run it for about 2 days to check if it works.

thanks,
-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Mason Aug. 11, 2014, 8:35 p.m. UTC | #21
On 08/10/2014 10:55 AM, Liu Bo wrote:
> On Thu, Aug 07, 2014 at 10:02:15AM -0400, Chris Mason wrote:
>>
>>
>> On 08/07/2014 04:20 AM, Miao Xie wrote:
>>> On Thu, 7 Aug 2014 15:50:30 +0800, Liu Bo wrote:
>>>>>> [90496.156016] kworker/u8:14   D ffff880044e38540     0 21050      2 0x00000000
>>>>>> [90496.157683] Workqueue: btrfs-delalloc normal_work_helper [btrfs]
>>>>>> [90496.159320]  ffff88022880f990 0000000000000002 ffff880407f649b0 ffff88022880ffd8
>>>>>> [90496.160997]  ffff880044e38000 0000000000013040 ffff880044e38000 7fffffffffffffff
>>>>>> [90496.162686]  ffff880301383aa0 0000000000000002 ffffffff814705d0 ffff880301383a98
>>>>>> [90496.164360] Call Trace:
>>>>>> [90496.166028]  [<ffffffff814705d0>] ? michael_mic.part.6+0x21/0x21
>>>>>> [90496.167854]  [<ffffffff81470fd0>] schedule+0x64/0x66
>>>>>> [90496.169574]  [<ffffffff814705ff>] schedule_timeout+0x2f/0x114
>>>>>> [90496.171221]  [<ffffffff8106479a>] ? wake_up_process+0x2f/0x32
>>>>>> [90496.172867]  [<ffffffff81062c3b>] ? get_parent_ip+0xd/0x3c
>>>>>> [90496.174472]  [<ffffffff81062ce5>] ? preempt_count_add+0x7b/0x8e
>>>>>> [90496.176053]  [<ffffffff814717f3>] __wait_for_common+0x11e/0x163
>>>>>> [90496.177619]  [<ffffffff814717f3>] ? __wait_for_common+0x11e/0x163
>>>>>> [90496.179173]  [<ffffffff810647aa>] ? wake_up_state+0xd/0xd
>>>>>> [90496.180728]  [<ffffffff81471857>] wait_for_completion+0x1f/0x21
>>>>>> [90496.182285]  [<ffffffffc044e3b3>] btrfs_async_run_delayed_refs+0xbf/0xd9 [btrfs]
>>>>>> [90496.183833]  [<ffffffffc04624e1>] __btrfs_end_transaction+0x2b6/0x2ec [btrfs]
>>>>>> [90496.185380]  [<ffffffffc0462522>] btrfs_end_transaction+0xb/0xd [btrfs]
>>>>>> [90496.186940]  [<ffffffffc0451742>] find_free_extent+0x8a9/0x976 [btrfs]
>>>>>> [90496.189464]  [<ffffffffc0451990>] btrfs_reserve_extent+0x6f/0x119 [btrfs]
>>>>>> [90496.191326]  [<ffffffffc0466b45>] cow_file_range+0x1a6/0x377 [btrfs]
>>>>>> [90496.193080]  [<ffffffffc047adc4>] ? extent_write_locked_range+0x10c/0x11e [btrfs]
>>>>>> [90496.194659]  [<ffffffffc04677e4>] submit_compressed_extents+0x100/0x412 [btrfs]
>>>>>> [90496.196225]  [<ffffffff8120e344>] ? debug_smp_processor_id+0x17/0x19
>>>>>> [90496.197776]  [<ffffffffc0467b78>] async_cow_submit+0x82/0x87 [btrfs]
>>>>>> [90496.199383]  [<ffffffffc048644b>] normal_work_helper+0x153/0x224 [btrfs]
>>>>>> [90496.200944]  [<ffffffff81052d8c>] process_one_work+0x16f/0x2b8
>>>>>> [90496.202483]  [<ffffffff81053636>] worker_thread+0x27b/0x32e
>>>>>> [90496.204000]  [<ffffffff810533bb>] ? cancel_delayed_work_sync+0x10/0x10
>>>>>> [90496.205514]  [<ffffffff81058012>] kthread+0xb2/0xba
>>>>>> [90496.207040]  [<ffffffff81470000>] ? ap_handle_dropped_data+0xf/0xc8
>>>>>> [90496.208565]  [<ffffffff81057f60>] ? __kthread_parkme+0x62/0x62
>>>>>> [90496.210096]  [<ffffffff81473f6c>] ret_from_fork+0x7c/0xb0
>>>>>> [90496.211618]  [<ffffffff81057f60>] ? __kthread_parkme+0x62/0x62
>>>>>>
>>>>>>
>>>>>> Ok, this should explain the hang.  submit_compressed_extents is calling
>>>>>> cow_file_range with a locked page.
>>>>>>
>>>>>> cow_file_range is trying to find a free extent and in the process is
>>>>>> calling btrfs_end_transaction, which is running the async delayed refs,
>>>>>> which is trying to write dirty pages, which is waiting for your locked
>>>>>> page.
>>>>>>
>>>>>> I should be able to reproduce this ;)
>>>>>
>>>>> This part of the trace is relatively new because Liu Bo's patch made us
>>>>> redirty the pages, making it more likely that we'd try to write them
>>>>> during commit.
>>>>>
>>>>> But, at the end of the day we have a fundamental deadlock with
>>>>> committing a transaction while holding a locked page from an ordered file.
>>>>>
>>>>> For now, I'm ripping out the strict ordered file and going back to a
>>>>> best-effort filemap_flush like ext4 is using.
>>>>
>>>> I think I've figured the deadlock out, this is obviously a race case, really
>>>> hard to reproduce :-(
>>>>
>>>> So it turns out to be related to workqueues -- now a kthread can process
>>>> work_struct queued in different workqueues, so we can explain the deadlock as
>>>> such,
>>>>
>>>> (1)
>>>> "btrfs-delalloc" workqueue gets a compressed extent to process with all its pages
>>>> locked during this, and it runs into read free space cache inode, and then wait
>>>> on lock_page().
>>>>
>>>> (2)
>>>> Reading that free space cache inode comes to submit part, and we have a
>>>> indirect twice endio way for it, with the first endio we come to end_workqueue_bio()
>>>> and queue a work in "btrfs-endio-meta" workqueue, and it will run the real
>>>> endio() for us, but ONLY when it's processed.
>>>>
>>>> So the problem is a kthread can serve several workqueues, which means
>>>> works in "btrfs-endio-meta" workqueues and works in "btrfs-flush_delalloc"
>>>> workqueues can be in the same processing list of a kthread.  When
>>>> "btrfs-flush_delalloc" waits for the compressed page and "btrfs-endio-meta"
>>>> comes after it, it hangs.
>>>
>>> I don't think it is right. All the btrfs workqueue has RECLAIM flag, which means
>>> each btrfs workqueue has its own rescue worker. So the problem you said should
>>> not happen.
>>
>> Right, I traded some emails with Tejun about this and spent a few days
>> trying to prove the workqueues were doing the wrong thing.   It will end
>> up spawning another worker thread for the new work, and it won't get
>> queued up behind the existing thread.
>>
>> If both work items went to the same workqueue, you'd definitely be right.
>>
>> I've got a patch to change the flush-delalloc code so we don't do the
>> file writes during commit.  It seems like the only choice right now.
> 
> Not the only choice any more ;)
> 
> It turns out to be related to async_cow's ordered list, say we have two
> async_cow works on the wq->ordered_list, and the first work(named A) finishes its
> ->ordered_func() and ->ordered_free(), and the second work(B) starts B's
> ->ordered_func() which gets to read free space cache inode, where it queues a
> work on @endio_meta_workers, but this work happens to be the same address with
> A's work.
> 
> So now the situation is,
> 
> (1) in kthread's looping worker_thread(), work A is actually running its job,
> (2) however, work A has freed its memory but kthread still want to use this
>     address of memory, which means worker->current_work is still A's address.
> (3) B's readahead for free space cache inode happens to queue a work whose
>     address of memory is just the previous address of A's work, which means
>     another worker's ->current_work is also A's address.
> (4) as in btrfs we all use function normal_work_helper(), so
>     worker->current_func is fixed here.

^^^^^^^^^^^^^

Ugh!  You're completely right.  But this doesn't change the ABBA in the
ordered code ;)  We need both patches.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Miao Xie Aug. 12, 2014, 2:55 a.m. UTC | #22
On Sun, 10 Aug 2014 22:55:44 +0800, Liu Bo wrote:
>>>>> This part of the trace is relatively new because Liu Bo's patch made us
>>>>> redirty the pages, making it more likely that we'd try to write them
>>>>> during commit.
>>>>>
>>>>> But, at the end of the day we have a fundamental deadlock with
>>>>> committing a transaction while holding a locked page from an ordered file.
>>>>>
>>>>> For now, I'm ripping out the strict ordered file and going back to a
>>>>> best-effort filemap_flush like ext4 is using.
>>>>
>>>> I think I've figured the deadlock out, this is obviously a race case, really
>>>> hard to reproduce :-(
>>>>
>>>> So it turns out to be related to workqueues -- now a kthread can process
>>>> work_struct queued in different workqueues, so we can explain the deadlock as
>>>> such,
>>>>
>>>> (1)
>>>> "btrfs-delalloc" workqueue gets a compressed extent to process with all its pages
>>>> locked during this, and it runs into read free space cache inode, and then wait
>>>> on lock_page().
>>>>
>>>> (2)
>>>> Reading that free space cache inode comes to submit part, and we have a
>>>> indirect twice endio way for it, with the first endio we come to end_workqueue_bio()
>>>> and queue a work in "btrfs-endio-meta" workqueue, and it will run the real
>>>> endio() for us, but ONLY when it's processed.
>>>>
>>>> So the problem is a kthread can serve several workqueues, which means
>>>> works in "btrfs-endio-meta" workqueues and works in "btrfs-flush_delalloc"
>>>> workqueues can be in the same processing list of a kthread.  When
>>>> "btrfs-flush_delalloc" waits for the compressed page and "btrfs-endio-meta"
>>>> comes after it, it hangs.
>>>
>>> I don't think it is right. All the btrfs workqueue has RECLAIM flag, which means
>>> each btrfs workqueue has its own rescue worker. So the problem you said should
>>> not happen.
>>
>> Right, I traded some emails with Tejun about this and spent a few days
>> trying to prove the workqueues were doing the wrong thing.   It will end
>> up spawning another worker thread for the new work, and it won't get
>> queued up behind the existing thread.
>>
>> If both work items went to the same workqueue, you'd definitely be right.
>>
>> I've got a patch to change the flush-delalloc code so we don't do the
>> file writes during commit.  It seems like the only choice right now.
> 
> Not the only choice any more ;)
> 
> It turns out to be related to async_cow's ordered list, say we have two
> async_cow works on the wq->ordered_list, and the first work(named A) finishes its
> ->ordered_func() and ->ordered_free(), and the second work(B) starts B's
> ->ordered_func() which gets to read free space cache inode, where it queues a
> work on @endio_meta_workers, but this work happens to be the same address with
> A's work.
> 
> So now the situation is,
> 
> (1) in kthread's looping worker_thread(), work A is actually running its job,
> (2) however, work A has freed its memory but kthread still want to use this
>     address of memory, which means worker->current_work is still A's address.
> (3) B's readahead for free space cache inode happens to queue a work whose
>     address of memory is just the previous address of A's work, which means
>     another worker's ->current_work is also A's address.
> (4) as in btrfs we all use function normal_work_helper(), so
>     worker->current_func is fixed here.
> (5) worker_thread()
>            ->process_one_work()
>                     ->find_worker_executing_work()    
>                     (find a collision, another work returns)
> 
> Then we saw the hang.

Here is my understand of what you said:
The same worker dealt with work A and work B, and the 3rd work which was introduced by
work B and has the same virtual memory address as work A was also inserted into the work
list of that worker. But work B was wait for the 3rd work at that time, so deadlock happened.
Am I right?

If I'm right, I think what you said is impossible. Before we dealt with work B, we should
already invoke spin_unlock_irq(&pool->lock), which implies a memory barrier that all changes
happens before unlock should complete before unlock, that is the address in current_work should
be the address of work B, when we inserted the 3rd work which was introduced by work B, we
should not find the address of work A in current_work of work B's worker.

I can not reproduce the problem on my machine, so I don't verify whether what I said is right
or not. Please correct me if I am wrong.

Thanks
Miao
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liu Bo Aug. 12, 2014, 7:51 a.m. UTC | #23
On Tue, Aug 12, 2014 at 10:55:51AM +0800, Miao Xie wrote:
> On Sun, 10 Aug 2014 22:55:44 +0800, Liu Bo wrote:
> >>>>> This part of the trace is relatively new because Liu Bo's patch made us
> >>>>> redirty the pages, making it more likely that we'd try to write them
> >>>>> during commit.
> >>>>>
> >>>>> But, at the end of the day we have a fundamental deadlock with
> >>>>> committing a transaction while holding a locked page from an ordered file.
> >>>>>
> >>>>> For now, I'm ripping out the strict ordered file and going back to a
> >>>>> best-effort filemap_flush like ext4 is using.
> >>>>
> >>>> I think I've figured the deadlock out, this is obviously a race case, really
> >>>> hard to reproduce :-(
> >>>>
> >>>> So it turns out to be related to workqueues -- now a kthread can process
> >>>> work_struct queued in different workqueues, so we can explain the deadlock as
> >>>> such,
> >>>>
> >>>> (1)
> >>>> "btrfs-delalloc" workqueue gets a compressed extent to process with all its pages
> >>>> locked during this, and it runs into read free space cache inode, and then wait
> >>>> on lock_page().
> >>>>
> >>>> (2)
> >>>> Reading that free space cache inode comes to submit part, and we have a
> >>>> indirect twice endio way for it, with the first endio we come to end_workqueue_bio()
> >>>> and queue a work in "btrfs-endio-meta" workqueue, and it will run the real
> >>>> endio() for us, but ONLY when it's processed.
> >>>>
> >>>> So the problem is a kthread can serve several workqueues, which means
> >>>> works in "btrfs-endio-meta" workqueues and works in "btrfs-flush_delalloc"
> >>>> workqueues can be in the same processing list of a kthread.  When
> >>>> "btrfs-flush_delalloc" waits for the compressed page and "btrfs-endio-meta"
> >>>> comes after it, it hangs.
> >>>
> >>> I don't think it is right. All the btrfs workqueue has RECLAIM flag, which means
> >>> each btrfs workqueue has its own rescue worker. So the problem you said should
> >>> not happen.
> >>
> >> Right, I traded some emails with Tejun about this and spent a few days
> >> trying to prove the workqueues were doing the wrong thing.   It will end
> >> up spawning another worker thread for the new work, and it won't get
> >> queued up behind the existing thread.
> >>
> >> If both work items went to the same workqueue, you'd definitely be right.
> >>
> >> I've got a patch to change the flush-delalloc code so we don't do the
> >> file writes during commit.  It seems like the only choice right now.
> > 
> > Not the only choice any more ;)
> > 
> > It turns out to be related to async_cow's ordered list, say we have two
> > async_cow works on the wq->ordered_list, and the first work(named A) finishes its
> > ->ordered_func() and ->ordered_free(), and the second work(B) starts B's
> > ->ordered_func() which gets to read free space cache inode, where it queues a
> > work on @endio_meta_workers, but this work happens to be the same address with
> > A's work.
> > 
> > So now the situation is,
> > 
> > (1) in kthread's looping worker_thread(), work A is actually running its job,
> > (2) however, work A has freed its memory but kthread still want to use this
> >     address of memory, which means worker->current_work is still A's address.
> > (3) B's readahead for free space cache inode happens to queue a work whose
> >     address of memory is just the previous address of A's work, which means
> >     another worker's ->current_work is also A's address.
> > (4) as in btrfs we all use function normal_work_helper(), so
> >     worker->current_func is fixed here.
> > (5) worker_thread()
> >            ->process_one_work()
> >                     ->find_worker_executing_work()    
> >                     (find a collision, another work returns)
> > 
> > Then we saw the hang.
> 
> Here is my understand of what you said:
> The same worker dealt with work A and work B, and the 3rd work which was introduced by
> work B and has the same virtual memory address as work A was also inserted into the work
> list of that worker. But work B was wait for the 3rd work at that time, so deadlock happened.
> Am I right?
> 
> If I'm right, I think what you said is impossible. Before we dealt with work B, we should
> already invoke spin_unlock_irq(&pool->lock), which implies a memory barrier that all changes
> happens before unlock should complete before unlock, that is the address in current_work should
> be the address of work B, when we inserted the 3rd work which was introduced by work B, we
> should not find the address of work A in current_work of work B's worker.
> 
> I can not reproduce the problem on my machine, so I don't verify whether what I said is right
> or not. Please correct me if I am wrong.

Hmm, my fault...the problem is complex so I didn't describe it 100% precisely
in the above reply.

I've sent a patch about this which has the truth you may want :-)

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 3668048..8ea7610 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -709,6 +709,18 @@  retry:
 				unlock_extent(io_tree, async_extent->start,
 					      async_extent->start +
 					      async_extent->ram_size - 1);
+
+				/*
+				 * we need to redirty the pages if we decide to
+				 * fallback to uncompressed IO, otherwise we
+				 * will not submit these pages down to lower
+				 * layers.
+				 */
+				extent_range_redirty_for_io(inode,
+						async_extent->start,
+						async_extent->start +
+						async_extent->ram_size - 1);
+
 				goto retry;
 			}
 			goto out_free;