Patchwork BUG: soft lockup - is this XFS problem?

login
register
mail settings
Submitter Nick Piggin
Date Jan. 5, 2009, 6:48 a.m.
Message ID <20090105064838.GA5209@wotan.suse.de>
Download mbox | patch
Permalink /patch/789/
State New, archived
Headers show

Comments

Nick Piggin - Jan. 5, 2009, 6:48 a.m.
On Mon, Jan 05, 2009 at 05:19:59AM +0100, Nick Piggin wrote:
> On Mon, Jan 05, 2009 at 02:48:21AM +0100, Nick Piggin wrote:
> > 
> > OK.. Hmm, well here is a modification to your patch which might help further.
> > I'll see if I can reproduce it here meanwhile.
> 
> I have reproduced it. It seems like it might be a livelock condition
> because the system ended up recovering after I terminated the dd (and
> did so before I collected any real info, oops, hopefully I can
> reproduce it again).
> 
> This would fit with the problem going away when the debugging patch
> was applied. Timing changes...

No, I was wrong. The problem goes away with the patch applied because
a function call == a compiler barrier. The problem randomly recovered
for me because of something more subtle.

I believe this patch should solve it. Please test and confirm before
I send it upstream.

---

An XFS workload showed up a bug in the lockless pagecache patch. Basically it
would go into an "infinite" loop, although it would sometimes be able to break
out of the loop! The reason is a missing compiler barrier in the "increment
reference count unless it was zero" case of the lockless pagecache protocol in
the gang lookup functions.

This would cause the compiler to use a cached value of struct page pointer to
retry the operation with, rather than reload it. So the page might have been
removed from pagecache and freed (refcount==0) but the lookup would not correctly
notice the page is no longer in pagecache, and keep attempting to increment the
refcount and failing, until the page gets reallocated for something else. This
isn't a data corruption because the condition will be detected if the page has
been reallocated. However it can result in a lockup. 

Add a the required compiler barrier and comment to fix this.

Assembly snippet from find_get_pages, before:
.L220:
        movq    (%rbx), %rax    #* ivtmp.1162, tmp82
        movq    (%rax), %rdi    #, prephitmp.1149
.L218:
        testb   $1, %dil        #, prephitmp.1149
        jne     .L217   #,
        testq   %rdi, %rdi      # prephitmp.1149
        je      .L203   #,
        cmpq    $-1, %rdi       #, prephitmp.1149
        je      .L217   #,
        movl    8(%rdi), %esi   # <variable>._count.counter, c
        testl   %esi, %esi      # c
        je      .L218   #,

after:
.L212:
        movq    (%rbx), %rax    #* ivtmp.1109, tmp81
        movq    (%rax), %rdi    #, ret
        testb   $1, %dil        #, ret
        jne     .L211   #,
        testq   %rdi, %rdi      # ret
        je      .L197   #,
        cmpq    $-1, %rdi       #, ret
        je      .L211   #,
        movl    8(%rdi), %esi   # <variable>._count.counter, c
        testl   %esi, %esi      # c
        je      .L212   #,

(notice the obvious infinite loop in the first example, if page->count remains 0)


---
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Roman Kononov - Jan. 5, 2009, 2:25 p.m.
On 2009-01-05 00:48 Nick Piggin said the following:
> I believe this patch should solve it. Please test and confirm before
> I send it upstream.

3 systems with 2.6.27.10 have worked overnight with dd running 
continuously. They all failed within 20 minutes without the patch.

Thank you,

Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Klotz - Jan. 5, 2009, 4:21 p.m.
Nick Piggin wrote:
> I believe this patch should solve it. Please test and confirm before
> I send it upstream.


My test successfully completed two times, writing a 900GB file in each run.
I used a patched 2.6.27.10 x86_64 kernel.

On an unpatched system this test usually fails before reaching 100GB.

Thank you for fixing this issue that quick.

Regards, Peter.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Guus Sliepen - July 14, 2011, 11:23 a.m.
Hello,

I'm having a problem with a system having an XFS filesystem on RAID locking up
fairly consistently when writing large amounts of data to it, with several
kernels, including 2.6.38.2 and 2.6.39.3, on both AMD and Intel multi-core
processors. The kernel always logs this several times:

BUG: soft lockup - CPU#2 stuck for 67s! [kswapd0:33]

With different CPU# numbers, but always in kswapd0. Eventually the system will
really lock up, requiring a reset. During soft lockups (when file transfer
apparently stalled), merely typing "ps aux" would often cause the lockup to end
immediately. After googling I found this page:

https://patchwork.kernel.org/patch/789/

An unpatched vanilla 2.6.39.3 consistently locked up, however after patching it
(adding a barrier() after all 4 instances of if
(!page_cache_get_speculative(page))) the lockups never happened anymore, and
file transfer has been steady.

I also tested it with ext4, which doesn't give lockups on unpatched kernels,
but unfortunately mkfs.ext4 cannot create filesystems larger than 16TB yet, so
I have to use XFS instead.

On Mon, Jan 05, 2009 at 06:48:38AM -0000, Nick Piggin wrote:

> I believe this patch should solve it. Please test and confirm before
> I send it upstream.

Further comments on that thread in 2009 indicated the patch was very useful,
but it doesn't seem to have been applied upstream. Is there any reason this
patch should not be applied?

If necessary I can submit a reworked patch for 2.6.39.3 or 3.0 when that comes
out.

> ---
> An XFS workload showed up a bug in the lockless pagecache patch. Basically it
> would go into an "infinite" loop, although it would sometimes be able to break
> out of the loop! The reason is a missing compiler barrier in the "increment
> reference count unless it was zero" case of the lockless pagecache protocol in
> the gang lookup functions.
> 
> This would cause the compiler to use a cached value of struct page pointer to
> retry the operation with, rather than reload it. So the page might have been
> removed from pagecache and freed (refcount==0) but the lookup would not correctly
> notice the page is no longer in pagecache, and keep attempting to increment the
> refcount and failing, until the page gets reallocated for something else. This
> isn't a data corruption because the condition will be detected if the page has
> been reallocated. However it can result in a lockup. 
> 
> Add a the required compiler barrier and comment to fix this.
[...]
> Index: linux-2.6/mm/filemap.c
> ===================================================================
> --- linux-2.6.orig/mm/filemap.c	2009-01-05 17:22:57.000000000 +1100
> +++ linux-2.6/mm/filemap.c	2009-01-05 17:28:40.000000000 +1100
> @@ -794,8 +794,19 @@ repeat:
>  		if (unlikely(page == RADIX_TREE_RETRY))
>  			goto restart;
>  
> -		if (!page_cache_get_speculative(page))
> +		if (!page_cache_get_speculative(page)) {
> +			/*
> +			 * A failed page_cache_get_speculative operation does
> +			 * not imply any barriers (Documentation/atomic_ops.txt),
> +			 * and as such, we must force the compiler to deref the
> +			 * radix-tree slot again rather than using the cached
> +			 * value (because we need to give up if the page has been
> +			 * removed from the radix-tree, rather than looping until
> +			 * it gets reused for something else).
> +			 */
> +			barrier();
>  			goto repeat;
> +		}
>  
>  		/* Has the page moved? */
>  		if (unlikely(page != *((void **)pages[i]))) {
> @@ -850,8 +861,11 @@ repeat:
>  		if (page->mapping == NULL || page->index != index)
>  			break;
>  
> -		if (!page_cache_get_speculative(page))
> +		if (!page_cache_get_speculative(page)) {
> +			/* barrier: see find_get_pages() */
> +			barrier();
>  			goto repeat;
> +		}
>  
>  		/* Has the page moved? */
>  		if (unlikely(page != *((void **)pages[i]))) {
> @@ -904,8 +918,11 @@ repeat:
>  		if (unlikely(page == RADIX_TREE_RETRY))
>  			goto restart;
>  
> -		if (!page_cache_get_speculative(page))
> +		if (!page_cache_get_speculative(page)) {
> +			/* barrier: see find_get_pages() */
> +			barrier();
>  			goto repeat;
> +		}
>  
>  		/* Has the page moved? */
>  		if (unlikely(page != *((void **)pages[i]))) {
Peter Klotz - July 14, 2011, 6:03 p.m.
On 07/14/2011 01:23 PM, Guus Sliepen wrote:

> I'm having a problem with a system having an XFS filesystem on RAID locking up
> fairly consistently when writing large amounts of data to it, with several
> kernels, including 2.6.38.2 and 2.6.39.3, on both AMD and Intel multi-core
> processors. The kernel always logs this several times:
>
> BUG: soft lockup - CPU#2 stuck for 67s! [kswapd0:33]
...
>> I believe this patch should solve it. Please test and confirm before
>> I send it upstream.
>
> Further comments on that thread in 2009 indicated the patch was very useful,
> but it doesn't seem to have been applied upstream. Is there any reason this
> patch should not be applied?

Hello Guus

This Bugzilla entry documents the XFS bug from 2009 in detail including 
links:

http://oss.sgi.com/bugzilla/show_bug.cgi?id=805

The problem was finally solved by a patch proposed by Linus. This is the 
reason the original patch developed by Nick never made it into the kernel.

My tests back then showed that both patches fixed the problem.

It seems you have found a test case where just Nick's patch helps.

Regards, Peter.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Guus Sliepen - July 14, 2011, 7:29 p.m.
On Thu, Jul 14, 2011 at 08:03:09PM +0200, Peter Klotz wrote:

> On 07/14/2011 01:23 PM, Guus Sliepen wrote:
> 
> >I'm having a problem with a system having an XFS filesystem on RAID locking up
> >fairly consistently when writing large amounts of data to it, with several
> >kernels, including 2.6.38.2 and 2.6.39.3, on both AMD and Intel multi-core
> >processors. The kernel always logs this several times:
> >
> >BUG: soft lockup - CPU#2 stuck for 67s! [kswapd0:33]
[...]
> This Bugzilla entry documents the XFS bug from 2009 in detail
> including links:
> 
> http://oss.sgi.com/bugzilla/show_bug.cgi?id=805

Aha, I did not look at that before.

> The problem was finally solved by a patch proposed by Linus. This is
> the reason the original patch developed by Nick never made it into
> the kernel.
> 
> My tests back then showed that both patches fixed the problem.
> 
> It seems you have found a test case where just Nick's patch helps.

Yes. I agree with Linus that the root cause should be fixed, not the symptoms.
I don't have time to dive in the kernel code myself, but I do have several
nearly identical machines where I can test things on. I will be happy to test
out patches and/or different kernel versions or kernel configurations, and I
can provide dmesg output and perhaps other information if necessary.

Patch

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c	2009-01-05 17:22:57.000000000 +1100
+++ linux-2.6/mm/filemap.c	2009-01-05 17:28:40.000000000 +1100
@@ -794,8 +794,19 @@  repeat:
 		if (unlikely(page == RADIX_TREE_RETRY))
 			goto restart;
 
-		if (!page_cache_get_speculative(page))
+		if (!page_cache_get_speculative(page)) {
+			/*
+			 * A failed page_cache_get_speculative operation does
+			 * not imply any barriers (Documentation/atomic_ops.txt),
+			 * and as such, we must force the compiler to deref the
+			 * radix-tree slot again rather than using the cached
+			 * value (because we need to give up if the page has been
+			 * removed from the radix-tree, rather than looping until
+			 * it gets reused for something else).
+			 */
+			barrier();
 			goto repeat;
+		}
 
 		/* Has the page moved? */
 		if (unlikely(page != *((void **)pages[i]))) {
@@ -850,8 +861,11 @@  repeat:
 		if (page->mapping == NULL || page->index != index)
 			break;
 
-		if (!page_cache_get_speculative(page))
+		if (!page_cache_get_speculative(page)) {
+			/* barrier: see find_get_pages() */
+			barrier();
 			goto repeat;
+		}
 
 		/* Has the page moved? */
 		if (unlikely(page != *((void **)pages[i]))) {
@@ -904,8 +918,11 @@  repeat:
 		if (unlikely(page == RADIX_TREE_RETRY))
 			goto restart;
 
-		if (!page_cache_get_speculative(page))
+		if (!page_cache_get_speculative(page)) {
+			/* barrier: see find_get_pages() */
+			barrier();
 			goto repeat;
+		}
 
 		/* Has the page moved? */
 		if (unlikely(page != *((void **)pages[i]))) {