From patchwork Tue Mar 1 11:56:45 2011 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Xin Zhong X-Patchwork-Id: 598811 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by demeter1.kernel.org (8.14.4/8.14.3) with ESMTP id p21ButMe031456 for ; Tue, 1 Mar 2011 11:56:56 GMT Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756328Ab1CAL4v (ORCPT ); Tue, 1 Mar 2011 06:56:51 -0500 Received: from mga09.intel.com ([134.134.136.24]:57276 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756291Ab1CAL4u convert rfc822-to-8bit (ORCPT ); Tue, 1 Mar 2011 06:56:50 -0500 Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga102.jf.intel.com with ESMTP; 01 Mar 2011 03:56:49 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.62,246,1297065600"; d="scan'208";a="714629174" Received: from pgsmsx602.gar.corp.intel.com ([10.221.43.81]) by orsmga001.jf.intel.com with ESMTP; 01 Mar 2011 03:56:48 -0800 Received: from shsmsx602.ccr.corp.intel.com (10.239.4.104) by pgsmsx602.gar.corp.intel.com (10.221.43.81) with Microsoft SMTP Server (TLS) id 8.2.254.0; Tue, 1 Mar 2011 19:56:47 +0800 Received: from shsmsx502.ccr.corp.intel.com ([10.239.4.96]) by SHSMSX602.ccr.corp.intel.com ([10.239.4.104]) with mapi; Tue, 1 Mar 2011 19:56:46 +0800 From: "Zhong, Xin" To: "Zhong, Xin" , Mitch Harder , =?iso-8859-1?Q?Maria_Wikstr=F6m?= CC: Josef Bacik , Johannes Hirte , Chris Mason , "linux-btrfs@vger.kernel.org" Date: Tue, 1 Mar 2011 19:56:45 +0800 Subject: RE: [PATCH] btrfs file write debugging patch Thread-Topic: [PATCH] btrfs file write debugging patch Thread-Index: AcvXhPFOxmkBMQjfTyitdOpM7tJ2zQAc9/vwAANZRBA= Message-ID: <1865303E0DED764181A9D882DEF65FB68662A5EA7B@shsmsx502.ccr.corp.intel.com> References: <1298857223-sup-5612@think> <201102281114.00018.johannes.hirte@fem.tu-ilmenau.de> <20110228161056.GA2769@localhost.localdomain> <1298911556.11118.8.camel@mainframe> <1865303E0DED764181A9D882DEF65FB68662A5EA63@shsmsx502.ccr.corp.intel.com> In-Reply-To: <1865303E0DED764181A9D882DEF65FB68662A5EA63@shsmsx502.ccr.corp.intel.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US MIME-Version: 1.0 Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org X-Greylist: IP, sender and recipient auto-whitelisted, not delayed by milter-greylist-4.2.6 (demeter1.kernel.org [140.211.167.41]); Tue, 01 Mar 2011 11:56:57 +0000 (UTC) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 65338a1..b9d0929 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1007,17 +1007,16 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb, goto out; } - ret = btrfs_delalloc_reserve_space(inode, - num_pages << PAGE_CACHE_SHIFT); - if (ret) - goto out; - ret = prepare_pages(root, file, pages, num_pages, pos, first_index, last_index, write_bytes); - if (ret) { - btrfs_delalloc_release_space(inode, + if (ret) + goto out; + + ret = btrfs_delalloc_reserve_space(inode, num_pages << PAGE_CACHE_SHIFT); + if (ret) { + btrfs_drop_pages(pages, num_pages); goto out; } -----Original Message----- From: linux-btrfs-owner@vger.kernel.org [mailto:linux-btrfs-owner@vger.kernel.org] On Behalf Of Zhong, Xin Sent: Tuesday, March 01, 2011 6:15 PM To: Mitch Harder; Maria Wikström Cc: Josef Bacik; Johannes Hirte; Chris Mason; linux-btrfs@vger.kernel.org Subject: RE: [PATCH] btrfs file write debugging patch Is your system running out of memory or is there any other thread like flush-btrfs competing for the same page? I can only see one process in your ftrace log. You may need to trace all btrfs.ko function calls instead of a single process. Thanks! -----Original Message----- From: linux-btrfs-owner@vger.kernel.org [mailto:linux-btrfs-owner@vger.kernel.org] On Behalf Of Mitch Harder Sent: Tuesday, March 01, 2011 4:20 AM To: Maria Wikström Cc: Josef Bacik; Johannes Hirte; Chris Mason; Zhong, Xin; linux-btrfs@vger.kernel.org Subject: Re: [PATCH] btrfs file write debugging patch 2011/2/28 Mitch Harder : > 2011/2/28 Maria Wikström : >> mån 2011-02-28 klockan 11:10 -0500 skrev Josef Bacik: >>> On Mon, Feb 28, 2011 at 11:13:59AM +0100, Johannes Hirte wrote: >>> > On Monday 28 February 2011 02:46:05 Chris Mason wrote: >>> > > Excerpts from Mitch Harder's message of 2011-02-25 13:43:37 -0500: >>> > > > Some clarification on my previous message... >>> > > > >>> > > > After looking at my ftrace log more closely, I can see where Btrfs is >>> > > > trying to release the allocated pages.  However, the calculation for >>> > > > the number of dirty_pages is equal to 1 when "copied == 0". >>> > > > >>> > > > So I'm seeing at least two problems: >>> > > > (1)  It keeps looping when "copied == 0". >>> > > > (2)  One dirty page is not being released on every loop even though >>> > > > "copied == 0" (at least this problem keeps it from being an infinite >>> > > > loop by eventually exhausting reserveable space on the disk). >>> > > >>> > > Hi everyone, >>> > > >>> > > There are actually tow bugs here.  First the one that Mitch hit, and a >>> > > second one that still results in bad file_write results with my >>> > > debugging hunks (the first two hunks below) in place. >>> > > >>> > > My patch fixes Mitch's bug by checking for copied == 0 after >>> > > btrfs_copy_from_user and going the correct delalloc accounting.  This >>> > > one looks solved, but you'll notice the patch is bigger. >>> > > >>> > > First, I add some random failures to btrfs_copy_from_user() by failing >>> > > everyone once and a while.  This was much more reliable than trying to >>> > > use memory pressure than making copy_from_user fail. >>> > > >>> > > If copy_from_user fails and we partially update a page, we end up with a >>> > > page that may go away due to memory pressure.  But, btrfs_file_write >>> > > assumes that only the first and last page may have good data that needs >>> > > to be read off the disk. >>> > > >>> > > This patch ditches that code and puts it into prepare_pages instead. >>> > > But I'm still having some errors during long stress.sh runs.  Ideas are >>> > > more than welcome, hopefully some other timezones will kick in ideas >>> > > while I sleep. >>> > >>> > At least it doesn't fix the emerge-problem for me. The behavior is now the same >>> > as with 2.6.38-rc3. It needs a 'emerge --oneshot dev-libs/libgcrypt' with no >>> > further interaction to get the emerge-process hang with a svn-process >>> > consuming 100% CPU. I can cancel the emerge-process with ctrl-c but the >>> > spawned svn-process stays and it needs a reboot to get rid of it. >>> >>> Can you cat /proc/$pid/wchan a few times so we can get an idea of where it's >>> looping?  Thanks, >>> >>> Josef >> >> It behaves the same way here with btrfs-unstable. >> The output of "cat /proc/$pid/wchan" is 0. >> >> // Maria >> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html >>> >> >> >> > > I've applied the patch at the head of this thread (with the jiffies > debugging commented out) and I'm attaching a ftrace using the > function_graph tracer when I'm stuck in the loop.  I've just snipped > out a couple of the loops (the full trace file is quite large, and > mostly repititious). > > I'm going to try to modify file.c with some trace_printk debugging to > show the values of several of the relevant variables at various > stages. > > I'm going to try to exit the loop after 256 tries with an EFAULT so I > can stop the tracing at that point and capture a trace of the entry > into the problem (the ftrace ring buffer fills up too fast for me to > capture the entry point). > As promised, I'm put together a modified file.c with many trace_printk debugging statements to augment the ftrace. The trace is ~128K compressed (about 31,600 lines or 2.6MB uncompressed), so I'm putting it up on my local server instead of attaching. Let me know if it would be more appropriate to send to the list as an attachment. http://dontpanic.dyndns.org/ftrace-btrfs-file-write-debug-v2.gz I preface all my trace_printk comments with "TPK:" to make skipping through the trace easier. The trace contains the trace of about 3 or 4 successful passes through the btrfs_file_aio_write() function to show what a successful trace looks like. The pass through the btrfs_file_aio_write() that breaks begins on line 1088. I let it loop through the while (iov_iter_count(&i) > 0) {} loop for 256 times when copied==0 (otherwise it would loop infinitely). Then exit out and stop the trace. For reference, here's a diff file for the debugging statements I've added to file.c: Let me know if you would like me to re-run this trial with different debugging lines. fs/btrfs/file.c /usr/src/linux/fs/btrfs/file.c --- fs/btrfs/file.c 2011-02-28 10:13:45.000000000 -0600 +++ /usr/src/linux/fs/btrfs/file.c 2011-02-28 13:07:11.000000000 -0600 @@ -53,12 +53,14 @@ int offset = pos & (PAGE_CACHE_SIZE - 1); int total_copied = 0; + /*************************** if ((jiffies % 10) == 0) return 0; if ((jiffies % 25) == 0) { write_bytes /= 2; } + **************************/ while (write_bytes > 0) { size_t count = min_t(size_t, @@ -82,10 +84,13 @@ /* Return to btrfs_file_aio_write to fault page */ if (unlikely(copied == 0)) { + trace_printk("TPK: unlikely copied == 0 in btrfs_copy_from_user (total_copied=%i)\n", + total_copied); break; } if (unlikely(copied < PAGE_CACHE_SIZE - offset)) { + trace_printk("TPK: unlikely copied < PAGE_CACHE_SIZE - offset in btrfs_copy_from_user\n"); offset += copied;