From patchwork Tue Jan 3 23:36:29 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Liu Bo X-Patchwork-Id: 9495923 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id E8E3060413 for ; Tue, 3 Jan 2017 23:37:17 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id DAD09267EC for ; Tue, 3 Jan 2017 23:37:17 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id CF71127D13; Tue, 3 Jan 2017 23:37:17 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00, RCVD_IN_DNSWL_HI, UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id CD4C927AB2 for ; Tue, 3 Jan 2017 23:37:16 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S935005AbdACXhF (ORCPT ); Tue, 3 Jan 2017 18:37:05 -0500 Received: from userp1040.oracle.com ([156.151.31.81]:23516 "EHLO userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1762065AbdACXhD (ORCPT ); Tue, 3 Jan 2017 18:37:03 -0500 Received: from userv0022.oracle.com (userv0022.oracle.com [156.151.31.74]) by userp1040.oracle.com (Sentrion-MTA-4.3.2/Sentrion-MTA-4.3.2) with ESMTP id v03NaXav029955 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 3 Jan 2017 23:36:33 GMT Received: from aserv0122.oracle.com (aserv0122.oracle.com [141.146.126.236]) by userv0022.oracle.com (8.14.4/8.14.4) with ESMTP id v03NaWn8027276 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 3 Jan 2017 23:36:33 GMT Received: from abhmp0018.oracle.com (abhmp0018.oracle.com [141.146.116.24]) by aserv0122.oracle.com (8.14.4/8.14.4) with ESMTP id v03NaUAG023446; Tue, 3 Jan 2017 23:36:30 GMT Received: from localhost.localdomain (/10.159.161.182) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Tue, 03 Jan 2017 15:36:30 -0800 Date: Tue, 3 Jan 2017 15:36:29 -0800 From: Liu Bo To: linux-btrfs@vger.kernel.org Cc: clm@fb.com, jbacik@fb.com, dsterba@suse.com, holger@applied-asynchrony.com, s.priebe@profihost.ag, quwenruo@cn.fujitsu.com Subject: Re: [PATCH 1/3] btrfs: improve inode's outstanding_extents computation Message-ID: <20170103233629.GD4651@localhost.localdomain> Reply-To: bo.li.liu@oracle.com References: <1478853587-28717-1-git-send-email-wangxg.fnst@cn.fujitsu.com> <1478853587-28717-2-git-send-email-wangxg.fnst@cn.fujitsu.com> <20170103210045.GA4651@localhost.localdomain> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20170103210045.GA4651@localhost.localdomain> User-Agent: Mutt/1.6.1 (2016-04-27) X-Source-IP: userv0022.oracle.com [156.151.31.74] Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP (Resend this reply due to a message that there is an invalid email address.) On Tue, Jan 03, 2017 at 01:00:45PM -0800, Liu Bo wrote: > On Fri, Nov 11, 2016 at 04:39:45PM +0800, Wang Xiaoguang wrote: > > This issue was revealed by modifying BTRFS_MAX_EXTENT_SIZE(128MB) to 64KB, > > When modifying BTRFS_MAX_EXTENT_SIZE(128MB) to 64KB, fsstress test often > > gets these warnings from btrfs_destroy_inode(): > > WARN_ON(BTRFS_I(inode)->outstanding_extents); > > WARN_ON(BTRFS_I(inode)->reserved_extents); > > > > Simple test program below can reproduce this issue steadily. > > Note: you need to modify BTRFS_MAX_EXTENT_SIZE to 64KB to have test, > > otherwise there won't be such WARNING. > > #include > > #include > > #include > > #include > > #include > > > > int main(void) > > { > > int fd; > > char buf[68 *1024]; > > > > memset(buf, 0, 68 * 1024); > > fd = open("testfile", O_CREAT | O_EXCL | O_RDWR); pwrite(fd, buf, 64 * 1024, 64 * 1024); (During my test, I had to add the above in order to reproduce the warning, another alternative way is to use truncate and pread.) > > pwrite(fd, buf, 68 * 1024, 64 * 1024); > > return; > > } > > > > When BTRFS_MAX_EXTENT_SIZE is 64KB, and buffered data range is: > > 64KB 128K 132KB > > |-----------------------------------------------|---------------| > > 64 + 4KB > > > > 1) for above data range, btrfs_delalloc_reserve_metadata() will reserve > > metadata and set BTRFS_I(inode)->outstanding_extents to 2. > > (68KB + 64KB - 1) / 64KB == 2 > > > > Outstanding_extents: 2 > > > > 2) then btrfs_dirty_page() will be called to dirty pages and set > > EXTENT_DELALLOC flag. In this case, btrfs_set_bit_hook() will be called > > twice. > > The 1st set_bit_hook() call will set DEALLOC flag for the first 64K. > > 64KB 128KB > > |-----------------------------------------------| > > 64KB DELALLOC > > Outstanding_extents: 2 > I think that here the 1st extent [64k, 128k] is only set with EXTENT_UPTODATE since other bits like DELALLOC has been cleared by lock_and_cleanup_extent_if_need. It would work if we check EXTENT_UPTODATE on deciding whether to clear EXTENT_FIRST_DELALLOC, i.e. Thanks, -liubo > > > > > Set_bit_hooks() uses FIRST_DELALLOC flag to avoid re-increase > > outstanding_extents counter. > > So for 1st set_bit_hooks() call, it won't modify outstanding_extents, > > it's still 2. > > > > Then FIRST_DELALLOC flag is *CLEARED*. > > > > 3) 2nd btrfs_set_bit_hook() call. > > Because FIRST_DELALLOC have been cleared by previous set_bit_hook(), > > btrfs_set_bit_hook() will increase BTRFS_I(inode)->outstanding_extents by > > one, so now BTRFS_I(inode)->outstanding_extents is 3. > > 64KB 128KB 132KB > > |-----------------------------------------------|----------------| > > 64K DELALLOC 4K DELALLOC > > Outstanding_extents: 3 > > > > But the correct outstanding_extents number should be 2, not 3. > > The 2nd btrfs_set_bit_hook() call just screwed up this, and leads to the > > WARN_ON(). > > > > Normally, we can solve it by only increasing outstanding_extents in > > set_bit_hook(). > > But the problem is for delalloc_reserve/release_metadata(), we only have > > a 'length' parameter, and calculate in-accurate outstanding_extents. > > If we only rely on set_bit_hook() release_metadata() will crew things up > > as it will decrease inaccurate number. > > > > So the fix we use is: > > 1) Increase *INACCURATE* outstanding_extents at delalloc_reserve_meta > > Just as a place holder. > > 2) Increase *accurate* outstanding_extents at set_bit_hooks() > > This is the real increaser. > > 3) Decrease *INACCURATE* outstanding_extents before returning > > This makes outstanding_extents to correct value. > > > > For 128M BTRFS_MAX_EXTENT_SIZE, due to limitation of > > __btrfs_buffered_write(), each iteration will only handle about 2MB > > data. > > So btrfs_dirty_pages() won't need to handle cases cross 2 extents. > > > > Signed-off-by: Wang Xiaoguang > > --- > > fs/btrfs/ctree.h | 2 ++ > > fs/btrfs/inode.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++++++------ > > fs/btrfs/ioctl.c | 6 ++---- > > 3 files changed, 62 insertions(+), 11 deletions(-) > > > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > > index 9d8edcb..766d152 100644 > > --- a/fs/btrfs/ctree.h > > +++ b/fs/btrfs/ctree.h > > @@ -3139,6 +3139,8 @@ int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, int delay_iput, > > int nr); > > int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end, > > struct extent_state **cached_state, int dedupe); > > +int btrfs_set_extent_defrag(struct inode *inode, u64 start, u64 end, > > + struct extent_state **cached_state); > > int btrfs_create_subvol_root(struct btrfs_trans_handle *trans, > > struct btrfs_root *new_root, > > struct btrfs_root *parent_root, > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > > index 1f980ef..25e0083 100644 > > --- a/fs/btrfs/inode.c > > +++ b/fs/btrfs/inode.c > > @@ -1601,6 +1601,9 @@ static void btrfs_split_extent_hook(struct inode *inode, > > if (!(orig->state & EXTENT_DELALLOC)) > > return; > > > > + if (btrfs_is_free_space_inode(inode)) > > + return; > > + > > size = orig->end - orig->start + 1; > > if (size > BTRFS_MAX_EXTENT_SIZE) { > > u64 num_extents; > > @@ -1643,6 +1646,9 @@ static void btrfs_merge_extent_hook(struct inode *inode, > > if (!(other->state & EXTENT_DELALLOC)) > > return; > > > > + if (btrfs_is_free_space_inode(inode)) > > + return; > > + > > if (new->start > other->start) > > new_size = new->end - other->start + 1; > > else > > @@ -1738,7 +1744,6 @@ static void btrfs_del_delalloc_inode(struct btrfs_root *root, > > static void btrfs_set_bit_hook(struct inode *inode, > > struct extent_state *state, unsigned *bits) > > { > > - > > if ((*bits & EXTENT_DEFRAG) && !(*bits & EXTENT_DELALLOC)) > > WARN_ON(1); > > /* > > @@ -1749,13 +1754,16 @@ static void btrfs_set_bit_hook(struct inode *inode, > > if (!(state->state & EXTENT_DELALLOC) && (*bits & EXTENT_DELALLOC)) { > > struct btrfs_root *root = BTRFS_I(inode)->root; > > u64 len = state->end + 1 - state->start; > > + u64 num_extents = div64_u64(len + BTRFS_MAX_EXTENT_SIZE - 1, > > + BTRFS_MAX_EXTENT_SIZE); > > bool do_list = !btrfs_is_free_space_inode(inode); > > > > - if (*bits & EXTENT_FIRST_DELALLOC) { > > + if (*bits & EXTENT_FIRST_DELALLOC) > > *bits &= ~EXTENT_FIRST_DELALLOC; > > - } else { > > + > > + if (do_list) { > > spin_lock(&BTRFS_I(inode)->lock); > > - BTRFS_I(inode)->outstanding_extents++; > > + BTRFS_I(inode)->outstanding_extents += num_extents; > > spin_unlock(&BTRFS_I(inode)->lock); > > } > > > > @@ -1803,7 +1811,7 @@ static void btrfs_clear_bit_hook(struct inode *inode, > > > > if (*bits & EXTENT_FIRST_DELALLOC) { > > *bits &= ~EXTENT_FIRST_DELALLOC; > > - } else if (!(*bits & EXTENT_DO_ACCOUNTING)) { > > + } else if (!(*bits & EXTENT_DO_ACCOUNTING) && do_list) { > > spin_lock(&BTRFS_I(inode)->lock); > > BTRFS_I(inode)->outstanding_extents -= num_extents; > > spin_unlock(&BTRFS_I(inode)->lock); > > @@ -2001,9 +2009,52 @@ static noinline int add_pending_csums(struct btrfs_trans_handle *trans, > > int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end, > > struct extent_state **cached_state, int dedupe) > > { > > + int ret; > > + u64 num_extents = div64_u64(end - start + BTRFS_MAX_EXTENT_SIZE, > > + BTRFS_MAX_EXTENT_SIZE); > > + > > + WARN_ON((end & (PAGE_SIZE - 1)) == 0); > > + ret = set_extent_delalloc(&BTRFS_I(inode)->io_tree, start, end, > > + cached_state); > > + > > + /* > > + * btrfs_delalloc_reserve_metadata() will first add number of > > + * outstanding extents according to data length, which is inaccurate > > + * for case like dirtying already dirty pages. > > + * so here we will decrease such inaccurate numbers, to make > > + * outstanding_extents only rely on the correct values added by > > + * set_bit_hook() > > + * > > + * Also, we skipped the metadata space reserve for space cache inodes, > > + * so don't modify the outstanding_extents value. > > + */ > > + if (ret == 0 && !btrfs_is_free_space_inode(inode)) { > > + spin_lock(&BTRFS_I(inode)->lock); > > + BTRFS_I(inode)->outstanding_extents -= num_extents; > > + spin_unlock(&BTRFS_I(inode)->lock); > > + } > > + > > + return ret; > > +} > > + > > +int btrfs_set_extent_defrag(struct inode *inode, u64 start, u64 end, > > + struct extent_state **cached_state) > > +{ > > + int ret; > > + u64 num_extents = div64_u64(end - start + BTRFS_MAX_EXTENT_SIZE, > > + BTRFS_MAX_EXTENT_SIZE); > > + > > WARN_ON((end & (PAGE_SIZE - 1)) == 0); > > - return set_extent_delalloc(&BTRFS_I(inode)->io_tree, start, end, > > - cached_state); > > + ret = set_extent_defrag(&BTRFS_I(inode)->io_tree, start, end, > > + cached_state); > > + > > + if (ret == 0 && !btrfs_is_free_space_inode(inode)) { > > + spin_lock(&BTRFS_I(inode)->lock); > > + BTRFS_I(inode)->outstanding_extents -= num_extents; > > + spin_unlock(&BTRFS_I(inode)->lock); > > + } > > + > > + return ret; > > } > > > > /* see btrfs_writepage_start_hook for details on why this is required */ > > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c > > index 7163457..eff5eae 100644 > > --- a/fs/btrfs/ioctl.c > > +++ b/fs/btrfs/ioctl.c > > @@ -1235,10 +1235,8 @@ again: > > (page_cnt - i_done) << PAGE_SHIFT); > > } > > > > - > > - set_extent_defrag(&BTRFS_I(inode)->io_tree, page_start, page_end - 1, > > - &cached_state); > > - > > + btrfs_set_extent_defrag(inode, page_start, > > + page_end - 1, &cached_state); > > unlock_extent_cached(&BTRFS_I(inode)->io_tree, > > page_start, page_end - 1, &cached_state, > > GFP_NOFS); > > -- > > 2.7.4 > > > > > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html --- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 4175987..4eace1a 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -1752,7 +1752,12 @@ static void btrfs_set_bit_hook(struct inode *inode, bool do_list = !btrfs_is_free_space_inode(inode); if (*bits & EXTENT_FIRST_DELALLOC) { - *bits &= ~EXTENT_FIRST_DELALLOC; + /* + * With EXTENT_UPTODATE, the state gets rewritten + * before writing it back, or gets read from disk. + */ + if (!(state->state & EXTENT_UPTODATE)) + *bits &= ~EXTENT_FIRST_DELALLOC; } else { spin_lock(&BTRFS_I(inode)->lock); BTRFS_I(inode)->outstanding_extents++;