From patchwork Mon Jan 8 01:36:11 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Changwei Ge X-Patchwork-Id: 10148547 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 14B6D60210 for ; Mon, 8 Jan 2018 01:38:52 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 07DDA28727 for ; Mon, 8 Jan 2018 01:38:52 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id F0F4028874; Mon, 8 Jan 2018 01:38:51 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID autolearn=ham version=3.3.1 Received: from userp2130.oracle.com (userp2130.oracle.com [156.151.31.86]) (using TLSv1.2 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 0677228727 for ; Mon, 8 Jan 2018 01:38:49 +0000 (UTC) Received: from pps.filterd (userp2130.oracle.com [127.0.0.1]) by userp2130.oracle.com (8.16.0.22/8.16.0.22) with SMTP id w081ajCY000955; Mon, 8 Jan 2018 01:38:11 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : date : message-id : mime-version : cc : subject : list-id : list-unsubscribe : list-archive : list-post : list-help : list-subscribe : content-type : content-transfer-encoding : sender; s=corp-2017-10-26; bh=yxSL1Yw3Sa5kZiRfcMW0ZZ+kIUawdSaNWjJOWjf1LiM=; b=SnUbEweRr23SoF8k42g13AJdMSvgu7crZntkBkhoF2+1ImoNPRNCwWnLERJcmfOYr7xf kD7tH6cUT20vnSz75hM2fD+xK4Sezb6ZGuxEzEeh14Yzfv/xni+ptVnjHsEFzOJhBC6R /K97LkPp372gEh2Nilwf0A4JfG6doqP/cBITUCGEb9uij8QpY8qZiJId8XF5dsvSNNHH cnNGmdlPknrvW8CHQYJ4jk9JQk1nRDKYQaLVy88FahC7hFDUkGrvW5MqzdG3j/oweD4P wPd24qo8NREhZW1wtOy8xqu4NAjitnFFuBnNsy1hvRwhUMB02Dbjy+7MatXegMKoQ2lw Gg== Received: from aserv0022.oracle.com (aserv0022.oracle.com [141.146.126.234]) by userp2130.oracle.com with ESMTP id 2fbxr9019v-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 08 Jan 2018 01:38:10 +0000 Received: from oss.oracle.com (oss-old-reserved.oracle.com [137.254.22.2]) by aserv0022.oracle.com (8.14.4/8.14.4) with ESMTP id w081cAj2010465 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 8 Jan 2018 01:38:10 GMT Received: from localhost ([127.0.0.1] helo=lb-oss.oracle.com) by oss.oracle.com with esmtp (Exim 4.63) (envelope-from ) id 1eYMO9-000560-VA; Sun, 07 Jan 2018 17:38:09 -0800 Received: from aserv0021.oracle.com ([141.146.126.233]) by oss.oracle.com with esmtp (Exim 4.63) (envelope-from ) id 1eYMO8-00055n-2c for ocfs2-devel@oss.oracle.com; Sun, 07 Jan 2018 17:38:08 -0800 Received: from userp2030.oracle.com (userp2030.oracle.com [156.151.31.89]) by aserv0021.oracle.com (8.14.4/8.14.4) with ESMTP id w081c7HG026400 (version=TLSv1/SSLv3 cipher=AES256-SHA bits=256 verify=FAIL) for ; Mon, 8 Jan 2018 01:38:07 GMT Received: from pps.filterd (userp2030.oracle.com [127.0.0.1]) by userp2030.oracle.com (8.16.0.22/8.16.0.21) with SMTP id w081c70Z019067 for ; Mon, 8 Jan 2018 01:38:07 GMT Received: from h3cmg01-ex.h3c.com (smtp.h3c.com [60.191.123.56]) by userp2030.oracle.com with ESMTP id 2fbnvy5uwf-1 for ; Mon, 08 Jan 2018 01:38:06 +0000 Received: from BJHUB02-EX.srv.huawei-3com.com (unknown [10.63.20.170]) by h3cmg01-ex.h3c.com with smtp id 04a8_038e_47d242b0_9591_451c_9b5f_af7466c5c44b; Mon, 08 Jan 2018 09:36:25 +0800 Received: from H3CMLB12-EX.srv.huawei-3com.com ([fe80::10fe:abde:731b:fdde]) by BJHUB02-EX.srv.huawei-3com.com ([::1]) with mapi id 14.03.0248.002; Mon, 8 Jan 2018 09:36:13 +0800 From: Changwei Ge To: Junxiao Bi , Joseph Qi , "Mark Fasheh" , Joel Becker Thread-Topic: [PATCH 2/2 v3] ocfs2: try to reuse extent block in dealloc without Thread-Index: AdOIIRAi7U9C0VR3RvSzgVlFF+TC5g== Date: Mon, 8 Jan 2018 01:36:11 +0000 Message-ID: <63ADC13FD55D6546B7DECE290D39E373F290CEFE@H3CMLB12-EX.srv.huawei-3com.com> Accept-Language: en-US, zh-CN Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.125.136.231] MIME-Version: 1.0 X-CLX-Shades: MLX X-CLX-Response: 1TFkXGxIaEQpMehcaEQpZTRdnZnIRCllJFxpxGhAadwYbHBpxGRMQGncGGBo GGhEKWV4XaG55EQpJRhdFWEtJRk91WlhFTl9JXkNFRBl1T0sRCkNOF0ROcxpsGRxuY31JSWxGUn sfcF4aHWxIXAdPRh1Jem0HEQpYXBcfBBoEGxgYBxxLSEtPHhwaBRsaBBsaGgQeEgQfEBseGh8aE QpeWRd4Q2JrSxEKTVwXGB0aEQpMWhdoaUJNXREKTU4XaBEKTEYXY2sRCkNaFxwaBBsTGwQbGBkE HxwRCkJeFxsRCkReFx4RCkRJFxkRCkJGF29pBRpIaUhTZl5vEQpCXBcaEQpCRRduGVhMXmEBcFJ MYREKQk4XZEJ8WkVEQWIdZFARCkJMF29+XU0YBV1mGlJ7EQpCbBdkYU9LYEJIEngdZxEKQkAXYU VQRRNNHF5gfXARCkJYF2J9b3kBTxgZcHB7EQpNXhcbEQpwZxdtAWRaH15LcH1nfhAaEQpwaBdjY FsSHGFOT01cUBAaEQpwaBdicHxtZ2dSExNheBAaEQpwaBdvQGlcXUJDH3BkTBAaEQpwaBdoeh0S XktSYR8cYhAaEQpwaBdiXXxdWk1zXUNMQRAaEQpwbBdtThtvUwFHUkgdcxAZGhEKbX4XGxEKWE0 XSxEg X-PDR: PASS X-Source-IP: 60.191.123.56 X-ServerName: smtp.h3c.com X-Proofpoint-SPF-Result: pass X-Proofpoint-SPF-Record: v=spf1 ip4:60.191.123.56 ip4:60.191.123.50 ip4:221.12.31.13 ip4:221.12.31.56 X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=8767 signatures=668652 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 priorityscore=0 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=180 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1711220000 definitions=main-1801080019 X-Spam: Clean Cc: "ocfs2-devel@oss.oracle.com" Subject: [Ocfs2-devel] [PATCH 2/2 v3] ocfs2: try to reuse extent block in dealloc without X-BeenThere: ocfs2-devel@oss.oracle.com X-Mailman-Version: 2.1.9 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: ocfs2-devel-bounces@oss.oracle.com Errors-To: ocfs2-devel-bounces@oss.oracle.com X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=8767 signatures=668652 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1711220000 definitions=main-1801080019 X-Virus-Scanned: ClamAV using ClamSMTP A crash issue was reported by John. The call trace follows: ocfs2_split_extent+0x1ad3/0x1b40 [ocfs2] ocfs2_change_extent_flag+0x33a/0x470 [ocfs2] ocfs2_mark_extent_written+0x172/0x220 [ocfs2] ocfs2_dio_end_io+0x62d/0x910 [ocfs2] dio_complete+0x19a/0x1a0 do_blockdev_direct_IO+0x19dd/0x1eb0 __blockdev_direct_IO+0x43/0x50 ocfs2_direct_IO+0x8f/0xa0 [ocfs2] generic_file_direct_write+0xb2/0x170 __generic_file_write_iter+0xc3/0x1b0 ocfs2_file_write_iter+0x4bb/0xca0 [ocfs2] __vfs_write+0xae/0xf0 vfs_write+0xb8/0x1b0 SyS_write+0x4f/0xb0 system_call_fastpath+0x16/0x75 The BUG code told that extent tree wants to grow but no metadata was reserved ahead of time. From my investigation into this issue, the root cause it that although enough metadata is not reserved, there should be enough for following use. Rightmost extent is merged into its left one due to a certain times of marking extent written. Because during marking extent written, we got many physically continuous extents. At last, an empty extent showed up and the rightmost path is removed from extent tree. Add a new mechanism to reuse extent block cached in dealloc which were just unlinked from extent tree to solve this crash issue. Criteria is that during marking extents *written*, if extent rotation and merging results in unlinking extent with growing extent tree later without any metadata reserved ahead of time, try to reuse those extents in dealloc in which deleted extents are cached. Also, this patch addresses the issue John reported that ::dw_zero_count is not calculated properly. After applying this patch, the issue John reported was gone. Thanks for the reproducer provided by John. And this patch has passed ocfs2-test(29 cases) suite running by New H3C Group. Change from v2: 1)Separate improvement for how to estimate metadata into respective patch 2)Solve the issue reported by John in last version of this patch - try to reuse metadata in dealloc ahead of claiming new metadata. 3)Add a method ocfs2_find_preferred_free_list() to find space of local slot at higher priority. Reported-by: John Lightsey Signed-off-by: Changwei Ge --- fs/ocfs2/alloc.c | 208 ++++++++++++++++++++++++++++++++++++++++++++++++++++--- fs/ocfs2/alloc.h | 1 + fs/ocfs2/aops.c | 6 ++ 3 files changed, 205 insertions(+), 10 deletions(-) diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c index ab5105f..11c1b53 100644 --- a/fs/ocfs2/alloc.c +++ b/fs/ocfs2/alloc.c @@ -165,6 +165,13 @@ static int ocfs2_dinode_insert_check(struct ocfs2_extent_tree *et, struct ocfs2_extent_rec *rec); static int ocfs2_dinode_sanity_check(struct ocfs2_extent_tree *et); static void ocfs2_dinode_fill_root_el(struct ocfs2_extent_tree *et); + +static int ocfs2_reuse_blk_from_dealloc(handle_t *handle, + struct ocfs2_extent_tree *et, + struct buffer_head **new_eb_bh, + int blk_wanted, int *blk_given); +static int ocfs2_is_dealloc_empty(struct ocfs2_extent_tree *et); + static const struct ocfs2_extent_tree_operations ocfs2_dinode_et_ops = { .eo_set_last_eb_blk = ocfs2_dinode_set_last_eb_blk, .eo_get_last_eb_blk = ocfs2_dinode_get_last_eb_blk, @@ -448,6 +455,7 @@ static void __ocfs2_init_extent_tree(struct ocfs2_extent_tree *et, if (!obj) obj = (void *)bh->b_data; et->et_object = obj; + et->et_dealloc = NULL; et->et_ops->eo_fill_root_el(et); if (!et->et_ops->eo_fill_max_leaf_clusters) @@ -1158,7 +1166,7 @@ static int ocfs2_add_branch(handle_t *handle, struct buffer_head **last_eb_bh, struct ocfs2_alloc_context *meta_ac) { - int status, new_blocks, i; + int status, new_blocks, i, block_given = 0; u64 next_blkno, new_last_eb_blk; struct buffer_head *bh; struct buffer_head **new_eb_bhs = NULL; @@ -1213,11 +1221,31 @@ static int ocfs2_add_branch(handle_t *handle, goto bail; } - status = ocfs2_create_new_meta_bhs(handle, et, new_blocks, - meta_ac, new_eb_bhs); - if (status < 0) { - mlog_errno(status); - goto bail; + /* Firstyly, try to reuse dealloc since we have already estimated how + * many extent blocks we may use. + */ + if (!ocfs2_is_dealloc_empty(et)) { + status = ocfs2_reuse_blk_from_dealloc(handle, et, + new_eb_bhs, new_blocks, + &block_given); + if (status < 0) { + mlog_errno(status); + goto bail; + } + } + + BUG_ON(block_given > new_blocks); + + if (block_given < new_blocks) { + BUG_ON(!meta_ac); + status = ocfs2_create_new_meta_bhs(handle, et, + new_blocks - block_given, + meta_ac, + &new_eb_bhs[block_given]); + if (status < 0) { + mlog_errno(status); + goto bail; + } } /* Note: new_eb_bhs[new_blocks - 1] is the guy which will be @@ -1340,15 +1368,25 @@ static int ocfs2_shift_tree_depth(handle_t *handle, struct ocfs2_alloc_context *meta_ac, struct buffer_head **ret_new_eb_bh) { - int status, i; + int status, i, block_given = 0; u32 new_clusters; struct buffer_head *new_eb_bh = NULL; struct ocfs2_extent_block *eb; struct ocfs2_extent_list *root_el; struct ocfs2_extent_list *eb_el; - status = ocfs2_create_new_meta_bhs(handle, et, 1, meta_ac, - &new_eb_bh); + if (!ocfs2_is_dealloc_empty(et)) { + status = ocfs2_reuse_blk_from_dealloc(handle, et, + &new_eb_bh, 1, + &block_given); + } else if (meta_ac) { + status = ocfs2_create_new_meta_bhs(handle, et, 1, meta_ac, + &new_eb_bh); + + } else { + BUG(); + } + if (status < 0) { mlog_errno(status); goto bail; @@ -1511,7 +1549,7 @@ static int ocfs2_grow_tree(handle_t *handle, struct ocfs2_extent_tree *et, int depth = le16_to_cpu(el->l_tree_depth); struct buffer_head *bh = NULL; - BUG_ON(meta_ac == NULL); + BUG_ON(meta_ac == NULL && ocfs2_is_dealloc_empty(et)); shift = ocfs2_find_branch_target(et, &bh); if (shift < 0) { @@ -6585,6 +6623,156 @@ ocfs2_find_per_slot_free_list(int type, return fl; } +static struct ocfs2_per_slot_free_list * +ocfs2_find_preferred_free_list(int type, + int preferred_slot, + int *real_slot, + struct ocfs2_cached_dealloc_ctxt *ctxt) +{ + struct ocfs2_per_slot_free_list *fl = ctxt->c_first_suballocator; + + while (fl) { + if (fl->f_inode_type == type && fl->f_slot == preferred_slot) { + *real_slot = fl->f_slot; + return fl; + } + + fl = fl->f_next_suballocator; + } + + /* If we can't find any free list matching preferred slot, just use + * the first one. + */ + fl = ctxt->c_first_suballocator; + *real_slot = fl->f_slot; + + return fl; +} + +/* Return Value 1 indicates empty */ +static int ocfs2_is_dealloc_empty(struct ocfs2_extent_tree *et) +{ + struct ocfs2_per_slot_free_list *fl = NULL; + + if (!et->et_dealloc) + return 1; + + fl = et->et_dealloc->c_first_suballocator; + if (!fl) + return 1; + + if (!fl->f_first) + return 1; + + return 0; +} + +/* If extent was deleted from tree due to extent rotation and merging, and + * no metadata is reserved ahead of time. Try to reuse some extents + * just deleted. This is only used to reuse extent blocks. + * It is supposed to find enough extent blocks in dealloc if our estimation + * on metadata is accurate. + */ +static int ocfs2_reuse_blk_from_dealloc(handle_t *handle, + struct ocfs2_extent_tree *et, + struct buffer_head **new_eb_bh, + int blk_wanted, int *blk_given) +{ + int i, status = 0, real_slot; + struct ocfs2_cached_dealloc_ctxt *dealloc; + struct ocfs2_per_slot_free_list *fl; + struct ocfs2_cached_block_free *bf; + struct ocfs2_extent_block *eb; + struct ocfs2_super *osb = + OCFS2_SB(ocfs2_metadata_cache_get_super(et->et_ci)); + + *blk_given = 0; + + /* If extent tree doesn't have a dealloc, this is not faulty. Just + * tell upper caller dealloc can't provide any block and it should + * ask for alloc to claim more space. + */ + dealloc = et->et_dealloc; + if (!dealloc) + goto bail; + + for (i = 0; i < blk_wanted; i++) { + /* Prefer to use local slot */ + fl = ocfs2_find_preferred_free_list(EXTENT_ALLOC_SYSTEM_INODE, + osb->slot_num, &real_slot, + dealloc); + /* If no more block can be reused, we should claim more + * from alloc. Just return here normally. + */ + if (!fl) { + status = 0; + break; + } + + bf = fl->f_first; + fl->f_first = bf->free_next; + + new_eb_bh[i] = sb_getblk(osb->sb, bf->free_blk); + if (new_eb_bh[i] == NULL) { + status = -ENOMEM; + mlog_errno(status); + goto bail; + } + + mlog(0, "Reusing block(%llu) from " + "dealloc(local slot:%d, real slot:%d)\n", + bf->free_blk, osb->slot_num, real_slot); + + ocfs2_set_new_buffer_uptodate(et->et_ci, new_eb_bh[i]); + + status = ocfs2_journal_access_eb(handle, et->et_ci, + new_eb_bh[i], + OCFS2_JOURNAL_ACCESS_CREATE); + if (status < 0) { + mlog_errno(status); + goto bail; + } + + memset(new_eb_bh[i]->b_data, 0, osb->sb->s_blocksize); + eb = (struct ocfs2_extent_block *) new_eb_bh[i]->b_data; + + /* We can't guarantee that buffer head is still cached, so + * polutlate the extent block again. + */ + strcpy(eb->h_signature, OCFS2_EXTENT_BLOCK_SIGNATURE); + eb->h_blkno = cpu_to_le64(bf->free_blk); + eb->h_fs_generation = cpu_to_le32(osb->fs_generation); + eb->h_suballoc_slot = cpu_to_le16(real_slot); + eb->h_suballoc_loc = cpu_to_le64(bf->free_bg); + eb->h_suballoc_bit = cpu_to_le16(bf->free_bit); + eb->h_list.l_count = + cpu_to_le16(ocfs2_extent_recs_per_eb(osb->sb)); + + /* We'll also be dirtied by the caller, so + * this isn't absolutely necessary. + */ + ocfs2_journal_dirty(handle, new_eb_bh[i]); + + if (!fl->f_first) { + dealloc->c_first_suballocator = fl->f_next_suballocator; + kfree(fl); + } + kfree(bf); + } + + *blk_given = i; + +bail: + if (unlikely(status)) { + for (; i <= 0; i--) { + if (new_eb_bh[i]) + brelse(new_eb_bh[i]); + } + } + + return status; +} + int ocfs2_cache_block_dealloc(struct ocfs2_cached_dealloc_ctxt *ctxt, int type, int slot, u64 suballoc, u64 blkno, unsigned int bit) diff --git a/fs/ocfs2/alloc.h b/fs/ocfs2/alloc.h index 27b75cf..250bcac 100644 --- a/fs/ocfs2/alloc.h +++ b/fs/ocfs2/alloc.h @@ -61,6 +61,7 @@ struct ocfs2_extent_tree { ocfs2_journal_access_func et_root_journal_access; void *et_object; unsigned int et_max_leaf_clusters; + struct ocfs2_cached_dealloc_ctxt *et_dealloc; }; /* diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c index 256986a..e8e205b 100644 --- a/fs/ocfs2/aops.c +++ b/fs/ocfs2/aops.c @@ -2332,6 +2332,12 @@ static int ocfs2_dio_end_io_write(struct inode *inode, ocfs2_init_dinode_extent_tree(&et, INODE_CACHE(inode), di_bh); + /* Attach dealloc with extent tree in case that we may reuse extents + * which are already unlinked from current extent tree due to extent + * rotation and merging. + */ + et.et_dealloc = &dealloc; + ret = ocfs2_lock_allocators(inode, &et, 0, dwc->dw_zero_count*2, &data_ac, &meta_ac); if (ret) {