From patchwork Sun May 10 07:08:22 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ritesh Harjani X-Patchwork-Id: 11538679 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 71C50159A for ; Sun, 10 May 2020 07:09:04 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 63D3B208DB for ; Sun, 10 May 2020 07:09:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728564AbgEJHI5 (ORCPT ); Sun, 10 May 2020 03:08:57 -0400 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:18520 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1728332AbgEJHI5 (ORCPT ); Sun, 10 May 2020 03:08:57 -0400 Received: from pps.filterd (m0098421.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 04A6cg9i023089; Sun, 10 May 2020 03:08:48 -0400 Received: from ppma04fra.de.ibm.com (6a.4a.5195.ip4.static.sl-reverse.com [149.81.74.106]) by mx0a-001b2d01.pphosted.com with ESMTP id 30wsc248s5-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Sun, 10 May 2020 03:08:48 -0400 Received: from pps.filterd (ppma04fra.de.ibm.com [127.0.0.1]) by ppma04fra.de.ibm.com (8.16.0.27/8.16.0.27) with SMTP id 04A75G9x027632; Sun, 10 May 2020 07:08:46 GMT Received: from b06avi18878370.portsmouth.uk.ibm.com (b06avi18878370.portsmouth.uk.ibm.com [9.149.26.194]) by ppma04fra.de.ibm.com with ESMTP id 30wm55gtvd-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Sun, 10 May 2020 07:08:46 +0000 Received: from d06av24.portsmouth.uk.ibm.com (d06av24.portsmouth.uk.ibm.com [9.149.105.60]) by b06avi18878370.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 04A78h8D61604298 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Sun, 10 May 2020 07:08:43 GMT Received: from d06av24.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 747BF42045; Sun, 10 May 2020 07:08:43 +0000 (GMT) Received: from d06av24.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 73DBF42041; Sun, 10 May 2020 07:08:41 +0000 (GMT) Received: from localhost.localdomain.com (unknown [9.199.61.127]) by d06av24.portsmouth.uk.ibm.com (Postfix) with ESMTP; Sun, 10 May 2020 07:08:41 +0000 (GMT) From: Ritesh Harjani To: linux-ext4@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org, Jan Kara , tytso@mit.edu, "Aneesh Kumar K . V" , linux-kernel@vger.kernel.org, Ritesh Harjani Subject: [RFCv4 2/6] ext4: mballoc: Use ext4_lock_group() around calculations involving bb_free Date: Sun, 10 May 2020 12:38:22 +0530 Message-Id: <015bd57d5ff409cef6d452dffd4c140848ea5021.1589086820.git.riteshh@linux.ibm.com> X-Mailer: git-send-email 2.21.0 In-Reply-To: References: MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.216,18.0.676 definitions=2020-05-10_02:2020-05-08,2020-05-10 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 phishscore=0 mlxlogscore=999 impostorscore=0 adultscore=0 mlxscore=0 malwarescore=0 priorityscore=1501 spamscore=0 bulkscore=0 suspectscore=3 lowpriorityscore=0 clxscore=1015 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2003020000 definitions=main-2005100060 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Currently while doing block allocation grp->bb_free may be getting modified if discard is happening in parallel. For e.g. consider a case where there are lot of threads who have preallocated lot of blocks and there is a thread which is trying to discard all of this group's PA. Now it could happen that we see all of those group's bb_free is zero and fail the allocation while there is sufficient space if we free up all the PA. So this patch takes the ext4_lock_group() around calculations involving grp->bb_free in ext4_mb_good_group() & ext4_mb_good_group_nolock() Signed-off-by: Ritesh Harjani --- fs/ext4/mballoc.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index da11a4a738bd..dcd05ff7c012 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -2138,9 +2138,11 @@ static int ext4_mb_good_group_nolock(struct ext4_allocation_context *ac, ext4_group_t group, int cr) { struct ext4_group_info *grp = ext4_get_group_info(ac->ac_sb, group); + struct super_block *sb = ac->ac_sb; ext4_grpblk_t free; int ret = 0; + ext4_lock_group(sb, group); free = grp->bb_free; if (free == 0) goto out; @@ -2148,6 +2150,7 @@ static int ext4_mb_good_group_nolock(struct ext4_allocation_context *ac, goto out; if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp))) goto out; + ext4_unlock_group(sb, group); /* We only do this if the grp has never been initialized */ if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) { @@ -2156,8 +2159,10 @@ static int ext4_mb_good_group_nolock(struct ext4_allocation_context *ac, return ret; } + ext4_lock_group(sb, group); ret = ext4_mb_good_group(ac, group, cr); out: + ext4_unlock_group(sb, group); return ret; } From patchwork Sun May 10 07:08:23 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ritesh Harjani X-Patchwork-Id: 11538687 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0FB7014B4 for ; Sun, 10 May 2020 07:09:22 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 022D32137B for ; Sun, 10 May 2020 07:09:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728385AbgEJHI4 (ORCPT ); Sun, 10 May 2020 03:08:56 -0400 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:17806 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725994AbgEJHI4 (ORCPT ); Sun, 10 May 2020 03:08:56 -0400 Received: from pps.filterd (m0098409.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 04A72j5j040838; Sun, 10 May 2020 03:08:51 -0400 Received: from ppma05fra.de.ibm.com (6c.4a.5195.ip4.static.sl-reverse.com [149.81.74.108]) by mx0a-001b2d01.pphosted.com with ESMTP id 30ws5cvmkw-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Sun, 10 May 2020 03:08:51 -0400 Received: from pps.filterd (ppma05fra.de.ibm.com [127.0.0.1]) by ppma05fra.de.ibm.com (8.16.0.27/8.16.0.27) with SMTP id 04A75xxK011156; Sun, 10 May 2020 07:08:48 GMT Received: from b06cxnps3075.portsmouth.uk.ibm.com (d06relay10.portsmouth.uk.ibm.com [9.149.109.195]) by ppma05fra.de.ibm.com with ESMTP id 30wm55gtqx-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Sun, 10 May 2020 07:08:48 +0000 Received: from d06av24.portsmouth.uk.ibm.com (d06av24.portsmouth.uk.ibm.com [9.149.105.60]) by b06cxnps3075.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 04A78jXR46727178 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Sun, 10 May 2020 07:08:46 GMT Received: from d06av24.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id C6F6142052; Sun, 10 May 2020 07:08:45 +0000 (GMT) Received: from d06av24.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id D629542041; Sun, 10 May 2020 07:08:43 +0000 (GMT) Received: from localhost.localdomain.com (unknown [9.199.61.127]) by d06av24.portsmouth.uk.ibm.com (Postfix) with ESMTP; Sun, 10 May 2020 07:08:43 +0000 (GMT) From: Ritesh Harjani To: linux-ext4@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org, Jan Kara , tytso@mit.edu, "Aneesh Kumar K . V" , linux-kernel@vger.kernel.org, Ritesh Harjani Subject: [RFCv4 3/6] ext4: mballoc: Optimize ext4_mb_good_group_nolock further if grp needs init Date: Sun, 10 May 2020 12:38:23 +0530 Message-Id: X-Mailer: git-send-email 2.21.0 In-Reply-To: References: MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.216,18.0.676 definitions=2020-05-10_02:2020-05-08,2020-05-10 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 clxscore=1011 suspectscore=3 phishscore=0 spamscore=0 impostorscore=0 malwarescore=0 mlxlogscore=883 lowpriorityscore=0 bulkscore=0 adultscore=0 priorityscore=1501 mlxscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2003020000 definitions=main-2005100065 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Currently in case if EXT4_MB_GRP_NEED_INIT(grp) is true, then we first check for few things like grp->bb_free etc with spinlock (ext4_lock_group) held. Then we drop the lock only to initialize the group's buddy cache and then again take the lock and check for ext4_mb_good_group(). Once this step is done we return to ext4_mb_regular_allocator(), load the buddy and then again take the lock only to check ext4_mb_good_group(), which was anyways done in previous step. I believe we can optimize one step here where if the group needs init we can check for only few things and return early. Then recheck for ext4_mb_good_group() only once after loading the buddy cache. Signed-off-by: Ritesh Harjani --- fs/ext4/mballoc.c | 44 +++++++++++++++++++++----------------------- 1 file changed, 21 insertions(+), 23 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index dcd05ff7c012..7d766dc34cdd 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -2134,33 +2134,34 @@ static bool ext4_mb_good_group(struct ext4_allocation_context *ac, * during ext4_mb_init_group(). This should not be called with * ext4_lock_group() held. */ -static int ext4_mb_good_group_nolock(struct ext4_allocation_context *ac, +static bool ext4_mb_good_group_nolock(struct ext4_allocation_context *ac, ext4_group_t group, int cr) { struct ext4_group_info *grp = ext4_get_group_info(ac->ac_sb, group); struct super_block *sb = ac->ac_sb; ext4_grpblk_t free; - int ret = 0; + bool ret = false, need_init = EXT4_MB_GRP_NEED_INIT(grp); ext4_lock_group(sb, group); - free = grp->bb_free; - if (free == 0) - goto out; - if (cr <= 2 && free < ac->ac_g_ex.fe_len) - goto out; - if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp))) - goto out; - ext4_unlock_group(sb, group); - - /* We only do this if the grp has never been initialized */ - if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) { - ret = ext4_mb_init_group(ac->ac_sb, group, GFP_NOFS); - if (ret) - return ret; + /* + * If the group needs init then no need to call ext4_mb_init_group() + * after dropping the lock. It's better we check bb_free/other things + * here and if it meets the criteria than return true. Later we + * will anyway check for good group after loading the buddy cache + * which, if required will call ext4_mb_init_group() from within. + */ + if (unlikely(need_init)) { + free = grp->bb_free; + if (free == 0) + goto out; + if (cr <= 2 && free < ac->ac_g_ex.fe_len) + goto out; + if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp))) + goto out; + ret = true; + } else { + ret = ext4_mb_good_group(ac, group, cr); } - - ext4_lock_group(sb, group); - ret = ext4_mb_good_group(ac, group, cr); out: ext4_unlock_group(sb, group); return ret; @@ -2252,11 +2253,8 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) /* This now checks without needing the buddy page */ ret = ext4_mb_good_group_nolock(ac, group, cr); - if (ret <= 0) { - if (!first_err) - first_err = ret; + if (ret == 0) continue; - } err = ext4_mb_load_buddy(sb, group, &e4b); if (err) From patchwork Sun May 10 07:08:25 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ritesh Harjani X-Patchwork-Id: 11538685 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 8BB5714B4 for ; Sun, 10 May 2020 07:09:15 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 7D192208DB for ; Sun, 10 May 2020 07:09:15 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728729AbgEJHJE (ORCPT ); Sun, 10 May 2020 03:09:04 -0400 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:57318 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1728332AbgEJHJA (ORCPT ); Sun, 10 May 2020 03:09:00 -0400 Received: from pps.filterd (m0098419.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 04A73CE5116933; Sun, 10 May 2020 03:08:55 -0400 Received: from ppma03ams.nl.ibm.com (62.31.33a9.ip4.static.sl-reverse.com [169.51.49.98]) by mx0b-001b2d01.pphosted.com with ESMTP id 30ws9xcehs-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Sun, 10 May 2020 03:08:54 -0400 Received: from pps.filterd (ppma03ams.nl.ibm.com [127.0.0.1]) by ppma03ams.nl.ibm.com (8.16.0.27/8.16.0.27) with SMTP id 04A75SgU026141; Sun, 10 May 2020 07:08:52 GMT Received: from b06cxnps4076.portsmouth.uk.ibm.com (d06relay13.portsmouth.uk.ibm.com [9.149.109.198]) by ppma03ams.nl.ibm.com with ESMTP id 30wm55j2cw-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Sun, 10 May 2020 07:08:52 +0000 Received: from d06av24.portsmouth.uk.ibm.com (d06av24.portsmouth.uk.ibm.com [9.149.105.60]) by b06cxnps4076.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 04A78nXw47775772 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Sun, 10 May 2020 07:08:50 GMT Received: from d06av24.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id DBBB842049; Sun, 10 May 2020 07:08:49 +0000 (GMT) Received: from d06av24.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 3EB5C42041; Sun, 10 May 2020 07:08:48 +0000 (GMT) Received: from localhost.localdomain.com (unknown [9.199.61.127]) by d06av24.portsmouth.uk.ibm.com (Postfix) with ESMTP; Sun, 10 May 2020 07:08:48 +0000 (GMT) From: Ritesh Harjani To: linux-ext4@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org, Jan Kara , tytso@mit.edu, "Aneesh Kumar K . V" , linux-kernel@vger.kernel.org, Ritesh Harjani Subject: [RFCv4 5/6] ext4: mballoc: Refactor ext4_mb_discard_preallocations() Date: Sun, 10 May 2020 12:38:25 +0530 Message-Id: X-Mailer: git-send-email 2.21.0 In-Reply-To: References: MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.216,18.0.676 definitions=2020-05-10_02:2020-05-08,2020-05-10 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 suspectscore=1 bulkscore=0 malwarescore=0 clxscore=1015 priorityscore=1501 adultscore=0 impostorscore=0 phishscore=0 mlxlogscore=999 mlxscore=0 spamscore=0 lowpriorityscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2003020000 definitions=main-2005100060 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Implement ext4_mb_discard_preallocations_should_retry() which we will need in later patches to add more logic like check for sequence number match to see if we should retry for block allocation or not. There should be no functionality change in this patch. Signed-off-by: Ritesh Harjani --- fs/ext4/mballoc.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 574ce400a3b5..c18670924bbe 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -4570,6 +4570,17 @@ static int ext4_mb_discard_preallocations(struct super_block *sb, int needed) return freed; } +static bool ext4_mb_discard_preallocations_should_retry(struct super_block *sb, + struct ext4_allocation_context *ac) +{ + int freed; + + freed = ext4_mb_discard_preallocations(sb, ac->ac_o_ex.fe_len); + if (freed) + return true; + return false; +} + /* * Main entry point into mballoc to allocate blocks * it tries to use preallocation first, then falls back @@ -4578,7 +4589,6 @@ static int ext4_mb_discard_preallocations(struct super_block *sb, int needed) ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle, struct ext4_allocation_request *ar, int *errp) { - int freed; struct ext4_allocation_context *ac = NULL; struct ext4_sb_info *sbi; struct super_block *sb; @@ -4683,8 +4693,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle, ar->len = ac->ac_b_ex.fe_len; } } else { - freed = ext4_mb_discard_preallocations(sb, ac->ac_o_ex.fe_len); - if (freed) + if (ext4_mb_discard_preallocations_should_retry(sb, ac)) goto repeat; /* * If block allocation fails then the pa allocated above From patchwork Sun May 10 07:08:26 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ritesh Harjani X-Patchwork-Id: 11538683 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 33B5214B4 for ; Sun, 10 May 2020 07:09:13 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 1FBAC2137B for ; Sun, 10 May 2020 07:09:13 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728756AbgEJHJF (ORCPT ); Sun, 10 May 2020 03:09:05 -0400 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:43156 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1728575AbgEJHJD (ORCPT ); Sun, 10 May 2020 03:09:03 -0400 Received: from pps.filterd (m0098420.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 04A73mht072448; Sun, 10 May 2020 03:08:57 -0400 Received: from ppma01fra.de.ibm.com (46.49.7a9f.ip4.static.sl-reverse.com [159.122.73.70]) by mx0b-001b2d01.pphosted.com with ESMTP id 30xa2g2uf1-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Sun, 10 May 2020 03:08:56 -0400 Received: from pps.filterd (ppma01fra.de.ibm.com [127.0.0.1]) by ppma01fra.de.ibm.com (8.16.0.27/8.16.0.27) with SMTP id 04A75lND019199; Sun, 10 May 2020 07:08:55 GMT Received: from b06cxnps3075.portsmouth.uk.ibm.com (d06relay10.portsmouth.uk.ibm.com [9.149.109.195]) by ppma01fra.de.ibm.com with ESMTP id 30wm558u3x-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Sun, 10 May 2020 07:08:54 +0000 Received: from d06av24.portsmouth.uk.ibm.com (d06av24.portsmouth.uk.ibm.com [9.149.105.60]) by b06cxnps3075.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 04A78qKl34209958 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Sun, 10 May 2020 07:08:52 GMT Received: from d06av24.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 10B4F4204C; Sun, 10 May 2020 07:08:52 +0000 (GMT) Received: from d06av24.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 3ED2942041; Sun, 10 May 2020 07:08:50 +0000 (GMT) Received: from localhost.localdomain.com (unknown [9.199.61.127]) by d06av24.portsmouth.uk.ibm.com (Postfix) with ESMTP; Sun, 10 May 2020 07:08:50 +0000 (GMT) From: Ritesh Harjani To: linux-ext4@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org, Jan Kara , tytso@mit.edu, "Aneesh Kumar K . V" , linux-kernel@vger.kernel.org, Ritesh Harjani Subject: [RFCv4 6/6] ext4: mballoc: Introduce pcpu seqcnt for freeing PA to improve ENOSPC handling Date: Sun, 10 May 2020 12:38:26 +0530 Message-Id: X-Mailer: git-send-email 2.21.0 In-Reply-To: References: MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.216,18.0.676 definitions=2020-05-10_02:2020-05-08,2020-05-10 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 impostorscore=0 spamscore=0 phishscore=0 priorityscore=1501 adultscore=0 clxscore=1015 mlxscore=0 suspectscore=3 malwarescore=0 bulkscore=0 mlxlogscore=999 lowpriorityscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2003020000 definitions=main-2005100065 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org There could be a race in function ext4_mb_discard_group_preallocations() where the 1st thread may iterate through group's bb_prealloc_list and remove all the PAs and add to function's local list head. Now if the 2nd thread comes in to discard the group preallocations, it will see that the group->bb_prealloc_list is empty and will return 0. Consider for a case where we have less number of groups (for e.g. just group 0), this may even return an -ENOSPC error from ext4_mb_new_blocks() (where we call for ext4_mb_discard_group_preallocations()). But that is wrong, since 2nd thread should have waited for 1st thread to release all the PAs and should have retried for allocation. Since 1st thread was anyway going to discard the PAs. The algorithm using this percpu seq counter goes below: 1. We sample the percpu discard_pa_seq counter before trying for block allocation in ext4_mb_new_blocks(). 2. We increment this percpu discard_pa_seq counter when we either allocate or free these blocks i.e. while marking those blocks as used/free in mb_mark_used()/mb_free_blocks(). 3. We also increment this percpu seq counter when we successfully identify that the bb_prealloc_list is not empty and hence proceed for discarding of those PAs inside ext4_mb_discard_group_preallocations(). Now to make sure that the regular fast path of block allocation is not affected, as a small optimization we only sample the percpu seq counter on that cpu. Only when the block allocation fails and when freed blocks found were 0, that is when we sample percpu seq counter for all cpus using below function ext4_get_discard_pa_seq_sum(). This happens after making sure that all the PAs on grp->bb_prealloc_list got freed or if it's empty. It can be well argued that why don't just check for grp->bb_free to see if there are any free blocks to be allocated. So here are the two concerns which were discussed:- 1. If for some reason the blocks available in the group are not appropriate for allocation logic (say for e.g. EXT4_MB_HINT_GOAL_ONLY, although this is not yet implemented), then the retry logic may result into infinte looping since grp->bb_free is non-zero. 2. Also before preallocation was clubbed with block allocation with the same ext4_lock_group() held, there were lot of races where grp->bb_free could not be reliably relied upon. Due to above, this patch considers discard_pa_seq logic to determine if we should retry for block allocation. Say if there are are n threads trying for block allocation and none of those could allocate or discard any of the blocks, then all of those n threads will fail the block allocation and return -ENOSPC error. (Since the seq counter for all of those will match as no block allocation/discard was done during that duration). Signed-off-by: Ritesh Harjani --- fs/ext4/mballoc.c | 56 ++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 51 insertions(+), 5 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index c18670924bbe..ddf2179ed7cf 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -351,6 +351,35 @@ static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap, ext4_group_t group); static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac); +/* + * The algorithm using this percpu seq counter goes below: + * 1. We sample the percpu discard_pa_seq counter before trying for block + * allocation in ext4_mb_new_blocks(). + * 2. We increment this percpu discard_pa_seq counter when we either allocate + * or free these blocks i.e. while marking those blocks as used/free in + * mb_mark_used()/mb_free_blocks(). + * 3. We also increment this percpu seq counter when we successfully identify + * that the bb_prealloc_list is not empty and hence proceed for discarding + * of those PAs inside ext4_mb_discard_group_preallocations(). + * + * Now to make sure that the regular fast path of block allocation is not + * affected, as a small optimization we only sample the percpu seq counter + * on that cpu. Only when the block allocation fails and when freed blocks + * found were 0, that is when we sample percpu seq counter for all cpus using + * below function ext4_get_discard_pa_seq_sum(). This happens after making + * sure that all the PAs on grp->bb_prealloc_list got freed or if it's empty. + */ +static DEFINE_PER_CPU(u64, discard_pa_seq); +static inline u64 ext4_get_discard_pa_seq_sum(void) +{ + int __cpu; + u64 __seq = 0; + + for_each_possible_cpu(__cpu) + __seq += per_cpu(discard_pa_seq, __cpu); + return __seq; +} + static inline void *mb_correct_addr_and_bit(int *bit, void *addr) { #if BITS_PER_LONG == 64 @@ -1462,6 +1491,7 @@ static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b, mb_check_buddy(e4b); mb_free_blocks_double(inode, e4b, first, count); + this_cpu_inc(discard_pa_seq); e4b->bd_info->bb_free += count; if (first < e4b->bd_info->bb_first_free) e4b->bd_info->bb_first_free = first; @@ -1603,6 +1633,7 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex) mb_check_buddy(e4b); mb_mark_used_double(e4b, start, len); + this_cpu_inc(discard_pa_seq); e4b->bd_info->bb_free -= len; if (e4b->bd_info->bb_first_free == start) e4b->bd_info->bb_first_free += len; @@ -3989,6 +4020,7 @@ ext4_mb_discard_group_preallocations(struct super_block *sb, INIT_LIST_HEAD(&list); repeat: ext4_lock_group(sb, group); + this_cpu_inc(discard_pa_seq); list_for_each_entry_safe(pa, tmp, &grp->bb_prealloc_list, pa_group_list) { spin_lock(&pa->pa_lock); @@ -4571,14 +4603,26 @@ static int ext4_mb_discard_preallocations(struct super_block *sb, int needed) } static bool ext4_mb_discard_preallocations_should_retry(struct super_block *sb, - struct ext4_allocation_context *ac) + struct ext4_allocation_context *ac, u64 *seq) { int freed; + u64 seq_retry = 0; + bool ret = false; freed = ext4_mb_discard_preallocations(sb, ac->ac_o_ex.fe_len); - if (freed) - return true; - return false; + if (freed) { + ret = true; + goto out_dbg; + } + seq_retry = ext4_get_discard_pa_seq_sum(); + if (seq_retry != *seq) { + *seq = seq_retry; + ret = true; + } + +out_dbg: + mb_debug(sb, "freed %d, retry ? %s\n", freed, ret ? "yes" : "no"); + return ret; } /* @@ -4595,6 +4639,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle, ext4_fsblk_t block = 0; unsigned int inquota = 0; unsigned int reserv_clstrs = 0; + u64 seq; might_sleep(); sb = ar->inode->i_sb; @@ -4657,6 +4702,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle, } ac->ac_op = EXT4_MB_HISTORY_PREALLOC; + seq = *this_cpu_ptr(&discard_pa_seq); if (!ext4_mb_use_preallocated(ac)) { ac->ac_op = EXT4_MB_HISTORY_ALLOC; ext4_mb_normalize_request(ac, ar); @@ -4693,7 +4739,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle, ar->len = ac->ac_b_ex.fe_len; } } else { - if (ext4_mb_discard_preallocations_should_retry(sb, ac)) + if (ext4_mb_discard_preallocations_should_retry(sb, ac, &seq)) goto repeat; /* * If block allocation fails then the pa allocated above