From patchwork Thu Nov 30 13:53:09 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ojaswin Mujoo X-Patchwork-Id: 13474418 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b="W0R0ufQR" Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8829B170B; Thu, 30 Nov 2023 05:53:36 -0800 (PST) Received: from pps.filterd (m0353729.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 3AUDd9qW001806; Thu, 30 Nov 2023 13:53:30 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=dNu5Klycj9CTH42YxtPoVOXxFekkC+Q8QRK1Zv71CMw=; b=W0R0ufQRr1h7eO4seILCjiyHBmegJXEd3o0i7/eDhacv8GcD2fbLv8GZ0g1UYQJJ9UU/ SqVvLD9B2uWEwauDzYdCGx+XzbpewXtPkOoawpAgiG+fv2BmEcsZiw/o8h/QfkPWAPfH 7ItIEWKhqQ9vywNO8WVn8YLjIQMVha2NMRKJek3QyyucUjbI3pWyAeNffYTzIETGPR0r PZLfYsKd2XwygwzGxCby/bs+kXWhEj4fWaEl2ny9eH/f0razqAIAiysd4YXov+tmmdCQ Kvfwm14HKXRwMwiqAhbUXwZGNbme7EZosoV4A4Nf6nKWB2EUwsdlyCXbotAMYtVi2Tq5 QQ== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3upuc30mkr-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 30 Nov 2023 13:53:30 +0000 Received: from m0353729.ppops.net (m0353729.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 3AUDea4V004643; Thu, 30 Nov 2023 13:53:29 GMT Received: from ppma13.dal12v.mail.ibm.com (dd.9e.1632.ip4.static.sl-reverse.com [50.22.158.221]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3upuc30mkb-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 30 Nov 2023 13:53:29 +0000 Received: from pps.filterd (ppma13.dal12v.mail.ibm.com [127.0.0.1]) by ppma13.dal12v.mail.ibm.com (8.17.1.19/8.17.1.19) with ESMTP id 3AUDn9ju029475; Thu, 30 Nov 2023 13:53:28 GMT Received: from smtprelay04.fra02v.mail.ibm.com ([9.218.2.228]) by ppma13.dal12v.mail.ibm.com (PPS) with ESMTPS id 3ukwfke1r8-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 30 Nov 2023 13:53:27 +0000 Received: from smtpav01.fra02v.mail.ibm.com (smtpav01.fra02v.mail.ibm.com [10.20.54.100]) by smtprelay04.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 3AUDrOku42139910 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 30 Nov 2023 13:53:25 GMT Received: from smtpav01.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id DDDAB2004B; Thu, 30 Nov 2023 13:53:24 +0000 (GMT) Received: from smtpav01.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 45D4D20043; Thu, 30 Nov 2023 13:53:22 +0000 (GMT) Received: from li-bb2b2a4c-3307-11b2-a85c-8fa5c3a69313.ibm.com.com (unknown [9.43.76.38]) by smtpav01.fra02v.mail.ibm.com (Postfix) with ESMTP; Thu, 30 Nov 2023 13:53:22 +0000 (GMT) From: Ojaswin Mujoo To: linux-ext4@vger.kernel.org, "Theodore Ts'o" Cc: Ritesh Harjani , linux-kernel@vger.kernel.org, "Darrick J . Wong" , linux-block@vger.kernel.org, linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, John Garry , dchinner@redhat.com Subject: [RFC 1/7] iomap: Don't fall back to buffered write if the write is atomic Date: Thu, 30 Nov 2023 19:23:09 +0530 Message-Id: <09ec4c88b565c85dee91eccf6e894a0c047d9e69.1701339358.git.ojaswin@linux.ibm.com> X-Mailer: git-send-email 2.39.3 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-xfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: s7Ig9mp1nPeU4UDRadYI1W8VOuasZx2S X-Proofpoint-ORIG-GUID: -9Q93Ol6p1bj0dK9orlZJjs2_NSBq4RL X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.272,Aquarius:18.0.997,Hydra:6.0.619,FMLib:17.11.176.26 definitions=2023-11-30_12,2023-11-30_01,2023-05-22_02 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 lowpriorityscore=0 suspectscore=0 adultscore=0 clxscore=1015 impostorscore=0 phishscore=0 priorityscore=1501 bulkscore=0 mlxlogscore=822 spamscore=0 malwarescore=0 mlxscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2311060000 definitions=main-2311300102 Currently, iomap only supports atomic writes for direct IOs and there is no guarantees that a buffered IO will be atomic. Hence, if the user has explicitly requested the direct write to be atomic and there's a failure, return -EIO instead of falling back to buffered IO. Signed-off-by: Ojaswin Mujoo --- fs/iomap/direct-io.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c index 6ef25e26f1a1..3e7cd9bc8f4d 100644 --- a/fs/iomap/direct-io.c +++ b/fs/iomap/direct-io.c @@ -662,7 +662,13 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, if (ret != -EAGAIN) { trace_iomap_dio_invalidate_fail(inode, iomi.pos, iomi.len); - ret = -ENOTBLK; + /* + * if this write was supposed to be atomic, + * return the err rather than trying to fall + * back to buffered IO. + */ + if (!atomic_write) + ret = -ENOTBLK; } goto out_free_dio; } From patchwork Thu Nov 30 13:53:11 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ojaswin Mujoo X-Patchwork-Id: 13474421 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b="bNsP7bGw" Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7C273173E; Thu, 30 Nov 2023 05:53:40 -0800 (PST) Received: from pps.filterd (m0356516.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 3AUDnWpb001333; Thu, 30 Nov 2023 13:53:35 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=GEbojckG87Zkl5i2XvaH4H480xxiwFF4r2OKYlabK54=; b=bNsP7bGwRrqT00WlGk/iKOOA7N4n/iq5dA7a85ARavnDOx+4XDUHbnxWyWJLW6k99Z3r CGgG4d5lpUY0H2eQj8oulrRYCCawLy7oc6FdnMvvt7SuAqoFlO9ZE+ZckdC2Pu9APDA+ fZIyJqW3wXuZvt1lHmpctR1PX/eTluy9SYOsVDAJZQBi5cr8dsV1D0S3qptmq14rIGo2 PLjbhX4nLoxLctKffsZGhRwlhs4rlpY2H8+sY/C3Ch/a/AUfWhU5iyZxhvbZ+sHkBiLR JBRmqy1HIGtnCXeKEkJ2zOY74PoNnNcr4J0QErquGSI/iMu47Iu3r/VZ6B7lnJik2e0s yg== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3upu2vh3g4-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 30 Nov 2023 13:53:34 +0000 Received: from m0356516.ppops.net (m0356516.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 3AUDowTQ006599; Thu, 30 Nov 2023 13:53:34 GMT Received: from ppma23.wdc07v.mail.ibm.com (5d.69.3da9.ip4.static.sl-reverse.com [169.61.105.93]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3upu2vh3fh-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 30 Nov 2023 13:53:34 +0000 Received: from pps.filterd (ppma23.wdc07v.mail.ibm.com [127.0.0.1]) by ppma23.wdc07v.mail.ibm.com (8.17.1.19/8.17.1.19) with ESMTP id 3AUDnBJU008115; Thu, 30 Nov 2023 13:53:33 GMT Received: from smtprelay04.fra02v.mail.ibm.com ([9.218.2.228]) by ppma23.wdc07v.mail.ibm.com (PPS) with ESMTPS id 3ukvrkx98c-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 30 Nov 2023 13:53:33 +0000 Received: from smtpav01.fra02v.mail.ibm.com (smtpav01.fra02v.mail.ibm.com [10.20.54.100]) by smtprelay04.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 3AUDrVvQ18416372 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 30 Nov 2023 13:53:31 GMT Received: from smtpav01.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id E89C22004B; Thu, 30 Nov 2023 13:53:30 +0000 (GMT) Received: from smtpav01.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 5E8CD20043; Thu, 30 Nov 2023 13:53:28 +0000 (GMT) Received: from li-bb2b2a4c-3307-11b2-a85c-8fa5c3a69313.ibm.com.com (unknown [9.43.76.38]) by smtpav01.fra02v.mail.ibm.com (Postfix) with ESMTP; Thu, 30 Nov 2023 13:53:28 +0000 (GMT) From: Ojaswin Mujoo To: linux-ext4@vger.kernel.org, "Theodore Ts'o" Cc: Ritesh Harjani , linux-kernel@vger.kernel.org, "Darrick J . Wong" , linux-block@vger.kernel.org, linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, John Garry , dchinner@redhat.com Subject: [RFC 2/7] ext4: Factor out size and start prediction from ext4_mb_normalize_request() Date: Thu, 30 Nov 2023 19:23:11 +0530 Message-Id: X-Mailer: git-send-email 2.39.3 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-xfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: qy_SkSW57mW-hjhbyF5qADthnI7ePZDv X-Proofpoint-ORIG-GUID: p6EWJrk5DgzIlF1RQ43yQLDbmbht6XtD X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.272,Aquarius:18.0.997,Hydra:6.0.619,FMLib:17.11.176.26 definitions=2023-11-30_12,2023-11-30_01,2023-05-22_02 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 clxscore=1015 bulkscore=0 mlxlogscore=984 adultscore=0 lowpriorityscore=0 impostorscore=0 suspectscore=0 phishscore=0 malwarescore=0 spamscore=0 priorityscore=1501 mlxscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2311060000 definitions=main-2311300102 As a part of trimming down the size of ext4_mb_normalize_request(), factor out the logic to predict normalized start and size to a separate function ext4_mb_pa_predict_size(). This is no functional change in this patch. Signed-off-by: Ojaswin Mujoo --- fs/ext4/mballoc.c | 95 ++++++++++++++++++++++++++++------------------- 1 file changed, 56 insertions(+), 39 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 0b0aff458efd..3eb7b639d36e 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -4394,6 +4394,58 @@ ext4_mb_pa_adjust_overlap(struct ext4_allocation_context *ac, *end = new_end; } +static void ext4_mb_pa_predict_size(struct ext4_allocation_context *ac, + loff_t *start, loff_t *size) +{ + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); + loff_t new_size = *size; + loff_t new_start = *start; + int bsbits, max; + + bsbits = ac->ac_sb->s_blocksize_bits; + /* max size of free chunks */ + max = 2 << bsbits; + +#define NRL_CHECK_SIZE(req, size, max, chunk_size) \ + (req <= (size) || max <= (chunk_size)) + + if (new_size <= 16 * 1024) { + new_size = 16 * 1024; + } else if (new_size <= 32 * 1024) { + new_size = 32 * 1024; + } else if (new_size <= 64 * 1024) { + new_size = 64 * 1024; + } else if (new_size <= 128 * 1024) { + new_size = 128 * 1024; + } else if (new_size <= 256 * 1024) { + new_size = 256 * 1024; + } else if (new_size <= 512 * 1024) { + new_size = 512 * 1024; + } else if (new_size <= 1024 * 1024) { + new_size = 1024 * 1024; + } else if (NRL_CHECK_SIZE(new_size, 4 * 1024 * 1024, max, 2 * 1024)) { + new_start = ((loff_t)ac->ac_o_ex.fe_logical >> + (21 - bsbits)) << 21; + new_size = 2 * 1024 * 1024; + } else if (NRL_CHECK_SIZE(new_size, 8 * 1024 * 1024, max, 4 * 1024)) { + new_start = ((loff_t)ac->ac_o_ex.fe_logical >> + (22 - bsbits)) << 22; + new_size = 4 * 1024 * 1024; + } else if (NRL_CHECK_SIZE(EXT4_C2B(sbi, ac->ac_o_ex.fe_len), + (8<<20)>>bsbits, max, 8 * 1024)) { + new_start = ((loff_t)ac->ac_o_ex.fe_logical >> + (23 - bsbits)) << 23; + new_size = 8 * 1024 * 1024; + } else { + new_start = (loff_t) ac->ac_o_ex.fe_logical << bsbits; + new_size = (loff_t) EXT4_C2B(sbi, + ac->ac_o_ex.fe_len) << bsbits; + } + + *size = new_size; + *start = new_start; +} + /* * Normalization means making request better in terms of * size and alignment @@ -4404,7 +4456,7 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac, { struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); struct ext4_super_block *es = sbi->s_es; - int bsbits, max; + int bsbits; loff_t size, start_off, end; loff_t orig_size __maybe_unused; ext4_lblk_t start; @@ -4438,47 +4490,12 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac, size = i_size_read(ac->ac_inode); orig_size = size; - /* max size of free chunks */ - max = 2 << bsbits; - -#define NRL_CHECK_SIZE(req, size, max, chunk_size) \ - (req <= (size) || max <= (chunk_size)) - /* first, try to predict filesize */ /* XXX: should this table be tunable? */ start_off = 0; - if (size <= 16 * 1024) { - size = 16 * 1024; - } else if (size <= 32 * 1024) { - size = 32 * 1024; - } else if (size <= 64 * 1024) { - size = 64 * 1024; - } else if (size <= 128 * 1024) { - size = 128 * 1024; - } else if (size <= 256 * 1024) { - size = 256 * 1024; - } else if (size <= 512 * 1024) { - size = 512 * 1024; - } else if (size <= 1024 * 1024) { - size = 1024 * 1024; - } else if (NRL_CHECK_SIZE(size, 4 * 1024 * 1024, max, 2 * 1024)) { - start_off = ((loff_t)ac->ac_o_ex.fe_logical >> - (21 - bsbits)) << 21; - size = 2 * 1024 * 1024; - } else if (NRL_CHECK_SIZE(size, 8 * 1024 * 1024, max, 4 * 1024)) { - start_off = ((loff_t)ac->ac_o_ex.fe_logical >> - (22 - bsbits)) << 22; - size = 4 * 1024 * 1024; - } else if (NRL_CHECK_SIZE(EXT4_C2B(sbi, ac->ac_o_ex.fe_len), - (8<<20)>>bsbits, max, 8 * 1024)) { - start_off = ((loff_t)ac->ac_o_ex.fe_logical >> - (23 - bsbits)) << 23; - size = 8 * 1024 * 1024; - } else { - start_off = (loff_t) ac->ac_o_ex.fe_logical << bsbits; - size = (loff_t) EXT4_C2B(sbi, - ac->ac_o_ex.fe_len) << bsbits; - } + + ext4_mb_pa_predict_size(ac, &start_off, &size); + size = size >> bsbits; start = start_off >> bsbits; From patchwork Thu Nov 30 13:53:12 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ojaswin Mujoo X-Patchwork-Id: 13474422 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b="ABOUMGVV" Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2E7A719B2; Thu, 30 Nov 2023 05:53:43 -0800 (PST) Received: from pps.filterd (m0353727.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 3AUDlTRp023890; Thu, 30 Nov 2023 13:53:38 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=bYDR8LkQhx5OY1pT1Z1ofFxDBMEDbq4D27syqk5XqQU=; b=ABOUMGVVwncvaTZmqhRw6fiMBHpIhDnU6r7kID3adD037aHOcPvVbGPK9F2Qb/VISvAj 4dRnxuimG2qamfS0ravs6RdX5UYiT+i6o8tMPPBesYzi9uvICUP9o0gUqsO4NF+/0E1B CkNJGhe6zp2Zj31fr0TOhJv0QdHfompNh+8vBkcuYWtN3M+ecpqwMoF46YCqjYP/iHzf MArsSA+FumxmYs2GGtlL4dYMmSgcEXCegJhWVbGrwG4hR38QIjlO60l1WXp4l6hoRHRJ c55KeMgNss6oFNlpBX6by7pQeqfnCwf872R+PH6/Tk7pJhn6OEkzHvznr3dOIAVhKh04 pQ== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3upuk2r5dm-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 30 Nov 2023 13:53:37 +0000 Received: from m0353727.ppops.net (m0353727.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 3AUDoT6K001367; Thu, 30 Nov 2023 13:53:37 GMT Received: from ppma13.dal12v.mail.ibm.com (dd.9e.1632.ip4.static.sl-reverse.com [50.22.158.221]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3upuk2r5cu-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 30 Nov 2023 13:53:37 +0000 Received: from pps.filterd (ppma13.dal12v.mail.ibm.com [127.0.0.1]) by ppma13.dal12v.mail.ibm.com (8.17.1.19/8.17.1.19) with ESMTP id 3AUDnBZj029489; Thu, 30 Nov 2023 13:53:36 GMT Received: from smtprelay02.fra02v.mail.ibm.com ([9.218.2.226]) by ppma13.dal12v.mail.ibm.com (PPS) with ESMTPS id 3ukwfke1sd-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 30 Nov 2023 13:53:36 +0000 Received: from smtpav01.fra02v.mail.ibm.com (smtpav01.fra02v.mail.ibm.com [10.20.54.100]) by smtprelay02.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 3AUDrYss18350602 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 30 Nov 2023 13:53:34 GMT Received: from smtpav01.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 82CC820043; Thu, 30 Nov 2023 13:53:34 +0000 (GMT) Received: from smtpav01.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 6F9442004B; Thu, 30 Nov 2023 13:53:31 +0000 (GMT) Received: from li-bb2b2a4c-3307-11b2-a85c-8fa5c3a69313.ibm.com.com (unknown [9.43.76.38]) by smtpav01.fra02v.mail.ibm.com (Postfix) with ESMTP; Thu, 30 Nov 2023 13:53:31 +0000 (GMT) From: Ojaswin Mujoo To: linux-ext4@vger.kernel.org, "Theodore Ts'o" Cc: Ritesh Harjani , linux-kernel@vger.kernel.org, "Darrick J . Wong" , linux-block@vger.kernel.org, linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, John Garry , dchinner@redhat.com Subject: [RFC 3/7] ext4: add aligned allocation support in mballoc Date: Thu, 30 Nov 2023 19:23:12 +0530 Message-Id: <7c652ff11d4d52466e0d40fc9bdd1a0c24fc80fa.1701339358.git.ojaswin@linux.ibm.com> X-Mailer: git-send-email 2.39.3 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-xfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: IE4ZHkm339pwYeRMJWL0J4jJf1oSVeM2 X-Proofpoint-ORIG-GUID: fkaP7um2Xq0baNu7XtfiAy1OY4Z-VHN- X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.272,Aquarius:18.0.997,Hydra:6.0.619,FMLib:17.11.176.26 definitions=2023-11-30_12,2023-11-30_01,2023-05-22_02 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 impostorscore=0 priorityscore=1501 adultscore=0 spamscore=0 bulkscore=0 suspectscore=0 mlxscore=0 lowpriorityscore=0 mlxlogscore=999 malwarescore=0 clxscore=1015 phishscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2311060000 definitions=main-2311300102 Add support in mballoc for allocating blocks that are aligned to a certain power-of-2 offset. 1. We define a new flag EXT4_MB_ALIGNED_ALLOC to indicate that we want an aligned allocation. 2. The alignment is determined by the length of the allocation, for example if we ask for 8192 bytes, then the alignment of physical blocks will also be 8192 bytes aligned (ie 2 blocks aligned on 4k blocksize). 3. We dont yet support arbitrary alignment. For aligned writes, the length/alignment must be power of 2 blocks, ie for 4k blocksize we can get 4k byte aligned, 8k byte aligned, 16k byte aligned ... allocation but not 12k byte aligned. 4. We use CR_POWER2_ALIGNED criteria for aligned allocation which by design allocates in an aligned manner. Since CR_POWER2_ALIGNED needs the ac->ac_g_ex.fe_len to be power of 2, thats where the restriction in point 3 above comes from. Since right now aligned allocation support is added mainly for atomic writes use case, this restriction should be fine since atomic write capable devices usually support only power of 2 alignments 5. For ease of review enabling inode preallocation support is done in upcoming patches and is disabled in this patch. 6. In case we can't find anything in CR_POWER2_ALIGNED, we return ENOSPC. Signed-off-by: Ojaswin Mujoo --- fs/ext4/ext4.h | 6 ++-- fs/ext4/mballoc.c | 69 ++++++++++++++++++++++++++++++++++--- include/trace/events/ext4.h | 1 + 3 files changed, 69 insertions(+), 7 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 9418359b1d9d..38a77148b85c 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -216,9 +216,11 @@ enum criteria { /* Large fragment size list lookup succeeded at least once for cr = 0 */ #define EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED 0x8000 /* Avg fragment size rb tree lookup succeeded at least once for cr = 1 */ -#define EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED 0x00010000 +#define EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED 0x10000 /* Avg fragment size rb tree lookup succeeded at least once for cr = 1.5 */ -#define EXT4_MB_CR_BEST_AVAIL_LEN_OPTIMIZED 0x00020000 +#define EXT4_MB_CR_BEST_AVAIL_LEN_OPTIMIZED 0x20000 +/* The allocation must respect alignment requirements for physical blocks */ +#define EXT4_MB_ALIGNED_ALLOC 0x40000 struct ext4_allocation_request { /* target inode for block we're allocating */ diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 3eb7b639d36e..b1df531e6db3 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -2150,8 +2150,11 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac, * user requested originally, we store allocated * space in a special descriptor. */ - if (ac->ac_o_ex.fe_len < ac->ac_b_ex.fe_len) + if (ac->ac_o_ex.fe_len < ac->ac_b_ex.fe_len) { + /* Aligned allocation doesn't have preallocation support */ + WARN_ON(ac->ac_flags & EXT4_MB_ALIGNED_ALLOC); ext4_mb_new_preallocation(ac); + } } @@ -2784,10 +2787,15 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) BUG_ON(ac->ac_status == AC_STATUS_FOUND); - /* first, try the goal */ - err = ext4_mb_find_by_goal(ac, &e4b); - if (err || ac->ac_status == AC_STATUS_FOUND) - goto out; + /* + * first, try the goal. Skip trying goal for aligned allocations since + * goal determination logic is not alignment aware (yet) + */ + if (!(ac->ac_flags & EXT4_MB_ALIGNED_ALLOC)) { + err = ext4_mb_find_by_goal(ac, &e4b); + if (err || ac->ac_status == AC_STATUS_FOUND) + goto out; + } if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY)) goto out; @@ -2828,9 +2836,26 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) */ if (ac->ac_2order) cr = CR_POWER2_ALIGNED; + else + WARN_ON(ac->ac_flags & EXT4_MB_ALIGNED_ALLOC && + ac->ac_g_ex.fe_len > 1); repeat: for (; cr < EXT4_MB_NUM_CRS && ac->ac_status == AC_STATUS_CONTINUE; cr++) { ac->ac_criteria = cr; + + if (ac->ac_criteria > CR_POWER2_ALIGNED && + ac->ac_flags & EXT4_MB_ALIGNED_ALLOC && + ac->ac_g_ex.fe_len > 1 + ) { + /* + * Aligned allocation only supports power 2 alignment + * values which can only be satisfied by + * CR_POWER2_ALIGNED. The exception being allocations of + * 1 block which can be done via any criteria + */ + break; + } + /* * searching for the right group start * from the goal value specified @@ -2955,6 +2980,23 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) if (!err && ac->ac_status != AC_STATUS_FOUND && first_err) err = first_err; + if (ac->ac_flags & EXT4_MB_ALIGNED_ALLOC && ac->ac_status == AC_STATUS_FOUND) { + ext4_fsblk_t start = ext4_grp_offs_to_block(sb, &ac->ac_b_ex); + ext4_grpblk_t len = EXT4_C2B(sbi, ac->ac_b_ex.fe_len); + + if (!len) { + ext4_warning(sb, "Expected a non zero len extent"); + ac->ac_status = AC_STATUS_BREAK; + goto exit; + } + + WARN_ON(!is_power_of_2(len)); + WARN_ON(start % len); + /* We don't support preallocation yet */ + WARN_ON(ac->ac_b_ex.fe_len != ac->ac_o_ex.fe_len); + } + + exit: mb_debug(sb, "Best len %d, origin len %d, ac_status %u, ac_flags 0x%x, cr %d ret %d\n", ac->ac_b_ex.fe_len, ac->ac_o_ex.fe_len, ac->ac_status, ac->ac_flags, cr, err); @@ -4475,6 +4517,13 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac, if (ac->ac_flags & EXT4_MB_HINT_NOPREALLOC) return; + /* + * caller may have strict alignment requirements. In this case, avoid + * normalization since it is not alignment aware. + */ + if (ac->ac_flags & EXT4_MB_ALIGNED_ALLOC) + return; + if (ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC) { ext4_mb_normalize_group_request(ac); return ; @@ -4790,6 +4839,10 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac) if (!(ac->ac_flags & EXT4_MB_HINT_DATA)) return false; + /* using preallocated blocks is not alignment aware. */ + if (ac->ac_flags & EXT4_MB_ALIGNED_ALLOC) + return false; + /* * first, try per-file preallocation by searching the inode pa rbtree. * @@ -6069,6 +6122,12 @@ static bool ext4_mb_discard_preallocations_should_retry(struct super_block *sb, u64 seq_retry = 0; bool ret = false; + /* No need to retry for aligned allocations */ + if (ac->ac_flags & EXT4_MB_ALIGNED_ALLOC) { + ret = false; + goto out_dbg; + } + freed = ext4_mb_discard_preallocations(sb, ac->ac_o_ex.fe_len); if (freed) { ret = true; diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h index 65029dfb92fb..56895cfb5781 100644 --- a/include/trace/events/ext4.h +++ b/include/trace/events/ext4.h @@ -36,6 +36,7 @@ struct partial_cluster; { EXT4_MB_STREAM_ALLOC, "STREAM_ALLOC" }, \ { EXT4_MB_USE_ROOT_BLOCKS, "USE_ROOT_BLKS" }, \ { EXT4_MB_USE_RESERVED, "USE_RESV" }, \ + { EXT4_MB_ALIGNED_ALLOC, "ALIGNED_ALLOC" }, \ { EXT4_MB_STRICT_CHECK, "STRICT_CHECK" }) #define show_map_flags(flags) __print_flags(flags, "|", \ From patchwork Thu Nov 30 13:53:13 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ojaswin Mujoo X-Patchwork-Id: 13474423 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b="UZJKC68m" Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 736FB170B; Thu, 30 Nov 2023 05:53:47 -0800 (PST) Received: from pps.filterd (m0356516.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 3AUDnZ3h001557; Thu, 30 Nov 2023 13:53:41 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=OLCZNiRjZ6gnwfClcveUD+5icHyr7gsNTwEW2i0DScM=; b=UZJKC68myhv6LTIU+jkYnDf4TRRJduigFshj9QUu0HESNTkq2QqwUqvll0LsG6WRBXuV JctlJ1CTctR5rcDFprEi+Biz+wb/XXZRY+l0FFg8GEpgN91a94xJC3r0ItYQ5CsDwG1A MOt0cQS6B3M1PunIJHF36Rc+QrDyr4yqAJrTP4M9aqR0uYeEbtqWGV7wILcKHFScJuGe sDe/4h07okqb2Xn9qxxe0oQbulchV9V4YjXaqt3gjab/UHscf8XCMNofPHL98s0Ge2J8 jFCGOQPUmpa6Zj/m4aHjtXG7oIcQVx3jGbE13kEHd3Pt1qHwchOaEsjW6Iy73ZDLnJ9S OQ== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3upu2vh3k6-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 30 Nov 2023 13:53:41 +0000 Received: from m0356516.ppops.net (m0356516.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 3AUDnb5M001726; Thu, 30 Nov 2023 13:53:40 GMT Received: from ppma22.wdc07v.mail.ibm.com (5c.69.3da9.ip4.static.sl-reverse.com [169.61.105.92]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3upu2vh3ju-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 30 Nov 2023 13:53:40 +0000 Received: from pps.filterd (ppma22.wdc07v.mail.ibm.com [127.0.0.1]) by ppma22.wdc07v.mail.ibm.com (8.17.1.19/8.17.1.19) with ESMTP id 3AUDn2Qi021292; Thu, 30 Nov 2023 13:53:39 GMT Received: from smtprelay03.fra02v.mail.ibm.com ([9.218.2.224]) by ppma22.wdc07v.mail.ibm.com (PPS) with ESMTPS id 3ukumyxkr1-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 30 Nov 2023 13:53:39 +0000 Received: from smtpav01.fra02v.mail.ibm.com (smtpav01.fra02v.mail.ibm.com [10.20.54.100]) by smtprelay03.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 3AUDrbsc19530370 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 30 Nov 2023 13:53:37 GMT Received: from smtpav01.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 93A3E2004B; Thu, 30 Nov 2023 13:53:37 +0000 (GMT) Received: from smtpav01.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 0A86B20040; Thu, 30 Nov 2023 13:53:35 +0000 (GMT) Received: from li-bb2b2a4c-3307-11b2-a85c-8fa5c3a69313.ibm.com.com (unknown [9.43.76.38]) by smtpav01.fra02v.mail.ibm.com (Postfix) with ESMTP; Thu, 30 Nov 2023 13:53:34 +0000 (GMT) From: Ojaswin Mujoo To: linux-ext4@vger.kernel.org, "Theodore Ts'o" Cc: Ritesh Harjani , linux-kernel@vger.kernel.org, "Darrick J . Wong" , linux-block@vger.kernel.org, linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, John Garry , dchinner@redhat.com Subject: [RFC 4/7] ext4: allow inode preallocation for aligned alloc Date: Thu, 30 Nov 2023 19:23:13 +0530 Message-Id: <74aceb317593df40539a0a3e109406992600853c.1701339358.git.ojaswin@linux.ibm.com> X-Mailer: git-send-email 2.39.3 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-xfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: y7j525rcsX4Mk3vCDTfdNxmNhtFW-PmU X-Proofpoint-ORIG-GUID: JpD_H5u2m3yifwts5EBH1iG38hWe4U9Z X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.272,Aquarius:18.0.997,Hydra:6.0.619,FMLib:17.11.176.26 definitions=2023-11-30_12,2023-11-30_01,2023-05-22_02 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 clxscore=1015 bulkscore=0 mlxlogscore=994 adultscore=0 lowpriorityscore=0 impostorscore=0 suspectscore=0 phishscore=0 malwarescore=0 spamscore=0 priorityscore=1501 mlxscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2311060000 definitions=main-2311300102 Enable inode preallocation support for aligned allocations. Inode preallocation will only be used if the preallocated blocks are able to satisfy the length and alignment requirements of the allocations, else we disable preallocation for this particular allocation and proceed as usual. Disabling inode preallocation is required otherwise we might end up with overlapping preallocated ranges which can trigger a BUG() later. While normalizing the request, we need to make sure that: 1. start of normalized(goal) request matches original request so it is easier to align it during actual allocations. This prevents various edge cases where the start of goal is different than original start making it trickier to align the original start as requested by user. 2. the length of goal should not be smaller than original and should be a power of 2. For now, group preallocation is disabled for aligned allocations. Signed-off-by: Ojaswin Mujoo --- fs/ext4/mballoc.c | 168 +++++++++++++++++++++++++++++----------------- 1 file changed, 107 insertions(+), 61 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index b1df531e6db3..c21b2758c3f0 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -2151,8 +2151,6 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac, * space in a special descriptor. */ if (ac->ac_o_ex.fe_len < ac->ac_b_ex.fe_len) { - /* Aligned allocation doesn't have preallocation support */ - WARN_ON(ac->ac_flags & EXT4_MB_ALIGNED_ALLOC); ext4_mb_new_preallocation(ac); } @@ -2992,8 +2990,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) WARN_ON(!is_power_of_2(len)); WARN_ON(start % len); - /* We don't support preallocation yet */ - WARN_ON(ac->ac_b_ex.fe_len != ac->ac_o_ex.fe_len); + WARN_ON(ac->ac_b_ex.fe_len < ac->ac_o_ex.fe_len); } exit: @@ -4309,7 +4306,7 @@ ext4_mb_pa_adjust_overlap(struct ext4_allocation_context *ac, struct ext4_prealloc_space *tmp_pa = NULL, *left_pa = NULL, *right_pa = NULL; struct rb_node *iter; ext4_lblk_t new_start, tmp_pa_start, right_pa_start = -1; - loff_t new_end, tmp_pa_end, left_pa_end = -1; + loff_t size, new_end, tmp_pa_end, left_pa_end = -1; new_start = *start; new_end = *end; @@ -4429,6 +4426,22 @@ ext4_mb_pa_adjust_overlap(struct ext4_allocation_context *ac, } read_unlock(&ei->i_prealloc_lock); + if (ac->ac_flags & EXT4_MB_ALIGNED_ALLOC) { + /* + * Aligned allocation happens via CR_POWER2_ALIGNED criteria + * hence we must make sure that the new size is a power of 2. + */ + size = new_end - new_start; + size = (loff_t)1 << (fls64(size) - 1); + + if (unlikely(size < ac->ac_o_ex.fe_len)) + size = ac->ac_o_ex.fe_len; + new_end = new_start + size; + + WARN_ON(*start != new_start); + WARN_ON(!is_power_of_2(size)); + } + /* XXX: extra loop to check we really don't overlap preallocations */ ext4_mb_pa_assert_overlap(ac, new_start, new_end); @@ -4484,6 +4497,21 @@ static void ext4_mb_pa_predict_size(struct ext4_allocation_context *ac, ac->ac_o_ex.fe_len) << bsbits; } + /* + * For aligned allocations, we need to ensure 2 things: + * + * 1. The start should remain same as original start so that finding + * aligned physical blocks for it is straight forward. + * + * 2. The new_size should not be less than the original len. This + * can sometimes happen due to the way we predict size above. + */ + if (ac->ac_flags & EXT4_MB_ALIGNED_ALLOC) { + new_start = ac->ac_o_ex.fe_logical << bsbits; + new_size = max_t(loff_t, new_size, + EXT4_C2B(sbi, ac->ac_o_ex.fe_len) << bsbits); + } + *size = new_size; *start = new_start; } @@ -4517,13 +4545,6 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac, if (ac->ac_flags & EXT4_MB_HINT_NOPREALLOC) return; - /* - * caller may have strict alignment requirements. In this case, avoid - * normalization since it is not alignment aware. - */ - if (ac->ac_flags & EXT4_MB_ALIGNED_ALLOC) - return; - if (ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC) { ext4_mb_normalize_group_request(ac); return ; @@ -4557,8 +4578,13 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac, start = max(start, rounddown(ac->ac_o_ex.fe_logical, (ext4_lblk_t)EXT4_BLOCKS_PER_GROUP(ac->ac_sb))); - /* don't cover already allocated blocks in selected range */ + /* + * don't cover already allocated blocks in selected range. For aligned + * alloc, since we don't change the original start we should ideally not + * enter this if block. + */ if (ar->pleft && start <= ar->lleft) { + WARN_ON(ac->ac_flags & EXT4_MB_ALIGNED_ALLOC); size -= ar->lleft + 1 - start; start = ar->lleft + 1; } @@ -4791,32 +4817,46 @@ ext4_mb_check_group_pa(ext4_fsblk_t goal_block, } /* - * check if found pa meets EXT4_MB_HINT_GOAL_ONLY + * check if found pa meets EXT4_MB_HINT_GOAL_ONLY or EXT4_MB_ALIGNED_ALLOC */ static bool -ext4_mb_pa_goal_check(struct ext4_allocation_context *ac, +ext4_mb_pa_check(struct ext4_allocation_context *ac, struct ext4_prealloc_space *pa) { struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); ext4_fsblk_t start; - if (likely(!(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY))) + if (likely(!(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY || + ac->ac_flags & EXT4_MB_ALIGNED_ALLOC))) return true; - /* - * If EXT4_MB_HINT_GOAL_ONLY is set, ac_g_ex will not be adjusted - * in ext4_mb_normalize_request and will keep same with ac_o_ex - * from ext4_mb_initialize_context. Choose ac_g_ex here to keep - * consistent with ext4_mb_find_by_goal. - */ - start = pa->pa_pstart + - (ac->ac_g_ex.fe_logical - pa->pa_lstart); - if (ext4_grp_offs_to_block(ac->ac_sb, &ac->ac_g_ex) != start) - return false; + if (ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY) { + /* + * If EXT4_MB_HINT_GOAL_ONLY is set, ac_g_ex will not be adjusted + * in ext4_mb_normalize_request and will keep same with ac_o_ex + * from ext4_mb_initialize_context. Choose ac_g_ex here to keep + * consistent with ext4_mb_find_by_goal. + */ + start = pa->pa_pstart + + (ac->ac_g_ex.fe_logical - pa->pa_lstart); + if (ext4_grp_offs_to_block(ac->ac_sb, &ac->ac_g_ex) != start) + return false; - if (ac->ac_g_ex.fe_len > pa->pa_len - - EXT4_B2C(sbi, ac->ac_g_ex.fe_logical - pa->pa_lstart)) - return false; + if (ac->ac_g_ex.fe_len > + pa->pa_len - EXT4_B2C(sbi, ac->ac_g_ex.fe_logical - + pa->pa_lstart)) + return false; + } else if (ac->ac_flags & EXT4_MB_ALIGNED_ALLOC) { + start = pa->pa_pstart + + (ac->ac_g_ex.fe_logical - pa->pa_lstart); + if (start % EXT4_C2B(sbi, ac->ac_g_ex.fe_len)) + return false; + + if (EXT4_C2B(sbi, ac->ac_g_ex.fe_len) > + (EXT4_C2B(sbi, pa->pa_len) - + (ac->ac_g_ex.fe_logical - pa->pa_lstart))) + return false; + } return true; } @@ -4839,10 +4879,6 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac) if (!(ac->ac_flags & EXT4_MB_HINT_DATA)) return false; - /* using preallocated blocks is not alignment aware. */ - if (ac->ac_flags & EXT4_MB_ALIGNED_ALLOC) - return false; - /* * first, try per-file preallocation by searching the inode pa rbtree. * @@ -4948,41 +4984,49 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac) goto try_group_pa; } - if (tmp_pa->pa_free && likely(ext4_mb_pa_goal_check(ac, tmp_pa))) { + if (tmp_pa->pa_free && likely(ext4_mb_pa_check(ac, tmp_pa))) { atomic_inc(&tmp_pa->pa_count); ext4_mb_use_inode_pa(ac, tmp_pa); spin_unlock(&tmp_pa->pa_lock); read_unlock(&ei->i_prealloc_lock); return true; } else { + if (tmp_pa->pa_free == 0) + /* + * We found a valid overlapping pa but couldn't use it because + * it had no free blocks. This should ideally never happen + * because: + * + * 1. When a new inode pa is added to rbtree it must have + * pa_free > 0 since otherwise we won't actually need + * preallocation. + * + * 2. An inode pa that is in the rbtree can only have it's + * pa_free become zero when another thread calls: + * ext4_mb_new_blocks + * ext4_mb_use_preallocated + * ext4_mb_use_inode_pa + * + * 3. Further, after the above calls make pa_free == 0, we will + * immediately remove it from the rbtree in: + * ext4_mb_new_blocks + * ext4_mb_release_context + * ext4_mb_put_pa + * + * 4. Since the pa_free becoming 0 and pa_free getting removed + * from tree both happen in ext4_mb_new_blocks, which is always + * called with i_data_sem held for data allocations, we can be + * sure that another process will never see a pa in rbtree with + * pa_free == 0. + */ + WARN_ON_ONCE(tmp_pa->pa_free == 0); /* - * We found a valid overlapping pa but couldn't use it because - * it had no free blocks. This should ideally never happen - * because: - * - * 1. When a new inode pa is added to rbtree it must have - * pa_free > 0 since otherwise we won't actually need - * preallocation. - * - * 2. An inode pa that is in the rbtree can only have it's - * pa_free become zero when another thread calls: - * ext4_mb_new_blocks - * ext4_mb_use_preallocated - * ext4_mb_use_inode_pa - * - * 3. Further, after the above calls make pa_free == 0, we will - * immediately remove it from the rbtree in: - * ext4_mb_new_blocks - * ext4_mb_release_context - * ext4_mb_put_pa - * - * 4. Since the pa_free becoming 0 and pa_free getting removed - * from tree both happen in ext4_mb_new_blocks, which is always - * called with i_data_sem held for data allocations, we can be - * sure that another process will never see a pa in rbtree with - * pa_free == 0. + * If we come here we need to disable preallocations else we'd + * have multiple preallocations for the same logical offset + * which is not allowed and will cause BUG_ONs to be triggered + * later. */ - WARN_ON_ONCE(tmp_pa->pa_free == 0); + ac->ac_flags |= EXT4_MB_HINT_NOPREALLOC; } spin_unlock(&tmp_pa->pa_lock); try_group_pa: @@ -5818,6 +5862,7 @@ static void ext4_mb_group_or_file(struct ext4_allocation_context *ac) int bsbits = ac->ac_sb->s_blocksize_bits; loff_t size, isize; bool inode_pa_eligible, group_pa_eligible; + bool is_aligned = (ac->ac_flags & EXT4_MB_ALIGNED_ALLOC); if (!(ac->ac_flags & EXT4_MB_HINT_DATA)) return; @@ -5825,7 +5870,8 @@ static void ext4_mb_group_or_file(struct ext4_allocation_context *ac) if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY)) return; - group_pa_eligible = sbi->s_mb_group_prealloc > 0; + /* Aligned allocation does not support group pa */ + group_pa_eligible = (!is_aligned && sbi->s_mb_group_prealloc > 0); inode_pa_eligible = true; size = extent_logical_end(sbi, &ac->ac_o_ex); isize = (i_size_read(ac->ac_inode) + ac->ac_sb->s_blocksize - 1) From patchwork Thu Nov 30 13:53:14 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ojaswin Mujoo X-Patchwork-Id: 13474424 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b="NeJBBMoZ" Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6B8BD1715; Thu, 30 Nov 2023 05:53:49 -0800 (PST) Received: from pps.filterd (m0356516.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 3AUDnXKC001388; Thu, 30 Nov 2023 13:53:44 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=I4EWMLFOy1eJZwXfsCVABkVMsuSEuv+Jei1udxbxQWg=; b=NeJBBMoZ2jTcxxtiQqRVxMOnkn9+H9cYcY0rFEwQXLb9TgqLzXZn9yw7hjaj+06nHgh4 Yl4ZIhMeWDhjkrcgpXgf8Bp37Guk7+zx3Rgy6oA9siMVNGMsaqkuNzKTGB8ZYzRBBkka 3hUQ63MV8G0HgOvEFhEegkmvGN11EqJ26LRSLPt3XL/6nxd1CAqcCDA010qaX0u7fLsL /mwwRMJkmoSNhwFnMWhWzMvs/kx9KNleQN3n9BWOwuTwWTdRAmJvaPESFN+R+HT26ZeR W4i5XnFrIEcoGmtr292CzejtyJ47BdxVgUbkaWkG+/oB3mY/ac4MADQpCxVhTPBWxEo+ Zw== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3upu2vh3mv-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 30 Nov 2023 13:53:44 +0000 Received: from m0356516.ppops.net (m0356516.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 3AUDowTT006599; Thu, 30 Nov 2023 13:53:43 GMT Received: from ppma13.dal12v.mail.ibm.com (dd.9e.1632.ip4.static.sl-reverse.com [50.22.158.221]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3upu2vh3mj-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 30 Nov 2023 13:53:43 +0000 Received: from pps.filterd (ppma13.dal12v.mail.ibm.com [127.0.0.1]) by ppma13.dal12v.mail.ibm.com (8.17.1.19/8.17.1.19) with ESMTP id 3AUDnC7r029505; Thu, 30 Nov 2023 13:53:42 GMT Received: from smtprelay01.fra02v.mail.ibm.com ([9.218.2.227]) by ppma13.dal12v.mail.ibm.com (PPS) with ESMTPS id 3ukwfke1sq-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 30 Nov 2023 13:53:42 +0000 Received: from smtpav01.fra02v.mail.ibm.com (smtpav01.fra02v.mail.ibm.com [10.20.54.100]) by smtprelay01.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 3AUDrepf30212398 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 30 Nov 2023 13:53:40 GMT Received: from smtpav01.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 867932004D; Thu, 30 Nov 2023 13:53:40 +0000 (GMT) Received: from smtpav01.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 02BCF20040; Thu, 30 Nov 2023 13:53:38 +0000 (GMT) Received: from li-bb2b2a4c-3307-11b2-a85c-8fa5c3a69313.ibm.com.com (unknown [9.43.76.38]) by smtpav01.fra02v.mail.ibm.com (Postfix) with ESMTP; Thu, 30 Nov 2023 13:53:37 +0000 (GMT) From: Ojaswin Mujoo To: linux-ext4@vger.kernel.org, "Theodore Ts'o" Cc: Ritesh Harjani , linux-kernel@vger.kernel.org, "Darrick J . Wong" , linux-block@vger.kernel.org, linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, John Garry , dchinner@redhat.com Subject: [RFC 5/7] block: export blkdev_atomic_write_valid() and refactor api Date: Thu, 30 Nov 2023 19:23:14 +0530 Message-Id: X-Mailer: git-send-email 2.39.3 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-xfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: ZugCvRCajbO69dPqhlk8SBBZ4fODI8KN X-Proofpoint-ORIG-GUID: c1vCyZy5Tdbw7RUuXr_sq8Rtv4ONgRya X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.272,Aquarius:18.0.997,Hydra:6.0.619,FMLib:17.11.176.26 definitions=2023-11-30_12,2023-11-30_01,2023-05-22_02 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 clxscore=1015 bulkscore=0 mlxlogscore=839 adultscore=0 lowpriorityscore=0 impostorscore=0 suspectscore=0 phishscore=0 malwarescore=0 spamscore=0 priorityscore=1501 mlxscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2311060000 definitions=main-2311300102 Export the blkdev_atomic_write_valid() function so that other filesystems can call it as a part of validating the atomic write operation. Further, refactor the api to accept a len argument instead of iov_iter to make it easier to call from other places. Signed-off-by: Ojaswin Mujoo --- block/fops.c | 18 ++++++++++-------- include/linux/blkdev.h | 2 ++ 2 files changed, 12 insertions(+), 8 deletions(-) diff --git a/block/fops.c b/block/fops.c index 516669ad69e5..5dae95c49720 100644 --- a/block/fops.c +++ b/block/fops.c @@ -41,8 +41,7 @@ static bool blkdev_dio_unaligned(struct block_device *bdev, loff_t pos, !bdev_iter_is_aligned(bdev, iter); } -static bool blkdev_atomic_write_valid(struct block_device *bdev, loff_t pos, - struct iov_iter *iter) +bool blkdev_atomic_write_valid(struct block_device *bdev, loff_t pos, size_t len) { unsigned int atomic_write_unit_min_bytes = queue_atomic_write_unit_min_bytes(bdev_get_queue(bdev)); @@ -53,16 +52,17 @@ static bool blkdev_atomic_write_valid(struct block_device *bdev, loff_t pos, return false; if (pos % atomic_write_unit_min_bytes) return false; - if (iov_iter_count(iter) % atomic_write_unit_min_bytes) + if (len % atomic_write_unit_min_bytes) return false; - if (!is_power_of_2(iov_iter_count(iter))) + if (!is_power_of_2(len)) return false; - if (iov_iter_count(iter) > atomic_write_unit_max_bytes) + if (len > atomic_write_unit_max_bytes) return false; - if (pos % iov_iter_count(iter)) + if (pos % len) return false; return true; } +EXPORT_SYMBOL_GPL(blkdev_atomic_write_valid); #define DIO_INLINE_BIO_VECS 4 @@ -81,7 +81,8 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb, if (blkdev_dio_unaligned(bdev, pos, iter)) return -EINVAL; - if (atomic_write && !blkdev_atomic_write_valid(bdev, pos, iter)) + if (atomic_write && + !blkdev_atomic_write_valid(bdev, pos, iov_iter_count(iter))) return -EINVAL; if (nr_pages <= DIO_INLINE_BIO_VECS) @@ -348,7 +349,8 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb, if (blkdev_dio_unaligned(bdev, pos, iter)) return -EINVAL; - if (atomic_write && !blkdev_atomic_write_valid(bdev, pos, iter)) + if (atomic_write && + !blkdev_atomic_write_valid(bdev, pos, iov_iter_count(iter))) return -EINVAL; if (iocb->ki_flags & IOCB_ALLOC_CACHE) diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index f70988083734..5a3124fc191f 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1566,6 +1566,8 @@ static inline int early_lookup_bdev(const char *pathname, dev_t *dev) int freeze_bdev(struct block_device *bdev); int thaw_bdev(struct block_device *bdev); +bool blkdev_atomic_write_valid(struct block_device *bdev, loff_t pos, size_t len); + struct io_comp_batch { struct request *req_list; bool need_ts; From patchwork Thu Nov 30 13:53:15 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ojaswin Mujoo X-Patchwork-Id: 13474425 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b="S0BMmChv" Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B4AF91BE9; Thu, 30 Nov 2023 05:53:52 -0800 (PST) Received: from pps.filterd (m0356516.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 3AUDnWZM001318; Thu, 30 Nov 2023 13:53:47 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=x3P0yh8GCW7CNkQcAyNbHKBz4AByo20w32YCb9GgXDc=; b=S0BMmChv8ihQGtq9Dr2E3FLQ7xqvBQTn/bWpQiBbUl1/HGlCNFq7+kgrQndbnCqeSsiA 1dDq5WYAe9BeE9rguXAu38Odr81MOO5kH5BCzqLW7qQGWeyeDGlelvxgvLfu4afq287+ FgKGbSrXk8OvxZUZFVfQpgdkJUkZ4Inaqb7iQ3aq3aF3g3MyDzY7yEZKjAyd+PiTIYIb Nasccus7n2pApiZC5PFxjBsVgPEugwGyQPT1YxQM77s/m9V8LbE211Wa6hpniJcPT7Nr vsTZRpf9NVD0UkJayVAY2oSUArIqYr5oFNe5Yma+9nMjMeaAwR+rK/5Icez8zT9ZXA6Y Vg== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3upu2vh3p8-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 30 Nov 2023 13:53:47 +0000 Received: from m0356516.ppops.net (m0356516.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 3AUDrG8D012183; Thu, 30 Nov 2023 13:53:46 GMT Received: from ppma23.wdc07v.mail.ibm.com (5d.69.3da9.ip4.static.sl-reverse.com [169.61.105.93]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3upu2vh3nu-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 30 Nov 2023 13:53:46 +0000 Received: from pps.filterd (ppma23.wdc07v.mail.ibm.com [127.0.0.1]) by ppma23.wdc07v.mail.ibm.com (8.17.1.19/8.17.1.19) with ESMTP id 3AUDnEkm008135; Thu, 30 Nov 2023 13:53:45 GMT Received: from smtprelay06.fra02v.mail.ibm.com ([9.218.2.230]) by ppma23.wdc07v.mail.ibm.com (PPS) with ESMTPS id 3ukvrkx99b-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 30 Nov 2023 13:53:45 +0000 Received: from smtpav01.fra02v.mail.ibm.com (smtpav01.fra02v.mail.ibm.com [10.20.54.100]) by smtprelay06.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 3AUDrh9541943586 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 30 Nov 2023 13:53:43 GMT Received: from smtpav01.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 9DE8A20040; Thu, 30 Nov 2023 13:53:43 +0000 (GMT) Received: from smtpav01.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id EF84020043; Thu, 30 Nov 2023 13:53:40 +0000 (GMT) Received: from li-bb2b2a4c-3307-11b2-a85c-8fa5c3a69313.ibm.com.com (unknown [9.43.76.38]) by smtpav01.fra02v.mail.ibm.com (Postfix) with ESMTP; Thu, 30 Nov 2023 13:53:40 +0000 (GMT) From: Ojaswin Mujoo To: linux-ext4@vger.kernel.org, "Theodore Ts'o" Cc: Ritesh Harjani , linux-kernel@vger.kernel.org, "Darrick J . Wong" , linux-block@vger.kernel.org, linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, John Garry , dchinner@redhat.com Subject: [RFC 6/7] ext4: Add aligned allocation support for atomic direct io Date: Thu, 30 Nov 2023 19:23:15 +0530 Message-Id: <12ce535f947babf9fbb61e371e9127d91d9feac0.1701339358.git.ojaswin@linux.ibm.com> X-Mailer: git-send-email 2.39.3 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-xfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: ZcgJSjyHH58BRnZA9_Ag4OpYE70IFd08 X-Proofpoint-ORIG-GUID: X6yvcBpGxZPW6v1tnY7tMz3ypETzozu- X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.272,Aquarius:18.0.997,Hydra:6.0.619,FMLib:17.11.176.26 definitions=2023-11-30_12,2023-11-30_01,2023-05-22_02 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 clxscore=1015 bulkscore=0 mlxlogscore=999 adultscore=0 lowpriorityscore=0 impostorscore=0 suspectscore=0 phishscore=0 malwarescore=0 spamscore=0 priorityscore=1501 mlxscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2311060000 definitions=main-2311300102 If the direct IO write is meant to be atomic, ext4 will now try to allocate aligned physical blocks so that atomic write request can be satisfied. This patch also makes the ext4_map_blocks() family of function alignment aware and defines ext4_map_blocks_aligned() function that can allow users to ask for aligned blocks and has checks to ensure the returned extent actually follows the alignment requirements. As usual, alignment requirement is always determined by the length and the offset should be naturally aligned to this len. Although aligned mapping usually makes sense with EXT4_GET_BLOCKS_CREATE we can call ext4_map_blocks_aligned() without that flag aswell. This can be useful to check: 1. If an aligned extent is already present and can be reused. 2. If a pre-existing extent at the location can't satisfy the alignment in which case and aligned write of given len and offset won't be possible. 3. If there is a hole, is it big enough that a subsequent map blocks would be able to allocate the required aligned extent of off and len. Signed-off-by: Ojaswin Mujoo --- fs/ext4/ext4.h | 4 ++ fs/ext4/extents.c | 14 +++++ fs/ext4/file.c | 49 +++++++++++++++++ fs/ext4/inode.c | 104 +++++++++++++++++++++++++++++++++++- include/trace/events/ext4.h | 1 + 5 files changed, 170 insertions(+), 2 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 38a77148b85c..1a57662e6a7a 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -717,6 +717,8 @@ enum { #define EXT4_GET_BLOCKS_IO_SUBMIT 0x0400 /* Caller is in the atomic contex, find extent if it has been cached */ #define EXT4_GET_BLOCKS_CACHED_NOWAIT 0x0800 +/* Caller wants strictly aligned allocation */ +#define EXT4_GET_BLOCKS_ALIGNED 0x1000 /* * The bit position of these flags must not overlap with any of the @@ -3683,6 +3685,8 @@ extern int ext4_convert_unwritten_io_end_vec(handle_t *handle, ext4_io_end_t *io_end); extern int ext4_map_blocks(handle_t *handle, struct inode *inode, struct ext4_map_blocks *map, int flags); +extern int ext4_map_blocks_aligned(handle_t *handle, struct inode *inode, + struct ext4_map_blocks *map, int flags); extern int ext4_ext_calc_credits_for_single_extent(struct inode *inode, int num, struct ext4_ext_path *path); diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 202c76996b62..2334fa767a6b 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -4091,6 +4091,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode, int err = 0, depth, ret; unsigned int allocated = 0, offset = 0; unsigned int allocated_clusters = 0; + unsigned int orig_mlen = map->m_len; struct ext4_allocation_request ar; ext4_lblk_t cluster_offset; @@ -4282,6 +4283,19 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode, ar.flags |= EXT4_MB_DELALLOC_RESERVED; if (flags & EXT4_GET_BLOCKS_METADATA_NOFAIL) ar.flags |= EXT4_MB_USE_RESERVED; + if (flags & EXT4_GET_BLOCKS_ALIGNED) { + /* + * During aligned allocation we dont want to map a length smaller + * than the originally requested length since we use this len to + * determine alignment and changing it can misalign the blocks. + */ + if (ar.len != orig_mlen) { + ext4_warning(inode->i_sb, + "Tried to modify requested len of aligned allocation."); + goto out; + } + ar.flags |= EXT4_MB_ALIGNED_ALLOC; + } newblock = ext4_mb_new_blocks(handle, &ar, &err); if (!newblock) goto out; diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 6830ea3a6c59..c928c2e8c067 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -430,6 +430,48 @@ static const struct iomap_dio_ops ext4_dio_write_ops = { .end_io = ext4_dio_write_end_io, }; +/* + * Check loff_t because the iov_iter_count() used in blkdev was size_t + */ +static bool ext4_dio_atomic_write_checks(struct kiocb *iocb, + struct iov_iter *from) +{ + struct inode *inode = iocb->ki_filp->f_inode; + struct block_device *bdev = inode->i_sb->s_bdev; + size_t len = iov_iter_count(from); + loff_t pos = iocb->ki_pos; + u8 blkbits = inode->i_blkbits; + + /* + * Currently aligned alloc, which is needed for atomic IO, is only + * supported with extent based files and non bigalloc file systems + */ + if (EXT4_SB(inode->i_sb)->s_cluster_ratio > 1) { + ext4_warning(inode->i_sb, + "Atomic write not supported with bigalloc"); + return false; + } + if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) { + ext4_warning(inode->i_sb, + "Atomic write not supported on non-extent files"); + return false; + } + if (len & ((1 << blkbits) - 1)) + /* len should be blocksize aligned */ + return false; + else if (pos % len) + /* pos should be naturally aligned to len */ + return false; + else if (!is_power_of_2(len >> blkbits)) + /* + * len in blocks should be power of 2 for mballoc to ensure + * alignment + */ + return false; + + return blkdev_atomic_write_valid(bdev, pos, len); +} + /* * The intention here is to start with shared lock acquired then see if any * condition requires an exclusive inode lock. If yes, then we restart the @@ -458,12 +500,19 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from, size_t count; ssize_t ret; bool overwrite, unaligned_io; + bool atomic_write = (iocb->ki_flags & IOCB_ATOMIC); restart: ret = ext4_generic_write_checks(iocb, from); if (ret <= 0) goto out; + if (atomic_write && !ext4_dio_atomic_write_checks(iocb, from)) { + ext4_warning(inode->i_sb, "Atomic write checks failed."); + ret = -EIO; + goto out; + } + offset = iocb->ki_pos; count = ret; diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 4ce35f1c8b0a..d185ec54ffa3 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -453,6 +453,77 @@ static void ext4_map_blocks_es_recheck(handle_t *handle, } #endif /* ES_AGGRESSIVE_TEST */ +/* + * This function checks if the map returned by ext4_map_blocks satisfies aligned + * allocation requirements. This should be used as the entry point for aligned + * allocations + */ +static bool ext4_map_check_alignment(struct ext4_map_blocks *map, + unsigned int orig_mlen, + ext4_lblk_t orig_mlblk, + int flags) +{ + if (flags & EXT4_GET_BLOCKS_CREATE) { + /* + * A create lookup must be mapped to satisfy alignment + * requirements + */ + if (!(map->m_flags & EXT4_MAP_MAPPED)) + return false; + } else { + /* + * For create=0, if we find a hole, this hole should be big + * enough to accommodate our aligned extent later + */ + if (!(map->m_flags & EXT4_MAP_MAPPED) && + (!(map->m_flags & EXT4_MAP_UNWRITTEN))) { + if (map->m_len < orig_mlen) + return false; + if (map->m_lblk != orig_mlblk) + /* Ideally shouldn't happen */ + return false; + return true; + } + } + + /* + * For all the remaining cases, to satisfy alignment, the extent should + * be exactly as big as requests and be at the right physical block + * alignment + */ + if (map->m_len != orig_mlen) + return false; + if (map->m_lblk != orig_mlblk) + return false; + if (!map->m_len || map->m_pblk % map->m_len) + return false; + + return true; +} + +int ext4_map_blocks_aligned(handle_t *handle, struct inode *inode, + struct ext4_map_blocks *map, int flags) +{ + int ret; + unsigned int orig_mlen = map->m_len; + ext4_lblk_t orig_mlblk = map->m_lblk; + + if (flags & EXT4_GET_BLOCKS_CREATE) + flags |= EXT4_GET_BLOCKS_ALIGNED; + + ret = ext4_map_blocks(handle, inode, map, flags); + + if (ret >= 0 && + !ext4_map_check_alignment(map, orig_mlen, orig_mlblk, flags)) { + ext4_warning( + inode->i_sb, + "Returned extent couldn't satisfy alignment requirements"); + ret = -EIO; + } + + return ret; +} + /* * The ext4_map_blocks() function tries to look up the requested blocks, * and returns if the blocks are already mapped. @@ -474,6 +545,12 @@ static void ext4_map_blocks_es_recheck(handle_t *handle, * indicate the length of a hole starting at map->m_lblk. * * It returns the error in case of allocation failure. + * + * Note for aligned allocations: While most of the alignment related checks are + * done by higher level functions, we do have some optimizations here. When + * trying to *create* a new aligned extent if at any point we are sure that the + * extent won't be as big as the full length, we exit early instead of going for + * the allocation and failing later. */ int ext4_map_blocks(handle_t *handle, struct inode *inode, struct ext4_map_blocks *map, int flags) @@ -481,6 +558,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode, struct extent_status es; int retval; int ret = 0; + unsigned int orig_mlen = map->m_len; #ifdef ES_AGGRESSIVE_TEST struct ext4_map_blocks orig_map; @@ -583,6 +661,12 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode, if ((flags & EXT4_GET_BLOCKS_CREATE) == 0) return retval; + /* For aligned allocation, we must not change original alignment */ + if (retval < 0 && (flags & EXT4_GET_BLOCKS_ALIGNED) && + map->m_len != orig_mlen) { + return retval; + } + /* * Returns if the blocks have already allocated * @@ -3307,7 +3391,10 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map, else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT; - ret = ext4_map_blocks(handle, inode, map, m_flags); + if (flags & IOMAP_ATOMIC_WRITE) + ret = ext4_map_blocks_aligned(handle, inode, map, m_flags); + else + ret = ext4_map_blocks(handle, inode, map, m_flags); /* * We cannot fill holes in indirect tree based inodes as that could @@ -3353,7 +3440,11 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, * especially in multi-threaded overwrite requests. */ if (offset + length <= i_size_read(inode)) { - ret = ext4_map_blocks(NULL, inode, &map, 0); + if (flags & IOMAP_ATOMIC_WRITE) + ret = ext4_map_blocks_aligned(NULL, inode, &map, 0); + else + ret = ext4_map_blocks(NULL, inode, &map, 0); + if (ret > 0 && (map.m_flags & EXT4_MAP_MAPPED)) goto out; } @@ -3372,6 +3463,15 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, */ map.m_len = fscrypt_limit_io_blocks(inode, map.m_lblk, map.m_len); + /* + * Ensure the found extent meets the alignment requirements for aligned + * allocation + */ + if ((flags & IOMAP_ATOMIC_WRITE) && + ((map.m_pblk << blkbits) % length || + (map.m_len << blkbits) != length)) + return -EIO; + ext4_set_iomap(inode, iomap, &map, offset, length, flags); return 0; diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h index 56895cfb5781..7bf116021408 100644 --- a/include/trace/events/ext4.h +++ b/include/trace/events/ext4.h @@ -50,6 +50,7 @@ struct partial_cluster; { EXT4_GET_BLOCKS_CONVERT_UNWRITTEN, "CONVERT_UNWRITTEN" }, \ { EXT4_GET_BLOCKS_ZERO, "ZERO" }, \ { EXT4_GET_BLOCKS_IO_SUBMIT, "IO_SUBMIT" }, \ + { EXT4_GET_BLOCKS_ALIGNED, "ALIGNED" }, \ { EXT4_EX_NOCACHE, "EX_NOCACHE" }) /* From patchwork Thu Nov 30 13:53:16 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ojaswin Mujoo X-Patchwork-Id: 13474426 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b="ZqBLLreS" Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 76D861FD4; Thu, 30 Nov 2023 05:53:57 -0800 (PST) Received: from pps.filterd (m0353722.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 3AUCrWA3025556; Thu, 30 Nov 2023 13:53:52 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding; s=pp1; bh=+KBImv1Kwt2AH4T96+jehckjTlPaL/qSymwuwqCxsAg=; b=ZqBLLreSH5NBgrAzh6GPuF4HceTE702NAOA9srBr+eWV7KlJXhG2VYp5UJIm5oy1bxu4 FqdXTgFGJ5tYRmWp28M9cU6jzpPZIkCJptLE/nlDIr1xIyj5ucxwwgIChr6bx68/kgNu GWaY3DOwHfxSRMbF4DPBrfeMnIgN/gRbaEXv4/gCMm2rSrCxPwWiJV7VNb2t7IwtI78s 3pA8ZuxN8BCKJJjjMpSjtUeDokCOhXO4YG0nkOjXSchbiHmVgDs7efrBrsxKfLehHRof xyEjbfNfhk6pbsvkUPf98R9zxDe5JHLoNzOChT5S0UwLDbJ2e1HVNnqils2vBpK5QpZ6 qg== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3uptsv9vpp-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 30 Nov 2023 13:53:51 +0000 Received: from m0353722.ppops.net (m0353722.ppops.net [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 3AUDVmVJ027422; Thu, 30 Nov 2023 13:53:51 GMT Received: from ppma21.wdc07v.mail.ibm.com (5b.69.3da9.ip4.static.sl-reverse.com [169.61.105.91]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3uptsv9vp4-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 30 Nov 2023 13:53:50 +0000 Received: from pps.filterd (ppma21.wdc07v.mail.ibm.com [127.0.0.1]) by ppma21.wdc07v.mail.ibm.com (8.17.1.19/8.17.1.19) with ESMTP id 3AUDn0vH022631; Thu, 30 Nov 2023 13:53:50 GMT Received: from smtprelay07.fra02v.mail.ibm.com ([9.218.2.229]) by ppma21.wdc07v.mail.ibm.com (PPS) with ESMTPS id 3ukv8nxdjq-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 30 Nov 2023 13:53:49 +0000 Received: from smtpav01.fra02v.mail.ibm.com (smtpav01.fra02v.mail.ibm.com [10.20.54.100]) by smtprelay07.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 3AUDrkVb63701370 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 30 Nov 2023 13:53:48 GMT Received: from smtpav01.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 9695820043; Thu, 30 Nov 2023 13:53:46 +0000 (GMT) Received: from smtpav01.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 0E3A120040; Thu, 30 Nov 2023 13:53:44 +0000 (GMT) Received: from li-bb2b2a4c-3307-11b2-a85c-8fa5c3a69313.ibm.com.com (unknown [9.43.76.38]) by smtpav01.fra02v.mail.ibm.com (Postfix) with ESMTP; Thu, 30 Nov 2023 13:53:43 +0000 (GMT) From: Ojaswin Mujoo To: linux-ext4@vger.kernel.org, "Theodore Ts'o" Cc: Ritesh Harjani , linux-kernel@vger.kernel.org, "Darrick J . Wong" , linux-block@vger.kernel.org, linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, John Garry , dchinner@redhat.com Subject: [RFC 7/7] ext4: Support atomic write for statx Date: Thu, 30 Nov 2023 19:23:16 +0530 Message-Id: X-Mailer: git-send-email 2.39.3 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-xfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: -KeN6sr_sn17w2SrVL9t35CbXj7Q8qNh X-Proofpoint-ORIG-GUID: MXWF82-sNdeumF3xiLVzTrUYOn44tg3T X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.272,Aquarius:18.0.997,Hydra:6.0.619,FMLib:17.11.176.26 definitions=2023-11-30_12,2023-11-30_01,2023-05-22_02 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 clxscore=1015 malwarescore=0 suspectscore=0 mlxlogscore=999 impostorscore=0 adultscore=0 bulkscore=0 priorityscore=1501 mlxscore=0 phishscore=0 spamscore=0 lowpriorityscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2311060000 definitions=main-2311300102 Support providing info on atomic write unit min and max for an inode. For simplicity, currently we limit the min at the FS block size, but a lower limit could be supported in future. Signed-off-by: Ojaswin Mujoo --- fs/ext4/inode.c | 38 ++++++++++++++++++++++++++++++++++++-- 1 file changed, 36 insertions(+), 2 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index d185ec54ffa3..c8f974d0f113 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -5621,6 +5621,7 @@ int ext4_getattr(struct mnt_idmap *idmap, const struct path *path, struct ext4_inode *raw_inode; struct ext4_inode_info *ei = EXT4_I(inode); unsigned int flags; + struct block_device *bdev = inode->i_sb->s_bdev; if ((request_mask & STATX_BTIME) && EXT4_FITS_IN_INODE(raw_inode, ei, i_crtime)) { @@ -5639,8 +5640,6 @@ int ext4_getattr(struct mnt_idmap *idmap, const struct path *path, stat->result_mask |= STATX_DIOALIGN; if (dio_align == 1) { - struct block_device *bdev = inode->i_sb->s_bdev; - /* iomap defaults */ stat->dio_mem_align = bdev_dma_alignment(bdev) + 1; stat->dio_offset_align = bdev_logical_block_size(bdev); @@ -5650,6 +5649,41 @@ int ext4_getattr(struct mnt_idmap *idmap, const struct path *path, } } + if ((request_mask & STATX_WRITE_ATOMIC)) { + unsigned int awumin, awumax; + unsigned int blocksize = 1 << inode->i_blkbits; + + awumin = queue_atomic_write_unit_min_bytes(bdev->bd_queue); + awumax = queue_atomic_write_unit_max_bytes(bdev->bd_queue); + + if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) || + EXT4_SB(inode->i_sb)->s_cluster_ratio > 1) { + /* + * Currently not supported for non extent files or + * with bigalloc + */ + stat->atomic_write_unit_min = 0; + stat->atomic_write_unit_max = 0; + } else if (awumin && awumax) { + /* + * For now we support atomic writes which are + * at least block size bytes. If that exceeds the + * max atomic unit, then don't advertise support + */ + stat->atomic_write_unit_min = max(awumin, blocksize); + + if (awumax < stat->atomic_write_unit_min) { + stat->atomic_write_unit_min = 0; + stat->atomic_write_unit_max = 0; + } else { + stat->atomic_write_unit_max = awumax; + stat->attributes |= STATX_ATTR_WRITE_ATOMIC; + } + } + stat->attributes_mask |= STATX_ATTR_WRITE_ATOMIC; + stat->result_mask |= STATX_WRITE_ATOMIC; + } + flags = ei->i_flags & EXT4_FL_USER_VISIBLE; if (flags & EXT4_APPEND_FL) stat->attributes |= STATX_ATTR_APPEND;