From patchwork Tue Jul 9 19:10:25 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Wengang Wang X-Patchwork-Id: 13728433 Received: from mx0a-00069f02.pphosted.com (mx0a-00069f02.pphosted.com [205.220.165.32]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 08309182A4F for ; Tue, 9 Jul 2024 19:10:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=205.220.165.32 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1720552239; cv=none; b=IBl5Ixhe0QJiXQ2Q256f7Ykn5OAZWjtzRdyT7z3gAVMWAJfs0mNiyMy2KRS897Gdu1A2pNqPIbdDTKq5cVFvA15mT7J8LnPuvQlbH/gVjVwnJ0WRokKfuOn9gNj5IT2j0H7RPt6WT3kexj/HEfRgUUVO+0rJoO1Z8BbIu4tovns= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1720552239; c=relaxed/simple; bh=dePsrq2fmZK09Ey35ANhhuTru+k/5JjGv0XRUhRs5js=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=RFifVkMY/P4FqJxUNcsZBxiF0Bqd33nVSwp5J9E+XeGBmvfT795dd8/QTv4ERdPW7UBQ4eS3gqvkarG4RcNb7aWOt8JnAGpHd6+RDJfWKVRwZkTat3P+rL6mzOOT9Db81MBp5BRIbh42P7EzN6VS+CKTqR4ps6nJqRLjfL85fdQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=oracle.com; spf=pass smtp.mailfrom=oracle.com; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b=Z4lpzmca; arc=none smtp.client-ip=205.220.165.32 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=oracle.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=oracle.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="Z4lpzmca" Received: from pps.filterd (m0333521.ppops.net [127.0.0.1]) by mx0b-00069f02.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 469Fta9t031057 for ; Tue, 9 Jul 2024 19:10:37 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h= from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; s=corp-2023-11-20; bh=r jgPPTfJNj2JOGyAEHszdU1xzx289N8m+gLdBdmHb5A=; b=Z4lpzmca5ZMMp4cz7 dUXZCNPiq+lk3FX1+vb8Ynz3+YAP1Zf2o2yPFQSo8XWr4TLcZopKS2iRPU0BFGDQ jeHlUdospno2McOkvQIa5Q0YuSXLE+XM+rthsSX9jpCWhdbEPLscQIuEXJAx4xSZ 5DBvedenkCZBB20LftrizL+Hh/UHi7PAUmVikdg9BEM4Mp/Fr/tb/+iRmDAh+4SQ zAUzRYOe7Ic0shr1lffLi8Mo9jgXjPcCxbgqe3hw7dB9GY2FTyduwPYS5ugFVkx/ +ICU8NDeb33DQEdym3GEJmXn2fVwf3UZ9hvnrlB4FtpBlplhNcsTGlUFuG5snCss BsUpQ== Received: from iadpaimrmta02.imrmtpd1.prodappiadaev1.oraclevcn.com (iadpaimrmta02.appoci.oracle.com [147.154.18.20]) by mx0b-00069f02.pphosted.com (PPS) with ESMTPS id 406wky5tmj-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK) for ; Tue, 09 Jul 2024 19:10:37 +0000 (GMT) Received: from pps.filterd (iadpaimrmta02.imrmtpd1.prodappiadaev1.oraclevcn.com [127.0.0.1]) by iadpaimrmta02.imrmtpd1.prodappiadaev1.oraclevcn.com (8.17.1.19/8.17.1.19) with ESMTP id 469IDThQ014345 for ; Tue, 9 Jul 2024 19:10:35 GMT Received: from pps.reinject (localhost [127.0.0.1]) by iadpaimrmta02.imrmtpd1.prodappiadaev1.oraclevcn.com (PPS) with ESMTPS id 407txhepr2-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK) for ; Tue, 09 Jul 2024 19:10:35 +0000 Received: from iadpaimrmta02.imrmtpd1.prodappiadaev1.oraclevcn.com (iadpaimrmta02.imrmtpd1.prodappiadaev1.oraclevcn.com [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 469JAUPa024440 for ; Tue, 9 Jul 2024 19:10:35 GMT Received: from wwg-mac.us.oracle.com (dhcp-10-159-146-188.vpn.oracle.com [10.159.146.188]) by iadpaimrmta02.imrmtpd1.prodappiadaev1.oraclevcn.com (PPS) with ESMTP id 407txhepkm-7; Tue, 09 Jul 2024 19:10:35 +0000 From: Wengang Wang To: linux-xfs@vger.kernel.org Cc: wen.gang.wang@oracle.com Subject: [PATCH 6/9] spaceman/defrag: workaround kernel xfs_reflink_try_clear_inode_flag() Date: Tue, 9 Jul 2024 12:10:25 -0700 Message-Id: <20240709191028.2329-7-wen.gang.wang@oracle.com> X-Mailer: git-send-email 2.39.3 (Apple Git-146) In-Reply-To: <20240709191028.2329-1-wen.gang.wang@oracle.com> References: <20240709191028.2329-1-wen.gang.wang@oracle.com> Precedence: bulk X-Mailing-List: linux-xfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1039,Hydra:6.0.680,FMLib:17.12.28.16 definitions=2024-07-09_08,2024-07-09_01,2024-05-17_01 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 adultscore=0 phishscore=0 suspectscore=0 malwarescore=0 bulkscore=0 mlxscore=0 spamscore=0 mlxlogscore=999 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2406180000 definitions=main-2407090129 X-Proofpoint-ORIG-GUID: a5_beDO4pSJ2mirs93ekUhp2qxuUGPpG X-Proofpoint-GUID: a5_beDO4pSJ2mirs93ekUhp2qxuUGPpG xfs_reflink_try_clear_inode_flag() takes very long in case file has huge number of extents and none of the extents are shared. workaround: share the first real extent so that xfs_reflink_try_clear_inode_flag() returns quickly to save cpu times and speed up defrag significantly. Signed-off-by: Wengang Wang --- spaceman/defrag.c | 174 +++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 172 insertions(+), 2 deletions(-) diff --git a/spaceman/defrag.c b/spaceman/defrag.c index f8e6713c..b5c5b187 100644 --- a/spaceman/defrag.c +++ b/spaceman/defrag.c @@ -327,6 +327,155 @@ defrag_fs_limit_hit(int fd) return statfs_s.f_bsize * statfs_s.f_bavail < g_limit_free_bytes; } +static bool g_enable_first_ext_share = true; + +static int +defrag_get_first_real_ext(int fd, struct getbmapx *mapx) +{ + int err; + + while (1) { + err = defrag_get_next_extent(fd, mapx); + if (err) + break; + + defrag_move_next_extent(); + if (!(mapx->bmv_oflags & BMV_OF_PREALLOC)) + break; + } + return err; +} + +static __u64 g_share_offset = -1ULL; +static __u64 g_share_len = 0ULL; +#define SHARE_MAX_SIZE 32768 /* 32KiB */ + +/* share the first real extent with scrach */ +static void +defrag_share_first_extent(int defrag_fd, int scratch_fd) +{ +#define OFFSET_1PB 0x4000000000000LL + struct file_clone_range clone; + struct getbmapx mapx; + int err; + + if (g_enable_first_ext_share == false) + return; + + err = defrag_get_first_real_ext(defrag_fd, &mapx); + if (err) + return; + + clone.src_fd = defrag_fd; + clone.src_offset = mapx.bmv_offset * 512; + clone.src_length = mapx.bmv_length * 512; + /* shares at most SHARE_MAX_SIZE length */ + if (clone.src_length > SHARE_MAX_SIZE) + clone.src_length = SHARE_MAX_SIZE; + clone.dest_offset = OFFSET_1PB + clone.src_offset; + /* if the first is extent is reaching the EoF, no need to share */ + if (clone.src_offset + clone.src_length >= g_defrag_file_size) + return; + err = ioctl(scratch_fd, FICLONERANGE, &clone); + if (err != 0) { + fprintf(stderr, "cloning first extent failed: %s\n", + strerror(errno)); + return; + } + + /* safe the offset and length for re-share */ + g_share_offset = clone.src_offset; + g_share_len = clone.src_length; +} + +/* re-share the blocks we shared previous if then are no longer shared */ +static void +defrag_reshare_blocks_in_front(int defrag_fd, int scratch_fd) +{ +#define NR_GET_EXT 9 + struct getbmapx mapx[NR_GET_EXT]; + struct file_clone_range clone; + __u64 new_share_len; + int idx, err; + + if (g_enable_first_ext_share == false) + return; + + if (g_share_len == 0ULL) + return; + + /* + * check if previous shareing still exist + * we are done if (partially) so. + */ + mapx[0].bmv_offset = g_share_offset; + mapx[0].bmv_length = g_share_len; + mapx[0].bmv_count = NR_GET_EXT; + mapx[0].bmv_iflags = BMV_IF_NO_HOLES | BMV_IF_PREALLOC; + err = ioctl(defrag_fd, XFS_IOC_GETBMAPX, mapx); + if (err) { + fprintf(stderr, "XFS_IOC_GETBMAPX failed %s\n", + strerror(errno)); + /* won't try share again */ + g_share_len = 0ULL; + return; + } + + if (mapx[0].bmv_entries == 0) { + /* shared blocks all became hole, won't try share again */ + g_share_len = 0ULL; + return; + } + + if (g_share_offset != 512 * mapx[1].bmv_offset) { + /* first shared block became hole, won't try share again */ + g_share_len = 0ULL; + return; + } + + /* we check up to only the first NR_GET_EXT - 1 extents */ + for (idx = 1; idx <= mapx[0].bmv_entries; idx++) { + if (mapx[idx].bmv_oflags & BMV_OF_SHARED) { + /* some blocks still shared, done */ + return; + } + } + + /* + * The previously shared blocks are no longer shared, re-share. + * deallocate the blocks in scrath file first + */ + err = fallocate(scratch_fd, + FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE, + OFFSET_1PB + g_share_offset, g_share_len); + if (err != 0) { + fprintf(stderr, "punch hole failed %s\n", + strerror(errno)); + g_share_len = 0; + return; + } + + new_share_len = 512 * mapx[1].bmv_length; + if (new_share_len > SHARE_MAX_SIZE) + new_share_len = SHARE_MAX_SIZE; + + clone.src_fd = defrag_fd; + /* keep starting offset unchanged */ + clone.src_offset = g_share_offset; + clone.src_length = new_share_len; + clone.dest_offset = OFFSET_1PB + clone.src_offset; + + err = ioctl(scratch_fd, FICLONERANGE, &clone); + if (err) { + fprintf(stderr, "FICLONERANGE failed %s\n", + strerror(errno)); + g_share_len = 0; + return; + } + + g_share_len = new_share_len; + } + /* * defragment a file * return 0 if successfully done, 1 otherwise @@ -377,6 +526,12 @@ defrag_xfs_defrag(char *file_path) { signal(SIGINT, defrag_sigint_handler); + /* + * share the first extent to work around kernel consuming time + * in xfs_reflink_try_clear_inode_flag() + */ + defrag_share_first_extent(defrag_fd, scratch_fd); + do { struct timeval t_clone, t_unshare, t_punch_hole; struct defrag_segment segment; @@ -454,6 +609,15 @@ defrag_xfs_defrag(char *file_path) { if (time_delta > max_unshare_us) max_unshare_us = time_delta; + /* + * if unshare used more than 1 second, time is very possibly + * used in checking if the file is sharing extents now. + * to avoid that happen again we re-share the blocks in front + * to workaround that. + */ + if (time_delta > 1000000) + defrag_reshare_blocks_in_front(defrag_fd, scratch_fd); + /* * Punch out the original extents we shared to the * scratch file so they are returned to free space. @@ -514,6 +678,8 @@ static void defrag_help(void) " -f free_space -- specify shrethod of the XFS free space in MiB, when\n" " XFS free space is lower than that, shared segments \n" " are excluded from defragmentation, 1024 by default\n" +" -n -- disable the \"share first extent\" featue, it's\n" +" enabled by default to speed up\n" )); } @@ -525,7 +691,7 @@ defrag_f(int argc, char **argv) int i; int c; - while ((c = getopt(argc, argv, "s:f:")) != EOF) { + while ((c = getopt(argc, argv, "s:f:n")) != EOF) { switch(c) { case 's': g_segment_size_lmt = atoi(optarg) * 1024 * 1024 / 512; @@ -539,6 +705,10 @@ defrag_f(int argc, char **argv) g_limit_free_bytes = atol(optarg) * 1024 * 1024; break; + case 'n': + g_enable_first_ext_share = false; + break; + default: command_usage(&defrag_cmd); return 1; @@ -556,7 +726,7 @@ void defrag_init(void) defrag_cmd.cfunc = defrag_f; defrag_cmd.argmin = 0; defrag_cmd.argmax = 4; - defrag_cmd.args = "[-s segment_size] [-f free_space]"; + defrag_cmd.args = "[-s segment_size] [-f free_space] [-n]"; defrag_cmd.flags = CMD_FLAG_ONESHOT; defrag_cmd.oneline = _("Defragment XFS files"); defrag_cmd.help = defrag_help;