From patchwork Wed Mar  7 19:24:51 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Brian Foster <bfoster@redhat.com>
X-Patchwork-Id: 10264781
Return-Path: <linux-xfs-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	24790602C8 for <patchwork-linux-xfs@patchwork.kernel.org>;
	Wed,  7 Mar 2018 19:24:56 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 12C7128ED1
	for <patchwork-linux-xfs@patchwork.kernel.org>;
	Wed,  7 Mar 2018 19:24:56 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 05F5629119; Wed,  7 Mar 2018 19:24:56 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,RCVD_IN_DNSWL_HI
	autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 5188B28ED1
	for <patchwork-linux-xfs@patchwork.kernel.org>;
	Wed,  7 Mar 2018 19:24:55 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S933674AbeCGTYy (ORCPT
	<rfc822;patchwork-linux-xfs@patchwork.kernel.org>);
	Wed, 7 Mar 2018 14:24:54 -0500
Received: from mx1.redhat.com ([209.132.183.28]:38666 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S933415AbeCGTYx (ORCPT <rfc822;linux-xfs@vger.kernel.org>);
	Wed, 7 Mar 2018 14:24:53 -0500
Received: from smtp.corp.redhat.com
	(int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15])
	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mx1.redhat.com (Postfix) with ESMTPS id BE913804F2
	for <linux-xfs@vger.kernel.org>; Wed,  7 Mar 2018 19:24:53 +0000 (UTC)
Received: from bfoster.bfoster (dhcp-41-20.bos.redhat.com [10.18.41.20])
	by smtp.corp.redhat.com (Postfix) with ESMTP id 8EF2B5D6A3
	for <linux-xfs@vger.kernel.org>; Wed,  7 Mar 2018 19:24:53 +0000 (UTC)
Received: by bfoster.bfoster (Postfix, from userid 1000)
	id 00E88121EC9; Wed,  7 Mar 2018 14:24:51 -0500 (EST)
From: Brian Foster <bfoster@redhat.com>
To: linux-xfs@vger.kernel.org
Subject: [PATCH RFC] xfs: convert between packed and unpacked agfls on-demand
Date: Wed,  7 Mar 2018 14:24:51 -0500
Message-Id: <20180307192451.24196-1-bfoster@redhat.com>
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16
	(mx1.redhat.com [10.5.110.27]);
	Wed, 07 Mar 2018 19:24:53 +0000 (UTC)
Sender: linux-xfs-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-xfs.vger.kernel.org>
X-Mailing-List: linux-xfs@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

Disliked-by: Brian Foster <bfoster@redhat.com>
---

Sent as RFC for the time being. This tests Ok on a straight xfstests run
and also seems to pass Darrick's agfl fixup tester (thanks) both on
upstream and on a rhel7 kernel with some minor supporting hacks.

I tried to tighten up the logic a bit to reduce the odds of mistaking
actual corruption for a padding mismatch as much as possible. E.g.,
limit to cases where the agfl is wrapped, make sure we don't mistake a
corruption that looks like an agfl with 120 entries on a packed kernel,
etc.

While I do prefer an on-demand fixup approach to a mount time scan, ISTM
that in either case it's impossible to completely eliminate the risk of
confusing corruption with a padding mismatch so long as we're doing a
manual agfl fixup. The more I think about that the more I really dislike
doing this. :(

After some IRC discussion with djwong and sandeen, I'm wondering if the
existence of 'xfs_repair -d' is a good enough last resort for those
users who might be bit by unexpected padding issues on a typical
upgrade. If so, we could fall back to a safer mount-time detection model
that enforces a read-only mount and let the user run repair. The
supposition is that those who aren't prepared to repair via a ramdisk or
whatever should be able to 'xfs_repair -d' a rootfs that is mounted
read-only provided agfl padding is the only inconsistency. 

Eric points out that we can still write an unmount record for a
read-only mount, but I'm not sure that would be a problem if repair only
needs to fix the agfl. xfs_repair shouldn't touch the log unless there's
a recovery issue or it needs to be reformatted to update the LSN, both
of which seem to qualify as "you have more problems than agfl padding
and need to run repair anyways" to me. Thoughts?

Brian

 fs/xfs/libxfs/xfs_alloc.c | 147 +++++++++++++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_mount.h        |   1 +
 fs/xfs/xfs_trace.h        |  18 ++++++
 3 files changed, 164 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index c02781a4c091..31330996e31c 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2054,6 +2054,136 @@ xfs_alloc_space_available(
 }
 
 /*
+ * Estimate the on-disk agfl size based on the agf state. A size mismatch due to
+ * padding is only significant if the agfl wraps around the end or refers to an
+ * invalid first/last value.
+ */
+static int
+xfs_agfl_ondisk_size(
+	struct xfs_mount	*mp,
+	int			first,
+	int			last,
+	int			count)
+{
+	int			active = count;
+	int			agfl_size = XFS_AGFL_SIZE(mp);
+	bool			wrapped = (first > last) ? true : false;
+
+	if (count && last >= first)
+		active = last - first + 1;
+	else if (count)
+		active = agfl_size - first + last + 1;
+
+	if (wrapped && active == count + 1)
+		agfl_size--;
+	else if ((wrapped && active == count - 1) ||
+		 first == agfl_size || last == agfl_size)
+		agfl_size++;
+
+	/*
+	 * We can't discern the packing problem from certain forms of corruption
+	 * that may look exactly the same. To minimize the chance of mistaking
+	 * corruption for a size mismatch, clamp the size to known valid values.
+	 * A packed header agfl has 119 entries and the older unpacked format
+	 * has one less.
+	 */
+	if (agfl_size < 118 || agfl_size > 119)
+		agfl_size = XFS_AGFL_SIZE(mp);
+
+	return agfl_size;
+}
+
+static bool
+xfs_agfl_need_padfix(
+	struct xfs_mount	*mp,
+	struct xfs_agf		*agf)
+{
+	int			f = be32_to_cpu(agf->agf_flfirst);
+	int			l = be32_to_cpu(agf->agf_fllast);
+	int			c = be32_to_cpu(agf->agf_flcount);
+
+	if (!xfs_sb_version_hascrc(&mp->m_sb))
+		return false;
+
+	return xfs_agfl_ondisk_size(mp, f, l, c) != XFS_AGFL_SIZE(mp);
+}
+
+static int
+xfs_agfl_check_padfix(
+	struct xfs_trans	*tp,
+	struct xfs_buf		*agbp,
+	struct xfs_buf		*agflbp,
+	struct xfs_perag	*pag)
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+	struct xfs_agf		*agf = XFS_BUF_TO_AGF(agbp);
+	__be32			*agfl_bno = XFS_BUF_TO_AGFL_BNO(mp, agflbp);
+	int			agfl_size = XFS_AGFL_SIZE(mp);
+	int			ofirst, olast, osize;
+	int			nfirst, nlast;
+	int			logflags = 0;
+	int			startoff = 0;
+
+	if (!pag->pagf_needpadfix)
+		return 0;
+
+	ofirst = nfirst = be32_to_cpu(agf->agf_flfirst);
+	olast = nlast = be32_to_cpu(agf->agf_fllast);
+	osize = xfs_agfl_ondisk_size(mp, ofirst, olast, pag->pagf_flcount);
+
+	/*
+	 * If the on-disk agfl is smaller than what the kernel expects, the
+	 * last slot of the on-disk agfl is a gap with bogus data. Move the
+	 * first valid block into the gap and bump the pointer.
+	 */
+	if (osize < agfl_size) {
+		ASSERT(pag->pagf_flcount != 0);
+		agfl_bno[agfl_size - 1] = agfl_bno[ofirst];
+		startoff = (char *) &agfl_bno[agfl_size - 1] - (char *) agflbp->b_addr;
+		nfirst++;
+		goto done;
+	}
+
+	/*
+	 * Otherwise, the on-disk agfl is larger than what the current kernel
+	 * expects. If empty, just fix up the first and last pointers. If not,
+	 * move the inaccessible block to the end of the valid range.
+	 */
+	nfirst = do_mod(nfirst, agfl_size);
+	if (pag->pagf_flcount == 0) {
+		nlast = (nfirst == 0 ? agfl_size - 1 : nfirst - 1);
+		goto done;
+	}
+	if (nlast != agfl_size)
+		nlast++;
+	nlast = do_mod(nlast, agfl_size);
+	agfl_bno[nlast] = agfl_bno[osize - 1];
+	startoff = (char *) &agfl_bno[nlast] - (char *) agflbp->b_addr;
+
+done:
+	if (nfirst != ofirst) {
+		agf->agf_flfirst = cpu_to_be32(nfirst);
+		logflags |= XFS_AGF_FLFIRST;
+	}
+	if (nlast != olast) {
+		agf->agf_fllast = cpu_to_be32(nlast);
+		logflags |= XFS_AGF_FLLAST;
+	}
+	if (startoff) {
+		xfs_trans_buf_set_type(tp, agflbp, XFS_BLFT_AGFL_BUF);
+		xfs_trans_log_buf(tp, agflbp, startoff,
+				  startoff + sizeof(xfs_agblock_t) - 1);
+	}
+	if (logflags)
+		xfs_alloc_log_agf(tp, agbp, logflags);
+
+	trace_xfs_agfl_padfix(mp, osize, agfl_size);
+	pag->pagf_needpadfix = false;
+
+	return 0;
+}
+
+/*
  * Decide whether to use this allocation group for this allocation.
  * If so, fix up the btree freelist's size.
  */
@@ -2258,6 +2388,12 @@ xfs_alloc_get_freelist(
 	if (error)
 		return error;
 
+	pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno));
+	error = xfs_agfl_check_padfix(tp, agbp, agflbp, pag);
+	if (error) {
+		xfs_perag_put(pag);
+		return error;
+	}
 
 	/*
 	 * Get the block number and update the data structures.
@@ -2269,7 +2405,6 @@ xfs_alloc_get_freelist(
 	if (be32_to_cpu(agf->agf_flfirst) == XFS_AGFL_SIZE(mp))
 		agf->agf_flfirst = 0;
 
-	pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno));
 	be32_add_cpu(&agf->agf_flcount, -1);
 	xfs_trans_agflist_delta(tp, -1);
 	pag->pagf_flcount--;
@@ -2376,11 +2511,18 @@ xfs_alloc_put_freelist(
 	if (!agflbp && (error = xfs_alloc_read_agfl(mp, tp,
 			be32_to_cpu(agf->agf_seqno), &agflbp)))
 		return error;
+
+	pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno));
+	error = xfs_agfl_check_padfix(tp, agbp, agflbp, pag);
+	if (error) {
+		xfs_perag_put(pag);
+		return error;
+	}
+
 	be32_add_cpu(&agf->agf_fllast, 1);
 	if (be32_to_cpu(agf->agf_fllast) == XFS_AGFL_SIZE(mp))
 		agf->agf_fllast = 0;
 
-	pag = xfs_perag_get(mp, be32_to_cpu(agf->agf_seqno));
 	be32_add_cpu(&agf->agf_flcount, 1);
 	xfs_trans_agflist_delta(tp, 1);
 	pag->pagf_flcount++;
@@ -2588,6 +2730,7 @@ xfs_alloc_read_agf(
 		pag->pagb_count = 0;
 		pag->pagb_tree = RB_ROOT;
 		pag->pagf_init = 1;
+		pag->pagf_needpadfix = xfs_agfl_need_padfix(mp, agf);
 	}
 #ifdef DEBUG
 	else if (!XFS_FORCED_SHUTDOWN(mp)) {
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index e0792d036be2..78a6377a9b38 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -353,6 +353,7 @@ typedef struct xfs_perag {
 	char		pagi_inodeok;	/* The agi is ok for inodes */
 	uint8_t		pagf_levels[XFS_BTNUM_AGF];
 					/* # of levels in bno & cnt btree */
+	bool		pagf_needpadfix;
 	uint32_t	pagf_flcount;	/* count of blocks in freelist */
 	xfs_extlen_t	pagf_freeblks;	/* total free blocks */
 	xfs_extlen_t	pagf_longest;	/* longest free space */
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 945de08af7ba..c7a3bcd6cc4a 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -3339,6 +3339,24 @@ TRACE_EVENT(xfs_trans_resv_calc,
 		  __entry->logflags)
 );
 
+TRACE_EVENT(xfs_agfl_padfix,
+	TP_PROTO(struct xfs_mount *mp, int osize, int nsize),
+	TP_ARGS(mp, osize, nsize),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(int, osize)
+		__field(int, nsize)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->osize = osize;
+		__entry->nsize = nsize;
+	),
+	TP_printk("dev %d:%d old size %d new size %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->osize, __entry->nsize)
+);
+
 #endif /* _TRACE_XFS_H */
 
 #undef TRACE_INCLUDE_PATH