From patchwork Sat Jan 21 08:09:48 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Darrick J. Wong" <darrick.wong@oracle.com>
X-Patchwork-Id: 9530093
Return-Path: <linux-xfs-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	86B86600CA for <patchwork-linux-xfs@patchwork.kernel.org>;
	Sat, 21 Jan 2017 08:10:06 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 6B97E2842E
	for <patchwork-linux-xfs@patchwork.kernel.org>;
	Sat, 21 Jan 2017 08:10:06 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 5FD8428620; Sat, 21 Jan 2017 08:10:06 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-4.1 required=2.0 tests=BAYES_00,FUZZY_XPILL,
	RCVD_IN_DNSWL_HI,UNPARSEABLE_RELAY autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id A89082842E
	for <patchwork-linux-xfs@patchwork.kernel.org>;
	Sat, 21 Jan 2017 08:10:01 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1751297AbdAUIKB (ORCPT
	<rfc822;patchwork-linux-xfs@patchwork.kernel.org>);
	Sat, 21 Jan 2017 03:10:01 -0500
Received: from userp1040.oracle.com ([156.151.31.81]:26136 "EHLO
	userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751045AbdAUIJ7 (ORCPT
	<rfc822; linux-xfs@vger.kernel.org>); Sat, 21 Jan 2017 03:09:59 -0500
Received: from aserv0022.oracle.com (aserv0022.oracle.com [141.146.126.234])
	by userp1040.oracle.com (Sentrion-MTA-4.3.2/Sentrion-MTA-4.3.2)
	with ESMTP id v0L89qqK024897
	(version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256
	verify=OK); Sat, 21 Jan 2017 08:09:53 GMT
Received: from aserv0121.oracle.com (aserv0121.oracle.com [141.146.126.235])
	by aserv0022.oracle.com (8.14.4/8.14.4) with ESMTP id
	v0L89qA0028949
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK);
	Sat, 21 Jan 2017 08:09:52 GMT
Received: from abhmp0016.oracle.com (abhmp0016.oracle.com [141.146.116.22])
	by aserv0121.oracle.com (8.13.8/8.13.8) with ESMTP id
	v0L89ojs004124; Sat, 21 Jan 2017 08:09:51 GMT
Received: from localhost (/24.21.211.40)
	by default (Oracle Beehive Gateway v4.0)
	with ESMTP ; Sat, 21 Jan 2017 00:09:49 -0800
Subject: [PATCH 15/17] xfs_scrub: add XFS-specific scrubbing functionality
From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: sandeen@redhat.com, darrick.wong@oracle.com
Cc: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org
Date: Sat, 21 Jan 2017 00:09:48 -0800
Message-ID: <148498618889.16675.2231803665220653597.stgit@birch.djwong.org>
In-Reply-To: <148498608472.16675.14848042961636871812.stgit@birch.djwong.org>
References: <148498608472.16675.14848042961636871812.stgit@birch.djwong.org>
User-Agent: StGit/0.17.1-dirty
MIME-Version: 1.0
X-Source-IP: aserv0022.oracle.com [141.146.126.234]
Sender: linux-xfs-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-xfs.vger.kernel.org>
X-Mailing-List: linux-xfs@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

For XFS, we perform sequential scans of each AG's metadata, inodes,
extent maps, and file data.  Being XFS specific, we can work with the
in-kernel scrubbers to perform much stronger metadata checking and
cross-referencing.  We can also take advantage of newer ioctls such as
GETFSMAP to perform faster read verification.

In the future we will be able to take advantage of (still unwritten)
features such as parent directory pointers to fully validate all
metadata.  The scrub tool can shut down the filesystem if errors are
found.  This is not a default option since scrubbing is very immature at
this time.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 scrub/Makefile    |    4 
 scrub/scrub.c     |    1 
 scrub/scrub.h     |    1 
 scrub/xfs.c       | 2641 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 scrub/xfs_ioctl.c |  942 +++++++++++++++++++
 scrub/xfs_ioctl.h |  102 ++
 6 files changed, 3689 insertions(+), 2 deletions(-)
 create mode 100644 scrub/xfs.c
 create mode 100644 scrub/xfs_ioctl.c
 create mode 100644 scrub/xfs_ioctl.h


--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

diff --git a/scrub/Makefile b/scrub/Makefile
index 4639eed..d5d58de 100644
--- a/scrub/Makefile
+++ b/scrub/Makefile
@@ -12,9 +12,9 @@ LTCOMMAND = xfs_scrub
 INSTALL_SCRUB = install-scrub
 endif
 
-HFILES = scrub.h ../repair/threads.h read_verify.h iocmd.h
+HFILES = scrub.h ../repair/threads.h read_verify.h iocmd.h xfs_ioctl.h
 CFILES = ../repair/avl64.c disk.c bitmap.c generic.c iocmd.c \
-	 read_verify.c scrub.c ../repair/threads.c
+	 read_verify.c scrub.c ../repair/threads.c xfs.c xfs_ioctl.c
 
 LLDLIBS += $(LIBBLKID) $(LIBXFS) $(LIBXCMD) $(LIBUUID) $(LIBRT) $(LIBPTHREAD) $(LIBHANDLE)
 LTDEPENDENCIES += $(LIBXFS) $(LIBXCMD) $(LIBHANDLE)
diff --git a/scrub/scrub.c b/scrub/scrub.c
index 7ed5374..1dcca66 100644
--- a/scrub/scrub.c
+++ b/scrub/scrub.c
@@ -308,6 +308,7 @@ __record_preen(
  * generic_scrub_ops will be selected if nothing else is.
  */
 static struct scrub_ops *scrub_impl[] = {
+	&xfs_scrub_ops,
 	NULL
 };
 
diff --git a/scrub/scrub.h b/scrub/scrub.h
index 6ab53c1..e2376a5 100644
--- a/scrub/scrub.h
+++ b/scrub/scrub.h
@@ -151,6 +151,7 @@ debug_tweak_on(
 }
 
 extern struct scrub_ops	generic_scrub_ops;
+extern struct scrub_ops	xfs_scrub_ops;
 
 /* Generic implementations of the ops functions */
 bool generic_cleanup(struct scrub_ctx *ctx);
diff --git a/scrub/xfs.c b/scrub/xfs.c
new file mode 100644
index 0000000..bed6549
--- /dev/null
+++ b/scrub/xfs.c
@@ -0,0 +1,2641 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "libxfs.h"
+#include <sys/statvfs.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include <attr/attributes.h>
+#include "disk.h"
+#include "scrub.h"
+#include "../repair/threads.h"
+#include "handle.h"
+#include "path.h"
+#include "xfs_ioctl.h"
+#include "read_verify.h"
+#include "bitmap.h"
+#include "iocmd.h"
+#include "xfs_fs.h"
+
+/*
+ * XFS Scrubbing Strategy
+ *
+ * The XFS scrubber is much more thorough than the generic scrubber
+ * because we can use custom XFS ioctls to probe more deeply into the
+ * internals of the filesystem.  Furthermore, we can take advantage of
+ * scrubbing ioctls to check all the records stored in a metadata btree
+ * and cross-reference those records against the other btrees.
+ *
+ * The "find geometry" phase queries XFS for the filesystem geometry.
+ * The block devices for the data, realtime, and log devices are opened.
+ * Kernel ioctls are queried to see if they are implemented, and a data
+ * file read-verify strategy is selected.
+ *
+ * In the "check internal metadata" phase, we call the SCRUB_METADATA
+ * ioctl to check the filesystem's internal per-AG btrees.  This
+ * includes the AG superblock, AGF, AGFL, and AGI headers, freespace
+ * btrees, the regular and free inode btrees, the reverse mapping
+ * btrees, and the reference counting btrees.  If the realtime device is
+ * enabled, the realtime bitmap and reverse mapping btrees are enabled.
+ * Each AG (and the realtime device) has its metadata checked in a
+ * separate thread for better performance.
+ *
+ * The "scan inodes" phase uses BULKSTAT to scan all the inodes in an
+ * AG in disk order.  From the BULKSTAT information, a file handle is
+ * constructed and the following items are checked:
+ *
+ *     - If it's a symlink, the target is read but not validated.
+ *     - Bulkstat data is checked.
+ *     - If the inode is a file or a directory, a file descriptor is
+ *       opened to pin the inode and for further analysis.
+ *     - Extended attribute names and values are read via the file
+ *       handle.  If this fails and we have a file descriptor open, we
+ *       retry with the generic extended attribute APIs.
+ *     - If the inode is not a file or directory, we're done.
+ *     - Extent maps are scanned to ensure that the records make sense.
+ *       We also use the SCRUB_METADATA ioctl for better checking of the
+ *       block mapping records.
+ *     - If the inode is a directory, open the directory and check that
+ *       the dirent type code and inode numbers match the stat output.
+ *
+ * Multiple threads are started to check each the inodes of each AG in
+ * parallel.
+ *
+ * If BULKSTAT is available, we can skip the "check directory structure"
+ * phase because directories were checked during the inode scan.
+ * Otherwise, the generic directory structure check is used.
+ *
+ * In the "verify data file integrity" phase, we can employ multiple
+ * strategies to read-verify the data blocks:
+ *
+ *     - If GETFSMAP is available, use it to read the reverse-mappings of
+ *       all AGs and issue direct-reads of the underlying disk blocks.
+ *       We rely on the underlying storage to have checksummed the data
+ *       blocks appropriately.
+ *     - If GETBMAPX is available, we use BULKSTAT (or a directory tree
+ *       walk) to iterate all inodes and issue direct-reads of the
+ *       underlying data.  Similar to the generic read-verify, the data
+ *       extents are buffered through a bitmap, which is used to issue
+ *       larger IOs.  Errors are recorded and cross-referenced through
+ *       a second BULKSTAT/GETBMAPX run.
+ *     - Otherwise, call the generic handler to verify file data.
+ *
+ * Multiple threads are started to check each AG in parallel.  A
+ * separate thread pool is used to handle the direct reads.
+ *
+ * In the "check summary counters" phase, use GETFSMAP to tally up the
+ * blocks and BULKSTAT to tally up the inodes we saw and compare that to
+ * the statfs output.  This gives the user a rough estimate of how
+ * thorough the scrub was.
+ */
+
+/* Routines to scrub an XFS filesystem. */
+
+enum data_scrub_type {
+	DS_NOSCRUB,		/* no data scrub */
+	DS_READ,		/* generic_scan_blocks */
+	DS_BULKSTAT_READ,	/* bulkstat and generic_file_read */
+	DS_BMAPX,		/* bulkstat, getbmapx, and read_verify */
+	DS_FSMAP,		/* getfsmap and read_verify */
+};
+
+struct xfs_scrub_ctx {
+	struct xfs_fsop_geom	geo;
+	struct fs_path		fsinfo;
+	unsigned int		agblklog;
+	unsigned int		blocklog;
+	unsigned int		inodelog;
+	unsigned int		inopblog;
+	struct disk		datadev;
+	struct disk		logdev;
+	struct disk		rtdev;
+	void			*fshandle;
+	size_t			fshandle_len;
+	unsigned long long	capabilities;	/* see below */
+	struct read_verify_pool	rvp;
+	enum data_scrub_type	data_scrubber;
+	struct list_head	repair_list;
+};
+
+#define XFS_SCRUB_CAP_KSCRUB_FS		(1ULL << 0)	/* can scrub fs meta? */
+#define XFS_SCRUB_CAP_GETFSMAP		(1ULL << 1)	/* have getfsmap? */
+#define XFS_SCRUB_CAP_BULKSTAT		(1ULL << 2)	/* have bulkstat? */
+#define XFS_SCRUB_CAP_BMAPX		(1ULL << 3)	/* have bmapx? */
+#define XFS_SCRUB_CAP_KSCRUB_INODE	(1ULL << 4)	/* can scrub inode? */
+#define XFS_SCRUB_CAP_KSCRUB_BMAP	(1ULL << 5)	/* can scrub bmap? */
+#define XFS_SCRUB_CAP_KSCRUB_DIR	(1ULL << 6)	/* can scrub dirs? */
+#define XFS_SCRUB_CAP_KSCRUB_XATTR	(1ULL << 7)	/* can scrub attrs?*/
+#define XFS_SCRUB_CAP_PARENT_PTR	(1ULL << 8)	/* can find parent? */
+/* If the fast xattr checks fail, we have to use the slower generic scan. */
+#define XFS_SCRUB_CAP_SKIP_SLOW_XATTR	(1ULL << 9)
+#define XFS_SCRUB_CAP_KSCRUB_SYMLINK	(1ULL << 10)	/* can scrub symlink? */
+
+#define XFS_SCRUB_CAPABILITY_FUNCS(name, flagname) \
+static inline bool \
+xfs_scrub_can_##name(struct xfs_scrub_ctx *xctx) \
+{ \
+	return xctx->capabilities & XFS_SCRUB_CAP_##flagname; \
+} \
+static inline void \
+xfs_scrub_set_##name(struct xfs_scrub_ctx *xctx) \
+{ \
+	xctx->capabilities |= XFS_SCRUB_CAP_##flagname; \
+} \
+static inline void \
+xfs_scrub_clear_##name(struct xfs_scrub_ctx *xctx) \
+{ \
+	xctx->capabilities &= ~(XFS_SCRUB_CAP_##flagname); \
+}
+XFS_SCRUB_CAPABILITY_FUNCS(kscrub_fs,		KSCRUB_FS)
+XFS_SCRUB_CAPABILITY_FUNCS(getfsmap,		GETFSMAP)
+XFS_SCRUB_CAPABILITY_FUNCS(bulkstat,		BULKSTAT)
+XFS_SCRUB_CAPABILITY_FUNCS(bmapx,		BMAPX)
+XFS_SCRUB_CAPABILITY_FUNCS(kscrub_inode,	KSCRUB_INODE)
+XFS_SCRUB_CAPABILITY_FUNCS(kscrub_bmap,		KSCRUB_BMAP)
+XFS_SCRUB_CAPABILITY_FUNCS(kscrub_dir,		KSCRUB_DIR)
+XFS_SCRUB_CAPABILITY_FUNCS(kscrub_xattr,	KSCRUB_XATTR)
+XFS_SCRUB_CAPABILITY_FUNCS(getparent,		PARENT_PTR)
+XFS_SCRUB_CAPABILITY_FUNCS(skip_slow_xattr,	SKIP_SLOW_XATTR)
+XFS_SCRUB_CAPABILITY_FUNCS(kscrub_symlink,	KSCRUB_SYMLINK)
+
+/* Find the fd for a given device identifier. */
+static struct disk *
+xfs_dev_to_disk(
+	struct xfs_scrub_ctx	*xctx,
+	dev_t			dev)
+{
+	if (dev == xctx->fsinfo.fs_datadev)
+		return &xctx->datadev;
+	else if (dev == xctx->fsinfo.fs_logdev)
+		return &xctx->logdev;
+	else if (dev == xctx->fsinfo.fs_rtdev)
+		return &xctx->rtdev;
+	abort();
+}
+
+/* Find the device major/minor for a given file descriptor. */
+static dev_t
+xfs_disk_to_dev(
+	struct xfs_scrub_ctx	*xctx,
+	struct disk		*disk)
+{
+	if (disk == &xctx->datadev)
+		return xctx->fsinfo.fs_datadev;
+	else if (disk == &xctx->logdev)
+		return xctx->fsinfo.fs_logdev;
+	else if (disk == &xctx->rtdev)
+		return xctx->fsinfo.fs_rtdev;
+	abort();
+}
+
+/* Shortcut to creating a read-verify thread pool. */
+static inline bool
+xfs_read_verify_pool_init(
+	struct scrub_ctx	*ctx,
+	read_verify_ioend_fn_t	ioend_fn)
+{
+	struct xfs_scrub_ctx	*xctx = ctx->priv;
+
+	return read_verify_pool_init(&xctx->rvp, ctx, ctx->readbuf,
+			IO_MAX_SIZE, xctx->geo.blocksize, ioend_fn,
+			disk_heads(&xctx->datadev));
+}
+
+struct owner_decode {
+	uint64_t		owner;
+	const char		*descr;
+};
+
+static const struct owner_decode special_owners[] = {
+	{FMR_OWN_FREE,		"free space"},
+	{FMR_OWN_UNKNOWN,	"unknown owner"},
+	{FMR_OWN_FS,		"static FS metadata"},
+	{FMR_OWN_LOG,		"journalling log"},
+	{FMR_OWN_AG,		"per-AG metadata"},
+	{FMR_OWN_INOBT,		"inode btree blocks"},
+	{FMR_OWN_INODES,	"inodes"},
+	{FMR_OWN_REFC,		"refcount btree"},
+	{FMR_OWN_COW,		"CoW staging"},
+	{FMR_OWN_DEFECTIVE,	"bad blocks"},
+	{0, NULL},
+};
+
+/* Decode a special owner. */
+static const char *
+xfs_decode_special_owner(
+	uint64_t			owner)
+{
+	const struct owner_decode	*od = special_owners;
+
+	while (od->descr) {
+		if (od->owner == owner)
+			return od->descr;
+		od++;
+	}
+
+	return NULL;
+}
+
+/* BULKSTAT wrapper routines. */
+struct xfs_scan_inodes {
+	xfs_inode_iter_fn	fn;
+	void			*arg;
+	size_t			array_arg_size;
+	bool			moveon;
+};
+
+/* Scan all the inodes in an AG. */
+static void
+xfs_scan_ag_inodes(
+	struct work_queue	*wq,
+	xfs_agnumber_t		agno,
+	void			*arg)
+{
+	struct xfs_scan_inodes	*si = arg;
+	struct scrub_ctx	*ctx = (struct scrub_ctx *)wq->mp;
+	struct xfs_scrub_ctx	*xctx = ctx->priv;
+	void			*fn_arg;
+	char			descr[DESCR_BUFSZ];
+	uint64_t		ag_ino;
+	uint64_t		next_ag_ino;
+	bool			moveon;
+
+	snprintf(descr, DESCR_BUFSZ, _("dev %d:%d AG %u inodes"),
+				major(xctx->fsinfo.fs_datadev),
+				minor(xctx->fsinfo.fs_datadev),
+				agno);
+
+	ag_ino = (__u64)agno << (xctx->inopblog + xctx->agblklog);
+	next_ag_ino = (__u64)(agno + 1) << (xctx->inopblog + xctx->agblklog);
+
+	fn_arg = ((char *)si->arg) + si->array_arg_size * agno;
+	moveon = xfs_iterate_inodes(ctx, descr, xctx->fshandle, ag_ino,
+			next_ag_ino - 1, si->fn, fn_arg);
+	if (!moveon)
+		si->moveon = false;
+}
+
+/* How many array elements should we create to scan all the inodes? */
+static inline size_t
+xfs_scan_all_inodes_array_size(
+	struct xfs_scrub_ctx	*xctx)
+{
+	return xctx->geo.agcount;
+}
+
+/* Scan all the inodes in a filesystem. */
+static bool
+xfs_scan_all_inodes_array_arg(
+	struct scrub_ctx	*ctx,
+	xfs_inode_iter_fn	fn,
+	void			*arg,
+	size_t			array_arg_size)
+{
+	struct xfs_scrub_ctx	*xctx = ctx->priv;
+	struct xfs_scan_inodes	si;
+	xfs_agnumber_t		agno;
+	struct work_queue	wq;
+
+	if (!xfs_scrub_can_bulkstat(xctx))
+		return true;
+
+	si.moveon = true;
+	si.fn = fn;
+	si.arg = arg;
+	si.array_arg_size = array_arg_size;
+
+	create_work_queue(&wq, (struct xfs_mount *)ctx, scrub_nproc(ctx));
+	for (agno = 0; agno < xctx->geo.agcount; agno++)
+		queue_work(&wq, xfs_scan_ag_inodes, agno, &si);
+	destroy_work_queue(&wq);
+
+	return si.moveon;
+}
+#define xfs_scan_all_inodes(ctx, fn) \
+	xfs_scan_all_inodes_array_arg((ctx), (fn), NULL, 0)
+#define xfs_scan_all_inodes_arg(ctx, fn, arg) \
+	xfs_scan_all_inodes_array_arg((ctx), (fn), (arg), 0)
+
+/* GETFSMAP wrappers routines. */
+struct xfs_scan_blocks {
+	xfs_fsmap_iter_fn	fn;
+	void			*arg;
+	size_t			array_arg_size;
+	bool			moveon;
+};
+
+/* Iterate all the reverse mappings of an AG. */
+static void
+xfs_scan_ag_blocks(
+	struct work_queue	*wq,
+	xfs_agnumber_t		agno,
+	void			*arg)
+{
+	struct scrub_ctx	*ctx = (struct scrub_ctx *)wq->mp;
+	struct xfs_scrub_ctx	*xctx = ctx->priv;
+	struct xfs_scan_blocks	*sbx = arg;
+	void			*fn_arg;
+	char			descr[DESCR_BUFSZ];
+	struct fsmap		keys[2];
+	off64_t			bperag;
+	bool			moveon;
+
+	bperag = (off64_t)xctx->geo.agblocks *
+		 (off64_t)xctx->geo.blocksize;
+
+	snprintf(descr, DESCR_BUFSZ, _("dev %d:%d AG %u fsmap"),
+				major(xctx->fsinfo.fs_datadev),
+				minor(xctx->fsinfo.fs_datadev),
+				agno);
+
+	memset(keys, 0, sizeof(struct fsmap) * 2);
+	keys->fmr_device = xctx->fsinfo.fs_datadev;
+	keys->fmr_physical = agno * bperag;
+	(keys + 1)->fmr_device = xctx->fsinfo.fs_datadev;
+	(keys + 1)->fmr_physical = ((agno + 1) * bperag) - 1;
+	(keys + 1)->fmr_owner = ULLONG_MAX;
+	(keys + 1)->fmr_offset = ULLONG_MAX;
+	(keys + 1)->fmr_flags = UINT_MAX;
+
+	fn_arg = ((char *)sbx->arg) + sbx->array_arg_size * agno;
+	moveon = xfs_iterate_fsmap(ctx, descr, keys, sbx->fn, fn_arg);
+	if (!moveon)
+		sbx->moveon = false;
+}
+
+/* Iterate all the reverse mappings of a standalone device. */
+static void
+xfs_scan_dev_blocks(
+	struct scrub_ctx	*ctx,
+	int			idx,
+	dev_t			dev,
+	struct xfs_scan_blocks	*sbx)
+{
+	struct fsmap		keys[2];
+	char			descr[DESCR_BUFSZ];
+	void			*fn_arg;
+	bool			moveon;
+
+	snprintf(descr, DESCR_BUFSZ, _("dev %d:%d fsmap"),
+			major(dev), minor(dev));
+
+	memset(keys, 0, sizeof(struct fsmap) * 2);
+	keys->fmr_device = dev;
+	(keys + 1)->fmr_device = dev;
+	(keys + 1)->fmr_physical = ULLONG_MAX;
+	(keys + 1)->fmr_owner = ULLONG_MAX;
+	(keys + 1)->fmr_offset = ULLONG_MAX;
+	(keys + 1)->fmr_flags = UINT_MAX;
+
+	fn_arg = ((char *)sbx->arg) + sbx->array_arg_size * idx;
+	moveon = xfs_iterate_fsmap(ctx, descr, keys, sbx->fn, fn_arg);
+	if (!moveon)
+		sbx->moveon = false;
+}
+
+/* Iterate all the reverse mappings of the realtime device. */
+static void
+xfs_scan_rt_blocks(
+	struct work_queue	*wq,
+	xfs_agnumber_t		agno,
+	void			*arg)
+{
+	struct scrub_ctx	*ctx = (struct scrub_ctx *)wq->mp;
+	struct xfs_scrub_ctx	*xctx = ctx->priv;
+
+	xfs_scan_dev_blocks(ctx, agno, xctx->fsinfo.fs_rtdev, arg);
+}
+
+/* Iterate all the reverse mappings of the log device. */
+static void
+xfs_scan_log_blocks(
+	struct work_queue	*wq,
+	xfs_agnumber_t		agno,
+	void			*arg)
+{
+	struct scrub_ctx	*ctx = (struct scrub_ctx *)wq->mp;
+	struct xfs_scrub_ctx	*xctx = ctx->priv;
+
+	xfs_scan_dev_blocks(ctx, agno, xctx->fsinfo.fs_logdev, arg);
+}
+
+/* How many array elements should we create to scan all the blocks? */
+static size_t
+xfs_scan_all_blocks_array_size(
+	struct xfs_scrub_ctx	*xctx)
+{
+	return xctx->geo.agcount + 2;
+}
+
+/* Scan all the blocks in a filesystem. */
+static bool
+xfs_scan_all_blocks_array_arg(
+	struct scrub_ctx	*ctx,
+	xfs_fsmap_iter_fn	fn,
+	void			*arg,
+	size_t			array_arg_size)
+{
+	struct xfs_scrub_ctx	*xctx = ctx->priv;
+	xfs_agnumber_t		agno;
+	struct work_queue	wq;
+	struct xfs_scan_blocks	sbx;
+
+	sbx.moveon = true;
+	sbx.fn = fn;
+	sbx.arg = arg;
+	sbx.array_arg_size = array_arg_size;
+
+	create_work_queue(&wq, (struct xfs_mount *)ctx, scrub_nproc(ctx));
+	if (xctx->fsinfo.fs_rt)
+		queue_work(&wq, xfs_scan_rt_blocks, xctx->geo.agcount + 1,
+				&sbx);
+	if (xctx->fsinfo.fs_log)
+		queue_work(&wq, xfs_scan_log_blocks, xctx->geo.agcount + 2,
+				&sbx);
+	for (agno = 0; agno < xctx->geo.agcount; agno++)
+		queue_work(&wq, xfs_scan_ag_blocks, agno, &sbx);
+	destroy_work_queue(&wq);
+
+	return sbx.moveon;
+}
+
+/* Routines to translate bad physical extents into file paths and offsets. */
+
+struct xfs_verify_error_info {
+	struct bitmap			*d_bad;		/* bytes */
+	struct bitmap			*r_bad;		/* bytes */
+};
+
+/* Report if this extent overlaps a bad region. */
+static bool
+xfs_report_verify_inode_bmap(
+	struct scrub_ctx		*ctx,
+	const char			*descr,
+	int				fd,
+	int				whichfork,
+	struct fsxattr			*fsx,
+	struct xfs_bmap			*bmap,
+	void				*arg)
+{
+	struct xfs_verify_error_info	*vei = arg;
+	struct bitmap			*tree;
+
+	/*
+	 * Only do data scrubbing if the extent is neither unwritten nor
+	 * delalloc.
+	 */
+	if (bmap->bm_flags & (BMV_OF_PREALLOC | BMV_OF_DELALLOC))
+		return true;
+
+	if (fsx->fsx_xflags & FS_XFLAG_REALTIME)
+		tree = vei->r_bad;
+	else
+		tree = vei->d_bad;
+
+	if (!bitmap_has_extent(tree, bmap->bm_physical, bmap->bm_length))
+		return true;
+
+	str_error(ctx, descr,
+_("offset %llu failed read verification."), bmap->bm_offset);
+	return true;
+}
+
+/* Iterate the extent mappings of a file to report errors. */
+static bool
+xfs_report_verify_fd(
+	struct scrub_ctx		*ctx,
+	const char			*descr,
+	int				fd,
+	void				*arg)
+{
+	struct xfs_bmap			key = {0};
+	bool				moveon;
+
+	/* data fork */
+	moveon = xfs_iterate_bmap(ctx, descr, fd, XFS_DATA_FORK, &key,
+			xfs_report_verify_inode_bmap, arg);
+	if (!moveon)
+		return false;
+
+	/* attr fork */
+	moveon = xfs_iterate_bmap(ctx, descr, fd, XFS_ATTR_FORK, &key,
+			xfs_report_verify_inode_bmap, arg);
+	if (!moveon)
+		return false;
+	return true;
+}
+
+/* Report read verify errors in unlinked (but still open) files. */
+static int
+xfs_report_verify_inode(
+	struct scrub_ctx		*ctx,
+	struct xfs_handle		*handle,
+	struct xfs_bstat		*bstat,
+	void				*arg)
+{
+	char				descr[DESCR_BUFSZ];
+	char				buf[DESCR_BUFSZ];
+	bool				moveon;
+	int				fd;
+	int				error;
+
+	snprintf(descr, DESCR_BUFSZ, _("inode %llu (unlinked)"), bstat->bs_ino);
+
+	/* Ignore linked files and things we can't open. */
+	if (bstat->bs_nlink != 0)
+		return 0;
+	if (!S_ISREG(bstat->bs_mode) && !S_ISDIR(bstat->bs_mode))
+		return 0;
+
+	/* Try to open the inode. */
+	fd = open_by_fshandle(handle, sizeof(*handle),
+			O_RDONLY | O_NOATIME | O_NOFOLLOW | O_NOCTTY);
+	if (fd < 0) {
+		error = errno;
+		if (error == ESTALE)
+			return error;
+
+		str_warn(ctx, descr, "%s", strerror_r(error, buf, DESCR_BUFSZ));
+		return error;
+	}
+
+	/* Go find the badness. */
+	moveon = xfs_report_verify_fd(ctx, descr, fd, arg);
+	close(fd);
+
+	return moveon ? 0 : XFS_ITERATE_INODES_ABORT;
+}
+
+/* Scan the inode associated with a directory entry. */
+static bool
+xfs_report_verify_dirent(
+	struct scrub_ctx	*ctx,
+	const char		*path,
+	int			dir_fd,
+	struct dirent		*dirent,
+	struct stat		*sb,
+	void			*arg)
+{
+	bool			moveon;
+	int			fd;
+
+	/* Ignore things we can't open. */
+	if (!S_ISREG(sb->st_mode) && !S_ISDIR(sb->st_mode))
+		return true;
+	/* Ignore . and .. */
+	if (dirent && (!strcmp(".", dirent->d_name) ||
+		       !strcmp("..", dirent->d_name)))
+		return true;
+
+	/* Open the file */
+	fd = dirent_open(dir_fd, dirent);
+	if (fd < 0)
+		return true;
+
+	/* Go find the badness. */
+	moveon = xfs_report_verify_fd(ctx, path, fd, arg);
+	if (moveon)
+		goto out;
+
+out:
+	close(fd);
+
+	return moveon;
+}
+
+/* Given bad extent lists for the data & rtdev, find bad files. */
+static bool
+xfs_report_verify_errors(
+	struct scrub_ctx		*ctx,
+	struct bitmap			*d_bad,
+	struct bitmap			*r_bad)
+{
+	struct xfs_verify_error_info	vei;
+	bool				moveon;
+
+	vei.d_bad = d_bad;
+	vei.r_bad = r_bad;
+
+	/* Scan the directory tree to get file paths. */
+	moveon = scan_fs_tree(ctx, NULL, xfs_report_verify_dirent, &vei);
+	if (!moveon)
+		return false;
+
+	/* Scan for unlinked files. */
+	return xfs_scan_all_inodes_arg(ctx, xfs_report_verify_inode, &vei);
+}
+
+/* Phase 1 */
+
+/* Clean up the XFS-specific state data. */
+static bool
+xfs_cleanup(
+	struct scrub_ctx	*ctx)
+{
+	struct xfs_scrub_ctx	*xctx = ctx->priv;
+
+	if (!xctx)
+		goto out;
+	if (xctx->fshandle)
+		free_handle(xctx->fshandle, xctx->fshandle_len);
+	disk_close(&xctx->rtdev);
+	disk_close(&xctx->logdev);
+	disk_close(&xctx->datadev);
+	free(ctx->priv);
+	ctx->priv = NULL;
+
+out:
+	return generic_cleanup(ctx);
+}
+
+/* Test what kernel functions we can call for this filesystem. */
+static void
+xfs_test_capability(
+	struct scrub_ctx		*ctx,
+	bool				(*test_fn)(struct scrub_ctx *),
+	void				(*set_fn)(struct xfs_scrub_ctx *),
+	const char			*errmsg)
+{
+	struct xfs_scrub_ctx		*xctx = ctx->priv;
+
+	if (test_fn(ctx))
+		set_fn(xctx);
+	else
+		str_info(ctx, ctx->mntpoint, errmsg);
+}
+
+/* Read the XFS geometry. */
+static bool
+xfs_scan_fs(
+	struct scrub_ctx		*ctx)
+{
+	struct xfs_scrub_ctx		*xctx;
+	struct fs_path			*fsp;
+	int				error;
+
+	if (!platform_test_xfs_fd(ctx->mnt_fd)) {
+		str_error(ctx, ctx->mntpoint,
+_("Does not appear to be an XFS filesystem!"));
+		return false;
+	}
+
+	/*
+	 * Flush everything out to disk before we start checking.
+	 * This seems to reduce the incidence of stale file handle
+	 * errors when we open things by handle.
+	 */
+	error = syncfs(ctx->mnt_fd);
+	if (error) {
+		str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+
+	xctx = calloc(1, sizeof(struct xfs_scrub_ctx));
+	if (!xctx) {
+		str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+	INIT_LIST_HEAD(&xctx->repair_list);
+	xctx->datadev.d_fd = xctx->logdev.d_fd = xctx->rtdev.d_fd = -1;
+
+	/* Retrieve XFS geometry. */
+	error = xfsctl(ctx->mntpoint, ctx->mnt_fd, XFS_IOC_FSGEOMETRY,
+			&xctx->geo);
+	if (error) {
+		str_errno(ctx, ctx->mntpoint);
+		goto err;
+	}
+	ctx->priv = xctx;
+
+	xctx->agblklog = libxfs_log2_roundup(xctx->geo.agblocks);
+	xctx->blocklog = libxfs_highbit32(xctx->geo.blocksize);
+	xctx->inodelog = libxfs_highbit32(xctx->geo.inodesize);
+	xctx->inopblog = xctx->blocklog - xctx->inodelog;
+
+	error = path_to_fshandle(ctx->mntpoint, &xctx->fshandle,
+			&xctx->fshandle_len);
+	if (error) {
+		perror(_("getting fshandle"));
+		goto err;
+	}
+
+	/* Do we have bulkstat? */
+	xfs_test_capability(ctx, xfs_can_iterate_inodes, xfs_scrub_set_bulkstat,
+_("Kernel lacks BULKSTAT; scrub will be incomplete."));
+
+	/* Do we have getbmapx? */
+	xfs_test_capability(ctx, xfs_can_iterate_bmap, xfs_scrub_set_bmapx,
+_("Kernel lacks GETBMAPX; scrub will be less efficient."));
+
+	/* Do we have getfsmap? */
+	xfs_test_capability(ctx, xfs_can_iterate_fsmap, xfs_scrub_set_getfsmap,
+_("Kernel lacks GETFSMAP; scrub will be less efficient."));
+
+	/* Do we have kernel-assisted metadata scrubbing? */
+	xfs_test_capability(ctx, xfs_can_scrub_fs_metadata,
+			xfs_scrub_set_kscrub_fs,
+_("Kernel cannot help scrub metadata; scrub will be incomplete."));
+
+	/* Do we have kernel-assisted inode scrubbing? */
+	xfs_test_capability(ctx, xfs_can_scrub_inode,
+			xfs_scrub_set_kscrub_inode,
+_("Kernel cannot help scrub inodes; scrub will be incomplete."));
+
+	/* Do we have kernel-assisted bmap scrubbing? */
+	xfs_test_capability(ctx, xfs_can_scrub_bmap,
+			xfs_scrub_set_kscrub_bmap,
+_("Kernel cannot help scrub extent map; scrub will be less efficient."));
+
+	/* Do we have kernel-assisted dir scrubbing? */
+	xfs_test_capability(ctx, xfs_can_scrub_dir,
+			xfs_scrub_set_kscrub_dir,
+_("Kernel cannot help scrub directories; scrub will be less efficient."));
+
+	/* Do we have kernel-assisted xattr scrubbing? */
+	xfs_test_capability(ctx, xfs_can_scrub_attr,
+			xfs_scrub_set_kscrub_xattr,
+_("Kernel cannot help scrub extended attributes; scrub will be less efficient."));
+
+	/* Do we have kernel-assisted symlink scrubbing? */
+	xfs_test_capability(ctx, xfs_can_scrub_symlink,
+			xfs_scrub_set_kscrub_symlink,
+_("Kernel cannot help scrub symbolic links; scrub will be less efficient."));
+
+	/*
+	 * We don't need to use the slow generic xattr scan unless all
+	 * of the fast scanners fail.
+	 */
+	xfs_scrub_set_skip_slow_xattr(xctx);
+
+	/* Go find the XFS devices if we have a usable fsmap. */
+	fs_table_initialise(0, NULL, 0, NULL);
+	errno = 0;
+	fsp = fs_table_lookup(ctx->mntpoint, FS_MOUNT_POINT);
+	if (!fsp) {
+		str_error(ctx, ctx->mntpoint,
+_("Unable to find XFS information."));
+		goto err;
+	}
+	memcpy(&xctx->fsinfo, fsp, sizeof(struct fs_path));
+
+	/* Did we find the log and rt devices, if they're present? */
+	if (xctx->geo.logstart == 0 && xctx->fsinfo.fs_log == NULL) {
+		str_error(ctx, ctx->mntpoint,
+_("Unable to find log device path."));
+		goto err;
+	}
+	if (xctx->geo.rtblocks && xctx->fsinfo.fs_rt == NULL) {
+		str_error(ctx, ctx->mntpoint,
+_("Unable to find realtime device path."));
+		goto err;
+	}
+
+	/* Open the raw devices. */
+	error = disk_open(xctx->fsinfo.fs_name, &xctx->datadev);
+	if (error) {
+		str_errno(ctx, xctx->fsinfo.fs_name);
+		xfs_scrub_clear_getfsmap(xctx);
+	}
+	ctx->nr_io_threads = libxfs_nproc();
+
+	if (xctx->fsinfo.fs_log) {
+		error = disk_open(xctx->fsinfo.fs_log, &xctx->logdev);
+		if (error) {
+			str_errno(ctx, xctx->fsinfo.fs_name);
+			xfs_scrub_clear_getfsmap(xctx);
+		}
+	}
+	if (xctx->fsinfo.fs_rt) {
+		error = disk_open(xctx->fsinfo.fs_rt, &xctx->rtdev);
+		if (error) {
+			str_errno(ctx, xctx->fsinfo.fs_name);
+			xfs_scrub_clear_getfsmap(xctx);
+		}
+	}
+
+	/* Figure out who gets to scrub data extents... */
+	if (scrub_data) {
+		if (xfs_scrub_can_getfsmap(xctx))
+			xctx->data_scrubber = DS_FSMAP;
+		else if (xfs_scrub_can_bmapx(xctx))
+			xctx->data_scrubber = DS_BMAPX;
+		else  if (xfs_scrub_can_bulkstat(xctx))
+			xctx->data_scrubber = DS_BULKSTAT_READ;
+		else
+			xctx->data_scrubber = DS_READ;
+	} else
+		xctx->data_scrubber = DS_NOSCRUB;
+
+	return generic_scan_fs(ctx);
+err:
+	xfs_cleanup(ctx);
+	return false;
+}
+
+/* Phase 2 */
+
+/* Defer all the repairs until phase 7. */
+static void
+xfs_defer_repairs(
+	struct scrub_ctx	*ctx,
+	struct list_head	*repairs)
+{
+	struct xfs_scrub_ctx	*xctx = ctx->priv;
+
+	if (list_empty(repairs))
+		return;
+
+	pthread_mutex_lock(&ctx->lock);
+	list_splice_tail_init(repairs, &xctx->repair_list);
+	pthread_mutex_unlock(&ctx->lock);
+}
+
+/* Repair some AG metadata; broken things are remembered for later. */
+static bool
+xfs_quick_repair(
+	struct scrub_ctx	*ctx,
+	struct list_head	*repairs)
+{
+	bool			moveon;
+
+	moveon = xfs_repair_metadata_list(ctx, ctx->mnt_fd, repairs,
+			XRML_REPAIR_ONLY);
+	if (!moveon)
+		return moveon;
+
+	xfs_defer_repairs(ctx, repairs);
+	return true;
+}
+
+/* Scrub each AG's metadata btrees. */
+static void
+xfs_scan_ag_metadata(
+	struct work_queue		*wq,
+	xfs_agnumber_t			agno,
+	void				*arg)
+{
+	struct scrub_ctx		*ctx = (struct scrub_ctx *)wq->mp;
+	struct xfs_scrub_ctx		*xctx = ctx->priv;
+	bool				*pmoveon = arg;
+	struct repair_item		*n;
+	struct repair_item		*ri;
+	struct list_head		repairs;
+	struct list_head		repair_now;
+	unsigned int			broken_primaries;
+	unsigned int			broken_secondaries;
+	bool				moveon;
+	char				descr[DESCR_BUFSZ];
+
+	if (!xfs_scrub_can_kscrub_fs(xctx))
+		return;
+
+	INIT_LIST_HEAD(&repairs);
+	INIT_LIST_HEAD(&repair_now);
+	snprintf(descr, DESCR_BUFSZ, _("AG %u"), agno);
+
+	/*
+	 * First we scrub and fix the AG headers, because we need
+	 * them to work well enough to check the AG btrees.
+	 */
+	moveon = xfs_scrub_ag_headers(ctx, agno, &repairs);
+	if (!moveon)
+		goto err;
+
+	/* Repair header damage. */
+	moveon = xfs_quick_repair(ctx, &repairs);
+	if (!moveon)
+		goto err;
+
+	/* Now scrub the AG btrees. */
+	moveon = xfs_scrub_ag_metadata(ctx, agno, &repairs);
+	if (!moveon)
+		goto err;
+
+	/*
+	 * Figure out if we need to perform early fixing.  The only
+	 * reason we need to do this is if the inobt is broken, which
+	 * prevents phase 3 (inode scan) from running.  We can rebuild
+	 * the inobt from rmapbt data, but if the rmapbt is broken even
+	 * at this early phase then we are sunk.
+	 */
+	broken_secondaries = 0;
+	broken_primaries = 0;
+	list_for_each_entry_safe(ri, n, &repairs, list) {
+		switch (ri->op.sm_type) {
+		case XFS_SCRUB_TYPE_RMAPBT:
+			broken_secondaries++;
+			break;
+		case XFS_SCRUB_TYPE_FINOBT:
+		case XFS_SCRUB_TYPE_INOBT:
+			list_del(&ri->list);
+			list_add_tail(&ri->list, &repair_now);
+			/* fall through */
+		case XFS_SCRUB_TYPE_BNOBT:
+		case XFS_SCRUB_TYPE_CNTBT:
+		case XFS_SCRUB_TYPE_REFCNTBT:
+			broken_primaries++;
+			break;
+		default:
+			ASSERT(false);
+			break;
+		}
+	}
+	if (broken_secondaries) {
+		if (broken_primaries)
+			str_warn(ctx, descr,
+_("Corrupt primary and secondary block mapping metadata."));
+		else
+			str_warn(ctx, descr,
+_("Corrupt secondary block mapping metadata."));
+		str_warn(ctx, descr,
+_("Filesystem might not be repairable."));
+	}
+
+	/* Repair (inode) btree damage. */
+	moveon = xfs_quick_repair(ctx, &repair_now);
+	if (!moveon)
+		goto err;
+
+	/* Everything else gets fixed during phase 7. */
+	xfs_defer_repairs(ctx, &repairs);
+
+	return;
+err:
+	*pmoveon = false;
+	return;
+}
+
+/* Scrub whole-FS metadata btrees. */
+static void
+xfs_scan_fs_metadata(
+	struct work_queue		*wq,
+	xfs_agnumber_t			agno,
+	void				*arg)
+{
+	struct scrub_ctx		*ctx = (struct scrub_ctx *)wq->mp;
+	struct xfs_scrub_ctx		*xctx = ctx->priv;
+	bool				*pmoveon = arg;
+	struct list_head		repairs;
+	bool				moveon;
+
+	if (!xfs_scrub_can_kscrub_fs(xctx))
+		return;
+
+	INIT_LIST_HEAD(&repairs);
+	moveon = xfs_scrub_fs_metadata(ctx, &repairs);
+	if (!moveon)
+		*pmoveon = false;
+
+	pthread_mutex_lock(&ctx->lock);
+	list_splice_tail_init(&repairs, &xctx->repair_list);
+	pthread_mutex_unlock(&ctx->lock);
+}
+
+/* Try to scan metadata via sysfs. */
+static bool
+xfs_scan_metadata(
+	struct scrub_ctx	*ctx)
+{
+	struct xfs_scrub_ctx	*xctx = ctx->priv;
+	xfs_agnumber_t		agno;
+	struct work_queue	wq;
+	bool			moveon = true;
+
+	create_work_queue(&wq, (struct xfs_mount *)ctx, scrub_nproc(ctx));
+	queue_work(&wq, xfs_scan_fs_metadata, 0, &moveon);
+	for (agno = 0; agno < xctx->geo.agcount; agno++)
+		queue_work(&wq, xfs_scan_ag_metadata, agno, &moveon);
+	destroy_work_queue(&wq);
+
+	return moveon;
+}
+
+/* Phase 3 */
+
+/* Scrub an inode extent, report if it's bad. */
+static bool
+xfs_scrub_inode_extent(
+	struct scrub_ctx		*ctx,
+	const char			*descr,
+	int				fd,
+	int				whichfork,
+	struct fsxattr			*fsx,
+	struct xfs_bmap			*bmap,
+	void				*arg)
+{
+	unsigned long long		*nextoff = arg;		/* bytes */
+	struct xfs_scrub_ctx		*xctx = ctx->priv;
+	unsigned long long		eofs;
+	bool				badmap = false;
+
+	if (fsx->fsx_xflags & FS_XFLAG_REALTIME)
+		eofs = xctx->geo.rtblocks;
+	else
+		eofs = xctx->geo.datablocks;
+	eofs <<= xctx->blocklog;
+
+	if (bmap->bm_length == 0) {
+		badmap = true;
+		str_error(ctx, descr,
+_("extent (%llu/%llu/%llu) has zero length."),
+				bmap->bm_physical, bmap->bm_offset,
+				bmap->bm_length);
+	}
+
+	if (bmap->bm_physical >= eofs) {
+		badmap = true;
+		str_error(ctx, descr,
+_("extent (%llu/%llu/%llu) starts past end of filesystem at %llu."),
+				bmap->bm_physical, bmap->bm_offset,
+				bmap->bm_length, eofs);
+	}
+
+	if (bmap->bm_offset < *nextoff) {
+		badmap = true;
+		str_error(ctx, descr,
+_("extent (%llu/%llu/%llu) overlaps another extent."),
+				bmap->bm_physical, bmap->bm_offset,
+				bmap->bm_length);
+	}
+
+	if (bmap->bm_physical + bmap->bm_length < bmap->bm_physical ||
+	    bmap->bm_physical + bmap->bm_length >= eofs) {
+		badmap = true;
+		str_error(ctx, descr,
+_("extent (%llu/%llu/%llu) ends past end of filesystem at %llu."),
+				bmap->bm_physical, bmap->bm_offset,
+				bmap->bm_length, eofs);
+	}
+
+	if (bmap->bm_offset + bmap->bm_length < bmap->bm_offset) {
+		badmap = true;
+		str_error(ctx, descr,
+_("extent (%llu/%llu/%llu) overflows file offset."),
+				bmap->bm_physical, bmap->bm_offset,
+				bmap->bm_length);
+	}
+
+	if ((bmap->bm_flags & BMV_OF_SHARED) &&
+	    (bmap->bm_flags & (BMV_OF_PREALLOC | BMV_OF_DELALLOC))) {
+		badmap = true;
+		str_error(ctx, descr,
+_("extent (%llu/%llu/%llu) has conflicting flags 0x%x."),
+				bmap->bm_physical, bmap->bm_offset,
+				bmap->bm_length,
+				bmap->bm_flags);
+	}
+
+	if ((bmap->bm_flags & BMV_OF_SHARED) &&
+	    !(xctx->geo.flags & XFS_FSOP_GEOM_FLAGS_REFLINK)) {
+		badmap = true;
+		str_error(ctx, descr,
+_("extent (%llu/%llu/%llu) is shared but filesystem does not support sharing."),
+				bmap->bm_physical, bmap->bm_offset,
+				bmap->bm_length);
+	}
+
+	if (!badmap)
+		*nextoff = bmap->bm_offset + bmap->bm_length;
+
+	return true;
+}
+
+/* Scrub an inode's data, xattr, and CoW extent records. */
+static bool
+xfs_scan_inode_extents(
+	struct scrub_ctx		*ctx,
+	const char			*descr,
+	int				fd)
+{
+	struct xfs_bmap			key = {0};
+	bool				moveon;
+	unsigned long long		nextoff;	/* bytes */
+
+	/* data fork */
+	nextoff = 0;
+	moveon = xfs_iterate_bmap(ctx, descr, fd, XFS_DATA_FORK, &key,
+			xfs_scrub_inode_extent, &nextoff);
+	if (!moveon)
+		return false;
+
+	/* attr fork */
+	nextoff = 0;
+	return xfs_iterate_bmap(ctx, descr, fd, XFS_ATTR_FORK, &key,
+			xfs_scrub_inode_extent, &nextoff);
+}
+
+enum xfs_xattr_ns {
+	RXT_USER	= 0,
+	RXT_ROOT	= ATTR_ROOT,
+	RXT_TRUST	= ATTR_TRUST,
+	RXT_SECURE	= ATTR_SECURE,
+	RXT_MAX		= 4,
+};
+
+static const enum xfs_xattr_ns known_attr_ns[RXT_MAX] = {
+	RXT_USER,
+	RXT_ROOT,
+	RXT_TRUST,
+	RXT_SECURE,
+};
+
+/*
+ * Read all the extended attributes of a file handle.
+ * This function can return false if the get-attr-by-handle function
+ * does not work correctly; callers must be able to work around that.
+ */
+static int
+xfs_read_handle_xattrs(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	struct xfs_handle	*handle,
+	enum xfs_xattr_ns	ns)
+{
+	struct attrlist_cursor	cur;
+	struct attr_multiop	mop;
+	char			attrbuf[XFS_XATTR_LIST_MAX];
+	char			*firstname = NULL;
+	struct xfs_scrub_ctx	*xctx = ctx->priv;
+	struct attrlist		*attrlist = (struct attrlist *)attrbuf;
+	struct attrlist_ent	*ent;
+	bool			moveon = true;
+	int			i;
+	int			flags = 0;
+	int			error;
+
+	flags |= ns;
+	memset(&attrbuf, 0, XFS_XATTR_LIST_MAX);
+	memset(&cur, 0, sizeof(cur));
+	mop.am_opcode = ATTR_OP_GET;
+	mop.am_flags = flags;
+	while ((error = attr_list_by_handle(handle, sizeof(*handle),
+			attrbuf, XFS_XATTR_LIST_MAX, flags, &cur)) == 0) {
+		for (i = 0; i < attrlist->al_count; i++) {
+			ent = ATTR_ENTRY(attrlist, i);
+
+			/*
+			 * XFS has a longstanding bug where the attr cursor
+			 * never gets updated, causing an infinite loop.
+			 * Detect this and bail out.
+			 */
+			if (i == 0 && xfs_scrub_can_skip_slow_xattr(xctx)) {
+				if (firstname == NULL) {
+					firstname = malloc(ent->a_valuelen);
+					memcpy(firstname, ent->a_name,
+							ent->a_valuelen);
+				} else if (memcmp(firstname, ent->a_name,
+							ent->a_valuelen) == 0) {
+					str_error(ctx, descr,
+_("duplicate extended attribute \"%s\", buggy XFS?"),
+							ent->a_name);
+					moveon = false;
+					goto out;
+				}
+			}
+
+			mop.am_attrname = ent->a_name;
+			mop.am_attrvalue = ctx->readbuf;
+			mop.am_length = IO_MAX_SIZE;
+			error = attr_multi_by_handle(handle, sizeof(*handle),
+					&mop, 1, flags);
+			if (error) {
+				error = errno;
+				goto out;
+			}
+		}
+
+		if (!attrlist->al_more)
+			break;
+	}
+	if (error)
+		error = errno;
+
+	/* ATTR_TRUST doesn't currently work on Linux... */
+	if (ns == RXT_TRUST && error == EINVAL)
+		error = 0;
+
+out:
+	if (firstname)
+		free(firstname);
+	if (error == ESTALE)
+		return ESTALE;
+	if (error) {
+		str_errno(ctx, descr);
+		return error;
+	}
+	return moveon ? 0 : XFS_ITERATE_INODES_ABORT;
+}
+
+/*
+ * Scrub part of a file.  If the user passes in a valid fd we assume
+ * that's the file to check; otherwise, pass in the inode number and
+ * let the kernel sort it out.
+ */
+static bool
+xfs_scrub_fd(
+	struct scrub_ctx	*ctx,
+	bool			(*fn)(struct scrub_ctx *, uint64_t,
+				      uint32_t, int, struct list_head *),
+	struct xfs_bstat	*bs,
+	int			fd,
+	struct list_head	*repairs)
+{
+	if (fd < 0)
+		fd = ctx->mnt_fd;
+	return fn(ctx, bs->bs_ino, bs->bs_gen, ctx->mnt_fd, repairs);
+}
+
+/* Verify the contents, xattrs, and extent maps of an inode. */
+static int
+xfs_scrub_inode(
+	struct scrub_ctx	*ctx,
+	struct xfs_handle	*handle,
+	struct xfs_bstat	*bstat,
+	void			*arg)
+{
+	struct stat		fd_sb;
+	struct xfs_scrub_ctx	*xctx = ctx->priv;
+	struct list_head	repairs;
+	static char		linkbuf[PATH_MAX];
+	char			descr[DESCR_BUFSZ];
+	bool			moveon = true;
+	int			fd = -1;
+	int			i;
+	int			error = 0;
+
+	INIT_LIST_HEAD(&repairs);
+	snprintf(descr, DESCR_BUFSZ, _("inode %llu"), bstat->bs_ino);
+
+	/* Check block sizes. */
+	if (!S_ISBLK(bstat->bs_mode) && !S_ISCHR(bstat->bs_mode) &&
+	    bstat->bs_blksize != xctx->geo.blocksize)
+		str_error(ctx, descr,
+_("Block size mismatch %u, expected %u"),
+				bstat->bs_blksize, xctx->geo.blocksize);
+	if (bstat->bs_xflags & FS_XFLAG_EXTSIZE) {
+		if (bstat->bs_extsize > (MAXEXTLEN << xctx->blocklog))
+			str_error(ctx, descr,
+_("Extent size hint %u too large"), bstat->bs_extsize);
+		if (!(bstat->bs_xflags & FS_XFLAG_REALTIME) &&
+		    bstat->bs_extsize > (xctx->geo.agblocks << (xctx->blocklog - 1)))
+			str_error(ctx, descr,
+_("Extent size hint %u too large for AG"), bstat->bs_extsize);
+		if (!(bstat->bs_xflags & FS_XFLAG_REALTIME) &&
+		    bstat->bs_extsize % xctx->geo.blocksize)
+			str_error(ctx, descr,
+_("Extent size hint %u not a multiple of blocksize"), bstat->bs_extsize);
+		if ((bstat->bs_xflags & FS_XFLAG_REALTIME) &&
+		    bstat->bs_extsize % (xctx->geo.rtextsize << xctx->blocklog))
+			str_error(ctx, descr,
+_("Extent size hint %u not a multiple of rt extent size"), bstat->bs_extsize);
+	}
+	if ((bstat->bs_xflags & FS_XFLAG_COWEXTSIZE) &&
+	    !(xctx->geo.flags & XFS_FSOP_GEOM_FLAGS_REFLINK))
+		str_error(ctx, descr,
+_("Has a CoW extent size hint on a non-reflink filesystem?"), 0);
+	if (bstat->bs_xflags & FS_XFLAG_COWEXTSIZE) {
+		if (bstat->bs_cowextsize > (MAXEXTLEN << xctx->blocklog))
+			str_error(ctx, descr,
+_("CoW Extent size hint %u too large"), bstat->bs_cowextsize);
+		if (bstat->bs_cowextsize > (xctx->geo.agblocks << (xctx->blocklog - 1)))
+			str_error(ctx, descr,
+_("CoW Extent size hint %u too large for AG"), bstat->bs_cowextsize);
+		if (bstat->bs_cowextsize % xctx->geo.blocksize)
+			str_error(ctx, descr,
+_("CoW Extent size hint %u not a multiple of blocksize"), bstat->bs_cowextsize);
+	}
+
+	/* Try to open the inode to pin it. */
+	if (S_ISREG(bstat->bs_mode) || S_ISDIR(bstat->bs_mode)) {
+		fd = open_by_fshandle(handle, sizeof(*handle),
+				O_RDONLY | O_NOATIME | O_NOFOLLOW | O_NOCTTY);
+		if (fd < 0) {
+			error = errno;
+			if (error != ESTALE)
+				str_errno(ctx, descr);
+			goto out;
+		}
+	}
+
+	/* Scrub the inode. */
+	if (xfs_scrub_can_kscrub_inode(xctx)) {
+		moveon = xfs_scrub_fd(ctx, xfs_scrub_inode_fields, bstat, fd,
+				&repairs);
+		if (!moveon)
+			goto out;
+
+		moveon = xfs_quick_repair(ctx, &repairs);
+		if (!moveon)
+			goto out;
+	}
+
+	/* Scrub all block mappings. */
+	if (xfs_scrub_can_kscrub_bmap(xctx)) {
+		/* Use the kernel scrubbers. */
+		moveon = xfs_scrub_fd(ctx, xfs_scrub_data_fork, bstat, fd,
+				&repairs);
+		if (!moveon)
+			goto out;
+		moveon = xfs_scrub_fd(ctx, xfs_scrub_attr_fork, bstat, fd,
+				&repairs);
+		if (!moveon)
+			goto out;
+		moveon = xfs_scrub_fd(ctx, xfs_scrub_cow_fork, bstat, fd,
+				&repairs);
+		if (!moveon)
+			goto out;
+
+		moveon = xfs_quick_repair(ctx, &repairs);
+		if (!moveon)
+			goto out;
+	} else if (fd >= 0 && xfs_scrub_can_bmapx(xctx)) {
+		/* Scan the extent maps with GETBMAPX. */
+		moveon = xfs_scan_inode_extents(ctx, descr, fd);
+		if (!moveon)
+			goto out;
+	} else if (fd >= 0) {
+		/* Fall back to the FIEMAP scanner. */
+		error = fstat(fd, &fd_sb);
+		if (error) {
+			str_errno(ctx, descr);
+			goto out;
+		}
+
+		moveon = generic_scan_extents(ctx, descr, fd, &fd_sb, false);
+		if (!moveon)
+			goto out;
+		moveon = generic_scan_extents(ctx, descr, fd, &fd_sb, true);
+		if (!moveon)
+			goto out;
+	} else {
+		/*
+		 * If this is a file or dir, we have no way to scan the
+		 * extent maps.  Complain.
+		 */
+		if (S_ISREG(bstat->bs_mode) || S_ISDIR(bstat->bs_mode))
+			str_error(ctx, descr,
+_("Unable to open inode to scrub extent maps."));
+	}
+
+	/* XXX: Some day, check child -> parent dir -> child. */
+
+	if (S_ISLNK(bstat->bs_mode)) {
+		/* Check symlink contents. */
+		if (xfs_scrub_can_kscrub_symlink(xctx))
+			moveon = xfs_scrub_symlink(ctx, bstat->bs_ino,
+					bstat->bs_gen, ctx->mnt_fd, &repairs);
+		else {
+			error = readlink_by_handle(handle, sizeof(*handle),
+					linkbuf, PATH_MAX);
+			if (error < 0) {
+				error = errno;
+				if (error != ESTALE)
+					str_errno(ctx, descr);
+				goto out;
+			}
+		}
+		if (!moveon)
+			goto out;
+	} else if (S_ISDIR(bstat->bs_mode)) {
+		/* Check the directory entries. */
+		if (xfs_scrub_can_kscrub_dir(xctx))
+			moveon = xfs_scrub_fd(ctx, xfs_scrub_dir, bstat, fd,
+					&repairs);
+		else if (fd >= 0)
+			moveon = generic_check_directory(ctx, descr, &fd);
+		else {
+			str_error(ctx, descr,
+_("Unable to open directory to scrub."));
+			moveon = true;
+		}
+		if (!moveon)
+			goto out;
+	}
+
+	/*
+	 * Read all the extended attributes.  If any of the read
+	 * functions decline to move on, we can try again with the
+	 * VFS functions if we have a file descriptor.
+	 */
+	if (xfs_scrub_can_kscrub_xattr(xctx))
+		moveon = xfs_scrub_fd(ctx, xfs_scrub_attr, bstat, fd, &repairs);
+	else {
+		moveon = true;
+		for (i = 0; i < RXT_MAX; i++) {
+			error = xfs_read_handle_xattrs(ctx, descr, handle,
+					known_attr_ns[i]);
+			if (error == XFS_ITERATE_INODES_ABORT) {
+				moveon = false;
+				error = 0;
+				break;
+			}
+			if (error)
+				break;
+		}
+		if (!moveon && fd >= 0) {
+			moveon = generic_scan_xattrs(ctx, descr, fd);
+			if (!moveon)
+				goto out;
+		}
+		if (!moveon)
+			xfs_scrub_clear_skip_slow_xattr(xctx);
+		moveon = true;
+	}
+	if (!moveon)
+		goto out;
+
+	moveon = xfs_quick_repair(ctx, &repairs);
+
+out:
+	xfs_defer_repairs(ctx, &repairs);
+	if (fd >= 0)
+		close(fd);
+	if (error)
+		return error;
+	return moveon ? 0 : XFS_ITERATE_INODES_ABORT;
+}
+
+/* Verify all the inodes in a filesystem. */
+static bool
+xfs_scan_inodes(
+	struct scrub_ctx	*ctx)
+{
+	struct xfs_scrub_ctx	*xctx = ctx->priv;
+
+	if (!xfs_scrub_can_bulkstat(xctx))
+		return generic_scan_inodes(ctx);
+
+	return xfs_scan_all_inodes(ctx, xfs_scrub_inode);
+}
+
+/* Phase 4 */
+
+/* Check an inode's extents. */
+static bool
+xfs_scan_extents(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	int			fd,
+	struct stat		*sb,
+	bool			attr_fork)
+{
+	struct xfs_scrub_ctx	*xctx = ctx->priv;
+
+	/*
+	 * If we have bulkstat and either bmap or kernel scrubbing,
+	 * we already checked the extents.
+	 */
+	if (xfs_scrub_can_bulkstat(xctx) &&
+	    (xfs_scrub_can_bmapx(xctx) || xfs_scrub_can_kscrub_fs(xctx)))
+		return true;
+
+	return generic_scan_extents(ctx, descr, fd, sb, attr_fork);
+}
+
+/* Try to read all the extended attributes. */
+static bool
+xfs_scan_xattrs(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	int			fd)
+{
+	struct xfs_scrub_ctx	*xctx = ctx->priv;
+
+	/* If we have bulkstat, we already checked the attributes. */
+	if (xfs_scrub_can_bulkstat(xctx) && xfs_scrub_can_skip_slow_xattr(xctx))
+		return true;
+
+	return generic_scan_xattrs(ctx, descr, fd);
+}
+
+/* Try to read all the extended attributes of things that have no fd. */
+static bool
+xfs_scan_special_xattrs(
+	struct scrub_ctx	*ctx,
+	const char		*path)
+{
+	struct xfs_scrub_ctx	*xctx = ctx->priv;
+
+	/* If we have bulkstat, we already checked the attributes. */
+	if (xfs_scrub_can_bulkstat(xctx) && xfs_scrub_can_skip_slow_xattr(xctx))
+		return true;
+
+	return generic_scan_special_xattrs(ctx, path);
+}
+
+/* Traverse the directory tree. */
+static bool
+xfs_scan_fs_tree(
+	struct scrub_ctx	*ctx)
+{
+	struct xfs_scrub_ctx	*xctx = ctx->priv;
+
+	/* If we have bulkstat, we already checked the attributes. */
+	if (xfs_scrub_can_bulkstat(xctx) && xfs_scrub_can_skip_slow_xattr(xctx))
+		return true;
+
+	return generic_scan_fs_tree(ctx);
+}
+
+/* Phase 5 */
+
+/* Verify disk blocks with GETFSMAP */
+
+struct xfs_verify_extent {
+	/* Maintain state for the lazy read verifier. */
+	struct read_verify	rv;
+
+	/* Store bad extents if we don't have parent pointers. */
+	struct bitmap		*d_bad;		/* bytes */
+	struct bitmap		*r_bad;		/* bytes */
+
+	/* Track the last extent we saw. */
+	uint64_t		laststart;	/* bytes */
+	uint64_t		lastlength;	/* bytes */
+	bool			lastshared;	/* bytes */
+};
+
+/* Report an IO error resulting from read-verify based off getfsmap. */
+static bool
+xfs_check_rmap_error_report(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	struct fsmap		*map,
+	void			*arg)
+{
+	const char		*type;
+	struct xfs_scrub_ctx	*xctx = ctx->priv;
+	char			buf[32];
+	uint64_t		err_physical = *(uint64_t *)arg;
+	uint64_t		err_off;
+
+	if (err_physical > map->fmr_physical)
+		err_off = err_physical - map->fmr_physical;
+	else
+		err_off = 0;
+
+	snprintf(buf, 32, _("disk offset %llu"),
+			BTOBB(map->fmr_physical + err_off));
+
+	if (map->fmr_flags & FMR_OF_SPECIAL_OWNER) {
+		type = xfs_decode_special_owner(map->fmr_owner);
+		str_error(ctx, buf,
+_("%s failed read verification."),
+				type);
+	} else if (xfs_scrub_can_getparent(xctx)) {
+		/* XXX: go find the parent path */
+		str_error(ctx, buf,
+_("XXX: inode %lld offset %llu failed read verification."),
+				map->fmr_owner, map->fmr_offset + err_off);
+	}
+	return true;
+}
+
+/* Handle a read error in the rmap-based read verify. */
+void
+xfs_check_rmap_ioerr(
+	struct read_verify_pool	*rvp,
+	struct disk		*disk,
+	uint64_t		start,
+	uint64_t		length,
+	int			error,
+	void			*arg)
+{
+	struct fsmap		keys[2];
+	char			descr[DESCR_BUFSZ];
+	struct scrub_ctx	*ctx = rvp->rvp_ctx;
+	struct xfs_scrub_ctx	*xctx = ctx->priv;
+	struct xfs_verify_extent	*ve;
+	struct bitmap		*tree;
+	dev_t			dev;
+	bool			moveon;
+
+	ve = arg;
+	dev = xfs_disk_to_dev(xctx, disk);
+
+	/*
+	 * If we don't have parent pointers, save the bad extent for
+	 * later rescanning.
+	 */
+	if (!xfs_scrub_can_getparent(xctx)) {
+		if (dev == xctx->fsinfo.fs_datadev)
+			tree = ve->d_bad;
+		else if (dev == xctx->fsinfo.fs_rtdev)
+			tree = ve->r_bad;
+		else
+			tree = NULL;
+		if (tree) {
+			moveon = bitmap_add(tree, start, length);
+			if (!moveon)
+				str_errno(ctx, ctx->mntpoint);
+		}
+	}
+
+	snprintf(descr, DESCR_BUFSZ, _("dev %d:%d ioerr @ %"PRIu64":%"PRIu64" "),
+			major(dev), minor(dev), start, length);
+
+	/* Go figure out which blocks are bad from the fsmap. */
+	memset(keys, 0, sizeof(struct fsmap) * 2);
+	keys->fmr_device = dev;
+	keys->fmr_physical = start;
+	(keys + 1)->fmr_device = dev;
+	(keys + 1)->fmr_physical = start + length - 1;
+	(keys + 1)->fmr_owner = ULLONG_MAX;
+	(keys + 1)->fmr_offset = ULLONG_MAX;
+	(keys + 1)->fmr_flags = UINT_MAX;
+	xfs_iterate_fsmap(ctx, descr, keys, xfs_check_rmap_error_report,
+			&start);
+}
+
+/* Read verify a (data block) extent. */
+static bool
+xfs_check_rmap(
+	struct scrub_ctx		*ctx,
+	const char			*descr,
+	struct fsmap			*map,
+	void				*arg)
+{
+	struct xfs_scrub_ctx		*xctx = ctx->priv;
+	struct xfs_verify_extent	*ve = arg;
+	struct disk			*disk;
+	uint64_t			eofs;
+	uint64_t			min_physical;
+	bool				badflags = false;
+	bool				badmap = false;
+
+	dbg_printf("rmap dev %d:%d phys %llu owner %lld offset %llu "
+			"len %llu flags 0x%x\n", major(map->fmr_device),
+			minor(map->fmr_device), map->fmr_physical,
+			map->fmr_owner, map->fmr_offset,
+			map->fmr_length, map->fmr_flags);
+
+	/* If kernel already checked this... */
+	if (xfs_scrub_can_kscrub_fs(xctx))
+		goto skip_check;
+
+	if (map->fmr_device == xctx->fsinfo.fs_datadev)
+		eofs = xctx->geo.datablocks;
+	else if (map->fmr_device == xctx->fsinfo.fs_rtdev)
+		eofs = xctx->geo.rtblocks;
+	else if (map->fmr_device == xctx->fsinfo.fs_logdev)
+		eofs = xctx->geo.logblocks;
+	else
+		abort();
+	eofs <<= xctx->blocklog;
+
+	/* Don't go past EOFS */
+	if (map->fmr_physical >= eofs) {
+		badmap = true;
+		str_error(ctx, descr,
+_("rmap (%llu/%llu/%llu) starts past end of filesystem at %llu."),
+				map->fmr_physical, map->fmr_offset,
+				map->fmr_length, eofs);
+	}
+
+	if (map->fmr_physical + map->fmr_length < map->fmr_physical ||
+	    map->fmr_physical + map->fmr_length >= eofs) {
+		badmap = true;
+		str_error(ctx, descr,
+_("rmap (%llu/%llu/%llu) ends past end of filesystem at %llu."),
+				map->fmr_physical, map->fmr_offset,
+				map->fmr_length, eofs);
+	}
+
+	/* Check for illegal overlapping. */
+	if (ve->lastshared && (map->fmr_flags & FMR_OF_SHARED))
+		min_physical = ve->laststart;
+	else
+		min_physical = ve->laststart + ve->lastlength;
+
+	if (map->fmr_physical < min_physical) {
+		badmap = true;
+		str_error(ctx, descr,
+_("rmap (%llu/%llu/%llu) overlaps another rmap."),
+				map->fmr_physical, map->fmr_offset,
+				map->fmr_length);
+	}
+
+	/* can't have shared on non-reflink */
+	if ((map->fmr_flags & FMR_OF_SHARED) &&
+	    !(xctx->geo.flags & XFS_FSOP_GEOM_FLAGS_REFLINK))
+		badflags = true;
+
+	/* unwritten can't have any of the other flags */
+	if ((map->fmr_flags & FMR_OF_PREALLOC) &&
+	     (map->fmr_flags & (FMR_OF_ATTR_FORK | FMR_OF_EXTENT_MAP |
+				 FMR_OF_SHARED | FMR_OF_SPECIAL_OWNER)))
+		badflags = true;
+
+	/* attr fork can't be shared or uwnritten or special */
+	if ((map->fmr_flags & FMR_OF_ATTR_FORK) &&
+	     (map->fmr_flags & (FMR_OF_PREALLOC | FMR_OF_SHARED |
+				 FMR_OF_SPECIAL_OWNER)))
+		badflags = true;
+
+	/* extent maps can only have attrfork */
+	if ((map->fmr_flags & FMR_OF_EXTENT_MAP) &&
+	     (map->fmr_flags & (FMR_OF_PREALLOC | FMR_OF_SHARED |
+				 FMR_OF_SPECIAL_OWNER)))
+		badflags = true;
+
+	/* shared maps can't have any of the other flags */
+	if ((map->fmr_flags & FMR_OF_SHARED) &&
+	    (map->fmr_flags & (FMR_OF_PREALLOC | FMR_OF_ATTR_FORK |
+				FMR_OF_EXTENT_MAP | FMR_OF_SPECIAL_OWNER)))
+
+	/* special owners can't have any of the other flags */
+	if ((map->fmr_flags & FMR_OF_SPECIAL_OWNER) &&
+	     (map->fmr_flags & (FMR_OF_PREALLOC | FMR_OF_ATTR_FORK |
+				 FMR_OF_EXTENT_MAP | FMR_OF_SHARED)))
+		badflags = true;
+
+	if (badflags) {
+		badmap = true;
+		str_error(ctx, descr,
+_("rmap (%llu/%llu/%llu) has conflicting flags 0x%x."),
+				map->fmr_physical, map->fmr_offset,
+				map->fmr_length, map->fmr_flags);
+	}
+
+	/* If this rmap is suspect, don't bother verifying it. */
+	if (badmap)
+		goto out;
+
+skip_check:
+	/* Remember this extent. */
+	ve->lastshared = (map->fmr_flags & FMR_OF_SHARED);
+	ve->laststart = map->fmr_physical;
+	ve->lastlength = map->fmr_length;
+
+	/* "Unknown" extents should be verified; they could be data. */
+	if ((map->fmr_flags & FMR_OF_SPECIAL_OWNER) &&
+			map->fmr_owner == FMR_OWN_UNKNOWN)
+		map->fmr_flags &= ~FMR_OF_SPECIAL_OWNER;
+
+	/*
+	 * We only care about read-verifying data extents that have been
+	 * written to disk.  This means we can skip "special" owners
+	 * (metadata), xattr blocks, unwritten extents, and extent maps.
+	 * These should all get checked elsewhere in the scrubber.
+	 */
+	if (map->fmr_flags & (FMR_OF_PREALLOC | FMR_OF_ATTR_FORK |
+			       FMR_OF_EXTENT_MAP | FMR_OF_SPECIAL_OWNER))
+		goto out;
+
+	/* XXX: Filter out directory data blocks. */
+
+	/* Schedule the read verify command for (eventual) running. */
+	disk = xfs_dev_to_disk(xctx, map->fmr_device);
+
+	read_verify_schedule(&xctx->rvp, &ve->rv, disk, map->fmr_physical,
+			map->fmr_length, ve);
+
+out:
+	/* Is this the last extent?  Fire off the read. */
+	if (map->fmr_flags & FMR_OF_LAST)
+		read_verify_force(&xctx->rvp, &ve->rv);
+
+	return true;
+}
+
+/* Verify all the blocks in a filesystem. */
+static bool
+xfs_scan_rmaps(
+	struct scrub_ctx		*ctx)
+{
+	struct xfs_scrub_ctx		*xctx = ctx->priv;
+	struct bitmap			d_bad;
+	struct bitmap			r_bad;
+	struct xfs_verify_extent	*ve;
+	struct xfs_verify_extent	*v;
+	int				i;
+	unsigned int			groups;
+	bool				moveon;
+
+	/*
+	 * Initialize our per-thread context.  By convention,
+	 * the log device comes first, then the rt device, and then
+	 * the AGs.
+	 */
+	groups = xfs_scan_all_blocks_array_size(xctx);
+	ve = calloc(groups, sizeof(struct xfs_verify_extent));
+	if (!ve) {
+		str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+
+	moveon = bitmap_init(&d_bad);
+	if (!moveon) {
+		str_errno(ctx, ctx->mntpoint);
+		goto out_ve;
+	}
+
+	moveon = bitmap_init(&r_bad);
+	if (!moveon) {
+		str_errno(ctx, ctx->mntpoint);
+		goto out_dbad;
+	}
+
+	for (i = 0, v = ve; i < groups; i++, v++) {
+		v->d_bad = &d_bad;
+		v->r_bad = &r_bad;
+	}
+
+	moveon = xfs_read_verify_pool_init(ctx, xfs_check_rmap_ioerr);
+	if (!moveon)
+		goto out_rbad;
+	moveon = xfs_scan_all_blocks_array_arg(ctx, xfs_check_rmap,
+			ve, sizeof(*ve));
+	if (!moveon)
+		goto out_pool;
+
+	for (i = 0, v = ve; i < groups; i++, v++)
+		read_verify_force(&xctx->rvp, &v->rv);
+	read_verify_pool_destroy(&xctx->rvp);
+
+	/* Scan the whole dir tree to see what matches the bad extents. */
+	if (!bitmap_empty(&d_bad) || !bitmap_empty(&r_bad))
+		moveon = xfs_report_verify_errors(ctx, &d_bad, &r_bad);
+
+	bitmap_free(&r_bad);
+	bitmap_free(&d_bad);
+	free(ve);
+	return moveon;
+
+out_pool:
+	read_verify_pool_destroy(&xctx->rvp);
+out_rbad:
+	bitmap_free(&r_bad);
+out_dbad:
+	bitmap_free(&d_bad);
+out_ve:
+	free(ve);
+	return moveon;
+}
+
+/* Read-verify with BULKSTAT + GETBMAPX */
+struct xfs_verify_inode {
+	struct bitmap			d_good;		/* bytes */
+	struct bitmap			r_good;		/* bytes */
+	struct bitmap			*d_bad;		/* bytes */
+	struct bitmap			*r_bad;		/* bytes */
+};
+
+struct xfs_verify_submit {
+	struct read_verify_pool		*rvp;
+	struct bitmap			*bad;
+	struct disk			*disk;
+	struct read_verify		rv;
+};
+
+/* Finish a inode block scan. */
+void
+xfs_verify_inode_bmap_ioerr(
+	struct read_verify_pool		*rvp,
+	struct disk			*disk,
+	uint64_t			start,
+	uint64_t			length,
+	int				error,
+	void				*arg)
+{
+	struct bitmap			*tree = arg;
+
+	bitmap_add(tree, start, length);
+}
+
+/* Scrub an inode extent and read-verify it. */
+bool
+xfs_verify_inode_bmap(
+	struct scrub_ctx		*ctx,
+	const char			*descr,
+	int				fd,
+	int				whichfork,
+	struct fsxattr			*fsx,
+	struct xfs_bmap			*bmap,
+	void				*arg)
+{
+	struct bitmap			*tree = arg;
+
+	/*
+	 * Only do data scrubbing if the extent is neither unwritten nor
+	 * delalloc.
+	 */
+	if (bmap->bm_flags & (BMV_OF_PREALLOC | BMV_OF_DELALLOC))
+		return true;
+
+	return bitmap_add(tree, bmap->bm_physical, bmap->bm_length);
+}
+
+/* Read-verify the data blocks of a file via BMAP. */
+static int
+xfs_verify_inode(
+	struct scrub_ctx		*ctx,
+	struct xfs_handle		*handle,
+	struct xfs_bstat		*bstat,
+	void				*arg)
+{
+	struct stat			fd_sb;
+	struct xfs_bmap			key = {0};
+	struct xfs_verify_inode		*vi = arg;
+	struct bitmap			*tree;
+	char				descr[DESCR_BUFSZ];
+	bool				moveon = true;
+	int				fd = -1;
+	int				error;
+
+	if (!S_ISREG(bstat->bs_mode))
+		return true;
+
+	snprintf(descr, DESCR_BUFSZ, _("inode %llu/%u"), bstat->bs_ino,
+			bstat->bs_gen);
+
+	/* Try to open the inode to pin it. */
+	fd = open_by_fshandle(handle, sizeof(*handle),
+			O_RDONLY | O_NOATIME | O_NOFOLLOW | O_NOCTTY);
+	if (fd < 0) {
+		error = errno;
+		if (error == ESTALE)
+			return error;
+
+		str_errno(ctx, descr);
+		return 0;
+	}
+
+	if (vi) {
+		/* Use BMAPX */
+		if (bstat->bs_xflags & FS_XFLAG_REALTIME)
+			tree = &vi->r_good;
+		else
+			tree = &vi->d_good;
+
+		/* data fork */
+		moveon = xfs_iterate_bmap(ctx, descr, fd, XFS_DATA_FORK, &key,
+				xfs_verify_inode_bmap, tree);
+	} else {
+		error = fstat(fd, &fd_sb);
+		if (error) {
+			str_errno(ctx, descr);
+			goto out;
+		}
+
+		/* Use generic_file_read */
+		moveon = read_verify_file(ctx, descr, fd, &fd_sb);
+	}
+
+out:
+	if (fd >= 0)
+		close(fd);
+	return moveon ? 0 : XFS_ITERATE_INODES_ABORT;
+}
+
+/* Schedule a read verification from an extent tree record. */
+static bool
+xfs_schedule_read_verify(
+	uint64_t			start,
+	uint64_t			length,
+	void				*arg)
+{
+	struct xfs_verify_submit	*rvs = arg;
+
+	read_verify_schedule(rvs->rvp, &rvs->rv, rvs->disk, start, length,
+			rvs->bad);
+	return true;
+}
+
+/* Verify all the file data in a filesystem. */
+static bool
+xfs_verify_inodes(
+	struct scrub_ctx	*ctx)
+{
+	struct xfs_scrub_ctx	*xctx = ctx->priv;
+	struct bitmap		d_good;
+	struct bitmap		d_bad;
+	struct bitmap		r_good;
+	struct bitmap		r_bad;
+	struct xfs_verify_inode	*vi;
+	struct xfs_verify_inode	*v;
+	struct xfs_verify_submit	vs;
+	int			i;
+	unsigned int		groups;
+	bool			moveon;
+
+	groups = xfs_scan_all_inodes_array_size(xctx);
+	vi = calloc(groups, sizeof(struct xfs_verify_inode));
+	if (!vi) {
+		str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+
+	moveon = bitmap_init(&d_good);
+	if (!moveon) {
+		str_errno(ctx, ctx->mntpoint);
+		goto out_vi;
+	}
+
+	moveon = bitmap_init(&d_bad);
+	if (!moveon) {
+		str_errno(ctx, ctx->mntpoint);
+		goto out_dgood;
+	}
+
+	moveon = bitmap_init(&r_good);
+	if (!moveon) {
+		str_errno(ctx, ctx->mntpoint);
+		goto out_dbad;
+	}
+
+	moveon = bitmap_init(&r_bad);
+	if (!moveon) {
+		str_errno(ctx, ctx->mntpoint);
+		goto out_rgood;
+	}
+
+	for (i = 0, v = vi; i < groups; i++, v++) {
+		v->d_bad = &d_bad;
+		v->r_bad = &r_bad;
+
+		moveon = bitmap_init(&v->d_good);
+		if (!moveon) {
+			str_errno(ctx, ctx->mntpoint);
+			goto out_varray;
+		}
+
+		moveon = bitmap_init(&v->r_good);
+		if (!moveon) {
+			str_errno(ctx, ctx->mntpoint);
+			goto out_varray;
+		}
+	}
+
+	/* Scan all the inodes for extent information. */
+	moveon = xfs_scan_all_inodes_array_arg(ctx, xfs_verify_inode,
+			vi, sizeof(*vi));
+	if (!moveon)
+		goto out_varray;
+
+	/* Merge all the IOs. */
+	for (i = 0, v = vi; i < groups; i++, v++) {
+		bitmap_merge(&d_good, &v->d_good);
+		bitmap_free(&v->d_good);
+		bitmap_merge(&r_good, &v->r_good);
+		bitmap_free(&v->r_good);
+	}
+
+	/* Run all the IO in batches. */
+	memset(&vs, 0, sizeof(struct xfs_verify_submit));
+	vs.rvp = &xctx->rvp;
+	moveon = xfs_read_verify_pool_init(ctx, xfs_verify_inode_bmap_ioerr);
+	if (!moveon)
+		goto out_varray;
+	vs.disk = &xctx->datadev;
+	vs.bad = &d_bad;
+	moveon = bitmap_iterate(&d_good, xfs_schedule_read_verify, &vs);
+	if (!moveon)
+		goto out_pool;
+	vs.disk = &xctx->rtdev;
+	vs.bad = &r_bad;
+	moveon = bitmap_iterate(&r_good, xfs_schedule_read_verify, &vs);
+	if (!moveon)
+		goto out_pool;
+	read_verify_force(&xctx->rvp, &vs.rv);
+	read_verify_pool_destroy(&xctx->rvp);
+
+	/* Re-scan the file bmaps to see if they match the bad. */
+	if (!bitmap_empty(&d_bad) || !bitmap_empty(&r_bad))
+		moveon = xfs_report_verify_errors(ctx, &d_bad, &r_bad);
+
+	goto out_varray;
+
+out_pool:
+	read_verify_pool_destroy(&xctx->rvp);
+out_varray:
+	for (i = 0, v = vi; i < xctx->geo.agcount; i++, v++) {
+		bitmap_free(&v->d_good);
+		bitmap_free(&v->r_good);
+	}
+	bitmap_free(&r_bad);
+out_rgood:
+	bitmap_free(&r_good);
+out_dbad:
+	bitmap_free(&d_bad);
+out_dgood:
+	bitmap_free(&d_good);
+out_vi:
+	free(vi);
+	return moveon;
+}
+
+/* Verify all the file data in a filesystem with the generic verifier. */
+static bool
+xfs_verify_inodes_generic(
+	struct scrub_ctx	*ctx)
+{
+	return xfs_scan_all_inodes(ctx, xfs_verify_inode);
+}
+
+/* Scan all the blocks in a filesystem. */
+static bool
+xfs_scan_blocks(
+	struct scrub_ctx		*ctx)
+{
+	struct xfs_scrub_ctx		*xctx = ctx->priv;
+
+	switch (xctx->data_scrubber) {
+	case DS_NOSCRUB:
+		return true;
+	case DS_READ:
+		return generic_scan_blocks(ctx);
+	case DS_BULKSTAT_READ:
+		return xfs_verify_inodes_generic(ctx);
+	case DS_BMAPX:
+		return xfs_verify_inodes(ctx);
+	case DS_FSMAP:
+		return xfs_scan_rmaps(ctx);
+	default:
+		abort();
+	}
+}
+
+/* Read an entire file's data. */
+static bool
+xfs_read_file(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	int			fd,
+	struct stat		*sb)
+{
+	struct xfs_scrub_ctx	*xctx = ctx->priv;
+
+	if (xctx->data_scrubber != DS_READ)
+		return true;
+
+	return read_verify_file(ctx, descr, fd, sb);
+}
+
+/* Phase 6 */
+
+struct xfs_summary_counts {
+	unsigned long long	inodes;		/* number of inodes */
+	unsigned long long	dbytes;		/* data dev bytes */
+	unsigned long long	rbytes;		/* rt dev bytes */
+	unsigned long long	next_phys;	/* next phys bytes we see? */
+	unsigned long long	agbytes;	/* freespace bytes */
+	struct bitmap		dext;		/* data block extent bitmap */
+	struct bitmap		rext;		/* rt block extent bitmap */
+};
+
+struct xfs_inode_fork_summary {
+	struct bitmap		*tree;
+	unsigned long long	bytes;
+};
+
+/* Record data block extents in a bitmap. */
+bool
+xfs_record_inode_summary_bmap(
+	struct scrub_ctx		*ctx,
+	const char			*descr,
+	int				fd,
+	int				whichfork,
+	struct fsxattr			*fsx,
+	struct xfs_bmap			*bmap,
+	void				*arg)
+{
+	struct xfs_inode_fork_summary	*ifs = arg;
+
+	/* Only record real extents. */
+	if (bmap->bm_flags & BMV_OF_DELALLOC)
+		return true;
+
+	bitmap_add(ifs->tree, bmap->bm_physical, bmap->bm_length);
+	ifs->bytes += bmap->bm_length;
+
+	return true;
+}
+
+/* Record inode and block usage. */
+static int
+xfs_record_inode_summary(
+	struct scrub_ctx		*ctx,
+	struct xfs_handle		*handle,
+	struct xfs_bstat		*bstat,
+	void				*arg)
+{
+	struct xfs_scrub_ctx		*xctx = ctx->priv;
+	struct xfs_summary_counts	*counts = arg;
+	struct xfs_inode_fork_summary	ifs = {0};
+	struct xfs_bmap			key = {0};
+	char				descr[DESCR_BUFSZ];
+	int				fd;
+	bool				moveon;
+
+	counts->inodes++;
+	if (xfs_scrub_can_getfsmap(xctx) || bstat->bs_blocks == 0)
+		return 0;
+
+	if (!xfs_scrub_can_bmapx(xctx) || !S_ISREG(bstat->bs_mode)) {
+		counts->dbytes += (bstat->bs_blocks << xctx->blocklog);
+		return 0;
+	}
+
+	/* Potentially a reflinked file, so collect the bitmap... */
+	snprintf(descr, DESCR_BUFSZ, _("inode %llu/%u"), bstat->bs_ino,
+			bstat->bs_gen);
+
+	/* Try to open the inode to pin it. */
+	fd = open_by_fshandle(handle, sizeof(*handle),
+			O_RDONLY | O_NOATIME | O_NOFOLLOW | O_NOCTTY);
+	if (fd < 0) {
+		if (errno == ESTALE)
+			return errno;
+
+		str_errno(ctx, descr);
+		return 0;
+	}
+
+	/* data fork */
+	if (bstat->bs_xflags & FS_XFLAG_REALTIME)
+		ifs.tree = &counts->rext;
+	else
+		ifs.tree = &counts->dext;
+	moveon = xfs_iterate_bmap(ctx, descr, fd, XFS_DATA_FORK, &key,
+			xfs_record_inode_summary_bmap, &ifs);
+	if (!moveon)
+		goto out;
+
+	/* attr fork */
+	ifs.tree = &counts->dext;
+	moveon = xfs_iterate_bmap(ctx, descr, fd, XFS_ATTR_FORK, &key,
+			xfs_record_inode_summary_bmap, &ifs);
+	if (!moveon)
+		goto out;
+
+	/*
+	 * bs_blocks tracks the number of sectors assigned to this file
+	 * for data, xattrs, and block mapping metadata.  ifs.bytes tracks
+	 * the data and xattr storage space used, so the diff between the
+	 * two is the space used for block mapping metadata.  Add that to
+	 * the data usage.
+	 */
+	counts->dbytes += (bstat->bs_blocks << xctx->blocklog) - ifs.bytes;
+
+out:
+	if (fd >= 0)
+		close(fd);
+	return moveon ? 0 : XFS_ITERATE_INODES_ABORT;
+}
+
+/* Record block usage. */
+static bool
+xfs_record_block_summary(
+	struct scrub_ctx		*ctx,
+	const char			*descr,
+	struct fsmap			*fsmap,
+	void				*arg)
+{
+	struct xfs_summary_counts	*counts = arg;
+	struct xfs_scrub_ctx		*xctx = ctx->priv;
+	unsigned long long		len;
+
+	if (fsmap->fmr_device == xctx->fsinfo.fs_logdev)
+		return true;
+	if ((fsmap->fmr_flags & FMR_OF_SPECIAL_OWNER) &&
+	    fsmap->fmr_owner == FMR_OWN_FREE)
+		return true;
+
+	len = fsmap->fmr_length;
+
+	/* freesp btrees live in free space, need to adjust counters later. */
+	if ((fsmap->fmr_flags & FMR_OF_SPECIAL_OWNER) &&
+	    fsmap->fmr_owner == FMR_OWN_AG) {
+		counts->agbytes += fsmap->fmr_length;
+	}
+	if (fsmap->fmr_device == xctx->fsinfo.fs_rtdev) {
+		/* Count realtime extents. */
+		counts->rbytes += len;
+	} else {
+		/* Count data extents. */
+		if (counts->next_phys >= fsmap->fmr_physical + len)
+			return true;
+		else if (counts->next_phys > fsmap->fmr_physical)
+			len = counts->next_phys - fsmap->fmr_physical;
+		counts->dbytes += len;
+		counts->next_phys = fsmap->fmr_physical + fsmap->fmr_length;
+	}
+
+	return true;
+}
+
+/* Sum the bytes in each extent. */
+static bool
+xfs_summary_count_helper(
+	uint64_t			start,
+	uint64_t			length,
+	void				*arg)
+{
+	unsigned long long		*count = arg;
+
+	*count += length;
+	return true;
+}
+
+/* Count all inodes and blocks in the filesystem, compare to superblock. */
+static bool
+xfs_check_summary(
+	struct scrub_ctx		*ctx)
+{
+	struct xfs_scrub_ctx		*xctx = ctx->priv;
+	struct xfs_fsop_counts		fc;
+	struct xfs_fsop_resblks		rb;
+	struct xfs_fsop_ag_resblks	arb;
+	struct statvfs			sfs;
+	struct xfs_summary_counts	*summary;
+	unsigned long long		fd;
+	unsigned long long		fr;
+	unsigned long long		fi;
+	unsigned long long		sd;
+	unsigned long long		sr;
+	unsigned long long		si;
+	unsigned long long		absdiff;
+	xfs_agnumber_t			agno;
+	bool				moveon;
+	bool				complain;
+	unsigned int			groups;
+	int				error;
+
+	if (!xfs_scrub_can_bulkstat(xctx))
+		return generic_check_summary(ctx);
+
+	groups = xfs_scan_all_blocks_array_size(xctx);
+	summary = calloc(groups, sizeof(struct xfs_summary_counts));
+	if (!summary) {
+		str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+
+	/* Flush everything out to disk before we start counting. */
+	error = syncfs(ctx->mnt_fd);
+	if (error) {
+		str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+
+	if (xfs_scrub_can_getfsmap(xctx)) {
+		/* Use fsmap to count blocks. */
+		moveon = xfs_scan_all_blocks_array_arg(ctx,
+				xfs_record_block_summary,
+				summary, sizeof(*summary));
+		if (!moveon)
+			goto out;
+	} else {
+		/* Reflink w/o rmap; have to collect extents in a bitmap. */
+		for (agno = 0; agno < groups; agno++) {
+			moveon = bitmap_init(&summary[agno].dext);
+			if (!moveon) {
+				str_errno(ctx, ctx->mntpoint);
+				goto out;
+			}
+			moveon = bitmap_init(&summary[agno].rext);
+			if (!moveon) {
+				str_errno(ctx, ctx->mntpoint);
+				goto out;
+			}
+		}
+	}
+
+	/* Scan the whole fs. */
+	moveon = xfs_scan_all_inodes_array_arg(ctx, xfs_record_inode_summary,
+			summary, sizeof(*summary));
+	if (!moveon)
+		goto out;
+
+	if (!xfs_scrub_can_getfsmap(xctx)) {
+		/* Reflink w/o rmap; merge the bitmaps. */
+		for (agno = 1; agno < groups; agno++) {
+			bitmap_merge(&summary[0].dext, &summary[agno].dext);
+			bitmap_free(&summary[agno].dext);
+			bitmap_merge(&summary[0].rext, &summary[agno].rext);
+			bitmap_free(&summary[agno].rext);
+		}
+		moveon = bitmap_iterate(&summary[0].dext,
+				xfs_summary_count_helper, &summary[0].dbytes);
+		moveon = bitmap_iterate(&summary[0].rext,
+				xfs_summary_count_helper, &summary[0].rbytes);
+		bitmap_free(&summary[0].dext);
+		bitmap_free(&summary[0].rext);
+		if (!moveon)
+			goto out;
+	}
+
+	/* Sum the counts. */
+	for (agno = 1; agno < groups; agno++) {
+		summary[0].inodes += summary[agno].inodes;
+		summary[0].dbytes += summary[agno].dbytes;
+		summary[0].rbytes += summary[agno].rbytes;
+		summary[0].agbytes += summary[agno].agbytes;
+	}
+
+	/* Account for an internal log, if present. */
+	if (!xfs_scrub_can_getfsmap(xctx) && xctx->fsinfo.fs_log == NULL)
+		summary[0].dbytes += (unsigned long long)xctx->geo.logblocks <<
+				xctx->blocklog;
+
+	/* Account for hidden rt metadata inodes. */
+	summary[0].inodes += 2;
+	if ((xctx->geo.flags & XFS_FSOP_GEOM_FLAGS_RMAPBT) &&
+			xctx->geo.rtblocks > 0)
+		summary[0].inodes++;
+
+	/* Fetch the filesystem counters. */
+	error = xfsctl(NULL, ctx->mnt_fd, XFS_IOC_FSCOUNTS, &fc);
+	if (error)
+		str_errno(ctx, ctx->mntpoint);
+
+	/* Grab the fstatvfs counters, since it has to report accurately. */
+	error = fstatvfs(ctx->mnt_fd, &sfs);
+	if (error) {
+		str_errno(ctx, ctx->mntpoint);
+		return false;
+	}
+
+	/*
+	 * XFS reserves some blocks to prevent hard ENOSPC, so add those
+	 * blocks back to the free data counts.
+	 */
+	error = xfsctl(NULL, ctx->mnt_fd, XFS_IOC_GET_RESBLKS, &rb);
+	if (error)
+		str_errno(ctx, ctx->mntpoint);
+	sfs.f_bfree += rb.resblks_avail;
+
+	/*
+	 * XFS with rmap or reflink reserves blocks in each AG to
+	 * prevent the AG from running out of space for metadata blocks.
+	 * Add those back to the free data counts.
+	 */
+	memset(&arb, 0, sizeof(arb));
+	error = xfsctl(NULL, ctx->mnt_fd, XFS_IOC_GET_AG_RESBLKS, &arb);
+	if (error && errno != ENOTTY)
+		str_errno(ctx, ctx->mntpoint);
+	sfs.f_bfree += arb.resblks;
+
+	/*
+	 * If we counted blocks with fsmap, then dblocks includes
+	 * blocks for the AGFL and the freespace/rmap btrees.  The
+	 * filesystem treats them as "free", but since we scanned
+	 * them, we'll consider them used.
+	 */
+	sfs.f_bfree -= summary[0].agbytes >> xctx->blocklog;
+
+	/* Report on what we found. */
+	fd = (xctx->geo.datablocks - sfs.f_bfree) << xctx->blocklog;
+	fr = (xctx->geo.rtblocks - fc.freertx) << xctx->blocklog;
+	fi = sfs.f_files - sfs.f_ffree;
+	sd = summary[0].dbytes;
+	sr = summary[0].rbytes;
+	si = summary[0].inodes;
+
+	/*
+	 * Complain if the counts are off by more than 10% unless
+	 * the inaccuracy is less than 32MB worth of blocks or 100 inodes.
+	 */
+	absdiff = 1ULL << 25;
+	complain = !within_range(ctx, sd, fd, absdiff, 1, 10, _("data blocks"));
+	complain |= !within_range(ctx, sr, fr, absdiff, 1, 10, _("realtime blocks"));
+	complain |= !within_range(ctx, si, fi, 100, 1, 10, _("inodes"));
+
+	if (complain || verbose) {
+		double		d, r, i;
+		char		*du, *ru, *iu;
+
+		if (fr || sr) {
+			d = auto_space_units(fd, &du);
+			r = auto_space_units(fr, &ru);
+			i = auto_units(fi, &iu);
+			fprintf(stdout,
+_("%.1f%s data used;  %.1f%s realtime data used;  %.2f%s inodes used.\n"),
+					d, du, r, ru, i, iu);
+			d = auto_space_units(sd, &du);
+			r = auto_space_units(sr, &ru);
+			i = auto_units(si, &iu);
+			fprintf(stdout,
+_("%.1f%s data found; %.1f%s realtime data found; %.2f%s inodes found.\n"),
+					d, du, r, ru, i, iu);
+		} else {
+			d = auto_space_units(fd, &du);
+			i = auto_units(fi, &iu);
+			fprintf(stdout,
+_("%.1f%s data used;  %.1f%s inodes used.\n"),
+					d, du, i, iu);
+			d = auto_space_units(sd, &du);
+			i = auto_units(si, &iu);
+			fprintf(stdout,
+_("%.1f%s data found; %.1f%s inodes found.\n"),
+					d, du, i, iu);
+		}
+		fflush(stdout);
+	}
+	moveon = true;
+
+out:
+	for (agno = 0; agno < groups; agno++) {
+		bitmap_free(&summary[agno].dext);
+		bitmap_free(&summary[agno].rext);
+	}
+	free(summary);
+	return moveon;
+}
+
+/* Phase 7: Preen filesystem. */
+
+static int
+list_length(
+	struct list_head		*head)
+{
+	struct list_head		*pos;
+	int				nr = 0;
+
+	list_for_each(pos, head) {
+		nr++;
+	}
+
+	return nr;
+}
+
+/* Fix the per-AG and per-FS metadata. */
+static bool
+xfs_repair_fs(
+	struct scrub_ctx		*ctx)
+{
+	struct xfs_scrub_ctx		*xctx = ctx->priv;
+	int				len;
+	int				old_len;
+	bool				moveon;
+
+	/* Repair anything broken until we fail to make progress. */
+	len = list_length(&xctx->repair_list);
+	do {
+		old_len = len;
+		moveon = xfs_repair_metadata_list(ctx, ctx->mnt_fd,
+				&xctx->repair_list, 0);
+		if (!moveon)
+			return false;
+		len = list_length(&xctx->repair_list);
+	} while (old_len > len);
+
+	/* Try once more, but this time complain if we can't fix things. */
+	moveon = xfs_repair_metadata_list(ctx, ctx->mnt_fd,
+			&xctx->repair_list, XRML_NOFIX_COMPLAIN);
+	if (!moveon)
+		return false;
+
+	fstrim(ctx);
+	return true;
+}
+
+/* Shut down the filesystem. */
+static void
+xfs_shutdown_fs(
+	struct scrub_ctx		*ctx)
+{
+	int				flag;
+
+	flag = XFS_FSOP_GOING_FLAGS_LOGFLUSH;
+	if (xfsctl(ctx->mntpoint, ctx->mnt_fd, XFS_IOC_GOINGDOWN, &flag))
+		str_errno(ctx, ctx->mntpoint);
+}
+
+struct scrub_ops xfs_scrub_ops = {
+	.name			= "xfs",
+	.repair_tool		= "xfs_repair",
+	.cleanup		= xfs_cleanup,
+	.scan_fs		= xfs_scan_fs,
+	.scan_inodes		= xfs_scan_inodes,
+	.check_dir		= generic_check_dir,
+	.check_inode		= generic_check_inode,
+	.scan_extents		= xfs_scan_extents,
+	.scan_xattrs		= xfs_scan_xattrs,
+	.scan_special_xattrs	= xfs_scan_special_xattrs,
+	.scan_metadata		= xfs_scan_metadata,
+	.check_summary		= xfs_check_summary,
+	.scan_blocks		= xfs_scan_blocks,
+	.read_file		= xfs_read_file,
+	.scan_fs_tree		= xfs_scan_fs_tree,
+	.shutdown_fs		= xfs_shutdown_fs,
+	.preen_fs		= xfs_repair_fs,
+	.repair_fs		= xfs_repair_fs,
+};
diff --git a/scrub/xfs_ioctl.c b/scrub/xfs_ioctl.c
new file mode 100644
index 0000000..d69a4eb
--- /dev/null
+++ b/scrub/xfs_ioctl.c
@@ -0,0 +1,942 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#include "libxfs.h"
+#include <sys/statvfs.h>
+#include <sys/types.h>
+#include <dirent.h>
+#include "disk.h"
+#include "scrub.h"
+#include "../repair/threads.h"
+#include "handle.h"
+#include "path.h"
+
+#include "xfs_ioctl.h"
+
+#define FSMAP_NR		65536
+#define BMAP_NR			2048
+
+/* Call the handler function. */
+static int
+xfs_iterate_inode_func(
+	struct scrub_ctx	*ctx,
+	xfs_inode_iter_fn	fn,
+	struct xfs_bstat	*bs,
+	struct xfs_handle	*handle,
+	void			*arg)
+{
+	int			error;
+
+	handle->ha_fid.fid_ino = bs->bs_ino;
+	handle->ha_fid.fid_gen = bs->bs_gen;
+	error = fn(ctx, handle, bs, arg);
+	if (error)
+		return error;
+	if (xfs_scrub_excessive_errors(ctx))
+		return XFS_ITERATE_INODES_ABORT;
+	return 0;
+}
+
+/* Iterate a range of inodes. */
+bool
+xfs_iterate_inodes(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	void			*fshandle,
+	uint64_t		first_ino,
+	uint64_t		last_ino,
+	xfs_inode_iter_fn	fn,
+	void			*arg)
+{
+	struct xfs_fsop_bulkreq	igrpreq = {0};
+	struct xfs_fsop_bulkreq	bulkreq = {0};
+	struct xfs_fsop_bulkreq	onereq = {0};
+	struct xfs_handle	handle;
+	struct xfs_inogrp	inogrp;
+	struct xfs_bstat	bstat[XFS_INODES_PER_CHUNK] = {0};
+	char			idescr[DESCR_BUFSZ];
+	char			buf[DESCR_BUFSZ];
+	struct xfs_bstat	*bs;
+	__u64			last_stale = first_ino - 1;
+	__u64			igrp_ino;
+	__u64			oneino;
+	__u64			ino;
+	__s32			bulklen = 0;
+	__s32			onelen = 0;
+	__s32			igrplen = 0;
+	bool			moveon = true;
+	int			i;
+	int			error;
+	int			stale_count = 0;
+
+	assert(!debug_tweak_on("XFS_SCRUB_NO_BULKSTAT"));
+
+	onereq.lastip  = &oneino;
+	onereq.icount  = 1;
+	onereq.ocount  = &onelen;
+
+	bulkreq.lastip  = &ino;
+	bulkreq.icount  = XFS_INODES_PER_CHUNK;
+	bulkreq.ubuffer = &bstat;
+	bulkreq.ocount  = &bulklen;
+
+	igrpreq.lastip  = &igrp_ino;
+	igrpreq.icount  = 1;
+	igrpreq.ubuffer = &inogrp;
+	igrpreq.ocount  = &igrplen;
+
+	memcpy(&handle.ha_fsid, fshandle, sizeof(handle.ha_fsid));
+	handle.ha_fid.fid_len = sizeof(xfs_fid_t) -
+			sizeof(handle.ha_fid.fid_len);
+	handle.ha_fid.fid_pad = 0;
+
+	/* Find the inode chunk & alloc mask */
+	igrp_ino = first_ino;
+	error = xfsctl(ctx->mntpoint, ctx->mnt_fd, XFS_IOC_FSINUMBERS,
+			&igrpreq);
+	while (!error && igrplen) {
+		/* Load the inodes. */
+		ino = inogrp.xi_startino - 1;
+		bulkreq.icount = inogrp.xi_alloccount;
+		error = xfsctl(ctx->mntpoint, ctx->mnt_fd,
+				XFS_IOC_FSBULKSTAT, &bulkreq);
+
+		/* Did we get exactly the inodes we expected? */
+		for (i = 0, bs = bstat; i < XFS_INODES_PER_CHUNK; i++) {
+			if (!(inogrp.xi_allocmask & (1ULL << i)))
+				continue;
+			if (bs->bs_ino == inogrp.xi_startino + i) {
+				bs++;
+				continue;
+			}
+
+			/* Load the one inode. */
+			oneino = inogrp.xi_startino + i;
+			onereq.ubuffer = bs;
+			error = xfsctl(ctx->mntpoint, ctx->mnt_fd,
+					XFS_IOC_FSBULKSTAT_SINGLE,
+					&onereq);
+			if (error || bs->bs_ino != inogrp.xi_startino + i) {
+				memset(bs, 0, sizeof(struct xfs_bstat));
+				bs->bs_ino = inogrp.xi_startino;
+				bs->bs_blksize = ctx->mnt_sv.f_frsize;
+			}
+			bs++;
+		}
+
+		/* Iterate all the inodes. */
+		for (i = 0, bs = bstat; i < inogrp.xi_alloccount; i++, bs++) {
+			if (bs->bs_ino > last_ino)
+				goto out;
+
+			error = xfs_iterate_inode_func(ctx, fn, bs, &handle,
+					arg);
+			switch (error) {
+			case 0:
+				break;
+			case ESTALE:
+				if (last_stale == inogrp.xi_startino)
+					stale_count++;
+				else {
+					last_stale = inogrp.xi_startino;
+					stale_count = 0;
+				}
+				if (stale_count < 30) {
+					igrp_ino = inogrp.xi_startino;
+					goto igrp_retry;
+				}
+				snprintf(idescr, DESCR_BUFSZ, "inode %llu",
+						bs->bs_ino);
+				str_warn(ctx, idescr, "%s", strerror_r(error,
+						buf, DESCR_BUFSZ));
+				break;
+			case XFS_ITERATE_INODES_ABORT:
+				error = 0;
+				/* fall thru */
+			default:
+				moveon = false;
+				errno = error;
+				goto err;
+			}
+		}
+
+igrp_retry:
+		error = xfsctl(ctx->mntpoint, ctx->mnt_fd, XFS_IOC_FSINUMBERS,
+				&igrpreq);
+	}
+
+err:
+	if (error) {
+		str_errno(ctx, descr);
+		moveon = false;
+	}
+out:
+	return moveon;
+}
+
+/* Does the kernel support bulkstat? */
+bool
+xfs_can_iterate_inodes(
+	struct scrub_ctx	*ctx)
+{
+	struct xfs_fsop_bulkreq	bulkreq;
+	__u64			lastino;
+	__s32			bulklen = 0;
+	int			error;
+
+	if (debug_tweak_on("XFS_SCRUB_NO_BULKSTAT"))
+		return false;
+
+	lastino = 0;
+	memset(&bulkreq, 0, sizeof(bulkreq));
+	bulkreq.lastip = (__u64 *)&lastino;
+	bulkreq.icount  = 0;
+	bulkreq.ubuffer = NULL;
+	bulkreq.ocount  = &bulklen;
+
+	error = xfsctl(ctx->mntpoint, ctx->mnt_fd, XFS_IOC_FSBULKSTAT,
+			&bulkreq);
+	return error == -1 && errno == EINVAL;
+}
+
+/* Iterate all the extent block mappings between the two keys. */
+bool
+xfs_iterate_bmap(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	int			fd,
+	int			whichfork,
+	struct xfs_bmap		*key,
+	xfs_bmap_iter_fn	fn,
+	void			*arg)
+{
+	struct fsxattr		fsx;
+	struct getbmapx		*map;
+	struct getbmapx		*p;
+	struct xfs_bmap		bmap;
+	char			bmap_descr[DESCR_BUFSZ];
+	bool			moveon = true;
+	xfs_off_t		new_off;
+	int			getxattr_type;
+	int			i;
+	int			error;
+
+	assert(!debug_tweak_on("XFS_SCRUB_NO_BMAP"));
+
+	switch (whichfork) {
+	case XFS_ATTR_FORK:
+		snprintf(bmap_descr, DESCR_BUFSZ, _("%s attr"), descr);
+		break;
+	case XFS_COW_FORK:
+		snprintf(bmap_descr, DESCR_BUFSZ, _("%s CoW"), descr);
+		break;
+	case XFS_DATA_FORK:
+		snprintf(bmap_descr, DESCR_BUFSZ, _("%s data"), descr);
+		break;
+	default:
+		assert(0);
+	}
+
+	map = calloc(BMAP_NR, sizeof(struct getbmapx));
+	if (!map) {
+		str_errno(ctx, bmap_descr);
+		return false;
+	}
+
+	map->bmv_offset = BTOBB(key->bm_offset);
+	map->bmv_block = BTOBB(key->bm_physical);
+	if (key->bm_length == 0)
+		map->bmv_length = ULLONG_MAX;
+	else
+		map->bmv_length = BTOBB(key->bm_length);
+	map->bmv_count = BMAP_NR;
+	map->bmv_iflags = BMV_IF_NO_DMAPI_READ | BMV_IF_PREALLOC |
+			  BMV_OF_DELALLOC | BMV_IF_NO_HOLES;
+	switch (whichfork) {
+	case XFS_ATTR_FORK:
+		getxattr_type = XFS_IOC_FSGETXATTRA;
+		map->bmv_iflags |= BMV_IF_ATTRFORK;
+		break;
+	case XFS_COW_FORK:
+		map->bmv_iflags |= BMV_IF_COWFORK;
+		getxattr_type = XFS_IOC_FSGETXATTR;
+		break;
+	case XFS_DATA_FORK:
+		getxattr_type = XFS_IOC_FSGETXATTR;
+		break;
+	default:
+		abort();
+	}
+
+	error = xfsctl("", fd, getxattr_type, &fsx);
+	if (error < 0) {
+		str_errno(ctx, bmap_descr);
+		moveon = false;
+		goto out;
+	}
+
+	while ((error = xfsctl(bmap_descr, fd, XFS_IOC_GETBMAPX, map)) == 0) {
+
+		for (i = 0, p = &map[i + 1]; i < map->bmv_entries; i++, p++) {
+			bmap.bm_offset = BBTOB(p->bmv_offset);
+			bmap.bm_physical = BBTOB(p->bmv_block);
+			bmap.bm_length = BBTOB(p->bmv_length);
+			bmap.bm_flags = p->bmv_oflags;
+			moveon = fn(ctx, bmap_descr, fd, whichfork, &fsx,
+					&bmap, arg);
+			if (!moveon)
+				goto out;
+			if (xfs_scrub_excessive_errors(ctx)) {
+				moveon = false;
+				goto out;
+			}
+		}
+
+		if (map->bmv_entries == 0)
+			break;
+		p = map + map->bmv_entries;
+		if (p->bmv_oflags & BMV_OF_LAST)
+			break;
+
+		new_off = p->bmv_offset + p->bmv_length;
+		map->bmv_length -= new_off - map->bmv_offset;
+		map->bmv_offset = new_off;
+	}
+
+	/* Pre-reflink filesystems don't know about CoW forks. */
+	if (whichfork == XFS_COW_FORK && error && errno == EINVAL)
+		error = 0;
+
+	if (error)
+		str_errno(ctx, bmap_descr);
+out:
+	memcpy(key, map, sizeof(struct getbmapx));
+	free(map);
+	return moveon;
+}
+
+/* Does the kernel support getbmapx? */
+bool
+xfs_can_iterate_bmap(
+	struct scrub_ctx	*ctx)
+{
+	struct getbmapx		bsm[2];
+	int			error;
+
+	if (debug_tweak_on("XFS_SCRUB_NO_BMAP"))
+		return false;
+
+	memset(bsm, 0, sizeof(struct getbmapx));
+	bsm->bmv_length = ULLONG_MAX;
+	bsm->bmv_count = 2;
+	error = xfsctl(ctx->mntpoint, ctx->mnt_fd, XFS_IOC_GETBMAPX, bsm);
+	return error == 0;
+}
+
+/* Iterate all the fs block mappings between the two keys. */
+bool
+xfs_iterate_fsmap(
+	struct scrub_ctx	*ctx,
+	const char		*descr,
+	struct fsmap		*keys,
+	xfs_fsmap_iter_fn	fn,
+	void			*arg)
+{
+	struct fsmap_head	*head;
+	struct fsmap		*p;
+	bool			moveon = true;
+	int			i;
+	int			error;
+
+	assert(!debug_tweak_on("XFS_SCRUB_NO_FSMAP"));
+
+	head = malloc(fsmap_sizeof(FSMAP_NR));
+	if (!head) {
+		str_errno(ctx, descr);
+		return false;
+	}
+
+	memset(head, 0, sizeof(*head));
+	memcpy(head->fmh_keys, keys, sizeof(struct fsmap) * 2);
+	head->fmh_count = FSMAP_NR;
+
+	while ((error = xfsctl(ctx->mntpoint, ctx->mnt_fd, XFS_IOC_GETFSMAP,
+				head)) == 0) {
+
+		for (i = 0, p = head->fmh_recs; i < head->fmh_entries; i++, p++) {
+			moveon = fn(ctx, descr, p, arg);
+			if (!moveon)
+				goto out;
+			if (xfs_scrub_excessive_errors(ctx)) {
+				moveon = false;
+				goto out;
+			}
+		}
+
+		if (head->fmh_entries == 0)
+			break;
+		p = &head->fmh_recs[head->fmh_entries - 1];
+		if (p->fmr_flags & FMR_OF_LAST)
+			break;
+
+		head->fmh_keys[0] = *p;
+	}
+
+	if (error) {
+		str_errno(ctx, descr);
+		moveon = false;
+	}
+out:
+	free(head);
+	return moveon;
+}
+
+/* Does the kernel support getfsmap? */
+bool
+xfs_can_iterate_fsmap(
+	struct scrub_ctx	*ctx)
+{
+	struct fsmap_head	head;
+	int			error;
+
+	if (debug_tweak_on("XFS_SCRUB_NO_FSMAP"))
+		return false;
+
+	memset(&head, 0, sizeof(struct fsmap_head));
+	head.fmh_keys[1].fmr_device = UINT_MAX;
+	head.fmh_keys[1].fmr_physical = ULLONG_MAX;
+	head.fmh_keys[1].fmr_owner = ULLONG_MAX;
+	head.fmh_keys[1].fmr_offset = ULLONG_MAX;
+	error = xfsctl(ctx->mntpoint, ctx->mnt_fd, XFS_IOC_GETFSMAP, &head);
+	return error == 0 && (head.fmh_oflags & FMH_OF_DEV_T);
+}
+
+/* Online scrub and repair. */
+
+/* Type info and names for the scrub types. */
+enum scrub_type {
+	ST_NONE,	/* disabled */
+	ST_AGHEADER,	/* per-AG header */
+	ST_PERAG,	/* per-AG metadata */
+	ST_FS,		/* per-FS metadata */
+	ST_INODE,	/* per-inode metadata */
+};
+struct scrub_descr {
+	const char	*name;
+	enum scrub_type	type;
+};
+
+/* These must correspond to XFS_SCRUB_TYPE_ */
+static const struct scrub_descr scrubbers[] = {
+	{"dummy",				ST_NONE},
+	{"superblock",				ST_AGHEADER},
+	{"free space header",			ST_AGHEADER},
+	{"free list",				ST_AGHEADER},
+	{"inode header",			ST_AGHEADER},
+	{"freesp by block btree",		ST_PERAG},
+	{"freesp by length btree",		ST_PERAG},
+	{"inode btree",				ST_PERAG},
+	{"free inode btree",			ST_PERAG},
+	{"reverse mapping btree",		ST_PERAG},
+	{"reference count btree",		ST_PERAG},
+	{"record",				ST_INODE},
+	{"data block map",			ST_INODE},
+	{"attr block map",			ST_INODE},
+	{"CoW block map",			ST_INODE},
+	{"directory entries",			ST_INODE},
+	{"extended attributes",			ST_INODE},
+	{"symbolic link",			ST_INODE},
+	{"realtime bitmap",			ST_FS},
+	{"realtime summary",			ST_FS},
+};
+
+/* Format a scrub description. */
+static void
+format_scrub_descr(
+	char				*buf,
+	size_t				buflen,
+	struct xfs_scrub_metadata	*meta,
+	const struct scrub_descr	*sc)
+{
+	switch (sc->type) {
+	case ST_AGHEADER:
+	case ST_PERAG:
+		snprintf(buf, buflen, _("AG %u %s"), meta->sm_agno,
+				_(sc->name));
+		break;
+	case ST_INODE:
+		snprintf(buf, buflen, _("Inode %llu %s"), meta->sm_ino,
+				_(sc->name));
+		break;
+	case ST_FS:
+		snprintf(buf, buflen, _("%s"), _(sc->name));
+		break;
+	case ST_NONE:
+		assert(0);
+		break;
+	}
+}
+
+static inline bool
+IS_CORRUPT(
+	__u32				flags)
+{
+	return flags & (XFS_SCRUB_FLAG_CORRUPT | XFS_SCRUB_FLAG_XCORRUPT);
+}
+
+/* Do we need to repair something? */
+static inline bool
+xfs_scrub_needs_repair(
+	struct xfs_scrub_metadata	*sm)
+{
+	return IS_CORRUPT(sm->sm_flags);
+}
+
+/* Can we optimize something? */
+static inline bool
+xfs_scrub_needs_preen(
+	struct xfs_scrub_metadata	*sm)
+{
+	return sm->sm_flags & XFS_SCRUB_FLAG_PREEN;
+}
+
+/* Do a read-only check of some metadata. */
+static enum check_outcome
+xfs_check_metadata(
+	struct scrub_ctx		*ctx,
+	int				fd,
+	struct xfs_scrub_metadata	*meta)
+{
+	char				buf[DESCR_BUFSZ];
+	int				error;
+
+	assert(!debug_tweak_on("XFS_SCRUB_NO_KERNEL"));
+	assert(meta->sm_type <= XFS_SCRUB_TYPE_MAX);
+	format_scrub_descr(buf, DESCR_BUFSZ, meta, &scrubbers[meta->sm_type]);
+
+	dbg_printf("check %s flags %xh\n", buf, meta->sm_flags);
+
+	error = ioctl(fd, XFS_IOC_SCRUB_METADATA, meta);
+	if (debug_tweak_on("XFS_SCRUB_FORCE_REPAIR") && !error)
+		meta->sm_flags |= XFS_SCRUB_FLAG_PREEN;
+	if (error) {
+		/* Metadata not present, just skip it. */
+		if (errno == ENOENT)
+			return CHECK_DONE;
+		else if (errno == ESHUTDOWN) {
+			/* FS already crashed, give up. */
+			str_error(ctx, buf,
+_("Filesystem is shut down, aborting."));
+			return CHECK_ABORT;
+		}
+
+		/* Operational error. */
+		str_errno(ctx, buf);
+		return CHECK_DONE;
+	} else if (!xfs_scrub_needs_repair(meta) &&
+		   !xfs_scrub_needs_preen(meta)) {
+		/* Clean operation, no corruption or preening detected. */
+		return CHECK_DONE;
+	} else if (xfs_scrub_needs_repair(meta) &&
+		   ctx->mode < SCRUB_MODE_REPAIR) {
+		/* Corrupt, but we're not in repair mode. */
+		str_error(ctx, buf, _("Repairs are required."));
+		return CHECK_DONE;
+	} else if (xfs_scrub_needs_preen(meta) &&
+		   ctx->mode < SCRUB_MODE_PREEN) {
+		/* Preenable, but we're not in preen mode. */
+		str_info(ctx, buf, _("Optimization is possible."));
+		return CHECK_DONE;
+	}
+
+	return CHECK_REPAIR;
+}
+
+/* Scrub metadata, saving corruption reports for later. */
+static bool
+xfs_scrub_metadata(
+	struct scrub_ctx		*ctx,
+	enum scrub_type			scrub_type,
+	xfs_agnumber_t			agno,
+	struct list_head		*repair_list)
+{
+	struct xfs_scrub_metadata	meta = {0};
+	const struct scrub_descr	*sc;
+	struct repair_item		*ri;
+	enum check_outcome		fix;
+	int				type;
+
+	sc = scrubbers;
+	for (type = 0; type <= XFS_SCRUB_TYPE_MAX; type++, sc++) {
+		if (sc->type != scrub_type)
+			continue;
+
+		meta.sm_type = type;
+		meta.sm_flags = 0;
+		meta.sm_agno = agno;
+
+		/* Check the item. */
+		fix = xfs_check_metadata(ctx, ctx->mnt_fd, &meta);
+		if (fix == CHECK_ABORT)
+			return false;
+		if (fix == CHECK_DONE)
+			continue;
+
+		/* Schedule this item for later repairs. */
+		ri = malloc(sizeof(struct repair_item));
+		if (!ri) {
+			str_errno(ctx, _("repair list"));
+			return false;
+		}
+		ri->op = meta;
+		list_add_tail(&ri->list, repair_list);
+	}
+
+	return true;
+}
+
+/* Scrub each AG's header blocks. */
+bool
+xfs_scrub_ag_headers(
+	struct scrub_ctx		*ctx,
+	xfs_agnumber_t			agno,
+	struct list_head		*repair_list)
+{
+	return xfs_scrub_metadata(ctx, ST_AGHEADER, agno, repair_list);
+}
+
+/* Scrub each AG's metadata btrees. */
+bool
+xfs_scrub_ag_metadata(
+	struct scrub_ctx		*ctx,
+	xfs_agnumber_t			agno,
+	struct list_head		*repair_list)
+{
+	return xfs_scrub_metadata(ctx, ST_PERAG, agno, repair_list);
+}
+
+/* Scrub whole-FS metadata btrees. */
+bool
+xfs_scrub_fs_metadata(
+	struct scrub_ctx		*ctx,
+	struct list_head		*repair_list)
+{
+	return xfs_scrub_metadata(ctx, ST_FS, 0, repair_list);
+}
+
+/* Scrub inode metadata. */
+static bool
+__xfs_scrub_file(
+	struct scrub_ctx		*ctx,
+	uint64_t			ino,
+	uint32_t			gen,
+	int				fd,
+	unsigned int			type,
+	struct list_head		*repair_list)
+{
+	struct xfs_scrub_metadata	meta = {0};
+	struct repair_item		*ri;
+	enum check_outcome		fix;
+
+	assert(type <= XFS_SCRUB_TYPE_MAX);
+	assert(scrubbers[type].type == ST_INODE);
+
+	meta.sm_type = type;
+	meta.sm_ino = ino;
+	meta.sm_gen = gen;
+
+	/* Scrub the piece of metadata. */
+	fix = xfs_check_metadata(ctx, fd, &meta);
+	if (fix == CHECK_ABORT)
+		return false;
+	if (fix == CHECK_DONE)
+		return true;
+
+	/* Schedule this item for later repairs. */
+	ri = malloc(sizeof(struct repair_item));
+	if (!ri) {
+		str_errno(ctx, _("repair list"));
+		return false;
+	}
+	ri->op = meta;
+	list_add_tail(&ri->list, repair_list);
+	return true;
+}
+
+#define XFS_SCRUB_FILE_PART(name, flagname) \
+bool \
+xfs_scrub_##name( \
+	struct scrub_ctx		*ctx, \
+	uint64_t			ino, \
+	uint32_t			gen, \
+	int				fd, \
+	struct list_head		*repair_list) \
+{ \
+	return __xfs_scrub_file(ctx, ino, gen, fd, XFS_SCRUB_TYPE_##flagname, \
+			repair_list); \
+}
+XFS_SCRUB_FILE_PART(inode_fields,	INODE)
+XFS_SCRUB_FILE_PART(data_fork,		BMBTD)
+XFS_SCRUB_FILE_PART(attr_fork,		BMBTA)
+XFS_SCRUB_FILE_PART(cow_fork,		BMBTC)
+XFS_SCRUB_FILE_PART(dir,		DIR)
+XFS_SCRUB_FILE_PART(attr,		XATTR)
+XFS_SCRUB_FILE_PART(symlink,		SYMLINK)
+
+/*
+ * Prioritize repair items in order of how long we can wait.
+ * 0 = do it now, 10000 = do it later.
+ *
+ * To minimize the amount of repair work, we want to prioritize metadata
+ * objects by perceived corruptness.  If CORRUPT is set, the fields are
+ * just plain bad; try fixing that first.  Otherwise if XCORRUPT is set,
+ * the fields could be bad, but the xref data could also be bad; we'll
+ * try fixing that next.  Finally, if XFAIL is set, some other metadata
+ * structure failed validation during xref, so we'll recheck this
+ * metadata last since it was probably fine.
+ *
+ * For metadata that lie in the critical path of checking other metadata
+ * (superblock, AG{F,I,FL}, inobt) we scrub and fix those things before
+ * we even get to handling their dependencies, so things should progress
+ * in order.
+ */
+static int
+PRIO(
+	struct xfs_scrub_metadata	*op,
+	int				order)
+{
+	if (op->sm_flags & XFS_SCRUB_FLAG_CORRUPT)
+		return order;
+	else if (op->sm_flags & XFS_SCRUB_FLAG_XCORRUPT)
+		return 100 + order;
+	else if (op->sm_flags & XFS_SCRUB_FLAG_XFAIL)
+		return 200 + order;
+	else if (op->sm_flags & XFS_SCRUB_FLAG_PREEN)
+		return 300 + order;
+	abort();
+}
+
+static int
+xfs_repair_item_priority(
+	struct repair_item		*ri)
+{
+	switch (ri->op.sm_type) {
+	case XFS_SCRUB_TYPE_SB:
+		return PRIO(&ri->op, 0);
+	case XFS_SCRUB_TYPE_AGF:
+		return PRIO(&ri->op, 1);
+	case XFS_SCRUB_TYPE_AGFL:
+		return PRIO(&ri->op, 2);
+	case XFS_SCRUB_TYPE_AGI:
+		return PRIO(&ri->op, 3);
+	case XFS_SCRUB_TYPE_BNOBT:
+	case XFS_SCRUB_TYPE_CNTBT:
+	case XFS_SCRUB_TYPE_INOBT:
+	case XFS_SCRUB_TYPE_FINOBT:
+	case XFS_SCRUB_TYPE_REFCNTBT:
+		return PRIO(&ri->op, 4);
+	case XFS_SCRUB_TYPE_RMAPBT:
+		return PRIO(&ri->op, 5);
+	case XFS_SCRUB_TYPE_INODE:
+		return PRIO(&ri->op, 6);
+	case XFS_SCRUB_TYPE_BMBTD:
+	case XFS_SCRUB_TYPE_BMBTA:
+	case XFS_SCRUB_TYPE_BMBTC:
+		return PRIO(&ri->op, 7);
+	case XFS_SCRUB_TYPE_DIR:
+	case XFS_SCRUB_TYPE_XATTR:
+	case XFS_SCRUB_TYPE_SYMLINK:
+		return PRIO(&ri->op, 8);
+	case XFS_SCRUB_TYPE_RTBITMAP:
+	case XFS_SCRUB_TYPE_RTSUM:
+		return PRIO(&ri->op, 9);
+	}
+	abort();
+}
+
+/* Make sure that btrees get repaired before headers. */
+static int
+xfs_repair_item_compare(
+	void				*priv,
+	struct list_head		*a,
+	struct list_head		*b)
+{
+	struct repair_item		*ra;
+	struct repair_item		*rb;
+
+	ra = container_of(a, struct repair_item, list);
+	rb = container_of(b, struct repair_item, list);
+
+	return xfs_repair_item_priority(ra) - xfs_repair_item_priority(rb);
+}
+
+/* Repair some metadata. */
+static enum check_outcome
+xfs_repair_metadata(
+	struct scrub_ctx		*ctx,
+	int				fd,
+	struct xfs_scrub_metadata	*meta,
+	bool				complain_if_still_broken)
+{
+	char				buf[DESCR_BUFSZ];
+	__u32				oldf = meta->sm_flags;
+	int				error;
+
+	assert(!debug_tweak_on("XFS_SCRUB_NO_KERNEL"));
+	meta->sm_flags |= XFS_SCRUB_FLAG_REPAIR;
+	assert(meta->sm_type <= XFS_SCRUB_TYPE_MAX);
+	format_scrub_descr(buf, DESCR_BUFSZ, meta, &scrubbers[meta->sm_type]);
+
+	if (xfs_scrub_needs_repair(meta))
+		str_info(ctx, buf, _("Attempting repair."));
+	else if (debug || verbose)
+		str_info(ctx, buf, _("Attempting optimization."));
+
+	error = ioctl(fd, XFS_IOC_SCRUB_METADATA, meta);
+	if (error) {
+		switch (errno) {
+		case ESHUTDOWN:
+			/* Filesystem is already shut down, abort. */
+			str_error(ctx, buf,
+_("Filesystem is shut down, aborting."));
+			return CHECK_ABORT;
+		case ENOTTY:
+		case EOPNOTSUPP:
+			/*
+			 * If we forced repairs, don't complain if kernel
+			 * doesn't know how to fix.
+			 */
+			if (debug_tweak_on("XFS_SCRUB_FORCE_REPAIR"))
+				return CHECK_DONE;
+			/* fall through */
+		case EINVAL:
+			/* Kernel doesn't know how to repair this? */
+			if (complain_if_still_broken)
+				str_error(ctx, buf,
+_("Don't know how to fix; offline repair required."));
+			return CHECK_REPAIR;
+		case EROFS:
+			/* Read-only filesystem, can't fix. */
+			if (verbose || debug || IS_CORRUPT(oldf))
+				str_info(ctx, buf,
+_("Read-only filesystem; cannot make changes."));
+			return CHECK_DONE;
+		case ENOENT:
+			/* Metadata not present, just skip it. */
+			return CHECK_DONE;
+		case ENOMEM:
+		case ENOSPC:
+			/* Don't care if preen fails due to low resources. */
+			if (oldf & XFS_SCRUB_FLAG_PREEN)
+				return CHECK_DONE;
+			/* fall through */
+		default:
+			/* Operational error. */
+			str_errno(ctx, buf);
+			return CHECK_DONE;
+		}
+	} else if (xfs_scrub_needs_repair(meta)) {
+		/* Still broken, try again or fix offline. */
+		if (complain_if_still_broken)
+			str_error(ctx, buf,
+_("Repair unsuccessful; offline repair required."));
+		return CHECK_REPAIR;
+	} else {
+		/* Clean operation, no corruption detected. */
+		if (IS_CORRUPT(oldf))
+			record_repair(ctx, buf, _("Repairs successful."));
+		else
+			record_preen(ctx, buf, _("Optimization successful."));
+		return CHECK_DONE;
+	}
+}
+
+/* Repair everything on this list. */
+bool
+xfs_repair_metadata_list(
+	struct scrub_ctx		*ctx,
+	int				fd,
+	struct list_head		*repair_list,
+	unsigned int			flags)
+{
+	struct repair_item		*ri;
+	struct repair_item		*n;
+	enum check_outcome		fix;
+
+	list_sort(NULL, repair_list, xfs_repair_item_compare);
+
+	list_for_each_entry_safe(ri, n, repair_list, list) {
+		if (!IS_CORRUPT(ri->op.sm_flags) &&
+		    (flags & XRML_REPAIR_ONLY))
+			continue;
+		fix = xfs_repair_metadata(ctx, fd, &ri->op,
+				flags & XRML_NOFIX_COMPLAIN);
+		if (fix == CHECK_ABORT)
+			return false;
+		else if (fix == CHECK_REPAIR)
+			continue;
+
+		list_del(&ri->list);
+		free(ri);
+	}
+
+	return !xfs_scrub_excessive_errors(ctx);
+}
+
+/* Test the availability of a kernel scrub command. */
+static bool
+__xfs_scrub_test(
+	struct scrub_ctx		*ctx,
+	unsigned int			type)
+{
+	struct xfs_scrub_metadata	meta = {0};
+	struct xfs_error_injection	inject;
+	static bool			injected;
+	int				error;
+
+	if (debug_tweak_on("XFS_SCRUB_NO_KERNEL"))
+		return false;
+	if (debug_tweak_on("XFS_SCRUB_FORCE_REPAIR") && !injected) {
+		inject.fd = ctx->mnt_fd;
+#define XFS_ERRTAG_FORCE_REPAIR	28
+		inject.errtag = XFS_ERRTAG_FORCE_REPAIR;
+		error = xfsctl(ctx->mntpoint, ctx->mnt_fd,
+				XFS_IOC_ERROR_INJECTION, &inject);
+		if (error == 0)
+			injected = true;
+	}
+
+	meta.sm_type = type;
+	error = ioctl(ctx->mnt_fd, XFS_IOC_SCRUB_METADATA, &meta);
+	return error == 0 || (error && errno != EOPNOTSUPP && errno != ENOTTY);
+}
+
+#define XFS_CAN_SCRUB_TEST(name, flagname) \
+bool \
+xfs_can_scrub_##name( \
+	struct scrub_ctx		*ctx) \
+{ \
+	return __xfs_scrub_test(ctx, XFS_SCRUB_TYPE_##flagname); \
+}
+XFS_CAN_SCRUB_TEST(fs_metadata,		SB)
+XFS_CAN_SCRUB_TEST(inode,		INODE)
+XFS_CAN_SCRUB_TEST(bmap,		BMBTD)
+XFS_CAN_SCRUB_TEST(dir,			DIR)
+XFS_CAN_SCRUB_TEST(attr,		XATTR)
+XFS_CAN_SCRUB_TEST(symlink,		SYMLINK)
diff --git a/scrub/xfs_ioctl.h b/scrub/xfs_ioctl.h
new file mode 100644
index 0000000..539804d
--- /dev/null
+++ b/scrub/xfs_ioctl.h
@@ -0,0 +1,102 @@
+/*
+ * Copyright (C) 2017 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+#ifndef XFS_IOCTL_H_
+#define XFS_IOCTL_H_
+
+/* inode iteration */
+#define XFS_ITERATE_INODES_ABORT	(-1)
+typedef int (*xfs_inode_iter_fn)(struct scrub_ctx *ctx,
+		struct xfs_handle *handle, struct xfs_bstat *bs, void *arg);
+bool xfs_iterate_inodes(struct scrub_ctx *ctx, const char *descr,
+		void *fshandle, uint64_t first_ino, uint64_t last_ino,
+		xfs_inode_iter_fn fn, void *arg);
+bool xfs_can_iterate_inodes(struct scrub_ctx *ctx);
+
+/* inode fork block mapping */
+struct xfs_bmap {
+	uint64_t	bm_offset;	/* file offset of segment in bytes */
+	uint64_t	bm_physical;	/* physical starting byte  */
+	uint64_t	bm_length;	/* length of segment, bytes */
+	uint32_t	bm_flags;	/* output flags */
+};
+
+typedef bool (*xfs_bmap_iter_fn)(struct scrub_ctx *ctx, const char *descr,
+		int fd, int whichfork, struct fsxattr *fsx,
+		struct xfs_bmap *bmap, void *arg);
+
+bool xfs_iterate_bmap(struct scrub_ctx *ctx, const char *descr, int fd,
+		int whichfork, struct xfs_bmap *key, xfs_bmap_iter_fn fn,
+		void *arg);
+bool xfs_can_iterate_bmap(struct scrub_ctx *ctx);
+
+/* filesystem reverse mapping */
+typedef bool (*xfs_fsmap_iter_fn)(struct scrub_ctx *ctx, const char *descr,
+		struct fsmap *fsr, void *arg);
+bool xfs_iterate_fsmap(struct scrub_ctx *ctx, const char *descr,
+		struct fsmap *keys, xfs_fsmap_iter_fn fn, void *arg);
+bool xfs_can_iterate_fsmap(struct scrub_ctx *ctx);
+
+/* Online scrub and repair. */
+enum check_outcome {
+	CHECK_DONE,
+	CHECK_REPAIR,
+	CHECK_ABORT,
+};
+
+struct repair_item {
+	struct list_head		list;
+	struct xfs_scrub_metadata	op;
+};
+
+bool xfs_scrub_ag_headers(struct scrub_ctx *ctx, xfs_agnumber_t agno,
+		struct list_head *repair_list);
+bool xfs_scrub_ag_metadata(struct scrub_ctx *ctx, xfs_agnumber_t agno,
+		struct list_head *repair_list);
+bool xfs_scrub_fs_metadata(struct scrub_ctx *ctx,
+		struct list_head *repair_list);
+
+#define XRML_REPAIR_ONLY	1 /* no optimizations */
+#define XRML_NOFIX_COMPLAIN	2 /* complain if still corrupt */
+bool xfs_repair_metadata_list(struct scrub_ctx *ctx, int fd,
+		struct list_head *repair_list, unsigned int flags);
+
+bool xfs_can_scrub_fs_metadata(struct scrub_ctx *ctx);
+bool xfs_can_scrub_inode(struct scrub_ctx *ctx);
+bool xfs_can_scrub_bmap(struct scrub_ctx *ctx);
+bool xfs_can_scrub_dir(struct scrub_ctx *ctx);
+bool xfs_can_scrub_attr(struct scrub_ctx *ctx);
+bool xfs_can_scrub_symlink(struct scrub_ctx *ctx);
+
+bool xfs_scrub_inode_fields(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
+		int fd, struct list_head *repair_list);
+bool xfs_scrub_data_fork(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
+		int fd, struct list_head *repair_list);
+bool xfs_scrub_attr_fork(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
+		int fd, struct list_head *repair_list);
+bool xfs_scrub_cow_fork(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
+		int fd, struct list_head *repair_list);
+bool xfs_scrub_dir(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
+		int fd, struct list_head *repair_list);
+bool xfs_scrub_attr(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
+		int fd, struct list_head *repair_list);
+bool xfs_scrub_symlink(struct scrub_ctx *ctx, uint64_t ino, uint32_t gen,
+		int fd, struct list_head *repair_list);
+
+#endif /* XFS_IOCTL_H_ */