From patchwork Wed Sep  5 08:19:29 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dave Chinner <david@fromorbit.com>
X-Patchwork-Id: 10588411
Return-Path: <linux-xfs-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id DF75F5A4
	for <patchwork-linux-xfs@patchwork.kernel.org>;
 Wed,  5 Sep 2018 08:19:40 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id C870F295CA
	for <patchwork-linux-xfs@patchwork.kernel.org>;
 Wed,  5 Sep 2018 08:19:40 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id BC86129789; Wed,  5 Sep 2018 08:19:40 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4DCCA295CA
	for <patchwork-linux-xfs@patchwork.kernel.org>;
 Wed,  5 Sep 2018 08:19:40 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1727659AbeIEMsm (ORCPT
        <rfc822;patchwork-linux-xfs@patchwork.kernel.org>);
        Wed, 5 Sep 2018 08:48:42 -0400
Received: from ipmail06.adl6.internode.on.net ([150.101.137.145]:24725 "EHLO
        ipmail06.adl6.internode.on.net" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S1726342AbeIEMsm (ORCPT
        <rfc822;linux-xfs@vger.kernel.org>); Wed, 5 Sep 2018 08:48:42 -0400
Received: from ppp59-167-129-252.static.internode.on.net (HELO dastard)
 ([59.167.129.252])
  by ipmail06.adl6.internode.on.net with ESMTP; 05 Sep 2018 17:49:38 +0930
Received: from discord.disaster.area ([192.168.1.111])
        by dastard with esmtp (Exim 4.80)
        (envelope-from <david@fromorbit.com>)
        id 1fxT2G-0003NB-1y
        for linux-xfs@vger.kernel.org; Wed, 05 Sep 2018 18:19:36 +1000
Received: from dave by discord.disaster.area with local (Exim 4.91)
        (envelope-from <david@fromorbit.com>)
        id 1fxT2G-0007Ga-10
        for linux-xfs@vger.kernel.org; Wed, 05 Sep 2018 18:19:36 +1000
From: Dave Chinner <david@fromorbit.com>
To: linux-xfs@vger.kernel.org
Subject: [PATCH 1/4] mkfs: stop zeroing old superblocks excessively
Date: Wed,  5 Sep 2018 18:19:29 +1000
Message-Id: <20180905081932.27478-2-david@fromorbit.com>
X-Mailer: git-send-email 2.17.0
In-Reply-To: <20180905081932.27478-1-david@fromorbit.com>
References: <20180905081932.27478-1-david@fromorbit.com>
Sender: linux-xfs-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-xfs.vger.kernel.org>
X-Mailing-List: linux-xfs@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

From: Dave Chinner <dchinner@redhat.com>

When making a new filesystem, don't zero superblocks way beyond the
end of the new filesystem. If the old filesystem was an EB scale
filesytsem, then this zeroing requires millions of IOs to complete.
We don't want to do this if the new filesystem on the device is only
going to be 100TB. Sure, zeroing old superblocks a good distance
beyond the new size is a good idea, as is zeroing the ones in the
middle and end, but the other 7,999,000 superblocks? Not so much.

Make a sane cut-off decision - zero out to 10x the size of the new
filesystem, then zero the middle AGs in the old filesystem, then
zero the last ones.

The initial zeroing out to 10x the new fs size means that this code
will only ever trigger in rare corner cases outside a testing
environment - there are very few production workloads where a huge
block device is reused immediately and permanently for a tiny much
smaller filesystem. Those that do this (e.g. on thing provisioned
devices) discard the in use blocks anyway and so the zeroing won't
actually do anything useful.

Time to mkfs a 1TB filsystem on a big device after it held another
larger filesystem:

previous FS size	10PB	100PB	 1EB
old mkfs time		1.95s	8.9s	81.3s
patched			0.95s	1.2s	 1.2s


Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 mkfs/xfs_mkfs.c | 64 +++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 59 insertions(+), 5 deletions(-)

diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
index 2e53c1e83b6a..c153592c705e 100644
--- a/mkfs/xfs_mkfs.c
+++ b/mkfs/xfs_mkfs.c
@@ -1155,14 +1155,15 @@ validate_ag_geometry(
 
 static void
 zero_old_xfs_structures(
-	libxfs_init_t		*xi,
-	xfs_sb_t		*new_sb)
+	struct libxfs_xinit	*xi,
+	struct xfs_sb		*new_sb)
 {
-	void 			*buf;
-	xfs_sb_t 		sb;
+	void			*buf;
+	struct xfs_sb		sb;
 	uint32_t		bsize;
 	int			i;
 	xfs_off_t		off;
+	xfs_off_t		end;
 
 	/*
 	 * We open regular files with O_TRUNC|O_CREAT. Nothing to do here...
@@ -1220,15 +1221,68 @@ zero_old_xfs_structures(
 
 	/*
 	 * block size and basic geometry seems alright, zero the secondaries.
+	 *
+	 * Don't be insane when it comes to overwriting really large filesystems
+	 * as it could take millions of IOs to zero every secondary
+	 * superblock. If we are remaking a huge filesystem, then do the
+	 * zeroing, but if we are replacing it with a small one (typically done
+	 * in test environments, limit the zeroing to:
+	 *
+	 *	- around the range of the new filesystem
+	 *	- the middle of the old filesystem
+	 *	- the end of the old filesystem
+	 *
+	 * Killing the middle and end of the old filesystem will prevent repair
+	 * from finding it with it's fast secondary sb scan algorithm. The slow
+	 * scan algorithm will then confirm the small filesystem geometry by
+	 * brute force scans.
 	 */
 	memset(buf, 0, new_sb->sb_sectsize);
+
+	/* this carefully avoids integer overflows */
+	end = sb.sb_dblocks;
+	if (sb.sb_agcount > 10000 &&
+	    new_sb->sb_dblocks < end / 10)
+		end = new_sb->sb_dblocks * 10;
 	off = 0;
-	for (i = 1; i < sb.sb_agcount; i++)  {
+	for (i = 1; i < sb.sb_agcount && off < end; i++)  {
+		off += sb.sb_agblocks;
+		if (pwrite(xi->dfd, buf, new_sb->sb_sectsize,
+					off << sb.sb_blocklog) == -1)
+			break;
+	}
+
+	if (end == sb.sb_dblocks)
+		return;
+
+	/*
+	 * Trash the middle 1000 AGs of the old fs, which we know has at least
+	 * 10000 AGs at this point. Cast to make sure we are doing 64bit
+	 * multiplies, otherwise off gets truncated to 32 bit. I hate C.
+	 */
+	i = (sb.sb_agcount / 2) - 500;
+	off = (xfs_off_t)sb.sb_agblocks * i;
+	off = (xfs_off_t)sb.sb_agblocks * ((sb.sb_agcount / 2) - 500);
+	end = off + 1000 * sb.sb_agblocks;
+	while (off < end) {
+		if (pwrite(xi->dfd, buf, new_sb->sb_sectsize,
+					off << sb.sb_blocklog) == -1)
+			break;
 		off += sb.sb_agblocks;
+	}
+
+	/*
+	 * Trash the last 1000 AGs of the old fs
+	 */
+	off = (xfs_off_t)sb.sb_agblocks * (sb.sb_agcount - 1000);
+	end = sb.sb_dblocks;
+	while (off < end) {
 		if (pwrite(xi->dfd, buf, new_sb->sb_sectsize,
 					off << sb.sb_blocklog) == -1)
 			break;
+		off += sb.sb_agblocks;
 	}
+
 done:
 	free(buf);
 }

From patchwork Wed Sep  5 08:19:30 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dave Chinner <david@fromorbit.com>
X-Patchwork-Id: 10588415
Return-Path: <linux-xfs-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6D914180E
	for <patchwork-linux-xfs@patchwork.kernel.org>;
 Wed,  5 Sep 2018 08:19:43 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 55BA62973B
	for <patchwork-linux-xfs@patchwork.kernel.org>;
 Wed,  5 Sep 2018 08:19:43 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 4A3EB29789; Wed,  5 Sep 2018 08:19:43 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id CFCC7297C5
	for <patchwork-linux-xfs@patchwork.kernel.org>;
 Wed,  5 Sep 2018 08:19:42 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1727740AbeIEMsp (ORCPT
        <rfc822;patchwork-linux-xfs@patchwork.kernel.org>);
        Wed, 5 Sep 2018 08:48:45 -0400
Received: from ipmail06.adl6.internode.on.net ([150.101.137.145]:25505 "EHLO
        ipmail06.adl6.internode.on.net" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S1727742AbeIEMsp (ORCPT
        <rfc822;linux-xfs@vger.kernel.org>); Wed, 5 Sep 2018 08:48:45 -0400
Received: from ppp59-167-129-252.static.internode.on.net (HELO dastard)
 ([59.167.129.252])
  by ipmail06.adl6.internode.on.net with ESMTP; 05 Sep 2018 17:49:38 +0930
Received: from discord.disaster.area ([192.168.1.111])
        by dastard with esmtp (Exim 4.80)
        (envelope-from <david@fromorbit.com>)
        id 1fxT2G-0003NC-31
        for linux-xfs@vger.kernel.org; Wed, 05 Sep 2018 18:19:36 +1000
Received: from dave by discord.disaster.area with local (Exim 4.91)
        (envelope-from <david@fromorbit.com>)
        id 1fxT2G-0007Gd-1o
        for linux-xfs@vger.kernel.org; Wed, 05 Sep 2018 18:19:36 +1000
From: Dave Chinner <david@fromorbit.com>
To: linux-xfs@vger.kernel.org
Subject: [PATCH 2/4] mkfs: rework AG header initialisation ordering
Date: Wed,  5 Sep 2018 18:19:30 +1000
Message-Id: <20180905081932.27478-3-david@fromorbit.com>
X-Mailer: git-send-email 2.17.0
In-Reply-To: <20180905081932.27478-1-david@fromorbit.com>
References: <20180905081932.27478-1-david@fromorbit.com>
Sender: linux-xfs-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-xfs.vger.kernel.org>
X-Mailing-List: linux-xfs@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

From: Dave Chinner <dchinner@redhat.com>

When observing the behaviour of an 8EB mkfs execution, I noticed
that a phase where there are a massive number of read/modify/write
cycles occurring. I didn't wait for it to complete - it was obvious
that it was after all the AG headers had been written. That left the
AGFL initialisation as the likely cause.

When all the AG headers don't fit in the libxfs buffer cache, the
AGFL init requires re-reading the AGF, the AGFL, the free space tree
root blocks and the rmap tree root block. They all then get
modified and written back out. 10 IOs per AG. When you have 8
million AGs, that's a lot of extra IO.

Change the initialisation algorithm to initialise the AGFL
immediately after initialising the rest of the headers and
calculating the minimum AGFL size for that AG. This means the
modifications will all hit the buffer cache and this will remove the
IO penalty.

The "worst_freelist" size calculation doesn't change from AG to AG -
it's based on the physical configuration of the AG, and all AGs have
the same configuration. hence we only need to calculate this once,
not for every AG. That allows us to initialise the AGFL immediately
after the rest of the AG has been initialised rather than in a
separate pass.

TIme to make a filesystem from scratch, using a zeroed device so the
force overwrite algorithms are not triggered and -K to avoid
discards:

FS size		10PB	100PB	 1EB
current mkfs	26.9s	214.8s	2484s
patched		11.3s	 70.3s	 709s

In both cases, the IO profile looks identical for the initial AG
header writeout loop. The difference is that the old code then
does the RMW loop to init the AGFL, and that runs at about half the
speed. Hence runtime of the new code is reduce by around 65-70%
simply by avoiding all that IO.


Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 mkfs/xfs_mkfs.c | 40 +++++++++++++++++++++++++---------------
 1 file changed, 25 insertions(+), 15 deletions(-)

diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
index c153592c705e..d70fbdb6b15a 100644
--- a/mkfs/xfs_mkfs.c
+++ b/mkfs/xfs_mkfs.c
@@ -3374,7 +3374,7 @@ initialise_ag_headers(
 	struct xfs_mount	*mp,
 	struct xfs_sb		*sbp,
 	xfs_agnumber_t		agno,
-	int			*worst_freelist)
+	int			*freelist_size)
 {
 	struct xfs_perag	*pag = libxfs_perag_get(mp, agno);
 	struct xfs_agfl		*agfl;
@@ -3453,8 +3453,22 @@ initialise_ag_headers(
 		agf->agf_longest = cpu_to_be32(agsize -
 			XFS_FSB_TO_AGBNO(mp, cfg->logstart) - cfg->logblocks);
 	}
-	if (libxfs_alloc_min_freelist(mp, pag) > *worst_freelist)
-		*worst_freelist = libxfs_alloc_min_freelist(mp, pag);
+
+	/*
+	 * The AGFL size is the same for all AGs because all AGs have the same
+	 * layout. If this AG sameness ever changes in the future, we'll need to
+	 * revisit how we initialise the AGFLs.
+	 */
+	if (*freelist_size == 0)
+		*freelist_size = libxfs_alloc_min_freelist(mp, pag);
+	else if (*freelist_size < libxfs_alloc_min_freelist(mp, pag)) {
+		fprintf(stderr,
+_("%s: Abort! Freelist size (%u) for AG %u not constant (%u)!\n"),
+			progname, libxfs_alloc_min_freelist(mp, pag),
+			agno, *freelist_size);
+		exit(1);
+	}
+
 	libxfs_writebuf(buf, LIBXFS_EXIT_ON_FAILURE);
 
 	/*
@@ -3724,14 +3738,14 @@ static void
 initialise_ag_freespace(
 	struct xfs_mount	*mp,
 	xfs_agnumber_t		agno,
-	int			worst_freelist)
+	int			freelist_size)
 {
 	struct xfs_alloc_arg	args;
 	struct xfs_trans	*tp;
 	struct xfs_trans_res tres = {0};
 	int			c;
 
-	c = libxfs_trans_alloc(mp, &tres, worst_freelist, 0, 0, &tp);
+	c = libxfs_trans_alloc(mp, &tres, freelist_size, 0, 0, &tp);
 	if (c)
 		res_failed(c);
 
@@ -3797,7 +3811,7 @@ main(
 	int			quiet = 0;
 	char			*protofile = NULL;
 	char			*protostring = NULL;
-	int			worst_freelist = 0;
+	int			freelist_size = 0;
 
 	struct libxfs_xinit	xi = {
 		.isdirect = LIBXFS_DIRECT,
@@ -4025,16 +4039,12 @@ main(
 	}
 
 	/*
-	 * Initialise all the static on disk metadata.
+	 * Initialise all the AG headers on disk.
 	 */
-	for (agno = 0; agno < cfg.agcount; agno++)
-		initialise_ag_headers(&cfg, mp, sbp, agno, &worst_freelist);
-
-	/*
-	 * Initialise the freespace freelists (i.e. AGFLs) in each AG.
-	 */
-	for (agno = 0; agno < cfg.agcount; agno++)
-		initialise_ag_freespace(mp, agno, worst_freelist);
+	for (agno = 0; agno < cfg.agcount; agno++) {
+		initialise_ag_headers(&cfg, mp, sbp, agno, &freelist_size);
+		initialise_ag_freespace(mp, agno, freelist_size);
+	}
 
 	/*
 	 * Allocate the root inode and anything else in the proto file.

From patchwork Wed Sep  5 08:19:31 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dave Chinner <david@fromorbit.com>
X-Patchwork-Id: 10588419
Return-Path: <linux-xfs-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 368E115E9
	for <patchwork-linux-xfs@patchwork.kernel.org>;
 Wed,  5 Sep 2018 08:19:47 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 1E85928D04
	for <patchwork-linux-xfs@patchwork.kernel.org>;
 Wed,  5 Sep 2018 08:19:47 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 124BA29789; Wed,  5 Sep 2018 08:19:47 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 3374828D04
	for <patchwork-linux-xfs@patchwork.kernel.org>;
 Wed,  5 Sep 2018 08:19:46 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1727775AbeIEMss (ORCPT
        <rfc822;patchwork-linux-xfs@patchwork.kernel.org>);
        Wed, 5 Sep 2018 08:48:48 -0400
Received: from ipmail06.adl6.internode.on.net ([150.101.137.145]:25505 "EHLO
        ipmail06.adl6.internode.on.net" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S1727742AbeIEMss (ORCPT
        <rfc822;linux-xfs@vger.kernel.org>); Wed, 5 Sep 2018 08:48:48 -0400
Received: from ppp59-167-129-252.static.internode.on.net (HELO dastard)
 ([59.167.129.252])
  by ipmail06.adl6.internode.on.net with ESMTP; 05 Sep 2018 17:49:38 +0930
Received: from discord.disaster.area ([192.168.1.111])
        by dastard with esmtp (Exim 4.80)
        (envelope-from <david@fromorbit.com>)
        id 1fxT2G-0003ND-49
        for linux-xfs@vger.kernel.org; Wed, 05 Sep 2018 18:19:36 +1000
Received: from dave by discord.disaster.area with local (Exim 4.91)
        (envelope-from <david@fromorbit.com>)
        id 1fxT2G-0007Gg-2r
        for linux-xfs@vger.kernel.org; Wed, 05 Sep 2018 18:19:36 +1000
From: Dave Chinner <david@fromorbit.com>
To: linux-xfs@vger.kernel.org
Subject: [PATCH 3/4] mkfs: introduce new delayed write buffer list
Date: Wed,  5 Sep 2018 18:19:31 +1000
Message-Id: <20180905081932.27478-4-david@fromorbit.com>
X-Mailer: git-send-email 2.17.0
In-Reply-To: <20180905081932.27478-1-david@fromorbit.com>
References: <20180905081932.27478-1-david@fromorbit.com>
Sender: linux-xfs-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-xfs.vger.kernel.org>
X-Mailing-List: linux-xfs@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

From: Dave Chinner <dchinner@redhat.com>

Similar to the kernel concept of delayed write buffers, modify the
xfs_buf to have an internal list head we can use to park dirty
buffers we need to write back for later processing. This enables us
to control writeback directly, rather than have it occur as a side
effect of buffer cache LRU pressure.

Because the whole transaction subsystem is different in userspace,
we need to pass the delwri list to the commit code so that it can
add the buffers dirtied in the transaction to the delwri list rather
than writing them back immediately. This is really special case code
for mkfs because we don't have a proper metadata writeback setup
like we do in the kernel. It's a crutch to enable mkfs to do
async writeback, nothing more.

By itself, this change does not improve performance - IO
dispatch from mkfs is still synchronous, so it can't drive a queue
depth of more than 1. But we now have batched writeback....

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 include/xfs_trans.h |  2 ++
 libxfs/libxfs_io.h  |  6 +++++
 libxfs/rdwr.c       | 65 +++++++++++++++++++++++++++++++++++++++++++++
 libxfs/trans.c      | 47 ++++++++++++++++++++++++--------
 mkfs/xfs_mkfs.c     | 37 +++++++++++++++-----------
 5 files changed, 131 insertions(+), 26 deletions(-)

diff --git a/include/xfs_trans.h b/include/xfs_trans.h
index 63972e4fff0f..25de8b7c757c 100644
--- a/include/xfs_trans.h
+++ b/include/xfs_trans.h
@@ -84,6 +84,8 @@ int	libxfs_trans_alloc(struct xfs_mount *mp, struct xfs_trans_res *resp,
 			   struct xfs_trans **tpp);
 int	libxfs_trans_alloc_empty(struct xfs_mount *mp, struct xfs_trans **tpp);
 int	libxfs_trans_commit(struct xfs_trans *);
+int	libxfs_trans_commit_delwri(struct xfs_trans *tp,
+				   struct list_head *delwri_list);
 void	libxfs_trans_cancel(struct xfs_trans *);
 struct xfs_buf *libxfs_trans_getsb(struct xfs_trans *, struct xfs_mount *, int);
 
diff --git a/libxfs/libxfs_io.h b/libxfs/libxfs_io.h
index 12064d798a2d..c69cc7cd7ec5 100644
--- a/libxfs/libxfs_io.h
+++ b/libxfs/libxfs_io.h
@@ -66,6 +66,7 @@ typedef struct xfs_buf {
 	struct xfs_buf_map	*b_maps;
 	struct xfs_buf_map	__b_map;
 	int			b_nmaps;
+	struct list_head	b_list;
 #ifdef XFS_BUF_TRACING
 	struct list_head	b_lock_list;
 	const char		*b_func;
@@ -81,6 +82,7 @@ enum xfs_buf_flags_t {	/* b_flags bits */
 	LIBXFS_B_UPTODATE	= 0x0008,	/* buffer is sync'd to disk */
 	LIBXFS_B_DISCONTIG	= 0x0010,	/* discontiguous buffer */
 	LIBXFS_B_UNCHECKED	= 0x0020,	/* needs verification */
+	LIBXFS_B_DELWRI_Q	= 0x0040,	/* buffer is on a delwri list */
 };
 
 #define XFS_BUF_DADDR_NULL		((xfs_daddr_t) (-1LL))
@@ -168,6 +170,10 @@ extern void	libxfs_putbuf (xfs_buf_t *);
 
 #endif
 
+extern void	libxfs_buf_delwri_add(struct xfs_buf *bp, int flags,
+			struct list_head *delwri_list);
+extern int	libxfs_buf_delwri_flush(struct list_head *delwri_list);
+
 extern void	libxfs_readbuf_verify(struct xfs_buf *bp,
 			const struct xfs_buf_ops *ops);
 extern xfs_buf_t *libxfs_getsb(struct xfs_mount *, int);
diff --git a/libxfs/rdwr.c b/libxfs/rdwr.c
index 14a4633e9fa6..7fbaae571abe 100644
--- a/libxfs/rdwr.c
+++ b/libxfs/rdwr.c
@@ -579,6 +579,7 @@ __initbuf(xfs_buf_t *bp, struct xfs_buftarg *btp, xfs_daddr_t bno,
 	bp->b_holder = 0;
 	bp->b_recur = 0;
 	bp->b_ops = NULL;
+	list_head_init(&bp->b_list);
 
 	if (!bp->b_maps) {
 		bp->b_nmaps = 1;
@@ -1196,6 +1197,70 @@ libxfs_writebuf(xfs_buf_t *bp, int flags)
 	return 0;
 }
 
+void
+libxfs_buf_delwri_add(
+	struct xfs_buf		*bp,
+	int			flags,
+	struct list_head	*delwri_list)
+{
+	if (bp->b_flags & LIBXFS_B_DELWRI_Q) {
+		libxfs_putbuf(bp);
+		return;
+	}
+
+	libxfs_writebuf_int(bp, flags);
+	bp->b_flags |= LIBXFS_B_DELWRI_Q;
+	list_add(&bp->b_list, delwri_list);
+}
+
+/*
+ * Compare function is more complex than it needs to be because
+ * the return value is only 32 bits and we are doing comparisons
+ * on 64 bit values
+ */
+static int
+xfs_buf_cmp(
+	void		*priv,
+	struct list_head *a,
+	struct list_head *b)
+{
+	struct xfs_buf	*ap = container_of(a, struct xfs_buf, b_list);
+	struct xfs_buf	*bp = container_of(b, struct xfs_buf, b_list);
+	xfs_daddr_t	diff;
+
+	diff = ap->b_maps[0].bm_bn - bp->b_maps[0].bm_bn;
+	if (diff < 0)
+		return -1;
+	if (diff > 0)
+		return 1;
+	return 0;
+}
+
+/* Processes entire list, but only returns the first error found */
+int
+libxfs_buf_delwri_flush(
+	struct list_head	*delwri_list)
+{
+	struct xfs_buf		*bp;
+	int			 error = 0;
+
+	list_sort(NULL, delwri_list, xfs_buf_cmp);
+	while (!list_empty(delwri_list)) {
+		bp = list_first_entry(delwri_list, struct xfs_buf, b_list);
+		list_del_init(&bp->b_list);
+		bp->b_flags &= ~LIBXFS_B_DELWRI_Q;
+		if (!bp->b_error && (bp->b_flags & LIBXFS_B_DIRTY)) {
+			int ret;
+			ret = libxfs_writebufr(bp);
+			if (ret && !error)
+				error = ret;
+		}
+		libxfs_putbuf(bp);
+	}
+	return error;
+}
+
+
 void
 libxfs_iomove(xfs_buf_t *bp, uint boff, int len, void *data, int flags)
 {
diff --git a/libxfs/trans.c b/libxfs/trans.c
index 2bb0d3b8e2d1..c3da46479efa 100644
--- a/libxfs/trans.c
+++ b/libxfs/trans.c
@@ -728,10 +728,11 @@ inode_item_done(
 
 static void
 buf_item_done(
-	xfs_buf_log_item_t	*bip)
+	struct xfs_buf_log_item	*bip,
+	struct list_head	*delwri_list)
 {
-	xfs_buf_t		*bp;
-	int			hold;
+	struct xfs_buf		*bp;
+	bool			hold;
 	extern kmem_zone_t	*xfs_buf_item_zone;
 
 	bp = bip->bli_buf;
@@ -745,7 +746,13 @@ buf_item_done(
 		fprintf(stderr, "flushing/staling buffer %p (hold=%d)\n",
 			bp, hold);
 #endif
-		libxfs_writebuf_int(bp, 0);
+		if (delwri_list) {
+			/* delwri list needs to hold on to the buffer here */
+			libxfs_buf_delwri_add(bp, 0, delwri_list);
+			hold = true;
+		} else {
+			libxfs_writebuf_int(bp, 0);
+		}
 	}
 	if (hold)
 		bip->bli_flags &= ~XFS_BLI_HOLD;
@@ -757,7 +764,8 @@ buf_item_done(
 
 static void
 trans_committed(
-	xfs_trans_t		*tp)
+	struct xfs_trans	*tp,
+	struct list_head	*delwri_list)
 {
 	struct xfs_log_item	*lip, *next;
 
@@ -765,7 +773,7 @@ trans_committed(
 		xfs_trans_del_item(lip);
 
 		if (lip->li_type == XFS_LI_BUF)
-			buf_item_done((xfs_buf_log_item_t *)lip);
+			buf_item_done((xfs_buf_log_item_t *)lip, delwri_list);
 		else if (lip->li_type == XFS_LI_INODE)
 			inode_item_done((xfs_inode_log_item_t *)lip);
 		else {
@@ -828,11 +836,12 @@ xfs_trans_free_items(
 /*
  * Commit the changes represented by this transaction
  */
-int
-libxfs_trans_commit(
-	xfs_trans_t	*tp)
+static int
+trans_commit(
+	struct xfs_trans	*tp,
+	struct list_head	*delwri_list)
 {
-	xfs_sb_t	*sbp;
+	struct xfs_sb		*sbp;
 
 	if (tp == NULL)
 		return 0;
@@ -862,9 +871,25 @@ libxfs_trans_commit(
 #ifdef XACT_DEBUG
 	fprintf(stderr, "committing dirty transaction %p\n", tp);
 #endif
-	trans_committed(tp);
+	trans_committed(tp, delwri_list);
 
 	/* That's it for the transaction structure.  Free it. */
 	xfs_trans_free(tp);
 	return 0;
 }
+
+int
+libxfs_trans_commit(
+	struct xfs_trans	*tp)
+{
+	return trans_commit(tp, NULL);
+}
+
+int
+libxfs_trans_commit_delwri(
+	struct xfs_trans	*tp,
+	struct list_head	*delwri_list)
+{
+	return trans_commit(tp, delwri_list);
+}
+
diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
index d70fbdb6b15a..b751b1fcb4a3 100644
--- a/mkfs/xfs_mkfs.c
+++ b/mkfs/xfs_mkfs.c
@@ -3374,7 +3374,8 @@ initialise_ag_headers(
 	struct xfs_mount	*mp,
 	struct xfs_sb		*sbp,
 	xfs_agnumber_t		agno,
-	int			*freelist_size)
+	int			*freelist_size,
+	struct list_head	*delwri_list)
 {
 	struct xfs_perag	*pag = libxfs_perag_get(mp, agno);
 	struct xfs_agfl		*agfl;
@@ -3402,7 +3403,7 @@ initialise_ag_headers(
 	buf->b_ops = &xfs_sb_buf_ops;
 	memset(buf->b_addr, 0, cfg->sectorsize);
 	libxfs_sb_to_disk(buf->b_addr, sbp);
-	libxfs_writebuf(buf, LIBXFS_EXIT_ON_FAILURE);
+	libxfs_buf_delwri_add(buf, LIBXFS_EXIT_ON_FAILURE, delwri_list);
 
 	/*
 	 * AG header block: freespace
@@ -3469,7 +3470,7 @@ _("%s: Abort! Freelist size (%u) for AG %u not constant (%u)!\n"),
 		exit(1);
 	}
 
-	libxfs_writebuf(buf, LIBXFS_EXIT_ON_FAILURE);
+	libxfs_buf_delwri_add(buf, LIBXFS_EXIT_ON_FAILURE, delwri_list);
 
 	/*
 	 * AG freelist header block
@@ -3489,7 +3490,7 @@ _("%s: Abort! Freelist size (%u) for AG %u not constant (%u)!\n"),
 			agfl->agfl_bno[bucket] = cpu_to_be32(NULLAGBLOCK);
 	}
 
-	libxfs_writebuf(buf, LIBXFS_EXIT_ON_FAILURE);
+	libxfs_buf_delwri_add(buf, LIBXFS_EXIT_ON_FAILURE, delwri_list);
 
 	/*
 	 * AG header block: inodes
@@ -3518,7 +3519,7 @@ _("%s: Abort! Freelist size (%u) for AG %u not constant (%u)!\n"),
 		platform_uuid_copy(&agi->agi_uuid, &sbp->sb_uuid);
 	for (c = 0; c < XFS_AGI_UNLINKED_BUCKETS; c++)
 		agi->agi_unlinked[c] = cpu_to_be32(NULLAGINO);
-	libxfs_writebuf(buf, LIBXFS_EXIT_ON_FAILURE);
+	libxfs_buf_delwri_add(buf, LIBXFS_EXIT_ON_FAILURE, delwri_list);
 
 	/*
 	 * BNO btree root block
@@ -3570,7 +3571,7 @@ _("%s: Abort! Freelist size (%u) for AG %u not constant (%u)!\n"),
 	if (!arec->ar_blockcount)
 		block->bb_numrecs = 0;
 
-	libxfs_writebuf(buf, LIBXFS_EXIT_ON_FAILURE);
+	libxfs_buf_delwri_add(buf, LIBXFS_EXIT_ON_FAILURE, delwri_list);
 
 	/*
 	 * CNT btree root block
@@ -3612,7 +3613,7 @@ _("%s: Abort! Freelist size (%u) for AG %u not constant (%u)!\n"),
 	if (!arec->ar_blockcount)
 		block->bb_numrecs = 0;
 
-	libxfs_writebuf(buf, LIBXFS_EXIT_ON_FAILURE);
+	libxfs_buf_delwri_add(buf, LIBXFS_EXIT_ON_FAILURE, delwri_list);
 
 	/*
 	 * refcount btree root block
@@ -3626,7 +3627,7 @@ _("%s: Abort! Freelist size (%u) for AG %u not constant (%u)!\n"),
 		block = XFS_BUF_TO_BLOCK(buf);
 		memset(block, 0, cfg->blocksize);
 		libxfs_btree_init_block(mp, buf, XFS_BTNUM_REFC, 0, 0, agno, 0);
-		libxfs_writebuf(buf, LIBXFS_EXIT_ON_FAILURE);
+		libxfs_buf_delwri_add(buf, LIBXFS_EXIT_ON_FAILURE, delwri_list);
 	}
 
 	/*
@@ -3639,7 +3640,7 @@ _("%s: Abort! Freelist size (%u) for AG %u not constant (%u)!\n"),
 	block = XFS_BUF_TO_BLOCK(buf);
 	memset(block, 0, cfg->blocksize);
 	libxfs_btree_init_block(mp, buf, XFS_BTNUM_INO, 0, 0, agno, 0);
-	libxfs_writebuf(buf, LIBXFS_EXIT_ON_FAILURE);
+	libxfs_buf_delwri_add(buf, LIBXFS_EXIT_ON_FAILURE, delwri_list);
 
 	/*
 	 * Free INO btree root block
@@ -3652,7 +3653,7 @@ _("%s: Abort! Freelist size (%u) for AG %u not constant (%u)!\n"),
 		block = XFS_BUF_TO_BLOCK(buf);
 		memset(block, 0, cfg->blocksize);
 		libxfs_btree_init_block(mp, buf, XFS_BTNUM_FINO, 0, 0, agno, 0);
-		libxfs_writebuf(buf, LIBXFS_EXIT_ON_FAILURE);
+		libxfs_buf_delwri_add(buf, LIBXFS_EXIT_ON_FAILURE, delwri_list);
 	}
 
 	/* RMAP btree root block */
@@ -3728,7 +3729,7 @@ _("%s: Abort! Freelist size (%u) for AG %u not constant (%u)!\n"),
 			be16_add_cpu(&block->bb_numrecs, 1);
 		}
 
-		libxfs_writebuf(buf, LIBXFS_EXIT_ON_FAILURE);
+		libxfs_buf_delwri_add(buf, LIBXFS_EXIT_ON_FAILURE, delwri_list);
 	}
 
 	libxfs_perag_put(pag);
@@ -3738,7 +3739,8 @@ static void
 initialise_ag_freespace(
 	struct xfs_mount	*mp,
 	xfs_agnumber_t		agno,
-	int			freelist_size)
+	int			freelist_size,
+	struct list_head	*delwri_list)
 {
 	struct xfs_alloc_arg	args;
 	struct xfs_trans	*tp;
@@ -3758,7 +3760,7 @@ initialise_ag_freespace(
 
 	libxfs_alloc_fix_freelist(&args, 0);
 	libxfs_perag_put(args.pag);
-	libxfs_trans_commit(tp);
+	libxfs_trans_commit_delwri(tp, delwri_list);
 }
 
 /*
@@ -3812,6 +3814,7 @@ main(
 	char			*protofile = NULL;
 	char			*protostring = NULL;
 	int			freelist_size = 0;
+	LIST_HEAD		(delwri_list);
 
 	struct libxfs_xinit	xi = {
 		.isdirect = LIBXFS_DIRECT,
@@ -4042,9 +4045,13 @@ main(
 	 * Initialise all the AG headers on disk.
 	 */
 	for (agno = 0; agno < cfg.agcount; agno++) {
-		initialise_ag_headers(&cfg, mp, sbp, agno, &freelist_size);
-		initialise_ag_freespace(mp, agno, freelist_size);
+		initialise_ag_headers(&cfg, mp, sbp, agno, &freelist_size,
+				      &delwri_list);
+		initialise_ag_freespace(mp, agno, freelist_size, &delwri_list);
+		if (agno && !(agno % 100))
+			libxfs_buf_delwri_flush(&delwri_list);
 	}
+	libxfs_buf_delwri_flush(&delwri_list);
 
 	/*
 	 * Allocate the root inode and anything else in the proto file.

From patchwork Wed Sep  5 08:19:32 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Dave Chinner <david@fromorbit.com>
X-Patchwork-Id: 10588417
Return-Path: <linux-xfs-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 76F555A4
	for <patchwork-linux-xfs@patchwork.kernel.org>;
 Wed,  5 Sep 2018 08:19:45 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 5C5C228D04
	for <patchwork-linux-xfs@patchwork.kernel.org>;
 Wed,  5 Sep 2018 08:19:45 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 50DC629814; Wed,  5 Sep 2018 08:19:45 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 6F6272973B
	for <patchwork-linux-xfs@patchwork.kernel.org>;
 Wed,  5 Sep 2018 08:19:43 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1727803AbeIEMsp (ORCPT
        <rfc822;patchwork-linux-xfs@patchwork.kernel.org>);
        Wed, 5 Sep 2018 08:48:45 -0400
Received: from ipmail06.adl6.internode.on.net ([150.101.137.145]:24725 "EHLO
        ipmail06.adl6.internode.on.net" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S1726342AbeIEMsp (ORCPT
        <rfc822;linux-xfs@vger.kernel.org>); Wed, 5 Sep 2018 08:48:45 -0400
Received: from ppp59-167-129-252.static.internode.on.net (HELO dastard)
 ([59.167.129.252])
  by ipmail06.adl6.internode.on.net with ESMTP; 05 Sep 2018 17:49:38 +0930
Received: from discord.disaster.area ([192.168.1.111])
        by dastard with esmtp (Exim 4.80)
        (envelope-from <david@fromorbit.com>)
        id 1fxT2G-0003NI-5Q
        for linux-xfs@vger.kernel.org; Wed, 05 Sep 2018 18:19:36 +1000
Received: from dave by discord.disaster.area with local (Exim 4.91)
        (envelope-from <david@fromorbit.com>)
        id 1fxT2G-0007Gj-3y
        for linux-xfs@vger.kernel.org; Wed, 05 Sep 2018 18:19:36 +1000
From: Dave Chinner <david@fromorbit.com>
To: linux-xfs@vger.kernel.org
Subject: [PATCH 4/4] mkfs: Use AIO for batched writeback
Date: Wed,  5 Sep 2018 18:19:32 +1000
Message-Id: <20180905081932.27478-5-david@fromorbit.com>
X-Mailer: git-send-email 2.17.0
In-Reply-To: <20180905081932.27478-1-david@fromorbit.com>
References: <20180905081932.27478-1-david@fromorbit.com>
Sender: linux-xfs-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-xfs.vger.kernel.org>
X-Mailing-List: linux-xfs@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

From: Dave Chinner <dchinner@redhat.com>

What it says. 8EB filesystem can now be made in 15 minutes on
storage that can do 100,000 IOPS. Before this patch mkfs topped out
at about 13,000 IOPS.

Proof of concept proven.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 libxfs/libxfs_io.h |   1 +
 libxfs/rdwr.c      | 170 +++++++++++++++++++++++++++++++++++++++++++--
 mkfs/Makefile      |   2 +-
 mkfs/xfs_mkfs.c    |   2 +-
 4 files changed, 169 insertions(+), 6 deletions(-)

diff --git a/libxfs/libxfs_io.h b/libxfs/libxfs_io.h
index c69cc7cd7ec5..2b11d008998d 100644
--- a/libxfs/libxfs_io.h
+++ b/libxfs/libxfs_io.h
@@ -67,6 +67,7 @@ typedef struct xfs_buf {
 	struct xfs_buf_map	__b_map;
 	int			b_nmaps;
 	struct list_head	b_list;
+	int			b_io_count;
 #ifdef XFS_BUF_TRACING
 	struct list_head	b_lock_list;
 	const char		*b_func;
diff --git a/libxfs/rdwr.c b/libxfs/rdwr.c
index 7fbaae571abe..5336f6a1e1de 100644
--- a/libxfs/rdwr.c
+++ b/libxfs/rdwr.c
@@ -20,6 +20,9 @@
 
 #include "libxfs.h"		/* for LIBXFS_EXIT_ON_FAILURE */
 
+/* XXX: all the aio stuff is hacky and proof of concept only */
+#include <libaio.h>
+
 /*
  * Important design/architecture note:
  *
@@ -1236,6 +1239,151 @@ xfs_buf_cmp(
 	return 0;
 }
 
+/* hacky AIO stuff that'll handle a single delwri queue flush at a time */
+#define MAX_AIO_EVENTS 1024
+io_context_t ctxp;
+struct iocb *iocbs[MAX_AIO_EVENTS];
+struct io_event ioevents[MAX_AIO_EVENTS];
+int aio_next;
+int aio_flight;
+
+static void
+init_aio_delwri(void)
+{
+	int i, r;
+
+	memset(&ctxp, 0, sizeof(ctxp));
+	r = io_setup(MAX_AIO_EVENTS, &ctxp);
+	if (r) {
+		printf("FAIL! io_setup returned %d\n", r);
+		exit(1);
+	}
+	for (i = 0; i < MAX_AIO_EVENTS; ++i) {
+		iocbs[i] = calloc(1, sizeof(struct iocb));
+		if (iocbs[i] == NULL) {
+			printf("failed to allocate an iocb\n");
+			exit(1);
+		}
+	}
+}
+
+int
+get_write_completions(
+	int	threshold)
+{
+	int	error = 0;
+	int	i, r;
+
+	while (aio_flight > threshold) {
+		/* gather up some completions */
+		r = io_getevents(ctxp, 1, MAX_AIO_EVENTS, ioevents, NULL);
+		if (r < 0)  {
+			printf("FAIL! io_getevents returned %d\n", r);
+			exit(1);
+		}
+
+		aio_flight -= r;
+		for (i = 0; i < r; ++i) {
+			struct xfs_buf *bp = ioevents[i].data;
+			if (ioevents[i].res < 0)
+				bp->b_error = ioevents[i].res;
+			else if (ioevents[i].res != bp->b_bcount)
+				bp->b_error = -EIO;
+
+			if (--bp->b_io_count == 0 && !bp->b_error) {
+				bp->b_flags |= LIBXFS_B_UPTODATE;
+				bp->b_flags &= ~(LIBXFS_B_DIRTY | LIBXFS_B_EXIT |
+						 LIBXFS_B_UNCHECKED);
+				libxfs_putbuf(bp);
+			} else if (!error) {
+				error = bp->b_error;
+			}
+		}
+		/* wait for a short while */
+		usleep(1000);
+	}
+	return error;
+}
+
+static int
+__aio_write(
+	struct xfs_buf  *bp,
+	void *buf,
+	int len,
+	off64_t offset)
+{
+	int	r, i;
+	int	fd = libxfs_device_to_fd(bp->b_target->dev);
+
+	i = aio_next++ % MAX_AIO_EVENTS;
+	aio_flight++;
+	bp->b_io_count++;
+
+	io_prep_pwrite(iocbs[i], fd, buf, len, offset);
+	iocbs[i]->data = bp;
+	r = io_submit(ctxp, 1, &iocbs[i]);
+	if (r != 1) {
+		aio_flight--;
+		if (bp->b_flags & LIBXFS_B_EXIT)
+			exit(1);
+		return -EIO;
+	}
+	return 0;
+
+}
+
+static int
+do_write(
+	struct xfs_buf	*bp)
+{
+	/*
+	 * we never write buffers that are marked stale. This indicates they
+	 * contain data that has been invalidated, and even if the buffer is
+	 * dirty it must *never* be written. Verifiers are wonderful for finding
+	 * bugs like this. Make sure the error is obvious as to the cause.
+	 */
+	if (bp->b_flags & LIBXFS_B_STALE) {
+		bp->b_error = -ESTALE;
+		return bp->b_error;
+	}
+
+	/*
+	 * clear any pre-existing error status on the buffer. This can occur if
+	 * the buffer is corrupt on disk and the repair process doesn't clear
+	 * the error before fixing and writing it back.
+	 */
+	bp->b_error = 0;
+	if (bp->b_ops) {
+		bp->b_ops->verify_write(bp);
+		if (bp->b_error) {
+			fprintf(stderr,
+	_("%s: write verifer failed on %s bno 0x%llx/0x%x\n"),
+				__func__, bp->b_ops->name,
+				(long long)bp->b_bn, bp->b_bcount);
+			return bp->b_error;
+		}
+	}
+
+	if (!(bp->b_flags & LIBXFS_B_DISCONTIG)) {
+		bp->b_error = __aio_write(bp, bp->b_addr, bp->b_bcount,
+				    LIBXFS_BBTOOFF64(bp->b_bn));
+	} else {
+		int	i;
+		void	*buf = bp->b_addr;
+
+		for (i = 0; i < bp->b_nmaps; i++) {
+			off64_t	offset = LIBXFS_BBTOOFF64(bp->b_maps[i].bm_bn);
+			int len = BBTOB(bp->b_maps[i].bm_len);
+
+			bp->b_error = __aio_write(bp, buf, len, offset);
+			if (bp->b_error)
+				break;	/* XXX: completion wait required! */
+			buf += len;
+		}
+	}
+	return bp->b_error;
+}
+
 /* Processes entire list, but only returns the first error found */
 int
 libxfs_buf_delwri_flush(
@@ -1243,21 +1391,35 @@ libxfs_buf_delwri_flush(
 {
 	struct xfs_buf		*bp;
 	int			 error = 0;
+	static bool		aio_init = false;
+	int			ret;
+
+	if (!aio_init) {
+		init_aio_delwri();
+		aio_init = true;
+	}
 
 	list_sort(NULL, delwri_list, xfs_buf_cmp);
 	while (!list_empty(delwri_list)) {
 		bp = list_first_entry(delwri_list, struct xfs_buf, b_list);
 		list_del_init(&bp->b_list);
 		bp->b_flags &= ~LIBXFS_B_DELWRI_Q;
+		bp->b_io_count = 1;
 		if (!bp->b_error && (bp->b_flags & LIBXFS_B_DIRTY)) {
-			int ret;
-			ret = libxfs_writebufr(bp);
+			ret = do_write(bp);
 			if (ret && !error)
 				error = ret;
 		}
-		libxfs_putbuf(bp);
+		if (--bp->b_io_count == 0)
+			libxfs_putbuf(bp);
+
+		/* wait for some completions if we've dispatched lots */
+		ret = get_write_completions(MAX_AIO_EVENTS / 5 * 4);
+		if (ret && !error)
+			error = ret;
 	}
-	return error;
+	ret = get_write_completions(0);
+	return error ? error : ret;
 }
 
 
diff --git a/mkfs/Makefile b/mkfs/Makefile
index 31482b08d559..e27f10f26b14 100644
--- a/mkfs/Makefile
+++ b/mkfs/Makefile
@@ -11,7 +11,7 @@ HFILES =
 CFILES = proto.c xfs_mkfs.c
 
 LLDLIBS += $(LIBXFS) $(LIBXCMD) $(LIBFROG) $(LIBRT) $(LIBPTHREAD) $(LIBBLKID) \
-	$(LIBUUID)
+	$(LIBUUID) -laio
 LTDEPENDENCIES += $(LIBXFS) $(LIBXCMD) $(LIBFROG)
 LLDFLAGS = -static-libtool-libs
 
diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
index b751b1fcb4a3..82e041dd1b48 100644
--- a/mkfs/xfs_mkfs.c
+++ b/mkfs/xfs_mkfs.c
@@ -4048,7 +4048,7 @@ main(
 		initialise_ag_headers(&cfg, mp, sbp, agno, &freelist_size,
 				      &delwri_list);
 		initialise_ag_freespace(mp, agno, freelist_size, &delwri_list);
-		if (agno && !(agno % 100))
+		if (agno && !(agno % 1000))
 			libxfs_buf_delwri_flush(&delwri_list);
 	}
 	libxfs_buf_delwri_flush(&delwri_list);