From patchwork Mon Sep 28 10:12:14 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 7276221 Return-Path: X-Original-To: patchwork-linux-nvdimm@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork1.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.136]) by patchwork1.web.kernel.org (Postfix) with ESMTP id B3B0B9F36A for ; Mon, 28 Sep 2015 10:12:21 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id 904CA20780 for ; Mon, 28 Sep 2015 10:12:20 +0000 (UTC) Received: from ml01.01.org (ml01.01.org [198.145.21.10]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 2E7002077E for ; Mon, 28 Sep 2015 10:12:19 +0000 (UTC) Received: from ml01.vlan14.01.org (localhost [IPv6:::1]) by ml01.01.org (Postfix) with ESMTP id 0B7CD616E2; Mon, 28 Sep 2015 03:12:19 -0700 (PDT) X-Original-To: linux-nvdimm@lists.01.org Delivered-To: linux-nvdimm@lists.01.org Received: from ipmail06.adl6.internode.on.net (ipmail06.adl6.internode.on.net [150.101.137.145]) by ml01.01.org (Postfix) with ESMTP id DB453616DD for ; Mon, 28 Sep 2015 03:12:17 -0700 (PDT) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2AcCABhEQlWPLPjLHldgySBPYZZoi0BAQEGin6RKgICAQECgT1NAQEBAQEBBwEBAQFBP4QkAQEBBCcTHDMIAxUDCSUPBSUDBxoBEogtynwBCgEBAR4ZhhOFRIUCEoMYgRQFhzSOPI0KgVSHWYVdjDqCcQMcgWYsM4dbJYEhAQEB Received: from ppp121-44-227-179.lns20.syd7.internode.on.net (HELO dastard) ([121.44.227.179]) by ipmail06.adl6.internode.on.net with ESMTP; 28 Sep 2015 19:42:14 +0930 Received: from dave by dastard with local (Exim 4.80) (envelope-from ) id 1ZgVPq-0005Tm-Hx; Mon, 28 Sep 2015 20:12:14 +1000 Date: Mon, 28 Sep 2015 20:12:14 +1000 From: Dave Chinner To: Ross Zwisler , linux-kernel@vger.kernel.org, Alexander Viro , Matthew Wilcox , linux-fsdevel@vger.kernel.org, Andrew Morton , Dan Williams , "Kirill A. Shutemov" , linux-nvdimm@lists.01.org, Jan Kara Subject: Re: [PATCH] dax: fix deadlock in __dax_fault Message-ID: <20150928101214.GA19114@dastard> References: <1443040800-5460-1-git-send-email-ross.zwisler@linux.intel.com> <20150924025225.GT3902@dastard> <20150924155029.GA6008@linux.intel.com> <20150925025357.GU3902@dastard> <20150926031745.GA560@linux.intel.com> <20150928005904.GY19114@dastard> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20150928005904.GY19114@dastard> User-Agent: Mutt/1.5.21 (2010-09-15) X-BeenThere: linux-nvdimm@lists.01.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: "Linux-nvdimm developer list." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: linux-nvdimm-bounces@lists.01.org Sender: "Linux-nvdimm" X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_LOW, T_RP_MATCHES_RCVD, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP On Mon, Sep 28, 2015 at 10:59:04AM +1000, Dave Chinner wrote: > > Does this sound like a reasonable path forward for v4.3? Dave, and Jan, can > > you guys can provide guidance and code reviews for the XFS and ext4 bits? > > IMO, it's way too much to get into 4.3. I'd much prefer we revert > the bad changes in 4.3, and then work towards fixing this for the > 4.4 merge window. If someone needs this for 4.3, then they can > backport the 4.4 code to 4.3-stable. FWIW, here's the first bit of making XFS clear blocks during allocation. There's a couple of things I need to fix (e.g. moving the zeroing out of the transaction context but still under the inode allocation lock and hence atomic with the allocation), it needs to be split into at least two patches (to split out the xfs_imap_to_sector() helper), and there's another patch to remove the complete_unwritten callbacks. I also need to do more audit and validation on the handling of unwritten extents during get_blocks() for write faults - the get_blocks() call in this case effectively becomes an unwritten extent conversion call rather than an allocation call, and I need to validate that it does what get_block expects it to do. And that I haven't broken any of the other get_blocks callers. So still lots to do. Cheers, Dave. diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index 50ab287..f645587 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -586,27 +586,35 @@ xfs_add_to_ioend( ioend->io_size += bh->b_size; } -STATIC void -xfs_map_buffer( +sector_t +xfs_imap_to_sector( struct inode *inode, - struct buffer_head *bh, struct xfs_bmbt_irec *imap, xfs_off_t offset) { - sector_t bn; - struct xfs_mount *m = XFS_I(inode)->i_mount; - xfs_off_t iomap_offset = XFS_FSB_TO_B(m, imap->br_startoff); - xfs_daddr_t iomap_bn = xfs_fsb_to_db(XFS_I(inode), imap->br_startblock); + struct xfs_mount *mp = XFS_I(inode)->i_mount; + xfs_off_t iomap_offset; + xfs_daddr_t iomap_bn; ASSERT(imap->br_startblock != HOLESTARTBLOCK); ASSERT(imap->br_startblock != DELAYSTARTBLOCK); - bn = (iomap_bn >> (inode->i_blkbits - BBSHIFT)) + - ((offset - iomap_offset) >> inode->i_blkbits); + iomap_bn = xfs_fsb_to_db(XFS_I(inode), imap->br_startblock); + iomap_offset = XFS_FSB_TO_B(mp, imap->br_startoff); - ASSERT(bn || XFS_IS_REALTIME_INODE(XFS_I(inode))); + return iomap_bn + BTOBB(offset - iomap_offset); +} - bh->b_blocknr = bn; +STATIC void +xfs_map_buffer( + struct inode *inode, + struct buffer_head *bh, + struct xfs_bmbt_irec *imap, + xfs_off_t offset) +{ + bh->b_blocknr = xfs_imap_to_sector(inode, imap, offset) >> + (inode->i_blkbits - BBSHIFT); + ASSERT(bh->b_blocknr || XFS_IS_REALTIME_INODE(XFS_I(inode))); set_buffer_mapped(bh); } @@ -617,11 +625,7 @@ xfs_map_at_offset( struct xfs_bmbt_irec *imap, xfs_off_t offset) { - ASSERT(imap->br_startblock != HOLESTARTBLOCK); - ASSERT(imap->br_startblock != DELAYSTARTBLOCK); - xfs_map_buffer(inode, bh, imap, offset); - set_buffer_mapped(bh); clear_buffer_delay(bh); clear_buffer_unwritten(bh); } @@ -1396,7 +1400,8 @@ __xfs_get_blocks( if (create && (!nimaps || (imap.br_startblock == HOLESTARTBLOCK || - imap.br_startblock == DELAYSTARTBLOCK))) { + imap.br_startblock == DELAYSTARTBLOCK) || + (IS_DAX(inode) && ISUNWRITTEN(&imap)))) { if (direct || xfs_get_extsz_hint(ip)) { /* * Drop the ilock in preparation for starting the block @@ -1441,6 +1446,9 @@ __xfs_get_blocks( goto out_unlock; } + if (IS_DAX(inode)) + ASSERT(!ISUNWRITTEN(&imap)); + /* trim mapping down to size requested */ if (direct || size > (1 << inode->i_blkbits)) xfs_map_trim_size(inode, iblock, bh_result, diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h index 86afd1a..ede1025 100644 --- a/fs/xfs/xfs_aops.h +++ b/fs/xfs/xfs_aops.h @@ -18,6 +18,8 @@ #ifndef __XFS_AOPS_H__ #define __XFS_AOPS_H__ +struct xfs_bmbt_irec; + extern mempool_t *xfs_ioend_pool; /* @@ -62,4 +64,7 @@ void xfs_end_io_dax_write(struct buffer_head *bh, int uptodate); extern void xfs_count_page_state(struct page *, int *, int *); +sector_t xfs_imap_to_sector(struct inode *inode, struct xfs_bmbt_irec *imap, + xfs_off_t offset); + #endif /* __XFS_AOPS_H__ */ diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index 1f86033..277cd82 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -131,6 +131,7 @@ xfs_iomap_write_direct( uint qblocks, resblks, resrtextents; int committed; int error; + int bmapi_flags = XFS_BMAPI_PREALLOC; error = xfs_qm_dqattach(ip, 0); if (error) @@ -196,13 +197,26 @@ xfs_iomap_write_direct( xfs_trans_ijoin(tp, ip, 0); /* + * For DAX, we do not allocate unwritten extents, but instead we zero + * the block before we commit the transaction. Hence if we are mapping + * unwritten extents here, we need to convert them to written so that we + * don't need an unwritten extent callback here. + * + * Block zeroing for DAX is effectively a memset operation and so should + * not block on anything when we call it after the block allocation or + * conversion before we commit the transaction. + */ + if (IS_DAX(VFS_I(ip))) + bmapi_flags = XFS_BMAPI_CONVERT; + + /* * From this point onwards we overwrite the imap pointer that the * caller gave to us. */ xfs_bmap_init(&free_list, &firstfsb); nimaps = 1; error = xfs_bmapi_write(tp, ip, offset_fsb, count_fsb, - XFS_BMAPI_PREALLOC, &firstfsb, 0, + bmapi_flags, &firstfsb, 0, imap, &nimaps, &free_list); if (error) goto out_bmap_cancel; @@ -213,6 +227,28 @@ xfs_iomap_write_direct( error = xfs_bmap_finish(&tp, &free_list, &committed); if (error) goto out_bmap_cancel; + + /* DAX needs to zero the entire allocated extent here */ + if (IS_DAX(VFS_I(ip)) && nimaps) { + sector_t sector = xfs_imap_to_sector(VFS_I(ip), imap, offset); + + ASSERT(!ISUNWRITTEN(imap)); + ASSERT(nimaps == 1); + error = dax_clear_blocks(VFS_I(ip), + sector >> (VFS_I(ip)->i_blkbits - BBSHIFT), + XFS_FSB_TO_B(mp, imap->br_blockcount)); + if (error) { + xfs_warn(mp, + "err %d, off/cnt %lld/%ld, sector %ld, bytes %lld, im.stblk %lld, im.stoff %lld, im.blkcnt %lld", + error, offset, count, + xfs_imap_to_sector(VFS_I(ip), imap, offset), + XFS_FSB_TO_B(mp, imap->br_blockcount), + imap->br_startblock, imap->br_startoff, + imap->br_blockcount); + goto out_trans_cancel; + } + } + error = xfs_trans_commit(tp); if (error) goto out_unlock;