From patchwork Wed May 18 14:17:26 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dave Chinner X-Patchwork-Id: 9119751 Return-Path: X-Original-To: patchwork-linux-fsdevel@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork1.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.136]) by patchwork1.web.kernel.org (Postfix) with ESMTP id DE0339F37F for ; Wed, 18 May 2016 14:17:53 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id D7CFE20160 for ; Wed, 18 May 2016 14:17:52 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id B6D7720122 for ; Wed, 18 May 2016 14:17:51 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932390AbcERORd (ORCPT ); Wed, 18 May 2016 10:17:33 -0400 Received: from ipmail06.adl6.internode.on.net ([150.101.137.145]:26952 "EHLO ipmail06.adl6.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932227AbcERORc (ORCPT ); Wed, 18 May 2016 10:17:32 -0400 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2AvIgC9eDxXLUMYLHlegzeBA1CGbZ5YAQEBAQEBBowdigiGCwICAQECgTVNAQEBAQEBBwEBAQEBQUBBDAGDdAEBAQQnExwjEAgDDgcDCSUPBSUDBxoTiC7DNgEBAQEGAgEkHoU/hRWEKYVvBZgriHiFHoFzh3qFN4YxiRiCaByBXSoyhECDRgEBAQ Received: from ppp121-44-24-67.lns20.syd4.internode.on.net (HELO dastard) ([121.44.24.67]) by ipmail06.adl6.internode.on.net with ESMTP; 18 May 2016 23:47:28 +0930 Received: from dave by dastard with local (Exim 4.80) (envelope-from ) id 1b32Hu-0000Zm-Ll; Thu, 19 May 2016 00:17:26 +1000 Date: Thu, 19 May 2016 00:17:26 +1000 From: Dave Chinner To: Xiong Zhou Cc: linux-next@vger.kernel.org, viro@zeniv.linux.org.uk, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org Subject: Re: Linux-next parallel cp workload hang Message-ID: <20160518141726.GY26977@dastard> References: <20160518014615.GA5302@ZX.nay.redhat.com> <20160518055634.GW26977@dastard> <20160518083150.GB6551@dhcp12-144.nay.redhat.com> <20160518095409.GX26977@dastard> <20160518114617.GC6551@dhcp12-144.nay.redhat.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20160518114617.GC6551@dhcp12-144.nay.redhat.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Spam-Status: No, score=-8.3 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_HI, RP_MATCHES_RCVD, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP On Wed, May 18, 2016 at 07:46:17PM +0800, Xiong Zhou wrote: > > On Wed, May 18, 2016 at 07:54:09PM +1000, Dave Chinner wrote: > > On Wed, May 18, 2016 at 04:31:50PM +0800, Xiong Zhou wrote: > > > Hi, > > > > > > On Wed, May 18, 2016 at 03:56:34PM +1000, Dave Chinner wrote: > > > > On Wed, May 18, 2016 at 09:46:15AM +0800, Xiong Zhou wrote: > > > > > Hi, > > > > > > > > > > Parallel cp workload (xfstests generic/273) hangs like blow. > > > > > It's reproducible with a small chance, less the 1/100 i think. > > > > > > > > > > Have hit this in linux-next 20160504 0506 0510 trees, testing on > > > > > xfs with loop or block device. Ext4 survived several rounds > > > > > of testing. > > > > > > > > > > Linux next 20160510 tree hangs within 500 rounds testing several > > > > > times. The same tree with vfs parallel lookup patchset reverted > > > > > survived 900 rounds testing. Reverted commits are attached. > Ok, this is trivial to reproduce. Al - I've hit this 9 times out of ten running it on a 4p VM with a pair of 4GB ram disks using all the current upstream default mkfs and mount configurations. On the tenth attempt I got the tracing to capture what I needed to see - process 7340 was the last xfs_buf_lock_done trace without an unlock trace, and that process had this trace: schedule rwsem_down_read_failed call_rwsem_down_read_failed down_read xfs_ilock xfs_ilock_data_map_shared xfs_dir2_leaf_getdents xfs_readdir xfs_file_readdir iterate_dir SyS_getdents entry_SYSCALL_64_fastpath Which means it's holding a buffer lock while trying to get the ilock(shared). That's never going to end well - I'm now wondering why lockdep hasn't been all over this lock order inversion.... Essentially, it's a three-process deadlock involving shared/exclusive barriers and inverted lock orders. It's a pre-existing problem with buffer mapping lock orders, nothing to do with with the VFS parallelisation code. process 1 process 2 process 3 --------- --------- --------- readdir iolock(shared) get_leaf_dents iterate entries ilock(shared) map, lock and read buffer iunlock(shared) process entries in buffer ..... readdir iolock(shared) get_leaf_dents iterate entries ilock(shared) map, lock buffer finish ->iterate_shared file_accessed() ->update_time start transaction ilock(excl) ..... finishes processing buffer get next buffer ilock(shared) And that's the deadlock. Now I know what the problem is I can say that process 2 - the transactional timestamp update - is the reason the readdir operations are blocking like this. And I know why CXFS never hit this - it doesn't use the VFS paths, so the VFS calls to update timestamps don't exist during concurrent readdir operations on the CXFS metadata server. Hence process 2 doesn't exist and no exclusive barriers are put in amongst the shared locking.... Patch below should fix the deadlock. Cheers, Dave. diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c index 93b3ab0..21501dc 100644 --- a/fs/xfs/xfs_dir2_readdir.c +++ b/fs/xfs/xfs_dir2_readdir.c @@ -273,10 +273,11 @@ xfs_dir2_leaf_readbuf( size_t bufsize, struct xfs_dir2_leaf_map_info *mip, xfs_dir2_off_t *curoff, - struct xfs_buf **bpp) + struct xfs_buf **bpp, + bool trim_map) { struct xfs_inode *dp = args->dp; - struct xfs_buf *bp = *bpp; + struct xfs_buf *bp = NULL; struct xfs_bmbt_irec *map = mip->map; struct blk_plug plug; int error = 0; @@ -286,13 +287,10 @@ xfs_dir2_leaf_readbuf( struct xfs_da_geometry *geo = args->geo; /* - * If we have a buffer, we need to release it and - * take it out of the mapping. + * If the caller just finished processing a buffer, it will tell us + * we need to trim that block out of the mapping now it is done. */ - - if (bp) { - xfs_trans_brelse(NULL, bp); - bp = NULL; + if (trim_map) { mip->map_blocks -= geo->fsbcount; /* * Loop to get rid of the extents for the @@ -533,10 +531,17 @@ xfs_dir2_leaf_getdents( */ if (!bp || ptr >= (char *)bp->b_addr + geo->blksize) { int lock_mode; + bool trim_map = false; + + if (bp) { + xfs_trans_brelse(NULL, bp); + bp = NULL; + trim_map = true; + } lock_mode = xfs_ilock_data_map_shared(dp); error = xfs_dir2_leaf_readbuf(args, bufsize, map_info, - &curoff, &bp); + &curoff, &bp, trim_map); xfs_iunlock(dp, lock_mode); if (error || !map_info->map_valid) break;