diff mbox

Linux-next parallel cp workload hang

Message ID 20160518141726.GY26977@dastard (mailing list archive)
State New, archived
Headers show

Commit Message

Dave Chinner May 18, 2016, 2:17 p.m. UTC
On Wed, May 18, 2016 at 07:46:17PM +0800, Xiong Zhou wrote:
> 
> On Wed, May 18, 2016 at 07:54:09PM +1000, Dave Chinner wrote:
> > On Wed, May 18, 2016 at 04:31:50PM +0800, Xiong Zhou wrote:
> > > Hi,
> > > 
> > > On Wed, May 18, 2016 at 03:56:34PM +1000, Dave Chinner wrote:
> > > > On Wed, May 18, 2016 at 09:46:15AM +0800, Xiong Zhou wrote:
> > > > > Hi,
> > > > > 
> > > > > Parallel cp workload (xfstests generic/273) hangs like blow.
> > > > > It's reproducible with a small chance, less the 1/100 i think.
> > > > > 
> > > > > Have hit this in linux-next 20160504 0506 0510 trees, testing on
> > > > > xfs with loop or block device. Ext4 survived several rounds
> > > > > of testing.
> > > > > 
> > > > > Linux next 20160510 tree hangs within 500 rounds testing several
> > > > > times. The same tree with vfs parallel lookup patchset reverted
> > > > > survived 900 rounds testing. Reverted commits are attached.  > 

Ok, this is trivial to reproduce. Al - I've hit this 9 times out of
ten running it on a 4p VM with a pair of 4GB ram disks using all
the current upstream default mkfs and mount configurations. On the
tenth attempt I got the tracing to capture what I needed to see -
process 7340 was the last xfs_buf_lock_done trace without an unlock
trace, and that process had this trace:

schedule
rwsem_down_read_failed
call_rwsem_down_read_failed
down_read
xfs_ilock
xfs_ilock_data_map_shared
xfs_dir2_leaf_getdents
xfs_readdir
xfs_file_readdir
iterate_dir
SyS_getdents
entry_SYSCALL_64_fastpath

Which means it's holding a buffer lock while trying to get the
ilock(shared). That's never going to end well - I'm now wondering
why lockdep hasn't been all over this lock order inversion....

Essentially, it's a three-process deadlock involving
shared/exclusive barriers and inverted lock orders. It's a
pre-existing problem with buffer mapping lock orders, nothing to do
with with the VFS parallelisation code.

process 1		process 2		process 3
---------		---------		---------
readdir
iolock(shared)
  get_leaf_dents
    iterate entries
       ilock(shared)
       map, lock and read buffer
       iunlock(shared)
       process entries in buffer
       .....
       						readdir
						iolock(shared)
						  get_leaf_dents
						    iterate entries
						      ilock(shared)
						      map, lock buffer
						      <blocks>
       			finish ->iterate_shared
			file_accessed()
			  ->update_time
			    start transaction
			    ilock(excl)
			    <blocks>
	.....
	finishes processing buffer
	get next buffer
	  ilock(shared)
	  <blocks>

And that's the deadlock.

Now I know what the problem is I can say that process 2 - the
transactional timestamp update - is the reason the readdir
operations are blocking like this. And I know why CXFS never hit
this - it doesn't use the VFS paths, so the VFS calls to update
timestamps don't exist during concurrent readdir operations on the
CXFS metadata server. Hence process 2 doesn't exist and no exclusive
barriers are put in amongst the shared locking....

Patch below should fix the deadlock.

Cheers,

Dave.

Comments

Dave Chinner May 18, 2016, 11:02 p.m. UTC | #1
On Thu, May 19, 2016 at 12:17:26AM +1000, Dave Chinner wrote:
> Patch below should fix the deadlock.

The test has been running for several hours without failure using
this patch, so I'd say this fixes the problem...

Cheers,

Dave.
Murphy Zhou May 19, 2016, 6:22 a.m. UTC | #2
On Thu, May 19, 2016 at 09:02:31AM +1000, Dave Chinner wrote:
> On Thu, May 19, 2016 at 12:17:26AM +1000, Dave Chinner wrote:
> > Patch below should fix the deadlock.
> 
> The test has been running for several hours without failure using
> this patch, so I'd say this fixes the problem...

Yes, the same for me.

Thanks.

Xiong

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
index 93b3ab0..21501dc 100644
--- a/fs/xfs/xfs_dir2_readdir.c
+++ b/fs/xfs/xfs_dir2_readdir.c
@@ -273,10 +273,11 @@  xfs_dir2_leaf_readbuf(
 	size_t			bufsize,
 	struct xfs_dir2_leaf_map_info *mip,
 	xfs_dir2_off_t		*curoff,
-	struct xfs_buf		**bpp)
+	struct xfs_buf		**bpp,
+	bool			trim_map)
 {
 	struct xfs_inode	*dp = args->dp;
-	struct xfs_buf		*bp = *bpp;
+	struct xfs_buf		*bp = NULL;
 	struct xfs_bmbt_irec	*map = mip->map;
 	struct blk_plug		plug;
 	int			error = 0;
@@ -286,13 +287,10 @@  xfs_dir2_leaf_readbuf(
 	struct xfs_da_geometry	*geo = args->geo;
 
 	/*
-	 * If we have a buffer, we need to release it and
-	 * take it out of the mapping.
+	 * If the caller just finished processing a buffer, it will tell us
+	 * we need to trim that block out of the mapping now it is done.
 	 */
-
-	if (bp) {
-		xfs_trans_brelse(NULL, bp);
-		bp = NULL;
+	if (trim_map) {
 		mip->map_blocks -= geo->fsbcount;
 		/*
 		 * Loop to get rid of the extents for the
@@ -533,10 +531,17 @@  xfs_dir2_leaf_getdents(
 		 */
 		if (!bp || ptr >= (char *)bp->b_addr + geo->blksize) {
 			int	lock_mode;
+			bool	trim_map = false;
+
+			if (bp) {
+				xfs_trans_brelse(NULL, bp);
+				bp = NULL;
+				trim_map = true;
+			}
 
 			lock_mode = xfs_ilock_data_map_shared(dp);
 			error = xfs_dir2_leaf_readbuf(args, bufsize, map_info,
-						      &curoff, &bp);
+						      &curoff, &bp, trim_map);
 			xfs_iunlock(dp, lock_mode);
 			if (error || !map_info->map_valid)
 				break;