From patchwork Sun Nov 4 13:50:20 2012 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sage Weil X-Patchwork-Id: 1694301 Return-Path: X-Original-To: patchwork-ceph-devel@patchwork.kernel.org Delivered-To: patchwork-process-083081@patchwork2.kernel.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by patchwork2.kernel.org (Postfix) with ESMTP id 2A669DF230 for ; Sun, 4 Nov 2012 13:50:23 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754164Ab2KDNuW (ORCPT ); Sun, 4 Nov 2012 08:50:22 -0500 Received: from cobra.newdream.net ([66.33.216.30]:42566 "EHLO cobra.newdream.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753779Ab2KDNuV (ORCPT ); Sun, 4 Nov 2012 08:50:21 -0500 Received: from cobra.newdream.net (localhost [127.0.0.1]) by cobra.newdream.net (Postfix) with ESMTP id 12BDF8004F; Sun, 4 Nov 2012 05:50:21 -0800 (PST) Received: by cobra.newdream.net (Postfix, from userid 1031) id 03574800BB; Sun, 4 Nov 2012 05:50:20 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by cobra.newdream.net (Postfix) with ESMTP id E92998004F; Sun, 4 Nov 2012 05:50:20 -0800 (PST) Date: Sun, 4 Nov 2012 05:50:20 -0800 (PST) From: Sage Weil X-X-Sender: sage@cobra.newdream.net To: Nick Bartos cc: ceph-devel@vger.kernel.org Subject: Re: Ignoresync hack no longer applies on 3.6.5 In-Reply-To: Message-ID: References: User-Agent: Alpine 2.00 (DEB 1167 2008-08-23) MIME-Version: 1.0 Sender: ceph-devel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: ceph-devel@vger.kernel.org On Fri, 2 Nov 2012, Nick Bartos wrote: > Sage, > > A while back you gave us a small kernel hack which allowed us to mount > the underlying OSD xfs filesystems in a way that they would ignore > system wide syncs (kernel hack + mounting with the reused "mand" > option), to workaround a deadlock problem when mounting an rbd on the > same node that holds osds and monitors. Somewhere between 3.5.6 and > 3.6.5, things changed enough that the patch no longer applies. > > Looking into it a bit more, sync_one_sb and sync_supers no longer > exist. In commit f0cd2dbb6cf387c11f87265462e370bb5469299e which > removes sync_supers: > > vfs: kill write_super and sync_supers > > Finally we can kill the 'sync_supers' kernel thread along with the > '->write_super()' superblock operation because all the users are gone. > Now every file-system is supposed to self-manage own superblock and > its dirty state. > > The nice thing about killing this thread is that it improves power > management. > Indeed, 'sync_supers' is a source of monotonic system wake-ups - it woke up > every 5 seconds no matter what - even if there were no dirty superblocks and > even if there were no file-systems using this service (e.g., btrfs and > journalled ext4 do not need it). So it was wasting power most of > the time. And > because the thread was in the core of the kernel, all systems had > to have it. > So I am quite happy to make it go away. > > Interestingly, this thread is a left-over from the pdflush kernel > thread which > was a self-forking kernel thread responsible for all the write-back in old > Linux kernels. It was turned into per-block device BDI threads, and > 'sync_supers' was a left-over. Thus, R.I.P, pdflush as well. > > Also commit b3de653105180b57af90ef2f5b8441f085f4ff56 renames > sync_inodes_one_sb to sync_inodes_one_sb along with some other > changes. > > Assuming that the deadlock problem is still present in 3.6.5, could we > trouble you for an updated patch? Here's the original patch you gave > us for reference: Below. Compile-tested only! However, looking over the code, I'm not sure that the deadlock potential still exists. Looking over the stack traces you sent way back when, I'm not sure exactly which lock it was blocked on. If this was easily reproducible before, you might try running without the patch to see if this is still a problem for your configuration. And if it does happen, capture a fresh dump (echo t > /proc/sysrq-trigger). Thanks! sage From 6cbfe169ece1943fee1159dd78c202e613098715 Mon Sep 17 00:00:00 2001 From: Sage Weil Date: Sun, 4 Nov 2012 05:34:40 -0800 Subject: [PATCH] vfs hack: make sync skip supers with MS_MANDLOCK This is an ugly hack to skip certain mounts when there is a sync(2) system call. A less ugly version would create a new mount flag for this, but it would require modifying mount(8) too, and that's too much work. A curious person would ask WTF this is for. It is a kludge to avoid a deadlock induced when an RBD or Ceph mount is backed by a local ceph-osd on a local fs. An ill-timed sync(2) call by whoever can leave a ceph-dependent mount waiting on writeback, while something would prevent the ceph-osd from doing its own sync(2) on its backing fs. --- fs/sync.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/fs/sync.c b/fs/sync.c index eb8722d..ab474a0 100644 --- a/fs/sync.c +++ b/fs/sync.c @@ -75,8 +75,12 @@ static void sync_inodes_one_sb(struct super_block *sb, void *arg) static void sync_fs_one_sb(struct super_block *sb, void *arg) { - if (!(sb->s_flags & MS_RDONLY) && sb->s_op->sync_fs) - sb->s_op->sync_fs(sb, *(int *)arg); + if (!(sb->s_flags & MS_RDONLY) && sb->s_op->sync_fs) { + if (sb->s_flags & MS_MANDLOCK) + pr_debug("sync_fs_one_sb skipping %p\n", sb); + else + sb->s_op->sync_fs(sb, *(int *)arg); + } } static void fdatawrite_one_bdev(struct block_device *bdev, void *arg)