From patchwork Mon Nov 3 22:04:36 2014 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: NeilBrown X-Patchwork-Id: 5220291 X-Patchwork-Delegate: snitzer@redhat.com Return-Path: X-Original-To: patchwork-dm-devel@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork1.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.19.201]) by patchwork1.web.kernel.org (Postfix) with ESMTP id 61F459F349 for ; Mon, 3 Nov 2014 22:10:05 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id 8CF632013A for ; Mon, 3 Nov 2014 22:10:04 +0000 (UTC) Received: from mx5-phx2.redhat.com (mx5-phx2.redhat.com [209.132.183.37]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 79E8020115 for ; Mon, 3 Nov 2014 22:10:03 +0000 (UTC) Received: from lists01.pubmisc.prod.ext.phx2.redhat.com (lists01.pubmisc.prod.ext.phx2.redhat.com [10.5.19.33]) by mx5-phx2.redhat.com (8.14.4/8.14.4) with ESMTP id sA3M4pqg000320; Mon, 3 Nov 2014 17:04:52 -0500 Received: from int-mx09.intmail.prod.int.phx2.redhat.com (int-mx09.intmail.prod.int.phx2.redhat.com [10.5.11.22]) by lists01.pubmisc.prod.ext.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id sA3M4ngo011297 for ; Mon, 3 Nov 2014 17:04:49 -0500 Received: from mx1.redhat.com (ext-mx13.extmail.prod.ext.phx2.redhat.com [10.5.110.18]) by int-mx09.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id sA3M4nl2026938 for ; Mon, 3 Nov 2014 17:04:49 -0500 Received: from mx2.suse.de (cantor2.suse.de [195.135.220.15]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id sA3M4ilM010463 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=FAIL) for ; Mon, 3 Nov 2014 17:04:46 -0500 Received: from relay2.suse.de (charybdis-ext.suse.de [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id 8A562AC38; Mon, 3 Nov 2014 22:04:43 +0000 (UTC) Date: Tue, 4 Nov 2014 09:04:36 +1100 From: NeilBrown To: James Simmons Message-ID: <20141104090436.355a8fb1@notabene.brown> In-Reply-To: References: MIME-Version: 1.0 X-RedHat-Spam-Score: -7.494 (BAYES_00,RCVD_IN_DNSWL_HI,RP_MATCHES_RCVD) X-Scanned-By: MIMEDefang 2.68 on 10.5.11.22 X-Scanned-By: MIMEDefang 2.68 on 10.5.110.18 X-loop: dm-devel@redhat.com Cc: dm-devel@redhat.com, andreas.dilger@intel.com, bruno.faccini@intel.com Subject: Re: [dm-devel] [PATCH] md: submit MMP reads REQ_SYNC to bypass RAID5 cache X-BeenThere: dm-devel@redhat.com X-Mailman-Version: 2.1.12 Precedence: junk Reply-To: device-mapper development List-Id: device-mapper development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: dm-devel-bounces@redhat.com Errors-To: dm-devel-bounces@redhat.com X-Spam-Status: No, score=-4.8 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_MED, RP_MATCHES_RCVD, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP On Mon, 3 Nov 2014 14:01:10 -0700 James Simmons wrote: > Hello. > > This is a patch against the latest kernel source which is based on > a patch used by Lustre. The below describes what we are trying to > achieve. I like to get a feedback if this is the right approach. > > ---------------------------------------------------------------------- > > The ext4 MMP block reads always need to get fresh data from the > underlying disk. Otherwise, if a remote node is updating the MMP > block and the reads are fetched from the MD RAID5 stripe cache, > it is possible that the local node will incorrectly decide the > remote node has died and allow the filesystem to be mounted on > two nodes at the same time. It is preferred for patches to be inline, rather than as attachments, as it makes it easier to comment on them.... This doesn't provide a useful guarantee. If the device that stores that block has failed, the md/raid5 will read all other devices to recover the block. If that recently happened and you just clear the UPTODATE bit on the block, md/raid5 will recover the data from all the other blocks, without reading them. But considering this at a higher level: if two different nodes try to assemble the same RAID5 array then you already potentially have a problem. You really want some sensible cluster co-ordinator and let it make these decisions. Hoping the a block device can be a reliable semaphore seems ... misguided. NeilBrown --- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 9c66e59..11b749c 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -2678,6 +2678,9 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, in } if (sector >= sh->dev[dd_idx].sector + STRIPE_SECTORS) set_bit(R5_OVERWRITE, &sh->dev[dd_idx].flags); + } else if (bi->bi_rw & REQ_NOCACHE) { + /* force to read from underlying disk if requested */ + clear_bit(R5_UPTODATE, &sh->dev[dd_idx].flags); } pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n",