diff mbox

md: submit MMP reads REQ_SYNC to bypass RAID5 cache

Message ID 20141104090436.355a8fb1@notabene.brown (mailing list archive)
State Not Applicable, archived
Delegated to: Mike Snitzer
Headers show

Commit Message

NeilBrown Nov. 3, 2014, 10:04 p.m. UTC
On Mon, 3 Nov 2014 14:01:10 -0700 James Simmons <uja.ornl@gmail.com> wrote:

> Hello.
> 
>    This is a patch against the latest kernel source which is based on
> a patch used by Lustre. The below describes what we are trying to
> achieve. I like to get a feedback if this is the right approach.
> 
> ----------------------------------------------------------------------
> 
> The ext4 MMP block reads always need to get fresh data from the
> underlying disk.  Otherwise, if a remote node is updating the MMP
> block and the reads are fetched from the MD RAID5 stripe cache,
> it is possible that the local node will incorrectly decide the
> remote node has died and allow the filesystem to be mounted on
> two nodes at the same time.

It is preferred for patches to be inline, rather than as attachments, as it
makes it easier to comment on them....



This doesn't provide a useful guarantee.  If the device that stores that
block has failed, the md/raid5 will read all other devices to recover the
block.
If that recently happened and you just clear the UPTODATE bit on the block,
md/raid5 will recover the data from all the other blocks, without reading
them.

But considering this at a higher level: if two different nodes try to
assemble the same RAID5 array then you already potentially have a problem.
You really want some sensible cluster co-ordinator and let it make these
decisions.   Hoping the a block device can be a reliable semaphore seems ...
misguided.

NeilBrown
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Comments

Faccini, Bruno Nov. 4, 2014, 8:21 a.m. UTC | #1
Hello Neil,
Thanks for your quick answer and pointing us to the reconstruct case we need to handle!
If I correctly understand you, we need to also clear the UPTODATE flag for all the other blocks sharing the same parity information, right ?
Concerning your remark about using a block device like a semaphore not being 100% safe, my understanding is that this can be achieved by the appropriate MMP feature algorithm…
Bruno.

Le Nov 3, 2014 à 23:04, NeilBrown <neilb@suse.de> a écrit :

> On Mon, 3 Nov 2014 14:01:10 -0700 James Simmons <uja.ornl@gmail.com> wrote:
> 
>> Hello.
>> 
>>   This is a patch against the latest kernel source which is based on
>> a patch used by Lustre. The below describes what we are trying to
>> achieve. I like to get a feedback if this is the right approach.
>> 
>> ----------------------------------------------------------------------
>> 
>> The ext4 MMP block reads always need to get fresh data from the
>> underlying disk.  Otherwise, if a remote node is updating the MMP
>> block and the reads are fetched from the MD RAID5 stripe cache,
>> it is possible that the local node will incorrectly decide the
>> remote node has died and allow the filesystem to be mounted on
>> two nodes at the same time.
> 
> It is preferred for patches to be inline, rather than as attachments, as it
> makes it easier to comment on them....
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 9c66e59..11b749c 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -2678,6 +2678,9 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, in
> 		}
> 		if (sector >= sh->dev[dd_idx].sector + STRIPE_SECTORS)
> 			set_bit(R5_OVERWRITE, &sh->dev[dd_idx].flags);
> +	} else if (bi->bi_rw & REQ_NOCACHE) {
> +		/* force to read from underlying disk if requested */
> +		clear_bit(R5_UPTODATE, &sh->dev[dd_idx].flags);
> 	}
> 
> 	pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n",
> 
> 
> This doesn't provide a useful guarantee.  If the device that stores that
> block has failed, the md/raid5 will read all other devices to recover the
> block.
> If that recently happened and you just clear the UPTODATE bit on the block,
> md/raid5 will recover the data from all the other blocks, without reading
> them.
> 
> But considering this at a higher level: if two different nodes try to
> assemble the same RAID5 array then you already potentially have a problem.
> You really want some sensible cluster co-ordinator and let it make these
> decisions.   Hoping the a block device can be a reliable semaphore seems ...
> misguided.
> 
> NeilBrown

---------------------------------------------------------------------
Intel Corporation SAS (French simplified joint stock company)
Registered headquarters: "Les Montalets"- 2, rue de Paris, 
92196 Meudon Cedex, France
Registration Number:  302 456 199 R.C.S. NANTERRE
Capital: 4,572,000 Euros

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
diff mbox

Patch

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 9c66e59..11b749c 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2678,6 +2678,9 @@  static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, in
 		}
 		if (sector >= sh->dev[dd_idx].sector + STRIPE_SECTORS)
 			set_bit(R5_OVERWRITE, &sh->dev[dd_idx].flags);
+	} else if (bi->bi_rw & REQ_NOCACHE) {
+		/* force to read from underlying disk if requested */
+		clear_bit(R5_UPTODATE, &sh->dev[dd_idx].flags);
 	}
 
 	pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n",