md: fix raid5 livelock

Message ID	20150128133754.25835582@notabene.brown (mailing list archive)
State	Superseded, archived
Delegated to:	Mike Snitzer
Headers	show Return-Path: <dm-devel-bounces@redhat.com> Date: Wed, 28 Jan 2015 13:37:54 +1100 From: NeilBrown <neilb@suse.de> To: Heinz Mauelshagen <heinzm@redhat.com> Message-ID: <20150128133754.25835582@notabene.brown> In-Reply-To: <54C54CBC.50101@redhat.com> References: <54C54CBC.50101@redhat.com> MIME-Version: 1.0 Cc: linux-raid@vger.kernel.org, "dm-devel >> device-mapper development" <dm-devel@redhat.com> Subject: Re: [dm-devel] [PATCH] md: fix raid5 livelock Precedence: junk Reply-To: device-mapper development <dm-devel@redhat.com> Content-Type: multipart/mixed; boundary="===============0546003019782688734==" Sender: dm-devel-bounces@redhat.com Errors-To: dm-devel-bounces@redhat.com

Message ID

20150128133754.25835582@notabene.brown (mailing list archive)

State

Superseded, archived

Delegated to:

Mike Snitzer

Headers

Date: Wed, 28 Jan 2015 13:37:54 +1100
From: NeilBrown <neilb@suse.de>
To: Heinz Mauelshagen <heinzm@redhat.com>
Message-ID: <20150128133754.25835582@notabene.brown>
In-Reply-To: <54C54CBC.50101@redhat.com>
References: <54C54CBC.50101@redhat.com>
MIME-Version: 1.0
Cc: linux-raid@vger.kernel.org,
	"dm-devel >> device-mapper development" <dm-devel@redhat.com>
Subject: Re: [dm-devel] [PATCH] md: fix raid5 livelock
Precedence: junk
Reply-To: device-mapper development <dm-devel@redhat.com>
Content-Type: multipart/mixed;
	boundary="===============0546003019782688734=="
Sender: dm-devel-bounces@redhat.com
Errors-To: dm-devel-bounces@redhat.com

Commit Message

NeilBrown Jan. 28, 2015, 2:37 a.m. UTC

On Sun, 25 Jan 2015 21:06:20 +0100 Heinz Mauelshagen <heinzm@redhat.com>
wrote:

> From: Heinz Mauelshagen <heinzm@redhat.com>
> 
> Hi Neil,
> 
> the reconstruct write optimization in raid5, function fetch_block causes
> livelocks in LVM raid4/5 tests.
> 
> Test scenarios:
> the tests wait for full initial array resynchronization before making a 
> filesystem
> on the raid4/5 logical volume, mounting it, writing to the filesystem 
> and failing
> one physical volume holding a raiddev.
> 
> In short, we're seeing livelocks on fully synchronized raid4/5 arrays 
> with a failed device.
> 
> This patch fixes the issue but likely in a suboptimnal way.
> 
> Do you think there is a better solution to avoid livelocks on 
> reconstruct writes?
> 
> Regards,
> Heinz
> 
> Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
> Tested-by: Jon Brassow <jbrassow@redhat.com>
> Tested-by: Heinz Mauelshagen <heinzm@redhat.com>
> 
> ---
>   drivers/md/raid5.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index c1b0d52..0fc8737 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -2915,7 +2915,7 @@ static int fetch_block(struct stripe_head *sh, 
> struct stripe_head_state *s,
>               (s->failed >= 1 && fdev[0]->toread) ||
>               (s->failed >= 2 && fdev[1]->toread) ||
>               (sh->raid_conf->level <= 5 && s->failed && fdev[0]->towrite &&
> -             (!test_bit(R5_Insync, &dev->flags) || 
> test_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) &&
> +             (!test_bit(R5_Insync, &dev->flags) || 
> test_bit(STRIPE_PREREAD_ACTIVE, &sh->state) || s->non_overwrite) &&
>                !test_bit(R5_OVERWRITE, &fdev[0]->flags)) ||
>               ((sh->raid_conf->level == 6 ||
>                 sh->sector >= sh->raid_conf->mddev->recovery_cp)


That is a bit heavy handed, but knowing that fixes the problem helps a lot.

I think the problem happens when processes a non-overwrite write to a failed
device.

fetch_block() should, in that case, pre-read all of the working device, but
since

	      (!test_bit(R5_Insync, &dev->flags) || test_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) &&

was added, it sometimes doesn't.  The root problem is that
handle_stripe_dirtying is getting confused because neither rmw or rcw seem to
work, so it doesn't start the chain of events to set STRIPE_PREREAD_ACTIVE.

The following (which is against mainline) might fix it.  Can you test?

Thanks,
NeilBrown



This code really really needs to be tidied up and commented better!!!

Thanks,
NeilBrown
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Comments

Heinz Mauelshagen Jan. 28, 2015, 12:03 p.m. UTC | #1

Neil,

thanks for providing the patch.

Test with it will take some hours in order to tell any success.

Regards,
Heinz

On 01/28/2015 03:37 AM, NeilBrown wrote:
> On Sun, 25 Jan 2015 21:06:20 +0100 Heinz Mauelshagen <heinzm@redhat.com>
> wrote:
>
>> From: Heinz Mauelshagen <heinzm@redhat.com>
>>
>> Hi Neil,
>>
>> the reconstruct write optimization in raid5, function fetch_block causes
>> livelocks in LVM raid4/5 tests.
>>
>> Test scenarios:
>> the tests wait for full initial array resynchronization before making a
>> filesystem
>> on the raid4/5 logical volume, mounting it, writing to the filesystem
>> and failing
>> one physical volume holding a raiddev.
>>
>> In short, we're seeing livelocks on fully synchronized raid4/5 arrays
>> with a failed device.
>>
>> This patch fixes the issue but likely in a suboptimnal way.
>>
>> Do you think there is a better solution to avoid livelocks on
>> reconstruct writes?
>>
>> Regards,
>> Heinz
>>
>> Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
>> Tested-by: Jon Brassow <jbrassow@redhat.com>
>> Tested-by: Heinz Mauelshagen <heinzm@redhat.com>
>>
>> ---
>>    drivers/md/raid5.c | 2 +-
>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index c1b0d52..0fc8737 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -2915,7 +2915,7 @@ static int fetch_block(struct stripe_head *sh,
>> struct stripe_head_state *s,
>>                (s->failed >= 1 && fdev[0]->toread) ||
>>                (s->failed >= 2 && fdev[1]->toread) ||
>>                (sh->raid_conf->level <= 5 && s->failed && fdev[0]->towrite &&
>> -             (!test_bit(R5_Insync, &dev->flags) ||
>> test_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) &&
>> +             (!test_bit(R5_Insync, &dev->flags) ||
>> test_bit(STRIPE_PREREAD_ACTIVE, &sh->state) || s->non_overwrite) &&
>>                 !test_bit(R5_OVERWRITE, &fdev[0]->flags)) ||
>>                ((sh->raid_conf->level == 6 ||
>>                  sh->sector >= sh->raid_conf->mddev->recovery_cp)
>
> That is a bit heavy handed, but knowing that fixes the problem helps a lot.
>
> I think the problem happens when processes a non-overwrite write to a failed
> device.
>
> fetch_block() should, in that case, pre-read all of the working device, but
> since
>
> 	      (!test_bit(R5_Insync, &dev->flags) || test_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) &&
>
> was added, it sometimes doesn't.  The root problem is that
> handle_stripe_dirtying is getting confused because neither rmw or rcw seem to
> work, so it doesn't start the chain of events to set STRIPE_PREREAD_ACTIVE.
>
> The following (which is against mainline) might fix it.  Can you test?
>
> Thanks,
> NeilBrown
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index c1b0d52bfcb0..793cf2861e97 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -3195,6 +3195,10 @@ static void handle_stripe_dirtying(struct r5conf *conf,
>   					  (unsigned long long)sh->sector,
>   					  rcw, qread, test_bit(STRIPE_DELAYED, &sh->state));
>   	}
> +	if (rcw > disks && rmw > disks &&
> +	    !test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
> +		set_bit(STRIPE_DELAYED, &sh->state);
> +
>   	/* now if nothing is locked, and if we have enough data,
>   	 * we can start a write request
>   	 */
>
>
> This code really really needs to be tidied up and commented better!!!
>
> Thanks,
> NeilBrown

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Heinz Mauelshagen Jan. 29, 2015, 11:24 a.m. UTC | #2

Neil,

the patch worked fine in overnight test runs without the previous livelock.
No regressions have been triggered.

Yes, tidying up that optimization logic (e.g. in fetch_block()) is very 
much appreciated :-)

Thanks,
Heinz

On 01/28/2015 01:03 PM, Heinz Mauelshagen wrote:
>
> Neil,
>
> thanks for providing the patch.
>
> Test with it will take some hours in order to tell any success.
>
> Regards,
> Heinz
>
> On 01/28/2015 03:37 AM, NeilBrown wrote:
>> On Sun, 25 Jan 2015 21:06:20 +0100 Heinz Mauelshagen <heinzm@redhat.com>
>> wrote:
>>
>>> From: Heinz Mauelshagen <heinzm@redhat.com>
>>>
>>> Hi Neil,
>>>
>>> the reconstruct write optimization in raid5, function fetch_block 
>>> causes
>>> livelocks in LVM raid4/5 tests.
>>>
>>> Test scenarios:
>>> the tests wait for full initial array resynchronization before making a
>>> filesystem
>>> on the raid4/5 logical volume, mounting it, writing to the filesystem
>>> and failing
>>> one physical volume holding a raiddev.
>>>
>>> In short, we're seeing livelocks on fully synchronized raid4/5 arrays
>>> with a failed device.
>>>
>>> This patch fixes the issue but likely in a suboptimnal way.
>>>
>>> Do you think there is a better solution to avoid livelocks on
>>> reconstruct writes?
>>>
>>> Regards,
>>> Heinz
>>>
>>> Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
>>> Tested-by: Jon Brassow <jbrassow@redhat.com>
>>> Tested-by: Heinz Mauelshagen <heinzm@redhat.com>
>>>
>>> ---
>>>    drivers/md/raid5.c | 2 +-
>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>>> index c1b0d52..0fc8737 100644
>>> --- a/drivers/md/raid5.c
>>> +++ b/drivers/md/raid5.c
>>> @@ -2915,7 +2915,7 @@ static int fetch_block(struct stripe_head *sh,
>>> struct stripe_head_state *s,
>>>                (s->failed >= 1 && fdev[0]->toread) ||
>>>                (s->failed >= 2 && fdev[1]->toread) ||
>>>                (sh->raid_conf->level <= 5 && s->failed && 
>>> fdev[0]->towrite &&
>>> -             (!test_bit(R5_Insync, &dev->flags) ||
>>> test_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) &&
>>> +             (!test_bit(R5_Insync, &dev->flags) ||
>>> test_bit(STRIPE_PREREAD_ACTIVE, &sh->state) || s->non_overwrite) &&
>>>                 !test_bit(R5_OVERWRITE, &fdev[0]->flags)) ||
>>>                ((sh->raid_conf->level == 6 ||
>>>                  sh->sector >= sh->raid_conf->mddev->recovery_cp)
>>
>> That is a bit heavy handed, but knowing that fixes the problem helps 
>> a lot.
>>
>> I think the problem happens when processes a non-overwrite write to a 
>> failed
>> device.
>>
>> fetch_block() should, in that case, pre-read all of the working 
>> device, but
>> since
>>
>>           (!test_bit(R5_Insync, &dev->flags) || 
>> test_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) &&
>>
>> was added, it sometimes doesn't.  The root problem is that
>> handle_stripe_dirtying is getting confused because neither rmw or rcw 
>> seem to
>> work, so it doesn't start the chain of events to set 
>> STRIPE_PREREAD_ACTIVE.
>>
>> The following (which is against mainline) might fix it.  Can you test?
>>
>> Thanks,
>> NeilBrown
>>
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index c1b0d52bfcb0..793cf2861e97 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -3195,6 +3195,10 @@ static void handle_stripe_dirtying(struct 
>> r5conf *conf,
>>                         (unsigned long long)sh->sector,
>>                         rcw, qread, test_bit(STRIPE_DELAYED, 
>> &sh->state));
>>       }
>> +    if (rcw > disks && rmw > disks &&
>> +        !test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
>> +        set_bit(STRIPE_DELAYED, &sh->state);
>> +
>>       /* now if nothing is locked, and if we have enough data,
>>        * we can start a write request
>>        */
>>
>>
>> This code really really needs to be tidied up and commented better!!!
>>
>> Thanks,
>> NeilBrown
>
> -- 
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Jes Sorensen Jan. 29, 2015, 5:17 p.m. UTC | #3

NeilBrown <neilb@suse.de> writes:
> On Sun, 25 Jan 2015 21:06:20 +0100 Heinz Mauelshagen <heinzm@redhat.com>
> wrote:
>
>> From: Heinz Mauelshagen <heinzm@redhat.com>
>> 
>> Hi Neil,
>> 
>> the reconstruct write optimization in raid5, function fetch_block causes
>> livelocks in LVM raid4/5 tests.
>> 
>> Test scenarios:
>> the tests wait for full initial array resynchronization before making a 
>> filesystem
>> on the raid4/5 logical volume, mounting it, writing to the filesystem 
>> and failing
>> one physical volume holding a raiddev.
>> 
>> In short, we're seeing livelocks on fully synchronized raid4/5 arrays 
>> with a failed device.
>> 
>> This patch fixes the issue but likely in a suboptimnal way.
>> 
>> Do you think there is a better solution to avoid livelocks on 
>> reconstruct writes?
>> 
>> Regards,
>> Heinz
>> 
>> Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
>> Tested-by: Jon Brassow <jbrassow@redhat.com>
>> Tested-by: Heinz Mauelshagen <heinzm@redhat.com>
>> 
>> ---
>>   drivers/md/raid5.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index c1b0d52..0fc8737 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -2915,7 +2915,7 @@ static int fetch_block(struct stripe_head *sh, 
>> struct stripe_head_state *s,
>>               (s->failed >= 1 && fdev[0]->toread) ||
>>               (s->failed >= 2 && fdev[1]->toread) ||
>>               (sh->raid_conf->level <= 5 && s->failed && fdev[0]->towrite &&
>> -             (!test_bit(R5_Insync, &dev->flags) || 
>> test_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) &&
>> +             (!test_bit(R5_Insync, &dev->flags) || 
>> test_bit(STRIPE_PREREAD_ACTIVE, &sh->state) || s->non_overwrite) &&
>>                !test_bit(R5_OVERWRITE, &fdev[0]->flags)) ||
>>               ((sh->raid_conf->level == 6 ||
>>                 sh->sector >= sh->raid_conf->mddev->recovery_cp)
>
>
> That is a bit heavy handed, but knowing that fixes the problem helps a lot.
>
> I think the problem happens when processes a non-overwrite write to a failed
> device.
>
> fetch_block() should, in that case, pre-read all of the working device, but
> since
>
> 	      (!test_bit(R5_Insync, &dev->flags) || test_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) &&
>
> was added, it sometimes doesn't.  The root problem is that
> handle_stripe_dirtying is getting confused because neither rmw or rcw seem to
> work, so it doesn't start the chain of events to set STRIPE_PREREAD_ACTIVE.
>
> The following (which is against mainline) might fix it.  Can you test?
>
> Thanks,
> NeilBrown
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index c1b0d52bfcb0..793cf2861e97 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -3195,6 +3195,10 @@ static void handle_stripe_dirtying(struct r5conf *conf,
>  					  (unsigned long long)sh->sector,
>  					  rcw, qread, test_bit(STRIPE_DELAYED, &sh->state));
>  	}
> +	if (rcw > disks && rmw > disks &&
> +	    !test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
> +		set_bit(STRIPE_DELAYED, &sh->state);
> +
>  	/* now if nothing is locked, and if we have enough data,
>  	 * we can start a write request
>  	 */
>
>
> This code really really needs to be tidied up and commented better!!!

Neil,

Since this one seems to do the trick, will you be pushing it into you
tree anytime soon?

Cheers,
Jes

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index c1b0d52bfcb0..793cf2861e97 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3195,6 +3195,10 @@  static void handle_stripe_dirtying(struct r5conf *conf,
 					  (unsigned long long)sh->sector,
 					  rcw, qread, test_bit(STRIPE_DELAYED, &sh->state));
 	}
+	if (rcw > disks && rmw > disks &&
+	    !test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
+		set_bit(STRIPE_DELAYED, &sh->state);
+
 	/* now if nothing is locked, and if we have enough data,
 	 * we can start a write request
 	 */

md: fix raid5 livelock

Commit Message

Comments

Patch