diff mbox

[v2,1/1] quorum: Change vote rules for 64 bits hash

Message ID 1455588944-29799-2-git-send-email-xiecl.fnst@cn.fujitsu.com (mailing list archive)
State New, archived
Headers show

Commit Message

Changlong Xie Feb. 16, 2016, 2:15 a.m. UTC
If quorum has two children(A, B). A do flush sucessfully, but B flush failed.
We MUST choice A as winner rather than just pick anyone of them. Otherwise
the filesystem of guest will become read-only with following errors:

end_request: I/O error, dev vda, sector 11159960
Aborting journal on device vda3-8
EXT4-fs error (device vda3): ext4_journal_start_sb:327: Detected abort journal
EXT4-fs (vda3): Remounting filesystem read-only

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
---
 block/quorum.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

Comments

Dr. David Alan Gilbert Feb. 18, 2016, 10 a.m. UTC | #1
* Changlong Xie (xiecl.fnst@cn.fujitsu.com) wrote:
> If quorum has two children(A, B). A do flush sucessfully, but B flush failed.
> We MUST choice A as winner rather than just pick anyone of them. Otherwise
> the filesystem of guest will become read-only with following errors:
> 
> end_request: I/O error, dev vda, sector 11159960
> Aborting journal on device vda3-8
> EXT4-fs error (device vda3): ext4_journal_start_sb:327: Detected abort journal
> EXT4-fs (vda3): Remounting filesystem read-only
> 
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>

Hi,
  That seems to fix the problem I was seeing; thanks.
(I don't know enough about the quorum layer to comment much more).

Dave

> ---
>  block/quorum.c | 15 +++++++++++----
>  1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/block/quorum.c b/block/quorum.c
> index 11cc60b..f094208 100644
> --- a/block/quorum.c
> +++ b/block/quorum.c
> @@ -447,7 +447,8 @@ static int quorum_compute_hash(QuorumAIOCB *acb, int i, QuorumVoteValue *hash)
>      return 0;
>  }
>  
> -static QuorumVoteVersion *quorum_get_vote_winner(QuorumVotes *votes)
> +static QuorumVoteVersion *quorum_get_vote_winner(QuorumVotes *votes,
> +                                                 bool vote_error)
>  {
>      int max = 0;
>      QuorumVoteVersion *candidate, *winner = NULL;
> @@ -456,6 +457,12 @@ static QuorumVoteVersion *quorum_get_vote_winner(QuorumVotes *votes)
>          if (candidate->vote_count > max) {
>              max = candidate->vote_count;
>              winner = candidate;
> +            continue;
> +        }
> +        /* For 64 bit hash */
> +        if (vote_error && candidate->vote_count == max
> +            && candidate->value.l == 0) {
> +            winner = candidate;
>          }
>      }
>  
> @@ -545,7 +552,7 @@ static int quorum_vote_error(QuorumAIOCB *acb)
>      }
>  
>      if (error) {
> -        winner = quorum_get_vote_winner(&error_votes);
> +        winner = quorum_get_vote_winner(&error_votes, false);
>          ret = winner->value.l;
>      }
>  
> @@ -610,7 +617,7 @@ static bool quorum_vote(QuorumAIOCB *acb)
>      }
>  
>      /* vote to select the most represented version */
> -    winner = quorum_get_vote_winner(&acb->votes);
> +    winner = quorum_get_vote_winner(&acb->votes, false);
>  
>      /* if the winner count is smaller than threshold the read fails */
>      if (winner->vote_count < s->threshold) {
> @@ -770,7 +777,7 @@ static coroutine_fn int quorum_co_flush(BlockDriverState *bs)
>          quorum_count_vote(&error_votes, &result_value, i);
>      }
>  
> -    winner = quorum_get_vote_winner(&error_votes);
> +    winner = quorum_get_vote_winner(&error_votes, true);
>      result = winner->value.l;
>  
>      quorum_free_vote_list(&error_votes);
> -- 
> 1.9.3
> 
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Alberto Garcia Feb. 18, 2016, 3:16 p.m. UTC | #2
On Tue 16 Feb 2016 03:15:44 AM CET, Changlong Xie <xiecl.fnst@cn.fujitsu.com> wrote:
> If quorum has two children(A, B). A do flush sucessfully, but B flush
> failed.  We MUST choice A as winner rather than just pick anyone of
> them. Otherwise the filesystem of guest will become read-only with
> following errors:
>
> end_request: I/O error, dev vda, sector 11159960
> Aborting journal on device vda3-8
> EXT4-fs error (device vda3): ext4_journal_start_sb:327: Detected abort journal
> EXT4-fs (vda3): Remounting filesystem read-only

Hi Xie,

Let's see if I'm getting this right:

- When Quorum flushes to disk, there's a vote among the return values of
  the flush operations of its members, and the one that wins is the one
  that Quorum returns.

- If there's a tie then Quorum choses the first result from the list of
  winners.

- With your patch you want to give priority to the vote with result == 0
  if there's any, so Quorum would return 0 (and succeed).

This seems to me like an ad-hoc fix for a particular use case. What if
you have 3 members and two of them fail with the same error code? Would
you still return 0 or the error code from the other two?

Also, is this only supposed to be used in FIFO mode? Your patch doesn't
seem to make any distinction.

Thanks!

Berto
Wen Congyang Feb. 19, 2016, 8:26 a.m. UTC | #3
On 02/18/2016 11:16 PM, Alberto Garcia wrote:
> On Tue 16 Feb 2016 03:15:44 AM CET, Changlong Xie <xiecl.fnst@cn.fujitsu.com> wrote:
>> If quorum has two children(A, B). A do flush sucessfully, but B flush
>> failed.  We MUST choice A as winner rather than just pick anyone of
>> them. Otherwise the filesystem of guest will become read-only with
>> following errors:
>>
>> end_request: I/O error, dev vda, sector 11159960
>> Aborting journal on device vda3-8
>> EXT4-fs error (device vda3): ext4_journal_start_sb:327: Detected abort journal
>> EXT4-fs (vda3): Remounting filesystem read-only
> 
> Hi Xie,
> 
> Let's see if I'm getting this right:
> 
> - When Quorum flushes to disk, there's a vote among the return values of
>   the flush operations of its members, and the one that wins is the one
>   that Quorum returns.
> 
> - If there's a tie then Quorum choses the first result from the list of
>   winners.
> 
> - With your patch you want to give priority to the vote with result == 0
>   if there's any, so Quorum would return 0 (and succeed).
> 
> This seems to me like an ad-hoc fix for a particular use case. What if
> you have 3 members and two of them fail with the same error code? Would
> you still return 0 or the error code from the other two?

For example:
children.0 returns 0
children.1 returns -EIO
children.2 returns -EPIPE

In this case, quorum returns -EPIPE now(without this patch).

For example:
children.0 returns -EPIPE
children.1 returns -EIO
children.2 returns 0
In this case, quorum returns 0 now.

If two children returns the same error, and only one returns 0, this patch doesn't
change the behavior.

Back to your question, before this patch, sometimes quorum returns error, and
sometimes quorum returns 0. In such case, which is better? Always return 0 or
always return error? In my opinion, we can always return 0 if we allow quorum
returns 0 in case 2.

> 
> Also, is this only supposed to be used in FIFO mode? Your patch doesn't
> seem to make any distinction.

IIRC, FIFO mode is only for read operation. This patch is not for FIFO mode.

Thanks
Wen Congyang

> 
> Thanks!
> 
> Berto
> 
> 
> .
>
Alberto Garcia Feb. 19, 2016, 11:24 a.m. UTC | #4
On Fri 19 Feb 2016 09:26:53 AM CET, Wen Congyang <wency@cn.fujitsu.com> wrote:

>>> If quorum has two children(A, B). A do flush sucessfully, but B
>>> flush failed.  We MUST choice A as winner rather than just pick
>>> anyone of them. Otherwise the filesystem of guest will become
>>> read-only with following errors:
>>>
>>> end_request: I/O error, dev vda, sector 11159960
>>> Aborting journal on device vda3-8
>>> EXT4-fs error (device vda3): ext4_journal_start_sb:327: Detected abort journal
>>> EXT4-fs (vda3): Remounting filesystem read-only
>> 
>> Hi Xie,
>> 
>> Let's see if I'm getting this right:
>> 
>> - When Quorum flushes to disk, there's a vote among the return values of
>>   the flush operations of its members, and the one that wins is the one
>>   that Quorum returns.
>> 
>> - If there's a tie then Quorum choses the first result from the list of
>>   winners.
>> 
>> - With your patch you want to give priority to the vote with result == 0
>>   if there's any, so Quorum would return 0 (and succeed).
>> 
>> This seems to me like an ad-hoc fix for a particular use case. What
>> if you have 3 members and two of them fail with the same error code?
>> Would you still return 0 or the error code from the other two?
>
> For example:
> children.0 returns 0
> children.1 returns -EIO
> children.2 returns -EPIPE
>
> In this case, quorum returns -EPIPE now(without this patch).
>
> For example:
> children.0 returns -EPIPE
> children.1 returns -EIO
> children.2 returns 0
> In this case, quorum returns 0 now.

My question is: what's the rationale for returning 0 in case a) but not
in case b)?

  a)
    children.0 returns -EPIPE
    children.1 returns -EIO
    children.2 returns 0

  b)
    children.0 returns -EIO
    children.1 returns -EIO
    children.2 returns 0

In both cases you have one successful flush and two errors. You want to
return always 0 in case a) and always -EIO in case b). But the only
difference is that in case b) the errors happen to be the same, so why
does that matter?

That said, I'm not very convinced of the current logics of the Quorum
flush code either, so it's not even a problem with your patch... it
seems to me that the code should follow the same logics as in the
read/write case: if the number of correct flushes >= threshold then
return 0, else select the most common error code.

Berto
Max Reitz Feb. 20, 2016, 2:28 p.m. UTC | #5
On 19.02.2016 12:24, Alberto Garcia wrote:
> On Fri 19 Feb 2016 09:26:53 AM CET, Wen Congyang <wency@cn.fujitsu.com> wrote:
> 
>>>> If quorum has two children(A, B). A do flush sucessfully, but B
>>>> flush failed.  We MUST choice A as winner rather than just pick
>>>> anyone of them. Otherwise the filesystem of guest will become
>>>> read-only with following errors:
>>>>
>>>> end_request: I/O error, dev vda, sector 11159960
>>>> Aborting journal on device vda3-8
>>>> EXT4-fs error (device vda3): ext4_journal_start_sb:327: Detected abort journal
>>>> EXT4-fs (vda3): Remounting filesystem read-only
>>>
>>> Hi Xie,
>>>
>>> Let's see if I'm getting this right:
>>>
>>> - When Quorum flushes to disk, there's a vote among the return values of
>>>   the flush operations of its members, and the one that wins is the one
>>>   that Quorum returns.
>>>
>>> - If there's a tie then Quorum choses the first result from the list of
>>>   winners.
>>>
>>> - With your patch you want to give priority to the vote with result == 0
>>>   if there's any, so Quorum would return 0 (and succeed).
>>>
>>> This seems to me like an ad-hoc fix for a particular use case. What
>>> if you have 3 members and two of them fail with the same error code?
>>> Would you still return 0 or the error code from the other two?
>>
>> For example:
>> children.0 returns 0
>> children.1 returns -EIO
>> children.2 returns -EPIPE
>>
>> In this case, quorum returns -EPIPE now(without this patch).
>>
>> For example:
>> children.0 returns -EPIPE
>> children.1 returns -EIO
>> children.2 returns 0
>> In this case, quorum returns 0 now.
> 
> My question is: what's the rationale for returning 0 in case a) but not
> in case b)?
> 
>   a)
>     children.0 returns -EPIPE
>     children.1 returns -EIO
>     children.2 returns 0
> 
>   b)
>     children.0 returns -EIO
>     children.1 returns -EIO
>     children.2 returns 0
> 
> In both cases you have one successful flush and two errors. You want to
> return always 0 in case a) and always -EIO in case b). But the only
> difference is that in case b) the errors happen to be the same, so why
> does that matter?
> 
> That said, I'm not very convinced of the current logics of the Quorum
> flush code either, so it's not even a problem with your patch... it
> seems to me that the code should follow the same logics as in the
> read/write case: if the number of correct flushes >= threshold then
> return 0, else select the most common error code.

I'm not convinced of the logic either, which is why I waited for you to
respond to this patch. :-)

Intuitively, I'd expect Quorum to return an error if flushing failed for
any of the children, because, well, flushing failed. I somehow feel like
flushing is different from a read or write operation and therefore
ignoring the threshold would be fine here. However, maybe my intuition
is just off.

Anyway, regardless of that, if we do take the threshold into account, we
should not use the exact error value for voting but just whether an
error occurred or not. If all but one children fail to flush (all for
different reasons), I find it totally wrong to return success. We should
then just return -EIO or something.

Max
Changlong Xie Feb. 22, 2016, 3:17 a.m. UTC | #6
On 02/20/2016 10:28 PM, Max Reitz wrote:
> On 19.02.2016 12:24, Alberto Garcia wrote:
>> On Fri 19 Feb 2016 09:26:53 AM CET, Wen Congyang <wency@cn.fujitsu.com> wrote:
>>
>>>>> If quorum has two children(A, B). A do flush sucessfully, but B
>>>>> flush failed.  We MUST choice A as winner rather than just pick
>>>>> anyone of them. Otherwise the filesystem of guest will become
>>>>> read-only with following errors:
>>>>>
>>>>> end_request: I/O error, dev vda, sector 11159960
>>>>> Aborting journal on device vda3-8
>>>>> EXT4-fs error (device vda3): ext4_journal_start_sb:327: Detected abort journal
>>>>> EXT4-fs (vda3): Remounting filesystem read-only
>>>>
>>>> Hi Xie,
>>>>
>>>> Let's see if I'm getting this right:
>>>>
>>>> - When Quorum flushes to disk, there's a vote among the return values of
>>>>    the flush operations of its members, and the one that wins is the one
>>>>    that Quorum returns.
>>>>
>>>> - If there's a tie then Quorum choses the first result from the list of
>>>>    winners.
>>>>
>>>> - With your patch you want to give priority to the vote with result == 0
>>>>    if there's any, so Quorum would return 0 (and succeed).
>>>>
>>>> This seems to me like an ad-hoc fix for a particular use case. What
>>>> if you have 3 members and two of them fail with the same error code?
>>>> Would you still return 0 or the error code from the other two?
>>>
>>> For example:
>>> children.0 returns 0
>>> children.1 returns -EIO
>>> children.2 returns -EPIPE
>>>
>>> In this case, quorum returns -EPIPE now(without this patch).
>>>
>>> For example:
>>> children.0 returns -EPIPE
>>> children.1 returns -EIO
>>> children.2 returns 0
>>> In this case, quorum returns 0 now.
>>
>> My question is: what's the rationale for returning 0 in case a) but not
>> in case b)?
>>
>>    a)
>>      children.0 returns -EPIPE
>>      children.1 returns -EIO
>>      children.2 returns 0
>>
>>    b)
>>      children.0 returns -EIO
>>      children.1 returns -EIO
>>      children.2 returns 0
>>
>> In both cases you have one successful flush and two errors. You want to
>> return always 0 in case a) and always -EIO in case b). But the only
>> difference is that in case b) the errors happen to be the same, so why
>> does that matter?
>>
>> That said, I'm not very convinced of the current logics of the Quorum
>> flush code either, so it's not even a problem with your patch... it
>> seems to me that the code should follow the same logics as in the
>> read/write case: if the number of correct flushes >= threshold then
>> return 0, else select the most common error code.
>
> I'm not convinced of the logic either, which is why I waited for you to
> respond to this patch. :-)
>
> Intuitively, I'd expect Quorum to return an error if flushing failed for
> any of the children, because, well, flushing failed. I somehow feel like
> flushing is different from a read or write operation and therefore
> ignoring the threshold would be fine here. However, maybe my intuition
> is just off.
>
> Anyway, regardless of that, if we do take the threshold into account, we
> should not use the exact error value for voting but just whether an
> error occurred or not. If all but one children fail to flush (all for
> different reasons), I find it totally wrong to return success. We should
> then just return -EIO or something.
>
Hi Berto & Max

Thanks for your comments, i'd like to have a summary here. For flush cases:

1) if flush successfully(result >= 0), result = 0; else if result < 0, 
result = -EIO. then invoke quorum_count_vote
2) if correct flushes >= threshold, mark correct flushes as winner directly.

Will fix in next version.

Thanks
	-Xie
> Max
>
Dr. David Alan Gilbert Feb. 22, 2016, 9:02 a.m. UTC | #7
* Changlong Xie (xiecl.fnst@cn.fujitsu.com) wrote:
> On 02/20/2016 10:28 PM, Max Reitz wrote:
> >On 19.02.2016 12:24, Alberto Garcia wrote:
> >>On Fri 19 Feb 2016 09:26:53 AM CET, Wen Congyang <wency@cn.fujitsu.com> wrote:
> >>
> >>>>>If quorum has two children(A, B). A do flush sucessfully, but B
> >>>>>flush failed.  We MUST choice A as winner rather than just pick
> >>>>>anyone of them. Otherwise the filesystem of guest will become
> >>>>>read-only with following errors:
> >>>>>
> >>>>>end_request: I/O error, dev vda, sector 11159960
> >>>>>Aborting journal on device vda3-8
> >>>>>EXT4-fs error (device vda3): ext4_journal_start_sb:327: Detected abort journal
> >>>>>EXT4-fs (vda3): Remounting filesystem read-only
> >>>>
> >>>>Hi Xie,
> >>>>
> >>>>Let's see if I'm getting this right:
> >>>>
> >>>>- When Quorum flushes to disk, there's a vote among the return values of
> >>>>   the flush operations of its members, and the one that wins is the one
> >>>>   that Quorum returns.
> >>>>
> >>>>- If there's a tie then Quorum choses the first result from the list of
> >>>>   winners.
> >>>>
> >>>>- With your patch you want to give priority to the vote with result == 0
> >>>>   if there's any, so Quorum would return 0 (and succeed).
> >>>>
> >>>>This seems to me like an ad-hoc fix for a particular use case. What
> >>>>if you have 3 members and two of them fail with the same error code?
> >>>>Would you still return 0 or the error code from the other two?
> >>>
> >>>For example:
> >>>children.0 returns 0
> >>>children.1 returns -EIO
> >>>children.2 returns -EPIPE
> >>>
> >>>In this case, quorum returns -EPIPE now(without this patch).
> >>>
> >>>For example:
> >>>children.0 returns -EPIPE
> >>>children.1 returns -EIO
> >>>children.2 returns 0
> >>>In this case, quorum returns 0 now.
> >>
> >>My question is: what's the rationale for returning 0 in case a) but not
> >>in case b)?
> >>
> >>   a)
> >>     children.0 returns -EPIPE
> >>     children.1 returns -EIO
> >>     children.2 returns 0
> >>
> >>   b)
> >>     children.0 returns -EIO
> >>     children.1 returns -EIO
> >>     children.2 returns 0
> >>
> >>In both cases you have one successful flush and two errors. You want to
> >>return always 0 in case a) and always -EIO in case b). But the only
> >>difference is that in case b) the errors happen to be the same, so why
> >>does that matter?
> >>
> >>That said, I'm not very convinced of the current logics of the Quorum
> >>flush code either, so it's not even a problem with your patch... it
> >>seems to me that the code should follow the same logics as in the
> >>read/write case: if the number of correct flushes >= threshold then
> >>return 0, else select the most common error code.
> >
> >I'm not convinced of the logic either, which is why I waited for you to
> >respond to this patch. :-)
> >
> >Intuitively, I'd expect Quorum to return an error if flushing failed for
> >any of the children, because, well, flushing failed. I somehow feel like
> >flushing is different from a read or write operation and therefore
> >ignoring the threshold would be fine here. However, maybe my intuition
> >is just off.
> >
> >Anyway, regardless of that, if we do take the threshold into account, we
> >should not use the exact error value for voting but just whether an
> >error occurred or not. If all but one children fail to flush (all for
> >different reasons), I find it totally wrong to return success. We should
> >then just return -EIO or something.
> >
> Hi Berto & Max
> 
> Thanks for your comments, i'd like to have a summary here. For flush cases:
> 
> 1) if flush successfully(result >= 0), result = 0; else if result < 0,
> result = -EIO. then invoke quorum_count_vote
> 2) if correct flushes >= threshold, mark correct flushes as winner directly.

I find it difficult to understand how this corresponds to the behaviour needed
in COLO, where we have the NBD and the real storage on the primary; in that
case the failure of the real storage should give an error to the guest, but the
failure of the NBD shouldn't produce a guest visible failure.

Dave

> Will fix in next version.
> 
> Thanks
> 	-Xie
> >Max
> >
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Changlong Xie Feb. 22, 2016, 9:52 a.m. UTC | #8
On 02/22/2016 05:02 PM, Dr. David Alan Gilbert wrote:
> * Changlong Xie (xiecl.fnst@cn.fujitsu.com) wrote:
>> On 02/20/2016 10:28 PM, Max Reitz wrote:
>>> On 19.02.2016 12:24, Alberto Garcia wrote:
>>>> On Fri 19 Feb 2016 09:26:53 AM CET, Wen Congyang <wency@cn.fujitsu.com> wrote:
>>>>
>>>>>>> If quorum has two children(A, B). A do flush sucessfully, but B
>>>>>>> flush failed.  We MUST choice A as winner rather than just pick
>>>>>>> anyone of them. Otherwise the filesystem of guest will become
>>>>>>> read-only with following errors:
>>>>>>>
>>>>>>> end_request: I/O error, dev vda, sector 11159960
>>>>>>> Aborting journal on device vda3-8
>>>>>>> EXT4-fs error (device vda3): ext4_journal_start_sb:327: Detected abort journal
>>>>>>> EXT4-fs (vda3): Remounting filesystem read-only
>>>>>>
>>>>>> Hi Xie,
>>>>>>
>>>>>> Let's see if I'm getting this right:
>>>>>>
>>>>>> - When Quorum flushes to disk, there's a vote among the return values of
>>>>>>    the flush operations of its members, and the one that wins is the one
>>>>>>    that Quorum returns.
>>>>>>
>>>>>> - If there's a tie then Quorum choses the first result from the list of
>>>>>>    winners.
>>>>>>
>>>>>> - With your patch you want to give priority to the vote with result == 0
>>>>>>    if there's any, so Quorum would return 0 (and succeed).
>>>>>>
>>>>>> This seems to me like an ad-hoc fix for a particular use case. What
>>>>>> if you have 3 members and two of them fail with the same error code?
>>>>>> Would you still return 0 or the error code from the other two?
>>>>>
>>>>> For example:
>>>>> children.0 returns 0
>>>>> children.1 returns -EIO
>>>>> children.2 returns -EPIPE
>>>>>
>>>>> In this case, quorum returns -EPIPE now(without this patch).
>>>>>
>>>>> For example:
>>>>> children.0 returns -EPIPE
>>>>> children.1 returns -EIO
>>>>> children.2 returns 0
>>>>> In this case, quorum returns 0 now.
>>>>
>>>> My question is: what's the rationale for returning 0 in case a) but not
>>>> in case b)?
>>>>
>>>>    a)
>>>>      children.0 returns -EPIPE
>>>>      children.1 returns -EIO
>>>>      children.2 returns 0
>>>>
>>>>    b)
>>>>      children.0 returns -EIO
>>>>      children.1 returns -EIO
>>>>      children.2 returns 0
>>>>
>>>> In both cases you have one successful flush and two errors. You want to
>>>> return always 0 in case a) and always -EIO in case b). But the only
>>>> difference is that in case b) the errors happen to be the same, so why
>>>> does that matter?
>>>>
>>>> That said, I'm not very convinced of the current logics of the Quorum
>>>> flush code either, so it's not even a problem with your patch... it
>>>> seems to me that the code should follow the same logics as in the
>>>> read/write case: if the number of correct flushes >= threshold then
>>>> return 0, else select the most common error code.
>>>
>>> I'm not convinced of the logic either, which is why I waited for you to
>>> respond to this patch. :-)
>>>
>>> Intuitively, I'd expect Quorum to return an error if flushing failed for
>>> any of the children, because, well, flushing failed. I somehow feel like
>>> flushing is different from a read or write operation and therefore
>>> ignoring the threshold would be fine here. However, maybe my intuition
>>> is just off.
>>>
>>> Anyway, regardless of that, if we do take the threshold into account, we
>>> should not use the exact error value for voting but just whether an
>>> error occurred or not. If all but one children fail to flush (all for
>>> different reasons), I find it totally wrong to return success. We should
>>> then just return -EIO or something.
>>>
>> Hi Berto & Max
>>
>> Thanks for your comments, i'd like to have a summary here. For flush cases:
>>
>> 1) if flush successfully(result >= 0), result = 0; else if result < 0,
>> result = -EIO. then invoke quorum_count_vote
>> 2) if correct flushes >= threshold, mark correct flushes as winner directly.
>
> I find it difficult to understand how this corresponds to the behaviour needed
> in COLO, where we have the NBD and the real storage on the primary; in that
> case the failure of the real storage should give an error to the guest, but the
> failure of the NBD shouldn't produce a guest visible failure.
>
Hi Dave

"in that case the failure of the real storage should give an error to 
the guest, but the failure of the NBD shouldn't produce a guest visible 
failure."
This is just what i think :), but there is a restricted condition
.
1) If the guest *Must* fetch the return code for flush operation? This 
is prerequisite.
2) If no. Since Colo and Quorum are independent of each other, so quorum 
don't know if we are in colo mode. I think the only way to implement 
your idea is:"pass some parameters such as flush= on/off to each quorum 
child".

BTW, i've just sent out V3, and thought it make scense.

Thanks
	-Xie
> Dave
>
>> Will fix in next version.
>>
>> Thanks
>> 	-Xie
>>> Max
>>>
>>
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>
>
> .
>
Dr. David Alan Gilbert Feb. 22, 2016, 9:59 a.m. UTC | #9
* Changlong Xie (xiecl.fnst@cn.fujitsu.com) wrote:
> On 02/22/2016 05:02 PM, Dr. David Alan Gilbert wrote:
> >* Changlong Xie (xiecl.fnst@cn.fujitsu.com) wrote:
> >>On 02/20/2016 10:28 PM, Max Reitz wrote:
> >>>On 19.02.2016 12:24, Alberto Garcia wrote:
> >>>>On Fri 19 Feb 2016 09:26:53 AM CET, Wen Congyang <wency@cn.fujitsu.com> wrote:
> >>>>
> >>>>>>>If quorum has two children(A, B). A do flush sucessfully, but B
> >>>>>>>flush failed.  We MUST choice A as winner rather than just pick
> >>>>>>>anyone of them. Otherwise the filesystem of guest will become
> >>>>>>>read-only with following errors:
> >>>>>>>
> >>>>>>>end_request: I/O error, dev vda, sector 11159960
> >>>>>>>Aborting journal on device vda3-8
> >>>>>>>EXT4-fs error (device vda3): ext4_journal_start_sb:327: Detected abort journal
> >>>>>>>EXT4-fs (vda3): Remounting filesystem read-only
> >>>>>>
> >>>>>>Hi Xie,
> >>>>>>
> >>>>>>Let's see if I'm getting this right:
> >>>>>>
> >>>>>>- When Quorum flushes to disk, there's a vote among the return values of
> >>>>>>   the flush operations of its members, and the one that wins is the one
> >>>>>>   that Quorum returns.
> >>>>>>
> >>>>>>- If there's a tie then Quorum choses the first result from the list of
> >>>>>>   winners.
> >>>>>>
> >>>>>>- With your patch you want to give priority to the vote with result == 0
> >>>>>>   if there's any, so Quorum would return 0 (and succeed).
> >>>>>>
> >>>>>>This seems to me like an ad-hoc fix for a particular use case. What
> >>>>>>if you have 3 members and two of them fail with the same error code?
> >>>>>>Would you still return 0 or the error code from the other two?
> >>>>>
> >>>>>For example:
> >>>>>children.0 returns 0
> >>>>>children.1 returns -EIO
> >>>>>children.2 returns -EPIPE
> >>>>>
> >>>>>In this case, quorum returns -EPIPE now(without this patch).
> >>>>>
> >>>>>For example:
> >>>>>children.0 returns -EPIPE
> >>>>>children.1 returns -EIO
> >>>>>children.2 returns 0
> >>>>>In this case, quorum returns 0 now.
> >>>>
> >>>>My question is: what's the rationale for returning 0 in case a) but not
> >>>>in case b)?
> >>>>
> >>>>   a)
> >>>>     children.0 returns -EPIPE
> >>>>     children.1 returns -EIO
> >>>>     children.2 returns 0
> >>>>
> >>>>   b)
> >>>>     children.0 returns -EIO
> >>>>     children.1 returns -EIO
> >>>>     children.2 returns 0
> >>>>
> >>>>In both cases you have one successful flush and two errors. You want to
> >>>>return always 0 in case a) and always -EIO in case b). But the only
> >>>>difference is that in case b) the errors happen to be the same, so why
> >>>>does that matter?
> >>>>
> >>>>That said, I'm not very convinced of the current logics of the Quorum
> >>>>flush code either, so it's not even a problem with your patch... it
> >>>>seems to me that the code should follow the same logics as in the
> >>>>read/write case: if the number of correct flushes >= threshold then
> >>>>return 0, else select the most common error code.
> >>>
> >>>I'm not convinced of the logic either, which is why I waited for you to
> >>>respond to this patch. :-)
> >>>
> >>>Intuitively, I'd expect Quorum to return an error if flushing failed for
> >>>any of the children, because, well, flushing failed. I somehow feel like
> >>>flushing is different from a read or write operation and therefore
> >>>ignoring the threshold would be fine here. However, maybe my intuition
> >>>is just off.
> >>>
> >>>Anyway, regardless of that, if we do take the threshold into account, we
> >>>should not use the exact error value for voting but just whether an
> >>>error occurred or not. If all but one children fail to flush (all for
> >>>different reasons), I find it totally wrong to return success. We should
> >>>then just return -EIO or something.
> >>>
> >>Hi Berto & Max
> >>
> >>Thanks for your comments, i'd like to have a summary here. For flush cases:
> >>
> >>1) if flush successfully(result >= 0), result = 0; else if result < 0,
> >>result = -EIO. then invoke quorum_count_vote
> >>2) if correct flushes >= threshold, mark correct flushes as winner directly.
> >
> >I find it difficult to understand how this corresponds to the behaviour needed
> >in COLO, where we have the NBD and the real storage on the primary; in that
> >case the failure of the real storage should give an error to the guest, but the
> >failure of the NBD shouldn't produce a guest visible failure.
> >
> Hi Dave
> 
> "in that case the failure of the real storage should give an error to the
> guest, but the failure of the NBD shouldn't produce a guest visible
> failure."
> This is just what i think :), but there is a restricted condition
> .
> 1) If the guest *Must* fetch the return code for flush operation? This is
> prerequisite.
> 2) If no. Since Colo and Quorum are independent of each other, so quorum
> don't know if we are in colo mode. I think the only way to implement your
> idea is:"pass some parameters such as flush= on/off to each quorum child".

I'm not sure why flush is special; but either way I think for Quorum the
answer is to have a flag per-child to say whether a failure on that
child should be visible to the guest.

Dave

> BTW, i've just sent out V3, and thought it make scense.
> 
> Thanks
> 	-Xie
> >Dave
> >
> >>Will fix in next version.
> >>
> >>Thanks
> >>	-Xie
> >>>Max
> >>>
> >>
> >>
> >--
> >Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >
> >
> >.
> >
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Kevin Wolf Feb. 22, 2016, 10:34 a.m. UTC | #10
Am 22.02.2016 um 10:02 hat Dr. David Alan Gilbert geschrieben:
> * Changlong Xie (xiecl.fnst@cn.fujitsu.com) wrote:
> > On 02/20/2016 10:28 PM, Max Reitz wrote:
> > >On 19.02.2016 12:24, Alberto Garcia wrote:
> > >>On Fri 19 Feb 2016 09:26:53 AM CET, Wen Congyang <wency@cn.fujitsu.com> wrote:
> > >>
> > >>>>>If quorum has two children(A, B). A do flush sucessfully, but B
> > >>>>>flush failed.  We MUST choice A as winner rather than just pick
> > >>>>>anyone of them. Otherwise the filesystem of guest will become
> > >>>>>read-only with following errors:
> > >>>>>
> > >>>>>end_request: I/O error, dev vda, sector 11159960
> > >>>>>Aborting journal on device vda3-8
> > >>>>>EXT4-fs error (device vda3): ext4_journal_start_sb:327: Detected abort journal
> > >>>>>EXT4-fs (vda3): Remounting filesystem read-only
> > >>>>
> > >>>>Hi Xie,
> > >>>>
> > >>>>Let's see if I'm getting this right:
> > >>>>
> > >>>>- When Quorum flushes to disk, there's a vote among the return values of
> > >>>>   the flush operations of its members, and the one that wins is the one
> > >>>>   that Quorum returns.
> > >>>>
> > >>>>- If there's a tie then Quorum choses the first result from the list of
> > >>>>   winners.
> > >>>>
> > >>>>- With your patch you want to give priority to the vote with result == 0
> > >>>>   if there's any, so Quorum would return 0 (and succeed).
> > >>>>
> > >>>>This seems to me like an ad-hoc fix for a particular use case. What
> > >>>>if you have 3 members and two of them fail with the same error code?
> > >>>>Would you still return 0 or the error code from the other two?
> > >>>
> > >>>For example:
> > >>>children.0 returns 0
> > >>>children.1 returns -EIO
> > >>>children.2 returns -EPIPE
> > >>>
> > >>>In this case, quorum returns -EPIPE now(without this patch).
> > >>>
> > >>>For example:
> > >>>children.0 returns -EPIPE
> > >>>children.1 returns -EIO
> > >>>children.2 returns 0
> > >>>In this case, quorum returns 0 now.
> > >>
> > >>My question is: what's the rationale for returning 0 in case a) but not
> > >>in case b)?
> > >>
> > >>   a)
> > >>     children.0 returns -EPIPE
> > >>     children.1 returns -EIO
> > >>     children.2 returns 0
> > >>
> > >>   b)
> > >>     children.0 returns -EIO
> > >>     children.1 returns -EIO
> > >>     children.2 returns 0
> > >>
> > >>In both cases you have one successful flush and two errors. You want to
> > >>return always 0 in case a) and always -EIO in case b). But the only
> > >>difference is that in case b) the errors happen to be the same, so why
> > >>does that matter?
> > >>
> > >>That said, I'm not very convinced of the current logics of the Quorum
> > >>flush code either, so it's not even a problem with your patch... it
> > >>seems to me that the code should follow the same logics as in the
> > >>read/write case: if the number of correct flushes >= threshold then
> > >>return 0, else select the most common error code.
> > >
> > >I'm not convinced of the logic either, which is why I waited for you to
> > >respond to this patch. :-)
> > >
> > >Intuitively, I'd expect Quorum to return an error if flushing failed for
> > >any of the children, because, well, flushing failed. I somehow feel like
> > >flushing is different from a read or write operation and therefore
> > >ignoring the threshold would be fine here. However, maybe my intuition
> > >is just off.
> > >
> > >Anyway, regardless of that, if we do take the threshold into account, we
> > >should not use the exact error value for voting but just whether an
> > >error occurred or not. If all but one children fail to flush (all for
> > >different reasons), I find it totally wrong to return success. We should
> > >then just return -EIO or something.
> > >
> > Hi Berto & Max
> > 
> > Thanks for your comments, i'd like to have a summary here. For flush cases:
> > 
> > 1) if flush successfully(result >= 0), result = 0; else if result < 0,
> > result = -EIO. then invoke quorum_count_vote
> > 2) if correct flushes >= threshold, mark correct flushes as winner directly.

Please try to return one of the real error codes instead of -EIO.
Essentially we should use the same logic as for writes (like Berto
suggested above).

> I find it difficult to understand how this corresponds to the behaviour needed
> in COLO, where we have the NBD and the real storage on the primary; in that
> case the failure of the real storage should give an error to the guest, but the
> failure of the NBD shouldn't produce a guest visible failure.

That's probably because you're abusing quorum as an active mirroring
filter, which it really isn't. This is okay for now so you have
something to work with, but I expect that eventually you'll need a
different driver. (Well, or maybe I'm mistaken and you actually do need
to read back data from NBD to compare it to the real storage - do you?)

Anyway, I'm curious how you would handle a failed write/flush request to
the NBD target. Simply ignoring it doesn't feel right; in case of a
failover, wouldn't you switch to a potentially corrupted image then?

Kevin
Dr. David Alan Gilbert Feb. 22, 2016, 10:39 a.m. UTC | #11
* Kevin Wolf (kwolf@redhat.com) wrote:
> Am 22.02.2016 um 10:02 hat Dr. David Alan Gilbert geschrieben:
> > * Changlong Xie (xiecl.fnst@cn.fujitsu.com) wrote:
> > > On 02/20/2016 10:28 PM, Max Reitz wrote:
> > > >On 19.02.2016 12:24, Alberto Garcia wrote:
> > > >>On Fri 19 Feb 2016 09:26:53 AM CET, Wen Congyang <wency@cn.fujitsu.com> wrote:
> > > >>
> > > >>>>>If quorum has two children(A, B). A do flush sucessfully, but B
> > > >>>>>flush failed.  We MUST choice A as winner rather than just pick
> > > >>>>>anyone of them. Otherwise the filesystem of guest will become
> > > >>>>>read-only with following errors:
> > > >>>>>
> > > >>>>>end_request: I/O error, dev vda, sector 11159960
> > > >>>>>Aborting journal on device vda3-8
> > > >>>>>EXT4-fs error (device vda3): ext4_journal_start_sb:327: Detected abort journal
> > > >>>>>EXT4-fs (vda3): Remounting filesystem read-only
> > > >>>>
> > > >>>>Hi Xie,
> > > >>>>
> > > >>>>Let's see if I'm getting this right:
> > > >>>>
> > > >>>>- When Quorum flushes to disk, there's a vote among the return values of
> > > >>>>   the flush operations of its members, and the one that wins is the one
> > > >>>>   that Quorum returns.
> > > >>>>
> > > >>>>- If there's a tie then Quorum choses the first result from the list of
> > > >>>>   winners.
> > > >>>>
> > > >>>>- With your patch you want to give priority to the vote with result == 0
> > > >>>>   if there's any, so Quorum would return 0 (and succeed).
> > > >>>>
> > > >>>>This seems to me like an ad-hoc fix for a particular use case. What
> > > >>>>if you have 3 members and two of them fail with the same error code?
> > > >>>>Would you still return 0 or the error code from the other two?
> > > >>>
> > > >>>For example:
> > > >>>children.0 returns 0
> > > >>>children.1 returns -EIO
> > > >>>children.2 returns -EPIPE
> > > >>>
> > > >>>In this case, quorum returns -EPIPE now(without this patch).
> > > >>>
> > > >>>For example:
> > > >>>children.0 returns -EPIPE
> > > >>>children.1 returns -EIO
> > > >>>children.2 returns 0
> > > >>>In this case, quorum returns 0 now.
> > > >>
> > > >>My question is: what's the rationale for returning 0 in case a) but not
> > > >>in case b)?
> > > >>
> > > >>   a)
> > > >>     children.0 returns -EPIPE
> > > >>     children.1 returns -EIO
> > > >>     children.2 returns 0
> > > >>
> > > >>   b)
> > > >>     children.0 returns -EIO
> > > >>     children.1 returns -EIO
> > > >>     children.2 returns 0
> > > >>
> > > >>In both cases you have one successful flush and two errors. You want to
> > > >>return always 0 in case a) and always -EIO in case b). But the only
> > > >>difference is that in case b) the errors happen to be the same, so why
> > > >>does that matter?
> > > >>
> > > >>That said, I'm not very convinced of the current logics of the Quorum
> > > >>flush code either, so it's not even a problem with your patch... it
> > > >>seems to me that the code should follow the same logics as in the
> > > >>read/write case: if the number of correct flushes >= threshold then
> > > >>return 0, else select the most common error code.
> > > >
> > > >I'm not convinced of the logic either, which is why I waited for you to
> > > >respond to this patch. :-)
> > > >
> > > >Intuitively, I'd expect Quorum to return an error if flushing failed for
> > > >any of the children, because, well, flushing failed. I somehow feel like
> > > >flushing is different from a read or write operation and therefore
> > > >ignoring the threshold would be fine here. However, maybe my intuition
> > > >is just off.
> > > >
> > > >Anyway, regardless of that, if we do take the threshold into account, we
> > > >should not use the exact error value for voting but just whether an
> > > >error occurred or not. If all but one children fail to flush (all for
> > > >different reasons), I find it totally wrong to return success. We should
> > > >then just return -EIO or something.
> > > >
> > > Hi Berto & Max
> > > 
> > > Thanks for your comments, i'd like to have a summary here. For flush cases:
> > > 
> > > 1) if flush successfully(result >= 0), result = 0; else if result < 0,
> > > result = -EIO. then invoke quorum_count_vote
> > > 2) if correct flushes >= threshold, mark correct flushes as winner directly.
> 
> Please try to return one of the real error codes instead of -EIO.
> Essentially we should use the same logic as for writes (like Berto
> suggested above).
> 
> > I find it difficult to understand how this corresponds to the behaviour needed
> > in COLO, where we have the NBD and the real storage on the primary; in that
> > case the failure of the real storage should give an error to the guest, but the
> > failure of the NBD shouldn't produce a guest visible failure.
> 
> That's probably because you're abusing quorum as an active mirroring
> filter, which it really isn't. This is okay for now so you have
> something to work with, but I expect that eventually you'll need a
> different driver. (Well, or maybe I'm mistaken and you actually do need
> to read back data from NBD to compare it to the real storage - do you?)
> 
> Anyway, I'm curious how you would handle a failed write/flush request to
> the NBD target. Simply ignoring it doesn't feel right; in case of a
> failover, wouldn't you switch to a potentially corrupted image then?

That depends.   If the NBD target write has failed then you need to declare
that the two hosts are no longer synchronised, and you need to trigger
a failover (probably just discard the secondary at this point); however
the point about this change is that the guest that survives this process
shouldn't see a write error, something else that manages the fault-tolerance
should, but not the guest.

Dave


> Kevin
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Alberto Garcia Feb. 22, 2016, 1:31 p.m. UTC | #12
On Sat 20 Feb 2016 03:28:03 PM CET, Max Reitz <mreitz@redhat.com> wrote:

>> That said, I'm not very convinced of the current logics of the Quorum
>> flush code either, so it's not even a problem with your patch... it
>> seems to me that the code should follow the same logics as in the
>> read/write case: if the number of correct flushes >= threshold then
>> return 0, else select the most common error code.
>
> I'm not convinced of the logic either, which is why I waited for you
> to respond to this patch. :-)
>
> Intuitively, I'd expect Quorum to return an error if flushing failed
> for any of the children, because, well, flushing failed. I somehow
> feel like flushing is different from a read or write operation and
> therefore ignoring the threshold would be fine here. However, maybe my
> intuition is just off.

The way I see it is that if we have, say, 5 drives with a threshold of 3
and flushing fails in one of them Quorum should report the error (with
QUORUM_REPORT_BAD probably, or maybe a new event) but succeed, because
we have at least 3 images that are (in principle) fine. I don't see why
the guest should see an error in that case.

This is what I found from the original discussion when the patch was
submitted:

https://lists.gnu.org/archive/html/qemu-devel/2014-02/msg00119.html
https://lists.gnu.org/archive/html/qemu-devel/2014-02/msg00377.html

It seems that there was no dicussion about why the threshold was not
used. The current code has the problem that I mentioned earlier: in the
example with 5 drives, if 2 succeed and 3 fail with different error
codes then Quorum will return 0, which feels wrong.

I think the correct solution would be something like the code in V10 but
counting the number of correct flushes and using them to decide whether
to report an error or not. Something like this:

    for (i = 0; i < s->total; i++) {
        result = bdrv_co_flush(s->bs[i]);
        if (result) {
            result_value.l = result;
            quorum_count_vote(&error_votes, &result_value, i);
        } else {
            correct++;
        }
    }

    if (correct < s->threshold) {
        winner = quorum_get_vote_winner(&error_votes);
        result = winner->value.l;
    }

> Anyway, regardless of that, if we do take the threshold into account,
> we should not use the exact error value for voting but just whether an
> error occurred or not. If all but one children fail to flush (all for
> different reasons), I find it totally wrong to return success. We
> should then just return -EIO or something.

Exactly. I think -EIO should be fine, I'm not sure why it was proposed
to vote the return code in this case?

It doesn't seem that there was any discussion other than this question
from Eric:

https://lists.gnu.org/archive/html/qemu-devel/2012-10/msg04034.html

Berto
Alberto Garcia Feb. 22, 2016, 1:43 p.m. UTC | #13
On Mon 22 Feb 2016 02:31:30 PM CET, Alberto Garcia wrote:
>> Anyway, regardless of that, if we do take the threshold into account,
>> we should not use the exact error value for voting but just whether
>> an error occurred or not. If all but one children fail to flush (all
>> for different reasons), I find it totally wrong to return success. We
>> should then just return -EIO or something.
>
> Exactly. I think -EIO should be fine, I'm not sure why it was proposed
> to vote the return code in this case?

I correct myself: when reads or writes fail we also vote the return
code. We only return -EIO when there are few or no I/O errors but the
results differ and we cannot get a majority that reaches the threshold.

Berto
Eric Blake Feb. 22, 2016, 4:37 p.m. UTC | #14
On 02/22/2016 06:31 AM, Alberto Garcia wrote:

> The way I see it is that if we have, say, 5 drives with a threshold of 3
> and flushing fails in one of them Quorum should report the error (with
> QUORUM_REPORT_BAD probably, or maybe a new event) but succeed, because
> we have at least 3 images that are (in principle) fine. I don't see why
> the guest should see an error in that case.

Yes, I can live with that.  Basically, treat flushing like everything
else quorum - follow the same voting rules (either you are in threshold
mode, and must meet the threshold to succeed while issuing an event for
the failing outliers; or you are in FIFO mode and the first success is
sufficient).


> I think the correct solution would be something like the code in V10 but
> counting the number of correct flushes and using them to decide whether
> to report an error or not. Something like this:
> 
>     for (i = 0; i < s->total; i++) {
>         result = bdrv_co_flush(s->bs[i]);
>         if (result) {
>             result_value.l = result;
>             quorum_count_vote(&error_votes, &result_value, i);
>         } else {
>             correct++;
>         }
>     }
> 
>     if (correct < s->threshold) {
>         winner = quorum_get_vote_winner(&error_votes);
>         result = winner->value.l;
>     }
> 
>> Anyway, regardless of that, if we do take the threshold into account,
>> we should not use the exact error value for voting but just whether an
>> error occurred or not. If all but one children fail to flush (all for
>> different reasons), I find it totally wrong to return success. We
>> should then just return -EIO or something.
> 
> Exactly. I think -EIO should be fine, I'm not sure why it was proposed
> to vote the return code in this case?
> 
> It doesn't seem that there was any discussion other than this question
> from Eric:
> 
> https://lists.gnu.org/archive/html/qemu-devel/2012-10/msg04034.html

My concern there was the complete disregard for errors.  With your
proposal here, errors are voted on the same as they are for reads and
writes, which feels like the correct approach.
Changlong Xie Feb. 23, 2016, 2:55 a.m. UTC | #15
On 02/22/2016 06:34 PM, Kevin Wolf wrote:
> Am 22.02.2016 um 10:02 hat Dr. David Alan Gilbert geschrieben:
>> * Changlong Xie (xiecl.fnst@cn.fujitsu.com) wrote:
>>> On 02/20/2016 10:28 PM, Max Reitz wrote:
>>>> On 19.02.2016 12:24, Alberto Garcia wrote:
>>>>> On Fri 19 Feb 2016 09:26:53 AM CET, Wen Congyang <wency@cn.fujitsu.com> wrote:
>>>>>
>>>>>>>> If quorum has two children(A, B). A do flush sucessfully, but B
>>>>>>>> flush failed.  We MUST choice A as winner rather than just pick
>>>>>>>> anyone of them. Otherwise the filesystem of guest will become
>>>>>>>> read-only with following errors:
>>>>>>>>
>>>>>>>> end_request: I/O error, dev vda, sector 11159960
>>>>>>>> Aborting journal on device vda3-8
>>>>>>>> EXT4-fs error (device vda3): ext4_journal_start_sb:327: Detected abort journal
>>>>>>>> EXT4-fs (vda3): Remounting filesystem read-only
>>>>>>>
>>>>>>> Hi Xie,
>>>>>>>
>>>>>>> Let's see if I'm getting this right:
>>>>>>>
>>>>>>> - When Quorum flushes to disk, there's a vote among the return values of
>>>>>>>    the flush operations of its members, and the one that wins is the one
>>>>>>>    that Quorum returns.
>>>>>>>
>>>>>>> - If there's a tie then Quorum choses the first result from the list of
>>>>>>>    winners.
>>>>>>>
>>>>>>> - With your patch you want to give priority to the vote with result == 0
>>>>>>>    if there's any, so Quorum would return 0 (and succeed).
>>>>>>>
>>>>>>> This seems to me like an ad-hoc fix for a particular use case. What
>>>>>>> if you have 3 members and two of them fail with the same error code?
>>>>>>> Would you still return 0 or the error code from the other two?
>>>>>>
>>>>>> For example:
>>>>>> children.0 returns 0
>>>>>> children.1 returns -EIO
>>>>>> children.2 returns -EPIPE
>>>>>>
>>>>>> In this case, quorum returns -EPIPE now(without this patch).
>>>>>>
>>>>>> For example:
>>>>>> children.0 returns -EPIPE
>>>>>> children.1 returns -EIO
>>>>>> children.2 returns 0
>>>>>> In this case, quorum returns 0 now.
>>>>>
>>>>> My question is: what's the rationale for returning 0 in case a) but not
>>>>> in case b)?
>>>>>
>>>>>    a)
>>>>>      children.0 returns -EPIPE
>>>>>      children.1 returns -EIO
>>>>>      children.2 returns 0
>>>>>
>>>>>    b)
>>>>>      children.0 returns -EIO
>>>>>      children.1 returns -EIO
>>>>>      children.2 returns 0
>>>>>
>>>>> In both cases you have one successful flush and two errors. You want to
>>>>> return always 0 in case a) and always -EIO in case b). But the only
>>>>> difference is that in case b) the errors happen to be the same, so why
>>>>> does that matter?
>>>>>
>>>>> That said, I'm not very convinced of the current logics of the Quorum
>>>>> flush code either, so it's not even a problem with your patch... it
>>>>> seems to me that the code should follow the same logics as in the
>>>>> read/write case: if the number of correct flushes >= threshold then
>>>>> return 0, else select the most common error code.
>>>>
>>>> I'm not convinced of the logic either, which is why I waited for you to
>>>> respond to this patch. :-)
>>>>
>>>> Intuitively, I'd expect Quorum to return an error if flushing failed for
>>>> any of the children, because, well, flushing failed. I somehow feel like
>>>> flushing is different from a read or write operation and therefore
>>>> ignoring the threshold would be fine here. However, maybe my intuition
>>>> is just off.
>>>>
>>>> Anyway, regardless of that, if we do take the threshold into account, we
>>>> should not use the exact error value for voting but just whether an
>>>> error occurred or not. If all but one children fail to flush (all for
>>>> different reasons), I find it totally wrong to return success. We should
>>>> then just return -EIO or something.
>>>>
>>> Hi Berto & Max
>>>
>>> Thanks for your comments, i'd like to have a summary here. For flush cases:
>>>
>>> 1) if flush successfully(result >= 0), result = 0; else if result < 0,
>>> result = -EIO. then invoke quorum_count_vote
>>> 2) if correct flushes >= threshold, mark correct flushes as winner directly.
>
> Please try to return one of the real error codes instead of -EIO.
> Essentially we should use the same logic as for writes (like Berto
> suggested above).

Yes, i correct myself. It seems everyone reaches an agreement in this 
thread, and will use the same logic as writes in next version.

Thanks
	-Xie
>
>> I find it difficult to understand how this corresponds to the behaviour needed
>> in COLO, where we have the NBD and the real storage on the primary; in that
>> case the failure of the real storage should give an error to the guest, but the
>> failure of the NBD shouldn't produce a guest visible failure.
>
> That's probably because you're abusing quorum as an active mirroring
> filter, which it really isn't. This is okay for now so you have
> something to work with, but I expect that eventually you'll need a
> different driver. (Well, or maybe I'm mistaken and you actually do need
> to read back data from NBD to compare it to the real storage - do you?)
>
> Anyway, I'm curious how you would handle a failed write/flush request to
> the NBD target. Simply ignoring it doesn't feel right; in case of a
> failover, wouldn't you switch to a potentially corrupted image then?
>
> Kevin
>
>
> .
>
diff mbox

Patch

diff --git a/block/quorum.c b/block/quorum.c
index 11cc60b..f094208 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -447,7 +447,8 @@  static int quorum_compute_hash(QuorumAIOCB *acb, int i, QuorumVoteValue *hash)
     return 0;
 }
 
-static QuorumVoteVersion *quorum_get_vote_winner(QuorumVotes *votes)
+static QuorumVoteVersion *quorum_get_vote_winner(QuorumVotes *votes,
+                                                 bool vote_error)
 {
     int max = 0;
     QuorumVoteVersion *candidate, *winner = NULL;
@@ -456,6 +457,12 @@  static QuorumVoteVersion *quorum_get_vote_winner(QuorumVotes *votes)
         if (candidate->vote_count > max) {
             max = candidate->vote_count;
             winner = candidate;
+            continue;
+        }
+        /* For 64 bit hash */
+        if (vote_error && candidate->vote_count == max
+            && candidate->value.l == 0) {
+            winner = candidate;
         }
     }
 
@@ -545,7 +552,7 @@  static int quorum_vote_error(QuorumAIOCB *acb)
     }
 
     if (error) {
-        winner = quorum_get_vote_winner(&error_votes);
+        winner = quorum_get_vote_winner(&error_votes, false);
         ret = winner->value.l;
     }
 
@@ -610,7 +617,7 @@  static bool quorum_vote(QuorumAIOCB *acb)
     }
 
     /* vote to select the most represented version */
-    winner = quorum_get_vote_winner(&acb->votes);
+    winner = quorum_get_vote_winner(&acb->votes, false);
 
     /* if the winner count is smaller than threshold the read fails */
     if (winner->vote_count < s->threshold) {
@@ -770,7 +777,7 @@  static coroutine_fn int quorum_co_flush(BlockDriverState *bs)
         quorum_count_vote(&error_votes, &result_value, i);
     }
 
-    winner = quorum_get_vote_winner(&error_votes);
+    winner = quorum_get_vote_winner(&error_votes, true);
     result = winner->value.l;
 
     quorum_free_vote_list(&error_votes);