diff mbox

default min_size for erasure pools

Message ID CABZ+qqnLbXY97REQOMRuNc2n+-f_g21bwU2E39XN6tCVfM_nJw@mail.gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Dan van der Ster March 9, 2016, 2:25 p.m. UTC
Hi,

For replicated pools we default to min_size=2 when size=3
(size-size/2) in order to avoid the split brain scenario, for example
as described here:
http://www.spinics.net/lists/ceph-devel/msg27008.html

But for erasure pools we default to min_size=k which I think is a
recipe for similar problems.

Shouldn't we default to at least min_size=k+1??

     break;


Cheers, Dan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Gregory Farnum March 9, 2016, 8:05 p.m. UTC | #1
On Wed, Mar 9, 2016 at 6:25 AM, Dan van der Ster <dan@vanderster.com> wrote:
> Hi,
>
> For replicated pools we default to min_size=2 when size=3
> (size-size/2) in order to avoid the split brain scenario, for example
> as described here:
> http://www.spinics.net/lists/ceph-devel/msg27008.html
>
> But for erasure pools we default to min_size=k which I think is a
> recipe for similar problems.
>
> Shouldn't we default to at least min_size=k+1??
>
> diff --git a/src/mon/OSDMonitor.cc b/src/mon/OSDMonitor.cc
> index 77e26de..5d51686 100644
> --- a/src/mon/OSDMonitor.cc
> +++ b/src/mon/OSDMonitor.cc
> @@ -4427,7 +4427,7 @@ int OSDMonitor::prepare_pool_size(const unsigned
> pool_type,
>        err = get_erasure_code(erasure_code_profile, &erasure_code, ss);
>        if (err == 0) {
>         *size = erasure_code->get_chunk_count();
> -       *min_size = erasure_code->get_data_chunk_count();
> +       *min_size = erasure_code->get_data_chunk_count() + 1;
>        }
>      }
>      break;

Well, losing any OSDs at that point would be bad since it would become
inaccessible until you get that whole set back, but there's not really
any chance of serving up bad reads like Sam is worried about in the
ReplicatedPG case. (...at least, assuming you have more data chunks
than parity chunks.) Send in a PR on github?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dan van der Ster March 9, 2016, 8:32 p.m. UTC | #2
On 3/9/16, Gregory Farnum <gfarnum@redhat.com> wrote:
> On Wed, Mar 9, 2016 at 6:25 AM, Dan van der Ster <dan@vanderster.com>
> wrote:
>> Hi,
>>
>> For replicated pools we default to min_size=2 when size=3
>> (size-size/2) in order to avoid the split brain scenario, for example
>> as described here:
>> http://www.spinics.net/lists/ceph-devel/msg27008.html
>>
>> But for erasure pools we default to min_size=k which I think is a
>> recipe for similar problems.
>>
>> Shouldn't we default to at least min_size=k+1??
>>
>> diff --git a/src/mon/OSDMonitor.cc b/src/mon/OSDMonitor.cc
>> index 77e26de..5d51686 100644
>> --- a/src/mon/OSDMonitor.cc
>> +++ b/src/mon/OSDMonitor.cc
>> @@ -4427,7 +4427,7 @@ int OSDMonitor::prepare_pool_size(const unsigned
>> pool_type,
>>        err = get_erasure_code(erasure_code_profile, &erasure_code, ss);
>>        if (err == 0) {
>>         *size = erasure_code->get_chunk_count();
>> -       *min_size = erasure_code->get_data_chunk_count();
>> +       *min_size = erasure_code->get_data_chunk_count() + 1;
>>        }
>>      }
>>      break;
>
> Well, losing any OSDs at that point would be bad since it would become
> inaccessible until you get that whole set back, but there's not really
> any chance of serving up bad reads like Sam is worried about in the
> ReplicatedPG case. (...at least, assuming you have more data chunks
> than parity chunks.) Send in a PR on github?
> -Greg
>

Oops, that link discussed reads, but I'm more worried about writes.
I.e. if we allow writes when only k osds are up, then one of the m
down osds comes back and starts backfilling or recovery, but then one
of the k osds that took writes goes down before recovery completes.

PR incoming.

.. Dan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/src/mon/OSDMonitor.cc b/src/mon/OSDMonitor.cc
index 77e26de..5d51686 100644
--- a/src/mon/OSDMonitor.cc
+++ b/src/mon/OSDMonitor.cc
@@ -4427,7 +4427,7 @@  int OSDMonitor::prepare_pool_size(const unsigned
pool_type,
       err = get_erasure_code(erasure_code_profile, &erasure_code, ss);
       if (err == 0) {
        *size = erasure_code->get_chunk_count();
-       *min_size = erasure_code->get_data_chunk_count();
+       *min_size = erasure_code->get_data_chunk_count() + 1;
       }
     }