Message ID | CABZ+qqnLbXY97REQOMRuNc2n+-f_g21bwU2E39XN6tCVfM_nJw@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Mar 9, 2016 at 6:25 AM, Dan van der Ster <dan@vanderster.com> wrote: > Hi, > > For replicated pools we default to min_size=2 when size=3 > (size-size/2) in order to avoid the split brain scenario, for example > as described here: > http://www.spinics.net/lists/ceph-devel/msg27008.html > > But for erasure pools we default to min_size=k which I think is a > recipe for similar problems. > > Shouldn't we default to at least min_size=k+1?? > > diff --git a/src/mon/OSDMonitor.cc b/src/mon/OSDMonitor.cc > index 77e26de..5d51686 100644 > --- a/src/mon/OSDMonitor.cc > +++ b/src/mon/OSDMonitor.cc > @@ -4427,7 +4427,7 @@ int OSDMonitor::prepare_pool_size(const unsigned > pool_type, > err = get_erasure_code(erasure_code_profile, &erasure_code, ss); > if (err == 0) { > *size = erasure_code->get_chunk_count(); > - *min_size = erasure_code->get_data_chunk_count(); > + *min_size = erasure_code->get_data_chunk_count() + 1; > } > } > break; Well, losing any OSDs at that point would be bad since it would become inaccessible until you get that whole set back, but there's not really any chance of serving up bad reads like Sam is worried about in the ReplicatedPG case. (...at least, assuming you have more data chunks than parity chunks.) Send in a PR on github? -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 3/9/16, Gregory Farnum <gfarnum@redhat.com> wrote: > On Wed, Mar 9, 2016 at 6:25 AM, Dan van der Ster <dan@vanderster.com> > wrote: >> Hi, >> >> For replicated pools we default to min_size=2 when size=3 >> (size-size/2) in order to avoid the split brain scenario, for example >> as described here: >> http://www.spinics.net/lists/ceph-devel/msg27008.html >> >> But for erasure pools we default to min_size=k which I think is a >> recipe for similar problems. >> >> Shouldn't we default to at least min_size=k+1?? >> >> diff --git a/src/mon/OSDMonitor.cc b/src/mon/OSDMonitor.cc >> index 77e26de..5d51686 100644 >> --- a/src/mon/OSDMonitor.cc >> +++ b/src/mon/OSDMonitor.cc >> @@ -4427,7 +4427,7 @@ int OSDMonitor::prepare_pool_size(const unsigned >> pool_type, >> err = get_erasure_code(erasure_code_profile, &erasure_code, ss); >> if (err == 0) { >> *size = erasure_code->get_chunk_count(); >> - *min_size = erasure_code->get_data_chunk_count(); >> + *min_size = erasure_code->get_data_chunk_count() + 1; >> } >> } >> break; > > Well, losing any OSDs at that point would be bad since it would become > inaccessible until you get that whole set back, but there's not really > any chance of serving up bad reads like Sam is worried about in the > ReplicatedPG case. (...at least, assuming you have more data chunks > than parity chunks.) Send in a PR on github? > -Greg > Oops, that link discussed reads, but I'm more worried about writes. I.e. if we allow writes when only k osds are up, then one of the m down osds comes back and starts backfilling or recovery, but then one of the k osds that took writes goes down before recovery completes. PR incoming. .. Dan -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/src/mon/OSDMonitor.cc b/src/mon/OSDMonitor.cc index 77e26de..5d51686 100644 --- a/src/mon/OSDMonitor.cc +++ b/src/mon/OSDMonitor.cc @@ -4427,7 +4427,7 @@ int OSDMonitor::prepare_pool_size(const unsigned pool_type, err = get_erasure_code(erasure_code_profile, &erasure_code, ss); if (err == 0) { *size = erasure_code->get_chunk_count(); - *min_size = erasure_code->get_data_chunk_count(); + *min_size = erasure_code->get_data_chunk_count() + 1; } }