diff mbox

[04/39] mds: make sure table request id unique

Message ID 1363531902-24909-5-git-send-email-zheng.z.yan@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Yan, Zheng March 17, 2013, 2:51 p.m. UTC
From: "Yan, Zheng" <zheng.z.yan@intel.com>

When a MDS becomes active, the table server re-sends 'agree' messages
for old prepared request. If the recoverd MDS starts a new table request
at the same time, The new request's ID can happen to be the same as old
prepared request's ID, because current table client assigns request ID
from zero after MDS restarts.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
---
 src/mds/MDS.cc            | 3 +++
 src/mds/MDSTableClient.cc | 5 +++++
 src/mds/MDSTableClient.h  | 2 ++
 3 files changed, 10 insertions(+)

Comments

Gregory Farnum March 19, 2013, 11:09 p.m. UTC | #1
Hmm, this is definitely narrowing the race (probably enough to never hit it), but it's not actually eliminating it (if the restart happens after 4 billion requests…). More importantly this kind of symptom makes me worry that we might be papering over more serious issues with colliding states in the Table on restart.
I don't have the MDSTable semantics in my head so I'll need to look into this later unless somebody else volunteers to do so…
-Greg

Software Engineer #42 @ http://inktank.com | http://ceph.com


On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote:

> From: "Yan, Zheng" <zheng.z.yan@intel.com>
>  
> When a MDS becomes active, the table server re-sends 'agree' messages
> for old prepared request. If the recoverd MDS starts a new table request
> at the same time, The new request's ID can happen to be the same as old
> prepared request's ID, because current table client assigns request ID
> from zero after MDS restarts.
>  
> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com (mailto:zheng.z.yan@intel.com)>
> ---
> src/mds/MDS.cc (http://MDS.cc) | 3 +++
> src/mds/MDSTableClient.cc (http://MDSTableClient.cc) | 5 +++++
> src/mds/MDSTableClient.h | 2 ++
> 3 files changed, 10 insertions(+)
>  
> diff --git a/src/mds/MDS.cc (http://MDS.cc) b/src/mds/MDS.cc (http://MDS.cc)
> index bb1c833..859782a 100644
> --- a/src/mds/MDS.cc (http://MDS.cc)
> +++ b/src/mds/MDS.cc (http://MDS.cc)
> @@ -1212,6 +1212,9 @@ void MDS::boot_start(int step, int r)
> dout(2) << "boot_start " << step << ": opening snap table" << dendl;  
> snapserver->load(gather.new_sub());
> }
> +
> + anchorclient->init();
> + snapclient->init();
>  
> dout(2) << "boot_start " << step << ": opening mds log" << dendl;
> mdlog->open(gather.new_sub());
> diff --git a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc)
> index ea021f5..beba0a3 100644
> --- a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc)
> +++ b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc)
> @@ -34,6 +34,11 @@
> #undef dout_prefix
> #define dout_prefix *_dout << "mds." << mds->get_nodeid() << ".tableclient(" << get_mdstable_name(table) << ") "
>  
> +void MDSTableClient::init()
> +{
> + // make reqid unique between MDS restarts
> + last_reqid = (uint64_t)mds->mdsmap->get_epoch() << 32;
> +}
>  
> void MDSTableClient::handle_request(class MMDSTableRequest *m)
> {
> diff --git a/src/mds/MDSTableClient.h b/src/mds/MDSTableClient.h
> index e15837f..78035db 100644
> --- a/src/mds/MDSTableClient.h
> +++ b/src/mds/MDSTableClient.h
> @@ -63,6 +63,8 @@ public:
> MDSTableClient(MDS *m, int tab) : mds(m), table(tab), last_reqid(0) {}
> virtual ~MDSTableClient() {}
>  
> + void init();
> +
> void handle_request(MMDSTableRequest *m);
>  
> void _prepare(bufferlist& mutation, version_t *ptid, bufferlist *pbl, Context *onfinish);
> --  
> 1.7.11.7



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yan, Zheng March 20, 2013, 5:53 a.m. UTC | #2
On 03/20/2013 07:09 AM, Greg Farnum wrote:
> Hmm, this is definitely narrowing the race (probably enough to never hit it), but it's not actually eliminating it (if the restart happens after 4 billion requests…). More importantly this kind of symptom makes me worry that we might be papering over more serious issues with colliding states in the Table on restart.
> I don't have the MDSTable semantics in my head so I'll need to look into this later unless somebody else volunteers to do so…

Not just 4 billion requests, MDS restart has several stage, mdsmap epoch increases for each stage. I don't think there are any more colliding states in the table. The table client/server use two phase commit. it's similar to client request that involves multiple MDS. the reqid is analogy to client request id. The difference is client request ID is unique because new client always get an unique session id.

Thanks
Yan, Zheng

> -Greg
> 
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> 
> 
> On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote:
> 
>> From: "Yan, Zheng" <zheng.z.yan@intel.com>
>>  
>> When a MDS becomes active, the table server re-sends 'agree' messages
>> for old prepared request. If the recoverd MDS starts a new table request
>> at the same time, The new request's ID can happen to be the same as old
>> prepared request's ID, because current table client assigns request ID
>> from zero after MDS restarts.
>>  
>> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com (mailto:zheng.z.yan@intel.com)>
>> ---
>> src/mds/MDS.cc (http://MDS.cc) | 3 +++
>> src/mds/MDSTableClient.cc (http://MDSTableClient.cc) | 5 +++++
>> src/mds/MDSTableClient.h | 2 ++
>> 3 files changed, 10 insertions(+)
>>  
>> diff --git a/src/mds/MDS.cc (http://MDS.cc) b/src/mds/MDS.cc (http://MDS.cc)
>> index bb1c833..859782a 100644
>> --- a/src/mds/MDS.cc (http://MDS.cc)
>> +++ b/src/mds/MDS.cc (http://MDS.cc)
>> @@ -1212,6 +1212,9 @@ void MDS::boot_start(int step, int r)
>> dout(2) << "boot_start " << step << ": opening snap table" << dendl;  
>> snapserver->load(gather.new_sub());
>> }
>> +
>> + anchorclient->init();
>> + snapclient->init();
>>  
>> dout(2) << "boot_start " << step << ": opening mds log" << dendl;
>> mdlog->open(gather.new_sub());
>> diff --git a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc)
>> index ea021f5..beba0a3 100644
>> --- a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc)
>> +++ b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc)
>> @@ -34,6 +34,11 @@
>> #undef dout_prefix
>> #define dout_prefix *_dout << "mds." << mds->get_nodeid() << ".tableclient(" << get_mdstable_name(table) << ") "
>>  
>> +void MDSTableClient::init()
>> +{
>> + // make reqid unique between MDS restarts
>> + last_reqid = (uint64_t)mds->mdsmap->get_epoch() << 32;
>> +}
>>  
>> void MDSTableClient::handle_request(class MMDSTableRequest *m)
>> {
>> diff --git a/src/mds/MDSTableClient.h b/src/mds/MDSTableClient.h
>> index e15837f..78035db 100644
>> --- a/src/mds/MDSTableClient.h
>> +++ b/src/mds/MDSTableClient.h
>> @@ -63,6 +63,8 @@ public:
>> MDSTableClient(MDS *m, int tab) : mds(m), table(tab), last_reqid(0) {}
>> virtual ~MDSTableClient() {}
>>  
>> + void init();
>> +
>> void handle_request(MMDSTableRequest *m);
>>  
>> void _prepare(bufferlist& mutation, version_t *ptid, bufferlist *pbl, Context *onfinish);
>> --  
>> 1.7.11.7
> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sage Weil March 20, 2013, 6:15 a.m. UTC | #3
On Wed, 20 Mar 2013, Yan, Zheng wrote:
> On 03/20/2013 07:09 AM, Greg Farnum wrote:
> > Hmm, this is definitely narrowing the race (probably enough to never hit it), but it's not actually eliminating it (if the restart happens after 4 billion requests?). More importantly this kind of symptom makes me worry that we might be papering over more serious issues with colliding states in the Table on restart.
> > I don't have the MDSTable semantics in my head so I'll need to look into this later unless somebody else volunteers to do so?
> 
> Not just 4 billion requests, MDS restart has several stage, mdsmap epoch 
> increases for each stage. I don't think there are any more colliding 
> states in the table. The table client/server use two phase commit. it's 
> similar to client request that involves multiple MDS. the reqid is 
> analogy to client request id. The difference is client request ID is 
> unique because new client always get an unique session id.

Each time a tid is consumed (at least for an update) it is journaled in 
the EMetaBlob::table_tids list, right?  So we could actually take a max 
from journal replay and pick up where we left off?  That seems like the 
cleanest.

I'm not too worried about 2^32 tids, I guess, but it would be nicer to 
avoid that possibility.

sage

> 
> Thanks
> Yan, Zheng
> 
> > -Greg
> > 
> > Software Engineer #42 @ http://inktank.com | http://ceph.com
> > 
> > 
> > On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote:
> > 
> >> From: "Yan, Zheng" <zheng.z.yan@intel.com>
> >>  
> >> When a MDS becomes active, the table server re-sends 'agree' messages
> >> for old prepared request. If the recoverd MDS starts a new table request
> >> at the same time, The new request's ID can happen to be the same as old
> >> prepared request's ID, because current table client assigns request ID
> >> from zero after MDS restarts.
> >>  
> >> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com (mailto:zheng.z.yan@intel.com)>
> >> ---
> >> src/mds/MDS.cc (http://MDS.cc) | 3 +++
> >> src/mds/MDSTableClient.cc (http://MDSTableClient.cc) | 5 +++++
> >> src/mds/MDSTableClient.h | 2 ++
> >> 3 files changed, 10 insertions(+)
> >>  
> >> diff --git a/src/mds/MDS.cc (http://MDS.cc) b/src/mds/MDS.cc (http://MDS.cc)
> >> index bb1c833..859782a 100644
> >> --- a/src/mds/MDS.cc (http://MDS.cc)
> >> +++ b/src/mds/MDS.cc (http://MDS.cc)
> >> @@ -1212,6 +1212,9 @@ void MDS::boot_start(int step, int r)
> >> dout(2) << "boot_start " << step << ": opening snap table" << dendl;  
> >> snapserver->load(gather.new_sub());
> >> }
> >> +
> >> + anchorclient->init();
> >> + snapclient->init();
> >>  
> >> dout(2) << "boot_start " << step << ": opening mds log" << dendl;
> >> mdlog->open(gather.new_sub());
> >> diff --git a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc)
> >> index ea021f5..beba0a3 100644
> >> --- a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc)
> >> +++ b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc)
> >> @@ -34,6 +34,11 @@
> >> #undef dout_prefix
> >> #define dout_prefix *_dout << "mds." << mds->get_nodeid() << ".tableclient(" << get_mdstable_name(table) << ") "
> >>  
> >> +void MDSTableClient::init()
> >> +{
> >> + // make reqid unique between MDS restarts
> >> + last_reqid = (uint64_t)mds->mdsmap->get_epoch() << 32;
> >> +}
> >>  
> >> void MDSTableClient::handle_request(class MMDSTableRequest *m)
> >> {
> >> diff --git a/src/mds/MDSTableClient.h b/src/mds/MDSTableClient.h
> >> index e15837f..78035db 100644
> >> --- a/src/mds/MDSTableClient.h
> >> +++ b/src/mds/MDSTableClient.h
> >> @@ -63,6 +63,8 @@ public:
> >> MDSTableClient(MDS *m, int tab) : mds(m), table(tab), last_reqid(0) {}
> >> virtual ~MDSTableClient() {}
> >>  
> >> + void init();
> >> +
> >> void handle_request(MMDSTableRequest *m);
> >>  
> >> void _prepare(bufferlist& mutation, version_t *ptid, bufferlist *pbl, Context *onfinish);
> >> --  
> >> 1.7.11.7
> > 
> > 
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yan, Zheng March 20, 2013, 6:24 a.m. UTC | #4
On 03/20/2013 02:15 PM, Sage Weil wrote:
> On Wed, 20 Mar 2013, Yan, Zheng wrote:
>> On 03/20/2013 07:09 AM, Greg Farnum wrote:
>>> Hmm, this is definitely narrowing the race (probably enough to never hit it), but it's not actually eliminating it (if the restart happens after 4 billion requests?). More importantly this kind of symptom makes me worry that we might be papering over more serious issues with colliding states in the Table on restart.
>>> I don't have the MDSTable semantics in my head so I'll need to look into this later unless somebody else volunteers to do so?
>>
>> Not just 4 billion requests, MDS restart has several stage, mdsmap epoch 
>> increases for each stage. I don't think there are any more colliding 
>> states in the table. The table client/server use two phase commit. it's 
>> similar to client request that involves multiple MDS. the reqid is 
>> analogy to client request id. The difference is client request ID is 
>> unique because new client always get an unique session id.
> 
> Each time a tid is consumed (at least for an update) it is journaled in 
> the EMetaBlob::table_tids list, right?  So we could actually take a max 
> from journal replay and pick up where we left off?  That seems like the 
> cleanest.

This approach works only if client journal the reqid before it sending the request
to the server. but current implementation is client journal the reqid when it receives
server's agree message.

Regards
Yan, Zheng 
> 
> I'm not too worried about 2^32 tids, I guess, but it would be nicer to 
> avoid that possibility.
> 
> sage
> 
>>
>> Thanks
>> Yan, Zheng
>>
>>> -Greg
>>>
>>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>>
>>>
>>> On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote:
>>>
>>>> From: "Yan, Zheng" <zheng.z.yan@intel.com>
>>>>  
>>>> When a MDS becomes active, the table server re-sends 'agree' messages
>>>> for old prepared request. If the recoverd MDS starts a new table request
>>>> at the same time, The new request's ID can happen to be the same as old
>>>> prepared request's ID, because current table client assigns request ID
>>>> from zero after MDS restarts.
>>>>  
>>>> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com (mailto:zheng.z.yan@intel.com)>
>>>> ---
>>>> src/mds/MDS.cc (http://MDS.cc) | 3 +++
>>>> src/mds/MDSTableClient.cc (http://MDSTableClient.cc) | 5 +++++
>>>> src/mds/MDSTableClient.h | 2 ++
>>>> 3 files changed, 10 insertions(+)
>>>>  
>>>> diff --git a/src/mds/MDS.cc (http://MDS.cc) b/src/mds/MDS.cc (http://MDS.cc)
>>>> index bb1c833..859782a 100644
>>>> --- a/src/mds/MDS.cc (http://MDS.cc)
>>>> +++ b/src/mds/MDS.cc (http://MDS.cc)
>>>> @@ -1212,6 +1212,9 @@ void MDS::boot_start(int step, int r)
>>>> dout(2) << "boot_start " << step << ": opening snap table" << dendl;  
>>>> snapserver->load(gather.new_sub());
>>>> }
>>>> +
>>>> + anchorclient->init();
>>>> + snapclient->init();
>>>>  
>>>> dout(2) << "boot_start " << step << ": opening mds log" << dendl;
>>>> mdlog->open(gather.new_sub());
>>>> diff --git a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc)
>>>> index ea021f5..beba0a3 100644
>>>> --- a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc)
>>>> +++ b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc)
>>>> @@ -34,6 +34,11 @@
>>>> #undef dout_prefix
>>>> #define dout_prefix *_dout << "mds." << mds->get_nodeid() << ".tableclient(" << get_mdstable_name(table) << ") "
>>>>  
>>>> +void MDSTableClient::init()
>>>> +{
>>>> + // make reqid unique between MDS restarts
>>>> + last_reqid = (uint64_t)mds->mdsmap->get_epoch() << 32;
>>>> +}
>>>>  
>>>> void MDSTableClient::handle_request(class MMDSTableRequest *m)
>>>> {
>>>> diff --git a/src/mds/MDSTableClient.h b/src/mds/MDSTableClient.h
>>>> index e15837f..78035db 100644
>>>> --- a/src/mds/MDSTableClient.h
>>>> +++ b/src/mds/MDSTableClient.h
>>>> @@ -63,6 +63,8 @@ public:
>>>> MDSTableClient(MDS *m, int tab) : mds(m), table(tab), last_reqid(0) {}
>>>> virtual ~MDSTableClient() {}
>>>>  
>>>> + void init();
>>>> +
>>>> void handle_request(MMDSTableRequest *m);
>>>>  
>>>> void _prepare(bufferlist& mutation, version_t *ptid, bufferlist *pbl, Context *onfinish);
>>>> --  
>>>> 1.7.11.7
>>>
>>>
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yan, Zheng March 20, 2013, 6:49 a.m. UTC | #5
On 03/20/2013 02:15 PM, Sage Weil wrote:
> On Wed, 20 Mar 2013, Yan, Zheng wrote:
>> On 03/20/2013 07:09 AM, Greg Farnum wrote:
>>> Hmm, this is definitely narrowing the race (probably enough to never hit it), but it's not actually eliminating it (if the restart happens after 4 billion requests?). More importantly this kind of symptom makes me worry that we might be papering over more serious issues with colliding states in the Table on restart.
>>> I don't have the MDSTable semantics in my head so I'll need to look into this later unless somebody else volunteers to do so?
>>
>> Not just 4 billion requests, MDS restart has several stage, mdsmap epoch 
>> increases for each stage. I don't think there are any more colliding 
>> states in the table. The table client/server use two phase commit. it's 
>> similar to client request that involves multiple MDS. the reqid is 
>> analogy to client request id. The difference is client request ID is 
>> unique because new client always get an unique session id.
> 
> Each time a tid is consumed (at least for an update) it is journaled in 
> the EMetaBlob::table_tids list, right?  So we could actually take a max 
> from journal replay and pick up where we left off?  That seems like the 
> cleanest.
> 
> I'm not too worried about 2^32 tids, I guess, but it would be nicer to 
> avoid that possibility.
> 

Can we re-use the client request ID as table client request ID ?

Regards
Yan, Zheng

> sage
> 
>>
>> Thanks
>> Yan, Zheng
>>
>>> -Greg
>>>
>>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>>
>>>
>>> On Sunday, March 17, 2013 at 7:51 AM, Yan, Zheng wrote:
>>>
>>>> From: "Yan, Zheng" <zheng.z.yan@intel.com>
>>>>  
>>>> When a MDS becomes active, the table server re-sends 'agree' messages
>>>> for old prepared request. If the recoverd MDS starts a new table request
>>>> at the same time, The new request's ID can happen to be the same as old
>>>> prepared request's ID, because current table client assigns request ID
>>>> from zero after MDS restarts.
>>>>  
>>>> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com (mailto:zheng.z.yan@intel.com)>
>>>> ---
>>>> src/mds/MDS.cc (http://MDS.cc) | 3 +++
>>>> src/mds/MDSTableClient.cc (http://MDSTableClient.cc) | 5 +++++
>>>> src/mds/MDSTableClient.h | 2 ++
>>>> 3 files changed, 10 insertions(+)
>>>>  
>>>> diff --git a/src/mds/MDS.cc (http://MDS.cc) b/src/mds/MDS.cc (http://MDS.cc)
>>>> index bb1c833..859782a 100644
>>>> --- a/src/mds/MDS.cc (http://MDS.cc)
>>>> +++ b/src/mds/MDS.cc (http://MDS.cc)
>>>> @@ -1212,6 +1212,9 @@ void MDS::boot_start(int step, int r)
>>>> dout(2) << "boot_start " << step << ": opening snap table" << dendl;  
>>>> snapserver->load(gather.new_sub());
>>>> }
>>>> +
>>>> + anchorclient->init();
>>>> + snapclient->init();
>>>>  
>>>> dout(2) << "boot_start " << step << ": opening mds log" << dendl;
>>>> mdlog->open(gather.new_sub());
>>>> diff --git a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc) b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc)
>>>> index ea021f5..beba0a3 100644
>>>> --- a/src/mds/MDSTableClient.cc (http://MDSTableClient.cc)
>>>> +++ b/src/mds/MDSTableClient.cc (http://MDSTableClient.cc)
>>>> @@ -34,6 +34,11 @@
>>>> #undef dout_prefix
>>>> #define dout_prefix *_dout << "mds." << mds->get_nodeid() << ".tableclient(" << get_mdstable_name(table) << ") "
>>>>  
>>>> +void MDSTableClient::init()
>>>> +{
>>>> + // make reqid unique between MDS restarts
>>>> + last_reqid = (uint64_t)mds->mdsmap->get_epoch() << 32;
>>>> +}
>>>>  
>>>> void MDSTableClient::handle_request(class MMDSTableRequest *m)
>>>> {
>>>> diff --git a/src/mds/MDSTableClient.h b/src/mds/MDSTableClient.h
>>>> index e15837f..78035db 100644
>>>> --- a/src/mds/MDSTableClient.h
>>>> +++ b/src/mds/MDSTableClient.h
>>>> @@ -63,6 +63,8 @@ public:
>>>> MDSTableClient(MDS *m, int tab) : mds(m), table(tab), last_reqid(0) {}
>>>> virtual ~MDSTableClient() {}
>>>>  
>>>> + void init();
>>>> +
>>>> void handle_request(MMDSTableRequest *m);
>>>>  
>>>> void _prepare(bufferlist& mutation, version_t *ptid, bufferlist *pbl, Context *onfinish);
>>>> --  
>>>> 1.7.11.7
>>>
>>>
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gregory Farnum March 20, 2013, 6:31 p.m. UTC | #6
On Tuesday, March 19, 2013 at 11:49 PM, Yan, Zheng wrote:
> On 03/20/2013 02:15 PM, Sage Weil wrote:
> > On Wed, 20 Mar 2013, Yan, Zheng wrote:
> > > On 03/20/2013 07:09 AM, Greg Farnum wrote:
> > > > Hmm, this is definitely narrowing the race (probably enough to never hit it), but it's not actually eliminating it (if the restart happens after 4 billion requests?). More importantly this kind of symptom makes me worry that we might be papering over more serious issues with colliding states in the Table on restart.
> > > > I don't have the MDSTable semantics in my head so I'll need to look into this later unless somebody else volunteers to do so?
> > >  
> > >  
> > >  
> > > Not just 4 billion requests, MDS restart has several stage, mdsmap epoch  
> > > increases for each stage. I don't think there are any more colliding  
> > > states in the table. The table client/server use two phase commit. it's  
> > > similar to client request that involves multiple MDS. the reqid is  
> > > analogy to client request id. The difference is client request ID is  
> > > unique because new client always get an unique session id.
> >  
> >  
> >  
> > Each time a tid is consumed (at least for an update) it is journaled in  
> > the EMetaBlob::table_tids list, right? So we could actually take a max  
> > from journal replay and pick up where we left off? That seems like the  
> > cleanest.
> >  
> > I'm not too worried about 2^32 tids, I guess, but it would be nicer to  
> > avoid that possibility.
>  
>  
>  
> Can we re-use the client request ID as table client request ID ?
>  
> Regards
> Yan, Zheng

Not sure what you're referring to here — do you mean the ID of the filesystem client request which prompted the update? I don't think that would work as client requests actually require two parts to be unique (the client GUID and the request seq number), and I'm pretty sure a single client request can spawn multiple Table updates.

As I look over this more, it sure looks to me as if the effect of the code we have (when non-broken) is to rollback every non-committed request by an MDS which restarted — the only time it can handle the TableServer's "agree" with a different response is if the MDS was incorrectly marked out by the map. Am I parsing this correctly, Sage? Given that, and without having looked at the code more broadly, I think we want to add some sort of implicit or explicit handshake letting each of them know if the MDS actually disappeared. We use the process/address nonce to accomplish this in other places…
-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yan, Zheng March 21, 2013, 8:07 a.m. UTC | #7
On 03/21/2013 02:31 AM, Greg Farnum wrote:
> On Tuesday, March 19, 2013 at 11:49 PM, Yan, Zheng wrote:
>> On 03/20/2013 02:15 PM, Sage Weil wrote:
>>> On Wed, 20 Mar 2013, Yan, Zheng wrote:
>>>> On 03/20/2013 07:09 AM, Greg Farnum wrote:
>>>>> Hmm, this is definitely narrowing the race (probably enough to never hit it), but it's not actually eliminating it (if the restart happens after 4 billion requests?). More importantly this kind of symptom makes me worry that we might be papering over more serious issues with colliding states in the Table on restart.
>>>>> I don't have the MDSTable semantics in my head so I'll need to look into this later unless somebody else volunteers to do so?
>>>>  
>>>>  
>>>>  
>>>> Not just 4 billion requests, MDS restart has several stage, mdsmap epoch  
>>>> increases for each stage. I don't think there are any more colliding  
>>>> states in the table. The table client/server use two phase commit. it's  
>>>> similar to client request that involves multiple MDS. the reqid is  
>>>> analogy to client request id. The difference is client request ID is  
>>>> unique because new client always get an unique session id.
>>>  
>>>  
>>>  
>>> Each time a tid is consumed (at least for an update) it is journaled in  
>>> the EMetaBlob::table_tids list, right? So we could actually take a max  
>>> from journal replay and pick up where we left off? That seems like the  
>>> cleanest.
>>>  
>>> I'm not too worried about 2^32 tids, I guess, but it would be nicer to  
>>> avoid that possibility.
>>  
>>  
>>  
>> Can we re-use the client request ID as table client request ID ?
>>  
>> Regards
>> Yan, Zheng
> 
> Not sure what you're referring to here — do you mean the ID of the filesystem client request which prompted the update? I don't think that would work as client requests actually require two parts to be unique (the client GUID and the request seq number), and I'm pretty sure a single client request can spawn multiple Table updates.
> 

You are right, client request ID does not work.

> As I look over this more, it sure looks to me as if the effect of the code we have (when non-broken) is to rollback every non-committed request by an MDS which restarted — the only time it can handle the TableServer's "agree" with a different response is if the MDS was incorrectly marked out by the map. Am I parsing this correctly, Sage? Given that, and without having looked at the code more broadly, I think we want to add some sort of implicit or explicit handshake letting each of them know if the MDS actually disappeared. We use the process/address nonce to accomplish this in other places…
> -Greg
> 

The table server sends 'agree' message to table client after a 'prepare entry' is safely logged. The table server re-sends 'agree' message in two cases, one is the table client restarts, another is the table server itself restarts.
The purpose of re-sending 'agree' message is to check if the table client still wants to keep the update preparation. (The table client might crash before submitting the update). The purpose of reqid is associate table update
preparation request with the server's 'agree' reply message. The problem here is that the table client does not make sure reqid unique between restarts. If you feel 2^32 reqids are still enough, set the reqid to a randomized 64bit
value should be safe enough.

Thanks
Yan, Zheng
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gregory Farnum March 21, 2013, 10:03 p.m. UTC | #8
On Thu, Mar 21, 2013 at 1:07 AM, Yan, Zheng <zheng.z.yan@intel.com> wrote:
> On 03/21/2013 02:31 AM, Greg Farnum wrote:
>> On Tuesday, March 19, 2013 at 11:49 PM, Yan, Zheng wrote:
>>> On 03/20/2013 02:15 PM, Sage Weil wrote:
>>>> On Wed, 20 Mar 2013, Yan, Zheng wrote:
>>>>> On 03/20/2013 07:09 AM, Greg Farnum wrote:
>>>>>> Hmm, this is definitely narrowing the race (probably enough to never hit it), but it's not actually eliminating it (if the restart happens after 4 billion requests?). More importantly this kind of symptom makes me worry that we might be papering over more serious issues with colliding states in the Table on restart.
>>>>>> I don't have the MDSTable semantics in my head so I'll need to look into this later unless somebody else volunteers to do so?
>>>>>
>>>>>
>>>>>
>>>>> Not just 4 billion requests, MDS restart has several stage, mdsmap epoch
>>>>> increases for each stage. I don't think there are any more colliding
>>>>> states in the table. The table client/server use two phase commit. it's
>>>>> similar to client request that involves multiple MDS. the reqid is
>>>>> analogy to client request id. The difference is client request ID is
>>>>> unique because new client always get an unique session id.
>>>>
>>>>
>>>>
>>>> Each time a tid is consumed (at least for an update) it is journaled in
>>>> the EMetaBlob::table_tids list, right? So we could actually take a max
>>>> from journal replay and pick up where we left off? That seems like the
>>>> cleanest.
>>>>
>>>> I'm not too worried about 2^32 tids, I guess, but it would be nicer to
>>>> avoid that possibility.
>>>
>>>
>>>
>>> Can we re-use the client request ID as table client request ID ?
>>>
>>> Regards
>>> Yan, Zheng
>>
>> Not sure what you're referring to here — do you mean the ID of the filesystem client request which prompted the update? I don't think that would work as client requests actually require two parts to be unique (the client GUID and the request seq number), and I'm pretty sure a single client request can spawn multiple Table updates.
>>
>
> You are right, client request ID does not work.
>
>> As I look over this more, it sure looks to me as if the effect of the code we have (when non-broken) is to rollback every non-committed request by an MDS which restarted — the only time it can handle the TableServer's "agree" with a different response is if the MDS was incorrectly marked out by the map. Am I parsing this correctly, Sage? Given that, and without having looked at the code more broadly, I think we want to add some sort of implicit or explicit handshake letting each of them know if the MDS actually disappeared. We use the process/address nonce to accomplish this in other places…
>> -Greg
>>
>
> The table server sends 'agree' message to table client after a 'prepare entry' is safely logged. The table server re-sends 'agree' message in two cases, one is the table client restarts, another is the table server itself restarts.
> The purpose of re-sending 'agree' message is to check if the table client still wants to keep the update preparation. (The table client might crash before submitting the update). The purpose of reqid is associate table update
> preparation request with the server's 'agree' reply message. The problem here is that the table client does not make sure reqid unique between restarts. If you feel 2^32 reqids are still enough, set the reqid to a randomized 64bit
> value should be safe enough.

Right. I'd like to somehow mark those reqid's so that we can tell when
they come from a different incarnation of the MDS TableClient daemon.
One way is via some piece of random data that will probably
distinguish them, although if we have something which we can know is
different that would be preferable. I think we can work something out
of the startup session data each MDS does with the monitors, but I'm
not sure I can check any time soon; I have a number of other things to
get to now that I've gotten through (the first round on) this series.

Thanks for all the patches, by the way. :)
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/src/mds/MDS.cc b/src/mds/MDS.cc
index bb1c833..859782a 100644
--- a/src/mds/MDS.cc
+++ b/src/mds/MDS.cc
@@ -1212,6 +1212,9 @@  void MDS::boot_start(int step, int r)
 	dout(2) << "boot_start " << step << ": opening snap table" << dendl;	
 	snapserver->load(gather.new_sub());
       }
+
+      anchorclient->init();
+      snapclient->init();
       
       dout(2) << "boot_start " << step << ": opening mds log" << dendl;
       mdlog->open(gather.new_sub());
diff --git a/src/mds/MDSTableClient.cc b/src/mds/MDSTableClient.cc
index ea021f5..beba0a3 100644
--- a/src/mds/MDSTableClient.cc
+++ b/src/mds/MDSTableClient.cc
@@ -34,6 +34,11 @@ 
 #undef dout_prefix
 #define dout_prefix *_dout << "mds." << mds->get_nodeid() << ".tableclient(" << get_mdstable_name(table) << ") "
 
+void MDSTableClient::init()
+{
+  // make reqid unique between MDS restarts
+  last_reqid = (uint64_t)mds->mdsmap->get_epoch() << 32;
+}
 
 void MDSTableClient::handle_request(class MMDSTableRequest *m)
 {
diff --git a/src/mds/MDSTableClient.h b/src/mds/MDSTableClient.h
index e15837f..78035db 100644
--- a/src/mds/MDSTableClient.h
+++ b/src/mds/MDSTableClient.h
@@ -63,6 +63,8 @@  public:
   MDSTableClient(MDS *m, int tab) : mds(m), table(tab), last_reqid(0) {}
   virtual ~MDSTableClient() {}
 
+  void init();
+
   void handle_request(MMDSTableRequest *m);
 
   void _prepare(bufferlist& mutation, version_t *ptid, bufferlist *pbl, Context *onfinish);