diff mbox

ceph: use the OSD time as the object mtime instead of the client time

Message ID 06E7D85B3BA36C4DB207FEDE871C534891D8CB@SHSMSX101.ccr.corp.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Wang, Zhiqiang July 29, 2014, 5:07 a.m. UTC
This fixes a bug when the time of the OSDs and clients are not synchronized
(especially when client is ahead of OSD), and the cache tier dirty ratio
reaches the threshold, the agent skips the flush work because it thinks
the object is too young.

The skipping flush code is as following:

  if (obc->obs.oi.mtime + utime_t(pool.info.cache_min_flush_age, 0) > now) {
    dout(20) << __func__ << " skip (too young) " << obc->obs.oi << dendl;
    osd->logger->inc(l_osd_agent_skip);
    return false;
  }

Signed-off-by: Zhiqiang Wang <wonzhq@hotmail.com>
---
 src/osd/ReplicatedPG.cc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--
1.9.1
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Sage Weil July 29, 2014, 3:46 p.m. UTC | #1
On Tue, 29 Jul 2014, Wang, Zhiqiang wrote:
> This fixes a bug when the time of the OSDs and clients are not synchronized
> (especially when client is ahead of OSD), and the cache tier dirty ratio
> reaches the threshold, the agent skips the flush work because it thinks
> the object is too young.
> 
> The skipping flush code is as following:
> 
>   if (obc->obs.oi.mtime + utime_t(pool.info.cache_min_flush_age, 0) > now) {
>     dout(20) << __func__ << " skip (too young) " << obc->obs.oi << dendl;
>     osd->logger->inc(l_osd_agent_skip);
>     return false;
>   }

Hmm, I think the use of the client mtime is an important property we want 
to maintain for the benefit of the rest of the system.  Two other 
possibilities:

1) If cache_min_flush_age is 0, skip the check, so we avoid mtimes in the 
future in that case.

2) Add an additional field to object_info_t that is an mtime based on the 
OSD's clock, and use that for these checks.

What do you think?
sage


> 
> Signed-off-by: Zhiqiang Wang <wonzhq@hotmail.com>
> ---
>  src/osd/ReplicatedPG.cc | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/src/osd/ReplicatedPG.cc b/src/osd/ReplicatedPG.cc
> index 5d822c1..1c7c527 100644
> --- a/src/osd/ReplicatedPG.cc
> +++ b/src/osd/ReplicatedPG.cc
> @@ -1805,7 +1805,7 @@ void ReplicatedPG::execute_ctx(OpContext *ctx)
> 
>      // version
>      ctx->at_version = get_next_version();
> -    ctx->mtime = m->get_mtime();
> +    ctx->mtime = ceph_clock_now(cct);
> 
>      dout(10) << "do_op " << soid << " " << ctx->ops
>              << " ov " << obc->obs.oi.version << " av " << ctx->at_version
> --
> 1.9.1
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gregory Farnum July 29, 2014, 4:02 p.m. UTC | #2
On Tue, Jul 29, 2014 at 11:46 AM, Sage Weil <sweil@redhat.com> wrote:
> On Tue, 29 Jul 2014, Wang, Zhiqiang wrote:
>> This fixes a bug when the time of the OSDs and clients are not synchronized
>> (especially when client is ahead of OSD), and the cache tier dirty ratio
>> reaches the threshold, the agent skips the flush work because it thinks
>> the object is too young.
>>
>> The skipping flush code is as following:
>>
>>   if (obc->obs.oi.mtime + utime_t(pool.info.cache_min_flush_age, 0) > now) {
>>     dout(20) << __func__ << " skip (too young) " << obc->obs.oi << dendl;
>>     osd->logger->inc(l_osd_agent_skip);
>>     return false;
>>   }
>
> Hmm, I think the use of the client mtime is an important property we want
> to maintain for the benefit of the rest of the system.  Two other
> possibilities:

This was my initial thought as well — but I don't think we're actually
relying on that any more (e.g., we used to use the client times for
file recovery in the MDS, but the rest of CephFS is now using MDS
times anyway). Are there scenarios other than CephFS recovery?

>
> 1) If cache_min_flush_age is 0, skip the check, so we avoid mtimes in the
> future in that case.
>
> 2) Add an additional field to object_info_t that is an mtime based on the
> OSD's clock, and use that for these checks.

But option 2 is probably still a good idea.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Milosz Tanski July 29, 2014, 6:54 p.m. UTC | #3
Off the top of my head... the fscache code (read only) in the CephFS
kclient uses the mtime to see if the item is valid (there was not an
object change version counter at the time). I don't know if / how that
would be affected of the top of my head.

- M

On Tue, Jul 29, 2014 at 12:02 PM, Gregory Farnum <greg@inktank.com> wrote:
> On Tue, Jul 29, 2014 at 11:46 AM, Sage Weil <sweil@redhat.com> wrote:
>> On Tue, 29 Jul 2014, Wang, Zhiqiang wrote:
>>> This fixes a bug when the time of the OSDs and clients are not synchronized
>>> (especially when client is ahead of OSD), and the cache tier dirty ratio
>>> reaches the threshold, the agent skips the flush work because it thinks
>>> the object is too young.
>>>
>>> The skipping flush code is as following:
>>>
>>>   if (obc->obs.oi.mtime + utime_t(pool.info.cache_min_flush_age, 0) > now) {
>>>     dout(20) << __func__ << " skip (too young) " << obc->obs.oi << dendl;
>>>     osd->logger->inc(l_osd_agent_skip);
>>>     return false;
>>>   }
>>
>> Hmm, I think the use of the client mtime is an important property we want
>> to maintain for the benefit of the rest of the system.  Two other
>> possibilities:
>
> This was my initial thought as well — but I don't think we're actually
> relying on that any more (e.g., we used to use the client times for
> file recovery in the MDS, but the rest of CephFS is now using MDS
> times anyway). Are there scenarios other than CephFS recovery?
>
>>
>> 1) If cache_min_flush_age is 0, skip the check, so we avoid mtimes in the
>> future in that case.
>>
>> 2) Add an additional field to object_info_t that is an mtime based on the
>> OSD's clock, and use that for these checks.
>
> But option 2 is probably still a good idea.
> -Greg
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wang, Zhiqiang July 30, 2014, 1:02 a.m. UTC | #4
TG9va3MgbGlrZSB3ZSBkbyBuZWVkIHRvIGtlZXAgdGhlIGNsaWVudCB0aW1lIGFzIHRoZSBtdGlt
ZS4gU2luY2Ugb3B0aW9uIDEgZG9lc24ndCByZWFsbHkgZml4IHRoZSBwcm9ibGVtLCBJIHdvdWxk
IHZvdGUgZm9yIG9wdGlvbiAyLg0KDQotLS0tLU9yaWdpbmFsIE1lc3NhZ2UtLS0tLQ0KRnJvbTog
Y2VwaC1kZXZlbC1vd25lckB2Z2VyLmtlcm5lbC5vcmcgW21haWx0bzpjZXBoLWRldmVsLW93bmVy
QHZnZXIua2VybmVsLm9yZ10gT24gQmVoYWxmIE9mIE1pbG9zeiBUYW5za2kNClNlbnQ6IFdlZG5l
c2RheSwgSnVseSAzMCwgMjAxNCAyOjU1IEFNDQpUbzogR3JlZ29yeSBGYXJudW0NCkNjOiBTYWdl
IFdlaWw7IFdhbmcsIFpoaXFpYW5nOyBjZXBoLWRldmVsQHZnZXIua2VybmVsLm9yZw0KU3ViamVj
dDogUmU6IFtQQVRDSF0gY2VwaDogdXNlIHRoZSBPU0QgdGltZSBhcyB0aGUgb2JqZWN0IG10aW1l
IGluc3RlYWQgb2YgdGhlIGNsaWVudCB0aW1lDQoNCk9mZiB0aGUgdG9wIG9mIG15IGhlYWQuLi4g
dGhlIGZzY2FjaGUgY29kZSAocmVhZCBvbmx5KSBpbiB0aGUgQ2VwaEZTIGtjbGllbnQgdXNlcyB0
aGUgbXRpbWUgdG8gc2VlIGlmIHRoZSBpdGVtIGlzIHZhbGlkICh0aGVyZSB3YXMgbm90IGFuIG9i
amVjdCBjaGFuZ2UgdmVyc2lvbiBjb3VudGVyIGF0IHRoZSB0aW1lKS4gSSBkb24ndCBrbm93IGlm
IC8gaG93IHRoYXQgd291bGQgYmUgYWZmZWN0ZWQgb2YgdGhlIHRvcCBvZiBteSBoZWFkLg0KDQot
IE0NCg0KT24gVHVlLCBKdWwgMjksIDIwMTQgYXQgMTI6MDIgUE0sIEdyZWdvcnkgRmFybnVtIDxn
cmVnQGlua3RhbmsuY29tPiB3cm90ZToNCj4gT24gVHVlLCBKdWwgMjksIDIwMTQgYXQgMTE6NDYg
QU0sIFNhZ2UgV2VpbCA8c3dlaWxAcmVkaGF0LmNvbT4gd3JvdGU6DQo+PiBPbiBUdWUsIDI5IEp1
bCAyMDE0LCBXYW5nLCBaaGlxaWFuZyB3cm90ZToNCj4+PiBUaGlzIGZpeGVzIGEgYnVnIHdoZW4g
dGhlIHRpbWUgb2YgdGhlIE9TRHMgYW5kIGNsaWVudHMgYXJlIG5vdCANCj4+PiBzeW5jaHJvbml6
ZWQgKGVzcGVjaWFsbHkgd2hlbiBjbGllbnQgaXMgYWhlYWQgb2YgT1NEKSwgYW5kIHRoZSBjYWNo
ZSANCj4+PiB0aWVyIGRpcnR5IHJhdGlvIHJlYWNoZXMgdGhlIHRocmVzaG9sZCwgdGhlIGFnZW50
IHNraXBzIHRoZSBmbHVzaCANCj4+PiB3b3JrIGJlY2F1c2UgaXQgdGhpbmtzIHRoZSBvYmplY3Qg
aXMgdG9vIHlvdW5nLg0KPj4+DQo+Pj4gVGhlIHNraXBwaW5nIGZsdXNoIGNvZGUgaXMgYXMgZm9s
bG93aW5nOg0KPj4+DQo+Pj4gICBpZiAob2JjLT5vYnMub2kubXRpbWUgKyB1dGltZV90KHBvb2wu
aW5mby5jYWNoZV9taW5fZmx1c2hfYWdlLCAwKSA+IG5vdykgew0KPj4+ICAgICBkb3V0KDIwKSA8
PCBfX2Z1bmNfXyA8PCAiIHNraXAgKHRvbyB5b3VuZykgIiA8PCBvYmMtPm9icy5vaSA8PCBkZW5k
bDsNCj4+PiAgICAgb3NkLT5sb2dnZXItPmluYyhsX29zZF9hZ2VudF9za2lwKTsNCj4+PiAgICAg
cmV0dXJuIGZhbHNlOw0KPj4+ICAgfQ0KPj4NCj4+IEhtbSwgSSB0aGluayB0aGUgdXNlIG9mIHRo
ZSBjbGllbnQgbXRpbWUgaXMgYW4gaW1wb3J0YW50IHByb3BlcnR5IHdlIA0KPj4gd2FudCB0byBt
YWludGFpbiBmb3IgdGhlIGJlbmVmaXQgb2YgdGhlIHJlc3Qgb2YgdGhlIHN5c3RlbS4gIFR3byAN
Cj4+IG90aGVyDQo+PiBwb3NzaWJpbGl0aWVzOg0KPg0KPiBUaGlzIHdhcyBteSBpbml0aWFsIHRo
b3VnaHQgYXMgd2VsbCDigJQgYnV0IEkgZG9uJ3QgdGhpbmsgd2UncmUgYWN0dWFsbHkgDQo+IHJl
bHlpbmcgb24gdGhhdCBhbnkgbW9yZSAoZS5nLiwgd2UgdXNlZCB0byB1c2UgdGhlIGNsaWVudCB0
aW1lcyBmb3IgDQo+IGZpbGUgcmVjb3ZlcnkgaW4gdGhlIE1EUywgYnV0IHRoZSByZXN0IG9mIENl
cGhGUyBpcyBub3cgdXNpbmcgTURTIA0KPiB0aW1lcyBhbnl3YXkpLiBBcmUgdGhlcmUgc2NlbmFy
aW9zIG90aGVyIHRoYW4gQ2VwaEZTIHJlY292ZXJ5Pw0KPg0KPj4NCj4+IDEpIElmIGNhY2hlX21p
bl9mbHVzaF9hZ2UgaXMgMCwgc2tpcCB0aGUgY2hlY2ssIHNvIHdlIGF2b2lkIG10aW1lcyBpbiAN
Cj4+IHRoZSBmdXR1cmUgaW4gdGhhdCBjYXNlLg0KPj4NCj4+IDIpIEFkZCBhbiBhZGRpdGlvbmFs
IGZpZWxkIHRvIG9iamVjdF9pbmZvX3QgdGhhdCBpcyBhbiBtdGltZSBiYXNlZCBvbiANCj4+IHRo
ZSBPU0QncyBjbG9jaywgYW5kIHVzZSB0aGF0IGZvciB0aGVzZSBjaGVja3MuDQo+DQo+IEJ1dCBv
cHRpb24gMiBpcyBwcm9iYWJseSBzdGlsbCBhIGdvb2QgaWRlYS4NCj4gLUdyZWcNCj4gLS0NCj4g
VG8gdW5zdWJzY3JpYmUgZnJvbSB0aGlzIGxpc3Q6IHNlbmQgdGhlIGxpbmUgInVuc3Vic2NyaWJl
IGNlcGgtZGV2ZWwiIA0KPiBpbiB0aGUgYm9keSBvZiBhIG1lc3NhZ2UgdG8gbWFqb3Jkb21vQHZn
ZXIua2VybmVsLm9yZyBNb3JlIG1ham9yZG9tbyANCj4gaW5mbyBhdCAgaHR0cDovL3ZnZXIua2Vy
bmVsLm9yZy9tYWpvcmRvbW8taW5mby5odG1sDQoNCg0KDQotLQ0KTWlsb3N6IFRhbnNraQ0KQ1RP
DQoxNiBFYXN0IDM0dGggU3RyZWV0LCAxNXRoIGZsb29yDQpOZXcgWW9yaywgTlkgMTAwMTYNCg0K
cDogNjQ2LTI1My05MDU1DQplOiBtaWxvc3pAYWRmaW4uY29tDQotLQ0KVG8gdW5zdWJzY3JpYmUg
ZnJvbSB0aGlzIGxpc3Q6IHNlbmQgdGhlIGxpbmUgInVuc3Vic2NyaWJlIGNlcGgtZGV2ZWwiIGlu
IHRoZSBib2R5IG9mIGEgbWVzc2FnZSB0byBtYWpvcmRvbW9Admdlci5rZXJuZWwub3JnIE1vcmUg
bWFqb3Jkb21vIGluZm8gYXQgIGh0dHA6Ly92Z2VyLmtlcm5lbC5vcmcvbWFqb3Jkb21vLWluZm8u
aHRtbA0K
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/src/osd/ReplicatedPG.cc b/src/osd/ReplicatedPG.cc
index 5d822c1..1c7c527 100644
--- a/src/osd/ReplicatedPG.cc
+++ b/src/osd/ReplicatedPG.cc
@@ -1805,7 +1805,7 @@  void ReplicatedPG::execute_ctx(OpContext *ctx)

     // version
     ctx->at_version = get_next_version();
-    ctx->mtime = m->get_mtime();
+    ctx->mtime = ceph_clock_now(cct);

     dout(10) << "do_op " << soid << " " << ctx->ops
             << " ov " << obc->obs.oi.version << " av " << ctx->at_version