Message ID | 20150226152807.5b71a93c@doppio (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hmm, we already obverse this duplicate omap keys set from pglog operations. And I think we need to resolve it in upper layer, of course, coalescing omap operations in FileStore is also useful. @Somnath Do you do this dedup work in KeyValueStore already? On Thu, Feb 26, 2015 at 10:28 PM, Andreas Bluemle <andreas.bluemle@itxperts.de> wrote: > Hi, > > during the performance weely meeting, I had mentioned > my experiences concerning the transaction structure > for write requests at the level of the FileStore. > Such a transaction not only contains the OP_WRITE > operation to the object in the file system, but also > a series of OP_OMAP_SETKEYS and OP_SETATTR operations. > > Find attached a README and source code patch, which > describe a prototype for coalescing the OP_OMAP_SETKEYS > operations and the performance impact f this change. > > Regards > > Andreas Bluemle > > -- > Andreas Bluemle mailto:Andreas.Bluemle@itxperts.de > ITXperts GmbH http://www.itxperts.de > Balanstrasse 73, Geb. 08 Phone: (+49) 89 89044917 > D-81541 Muenchen (Germany) Fax: (+49) 89 89044910 > > Company details: http://www.itxperts.de/imprint.htm
On 02/26/2015 09:02 AM, Haomai Wang wrote: > Hmm, we already obverse this duplicate omap keys set from pglog operations. > > And I think we need to resolve it in upper layer, of course, Can we resolve this higher up easily? I am struck that it might be easier to simply do the coalescing here. I think this was the approach Sam was advocating in the performance meeting? > coalescing omap operations in FileStore is also useful. > > @Somnath Do you do this dedup work in KeyValueStore already? > > On Thu, Feb 26, 2015 at 10:28 PM, Andreas Bluemle > <andreas.bluemle@itxperts.de> wrote: >> Hi, >> >> during the performance weely meeting, I had mentioned >> my experiences concerning the transaction structure >> for write requests at the level of the FileStore. >> Such a transaction not only contains the OP_WRITE >> operation to the object in the file system, but also >> a series of OP_OMAP_SETKEYS and OP_SETATTR operations. >> >> Find attached a README and source code patch, which >> describe a prototype for coalescing the OP_OMAP_SETKEYS >> operations and the performance impact f this change. >> >> Regards >> >> Andreas Bluemle >> >> -- >> Andreas Bluemle mailto:Andreas.Bluemle@itxperts.de >> ITXperts GmbH http://www.itxperts.de >> Balanstrasse 73, Geb. 08 Phone: (+49) 89 89044917 >> D-81541 Muenchen (Germany) Fax: (+49) 89 89044910 >> >> Company details: http://www.itxperts.de/imprint.htm > > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 26 Feb 2015, Andreas Bluemle wrote: > Hi, > > during the performance weely meeting, I had mentioned > my experiences concerning the transaction structure > for write requests at the level of the FileStore. > Such a transaction not only contains the OP_WRITE > operation to the object in the file system, but also > a series of OP_OMAP_SETKEYS and OP_SETATTR operations. > > Find attached a README and source code patch, which > describe a prototype for coalescing the OP_OMAP_SETKEYS > operations and the performance impact f this change. I think we should try to avoid the dups in the first place in the upper layers before resorting to deduping in the FileStore. Here's a sample transaction: { "ops": [ { "op_num": 0, "op_name": "omap_setkeys", "collection": "0.3_head", "oid": "3\/\/head\/\/0", "attr_lens": { "0000000005.00000000000000000057": 180 } }, ^ PG::append_log { "op_num": 1, "op_name": "omap_setkeys", "collection": "0.3_head", "oid": "3\/\/head\/\/0", "attr_lens": { "_epoch": 4, "_info": 729 } }, ^^ these two come from PG::_write_info() { "op_num": 2, "op_name": "omap_setkeys", "collection": "0.3_head", "oid": "3\/\/head\/\/0", "attr_lens": { "0000000005.00000000000000000057": 180, "can_rollback_to": 12, "rollback_info_trimmed_to": 12 } }, ^ PG::_write_log() I think the log ones are easy to combine.. in fact I think we saw a pull request from somone reently that does this? The _write_info() one is probably a slightly bigger challenge, but only slightly, given that both of these are called from void PG::write_if_dirty(ObjectStore::Transaction& t) { if (dirty_big_info || dirty_info) write_info(t); pg_log.write_log(t, coll, pgmeta_oid); } { "op_num": 3, "op_name": "write", "collection": "0.3_head", "oid": "b2082803\/benchmark_data_maetl_20149_object474\/head\/\/0", "length": 123, "offset": 0, "bufferlist length": 123 }, necessary { "op_num": 4, "op_name": "setattr", "collection": "0.3_head", "oid": "b2082803\/benchmark_data_maetl_20149_object474\/head\/\/0", "name": "_", "length": 271 }, this too { "op_num": 5, "op_name": "setattr", "collection": "0.3_head", "oid": "b2082803\/benchmark_data_maetl_20149_object474\/head\/\/0", "name": "snapset", "length": 31 } We have a PR that avoids this second setattr when the value isn't changing, but at the moment the consensus is that it's complex and probably not worth merging. It's also a much cheaper operation, so I suggest we fix the omap calls first and then compare performance w/ and w/o the snapset xattr changes to see how significant it is. ] } Thanks! sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 3 Mar 2015, Sage Weil wrote: > I think we should try to avoid the dups in the first place in the upper > layers before resorting to deduping in the FileStore. I took a stab at this: https://github.com/ceph/ceph/pull/3878 Here's what the txns look like now: { "ops": [ { "op_num": 0, "op_name": "omap_setkeys", "collection": "0.2_head", "oid": "2\/\/head\/\/0", "attr_lens": { "0000000005.00000000000000000002": 178, "_epoch": 4, "_info": 729, "can_rollback_to": 12, "rollback_info_trimmed_to": 12 } }, { "op_num": 1, "op_name": "write", "collection": "0.2_head", "oid": "8b08fa52\/benchmark_data_maetl_15609_object5\/head\/\/0", "length": 123, "offset": 0, "bufferlist length": 123 }, { "op_num": 2, "op_name": "setattr", "collection": "0.2_head", "oid": "8b08fa52\/benchmark_data_maetl_15609_object5\/head\/\/0", "name": "_", "length": 269 }, { "op_num": 3, "op_name": "setattr", "collection": "0.2_head", "oid": "8b08fa52\/benchmark_data_maetl_15609_object5\/head\/\/0", "name": "snapset", "length": 31 } ] } Care to give that a spin and see if it give syou a similar (or better) improvement? sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
I think the performance improvement can be refer to https://github.com/ceph/ceph/pull/2972 which I did a simple benchmark comparison. This coalescing should get a lighter improvement but I think obviously it should be better. On Thu, Mar 5, 2015 at 8:10 AM, Sage Weil <sage@newdream.net> wrote: > On Tue, 3 Mar 2015, Sage Weil wrote: >> I think we should try to avoid the dups in the first place in the upper >> layers before resorting to deduping in the FileStore. > > I took a stab at this: > > https://github.com/ceph/ceph/pull/3878 > > Here's what the txns look like now: > > { > "ops": [ > { > "op_num": 0, > "op_name": "omap_setkeys", > "collection": "0.2_head", > "oid": "2\/\/head\/\/0", > "attr_lens": { > "0000000005.00000000000000000002": 178, > "_epoch": 4, > "_info": 729, > "can_rollback_to": 12, > "rollback_info_trimmed_to": 12 > } > }, > { > "op_num": 1, > "op_name": "write", > "collection": "0.2_head", > "oid": "8b08fa52\/benchmark_data_maetl_15609_object5\/head\/\/0", > "length": 123, > "offset": 0, > "bufferlist length": 123 > }, > { > "op_num": 2, > "op_name": "setattr", > "collection": "0.2_head", > "oid": "8b08fa52\/benchmark_data_maetl_15609_object5\/head\/\/0", > "name": "_", > "length": 269 > }, > { > "op_num": 3, > "op_name": "setattr", > "collection": "0.2_head", > "oid": "8b08fa52\/benchmark_data_maetl_15609_object5\/head\/\/0", > "name": "snapset", > "length": 31 > } > ] > } > > Care to give that a spin and see if it give syou a similar (or better) > improvement? > > sage > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
Can we also consider to coalesce two OP_SETATTR transaction to a single OP_SETATTRS transaction? Regards Ning Yao 2015-03-05 15:04 GMT+08:00 Haomai Wang <haomaiwang@gmail.com>: > I think the performance improvement can be refer to > https://github.com/ceph/ceph/pull/2972 which I did a simple benchmark > comparison. > > This coalescing should get a lighter improvement but I think obviously > it should be better. > > On Thu, Mar 5, 2015 at 8:10 AM, Sage Weil <sage@newdream.net> wrote: >> On Tue, 3 Mar 2015, Sage Weil wrote: >>> I think we should try to avoid the dups in the first place in the upper >>> layers before resorting to deduping in the FileStore. >> >> I took a stab at this: >> >> https://github.com/ceph/ceph/pull/3878 >> >> Here's what the txns look like now: >> >> { >> "ops": [ >> { >> "op_num": 0, >> "op_name": "omap_setkeys", >> "collection": "0.2_head", >> "oid": "2\/\/head\/\/0", >> "attr_lens": { >> "0000000005.00000000000000000002": 178, >> "_epoch": 4, >> "_info": 729, >> "can_rollback_to": 12, >> "rollback_info_trimmed_to": 12 >> } >> }, >> { >> "op_num": 1, >> "op_name": "write", >> "collection": "0.2_head", >> "oid": "8b08fa52\/benchmark_data_maetl_15609_object5\/head\/\/0", >> "length": 123, >> "offset": 0, >> "bufferlist length": 123 >> }, >> { >> "op_num": 2, >> "op_name": "setattr", >> "collection": "0.2_head", >> "oid": "8b08fa52\/benchmark_data_maetl_15609_object5\/head\/\/0", >> "name": "_", >> "length": 269 >> }, >> { >> "op_num": 3, >> "op_name": "setattr", >> "collection": "0.2_head", >> "oid": "8b08fa52\/benchmark_data_maetl_15609_object5\/head\/\/0", >> "name": "snapset", >> "length": 31 >> } >> ] >> } >> >> Care to give that a spin and see if it give syou a similar (or better) >> improvement? >> >> sage >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > Best Regards, > > Wheat > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 11 Mar 2015, Ning Yao wrote: > Can we also consider to coalesce two OP_SETATTR transaction to a > single OP_SETATTRS transaction? Yeah, definitely! sage > Regards > Ning Yao > > > 2015-03-05 15:04 GMT+08:00 Haomai Wang <haomaiwang@gmail.com>: > > I think the performance improvement can be refer to > > https://github.com/ceph/ceph/pull/2972 which I did a simple benchmark > > comparison. > > > > This coalescing should get a lighter improvement but I think obviously > > it should be better. > > > > On Thu, Mar 5, 2015 at 8:10 AM, Sage Weil <sage@newdream.net> wrote: > >> On Tue, 3 Mar 2015, Sage Weil wrote: > >>> I think we should try to avoid the dups in the first place in the upper > >>> layers before resorting to deduping in the FileStore. > >> > >> I took a stab at this: > >> > >> https://github.com/ceph/ceph/pull/3878 > >> > >> Here's what the txns look like now: > >> > >> { > >> "ops": [ > >> { > >> "op_num": 0, > >> "op_name": "omap_setkeys", > >> "collection": "0.2_head", > >> "oid": "2\/\/head\/\/0", > >> "attr_lens": { > >> "0000000005.00000000000000000002": 178, > >> "_epoch": 4, > >> "_info": 729, > >> "can_rollback_to": 12, > >> "rollback_info_trimmed_to": 12 > >> } > >> }, > >> { > >> "op_num": 1, > >> "op_name": "write", > >> "collection": "0.2_head", > >> "oid": "8b08fa52\/benchmark_data_maetl_15609_object5\/head\/\/0", > >> "length": 123, > >> "offset": 0, > >> "bufferlist length": 123 > >> }, > >> { > >> "op_num": 2, > >> "op_name": "setattr", > >> "collection": "0.2_head", > >> "oid": "8b08fa52\/benchmark_data_maetl_15609_object5\/head\/\/0", > >> "name": "_", > >> "length": 269 > >> }, > >> { > >> "op_num": 3, > >> "op_name": "setattr", > >> "collection": "0.2_head", > >> "oid": "8b08fa52\/benchmark_data_maetl_15609_object5\/head\/\/0", > >> "name": "snapset", > >> "length": 31 > >> } > >> ] > >> } > >> > >> Care to give that a spin and see if it give syou a similar (or better) > >> improvement? > >> > >> sage > >> -- > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > >> the body of a message to majordomo@vger.kernel.org > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > -- > > Best Regards, > > > > Wheat > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Sage, as discussed yesterday during the performance meeting, I have integratged pull reuest 3878 into v0.93 source code and re-tested: I can confirm that the changes show a significant benefit in the CPU consumption of the ceph-osd daemon for a given workload: the reduction in CPU cosumption for the FileStore workers is 22% and 10% for the overall CPU consumption of the ceph-osd daemon. Regards Andreas Bluemle On Thu, 5 Mar 2015 15:04:41 +0800 Haomai Wang <haomaiwang@gmail.com> wrote: > I think the performance improvement can be refer to > https://github.com/ceph/ceph/pull/2972 which I did a simple benchmark > comparison. > > This coalescing should get a lighter improvement but I think obviously > it should be better. > > On Thu, Mar 5, 2015 at 8:10 AM, Sage Weil <sage@newdream.net> wrote: > > On Tue, 3 Mar 2015, Sage Weil wrote: > >> I think we should try to avoid the dups in the first place in the > >> upper layers before resorting to deduping in the FileStore. > > > > I took a stab at this: > > > > https://github.com/ceph/ceph/pull/3878 > > > > Here's what the txns look like now: > > > > { > > "ops": [ > > { > > "op_num": 0, > > "op_name": "omap_setkeys", > > "collection": "0.2_head", > > "oid": "2\/\/head\/\/0", > > "attr_lens": { > > "0000000005.00000000000000000002": 178, > > "_epoch": 4, > > "_info": 729, > > "can_rollback_to": 12, > > "rollback_info_trimmed_to": 12 > > } > > }, > > { > > "op_num": 1, > > "op_name": "write", > > "collection": "0.2_head", > > "oid": > > "8b08fa52\/benchmark_data_maetl_15609_object5\/head\/\/0", > > "length": 123, "offset": 0, > > "bufferlist length": 123 > > }, > > { > > "op_num": 2, > > "op_name": "setattr", > > "collection": "0.2_head", > > "oid": > > "8b08fa52\/benchmark_data_maetl_15609_object5\/head\/\/0", "name": > > "_", "length": 269 > > }, > > { > > "op_num": 3, > > "op_name": "setattr", > > "collection": "0.2_head", > > "oid": > > "8b08fa52\/benchmark_data_maetl_15609_object5\/head\/\/0", "name": > > "snapset", "length": 31 > > } > > ] > > } > > > > Care to give that a spin and see if it give syou a similar (or > > better) improvement? > > > > sage > > -- > > To unsubscribe from this list: send the line "unsubscribe > > ceph-devel" in the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > >
diff --git a/src/os/FileStore.cc b/src/os/FileStore.cc index f6c3bb8..29382b2 100644 --- a/src/os/FileStore.cc +++ b/src/os/FileStore.cc @@ -2260,10 +2260,24 @@ int FileStore::_check_replay_guard(int fd, const SequencerPosition& spos) } } +void FileStore::_coalesce(map<string, bufferlist> &target, map<string, bufferlist> &source) +{ + for (map<string, bufferlist>::iterator p = source.begin(); + p != source.end(); + p++) { + target[p->first] = p->second; + } + return; +} + unsigned FileStore::_do_transaction( Transaction& t, uint64_t op_seq, int trans_num, ThreadPool::TPHandle *handle) { + map<string, bufferlist> collected_aset; + coll_t collected_cid; + ghobject_t collected_oid; + dout(10) << "_do_transaction on " << &t << dendl; #ifdef WITH_LTTNG @@ -2282,6 +2296,22 @@ unsigned FileStore::_do_transaction( _inject_failure(); + if (op->op == Transaction::OP_OMAP_SETKEYS) { + collected_cid = i.get_cid(op->cid); + collected_oid = i.get_oid(op->oid); + map<string, bufferlist> aset; + i.decode_attrset(aset); + _coalesce(collected_aset, aset); + continue; + } else { + if (collected_aset.empty() == false) { + tracepoint(objectstore, omap_setkeys_enter, osr_name); + r = _omap_setkeys(collected_cid, collected_oid, collected_aset, spos); + tracepoint(objectstore, omap_setkeys_exit, r); + collected_aset.clear(); + } + } + switch (op->op) { case Transaction::OP_NOP: break; diff --git a/src/os/FileStore.h b/src/os/FileStore.h index af1fb8d..a039731 100644 --- a/src/os/FileStore.h +++ b/src/os/FileStore.h @@ -449,6 +449,8 @@ public: int statfs(struct statfs *buf); + void _coalesce( map<string, bufferlist> &target, map<string, bufferlist> &source); + int _do_transactions( list<Transaction*> &tls, uint64_t op_seq, ThreadPool::TPHandle *handle);