diff mbox

FileStore performance: coalescing operations

Message ID 20150226152807.5b71a93c@doppio (mailing list archive)
State New, archived
Headers show

Commit Message

Andreas Bluemle Feb. 26, 2015, 2:28 p.m. UTC
Hi,

during the performance weely meeting, I had mentioned
my experiences concerning the transaction structure
for write requests at the level of the FileStore.
Such a transaction not only contains the OP_WRITE
operation to the object in the file system, but also
a series of OP_OMAP_SETKEYS and OP_SETATTR operations.

Find attached a README and source code patch, which
describe a prototype for coalescing the OP_OMAP_SETKEYS
operations and the performance impact f this change.

Regards

Andreas Bluemle

Comments

Haomai Wang Feb. 26, 2015, 3:02 p.m. UTC | #1
Hmm, we already obverse this duplicate omap keys set from pglog operations.

And I think we need to resolve it in upper layer, of course,
coalescing omap operations in FileStore is also useful.

@Somnath Do you do this dedup work in KeyValueStore already?

On Thu, Feb 26, 2015 at 10:28 PM, Andreas Bluemle
<andreas.bluemle@itxperts.de> wrote:
> Hi,
>
> during the performance weely meeting, I had mentioned
> my experiences concerning the transaction structure
> for write requests at the level of the FileStore.
> Such a transaction not only contains the OP_WRITE
> operation to the object in the file system, but also
> a series of OP_OMAP_SETKEYS and OP_SETATTR operations.
>
> Find attached a README and source code patch, which
> describe a prototype for coalescing the OP_OMAP_SETKEYS
> operations and the performance impact f this change.
>
> Regards
>
> Andreas Bluemle
>
> --
> Andreas Bluemle                     mailto:Andreas.Bluemle@itxperts.de
> ITXperts GmbH                       http://www.itxperts.de
> Balanstrasse 73, Geb. 08            Phone: (+49) 89 89044917
> D-81541 Muenchen (Germany)          Fax:   (+49) 89 89044910
>
> Company details: http://www.itxperts.de/imprint.htm
Mark Nelson Feb. 26, 2015, 3:06 p.m. UTC | #2
On 02/26/2015 09:02 AM, Haomai Wang wrote:
> Hmm, we already obverse this duplicate omap keys set from pglog operations.
>
> And I think we need to resolve it in upper layer, of course,

Can we resolve this higher up easily?  I am struck that it might be 
easier to simply do the coalescing here.  I think this was the approach 
Sam was advocating in the performance meeting?

> coalescing omap operations in FileStore is also useful.
>
> @Somnath Do you do this dedup work in KeyValueStore already?
>
> On Thu, Feb 26, 2015 at 10:28 PM, Andreas Bluemle
> <andreas.bluemle@itxperts.de> wrote:
>> Hi,
>>
>> during the performance weely meeting, I had mentioned
>> my experiences concerning the transaction structure
>> for write requests at the level of the FileStore.
>> Such a transaction not only contains the OP_WRITE
>> operation to the object in the file system, but also
>> a series of OP_OMAP_SETKEYS and OP_SETATTR operations.
>>
>> Find attached a README and source code patch, which
>> describe a prototype for coalescing the OP_OMAP_SETKEYS
>> operations and the performance impact f this change.
>>
>> Regards
>>
>> Andreas Bluemle
>>
>> --
>> Andreas Bluemle                     mailto:Andreas.Bluemle@itxperts.de
>> ITXperts GmbH                       http://www.itxperts.de
>> Balanstrasse 73, Geb. 08            Phone: (+49) 89 89044917
>> D-81541 Muenchen (Germany)          Fax:   (+49) 89 89044910
>>
>> Company details: http://www.itxperts.de/imprint.htm
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sage Weil March 4, 2015, 1:05 a.m. UTC | #3
On Thu, 26 Feb 2015, Andreas Bluemle wrote:
> Hi,
> 
> during the performance weely meeting, I had mentioned
> my experiences concerning the transaction structure
> for write requests at the level of the FileStore.
> Such a transaction not only contains the OP_WRITE
> operation to the object in the file system, but also
> a series of OP_OMAP_SETKEYS and OP_SETATTR operations.
> 
> Find attached a README and source code patch, which
> describe a prototype for coalescing the OP_OMAP_SETKEYS
> operations and the performance impact f this change.

I think we should try to avoid the dups in the first place in the upper 
layers before resorting to deduping in the FileStore. 

Here's a sample transaction:

{
    "ops": [
        {
            "op_num": 0,
            "op_name": "omap_setkeys",
            "collection": "0.3_head",
            "oid": "3\/\/head\/\/0",
            "attr_lens": {
                "0000000005.00000000000000000057": 180
            }
        },

^ PG::append_log

        {
            "op_num": 1,
            "op_name": "omap_setkeys",
            "collection": "0.3_head",
            "oid": "3\/\/head\/\/0",
            "attr_lens": {
                "_epoch": 4,
                "_info": 729
            }
        },

^^ these two come from PG::_write_info()

        {
            "op_num": 2,
            "op_name": "omap_setkeys",
            "collection": "0.3_head",
            "oid": "3\/\/head\/\/0",
            "attr_lens": {
                "0000000005.00000000000000000057": 180,
                "can_rollback_to": 12,
                "rollback_info_trimmed_to": 12
            }
        },

^ PG::_write_log()

I think the log ones are easy to combine.. in fact I think we saw a pull 
request from somone reently that does this?

The _write_info() one is probably a slightly bigger challenge, but only 
slightly, given that both of these are called from

void PG::write_if_dirty(ObjectStore::Transaction& t)
{
  if (dirty_big_info || dirty_info)
    write_info(t);
  pg_log.write_log(t, coll, pgmeta_oid);
}


        {
            "op_num": 3,
            "op_name": "write",
            "collection": "0.3_head",
            "oid": "b2082803\/benchmark_data_maetl_20149_object474\/head\/\/0",
            "length": 123,
            "offset": 0,
            "bufferlist length": 123
        },

necessary

        {
            "op_num": 4,
            "op_name": "setattr",
            "collection": "0.3_head",
            "oid": "b2082803\/benchmark_data_maetl_20149_object474\/head\/\/0",
            "name": "_",
            "length": 271
        },

this too

        {
            "op_num": 5,
            "op_name": "setattr",
            "collection": "0.3_head",
            "oid": "b2082803\/benchmark_data_maetl_20149_object474\/head\/\/0",
            "name": "snapset",
            "length": 31
        }

We have a PR that avoids this second setattr when the value isn't 
changing, but at the moment the consensus is that it's complex and 
probably not worth merging.  It's also a much cheaper operation, so I 
suggest we fix the omap calls first and then compare performance w/ and 
w/o the snapset xattr changes to see how significant it is.

    ]
}

Thanks!
sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sage Weil March 5, 2015, 12:10 a.m. UTC | #4
On Tue, 3 Mar 2015, Sage Weil wrote:
> I think we should try to avoid the dups in the first place in the upper 
> layers before resorting to deduping in the FileStore. 

I took a stab at this:

	https://github.com/ceph/ceph/pull/3878

Here's what the txns look like now:

{
    "ops": [
        {
            "op_num": 0,
            "op_name": "omap_setkeys",
            "collection": "0.2_head",
            "oid": "2\/\/head\/\/0",
            "attr_lens": {
                "0000000005.00000000000000000002": 178,
                "_epoch": 4,
                "_info": 729,
                "can_rollback_to": 12,
                "rollback_info_trimmed_to": 12
            }
        },
        {
            "op_num": 1,
            "op_name": "write",
            "collection": "0.2_head",
            "oid": "8b08fa52\/benchmark_data_maetl_15609_object5\/head\/\/0",
            "length": 123,
            "offset": 0,
            "bufferlist length": 123
        },
        {
            "op_num": 2,
            "op_name": "setattr",
            "collection": "0.2_head",
            "oid": "8b08fa52\/benchmark_data_maetl_15609_object5\/head\/\/0",
            "name": "_",
            "length": 269
        },
        {
            "op_num": 3,
            "op_name": "setattr",
            "collection": "0.2_head",
            "oid": "8b08fa52\/benchmark_data_maetl_15609_object5\/head\/\/0",
            "name": "snapset",
            "length": 31
        }
    ]
}

Care to give that a spin and see if it give syou a similar (or better) 
improvement?

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Haomai Wang March 5, 2015, 7:04 a.m. UTC | #5
I think the performance improvement can be refer to
https://github.com/ceph/ceph/pull/2972 which I did a simple benchmark
comparison.

This coalescing should get a lighter improvement but I think obviously
it should be better.

On Thu, Mar 5, 2015 at 8:10 AM, Sage Weil <sage@newdream.net> wrote:
> On Tue, 3 Mar 2015, Sage Weil wrote:
>> I think we should try to avoid the dups in the first place in the upper
>> layers before resorting to deduping in the FileStore.
>
> I took a stab at this:
>
>         https://github.com/ceph/ceph/pull/3878
>
> Here's what the txns look like now:
>
> {
>     "ops": [
>         {
>             "op_num": 0,
>             "op_name": "omap_setkeys",
>             "collection": "0.2_head",
>             "oid": "2\/\/head\/\/0",
>             "attr_lens": {
>                 "0000000005.00000000000000000002": 178,
>                 "_epoch": 4,
>                 "_info": 729,
>                 "can_rollback_to": 12,
>                 "rollback_info_trimmed_to": 12
>             }
>         },
>         {
>             "op_num": 1,
>             "op_name": "write",
>             "collection": "0.2_head",
>             "oid": "8b08fa52\/benchmark_data_maetl_15609_object5\/head\/\/0",
>             "length": 123,
>             "offset": 0,
>             "bufferlist length": 123
>         },
>         {
>             "op_num": 2,
>             "op_name": "setattr",
>             "collection": "0.2_head",
>             "oid": "8b08fa52\/benchmark_data_maetl_15609_object5\/head\/\/0",
>             "name": "_",
>             "length": 269
>         },
>         {
>             "op_num": 3,
>             "op_name": "setattr",
>             "collection": "0.2_head",
>             "oid": "8b08fa52\/benchmark_data_maetl_15609_object5\/head\/\/0",
>             "name": "snapset",
>             "length": 31
>         }
>     ]
> }
>
> Care to give that a spin and see if it give syou a similar (or better)
> improvement?
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ning Yao March 11, 2015, 3:44 a.m. UTC | #6
Can we also consider to coalesce two OP_SETATTR transaction to a
single OP_SETATTRS transaction?
Regards
Ning Yao


2015-03-05 15:04 GMT+08:00 Haomai Wang <haomaiwang@gmail.com>:
> I think the performance improvement can be refer to
> https://github.com/ceph/ceph/pull/2972 which I did a simple benchmark
> comparison.
>
> This coalescing should get a lighter improvement but I think obviously
> it should be better.
>
> On Thu, Mar 5, 2015 at 8:10 AM, Sage Weil <sage@newdream.net> wrote:
>> On Tue, 3 Mar 2015, Sage Weil wrote:
>>> I think we should try to avoid the dups in the first place in the upper
>>> layers before resorting to deduping in the FileStore.
>>
>> I took a stab at this:
>>
>>         https://github.com/ceph/ceph/pull/3878
>>
>> Here's what the txns look like now:
>>
>> {
>>     "ops": [
>>         {
>>             "op_num": 0,
>>             "op_name": "omap_setkeys",
>>             "collection": "0.2_head",
>>             "oid": "2\/\/head\/\/0",
>>             "attr_lens": {
>>                 "0000000005.00000000000000000002": 178,
>>                 "_epoch": 4,
>>                 "_info": 729,
>>                 "can_rollback_to": 12,
>>                 "rollback_info_trimmed_to": 12
>>             }
>>         },
>>         {
>>             "op_num": 1,
>>             "op_name": "write",
>>             "collection": "0.2_head",
>>             "oid": "8b08fa52\/benchmark_data_maetl_15609_object5\/head\/\/0",
>>             "length": 123,
>>             "offset": 0,
>>             "bufferlist length": 123
>>         },
>>         {
>>             "op_num": 2,
>>             "op_name": "setattr",
>>             "collection": "0.2_head",
>>             "oid": "8b08fa52\/benchmark_data_maetl_15609_object5\/head\/\/0",
>>             "name": "_",
>>             "length": 269
>>         },
>>         {
>>             "op_num": 3,
>>             "op_name": "setattr",
>>             "collection": "0.2_head",
>>             "oid": "8b08fa52\/benchmark_data_maetl_15609_object5\/head\/\/0",
>>             "name": "snapset",
>>             "length": 31
>>         }
>>     ]
>> }
>>
>> Care to give that a spin and see if it give syou a similar (or better)
>> improvement?
>>
>> sage
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Best Regards,
>
> Wheat
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sage Weil March 11, 2015, 12:34 p.m. UTC | #7
On Wed, 11 Mar 2015, Ning Yao wrote:
> Can we also consider to coalesce two OP_SETATTR transaction to a
> single OP_SETATTRS transaction?

Yeah, definitely!

sage


> Regards
> Ning Yao
> 
> 
> 2015-03-05 15:04 GMT+08:00 Haomai Wang <haomaiwang@gmail.com>:
> > I think the performance improvement can be refer to
> > https://github.com/ceph/ceph/pull/2972 which I did a simple benchmark
> > comparison.
> >
> > This coalescing should get a lighter improvement but I think obviously
> > it should be better.
> >
> > On Thu, Mar 5, 2015 at 8:10 AM, Sage Weil <sage@newdream.net> wrote:
> >> On Tue, 3 Mar 2015, Sage Weil wrote:
> >>> I think we should try to avoid the dups in the first place in the upper
> >>> layers before resorting to deduping in the FileStore.
> >>
> >> I took a stab at this:
> >>
> >>         https://github.com/ceph/ceph/pull/3878
> >>
> >> Here's what the txns look like now:
> >>
> >> {
> >>     "ops": [
> >>         {
> >>             "op_num": 0,
> >>             "op_name": "omap_setkeys",
> >>             "collection": "0.2_head",
> >>             "oid": "2\/\/head\/\/0",
> >>             "attr_lens": {
> >>                 "0000000005.00000000000000000002": 178,
> >>                 "_epoch": 4,
> >>                 "_info": 729,
> >>                 "can_rollback_to": 12,
> >>                 "rollback_info_trimmed_to": 12
> >>             }
> >>         },
> >>         {
> >>             "op_num": 1,
> >>             "op_name": "write",
> >>             "collection": "0.2_head",
> >>             "oid": "8b08fa52\/benchmark_data_maetl_15609_object5\/head\/\/0",
> >>             "length": 123,
> >>             "offset": 0,
> >>             "bufferlist length": 123
> >>         },
> >>         {
> >>             "op_num": 2,
> >>             "op_name": "setattr",
> >>             "collection": "0.2_head",
> >>             "oid": "8b08fa52\/benchmark_data_maetl_15609_object5\/head\/\/0",
> >>             "name": "_",
> >>             "length": 269
> >>         },
> >>         {
> >>             "op_num": 3,
> >>             "op_name": "setattr",
> >>             "collection": "0.2_head",
> >>             "oid": "8b08fa52\/benchmark_data_maetl_15609_object5\/head\/\/0",
> >>             "name": "snapset",
> >>             "length": 31
> >>         }
> >>     ]
> >> }
> >>
> >> Care to give that a spin and see if it give syou a similar (or better)
> >> improvement?
> >>
> >> sage
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> >
> > --
> > Best Regards,
> >
> > Wheat
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Bluemle March 19, 2015, 2:59 p.m. UTC | #8
Hi Sage,

as discussed yesterday during the performance meeting,
I have integratged pull reuest 3878 into v0.93 source code
and re-tested: I can confirm that the changes show a significant
benefit in the CPU consumption of the ceph-osd daemon for a
given workload: the reduction in CPU cosumption for the FileStore
workers is 22% and 10% for the overall CPU consumption of the
ceph-osd daemon.


Regards

Andreas Bluemle

On Thu, 5 Mar 2015 15:04:41 +0800
Haomai Wang <haomaiwang@gmail.com> wrote:

> I think the performance improvement can be refer to
> https://github.com/ceph/ceph/pull/2972 which I did a simple benchmark
> comparison.
> 
> This coalescing should get a lighter improvement but I think obviously
> it should be better.
> 
> On Thu, Mar 5, 2015 at 8:10 AM, Sage Weil <sage@newdream.net> wrote:
> > On Tue, 3 Mar 2015, Sage Weil wrote:
> >> I think we should try to avoid the dups in the first place in the
> >> upper layers before resorting to deduping in the FileStore.
> >
> > I took a stab at this:
> >
> >         https://github.com/ceph/ceph/pull/3878
> >
> > Here's what the txns look like now:
> >
> > {
> >     "ops": [
> >         {
> >             "op_num": 0,
> >             "op_name": "omap_setkeys",
> >             "collection": "0.2_head",
> >             "oid": "2\/\/head\/\/0",
> >             "attr_lens": {
> >                 "0000000005.00000000000000000002": 178,
> >                 "_epoch": 4,
> >                 "_info": 729,
> >                 "can_rollback_to": 12,
> >                 "rollback_info_trimmed_to": 12
> >             }
> >         },
> >         {
> >             "op_num": 1,
> >             "op_name": "write",
> >             "collection": "0.2_head",
> >             "oid":
> > "8b08fa52\/benchmark_data_maetl_15609_object5\/head\/\/0",
> > "length": 123, "offset": 0,
> >             "bufferlist length": 123
> >         },
> >         {
> >             "op_num": 2,
> >             "op_name": "setattr",
> >             "collection": "0.2_head",
> >             "oid":
> > "8b08fa52\/benchmark_data_maetl_15609_object5\/head\/\/0", "name":
> > "_", "length": 269
> >         },
> >         {
> >             "op_num": 3,
> >             "op_name": "setattr",
> >             "collection": "0.2_head",
> >             "oid":
> > "8b08fa52\/benchmark_data_maetl_15609_object5\/head\/\/0", "name":
> > "snapset", "length": 31
> >         }
> >     ]
> > }
> >
> > Care to give that a spin and see if it give syou a similar (or
> > better) improvement?
> >
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe
> > ceph-devel" in the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
>
diff mbox

Patch

diff --git a/src/os/FileStore.cc b/src/os/FileStore.cc
index f6c3bb8..29382b2 100644
--- a/src/os/FileStore.cc
+++ b/src/os/FileStore.cc
@@ -2260,10 +2260,24 @@  int FileStore::_check_replay_guard(int fd, const SequencerPosition& spos)
   }
 }
 
+void FileStore::_coalesce(map<string, bufferlist> &target, map<string, bufferlist> &source)
+{
+  for (map<string, bufferlist>::iterator p = source.begin();
+       p != source.end();
+       p++) {
+    target[p->first] = p->second;
+  }
+  return;
+}
+
 unsigned FileStore::_do_transaction(
   Transaction& t, uint64_t op_seq, int trans_num,
   ThreadPool::TPHandle *handle)
 {
+  map<string, bufferlist> collected_aset;
+  coll_t collected_cid;
+  ghobject_t collected_oid;
+
   dout(10) << "_do_transaction on " << &t << dendl;
 
 #ifdef WITH_LTTNG
@@ -2282,6 +2296,22 @@  unsigned FileStore::_do_transaction(
 
     _inject_failure();
 
+    if (op->op == Transaction::OP_OMAP_SETKEYS) {
+	collected_cid = i.get_cid(op->cid);
+	collected_oid = i.get_oid(op->oid);
+	map<string, bufferlist> aset;
+	i.decode_attrset(aset);
+	_coalesce(collected_aset, aset);
+	continue;
+    } else {
+	if (collected_aset.empty() == false) {
+	  tracepoint(objectstore, omap_setkeys_enter, osr_name);
+	  r = _omap_setkeys(collected_cid, collected_oid, collected_aset, spos);
+	  tracepoint(objectstore, omap_setkeys_exit, r);
+	  collected_aset.clear();
+	}
+    }
+
     switch (op->op) {
     case Transaction::OP_NOP:
       break;
diff --git a/src/os/FileStore.h b/src/os/FileStore.h
index af1fb8d..a039731 100644
--- a/src/os/FileStore.h
+++ b/src/os/FileStore.h
@@ -449,6 +449,8 @@  public:
 
   int statfs(struct statfs *buf);
 
+  void _coalesce( map<string, bufferlist> &target, map<string, bufferlist> &source);
+
   int _do_transactions(
     list<Transaction*> &tls, uint64_t op_seq,
     ThreadPool::TPHandle *handle);