Message ID | 20200421131850.443228-1-rpenyaev@suse.de (mailing list archive) |
---|---|
Headers | show |
Series | libceph: messenger: send/recv data at one go | expand |
On Tue, Apr 21, 2020 at 3:18 PM Roman Penyaev <rpenyaev@suse.de> wrote: > > Hi folks, > > While experimenting with messenger code in userspace [1] I noticed > that send and receive socket calls always operate with 4k, even bvec > length is larger (for example when bvec is contructed from bio, where > multi-page is used for big IOs). This is an attempt to speed up send > and receive for large IO. > > First 3 patches are cleanups. I remove unused code and get rid of > ceph_osd_data structure. I found that ceph_osd_data duplicates > ceph_msg_data and it seems unified API looks better for similar > things. > > In the following patches ceph_msg_data_cursor is switched to iov_iter, > which seems is more suitable for such kind of things (when we > basically do socket IO). This gives us the possibility to use the > whole iov_iter for sendmsg() and recvmsg() calls instead of iterating > page by page. sendpage() call also benefits from this, because now if > bvec is constructed from multi-page, then we can 0-copy the whole > bvec in one go. Hi Roman, I'm in the process of rewriting the kernel messenger to support msgr2 (i.e. encryption) and noticed the same things. The switch to iov_iter was the first thing I implemented ;) Among other things is support for multipage bvecs and explicit socket corking. I haven't benchmarked any of it though -- it just seemed like a sensible thing to do, especially since the sendmsg/sendpage infrastructure needed changes for encryption anyway. Support for kvecs isn't implemented yet, but will be in order to get rid of all those "allocate a page just to process 16 bytes" sites. Unfortunately I got distracted by some higher priority issues with the userspace messenger, so the kernel messenger is in a bit of a state of disarray at the moment. Here is the excerpt from the send path: #define CEPH_MSG_FLAGS (MSG_DONTWAIT | MSG_NOSIGNAL) static int do_sendmsg(struct ceph_connection *con, struct iov_iter *it) { struct msghdr msg = { .msg_flags = CEPH_MSG_FLAGS }; int ret; msg.msg_iter = *it; while (iov_iter_count(it)) { ret = do_one_sendmsg(con, &msg); if (ret <= 0) { if (ret == -EAGAIN) ret = 0; return ret; } iov_iter_advance(it, ret); } BUG_ON(msg_data_left(&msg)); return 1; } static int do_sendpage(struct ceph_connection *con, struct iov_iter *it) { ssize_t ret; BUG_ON(!iov_iter_is_bvec(it)); while (iov_iter_count(it)) { struct page *page = it->bvec->bv_page; int offset = it->bvec->bv_offset + it->iov_offset; size_t size = min(it->count, it->bvec->bv_len - it->iov_offset); /* * sendpage cannot properly handle pages with * page_count == 0, we need to fall back to sendmsg if * that's the case. * * Same goes for slab pages: skb_can_coalesce() allows * coalescing neighboring slab objects into a single frag * which triggers one of hardened usercopy checks. */ if (page_count(page) >= 1 && !PageSlab(page)) { ret = do_one_sendpage(con, page, offset, size, CEPH_MSG_FLAGS); } else { struct msghdr msg = { .msg_flags = CEPH_MSG_FLAGS }; struct bio_vec bv = { .bv_page = page, .bv_offset = offset, .bv_len = size, }; iov_iter_bvec(&msg.msg_iter, WRITE, &bv, 1, size); ret = do_one_sendmsg(con, &msg); } if (ret <= 0) { if (ret == -EAGAIN) ret = 0; return ret; } iov_iter_advance(it, ret); } return 1; } /* * Write as much as possible. The socket is expected to be corked, * so we don't bother with MSG_MORE/MSG_SENDPAGE_NOTLAST here. * * Return: * 1 - done, nothing else to write * 0 - socket is full, need to wait * <0 - error */ int ceph_tcp_send(struct ceph_connection *con) { bool is_kvec = iov_iter_is_kvec(&con->out_iter); int ret; dout("%s con %p have %zu is_kvec %d\n", __func__, con, iov_iter_count(&con->out_iter), is_kvec); if (is_kvec) ret = do_sendmsg(con, &con->out_iter); else ret = do_sendpage(con, &con->out_iter); dout("%s con %p ret %d left %zu\n", __func__, con, ret, iov_iter_count(&con->out_iter)); return ret; } I'll make sure to CC you on my patches, should be in a few weeks. Getting rid of ceph_osd_data is probably a good idea. FWIW I never liked it, but not strong enough to bother with removing it. > > I also allowed myself to get rid of ->last_piece and ->need_crc > members and ceph_msg_data_next() call. Now CRC is calculated not on > page basis, but according to the size of processed chunk. I found > ceph_msg_data_next() is a bit redundant, since we always can set the > next cursor chunk on cursor init or on advance. > > How I tested the performance? I used rbd.fio load on 1 OSD in memory > with the following fio configuration: > > direct=1 > time_based=1 > runtime=10 > ioengine=io_uring > size=256m > > rw=rand{read|write} > numjobs=32 > iodepth=32 > > [job1] > filename=/dev/rbd0 > > RBD device is mapped with 'nocrc' option set. For writes OSD completes > requests immediately, without touching the memory simulating null block > device, that's why write throughput in my results is much higher than > for reads. > > I tested on loopback interface only, in Vm, have not yet setup the > cluster on real machines, so sendpage() on a big multi-page shows > indeed good results, as expected. But I found an interesting comment > in drivers/infiniband/sw/siw/siw_qp_tc.c:siw_tcp_sendpages(), which > says: > > "Using sendpage to push page by page appears to be less efficient > than using sendmsg, even if data are copied. > > A general performance limitation might be the extra four bytes > trailer checksum segment to be pushed after user data." > > I could not prove or disprove since have tested on loopback interface > only. So it might be that sendmsg() in on go is faster than > sendpage() for bvecs with many segments. Please share any further findings. We have been using sendpage for the data section of the message since forever and I remember hearing about a performance regression when someone inadvertently disabled the sendpage path (can't recall the subsystem -- iSCSI?). If you discover that sendpage is actually slower, that would be very interesting. > > Here is the output of the rbd fio load for various block sizes: > > ==== WRITE === > > current master, rw=randwrite, numjobs=32 iodepth=32 > > 4k IOPS=92.7k, BW=362MiB/s, Lat=11033.30usec > 8k IOPS=85.6k, BW=669MiB/s, Lat=11956.74usec > 16k IOPS=76.8k, BW=1200MiB/s, Lat=13318.24usec > 32k IOPS=56.7k, BW=1770MiB/s, Lat=18056.92usec > 64k IOPS=34.0k, BW=2186MiB/s, Lat=29.23msec > 128k IOPS=21.8k, BW=2720MiB/s, Lat=46.96msec > 256k IOPS=14.4k, BW=3596MiB/s, Lat=71.03msec > 512k IOPS=8726, BW=4363MiB/s, Lat=116.34msec > 1m IOPS=4799, BW=4799MiB/s, Lat=211.15msec > > this patchset, rw=randwrite, numjobs=32 iodepth=32 > > 4k IOPS=94.7k, BW=370MiB/s, Lat=10802.43usec > 8k IOPS=91.2k, BW=712MiB/s, Lat=11221.00usec > 16k IOPS=80.4k, BW=1257MiB/s, Lat=12715.56usec > 32k IOPS=61.2k, BW=1912MiB/s, Lat=16721.33usec > 64k IOPS=40.9k, BW=2554MiB/s, Lat=24993.31usec > 128k IOPS=25.7k, BW=3216MiB/s, Lat=39.72msec > 256k IOPS=17.3k, BW=4318MiB/s, Lat=59.15msec > 512k IOPS=11.1k, BW=5559MiB/s, Lat=91.39msec > 1m IOPS=6696, BW=6696MiB/s, Lat=151.25msec > > > === READ === > > current master, rw=randread, numjobs=32 iodepth=32 > > 4k IOPS=62.5k, BW=244MiB/s, Lat=16.38msec > 8k IOPS=55.5k, BW=433MiB/s, Lat=18.44msec > 16k IOPS=40.6k, BW=635MiB/s, Lat=25.18msec > 32k IOPS=24.6k, BW=768MiB/s, Lat=41.61msec > 64k IOPS=14.8k, BW=925MiB/s, Lat=69.06msec > 128k IOPS=8687, BW=1086MiB/s, Lat=117.59msec > 256k IOPS=4733, BW=1183MiB/s, Lat=214.76msec > 512k IOPS=3156, BW=1578MiB/s, Lat=320.54msec > 1m IOPS=1901, BW=1901MiB/s, Lat=528.22msec > > this patchset, rw=randread, numjobs=32 iodepth=32 > > 4k IOPS=62.6k, BW=244MiB/s, Lat=16342.89usec > 8k IOPS=55.5k, BW=434MiB/s, Lat=18.42msec > 16k IOPS=43.2k, BW=675MiB/s, Lat=23.68msec > 32k IOPS=28.4k, BW=887MiB/s, Lat=36.04msec > 64k IOPS=20.2k, BW=1263MiB/s, Lat=50.54msec > 128k IOPS=11.7k, BW=1465MiB/s, Lat=87.01msec > 256k IOPS=6813, BW=1703MiB/s, Lat=149.30msec > 512k IOPS=5363, BW=2682MiB/s, Lat=189.37msec > 1m IOPS=2220, BW=2221MiB/s, Lat=453.92msec > > > Results for small blocks are not interesting, since there should not > be any difference. But starting from 32k block benefits of doing IO > for the whole message at once starts to prevail. It's not really the whole message, just the header, front and middle sections, right? The data section is still per-bvec, it's just that bvec is no longer limited to a single page but may encompass several physically contiguous pages. These are not that easy to come by on a heavily loaded system, but they do result in nice numbers. Thanks, Ilya
Hi Ilya, On 2020-04-21 17:51, Ilya Dryomov wrote: > On Tue, Apr 21, 2020 at 3:18 PM Roman Penyaev <rpenyaev@suse.de> wrote: >> >> Hi folks, >> >> While experimenting with messenger code in userspace [1] I noticed >> that send and receive socket calls always operate with 4k, even bvec >> length is larger (for example when bvec is contructed from bio, where >> multi-page is used for big IOs). This is an attempt to speed up send >> and receive for large IO. >> >> First 3 patches are cleanups. I remove unused code and get rid of >> ceph_osd_data structure. I found that ceph_osd_data duplicates >> ceph_msg_data and it seems unified API looks better for similar >> things. >> >> In the following patches ceph_msg_data_cursor is switched to iov_iter, >> which seems is more suitable for such kind of things (when we >> basically do socket IO). This gives us the possibility to use the >> whole iov_iter for sendmsg() and recvmsg() calls instead of iterating >> page by page. sendpage() call also benefits from this, because now if >> bvec is constructed from multi-page, then we can 0-copy the whole >> bvec in one go. > > Hi Roman, > > I'm in the process of rewriting the kernel messenger to support msgr2 > (i.e. encryption) and noticed the same things. The switch to iov_iter > was the first thing I implemented ;) Among other things is support for > multipage bvecs and explicit socket corking. Ah, ok, good to know. This patchset came from the userspace variant of the kernel messenger. These changes also show nice numbers on userspace side. (of course without sendpage() variant and plus some caching on receive, which I also implemented for kernel side, but left aside since this did not show any interesting results for rbd load) > I haven't benchmarked any > of it though -- it just seemed like a sensible thing to do, especially > since the sendmsg/sendpage infrastructure needed changes for encryption > anyway. I can benchmark on my localhost setup easily. Just add me to CC when you are done. > > Support for kvecs isn't implemented yet, but will be in order to get > rid of all those "allocate a page just to process 16 bytes" sites. > > Unfortunately I got distracted by some higher priority issues with the > userspace messenger, so the kernel messenger is in a bit of a state of > disarray at the moment. Here is the excerpt from the send path: > > #define CEPH_MSG_FLAGS (MSG_DONTWAIT | MSG_NOSIGNAL) > > static int do_sendmsg(struct ceph_connection *con, struct iov_iter *it) > { > struct msghdr msg = { .msg_flags = CEPH_MSG_FLAGS }; > int ret; > > msg.msg_iter = *it; > while (iov_iter_count(it)) { > ret = do_one_sendmsg(con, &msg); > if (ret <= 0) { > if (ret == -EAGAIN) > ret = 0; > return ret; > } > > iov_iter_advance(it, ret); > } > > BUG_ON(msg_data_left(&msg)); > return 1; > } > > static int do_sendpage(struct ceph_connection *con, struct iov_iter > *it) > { > ssize_t ret; > > BUG_ON(!iov_iter_is_bvec(it)); > while (iov_iter_count(it)) { > struct page *page = it->bvec->bv_page; > int offset = it->bvec->bv_offset + it->iov_offset; > size_t size = min(it->count, > it->bvec->bv_len - it->iov_offset); > > /* > * sendpage cannot properly handle pages with > * page_count == 0, we need to fall back to sendmsg if > * that's the case. > * > * Same goes for slab pages: skb_can_coalesce() allows > * coalescing neighboring slab objects into a single > frag > * which triggers one of hardened usercopy checks. > */ > if (page_count(page) >= 1 && !PageSlab(page)) { > ret = do_one_sendpage(con, page, offset, size, > CEPH_MSG_FLAGS); > } else { > struct msghdr msg = { .msg_flags = > CEPH_MSG_FLAGS }; > struct bio_vec bv = { > .bv_page = page, > .bv_offset = offset, > .bv_len = size, > }; > > iov_iter_bvec(&msg.msg_iter, WRITE, &bv, 1, > size); > ret = do_one_sendmsg(con, &msg); > } > if (ret <= 0) { > if (ret == -EAGAIN) > ret = 0; > return ret; > } > > iov_iter_advance(it, ret); > } > > return 1; > } > > /* > * Write as much as possible. The socket is expected to be corked, > * so we don't bother with MSG_MORE/MSG_SENDPAGE_NOTLAST here. > * > * Return: > * 1 - done, nothing else to write > * 0 - socket is full, need to wait > * <0 - error > */ > int ceph_tcp_send(struct ceph_connection *con) > { > bool is_kvec = iov_iter_is_kvec(&con->out_iter); > int ret; > > dout("%s con %p have %zu is_kvec %d\n", __func__, con, > iov_iter_count(&con->out_iter), is_kvec); > if (is_kvec) > ret = do_sendmsg(con, &con->out_iter); > else > ret = do_sendpage(con, &con->out_iter); > > dout("%s con %p ret %d left %zu\n", __func__, con, ret, > iov_iter_count(&con->out_iter)); > return ret; > } Ha! Nice! That almost exactly what I do in current patchset, except corking. I still bother with MSG_MORE/MSG_SENDPAGE_NOTLAST :) BTW kvecs also can be sendpaged. Since we do not have userspace iovec, thus page can be easily taken. But that is a minor. > I'll make sure to CC you on my patches, should be in a few weeks. Yes, please. I can help with benchmarking and reviewing. > > Getting rid of ceph_osd_data is probably a good idea. FWIW I never > liked it, but not strong enough to bother with removing it. > >> >> I also allowed myself to get rid of ->last_piece and ->need_crc >> members and ceph_msg_data_next() call. Now CRC is calculated not on >> page basis, but according to the size of processed chunk. I found >> ceph_msg_data_next() is a bit redundant, since we always can set the >> next cursor chunk on cursor init or on advance. >> >> How I tested the performance? I used rbd.fio load on 1 OSD in memory >> with the following fio configuration: >> >> direct=1 >> time_based=1 >> runtime=10 >> ioengine=io_uring >> size=256m >> >> rw=rand{read|write} >> numjobs=32 >> iodepth=32 >> >> [job1] >> filename=/dev/rbd0 >> >> RBD device is mapped with 'nocrc' option set. For writes OSD >> completes >> requests immediately, without touching the memory simulating null >> block >> device, that's why write throughput in my results is much higher than >> for reads. >> >> I tested on loopback interface only, in Vm, have not yet setup the >> cluster on real machines, so sendpage() on a big multi-page shows >> indeed good results, as expected. But I found an interesting comment >> in drivers/infiniband/sw/siw/siw_qp_tc.c:siw_tcp_sendpages(), which >> says: >> >> "Using sendpage to push page by page appears to be less efficient >> than using sendmsg, even if data are copied. >> >> A general performance limitation might be the extra four bytes >> trailer checksum segment to be pushed after user data." >> >> I could not prove or disprove since have tested on loopback interface >> only. So it might be that sendmsg() in on go is faster than >> sendpage() for bvecs with many segments. > > Please share any further findings. We have been using sendpage for > the data section of the message since forever and I remember hearing > about a performance regression when someone inadvertently disabled the > sendpage path (can't recall the subsystem -- iSCSI?). If you discover > that sendpage is actually slower, that would be very interesting. I will try to find a good machine for that with fat network. Should be easy to benchmark. >> Here is the output of the rbd fio load for various block sizes: >> >> ==== WRITE === >> >> current master, rw=randwrite, numjobs=32 iodepth=32 >> >> 4k IOPS=92.7k, BW=362MiB/s, Lat=11033.30usec >> 8k IOPS=85.6k, BW=669MiB/s, Lat=11956.74usec >> 16k IOPS=76.8k, BW=1200MiB/s, Lat=13318.24usec >> 32k IOPS=56.7k, BW=1770MiB/s, Lat=18056.92usec >> 64k IOPS=34.0k, BW=2186MiB/s, Lat=29.23msec >> 128k IOPS=21.8k, BW=2720MiB/s, Lat=46.96msec >> 256k IOPS=14.4k, BW=3596MiB/s, Lat=71.03msec >> 512k IOPS=8726, BW=4363MiB/s, Lat=116.34msec >> 1m IOPS=4799, BW=4799MiB/s, Lat=211.15msec >> >> this patchset, rw=randwrite, numjobs=32 iodepth=32 >> >> 4k IOPS=94.7k, BW=370MiB/s, Lat=10802.43usec >> 8k IOPS=91.2k, BW=712MiB/s, Lat=11221.00usec >> 16k IOPS=80.4k, BW=1257MiB/s, Lat=12715.56usec >> 32k IOPS=61.2k, BW=1912MiB/s, Lat=16721.33usec >> 64k IOPS=40.9k, BW=2554MiB/s, Lat=24993.31usec >> 128k IOPS=25.7k, BW=3216MiB/s, Lat=39.72msec >> 256k IOPS=17.3k, BW=4318MiB/s, Lat=59.15msec >> 512k IOPS=11.1k, BW=5559MiB/s, Lat=91.39msec >> 1m IOPS=6696, BW=6696MiB/s, Lat=151.25msec >> >> >> === READ === >> >> current master, rw=randread, numjobs=32 iodepth=32 >> >> 4k IOPS=62.5k, BW=244MiB/s, Lat=16.38msec >> 8k IOPS=55.5k, BW=433MiB/s, Lat=18.44msec >> 16k IOPS=40.6k, BW=635MiB/s, Lat=25.18msec >> 32k IOPS=24.6k, BW=768MiB/s, Lat=41.61msec >> 64k IOPS=14.8k, BW=925MiB/s, Lat=69.06msec >> 128k IOPS=8687, BW=1086MiB/s, Lat=117.59msec >> 256k IOPS=4733, BW=1183MiB/s, Lat=214.76msec >> 512k IOPS=3156, BW=1578MiB/s, Lat=320.54msec >> 1m IOPS=1901, BW=1901MiB/s, Lat=528.22msec >> >> this patchset, rw=randread, numjobs=32 iodepth=32 >> >> 4k IOPS=62.6k, BW=244MiB/s, Lat=16342.89usec >> 8k IOPS=55.5k, BW=434MiB/s, Lat=18.42msec >> 16k IOPS=43.2k, BW=675MiB/s, Lat=23.68msec >> 32k IOPS=28.4k, BW=887MiB/s, Lat=36.04msec >> 64k IOPS=20.2k, BW=1263MiB/s, Lat=50.54msec >> 128k IOPS=11.7k, BW=1465MiB/s, Lat=87.01msec >> 256k IOPS=6813, BW=1703MiB/s, Lat=149.30msec >> 512k IOPS=5363, BW=2682MiB/s, Lat=189.37msec >> 1m IOPS=2220, BW=2221MiB/s, Lat=453.92msec >> >> >> Results for small blocks are not interesting, since there should not >> be any difference. But starting from 32k block benefits of doing IO >> for the whole message at once starts to prevail. > > It's not really the whole message, just the header, front and middle > sections, right? No, that is the whole message. Output of fio rbd load. Or I did not get your question. > The data section is still per-bvec, it's just that > bvec is no longer limited to a single page but may encompass several > physically contiguous pages. True. Data section is bvec, which is taken from a bio, which on its turn has one big physically contiguous multipage. Of course when there is such a slice of physical contiguous memory. > These are not that easy to come by on > a heavily loaded system, but they do result in nice numbers. Yeah, sometimes there is quite a big number of segments for a big IO. And for that case it seems makes sense to do buffering, i.e. calling sendmsg(). So probably sendpage() makes sense to call when nr_segs < N && bvec->bv_len > M * 4k where N and M are some magic numbers which help to resuce costs of calling do_tcp_sendpages() in a loooong loop. But this is pure speculation. -- Roman