Message ID | 1458130611-17304-3-git-send-email-thuth@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Mar 16, 2016 at 01:16:50PM +0100, Thomas Huth wrote: > tl;dr: > This patch introduces an alternate way of handling the receive > buffers of the spapr-vlan device, resulting in much better > receive performance for the guest. > > Full story: > One of our testers recently discovered that the performance of the > spapr-vlan device is very poor compared to other NICs, and that > a simple "ping -i 0.2 -s 65507 someip" in the guest can result > in more than 50% lost ping packets (especially with older guest > kernels < 3.17). > > After doing some analysis, it was clear that there is a problem > with the way we handle the receive buffers in spapr_llan.c: The > ibmveth driver of the guest Linux kernel tries to add a lot of > buffers into several buffer pools (with 512, 2048 and 65536 byte > sizes by default, but it can be changed via the entries in the > /sys/devices/vio/1000/pool* directories of the guest). However, > the spapr-vlan device of QEMU only tries to squeeze all receive > buffer descriptors into one single page which has been supplied > by the guest during the H_REGISTER_LOGICAL_LAN call, without > taking care of different buffer sizes. This has two bad effects: > First, only a very limited number of buffer descriptors is accepted > at all. Second, we also hand 64k buffers to the guest even if > the 2k buffers would fit better - and this results in dropped packets > in the IP layer of the guest since too much skbuf memory is used. > > Though it seems at a first glance like PAPR says that we should store > the receive buffer descriptors in the page that is supplied during > the H_REGISTER_LOGICAL_LAN call, chapter 16.4.1.2 in the LoPAPR spec > declares that "the contents of these descriptors are architecturally > opaque, none of these descriptors are manipulated by code above > the architected interfaces". Aaaahhh! I remember back when I first implemented this, that exposing the pool of buffer descriptors via DMA seemed a silly and pointless thing to do, but I did so because I thought that's what the spec said. 16.4.1.2 seems to make it clearer that the page doesn't list actual Rx buffers, but rather opaque handles on internal buffer pools. I don't know if I just misread this back in 2011 (or whenever it was) or if the PAPR wording at the time was less clear at the time. I note that you don't actually put the buffer pool pointers into that page in your patch below. I don't think that matters now, but I wonder if we'd ever want to implement H_MIGRATE_DMA and if we'd need it in that case. > That means we don't have to store > the RX buffer descriptors in this page, but can also manage the > receive buffers at the hypervisor level only. This is now what we > are doing here: Introducing proper RX buffer pools which are also > sorted by size of the buffers, so we can hand out a buffer with > the best fitting size when a packet has been received. > > To avoid problems with migration from/to older version of QEMU, > the old behavior is also retained and enabled by default. The new > buffer management has to be enabled via a new "use-rx-buffer-pools" > property. > > Now with the new buffer pool management enabled, the problem with > "ping -s 65507" is fixed for me, and the throughput of a simple > test with wget increases from creeping 3MB/s up to 20MB/s! Excellent. Still not exactly super fast, but much more respectable than it was before. > Signed-off-by: Thomas Huth <thuth@redhat.com> > --- > hw/net/spapr_llan.c | 209 +++++++++++++++++++++++++++++++++++++++++++++++++++- > 1 file changed, 207 insertions(+), 2 deletions(-) > > diff --git a/hw/net/spapr_llan.c b/hw/net/spapr_llan.c > index 39a1dd1..74c8b1e 100644 > --- a/hw/net/spapr_llan.c > +++ b/hw/net/spapr_llan.c > @@ -45,6 +45,10 @@ > #define DPRINTF(fmt...) > #endif > > +/* Compatibility flags for migration */ > +#define SPAPRVLAN_FLAG_RX_BUF_POOLS_BIT 0 > +#define SPAPRVLAN_FLAG_RX_BUF_POOLS (1 << SPAPRVLAN_FLAG_RX_BUF_POOLS_BIT) > + > /* > * Virtual LAN device > */ > @@ -86,6 +90,15 @@ typedef uint64_t vlan_bd_t; > #define VIO_SPAPR_VLAN_DEVICE(obj) \ > OBJECT_CHECK(VIOsPAPRVLANDevice, (obj), TYPE_VIO_SPAPR_VLAN_DEVICE) > > +#define RX_POOL_MAX_BDS 4096 > +#define RX_MAX_POOLS 5 > + > +typedef struct { > + int32_t bufsize; > + int32_t count; > + vlan_bd_t bds[RX_POOL_MAX_BDS]; > +} RxBufPool; > + > typedef struct VIOsPAPRVLANDevice { > VIOsPAPRDevice sdev; > NICConf nicconf; > @@ -94,6 +107,8 @@ typedef struct VIOsPAPRVLANDevice { > target_ulong buf_list; > uint32_t add_buf_ptr, use_buf_ptr, rx_bufs; > target_ulong rxq_ptr; > + uint32_t compat_flags; /* Compatability flags for migration */ > + RxBufPool *rx_pool[RX_MAX_POOLS]; /* Receive buffer descriptor pools */ > } VIOsPAPRVLANDevice; > > static int spapr_vlan_can_receive(NetClientState *nc) > @@ -104,6 +119,37 @@ static int spapr_vlan_can_receive(NetClientState *nc) > } > > /** > + * Get buffer descriptor from one of our receive buffer pools > + */ > +static vlan_bd_t spapr_vlan_get_rx_bd_from_pool(VIOsPAPRVLANDevice *dev, > + size_t size) > +{ > + vlan_bd_t bd; > + int pool; > + > + for (pool = 0; pool < RX_MAX_POOLS; pool++) { > + if (dev->rx_pool[pool]->count > 0 && > + dev->rx_pool[pool]->bufsize >= size + 8) { > + break; > + } > + } > + if (pool == RX_MAX_POOLS) { > + /* Failed to find a suitable buffer */ > + return 0; > + } > + > + DPRINTF("Found buffer: pool=%d count=%d rxbufs=%d\n", pool, > + dev->rx_pool[pool]->count, dev->rx_bufs); > + > + /* Remove the buffer from the pool */ > + dev->rx_pool[pool]->count--; > + bd = dev->rx_pool[pool]->bds[dev->rx_pool[pool]->count]; > + dev->rx_pool[pool]->bds[dev->rx_pool[pool]->count] = 0; > + > + return bd; > +} > + > +/** > * Get buffer descriptor from the receive buffer list page that has been > * supplied by the guest with the H_REGISTER_LOGICAL_LAN call > */ > @@ -160,7 +206,11 @@ static ssize_t spapr_vlan_receive(NetClientState *nc, const uint8_t *buf, > return -1; > } > > - bd = spapr_vlan_get_rx_bd_from_page(dev, size); > + if (dev->compat_flags & SPAPRVLAN_FLAG_RX_BUF_POOLS) { > + bd = spapr_vlan_get_rx_bd_from_pool(dev, size); > + } else { > + bd = spapr_vlan_get_rx_bd_from_page(dev, size); > + } > if (!bd) { > return -1; > } > @@ -216,10 +266,18 @@ static NetClientInfo net_spapr_vlan_info = { > static void spapr_vlan_reset(VIOsPAPRDevice *sdev) > { > VIOsPAPRVLANDevice *dev = VIO_SPAPR_VLAN_DEVICE(sdev); > + int i; > > dev->buf_list = 0; > dev->rx_bufs = 0; > dev->isopen = 0; > + > + if (dev->compat_flags & SPAPRVLAN_FLAG_RX_BUF_POOLS) { > + for (i = 0; i < RX_MAX_POOLS; i++) { > + memset(dev->rx_pool[i]->bds, 0, sizeof(dev->rx_pool[i]->bds)); > + dev->rx_pool[i]->count = 0; > + } > + } > } > > static void spapr_vlan_realize(VIOsPAPRDevice *sdev, Error **errp) > @@ -236,10 +294,30 @@ static void spapr_vlan_realize(VIOsPAPRDevice *sdev, Error **errp) > static void spapr_vlan_instance_init(Object *obj) > { > VIOsPAPRVLANDevice *dev = VIO_SPAPR_VLAN_DEVICE(obj); > + int i; > > device_add_bootindex_property(obj, &dev->nicconf.bootindex, > "bootindex", "", > DEVICE(dev), NULL); > + > + if (dev->compat_flags & SPAPRVLAN_FLAG_RX_BUF_POOLS) { > + for (i = 0; i < RX_MAX_POOLS; i++) { > + dev->rx_pool[i] = g_new0(RxBufPool, 1); > + } > + } > +} > + > +static void spapr_vlan_instance_finalize(Object *obj) > +{ > + VIOsPAPRVLANDevice *dev = VIO_SPAPR_VLAN_DEVICE(obj); > + int i; > + > + if (dev->compat_flags & SPAPRVLAN_FLAG_RX_BUF_POOLS) { > + for (i = 0; i < RX_MAX_POOLS; i++) { > + g_free(dev->rx_pool[i]); > + dev->rx_pool[i] = NULL; > + } > + } > } > > void spapr_vlan_create(VIOsPAPRBus *bus, NICInfo *nd) > @@ -390,6 +468,89 @@ static target_ulong h_free_logical_lan(PowerPCCPU *cpu, > return H_SUCCESS; > } > > +/** > + * Used for qsort, this function compares two RxBufPools by size. > + * If a pool is unused (i.e. count == 0), treat its size a INT_MAX > + * so that it gets sorted to the end of the rx_pool array. > + */ > +static int rx_pool_size_compare(const void *p1, const void *p2) > +{ > + const RxBufPool *pool1 = *(RxBufPool **)p1; > + const RxBufPool *pool2 = *(RxBufPool **)p2; > + int size1, size2; > + > + size1 = pool1->count ? pool1->bufsize : INT_MAX; > + size2 = pool2->count ? pool2->bufsize : INT_MAX; > + > + if (size1 < size2) { > + return -1; > + } > + return size1 > size2; > +} > + > +/** > + * Search for a matching buffer pool with exact matching size, > + * or return -1 if no matching pool has been found. > + */ > +static int spapr_vlan_get_rx_pool_id(VIOsPAPRVLANDevice *dev, int size) > +{ > + int pool; > + > + for (pool = 0; pool < RX_MAX_POOLS; pool++) { > + if (dev->rx_pool[pool]->bufsize == size) { > + return pool; > + } > + } > + > + return -1; > +} > + > +/** > + * Enqueuing receive buffer by adding it to one of our receive buffer pools > + */ > +static target_long spapr_vlan_add_rxbuf_to_pool(VIOsPAPRVLANDevice *dev, > + target_ulong buf) > +{ > + int size = VLAN_BD_LEN(buf); > + int pool; > + > + pool = spapr_vlan_get_rx_pool_id(dev, size); > + > + /* No matching pool found? Try to create a new one */ > + if (pool < 0) { > + for (pool = RX_MAX_POOLS - 1; pool >= 0 ; pool--) { I don't think this loop actually accomplishes anything. Either the last slot is free, in which case you use it, then sort into place, or it's not, in which case you've hit the maximum number of buffer pools. > + if (dev->rx_pool[pool]->count == 0) { > + dev->rx_pool[pool]->bufsize = size; > + /* > + * Sort pools by size so that spapr_vlan_receive() > + * can later find the smallest buffer pool easily. > + */ > + qsort(dev->rx_pool, RX_MAX_POOLS, sizeof(dev->rx_pool[0]), > + rx_pool_size_compare); > + pool = spapr_vlan_get_rx_pool_id(dev, size); > + DPRINTF("created RX pool %d for size %lld\n", pool, > + VLAN_BD_LEN(buf)); > + break; > + } > + } > + } > + /* Still no usable pool? Give up */ > + if (pool < 0 || dev->rx_pool[pool]->count >= RX_POOL_MAX_BDS) { > + return H_RESOURCE; > + } > + > + DPRINTF("h_add_llan_buf(): Add buf using pool %i (size %lli, count=%i)\n", > + pool, VLAN_BD_LEN(buf), dev->rx_pool[pool]->count); > + > + dev->rx_pool[pool]->bds[dev->rx_pool[pool]->count++] = buf; > + > + return 0; > +} > + > +/** > + * This is the old way of enqueuing receive buffers: Add it to the rx queue > + * page that has been supplied by the guest (which is quite limited in size). > + */ > static target_long spapr_vlan_add_rxbuf_to_page(VIOsPAPRVLANDevice *dev, > target_ulong buf) > { > @@ -445,7 +606,11 @@ static target_ulong h_add_logical_lan_buffer(PowerPCCPU *cpu, > return H_RESOURCE; > } > > - ret = spapr_vlan_add_rxbuf_to_page(dev, buf); > + if (dev->compat_flags & SPAPRVLAN_FLAG_RX_BUF_POOLS) { > + ret = spapr_vlan_add_rxbuf_to_pool(dev, buf); > + } else { > + ret = spapr_vlan_add_rxbuf_to_page(dev, buf); > + } > if (ret) { > return ret; > } > @@ -543,9 +708,44 @@ static target_ulong h_multicast_ctrl(PowerPCCPU *cpu, sPAPRMachineState *spapr, > static Property spapr_vlan_properties[] = { > DEFINE_SPAPR_PROPERTIES(VIOsPAPRVLANDevice, sdev), > DEFINE_NIC_PROPERTIES(VIOsPAPRVLANDevice, nicconf), > + DEFINE_PROP_BIT("use-rx-buffer-pools", VIOsPAPRVLANDevice, > + compat_flags, SPAPRVLAN_FLAG_RX_BUF_POOLS_BIT, false), > DEFINE_PROP_END_OF_LIST(), > }; > > +static bool spapr_vlan_rx_buffer_pools_needed(void *opaque) > +{ > + VIOsPAPRVLANDevice *dev = opaque; > + > + return (dev->compat_flags & SPAPRVLAN_FLAG_RX_BUF_POOLS) != 0; > +} > + > +static const VMStateDescription vmstate_rx_buffer_pool = { > + .name = "spapr_llan/rx_buffer_pool", > + .version_id = 1, > + .minimum_version_id = 1, > + .needed = spapr_vlan_rx_buffer_pools_needed, > + .fields = (VMStateField[]) { > + VMSTATE_INT32(bufsize, RxBufPool), > + VMSTATE_INT32(count, RxBufPool), > + VMSTATE_UINT64_ARRAY(bds, RxBufPool, RX_POOL_MAX_BDS), > + VMSTATE_END_OF_LIST() > + } > +}; > + > +static const VMStateDescription vmstate_rx_pools = { > + .name = "spapr_llan/rx_pools", > + .version_id = 1, > + .minimum_version_id = 1, > + .needed = spapr_vlan_rx_buffer_pools_needed, > + .fields = (VMStateField[]) { > + VMSTATE_ARRAY_OF_POINTER_TO_STRUCT(rx_pool, VIOsPAPRVLANDevice, > + RX_MAX_POOLS, 1, > + vmstate_rx_buffer_pool, RxBufPool), > + VMSTATE_END_OF_LIST() > + } > +}; > + > static const VMStateDescription vmstate_spapr_llan = { > .name = "spapr_llan", > .version_id = 1, > @@ -562,6 +762,10 @@ static const VMStateDescription vmstate_spapr_llan = { > > VMSTATE_END_OF_LIST() > }, > + .subsections = (const VMStateDescription * []) { > + &vmstate_rx_pools, > + NULL > + } > }; > > static void spapr_vlan_class_init(ObjectClass *klass, void *data) > @@ -588,6 +792,7 @@ static const TypeInfo spapr_vlan_info = { > .instance_size = sizeof(VIOsPAPRVLANDevice), > .class_init = spapr_vlan_class_init, > .instance_init = spapr_vlan_instance_init, > + .instance_finalize = spapr_vlan_instance_finalize, > }; > > static void spapr_vlan_register_types(void)
On 17.03.2016 07:23, David Gibson wrote: > On Wed, Mar 16, 2016 at 01:16:50PM +0100, Thomas Huth wrote: >> >> This patch introduces an alternate way of handling the receive >> buffers of the spapr-vlan device, resulting in much better >> receive performance for the guest. [...] >> Though it seems at a first glance like PAPR says that we should store >> the receive buffer descriptors in the page that is supplied during >> the H_REGISTER_LOGICAL_LAN call, chapter 16.4.1.2 in the LoPAPR spec >> declares that "the contents of these descriptors are architecturally >> opaque, none of these descriptors are manipulated by code above >> the architected interfaces". > > Aaaahhh! I remember back when I first implemented this, that exposing > the pool of buffer descriptors via DMA seemed a silly and pointless > thing to do, but I did so because I thought that's what the spec said. > > 16.4.1.2 seems to make it clearer that the page doesn't list actual Rx > buffers, but rather opaque handles on internal buffer pools. > > I don't know if I just misread this back in 2011 (or whenever it was) > or if the PAPR wording at the time was less clear at the time. > > I note that you don't actually put the buffer pool pointers into that > page in your patch below. I don't think that matters now, but I > wonder if we'd ever want to implement H_MIGRATE_DMA and if we'd need > it in that case. I also thought about putting the pointers to the pools into that page. But: If we put buffer pool pointers into that page, where should the buffer pools be located? Still in the memory of the hypervisor? Then this sounds like a very baaad design, the guest then could tinker with pointers to the host memory, causing very bad side effects or crashes. Or should the buffer pools be located in guest memory? That might be OK, but how do the pools get allocated in that case? So unless you've got a good idea here, I think it's better to keep the pointer list and the buffer pools both in host memory for now. [...] >> +/** >> + * Enqueuing receive buffer by adding it to one of our receive buffer pools >> + */ >> +static target_long spapr_vlan_add_rxbuf_to_pool(VIOsPAPRVLANDevice *dev, >> + target_ulong buf) >> +{ >> + int size = VLAN_BD_LEN(buf); >> + int pool; >> + >> + pool = spapr_vlan_get_rx_pool_id(dev, size); >> + >> + /* No matching pool found? Try to create a new one */ >> + if (pool < 0) { >> + for (pool = RX_MAX_POOLS - 1; pool >= 0 ; pool--) { > > I don't think this loop actually accomplishes anything. Either the > last slot is free, in which case you use it, then sort into place, or > it's not, in which case you've hit the maximum number of buffer pools. Oh, you're right. Well spotted! I'll rework my patch to do it without that loop. Thomas
On Thu, Mar 17, 2016 at 08:30:15AM +0100, Thomas Huth wrote: > On 17.03.2016 07:23, David Gibson wrote: > > On Wed, Mar 16, 2016 at 01:16:50PM +0100, Thomas Huth wrote: > >> > >> This patch introduces an alternate way of handling the receive > >> buffers of the spapr-vlan device, resulting in much better > >> receive performance for the guest. > [...] > >> Though it seems at a first glance like PAPR says that we should store > >> the receive buffer descriptors in the page that is supplied during > >> the H_REGISTER_LOGICAL_LAN call, chapter 16.4.1.2 in the LoPAPR spec > >> declares that "the contents of these descriptors are architecturally > >> opaque, none of these descriptors are manipulated by code above > >> the architected interfaces". > > > > Aaaahhh! I remember back when I first implemented this, that exposing > > the pool of buffer descriptors via DMA seemed a silly and pointless > > thing to do, but I did so because I thought that's what the spec said. > > > > 16.4.1.2 seems to make it clearer that the page doesn't list actual Rx > > buffers, but rather opaque handles on internal buffer pools. > > > > I don't know if I just misread this back in 2011 (or whenever it was) > > or if the PAPR wording at the time was less clear at the time. > > > > I note that you don't actually put the buffer pool pointers into that > > page in your patch below. I don't think that matters now, but I > > wonder if we'd ever want to implement H_MIGRATE_DMA and if we'd need > > it in that case. > > I also thought about putting the pointers to the pools into that page. > But: If we put buffer pool pointers into that page, where should the > buffer pools be located? Still in the memory of the hypervisor? Then > this sounds like a very baaad design, the guest then could tinker with > pointers to the host memory, causing very bad side effects or crashes. > Or should the buffer pools be located in guest memory? That might be OK, > but how do the pools get allocated in that case? > > So unless you've got a good idea here, I think it's better to keep the > pointer list and the buffer pools both in host memory for now. Yes, I think you're right. > [...] > >> +/** > >> + * Enqueuing receive buffer by adding it to one of our receive buffer pools > >> + */ > >> +static target_long spapr_vlan_add_rxbuf_to_pool(VIOsPAPRVLANDevice *dev, > >> + target_ulong buf) > >> +{ > >> + int size = VLAN_BD_LEN(buf); > >> + int pool; > >> + > >> + pool = spapr_vlan_get_rx_pool_id(dev, size); > >> + > >> + /* No matching pool found? Try to create a new one */ > >> + if (pool < 0) { > >> + for (pool = RX_MAX_POOLS - 1; pool >= 0 ; pool--) { > > > > I don't think this loop actually accomplishes anything. Either the > > last slot is free, in which case you use it, then sort into place, or > > it's not, in which case you've hit the maximum number of buffer pools. > > Oh, you're right. Well spotted! I'll rework my patch to do it without > that loop. > > Thomas > >
On 17.03.2016 08:30, Thomas Huth wrote: > On 17.03.2016 07:23, David Gibson wrote: >> On Wed, Mar 16, 2016 at 01:16:50PM +0100, Thomas Huth wrote: >>> >>> This patch introduces an alternate way of handling the receive >>> buffers of the spapr-vlan device, resulting in much better >>> receive performance for the guest. [...] >>> +/** >>> + * Enqueuing receive buffer by adding it to one of our receive buffer pools >>> + */ >>> +static target_long spapr_vlan_add_rxbuf_to_pool(VIOsPAPRVLANDevice *dev, >>> + target_ulong buf) >>> +{ >>> + int size = VLAN_BD_LEN(buf); >>> + int pool; >>> + >>> + pool = spapr_vlan_get_rx_pool_id(dev, size); >>> + >>> + /* No matching pool found? Try to create a new one */ >>> + if (pool < 0) { >>> + for (pool = RX_MAX_POOLS - 1; pool >= 0 ; pool--) { >> >> I don't think this loop actually accomplishes anything. Either the >> last slot is free, in which case you use it, then sort into place, or >> it's not, in which case you've hit the maximum number of buffer pools. > > Oh, you're right. Well spotted! I'll rework my patch to do it without > that loop. Wait, no, there was a case where this loop is actually really required: 1) All pools are in use and filled with at least one BD 2) User in the guest suddenly decides to change the buffer size of one of the pools in the /sys fs of the guest. 3) Guest driver tries to add buffers with a new size that do not match any size of one of the pools in the host 4) After the pool on the host runs empty which contained the BDs with the size that is not in use anymore, we should recycle that pool for the buffers with the new size instead. Since that buffer pool might not be at the end of the list, we've got to scan all buffers here to make sure we find it. So I think the for-loop should stay as it is. Thomas
On Thu, Mar 17, 2016 at 04:15:38PM +0100, Thomas Huth wrote: > On 17.03.2016 08:30, Thomas Huth wrote: > > On 17.03.2016 07:23, David Gibson wrote: > >> On Wed, Mar 16, 2016 at 01:16:50PM +0100, Thomas Huth wrote: > >>> > >>> This patch introduces an alternate way of handling the receive > >>> buffers of the spapr-vlan device, resulting in much better > >>> receive performance for the guest. > [...] > >>> +/** > >>> + * Enqueuing receive buffer by adding it to one of our receive buffer pools > >>> + */ > >>> +static target_long spapr_vlan_add_rxbuf_to_pool(VIOsPAPRVLANDevice *dev, > >>> + target_ulong buf) > >>> +{ > >>> + int size = VLAN_BD_LEN(buf); > >>> + int pool; > >>> + > >>> + pool = spapr_vlan_get_rx_pool_id(dev, size); > >>> + > >>> + /* No matching pool found? Try to create a new one */ > >>> + if (pool < 0) { > >>> + for (pool = RX_MAX_POOLS - 1; pool >= 0 ; pool--) { > >> > >> I don't think this loop actually accomplishes anything. Either the > >> last slot is free, in which case you use it, then sort into place, or > >> it's not, in which case you've hit the maximum number of buffer pools. > > > > Oh, you're right. Well spotted! I'll rework my patch to do it without > > that loop. > > Wait, no, there was a case where this loop is actually really required: > > 1) All pools are in use and filled with at least one BD > 2) User in the guest suddenly decides to change the buffer size of > one of the pools in the /sys fs of the guest. > 3) Guest driver tries to add buffers with a new size that do not > match any size of one of the pools in the host > 4) After the pool on the host runs empty which contained the BDs with > the size that is not in use anymore, we should recycle that pool > for the buffers with the new size instead. Since that buffer pool > might not be at the end of the list, we've got to scan all buffers > here to make sure we find it. > > So I think the for-loop should stay as it is. Ah, good point. I think I was assuming that the pools got sorted when one was emptied as well, but they're not and I suspect it's not a good idea to do so. Hmm.. I wonder if there's a brief way of explaining the above to put in the comment.
On 17.03.2016 23:33, David Gibson wrote: > On Thu, Mar 17, 2016 at 04:15:38PM +0100, Thomas Huth wrote: >> On 17.03.2016 08:30, Thomas Huth wrote: >>> On 17.03.2016 07:23, David Gibson wrote: >>>> On Wed, Mar 16, 2016 at 01:16:50PM +0100, Thomas Huth wrote: >>>>> >>>>> This patch introduces an alternate way of handling the receive >>>>> buffers of the spapr-vlan device, resulting in much better >>>>> receive performance for the guest. >> [...] >>>>> +/** >>>>> + * Enqueuing receive buffer by adding it to one of our receive buffer pools >>>>> + */ >>>>> +static target_long spapr_vlan_add_rxbuf_to_pool(VIOsPAPRVLANDevice *dev, >>>>> + target_ulong buf) >>>>> +{ >>>>> + int size = VLAN_BD_LEN(buf); >>>>> + int pool; >>>>> + >>>>> + pool = spapr_vlan_get_rx_pool_id(dev, size); >>>>> + >>>>> + /* No matching pool found? Try to create a new one */ >>>>> + if (pool < 0) { >>>>> + for (pool = RX_MAX_POOLS - 1; pool >= 0 ; pool--) { >>>> >>>> I don't think this loop actually accomplishes anything. Either the >>>> last slot is free, in which case you use it, then sort into place, or >>>> it's not, in which case you've hit the maximum number of buffer pools. >>> >>> Oh, you're right. Well spotted! I'll rework my patch to do it without >>> that loop. >> >> Wait, no, there was a case where this loop is actually really required: >> >> 1) All pools are in use and filled with at least one BD >> 2) User in the guest suddenly decides to change the buffer size of >> one of the pools in the /sys fs of the guest. >> 3) Guest driver tries to add buffers with a new size that do not >> match any size of one of the pools in the host >> 4) After the pool on the host runs empty which contained the BDs with >> the size that is not in use anymore, we should recycle that pool >> for the buffers with the new size instead. Since that buffer pool >> might not be at the end of the list, we've got to scan all buffers >> here to make sure we find it. >> >> So I think the for-loop should stay as it is. > > Ah, good point. I think I was assuming that the pools got sorted when > one was emptied as well, but they're not and I suspect it's not a good > idea to do so. > > Hmm.. I wonder if there's a brief way of explaining the above to put > in the comment. Something like: /* * If the guest used all pools, but changed the size of one pool * inbetween, we might need to recycle that pool here (if it has * already been emptied). Thus we need to scan all buffer pools * here, not only the last one (which has the highest probability * of being empty) */ ? Or is that too verbose already? Thomas
On Fri, Mar 18, 2016 at 08:56:56AM +0100, Thomas Huth wrote: > On 17.03.2016 23:33, David Gibson wrote: > > On Thu, Mar 17, 2016 at 04:15:38PM +0100, Thomas Huth wrote: > >> On 17.03.2016 08:30, Thomas Huth wrote: > >>> On 17.03.2016 07:23, David Gibson wrote: > >>>> On Wed, Mar 16, 2016 at 01:16:50PM +0100, Thomas Huth wrote: > >>>>> > >>>>> This patch introduces an alternate way of handling the receive > >>>>> buffers of the spapr-vlan device, resulting in much better > >>>>> receive performance for the guest. > >> [...] > >>>>> +/** > >>>>> + * Enqueuing receive buffer by adding it to one of our receive buffer pools > >>>>> + */ > >>>>> +static target_long spapr_vlan_add_rxbuf_to_pool(VIOsPAPRVLANDevice *dev, > >>>>> + target_ulong buf) > >>>>> +{ > >>>>> + int size = VLAN_BD_LEN(buf); > >>>>> + int pool; > >>>>> + > >>>>> + pool = spapr_vlan_get_rx_pool_id(dev, size); > >>>>> + > >>>>> + /* No matching pool found? Try to create a new one */ > >>>>> + if (pool < 0) { > >>>>> + for (pool = RX_MAX_POOLS - 1; pool >= 0 ; pool--) { > >>>> > >>>> I don't think this loop actually accomplishes anything. Either the > >>>> last slot is free, in which case you use it, then sort into place, or > >>>> it's not, in which case you've hit the maximum number of buffer pools. > >>> > >>> Oh, you're right. Well spotted! I'll rework my patch to do it without > >>> that loop. > >> > >> Wait, no, there was a case where this loop is actually really required: > >> > >> 1) All pools are in use and filled with at least one BD > >> 2) User in the guest suddenly decides to change the buffer size of > >> one of the pools in the /sys fs of the guest. > >> 3) Guest driver tries to add buffers with a new size that do not > >> match any size of one of the pools in the host > >> 4) After the pool on the host runs empty which contained the BDs with > >> the size that is not in use anymore, we should recycle that pool > >> for the buffers with the new size instead. Since that buffer pool > >> might not be at the end of the list, we've got to scan all buffers > >> here to make sure we find it. > >> > >> So I think the for-loop should stay as it is. > > > > Ah, good point. I think I was assuming that the pools got sorted when > > one was emptied as well, but they're not and I suspect it's not a good > > idea to do so. > > > > Hmm.. I wonder if there's a brief way of explaining the above to put > > in the comment. > > Something like: > > /* > * If the guest used all pools, but changed the size of one pool > * inbetween, we might need to recycle that pool here (if it has > * already been emptied). Thus we need to scan all buffer pools > * here, not only the last one (which has the highest probability > * of being empty) > */ > > ? > > Or is that too verbose already? Eh, it's written might as well throw it in.
diff --git a/hw/net/spapr_llan.c b/hw/net/spapr_llan.c index 39a1dd1..74c8b1e 100644 --- a/hw/net/spapr_llan.c +++ b/hw/net/spapr_llan.c @@ -45,6 +45,10 @@ #define DPRINTF(fmt...) #endif +/* Compatibility flags for migration */ +#define SPAPRVLAN_FLAG_RX_BUF_POOLS_BIT 0 +#define SPAPRVLAN_FLAG_RX_BUF_POOLS (1 << SPAPRVLAN_FLAG_RX_BUF_POOLS_BIT) + /* * Virtual LAN device */ @@ -86,6 +90,15 @@ typedef uint64_t vlan_bd_t; #define VIO_SPAPR_VLAN_DEVICE(obj) \ OBJECT_CHECK(VIOsPAPRVLANDevice, (obj), TYPE_VIO_SPAPR_VLAN_DEVICE) +#define RX_POOL_MAX_BDS 4096 +#define RX_MAX_POOLS 5 + +typedef struct { + int32_t bufsize; + int32_t count; + vlan_bd_t bds[RX_POOL_MAX_BDS]; +} RxBufPool; + typedef struct VIOsPAPRVLANDevice { VIOsPAPRDevice sdev; NICConf nicconf; @@ -94,6 +107,8 @@ typedef struct VIOsPAPRVLANDevice { target_ulong buf_list; uint32_t add_buf_ptr, use_buf_ptr, rx_bufs; target_ulong rxq_ptr; + uint32_t compat_flags; /* Compatability flags for migration */ + RxBufPool *rx_pool[RX_MAX_POOLS]; /* Receive buffer descriptor pools */ } VIOsPAPRVLANDevice; static int spapr_vlan_can_receive(NetClientState *nc) @@ -104,6 +119,37 @@ static int spapr_vlan_can_receive(NetClientState *nc) } /** + * Get buffer descriptor from one of our receive buffer pools + */ +static vlan_bd_t spapr_vlan_get_rx_bd_from_pool(VIOsPAPRVLANDevice *dev, + size_t size) +{ + vlan_bd_t bd; + int pool; + + for (pool = 0; pool < RX_MAX_POOLS; pool++) { + if (dev->rx_pool[pool]->count > 0 && + dev->rx_pool[pool]->bufsize >= size + 8) { + break; + } + } + if (pool == RX_MAX_POOLS) { + /* Failed to find a suitable buffer */ + return 0; + } + + DPRINTF("Found buffer: pool=%d count=%d rxbufs=%d\n", pool, + dev->rx_pool[pool]->count, dev->rx_bufs); + + /* Remove the buffer from the pool */ + dev->rx_pool[pool]->count--; + bd = dev->rx_pool[pool]->bds[dev->rx_pool[pool]->count]; + dev->rx_pool[pool]->bds[dev->rx_pool[pool]->count] = 0; + + return bd; +} + +/** * Get buffer descriptor from the receive buffer list page that has been * supplied by the guest with the H_REGISTER_LOGICAL_LAN call */ @@ -160,7 +206,11 @@ static ssize_t spapr_vlan_receive(NetClientState *nc, const uint8_t *buf, return -1; } - bd = spapr_vlan_get_rx_bd_from_page(dev, size); + if (dev->compat_flags & SPAPRVLAN_FLAG_RX_BUF_POOLS) { + bd = spapr_vlan_get_rx_bd_from_pool(dev, size); + } else { + bd = spapr_vlan_get_rx_bd_from_page(dev, size); + } if (!bd) { return -1; } @@ -216,10 +266,18 @@ static NetClientInfo net_spapr_vlan_info = { static void spapr_vlan_reset(VIOsPAPRDevice *sdev) { VIOsPAPRVLANDevice *dev = VIO_SPAPR_VLAN_DEVICE(sdev); + int i; dev->buf_list = 0; dev->rx_bufs = 0; dev->isopen = 0; + + if (dev->compat_flags & SPAPRVLAN_FLAG_RX_BUF_POOLS) { + for (i = 0; i < RX_MAX_POOLS; i++) { + memset(dev->rx_pool[i]->bds, 0, sizeof(dev->rx_pool[i]->bds)); + dev->rx_pool[i]->count = 0; + } + } } static void spapr_vlan_realize(VIOsPAPRDevice *sdev, Error **errp) @@ -236,10 +294,30 @@ static void spapr_vlan_realize(VIOsPAPRDevice *sdev, Error **errp) static void spapr_vlan_instance_init(Object *obj) { VIOsPAPRVLANDevice *dev = VIO_SPAPR_VLAN_DEVICE(obj); + int i; device_add_bootindex_property(obj, &dev->nicconf.bootindex, "bootindex", "", DEVICE(dev), NULL); + + if (dev->compat_flags & SPAPRVLAN_FLAG_RX_BUF_POOLS) { + for (i = 0; i < RX_MAX_POOLS; i++) { + dev->rx_pool[i] = g_new0(RxBufPool, 1); + } + } +} + +static void spapr_vlan_instance_finalize(Object *obj) +{ + VIOsPAPRVLANDevice *dev = VIO_SPAPR_VLAN_DEVICE(obj); + int i; + + if (dev->compat_flags & SPAPRVLAN_FLAG_RX_BUF_POOLS) { + for (i = 0; i < RX_MAX_POOLS; i++) { + g_free(dev->rx_pool[i]); + dev->rx_pool[i] = NULL; + } + } } void spapr_vlan_create(VIOsPAPRBus *bus, NICInfo *nd) @@ -390,6 +468,89 @@ static target_ulong h_free_logical_lan(PowerPCCPU *cpu, return H_SUCCESS; } +/** + * Used for qsort, this function compares two RxBufPools by size. + * If a pool is unused (i.e. count == 0), treat its size a INT_MAX + * so that it gets sorted to the end of the rx_pool array. + */ +static int rx_pool_size_compare(const void *p1, const void *p2) +{ + const RxBufPool *pool1 = *(RxBufPool **)p1; + const RxBufPool *pool2 = *(RxBufPool **)p2; + int size1, size2; + + size1 = pool1->count ? pool1->bufsize : INT_MAX; + size2 = pool2->count ? pool2->bufsize : INT_MAX; + + if (size1 < size2) { + return -1; + } + return size1 > size2; +} + +/** + * Search for a matching buffer pool with exact matching size, + * or return -1 if no matching pool has been found. + */ +static int spapr_vlan_get_rx_pool_id(VIOsPAPRVLANDevice *dev, int size) +{ + int pool; + + for (pool = 0; pool < RX_MAX_POOLS; pool++) { + if (dev->rx_pool[pool]->bufsize == size) { + return pool; + } + } + + return -1; +} + +/** + * Enqueuing receive buffer by adding it to one of our receive buffer pools + */ +static target_long spapr_vlan_add_rxbuf_to_pool(VIOsPAPRVLANDevice *dev, + target_ulong buf) +{ + int size = VLAN_BD_LEN(buf); + int pool; + + pool = spapr_vlan_get_rx_pool_id(dev, size); + + /* No matching pool found? Try to create a new one */ + if (pool < 0) { + for (pool = RX_MAX_POOLS - 1; pool >= 0 ; pool--) { + if (dev->rx_pool[pool]->count == 0) { + dev->rx_pool[pool]->bufsize = size; + /* + * Sort pools by size so that spapr_vlan_receive() + * can later find the smallest buffer pool easily. + */ + qsort(dev->rx_pool, RX_MAX_POOLS, sizeof(dev->rx_pool[0]), + rx_pool_size_compare); + pool = spapr_vlan_get_rx_pool_id(dev, size); + DPRINTF("created RX pool %d for size %lld\n", pool, + VLAN_BD_LEN(buf)); + break; + } + } + } + /* Still no usable pool? Give up */ + if (pool < 0 || dev->rx_pool[pool]->count >= RX_POOL_MAX_BDS) { + return H_RESOURCE; + } + + DPRINTF("h_add_llan_buf(): Add buf using pool %i (size %lli, count=%i)\n", + pool, VLAN_BD_LEN(buf), dev->rx_pool[pool]->count); + + dev->rx_pool[pool]->bds[dev->rx_pool[pool]->count++] = buf; + + return 0; +} + +/** + * This is the old way of enqueuing receive buffers: Add it to the rx queue + * page that has been supplied by the guest (which is quite limited in size). + */ static target_long spapr_vlan_add_rxbuf_to_page(VIOsPAPRVLANDevice *dev, target_ulong buf) { @@ -445,7 +606,11 @@ static target_ulong h_add_logical_lan_buffer(PowerPCCPU *cpu, return H_RESOURCE; } - ret = spapr_vlan_add_rxbuf_to_page(dev, buf); + if (dev->compat_flags & SPAPRVLAN_FLAG_RX_BUF_POOLS) { + ret = spapr_vlan_add_rxbuf_to_pool(dev, buf); + } else { + ret = spapr_vlan_add_rxbuf_to_page(dev, buf); + } if (ret) { return ret; } @@ -543,9 +708,44 @@ static target_ulong h_multicast_ctrl(PowerPCCPU *cpu, sPAPRMachineState *spapr, static Property spapr_vlan_properties[] = { DEFINE_SPAPR_PROPERTIES(VIOsPAPRVLANDevice, sdev), DEFINE_NIC_PROPERTIES(VIOsPAPRVLANDevice, nicconf), + DEFINE_PROP_BIT("use-rx-buffer-pools", VIOsPAPRVLANDevice, + compat_flags, SPAPRVLAN_FLAG_RX_BUF_POOLS_BIT, false), DEFINE_PROP_END_OF_LIST(), }; +static bool spapr_vlan_rx_buffer_pools_needed(void *opaque) +{ + VIOsPAPRVLANDevice *dev = opaque; + + return (dev->compat_flags & SPAPRVLAN_FLAG_RX_BUF_POOLS) != 0; +} + +static const VMStateDescription vmstate_rx_buffer_pool = { + .name = "spapr_llan/rx_buffer_pool", + .version_id = 1, + .minimum_version_id = 1, + .needed = spapr_vlan_rx_buffer_pools_needed, + .fields = (VMStateField[]) { + VMSTATE_INT32(bufsize, RxBufPool), + VMSTATE_INT32(count, RxBufPool), + VMSTATE_UINT64_ARRAY(bds, RxBufPool, RX_POOL_MAX_BDS), + VMSTATE_END_OF_LIST() + } +}; + +static const VMStateDescription vmstate_rx_pools = { + .name = "spapr_llan/rx_pools", + .version_id = 1, + .minimum_version_id = 1, + .needed = spapr_vlan_rx_buffer_pools_needed, + .fields = (VMStateField[]) { + VMSTATE_ARRAY_OF_POINTER_TO_STRUCT(rx_pool, VIOsPAPRVLANDevice, + RX_MAX_POOLS, 1, + vmstate_rx_buffer_pool, RxBufPool), + VMSTATE_END_OF_LIST() + } +}; + static const VMStateDescription vmstate_spapr_llan = { .name = "spapr_llan", .version_id = 1, @@ -562,6 +762,10 @@ static const VMStateDescription vmstate_spapr_llan = { VMSTATE_END_OF_LIST() }, + .subsections = (const VMStateDescription * []) { + &vmstate_rx_pools, + NULL + } }; static void spapr_vlan_class_init(ObjectClass *klass, void *data) @@ -588,6 +792,7 @@ static const TypeInfo spapr_vlan_info = { .instance_size = sizeof(VIOsPAPRVLANDevice), .class_init = spapr_vlan_class_init, .instance_init = spapr_vlan_instance_init, + .instance_finalize = spapr_vlan_instance_finalize, }; static void spapr_vlan_register_types(void)
tl;dr: This patch introduces an alternate way of handling the receive buffers of the spapr-vlan device, resulting in much better receive performance for the guest. Full story: One of our testers recently discovered that the performance of the spapr-vlan device is very poor compared to other NICs, and that a simple "ping -i 0.2 -s 65507 someip" in the guest can result in more than 50% lost ping packets (especially with older guest kernels < 3.17). After doing some analysis, it was clear that there is a problem with the way we handle the receive buffers in spapr_llan.c: The ibmveth driver of the guest Linux kernel tries to add a lot of buffers into several buffer pools (with 512, 2048 and 65536 byte sizes by default, but it can be changed via the entries in the /sys/devices/vio/1000/pool* directories of the guest). However, the spapr-vlan device of QEMU only tries to squeeze all receive buffer descriptors into one single page which has been supplied by the guest during the H_REGISTER_LOGICAL_LAN call, without taking care of different buffer sizes. This has two bad effects: First, only a very limited number of buffer descriptors is accepted at all. Second, we also hand 64k buffers to the guest even if the 2k buffers would fit better - and this results in dropped packets in the IP layer of the guest since too much skbuf memory is used. Though it seems at a first glance like PAPR says that we should store the receive buffer descriptors in the page that is supplied during the H_REGISTER_LOGICAL_LAN call, chapter 16.4.1.2 in the LoPAPR spec declares that "the contents of these descriptors are architecturally opaque, none of these descriptors are manipulated by code above the architected interfaces". That means we don't have to store the RX buffer descriptors in this page, but can also manage the receive buffers at the hypervisor level only. This is now what we are doing here: Introducing proper RX buffer pools which are also sorted by size of the buffers, so we can hand out a buffer with the best fitting size when a packet has been received. To avoid problems with migration from/to older version of QEMU, the old behavior is also retained and enabled by default. The new buffer management has to be enabled via a new "use-rx-buffer-pools" property. Now with the new buffer pool management enabled, the problem with "ping -s 65507" is fixed for me, and the throughput of a simple test with wget increases from creeping 3MB/s up to 20MB/s! Signed-off-by: Thomas Huth <thuth@redhat.com> --- hw/net/spapr_llan.c | 209 +++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 207 insertions(+), 2 deletions(-)