Message ID | 4c88e00f5211787a98fa980a4d42c5c6374ab868.1380056994.git.dledford@redhat.com (mailing list archive) |
---|---|
State | Rejected |
Headers | show |
On Tue, 24 Sep 2013 17:16:29 -0400 Doug Ledford <dledford@redhat.com> wrote: > @@ -85,13 +91,26 @@ int ib_get_cached_gid(struct ib_device *device, > > cache = device->cache.gid_cache[port_num - > start_port(device)]; > - if (index < 0 || index >= cache->table_len) > + if (index < 0 || index >= cache->table_len) { > ret = -EINVAL; > - else > - *gid = cache->table[index]; > + goto out_unlock; > + } > > - read_unlock_irqrestore(&device->cache.lock, flags); > + for (i = 0; i < cache->table_len; ++i) > + if (cache->entry[i].index == index) > + break; > + > + Hi Doug, I am a bit concerned about this patch, because where before ib_get_cached_gid just returned the GID at the given index, with your suggested change, ib_get_cached_gid() requires a search of the new gid table (to find the entry with the requested index value). ib_get_cached_gid is called by cm_req_handler, for the gid at index 0. There is no guarantee that this will be the first entry in the new scheme. Furthermore, ib_get_cached_gid is also called in MAD packet handling, with the specific gid index that is required. Thus, the savings for ib_find_cached_gid might possibly be offset by a performance loss in ib_get_cached_gid. A simpler optimization would be to simply keep a count of the number of valid GIDS in the gid table -- and break off the search when the last valid GID has been seen. This would optimize cases where, for example, you are searching for a GID that is not in the table, and only the first 3 gids in the table are valid (so you would not needlessly access 125 invalid GIDs). Clearly, such an optimization is only useful when there are a lot of invalid gids bunched together at the end of the table. Still, something to think about. -Jack -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/20/2013 02:51 AM, Jack Morgenstein wrote: > On Tue, 24 Sep 2013 17:16:29 -0400 > Doug Ledford <dledford@redhat.com> wrote: > >> @@ -85,13 +91,26 @@ int ib_get_cached_gid(struct ib_device *device, >> >> cache = device->cache.gid_cache[port_num - >> start_port(device)]; >> - if (index < 0 || index >= cache->table_len) >> + if (index < 0 || index >= cache->table_len) { >> ret = -EINVAL; >> - else >> - *gid = cache->table[index]; >> + goto out_unlock; >> + } >> >> - read_unlock_irqrestore(&device->cache.lock, flags); >> + for (i = 0; i < cache->table_len; ++i) >> + if (cache->entry[i].index == index) >> + break; >> + >> + > > Hi Doug, > > I am a bit concerned about this patch, because where before > ib_get_cached_gid just returned the GID at the given index, with your > suggested change, ib_get_cached_gid() requires a search of the new gid > table (to find the entry with the requested index value). Yes, I agree. That was a consideration in things. > ib_get_cached_gid is called by cm_req_handler, for the gid at index 0. > There is no guarantee that this will be the first entry in the new > scheme. If it's valid, then yes it is. Only if GID index 0 is invalid will it not be the first entry (and then it won't be an entry at all). This is due to how the update of the gid table works (it scans each entry, starting at 0 and going to table length, and puts them into a table in ascending order). Which points out that a valid optimization not present in this patch would be to break if the current table index is > than the requested index. I had also thought about doing a bitmap for the valid entries. Realistically, most machines only have 1 or 2 ports of IB devices. For common devices, the maximum pkey table length is 128 entries. So, 2 ports at 128 entries per port is a pretty miserly 256 bytes of memory. We could just allocate a full table and use a bitmap to represent the valid entries, and then in the find_cached* cases use the bitmap to limit our search. That would allow us to go back to O(1) for get_cached*. > Furthermore, ib_get_cached_gid is also called in MAD packet handling, > with the specific gid index that is required. > > Thus, the savings for ib_find_cached_gid might possibly be offset by a > performance loss in ib_get_cached_gid. Doubtful. Given that the common case is trading a single lookup increase to a 3 to 5 GID long chain search versus searching 128 entries of which 123 to 125 are invalid down to a similar 3 to 5 long chain search, the obvious big gain is getting rid of that 123 to 125 wasted memcmp's. And ib_find_cached_gid is used in cma_acquire_dev(), which in turn is called by cma_req_handler, iw_conn_req_handler, addr_handler, and rdma_bind_addr. So it sees plenty of use as well. Now, one thing I haven't tested yet and wish to, is a situation when we have lots of SRIOV devices and a GID table on the PF that is highly populated. In that case, further refinement will likely be necessary. If so, that will be my next patch, but I'll leave these patches to stand as they do now. > A simpler optimization would be to simply keep a count of the number of > valid GIDS in the gid table -- and break off the search when the last > valid GID has been seen. My understanding, according to the PKey change flow patches that Or posted, is that the GID table can have invalid entries prior to valid entries, and so that optimization would break. > This would optimize cases where, for example, > you are searching for a GID that is not in the table, and only the > first 3 gids in the table are valid (so you would not needlessly access > 125 invalid GIDs). Clearly, such an optimization is only useful when > there are a lot of invalid gids bunched together at the end of the > table. Still, something to think about. As you point out, it only works if all the invalid entries are at the end, and we can't guarantee that. I think I like my suggestion better: go back to having a full table, but use a bitmap to indicate valid entries and then use the bitmap to limit our comparisons in the find_cached* functions, and put the get_* funtions back to being O(1). But I would still do that incrementally from here I think. But I'm not totally convinced of that either. The exact sitiation I listed above, lots of GIDs on an SRIOV PF, makes me concerned that we can get back to a horrible situation in the find_cached* functions once we actually have lots of valid entries. It makes me think we need something better than just a linear search of all valid entries when you take SRIOV into account. Whether hash chains or ranges or something to make the lots of valid GIDs case faster, I suspect something needs to be done, but because things simply aren't in common use yet we don't know it. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 21 Oct 2013 00:12:54 -0400 Doug Ledford <dledford@redhat.com> wrote: > I think I like my suggestion better: go back to having a full table, > but use a bitmap to indicate valid entries and then use the bitmap to > limit our comparisons in the find_cached* functions, and put the > get_* funtions back to being O(1). But I would still do that > incrementally from here I think. > > But I'm not totally convinced of that either. The exact sitiation I > listed above, lots of GIDs on an SRIOV PF, makes me concerned that we > can get back to a horrible situation in the find_cached* functions > once we actually have lots of valid entries. It makes me think we > need something better than just a linear search of all valid entries > when you take SRIOV into account. Whether hash chains or ranges or > something to make the lots of valid GIDs case faster, I suspect > something needs to be done, but because things simply aren't in > common use yet we don't know it. Doug, I like your suggestion regarding bitmaps. I would rather hold off on patch #3, though, because as you say, patches 1 and 2 do most of the work and the patch #3 optimization won't do much if the GID table is very populated (which will be the case under SRIOV). I think what you say is correct, a linear search through a populated table will be expensive -- and we need to come up with a better strategy here. ACK for first 2 patches, please hold off on the third. -Jack -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c index 80f6cf2..d31972d 100644 --- a/drivers/infiniband/core/cache.c +++ b/drivers/infiniband/core/cache.c @@ -44,12 +44,18 @@ struct ib_pkey_cache { int table_len; - u16 table[0]; + struct pkey_cache_table_entry { + u16 pkey; + unsigned char index; + } entry[0]; }; struct ib_gid_cache { int table_len; - union ib_gid table[0]; + struct gid_cache_table_entry { + union ib_gid gid; + unsigned char index; + } entry[0]; }; struct ib_update_work { @@ -76,7 +82,7 @@ int ib_get_cached_gid(struct ib_device *device, { struct ib_gid_cache *cache; unsigned long flags; - int ret = 0; + int i, ret = 0; if (port_num < start_port(device) || port_num > end_port(device)) return -EINVAL; @@ -85,13 +91,26 @@ int ib_get_cached_gid(struct ib_device *device, cache = device->cache.gid_cache[port_num - start_port(device)]; - if (index < 0 || index >= cache->table_len) + if (index < 0 || index >= cache->table_len) { ret = -EINVAL; - else - *gid = cache->table[index]; + goto out_unlock; + } - read_unlock_irqrestore(&device->cache.lock, flags); + for (i = 0; i < cache->table_len; ++i) + if (cache->entry[i].index == index) + break; + + if (i < cache->table_len) + *gid = cache->entry[i].gid; + else { + ret = ib_query_gid(device, port_num, index, gid); + if (ret) + printk(KERN_WARNING "ib_query_gid failed (%d) for %s (index %d)\n", + ret, device->name, index); + } +out_unlock: + read_unlock_irqrestore(&device->cache.lock, flags); return ret; } EXPORT_SYMBOL(ib_get_cached_gid); @@ -115,18 +134,18 @@ int ib_find_cached_gid(struct ib_device *device, for (p = 0; p <= end_port(device) - start_port(device); ++p) { cache = device->cache.gid_cache[p]; for (i = 0; i < cache->table_len; ++i) { - if (!memcmp(gid, &cache->table[i], sizeof *gid)) { + if (!memcmp(gid, &cache->entry[i].gid, sizeof *gid)) { *port_num = p + start_port(device); if (index) - *index = i; + *index = cache->entry[i].index; ret = 0; goto found; } } } + found: read_unlock_irqrestore(&device->cache.lock, flags); - return ret; } EXPORT_SYMBOL(ib_find_cached_gid); @@ -138,7 +157,7 @@ int ib_get_cached_pkey(struct ib_device *device, { struct ib_pkey_cache *cache; unsigned long flags; - int ret = 0; + int i, ret = 0; if (port_num < start_port(device) || port_num > end_port(device)) return -EINVAL; @@ -147,13 +166,22 @@ int ib_get_cached_pkey(struct ib_device *device, cache = device->cache.pkey_cache[port_num - start_port(device)]; - if (index < 0 || index >= cache->table_len) + if (index < 0 || index >= cache->table_len) { ret = -EINVAL; + goto out_unlock; + } + + for (i = 0; i < cache->table_len; ++i) + if (cache->entry[i].index == index) + break; + + if (i < cache->table_len) + *pkey = cache->entry[i].pkey; else - *pkey = cache->table[index]; + *pkey = 0x0000; +out_unlock: read_unlock_irqrestore(&device->cache.lock, flags); - return ret; } EXPORT_SYMBOL(ib_get_cached_pkey); @@ -179,13 +207,13 @@ int ib_find_cached_pkey(struct ib_device *device, *index = -1; for (i = 0; i < cache->table_len; ++i) - if ((cache->table[i] & 0x7fff) == (pkey & 0x7fff)) { - if (cache->table[i] & 0x8000) { - *index = i; + if ((cache->entry[i].pkey & 0x7fff) == (pkey & 0x7fff)) { + if (cache->entry[i].pkey & 0x8000) { + *index = cache->entry[i].index; ret = 0; break; } else - partial_ix = i; + partial_ix = cache->entry[i].index; } if (ret && partial_ix >= 0) { @@ -219,8 +247,8 @@ int ib_find_exact_cached_pkey(struct ib_device *device, *index = -1; for (i = 0; i < cache->table_len; ++i) - if (cache->table[i] == pkey) { - *index = i; + if (cache->entry[i].pkey == pkey) { + *index = cache->entry[i].index; ret = 0; break; } @@ -255,8 +283,10 @@ static void ib_cache_update(struct ib_device *device, struct ib_port_attr *tprops = NULL; struct ib_pkey_cache *pkey_cache = NULL, *old_pkey_cache; struct ib_gid_cache *gid_cache = NULL, *old_gid_cache; - int i; + int i, j; int ret; + union ib_gid gid, empty_gid; + u16 pkey; tprops = kmalloc(sizeof *tprops, GFP_KERNEL); if (!tprops) @@ -270,35 +300,79 @@ static void ib_cache_update(struct ib_device *device, } pkey_cache = kmalloc(sizeof *pkey_cache + tprops->pkey_tbl_len * - sizeof *pkey_cache->table, GFP_KERNEL); + sizeof *pkey_cache->entry, GFP_KERNEL); if (!pkey_cache) goto err; - pkey_cache->table_len = tprops->pkey_tbl_len; + pkey_cache->table_len = 0; gid_cache = kmalloc(sizeof *gid_cache + tprops->gid_tbl_len * - sizeof *gid_cache->table, GFP_KERNEL); + sizeof *gid_cache->entry, GFP_KERNEL); if (!gid_cache) goto err; - gid_cache->table_len = tprops->gid_tbl_len; + gid_cache->table_len = 0; - for (i = 0; i < pkey_cache->table_len; ++i) { - ret = ib_query_pkey(device, port, i, pkey_cache->table + i); + for (i = 0, j = 0; i < tprops->pkey_tbl_len; ++i) { + ret = ib_query_pkey(device, port, i, &pkey); if (ret) { printk(KERN_WARNING "ib_query_pkey failed (%d) for %s (index %d)\n", ret, device->name, i); goto err; } + /* pkey 0xffff must be the default pkeyand 0x0000 must be the invalid + * pkey per IBTA spec */ + if (pkey) { + pkey_cache->entry[j].index = i; + pkey_cache->entry[j++].pkey = pkey; + } } + pkey_cache->table_len = j; - for (i = 0; i < gid_cache->table_len; ++i) { - ret = ib_query_gid(device, port, i, gid_cache->table + i); + memset(&empty_gid, 0, sizeof empty_gid); + for (i = 0, j = 0; i < tprops->gid_tbl_len; ++i) { + ret = ib_query_gid(device, port, i, &gid); if (ret) { printk(KERN_WARNING "ib_query_gid failed (%d) for %s (index %d)\n", ret, device->name, i); goto err; } + /* if the lower 8 bytes the device GID entry is all 0, + * our entry is a blank, invalid entry... + * depending on device, the upper 8 bytes might or might + * not be prefilled with a valid subnet prefix, so + * don't rely on them for determining a valid gid + * entry + */ + if (memcmp(&gid + 8, &empty_gid + 8, sizeof gid - 8)) { + gid_cache->entry[j].index = i; + gid_cache->entry[j++].gid = gid; + } + } + gid_cache->table_len = j; + + old_pkey_cache = pkey_cache; + pkey_cache = kmalloc(sizeof *pkey_cache + old_pkey_cache->table_len * + sizeof *pkey_cache->entry, GFP_KERNEL); + if (!pkey_cache) + pkey_cache = old_pkey_cache; + else { + pkey_cache->table_len = old_pkey_cache->table_len; + memcpy(&pkey_cache->entry[0], &old_pkey_cache->entry[0], + pkey_cache->table_len * sizeof *pkey_cache->entry); + kfree(old_pkey_cache); + } + + old_gid_cache = gid_cache; + gid_cache = kmalloc(sizeof *gid_cache + old_gid_cache->table_len * + sizeof *gid_cache->entry, GFP_KERNEL); + if (!gid_cache) + gid_cache = old_gid_cache; + else { + gid_cache->table_len = old_gid_cache->table_len; + memcpy(&gid_cache->entry[0], &old_gid_cache->entry[0], + gid_cache->table_len * sizeof *gid_cache->entry); + kfree(old_gid_cache); } write_lock_irq(&device->cache.lock);
We keep a cache of the GIDs and PKeys on an IB device, but when we get the props for the card, the table length returned is the overall table size and not related to how many valid entries there are in the table. As a result, we end up with things like 128 entries for GIDs that are 3 valid GIDs and 125 empty GIDs. Then when we call find_cache_gid, we search through all 128 GID cache entries using memcmp. This is needlessly expensive. So when we update the cache, check each item we get from the card to see if it's valid and only save it into the cache if it is. We make sure to preserve the index from the card's table with each cache item so the correct index in the card's table is returned on any lookup that requests the index. I have performance numbers on this change, but they aren't really conclusive. I made this change after the previous two patches, and while conceptually it is obvious that search 3 or 4 GIDs is better than searching through 128 empty GIDs using memcmp, the fact of the matter is that we normally find our result quickly and so the benefit of this change is not seen, at least not by using the cmtime application. In addition, the immediately previous patch optimized the connect routine's selection of device to search first, again hiding the benefit of this change. It would require a very complex setup with lots of valid GIDs and connect attempts that were accessing GIDs late in the table to demonstrate the benefit of this patch clearly, and the cmtime utilitity from librdmacm does not do that. Signed-off-by: Doug Ledford <dledford@redhat.com> --- drivers/infiniband/core/cache.c | 132 +++++++++++++++++++++++++++++++--------- 1 file changed, 103 insertions(+), 29 deletions(-)