Message ID | 71af2770c737b936f7b10f457f0ef303ffcf7ad7.1632644527.git.leonro@nvidia.com (mailing list archive) |
---|---|
State | Accepted |
Delegated to: | Jason Gunthorpe |
Headers | show |
Series | [rdma-next] RDMA/mlx5: Avoid taking MRs from larger MR cache pools when a pool is empty | expand |
On Sun, Sep 26, 2021 at 11:31:43AM +0300, Leon Romanovsky wrote: > From: Aharon Landau <aharonl@nvidia.com> > > Currently, if a cache entry is empty, the driver will try to take MRs > from larger cache entries. This behavior consumes a lot of memory. > In addition, when searching for an mkey in an entry, the entry is locked. > When using a multithreaded application with the old behavior, the threads > will block each other more often, which can hurt performance as can be > seen in the table below. > > Therefore, avoid it by creating a new mkey when the requested cache entry > is empty. > > The test was performed on a machine with > Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz 44 cores. > > Here are the time measures for allocating MRs of 2^6 pages. The search in > the cache started from entry 6. > > +------------+---------------------+---------------------+ > | | Old behavior | New behavior | > | +----------+----------+----------+----------+ > | | 1 thread | 5 thread | 1 thread | 5 thread | > +============+==========+==========+==========+==========+ > | 1,000 MRs | 14 ms | 30 ms | 14 ms | 80 ms | > +------------+----------+----------+----------+----------+ > | 10,000 MRs | 135 ms | 6 sec | 173 ms | 880 ms | > +------------+----------+----------+----------+----------+ > |100,000 MRs | 11.2 sec | 57 sec | 1.74 sec | 8.8 sec | > +------------+----------+----------+----------+----------+ > > Signed-off-by: Aharon Landau <aharonl@nvidia.com> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com> > --- > drivers/infiniband/hw/mlx5/mr.c | 26 +++++++++----------------- > 1 file changed, 9 insertions(+), 17 deletions(-) I'm surprised the cost is so high, I assume this has alot to do with repeated calls to queue_adjust_cache_locked()? Maybe this should be further investigated? Anyhow, applied to for-next, thanks Jason
On Mon, Oct 04, 2021 at 08:00:03PM -0300, Jason Gunthorpe wrote: > On Sun, Sep 26, 2021 at 11:31:43AM +0300, Leon Romanovsky wrote: > > From: Aharon Landau <aharonl@nvidia.com> > > > > Currently, if a cache entry is empty, the driver will try to take MRs > > from larger cache entries. This behavior consumes a lot of memory. > > In addition, when searching for an mkey in an entry, the entry is locked. > > When using a multithreaded application with the old behavior, the threads > > will block each other more often, which can hurt performance as can be > > seen in the table below. > > > > Therefore, avoid it by creating a new mkey when the requested cache entry > > is empty. > > > > The test was performed on a machine with > > Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz 44 cores. > > > > Here are the time measures for allocating MRs of 2^6 pages. The search in > > the cache started from entry 6. > > > > +------------+---------------------+---------------------+ > > | | Old behavior | New behavior | > > | +----------+----------+----------+----------+ > > | | 1 thread | 5 thread | 1 thread | 5 thread | > > +============+==========+==========+==========+==========+ > > | 1,000 MRs | 14 ms | 30 ms | 14 ms | 80 ms | > > +------------+----------+----------+----------+----------+ > > | 10,000 MRs | 135 ms | 6 sec | 173 ms | 880 ms | > > +------------+----------+----------+----------+----------+ > > |100,000 MRs | 11.2 sec | 57 sec | 1.74 sec | 8.8 sec | > > +------------+----------+----------+----------+----------+ > > > > Signed-off-by: Aharon Landau <aharonl@nvidia.com> > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com> > > --- > > drivers/infiniband/hw/mlx5/mr.c | 26 +++++++++----------------- > > 1 file changed, 9 insertions(+), 17 deletions(-) > > I'm surprised the cost is so high, I assume this has alot to do with > repeated calls to queue_adjust_cache_locked()? Maybe this should be > further investigated? I don't think so, most of the overhead comes from entry lock, which effectively stops any change to that shared entry. > > Anyhow, applied to for-next, thanks Thanks > > Jason
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c index 3be36ebbf67a..b4d2322e9ca5 100644 --- a/drivers/infiniband/hw/mlx5/mr.c +++ b/drivers/infiniband/hw/mlx5/mr.c @@ -600,29 +600,21 @@ struct mlx5_ib_mr *mlx5_mr_cache_alloc(struct mlx5_ib_dev *dev, /* Return a MR already available in the cache */ static struct mlx5_ib_mr *get_cache_mr(struct mlx5_cache_ent *req_ent) { - struct mlx5_ib_dev *dev = req_ent->dev; struct mlx5_ib_mr *mr = NULL; struct mlx5_cache_ent *ent = req_ent; - /* Try larger MR pools from the cache to satisfy the allocation */ - for (; ent != &dev->cache.ent[MR_CACHE_LAST_STD_ENTRY + 1]; ent++) { - mlx5_ib_dbg(dev, "order %u, cache index %zu\n", ent->order, - ent - dev->cache.ent); - - spin_lock_irq(&ent->lock); - if (!list_empty(&ent->head)) { - mr = list_first_entry(&ent->head, struct mlx5_ib_mr, - list); - list_del(&mr->list); - ent->available_mrs--; - queue_adjust_cache_locked(ent); - spin_unlock_irq(&ent->lock); - mlx5_clear_mr(mr); - return mr; - } + spin_lock_irq(&ent->lock); + if (!list_empty(&ent->head)) { + mr = list_first_entry(&ent->head, struct mlx5_ib_mr, list); + list_del(&mr->list); + ent->available_mrs--; queue_adjust_cache_locked(ent); spin_unlock_irq(&ent->lock); + mlx5_clear_mr(mr); + return mr; } + queue_adjust_cache_locked(ent); + spin_unlock_irq(&ent->lock); req_ent->miss++; return NULL; }