Message ID | 20180130033436.GA17053@ziepe.ca (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
On Mon, Jan 29, 2018 at 08:34:36PM -0700, Jason Gunthorpe wrote: > On Mon, Jan 29, 2018 at 03:11:53PM -0500, Doug Ledford wrote: > > On Sun, 2018-01-28 at 14:05 -0700, Jason Gunthorpe wrote: > > > On Sun, Jan 28, 2018 at 11:17:17AM +0200, Leon Romanovsky wrote: > > > > > > > The original goal of this series was to allow ability to view connection > > > > (QP) information about running processes, however I used this opportunity and > > > > created common infrastructure to track and report various resources. The report > > > > part is implemented in netlink (nldev), but smart ULPs can now create > > > > advanced usage models based on device utilization. > > > > > > > > The current implementation relies on one lock per-object per-device, so > > > > creation/destroying of various objects (CQ, PD, e.t.c) on various or the > > > > same devices doesn't interfere each with another. > > > > > > > > The data protection is performed with SRCU and its reader-writer model > > > > ensures that resource won't be destroyed till readers will finish their > > > > work. > > > > > > Well, this cover letter isn't quite right anymore.. but no matter. > > > > > > My small comments aside it looks OK to me. > > > > Likewise. I'm happy with it at this point. > > Okay, I fixed up the small things and applied the patches to for-next > > Leon: Please validate I didn't screw it up. Here is the diff against > what you sent: > > - Success path on the main execution flow, not under an if > - constify static structure > - Remove confusing comment about locking, ib_enum_all_devs > obtains locks to iterate its list and rdma_restrack_count holds > res->rwsem so everything is accounted for directly without > trickyness > - Speeling > - Remove extra lock in rdma_restrack_del > - Restore pd = NULL in ib_create_xrc_qp. This scraed me a bit, xrc is > wonky. But ib_create_xrc_q is only called in cases where > rdma_restrack_add is not added, so keeping things as-they-are should > not impact restrack. If restrack needs the pd for a XRC someday it > should get it from qp->real_qp > - Remove SET/NEW/DEL cargo cult, please send a patch for rest? > > Branch is here: > > https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/log/?h=wip/jgg-for-next > > Still unhappy with the kref-as-not-a-kref. Thanks Doug and Jason for accepting it. The "qp->pd = NULL" assignment is not needed. PD is NULL for the XRCD and you are setting "qp->pd = pd" before returning from _ib_create_qp() call, so it is actually the same. Everything works as expected, it passed my tests with ud/rc/xsrq pingpongs and shutdowns during traffic. Steve, I created the stable tag for you: rdma-next-2018-01-30 Thanks
> > Branch is here: > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/log/?h=wip/jgg - > for-next > > > > Still unhappy with the kref-as-not-a-kref. > > Thanks Doug and Jason for accepting it. > > The "qp->pd = NULL" assignment is not needed. PD is NULL > for the XRCD and you are setting "qp->pd = pd" before returning > from _ib_create_qp() call, so it is actually the same. > > Everything works as expected, it passed my tests with ud/rc/xsrq > pingpongs and shutdowns during traffic. > > Steve, I created the stable tag for you: rdma-next-2018-01-30 > > Thanks Thanks guys, Q: 4.15-rc2 kills my machine, I've been rebasing to a later rc to continue my development work, but the rebase fails horribly on mlx5. I skip these failures and don't compile mlx5 because I don't need it. Anyway, will rdma-next be rebased on 4.15.0 soon? Or does that wait until -rc2... -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jan 30, 2018 at 09:21:30AM -0600, Steve Wise wrote: > > > Branch is here: > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/log/?h=wip/jgg > - > > for-next > > > > > > Still unhappy with the kref-as-not-a-kref. > > > > Thanks Doug and Jason for accepting it. > > > > The "qp->pd = NULL" assignment is not needed. PD is NULL > > for the XRCD and you are setting "qp->pd = pd" before returning > > from _ib_create_qp() call, so it is actually the same. > > > > Everything works as expected, it passed my tests with ud/rc/xsrq > > pingpongs and shutdowns during traffic. > > > > Steve, I created the stable tag for you: rdma-next-2018-01-30 > > > > Thanks > > Thanks guys, Q: 4.15-rc2 kills my machine, I've been rebasing to a later rc > to continue my development work, but the rebase fails horribly on mlx5. I > skip these failures and don't compile mlx5 because I don't need it. > > Anyway, will rdma-next be rebased on 4.15.0 soon? Or does that wait until > -rc2... It is annoying for all concerned but Linus has asked us not to have needless merges. Hopefully we can build enough trust to relax on that down the road as other trees do. I have made a merge to v4.15 here: https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/log/?h=wip/for-linus-merged Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Thanks guys, Q: 4.15-rc2 kills my machine, I've been rebasing to a later rc > > to continue my development work, but the rebase fails horribly on mlx5. I > > skip these failures and don't compile mlx5 because I don't need it. > > > > Anyway, will rdma-next be rebased on 4.15.0 soon? Or does that wait until > > -rc2... > > It is annoying for all concerned but Linus has asked us not to have > needless merges. Hopefully we can build enough trust to relax on that > down the road as other trees do. > > I have made a merge to v4.15 here: > > https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/log/?h=wip/for - > linus-merged What is this a merge of exactly? I don't see the restrack stuff, for instance. Thanks, Steve. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jan 30, 2018 at 10:16:01AM -0600, Steve Wise wrote: > What is this a merge of exactly? I don't see the restrack stuff, for > instance. Yesterday's for-next. You could merge it with the latest for-next.. I updated it. I think we are done now, so for-next is what will be sent as the pull-request and for-next-merged is the conflict resolution. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2018-01-30 at 09:33 -0700, Jason Gunthorpe wrote: > On Tue, Jan 30, 2018 at 10:16:01AM -0600, Steve Wise wrote: > > > What is this a merge of exactly? I don't see the restrack stuff, for > > instance. > > Yesterday's for-next. You could merge it with the latest for-next.. > > I updated it. > > I think we are done now, so for-next is what will be sent as the > pull-request and for-next-merged is the conflict resolution. Hello Jason, Although I have not yet tried to root-cause this, I want to let you know that with your for-linus-merged branch the following error message is reported if I try to run the srp-test software against the rdma_rxe driver: id_ext=0x505400fffe4a0b7b,ioc_guid=0x505400fffe4a0b7b,dest=192.168.122.76:5555,t arget_can_queue=1,queue_size=32,max_cmd_per_lun=32,max_sect=131072 >/sys/class/i nfiniband_srp/srp-rxe0-1/add_target failed: Cannot allocate memory In the kernel log I found the following: Jan 30 10:55:50 ubuntu-vm kernel: scsi host4: ib_srp: FR pool allocation failed (-12) With your for-next branch from a few days ago the same test ran fine. Thanks, Bart.
On Tue, Jan 30, 2018 at 07:07:35PM +0000, Bart Van Assche wrote: > On Tue, 2018-01-30 at 09:33 -0700, Jason Gunthorpe wrote: > > On Tue, Jan 30, 2018 at 10:16:01AM -0600, Steve Wise wrote: > > > > > What is this a merge of exactly? I don't see the restrack stuff, for > > > instance. > > > > Yesterday's for-next. You could merge it with the latest for-next.. > > > > I updated it. > > > > I think we are done now, so for-next is what will be sent as the > > pull-request and for-next-merged is the conflict resolution. > > Hello Jason, > > Although I have not yet tried to root-cause this, I want to let you know > that with your for-linus-merged branch the following error message is > reported if I try to run the srp-test software against the rdma_rxe driver: > > id_ext=0x505400fffe4a0b7b,ioc_guid=0x505400fffe4a0b7b,dest=192.168.122.76:5555,t > arget_can_queue=1,queue_size=32,max_cmd_per_lun=32,max_sect=131072 >/sys/class/i > nfiniband_srp/srp-rxe0-1/add_target failed: Cannot allocate memory > > In the kernel log I found the following: > > Jan 30 10:55:50 ubuntu-vm kernel: scsi host4: ib_srp: FR pool allocation failed (-12) > > With your for-next branch from a few days ago the same test ran fine. I don't have a guess for you.. The difference between for-next and merged is only the inclusion of v4.15? Could some v4.15 non-rdma code be causing issue here? Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2018-01-30 at 12:46 -0700, Jason Gunthorpe wrote: > On Tue, Jan 30, 2018 at 07:07:35PM +0000, Bart Van Assche wrote: > > On Tue, 2018-01-30 at 09:33 -0700, Jason Gunthorpe wrote: > > > On Tue, Jan 30, 2018 at 10:16:01AM -0600, Steve Wise wrote: > > > > > > > What is this a merge of exactly? I don't see the restrack stuff, for > > > > instance. > > > > > > Yesterday's for-next. You could merge it with the latest for-next.. > > > > > > I updated it. > > > > > > I think we are done now, so for-next is what will be sent as the > > > pull-request and for-next-merged is the conflict resolution. > > > > Hello Jason, > > > > Although I have not yet tried to root-cause this, I want to let you know > > that with your for-linus-merged branch the following error message is > > reported if I try to run the srp-test software against the rdma_rxe driver: > > > > id_ext=0x505400fffe4a0b7b,ioc_guid=0x505400fffe4a0b7b,dest=192.168.122.76:5555,t > > arget_can_queue=1,queue_size=32,max_cmd_per_lun=32,max_sect=131072 >/sys/class/i > > nfiniband_srp/srp-rxe0-1/add_target failed: Cannot allocate memory > > > > In the kernel log I found the following: > > > > Jan 30 10:55:50 ubuntu-vm kernel: scsi host4: ib_srp: FR pool allocation failed (-12) > > > > With your for-next branch from a few days ago the same test ran fine. > > I don't have a guess for you.. > > The difference between for-next and merged is only the inclusion of > v4.15? Could some v4.15 non-rdma code be causing issue here? Hello Jason, I should have mentioned that in the previous tests I ran I merged kernel v4.15-rc9 myself into the RDMA for-next branch. So this behavior was probably introduced by a patch that was queued recently on the RDMA for-next branch, e.g. RDMA resource tracking. Bart.
On Tue, Jan 30, 2018 at 08:42:44PM +0000, Bart Van Assche wrote: > On Tue, 2018-01-30 at 12:46 -0700, Jason Gunthorpe wrote: > > On Tue, Jan 30, 2018 at 07:07:35PM +0000, Bart Van Assche wrote: > > > On Tue, 2018-01-30 at 09:33 -0700, Jason Gunthorpe wrote: > > > > On Tue, Jan 30, 2018 at 10:16:01AM -0600, Steve Wise wrote: > > > > > > > > > What is this a merge of exactly? I don't see the restrack stuff, for > > > > > instance. > > > > > > > > Yesterday's for-next. You could merge it with the latest for-next.. > > > > > > > > I updated it. > > > > > > > > I think we are done now, so for-next is what will be sent as the > > > > pull-request and for-next-merged is the conflict resolution. > > > > > > Hello Jason, > > > > > > Although I have not yet tried to root-cause this, I want to let you know > > > that with your for-linus-merged branch the following error message is > > > reported if I try to run the srp-test software against the rdma_rxe driver: > > > > > > id_ext=0x505400fffe4a0b7b,ioc_guid=0x505400fffe4a0b7b,dest=192.168.122.76:5555,t > > > arget_can_queue=1,queue_size=32,max_cmd_per_lun=32,max_sect=131072 >/sys/class/i > > > nfiniband_srp/srp-rxe0-1/add_target failed: Cannot allocate memory > > > > > > In the kernel log I found the following: > > > > > > Jan 30 10:55:50 ubuntu-vm kernel: scsi host4: ib_srp: FR pool allocation failed (-12) > > > > > > With your for-next branch from a few days ago the same test ran fine. > > > > I don't have a guess for you.. > > > > The difference between for-next and merged is only the inclusion of > > v4.15? Could some v4.15 non-rdma code be causing issue here? > > Hello Jason, > > I should have mentioned that in the previous tests I ran I merged kernel > v4.15-rc9 myself into the RDMA for-next branch. So this behavior was probably > introduced by a patch that was queued recently on the RDMA for-next branch, > e.g. RDMA resource tracking. Ok, I think that is the only likely thing recently.. But your print above must be caused by this line, right: static struct srp_fr_pool *srp_create_fr_pool(struct ib_device *device, struct ib_pd *pd, int pool_size, int max_page_list_len) { ret = -ENOMEM; pool = kzalloc(sizeof(struct srp_fr_pool) + pool_size * sizeof(struct srp_fr_desc), GFP_KERNEL); if (!pool) goto err; Since you didn't report the ib_alloc_mr() print it can't be the other ENOMEM case? Hard to see how that interesects with resource tracking.. Are you thinking memory corruption? Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2018-01-30 at 13:48 -0700, Jason Gunthorpe wrote: > Ok, I think that is the only likely thing recently.. > > But your print above must be caused by this line, right: > > static struct srp_fr_pool *srp_create_fr_pool(struct ib_device *device, > struct ib_pd *pd, int pool_size, > int max_page_list_len) > { > ret = -ENOMEM; > pool = kzalloc(sizeof(struct srp_fr_pool) + > pool_size * sizeof(struct srp_fr_desc), GFP_KERNEL); > if (!pool) > goto err; > > Since you didn't report the ib_alloc_mr() print it can't be the other > ENOMEM case? > > Hard to see how that interesects with resource tracking.. Are you > thinking memory corruption? Hello Jason, I don't see any reason to suspect memory corruption. kmemleak isn't reporting any memory leaks. Maybe memory fragmentation has increased? Thanks, Bart.
On Tue, 2018-01-30 at 21:22 +0000, Bart Van Assche wrote: > On Tue, 2018-01-30 at 13:48 -0700, Jason Gunthorpe wrote: > > Ok, I think that is the only likely thing recently.. > > > > But your print above must be caused by this line, right: > > > > static struct srp_fr_pool *srp_create_fr_pool(struct ib_device > > *device, > > struct ib_pd *pd, int > > pool_size, > > int > > max_page_list_len) > > { > > ret = -ENOMEM; > > pool = kzalloc(sizeof(struct srp_fr_pool) + > > pool_size * sizeof(struct srp_fr_desc), > > GFP_KERNEL); > > if (!pool) > > goto err; > > > > Since you didn't report the ib_alloc_mr() print it can't be the > > other > > ENOMEM case? > > > > Hard to see how that interesects with resource tracking.. Are you > > thinking memory corruption? > > Hello Jason, > > I don't see any reason to suspect memory corruption. kmemleak isn't > reporting > any memory leaks. Maybe memory fragmentation has increased? > > Thanks, > > Bart.NrybXǧv^){.n+{ٚ{ayʇڙ,jfhzwj:+vwjmzZ+ݢj"! Hi Bart, Can I take your tree and see if this fails for me too, Your last tree was fine, so did not have this latest stuff. Can I just pull to what I have Thanks Laurence -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2018-01-30 at 16:33 -0500, Laurence Oberman wrote: > Can I take your tree and see if this fails for me too, > Your last tree was fine, so did not have this latest stuff. > Can I just pull to what I have Hello Laurence, So far I have seen this behavior only inside a VM but not yet on a system with more memory than the VM. This issue may be specific to the memory size of the VM. I think we should try to isolate furhter what caused this before trying to reproduce it on more setups. Thanks, Bart.
On Tue, Jan 30, 2018 at 04:33:19PM -0500, Laurence Oberman wrote: > > > Since you didn't report the ib_alloc_mr() print it can't be the > > > other > > > ENOMEM case? > > > > > > Hard to see how that interesects with resource tracking.. Are you > > > thinking memory corruption? > > > > Hello Jason, > > > > I don't see any reason to suspect memory corruption. kmemleak isn't > > reporting > > any memory leaks. Maybe memory fragmentation has increased? > > > > Thanks, > > > > Bart > > Hi Bart, > > Can I take your tree and see if this fails for me too, > Your last tree was fine, so did not have this latest stuff. > Can I just pull to what I have Try what we are about to send as a PR: git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git branch wip/for-linus-merged https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/log/?h=wip/for-linus-merged No guess why a kzalloc will fail. How much is it asking for anyhow? Did the size get out of control for some reason? Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jan 30, 2018 at 09:40:14PM +0000, Bart Van Assche wrote: > On Tue, 2018-01-30 at 16:33 -0500, Laurence Oberman wrote: > > Can I take your tree and see if this fails for me too, > > Your last tree was fine, so did not have this latest stuff. > > Can I just pull to what I have > > Hello Laurence, > > So far I have seen this behavior only inside a VM but not yet on a system > with more memory than the VM. This issue may be specific to the memory size > of the VM. I think we should try to isolate furhter what caused this before > trying to reproduce it on more setups. Did you get an oops print related a kalloc failure? Or am I wrong and the ENOMEM is coming from someplace else? Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2018-01-30 at 14:42 -0700, Jason Gunthorpe wrote: > On Tue, Jan 30, 2018 at 09:40:14PM +0000, Bart Van Assche wrote: > > On Tue, 2018-01-30 at 16:33 -0500, Laurence Oberman wrote: > > > Can I take your tree and see if this fails for me too, > > > Your last tree was fine, so did not have this latest stuff. > > > Can I just pull to what I have > > > > Hello Laurence, > > > > So far I have seen this behavior only inside a VM but not yet on a system > > with more memory than the VM. This issue may be specific to the memory size > > of the VM. I think we should try to isolate furhter what caused this before > > trying to reproduce it on more setups. > > Did you get an oops print related a kalloc failure? > > Or am I wrong and the ENOMEM is coming from someplace else? Hello Jason, I just noticed the following in the system log: Jan 30 12:53:15 ubuntu-vm kernel: ib_srp: rxe0: ib_alloc_mr() failed. Try to reduce max_cmd_per_lun, max_sect or ch_count So apparently the ib_alloc_mr() fails sometimes (but not the first few times it is called). Thanks, Bart.
On Tue, Jan 30, 2018 at 09:47:48PM +0000, Bart Van Assche wrote: > On Tue, 2018-01-30 at 14:42 -0700, Jason Gunthorpe wrote: > > On Tue, Jan 30, 2018 at 09:40:14PM +0000, Bart Van Assche wrote: > > > On Tue, 2018-01-30 at 16:33 -0500, Laurence Oberman wrote: > > > > Can I take your tree and see if this fails for me too, > > > > Your last tree was fine, so did not have this latest stuff. > > > > Can I just pull to what I have > > > > > > Hello Laurence, > > > > > > So far I have seen this behavior only inside a VM but not yet on a system > > > with more memory than the VM. This issue may be specific to the memory size > > > of the VM. I think we should try to isolate furhter what caused this before > > > trying to reproduce it on more setups. > > > > Did you get an oops print related a kalloc failure? > > > > Or am I wrong and the ENOMEM is coming from someplace else? > > Hello Jason, > > I just noticed the following in the system log: > > Jan 30 12:53:15 ubuntu-vm kernel: ib_srp: rxe0: ib_alloc_mr() failed. Try to reduce max_cmd_per_lun, max_sect or ch_count > > So apparently the ib_alloc_mr() fails sometimes (but not the first few times > it is called). Looks like the only way you can get that without hitting an kalloc oops print is if rxe_alloc() fails, and probably here: if (atomic_inc_return(&pool->num_elem) > pool->max_elem) goto out_put_pool; Suggesting srp hit the max # of mrs in rxe: RXE_MAX_MR = 2 * 1024, Or maybe we are now leaking mrs someplace? There is nothing accepted recently that mucks with this, still not seeing even a tenuous connection to any patches in the last few days What was accepted in the past week(s) was a bunch of srp stuff though: $ git diff --stat 052eac6eeb5655c52a490a49f09c55500f868558 MAINTAINERS | 3 +- drivers/infiniband/core/Makefile | 2 +- drivers/infiniband/core/cm.c | 6 +- drivers/infiniband/core/cma.c | 2 +- drivers/infiniband/core/core_priv.h | 28 ++++ drivers/infiniband/core/cq.c | 16 ++- drivers/infiniband/core/device.c | 4 + drivers/infiniband/core/nldev.c | 374 ++++++++++++++++++++++++++++++++++++++++++++++++++ drivers/infiniband/core/restrack.c | 164 ++++++++++++++++++++++ drivers/infiniband/core/user_mad.c | 2 +- drivers/infiniband/core/uverbs_cmd.c | 7 +- drivers/infiniband/core/uverbs_ioctl.c | 19 ++- drivers/infiniband/core/uverbs_std_types.c | 3 + drivers/infiniband/core/verbs.c | 17 ++- drivers/infiniband/hw/mlx4/cq.c | 4 +- drivers/infiniband/hw/mlx5/cq.c | 2 +- drivers/infiniband/hw/mlx5/mlx5_ib.h | 4 +- drivers/infiniband/hw/mlx5/qp.c | 5 +- drivers/infiniband/hw/mthca/mthca_memfree.c | 2 +- drivers/infiniband/hw/mthca/mthca_user.h | 112 --------------- drivers/infiniband/hw/qedr/verbs.c | 6 +- drivers/infiniband/hw/qib/qib_keys.c | 235 ------------------------------- drivers/infiniband/sw/rxe/Kconfig | 4 +- drivers/infiniband/ulp/iser/iser_initiator.c | 16 +-- drivers/infiniband/ulp/srp/ib_srp.c | 723 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------------- drivers/infiniband/ulp/srp/ib_srp.h | 43 +++++- drivers/infiniband/ulp/srpt/ib_srpt.c | 2 - include/rdma/ib_verbs.h | 39 ++++-- include/rdma/restrack.h | 157 +++++++++++++++++++++ include/scsi/srp.h | 17 +++ include/uapi/rdma/ib_user_verbs.h | 7 +- include/uapi/rdma/rdma_netlink.h | 49 +++++++ lib/kobject.c | 2 + 33 files changed, 1511 insertions(+), 565 deletions(-) Any chance one of the SRP patches got mishandled somehow?? Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2018-01-30 at 15:02 -0700, Jason Gunthorpe wrote: > On Tue, Jan 30, 2018 at 09:47:48PM +0000, Bart Van Assche wrote: > > So apparently the ib_alloc_mr() fails sometimes (but not the first few times > > it is called). > > Looks like the only way you can get that without hitting an kalloc > oops print is if rxe_alloc() fails, and probably here: > > if (atomic_inc_return(&pool->num_elem) > pool->max_elem) > goto out_put_pool; > > Suggesting srp hit the max # of mrs in rxe: > > RXE_MAX_MR = 2 * 1024, > > Or maybe we are now leaking mrs someplace? > > There is nothing accepted recently that mucks with this, still not > seeing even a tenuous connection to any patches in the last few days Hello Jason, Since the number of memory regions that is allocated by the ib_srp driver depends on the max_sect parameter in the login string I will start with checking whether there have been any recent changes of that parameter in the test script. Thanks, Bart.
diff --git a/drivers/infiniband/core/core_priv.h b/drivers/infiniband/core/core_priv.h index 3dcacf220e5e7e..c4560d84dfaebd 100644 --- a/drivers/infiniband/core/core_priv.h +++ b/drivers/infiniband/core/core_priv.h @@ -310,21 +310,21 @@ static inline struct ib_qp *_ib_create_qp(struct ib_device *dev, struct ib_qp *qp; qp = dev->create_qp(pd, attr, udata); - if (!IS_ERR(qp)) { - qp->device = dev; - qp->pd = pd; - /* - * We don't track XRC QPs for now, because they don't have PD - * and more importantly they are created internaly by driver, - * see mlx5 create_dev_resources() as an example. - */ - if (attr->qp_type < IB_QPT_XRC_INI) { - qp->res.type = RDMA_RESTRACK_QP; - rdma_restrack_add(&qp->res); - } else { - qp->res.valid = false; - } - } + if (IS_ERR(qp)) + return qp; + + qp->device = dev; + qp->pd = pd; + /* + * We don't track XRC QPs for now, because they don't have PD + * and more importantly they are created internaly by driver, + * see mlx5 create_dev_resources() as an example. + */ + if (attr->qp_type < IB_QPT_XRC_INI) { + qp->res.type = RDMA_RESTRACK_QP; + rdma_restrack_add(&qp->res); + } else + qp->res.valid = false; return qp; } diff --git a/drivers/infiniband/core/nldev.c b/drivers/infiniband/core/nldev.c index 421d5fa6a81731..fa8655e3b3edfe 100644 --- a/drivers/infiniband/core/nldev.c +++ b/drivers/infiniband/core/nldev.c @@ -178,7 +178,7 @@ static int fill_res_info_entry(struct sk_buff *msg, static int fill_res_info(struct sk_buff *msg, struct ib_device *device) { - static const char *names[RDMA_RESTRACK_MAX] = { + static const char * const names[RDMA_RESTRACK_MAX] = { [RDMA_RESTRACK_PD] = "pd", [RDMA_RESTRACK_CQ] = "cq", [RDMA_RESTRACK_QP] = "qp", @@ -553,10 +553,6 @@ static int _nldev_res_get_dumpit(struct ib_device *device, static int nldev_res_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb) { - /* - * There is no need to take lock, because - * we are relying on ib_core's lists_rwsem - */ return ib_enum_all_devs(_nldev_res_get_dumpit, skb, cb); } @@ -627,8 +623,8 @@ static int nldev_res_get_qp_dumpit(struct sk_buff *skb, (!rdma_is_kernel_res(res) && task_active_pid_ns(current) != task_active_pid_ns(res->task))) /* - * 1. Kernel QPs should be visible in init namsapce only - * 2. Preent only QPs visible in the current namespace + * 1. Kernel QPs should be visible in init namspace only + * 2. Present only QPs visible in the current namespace */ goto next; diff --git a/drivers/infiniband/core/restrack.c b/drivers/infiniband/core/restrack.c index 351b6940f6dc17..857637bf46da27 100644 --- a/drivers/infiniband/core/restrack.c +++ b/drivers/infiniband/core/restrack.c @@ -150,9 +150,7 @@ void rdma_restrack_del(struct rdma_restrack_entry *res) if (!dev) return; - down_read(&dev->res.rwsem); rdma_restrack_put(res); - up_read(&dev->res.rwsem); wait_for_completion(&res->comp); diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c index a98a3e8412f810..16ebc6372c31ab 100644 --- a/drivers/infiniband/core/verbs.c +++ b/drivers/infiniband/core/verbs.c @@ -849,6 +849,7 @@ static struct ib_qp *ib_create_xrc_qp(struct ib_qp *qp, qp->event_handler = __ib_shared_qp_event_handler; qp->qp_context = qp; + qp->pd = NULL; qp->send_cq = qp->recv_cq = NULL; qp->srq = NULL; qp->xrcd = qp_init_attr->xrcd; diff --git a/include/uapi/rdma/rdma_netlink.h b/include/uapi/rdma/rdma_netlink.h index 23bef401598208..17e59bec169ec0 100644 --- a/include/uapi/rdma/rdma_netlink.h +++ b/include/uapi/rdma/rdma_netlink.h @@ -237,14 +237,8 @@ enum rdma_nldev_command { RDMA_NLDEV_CMD_PORT_DEL, RDMA_NLDEV_CMD_RES_GET, /* can dump */ - RDMA_NLDEV_CMD_RES_SET, - RDMA_NLDEV_CMD_RES_NEW, - RDMA_NLDEV_CMD_RES_DEL, RDMA_NLDEV_CMD_RES_QP_GET, /* can dump */ - RDMA_NLDEV_CMD_RES_QP_SET, - RDMA_NLDEV_CMD_RES_QP_NEW, - RDMA_NLDEV_CMD_RES_QP_DEL, RDMA_NLDEV_NUM_OPS };