Message ID | 31678a43-f76c-a921-e40c-470b0de1a86c@grimberg.me (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
On Thu, Mar 16, 2017 at 06:51:16PM +0200, Sagi Grimberg wrote: > > > > > > > Sagi, > > > > > > The release function is placed in global workqueue. I'm not familiar > > > > > > with NVMe design and I don't know all the details, but maybe the > > > > > > proper way will > > > > > > be to create special workqueue with MEM_RECLAIM flag to ensure the > > > > > > progress? > > Leon, the release work makes progress, but it is inherently slower > than the establishment work and when we are bombarded with > establishments we have no backpressure... Sagi, How do you see that release is slower than alloc? In this specific test, all queues are empty and QP drains should finish immediately. If we rely on the prints that Yi posted in the beginning of this thread, the release function doesn't have enough priority for execution and constantly delayed. > > > I tried with 4.11.0-rc2, and still can reproduced it with less than 2000 > > times. > > Yi, > > Can you try the below (untested) patch: > > I'm not at all convinced this is the way to go because it will > slow down all the connect requests, but I'm curious to know > if it'll make the issue go away. > > -- > diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c > index ecc4fe862561..f15fa6e6b640 100644 > --- a/drivers/nvme/target/rdma.c > +++ b/drivers/nvme/target/rdma.c > @@ -1199,6 +1199,9 @@ static int nvmet_rdma_queue_connect(struct rdma_cm_id > *cm_id, > } > queue->port = cm_id->context; > > + /* Let inflight queue teardown complete */ > + flush_scheduled_work(); > + > ret = nvmet_rdma_cm_accept(cm_id, queue, &event->param.conn); > if (ret) > goto release_queue; > -- > > Any other good ideas are welcome... Maybe create separate workqueue and flush its only, instead of global system queue. It will stress the system a little bit less. Thanks > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
I retest this issue on 4.11.0, the OOM issue cannot be reproduced now on the same environment[1] with test script[2], not sure which patch fixed this issue? And finally got reset_controller failed[3]. [1] memory:32GB CPU: Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz Card: 07:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3] [2] #!/bin/bash num=0 while [ 1 ] do echo "-------------------------------$num" echo 1 >/sys/block/nvme0n1/device/reset_controller || exit 1 ((num++)) sleep 0.1 done [3] -------------------------------897 reset_controller.sh: line 7: /sys/block/nvme0n1/device/reset_controller: No such file or directory Log from client: [ 2373.319860] nvme nvme0: creating 16 I/O queues. [ 2374.214380] nvme nvme0: creating 16 I/O queues. [ 2375.092755] nvme nvme0: creating 16 I/O queues. [ 2375.988591] nvme nvme0: creating 16 I/O queues. [ 2376.874315] nvme nvme0: creating 16 I/O queues. [ 2384.604400] nvme nvme0: rdma_resolve_addr wait failed (-110). [ 2384.636329] nvme nvme0: Removing after reset failure Best Regards, Yi Zhang ----- Original Message ----- From: "Leon Romanovsky" <leon@kernel.org> To: "Sagi Grimberg" <sagi@grimberg.me> Cc: linux-rdma@vger.kernel.org, "Max Gurtovoy" <maxg@mellanox.com>, "Christoph Hellwig" <hch@lst.de>, linux-nvme@lists.infradead.org, "Yi Zhang" <yizhan@redhat.com> Sent: Sunday, March 19, 2017 3:01:15 PM Subject: Re: mlx4_core 0000:07:00.0: swiotlb buffer is full and OOM observed during stress test on reset_controller On Thu, Mar 16, 2017 at 06:51:16PM +0200, Sagi Grimberg wrote: > > > > > > > Sagi, > > > > > > The release function is placed in global workqueue. I'm not familiar > > > > > > with NVMe design and I don't know all the details, but maybe the > > > > > > proper way will > > > > > > be to create special workqueue with MEM_RECLAIM flag to ensure the > > > > > > progress? > > Leon, the release work makes progress, but it is inherently slower > than the establishment work and when we are bombarded with > establishments we have no backpressure... Sagi, How do you see that release is slower than alloc? In this specific test, all queues are empty and QP drains should finish immediately. If we rely on the prints that Yi posted in the beginning of this thread, the release function doesn't have enough priority for execution and constantly delayed. > > > I tried with 4.11.0-rc2, and still can reproduced it with less than 2000 > > times. > > Yi, > > Can you try the below (untested) patch: > > I'm not at all convinced this is the way to go because it will > slow down all the connect requests, but I'm curious to know > if it'll make the issue go away. > > -- > diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c > index ecc4fe862561..f15fa6e6b640 100644 > --- a/drivers/nvme/target/rdma.c > +++ b/drivers/nvme/target/rdma.c > @@ -1199,6 +1199,9 @@ static int nvmet_rdma_queue_connect(struct rdma_cm_id > *cm_id, > } > queue->port = cm_id->context; > > + /* Let inflight queue teardown complete */ > + flush_scheduled_work(); > + > ret = nvmet_rdma_cm_accept(cm_id, queue, &event->param.conn); > if (ret) > goto release_queue; > -- > > Any other good ideas are welcome... Maybe create separate workqueue and flush its only, instead of global system queue. It will stress the system a little bit less. Thanks > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
Finally found below patch [1] that fixed this issue. With [1], I can see the speed of reset_controller operation[2] is obviously slow than before. [1] commit b7363e67b23e04c23c2a99437feefac7292a88bc Author: Sagi Grimberg <sagi@grimberg.me> Date: Wed Mar 8 22:03:17 2017 +0200 IB/device: Convert ib-comp-wq to be CPU-bound [2] echo 1 >/sys/block/nvme0n1/device/reset_controller Best Regards, Yi Zhang ----- Original Message ----- From: "Yi Zhang" <yizhan@redhat.com> To: "Leon Romanovsky" <leon@kernel.org> Cc: linux-rdma@vger.kernel.org, "Max Gurtovoy" <maxg@mellanox.com>, "Sagi Grimberg" <sagi@grimberg.me>, linux-nvme@lists.infradead.org, "Christoph Hellwig" <hch@lst.de> Sent: Friday, May 19, 2017 1:01:59 AM Subject: Re: mlx4_core 0000:07:00.0: swiotlb buffer is full and OOM observed during stress test on reset_controller I retest this issue on 4.11.0, the OOM issue cannot be reproduced now on the same environment[1] with test script[2], not sure which patch fixed this issue? And finally got reset_controller failed[3]. [1] memory:32GB CPU: Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz Card: 07:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3] [2] #!/bin/bash num=0 while [ 1 ] do echo "-------------------------------$num" echo 1 >/sys/block/nvme0n1/device/reset_controller || exit 1 ((num++)) sleep 0.1 done [3] -------------------------------897 reset_controller.sh: line 7: /sys/block/nvme0n1/device/reset_controller: No such file or directory Log from client: [ 2373.319860] nvme nvme0: creating 16 I/O queues. [ 2374.214380] nvme nvme0: creating 16 I/O queues. [ 2375.092755] nvme nvme0: creating 16 I/O queues. [ 2375.988591] nvme nvme0: creating 16 I/O queues. [ 2376.874315] nvme nvme0: creating 16 I/O queues. [ 2384.604400] nvme nvme0: rdma_resolve_addr wait failed (-110). [ 2384.636329] nvme nvme0: Removing after reset failure Best Regards, Yi Zhang ----- Original Message ----- From: "Leon Romanovsky" <leon@kernel.org> To: "Sagi Grimberg" <sagi@grimberg.me> Cc: linux-rdma@vger.kernel.org, "Max Gurtovoy" <maxg@mellanox.com>, "Christoph Hellwig" <hch@lst.de>, linux-nvme@lists.infradead.org, "Yi Zhang" <yizhan@redhat.com> Sent: Sunday, March 19, 2017 3:01:15 PM Subject: Re: mlx4_core 0000:07:00.0: swiotlb buffer is full and OOM observed during stress test on reset_controller On Thu, Mar 16, 2017 at 06:51:16PM +0200, Sagi Grimberg wrote: > > > > > > > Sagi, > > > > > > The release function is placed in global workqueue. I'm not familiar > > > > > > with NVMe design and I don't know all the details, but maybe the > > > > > > proper way will > > > > > > be to create special workqueue with MEM_RECLAIM flag to ensure the > > > > > > progress? > > Leon, the release work makes progress, but it is inherently slower > than the establishment work and when we are bombarded with > establishments we have no backpressure... Sagi, How do you see that release is slower than alloc? In this specific test, all queues are empty and QP drains should finish immediately. If we rely on the prints that Yi posted in the beginning of this thread, the release function doesn't have enough priority for execution and constantly delayed. > > > I tried with 4.11.0-rc2, and still can reproduced it with less than 2000 > > times. > > Yi, > > Can you try the below (untested) patch: > > I'm not at all convinced this is the way to go because it will > slow down all the connect requests, but I'm curious to know > if it'll make the issue go away. > > -- > diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c > index ecc4fe862561..f15fa6e6b640 100644 > --- a/drivers/nvme/target/rdma.c > +++ b/drivers/nvme/target/rdma.c > @@ -1199,6 +1199,9 @@ static int nvmet_rdma_queue_connect(struct rdma_cm_id > *cm_id, > } > queue->port = cm_id->context; > > + /* Let inflight queue teardown complete */ > + flush_scheduled_work(); > + > ret = nvmet_rdma_cm_accept(cm_id, queue, &event->param.conn); > if (ret) > goto release_queue; > -- > > Any other good ideas are welcome... Maybe create separate workqueue and flush its only, instead of global system queue. It will stress the system a little bit less. Thanks > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Yi, > Finally found below patch [1] that fixed this issue. > With [1], I can see the speed of reset_controller operation[2] is obviously slow than before. > > > [1] > commit b7363e67b23e04c23c2a99437feefac7292a88bc > Author: Sagi Grimberg <sagi@grimberg.me> > Date: Wed Mar 8 22:03:17 2017 +0200 > > IB/device: Convert ib-comp-wq to be CPU-bound This is very unlikely. I think that what made this go away is: commit 777dc82395de6e04b3a5fedcf153eb99bf5f1241 Author: Sagi Grimberg <sagi@grimberg.me> Date: Tue Mar 21 16:29:49 2017 +0200 nvmet-rdma: occasionally flush ongoing controller teardown If we are attacked with establishments/teradowns we need to make sure we do not consume too much system memory. Thus let ongoing controller teardowns complete before accepting new controller establishments. Cheers, Sagi. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 06/04/2017 11:49 PM, Sagi Grimberg wrote: > Hi Yi, > >> Finally found below patch [1] that fixed this issue. >> With [1], I can see the speed of reset_controller operation[2] is >> obviously slow than before. >> >> >> [1] >> commit b7363e67b23e04c23c2a99437feefac7292a88bc >> Author: Sagi Grimberg <sagi@grimberg.me> >> Date: Wed Mar 8 22:03:17 2017 +0200 >> >> IB/device: Convert ib-comp-wq to be CPU-bound > > This is very unlikely. > > I think that what made this go away is: > > commit 777dc82395de6e04b3a5fedcf153eb99bf5f1241 > Author: Sagi Grimberg <sagi@grimberg.me> > Date: Tue Mar 21 16:29:49 2017 +0200 > > nvmet-rdma: occasionally flush ongoing controller teardown > > If we are attacked with establishments/teradowns we need to > make sure we do not consume too much system memory. Thus > let ongoing controller teardowns complete before accepting > new controller establishments. > Hi Sagi This patch fixed the issue, thanks again. Yi > > Cheers, > Sagi. > > _______________________________________________ > Linux-nvme mailing list > Linux-nvme@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-nvme -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c index ecc4fe862561..f15fa6e6b640 100644 --- a/drivers/nvme/target/rdma.c +++ b/drivers/nvme/target/rdma.c @@ -1199,6 +1199,9 @@ static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id, } queue->port = cm_id->context; + /* Let inflight queue teardown complete */ + flush_scheduled_work(); + ret = nvmet_rdma_cm_accept(cm_id, queue, &event->param.conn); if (ret)