diff mbox series

[v4,1/1] RDMA/mana_ib: Add EQ interrupt support to mana ib driver.

Message ID 20230728170749.1888588-1-weh@microsoft.com (mailing list archive)
State Changes Requested
Headers show
Series [v4,1/1] RDMA/mana_ib: Add EQ interrupt support to mana ib driver. | expand

Commit Message

Wei Hu July 28, 2023, 5:07 p.m. UTC
Add EQ interrupt support for mana ib driver. Allocate EQs per ucontext
to receive interrupt. Attach EQ when CQ is created. Call CQ interrupt
handler when completion interrupt happens. EQs are destroyed when
ucontext is deallocated.

The change calls some public APIs in mana ethernet driver to
allocate EQs and other resources. Ehe EQ process routine is also shared
by mana ethernet and mana ib drivers.

Co-developed-by: Ajay Sharma <sharmaajay@microsoft.com>
Signed-off-by: Ajay Sharma <sharmaajay@microsoft.com>
Signed-off-by: Wei Hu <weh@microsoft.com>
---

v2: Use ibdev_dbg to print error messages and return -ENOMEN
    when kzalloc fails.

v3: Check return value on mana_ib_gd_destroy_dma_region(). Remove most
    debug prints.

v4: Fix couple nits and performed thorough test in production evn.

 drivers/infiniband/hw/mana/cq.c               |  35 ++++-
 drivers/infiniband/hw/mana/main.c             |  84 +++++++++++
 drivers/infiniband/hw/mana/mana_ib.h          |   4 +
 drivers/infiniband/hw/mana/qp.c               |  79 ++++++++++-
 .../net/ethernet/microsoft/mana/gdma_main.c   | 131 ++++++++++--------
 drivers/net/ethernet/microsoft/mana/mana_en.c |   1 +
 include/net/mana/gdma.h                       |   9 +-
 7 files changed, 278 insertions(+), 65 deletions(-)

Comments

Jason Gunthorpe July 28, 2023, 5:43 p.m. UTC | #1
On Fri, Jul 28, 2023 at 05:07:49PM +0000, Wei Hu wrote:
> Add EQ interrupt support for mana ib driver. Allocate EQs per ucontext
> to receive interrupt. Attach EQ when CQ is created. Call CQ interrupt
> handler when completion interrupt happens. EQs are destroyed when
> ucontext is deallocated.

It seems strange that interrupts would be somehow linked to a
ucontext? interrupts are highly limited, you can DOS the entire system
if someone abuses this.

Generally I expect a properly functioning driver to use one interrupt
per CPU core.

You should tie the CQ to a shared EQ belong to the core that the CQ
wants to have affinity to.

Jason
Long Li July 28, 2023, 5:51 p.m. UTC | #2
> Subject: Re: [PATCH v4 1/1] RDMA/mana_ib: Add EQ interrupt support to mana ib
> driver.
> 
> On Fri, Jul 28, 2023 at 05:07:49PM +0000, Wei Hu wrote:
> > Add EQ interrupt support for mana ib driver. Allocate EQs per ucontext
> > to receive interrupt. Attach EQ when CQ is created. Call CQ interrupt
> > handler when completion interrupt happens. EQs are destroyed when
> > ucontext is deallocated.
> 
> It seems strange that interrupts would be somehow linked to a ucontext?
> interrupts are highly limited, you can DOS the entire system if someone abuses
> this.
> 
> Generally I expect a properly functioning driver to use one interrupt per CPU core.

Yes, MANA uses one interrupt per CPU. One interrupt is shared among multiple
EQs.

> 
> You should tie the CQ to a shared EQ belong to the core that the CQ wants to have
> affinity to.

The reason for using a separate EQ for a ucontext, is for preventing DOS. If we use
a shared EQ, a single ucontext can storm this shared EQ affecting other users.

If one ucontext decides to abuse its own EQ, the hardware won't be able generate
enough IRQs to storm the whole system.

Long
Jason Gunthorpe July 28, 2023, 6:02 p.m. UTC | #3
On Fri, Jul 28, 2023 at 05:51:46PM +0000, Long Li wrote:
> > Subject: Re: [PATCH v4 1/1] RDMA/mana_ib: Add EQ interrupt support to mana ib
> > driver.
> > 
> > On Fri, Jul 28, 2023 at 05:07:49PM +0000, Wei Hu wrote:
> > > Add EQ interrupt support for mana ib driver. Allocate EQs per ucontext
> > > to receive interrupt. Attach EQ when CQ is created. Call CQ interrupt
> > > handler when completion interrupt happens. EQs are destroyed when
> > > ucontext is deallocated.
> > 
> > It seems strange that interrupts would be somehow linked to a ucontext?
> > interrupts are highly limited, you can DOS the entire system if someone abuses
> > this.
> > 
> > Generally I expect a properly functioning driver to use one interrupt per CPU core.
> 
> Yes, MANA uses one interrupt per CPU. One interrupt is shared among multiple
> EQs.

So you have another multiplexing layer between the interrupt and the
EQ? That is alot of multiplexing layers..

> > You should tie the CQ to a shared EQ belong to the core that the CQ wants to have
> > affinity to.
> 
> The reason for using a separate EQ for a ucontext, is for preventing DOS. If we use
> a shared EQ, a single ucontext can storm this shared EQ affecting
> other users.

With a proper design it should not be possible. The CQ adds an entry
to the EQ and that should be rate limited by the ability of userspace
to schedule to re-arm the CQ.

Jason
Long Li July 28, 2023, 6:22 p.m. UTC | #4
> Subject: Re: [PATCH v4 1/1] RDMA/mana_ib: Add EQ interrupt support to mana ib
> driver.
> 
> On Fri, Jul 28, 2023 at 05:51:46PM +0000, Long Li wrote:
> > > Subject: Re: [PATCH v4 1/1] RDMA/mana_ib: Add EQ interrupt support
> > > to mana ib driver.
> > >
> > > On Fri, Jul 28, 2023 at 05:07:49PM +0000, Wei Hu wrote:
> > > > Add EQ interrupt support for mana ib driver. Allocate EQs per
> > > > ucontext to receive interrupt. Attach EQ when CQ is created. Call
> > > > CQ interrupt handler when completion interrupt happens. EQs are
> > > > destroyed when ucontext is deallocated.
> > >
> > > It seems strange that interrupts would be somehow linked to a ucontext?
> > > interrupts are highly limited, you can DOS the entire system if
> > > someone abuses this.
> > >
> > > Generally I expect a properly functioning driver to use one interrupt per CPU
> core.
> >
> > Yes, MANA uses one interrupt per CPU. One interrupt is shared among
> > multiple EQs.
> 
> So you have another multiplexing layer between the interrupt and the EQ? That is
> alot of multiplexing layers..
> 
> > > You should tie the CQ to a shared EQ belong to the core that the CQ
> > > wants to have affinity to.
> >
> > The reason for using a separate EQ for a ucontext, is for preventing
> > DOS. If we use a shared EQ, a single ucontext can storm this shared EQ
> > affecting other users.
> 
> With a proper design it should not be possible. The CQ adds an entry to the EQ
> and that should be rate limited by the ability of userspace to schedule to re-arm
> the CQ.

I think DPDK user space can sometimes storm the EQ by arming the CQ from user-mode.

Please see the following code on arming the CQ from MLX4:
https://github.com/DPDK/dpdk/blob/12fcafcd62286933e6b167b14856d21f642efa5f/drivers/net/mlx4/mlx4_intr.c#L229

With a malicious DPDK user, this code can be abused to arm the CQ at extremely high rate.

Long
Jason Gunthorpe July 28, 2023, 6:39 p.m. UTC | #5
On Fri, Jul 28, 2023 at 06:22:53PM +0000, Long Li wrote:
> > Subject: Re: [PATCH v4 1/1] RDMA/mana_ib: Add EQ interrupt support to mana ib
> > driver.
> > 
> > On Fri, Jul 28, 2023 at 05:51:46PM +0000, Long Li wrote:
> > > > Subject: Re: [PATCH v4 1/1] RDMA/mana_ib: Add EQ interrupt support
> > > > to mana ib driver.
> > > >
> > > > On Fri, Jul 28, 2023 at 05:07:49PM +0000, Wei Hu wrote:
> > > > > Add EQ interrupt support for mana ib driver. Allocate EQs per
> > > > > ucontext to receive interrupt. Attach EQ when CQ is created. Call
> > > > > CQ interrupt handler when completion interrupt happens. EQs are
> > > > > destroyed when ucontext is deallocated.
> > > >
> > > > It seems strange that interrupts would be somehow linked to a ucontext?
> > > > interrupts are highly limited, you can DOS the entire system if
> > > > someone abuses this.
> > > >
> > > > Generally I expect a properly functioning driver to use one interrupt per CPU
> > core.
> > >
> > > Yes, MANA uses one interrupt per CPU. One interrupt is shared among
> > > multiple EQs.
> > 
> > So you have another multiplexing layer between the interrupt and the EQ? That is
> > alot of multiplexing layers..
> > 
> > > > You should tie the CQ to a shared EQ belong to the core that the CQ
> > > > wants to have affinity to.
> > >
> > > The reason for using a separate EQ for a ucontext, is for preventing
> > > DOS. If we use a shared EQ, a single ucontext can storm this shared EQ
> > > affecting other users.
> > 
> > With a proper design it should not be possible. The CQ adds an entry to the EQ
> > and that should be rate limited by the ability of userspace to schedule to re-arm
> > the CQ.
> 
> I think DPDK user space can sometimes storm the EQ by arming the CQ
> from user-mode.

Maybe maliciously you can do a blind re-arm, but nothing sane should
do that.

> With a malicious DPDK user, this code can be abused to arm the CQ at
> extremely high rate.

Again, the rate of CQ re-arm is limited by the ability of userspace to
schedule, I'm reluctant to consider that a DOS vector. Doesn't your HW
have EQ overflow recovery?

Frankly, stacking more layers of IRQ multiplexing doesn't seem like it
should solve any problems, you are just shifting where the DOS can
occure. Allowing userspace to create EQs is its own DOS direction,
either you exhaust and DOS the number of EQs or you DOS the
multiplexing layer between the interrupt and the EQ.

Jason
Long Li Aug. 1, 2023, 7:06 p.m. UTC | #6
> Subject: Re: [PATCH v4 1/1] RDMA/mana_ib: Add EQ interrupt support to mana ib
> driver.
> 
> On Fri, Jul 28, 2023 at 06:22:53PM +0000, Long Li wrote:
> > > Subject: Re: [PATCH v4 1/1] RDMA/mana_ib: Add EQ interrupt support
> > > to mana ib driver.
> > >
> > > On Fri, Jul 28, 2023 at 05:51:46PM +0000, Long Li wrote:
> > > > > Subject: Re: [PATCH v4 1/1] RDMA/mana_ib: Add EQ interrupt
> > > > > support to mana ib driver.
> > > > >
> > > > > On Fri, Jul 28, 2023 at 05:07:49PM +0000, Wei Hu wrote:
> > > > > > Add EQ interrupt support for mana ib driver. Allocate EQs per
> > > > > > ucontext to receive interrupt. Attach EQ when CQ is created.
> > > > > > Call CQ interrupt handler when completion interrupt happens.
> > > > > > EQs are destroyed when ucontext is deallocated.
> > > > >
> > > > > It seems strange that interrupts would be somehow linked to a ucontext?
> > > > > interrupts are highly limited, you can DOS the entire system if
> > > > > someone abuses this.
> > > > >
> > > > > Generally I expect a properly functioning driver to use one
> > > > > interrupt per CPU
> > > core.
> > > >
> > > > Yes, MANA uses one interrupt per CPU. One interrupt is shared
> > > > among multiple EQs.
> > >
> > > So you have another multiplexing layer between the interrupt and the
> > > EQ? That is alot of multiplexing layers..
> > >
> > > > > You should tie the CQ to a shared EQ belong to the core that the
> > > > > CQ wants to have affinity to.
> > > >
> > > > The reason for using a separate EQ for a ucontext, is for
> > > > preventing DOS. If we use a shared EQ, a single ucontext can storm
> > > > this shared EQ affecting other users.
> > >
> > > With a proper design it should not be possible. The CQ adds an entry
> > > to the EQ and that should be rate limited by the ability of
> > > userspace to schedule to re-arm the CQ.
> >
> > I think DPDK user space can sometimes storm the EQ by arming the CQ
> > from user-mode.
> 
> Maybe maliciously you can do a blind re-arm, but nothing sane should do that.

Yes, we don't expect a sane user would do that. But in a containerized cloud VM, we can't trust any user. The hardware/driver is designed to isolate the damage from those bad behaviors to their own environment.

> 
> > With a malicious DPDK user, this code can be abused to arm the CQ at
> > extremely high rate.
> 
> Again, the rate of CQ re-arm is limited by the ability of userspace to schedule, I'm
> reluctant to consider that a DOS vector. Doesn't your HW have EQ overflow
> recovery?

The HW supports detecting and recovery of EQ overflow, but it is on the slow path. A bad user can still affect other users if they use the same EQ and get into recovery mode from time to time.
	
> 
> Frankly, stacking more layers of IRQ multiplexing doesn't seem like it should solve
> any problems, you are just shifting where the DOS can occure. Allowing userspace
> to create EQs is its own DOS direction, either you exhaust and DOS the number of
> EQs or you DOS the multiplexing layer between the interrupt and the EQ.

The hardware is designed to support a very large number EQs. In practice, this hardware limit is unlikely to be reached before other resources are running out.

The driver interrupt code limits the CPU processing time of each EQ by reading a small batch of EQEs in this interrupt. It guarantees all the EQs are checked on this CPU, and limits the interrupt processing time for any given EQ. In this way, a bad EQ (which is stormed by a bad user doing unreasonable re-arming on the CQ) can't storm other EQs on this CPU.

Thanks,

Long
Jason Gunthorpe Aug. 1, 2023, 11:46 p.m. UTC | #7
On Tue, Aug 01, 2023 at 07:06:57PM +0000, Long Li wrote:

> The driver interrupt code limits the CPU processing time of each EQ
> by reading a small batch of EQEs in this interrupt. It guarantees
> all the EQs are checked on this CPU, and limits the interrupt
> processing time for any given EQ. In this way, a bad EQ (which is
> stormed by a bad user doing unreasonable re-arming on the CQ) can't
> storm other EQs on this CPU.

Of course it can, the bad use just creates a million EQs and pushes a
bit of work through them constantly. How is that really any different
from pushing more EQEs into a single EQ?

And how does your EQ multiplexing work anyhow? Do you poll every EQ on
every interrupt? That itself is a DOS vector.

Jason
Ajay Sharma Aug. 2, 2023, 4:11 a.m. UTC | #8
> On Aug 1, 2023, at 6:46 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> On Tue, Aug 01, 2023 at 07:06:57PM +0000, Long Li wrote:
> 
>> The driver interrupt code limits the CPU processing time of each EQ
>> by reading a small batch of EQEs in this interrupt. It guarantees
>> all the EQs are checked on this CPU, and limits the interrupt
>> processing time for any given EQ. In this way, a bad EQ (which is
>> stormed by a bad user doing unreasonable re-arming on the CQ) can't
>> storm other EQs on this CPU.
> 
> Of course it can, the bad use just creates a million EQs and pushes a
> bit of work through them constantly. How is that really any different
> from pushing more EQEs into a single EQ?
> 
> And how does your EQ multiplexing work anyhow? Do you poll every EQ on
> every interrupt? That itself is a DOS vector.
> 
> Jason

User does not create eqs directly . EQ creation is by product of opening device ie allocating context. I am not sure if the same process is allowed to open device multiple times - must be some kind of lock implemented. So million eqs are probably far fetched .
As for how the eq servicing is done - only those eq’s for which the interrupt is raised are checked. And each eq is tied only once and only to a single interrupt. 

Ajay
Jason Gunthorpe Aug. 2, 2023, 1:08 p.m. UTC | #9
On Wed, Aug 02, 2023 at 04:11:18AM +0000, Ajay Sharma wrote:
> 
> 
> > On Aug 1, 2023, at 6:46 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > 
> > On Tue, Aug 01, 2023 at 07:06:57PM +0000, Long Li wrote:
> > 
> >> The driver interrupt code limits the CPU processing time of each EQ
> >> by reading a small batch of EQEs in this interrupt. It guarantees
> >> all the EQs are checked on this CPU, and limits the interrupt
> >> processing time for any given EQ. In this way, a bad EQ (which is
> >> stormed by a bad user doing unreasonable re-arming on the CQ) can't
> >> storm other EQs on this CPU.
> > 
> > Of course it can, the bad use just creates a million EQs and pushes a
> > bit of work through them constantly. How is that really any different
> > from pushing more EQEs into a single EQ?
> > 
> > And how does your EQ multiplexing work anyhow? Do you poll every EQ on
> > every interrupt? That itself is a DOS vector.
> 
> User does not create eqs directly . EQ creation is by product of
> opening device ie allocating context. 

Which is done directly by the user.

> I am not sure if the same
> process is allowed to open device multiple times 

Of course it can.

> of lock implemented. So million eqs are probably far fetched .

Uh, how do you conclude that?

>  As for how the eq servicing is done - only those eq’s for which the
> interrupt is raised are checked. And each eq is tied only once and
> only to a single interrupt.

So you iterate over a list of EQs in every interrupt?

Allowing userspace to increase the number of EQs on an interrupt is a
direct DOS vector, no special fussing required.

If you want this to work properly you need to have your HW arrange
things so there is only ever one EQE in the EQ for a given CQ at any
time. Another EQE cannot be stuffed by the HW until the kernel reads
the first EQE and acks it back.

You have almost got this right, the mistake is that userspace is the
thing that allows the HW to generate a new EQE. If you care about DOS
then this is the wrong design, the kernel and only the kernel must be
able to trigger a new EQE for the CQ.

In effect you need two CQ doorbells, a userspace one that re-arms the
CQ, and a kernel one that allows a CQ that triggered on ARM to
generate an EQE.

Thus the kernel can strictly limit the flow of EQEs through the EQs
such that an EQ can never overflow and a CQ can never consume more
than one EQE.

You cannot really fix this hardware problem with a software
solution. You will always have a DOS at some point.

Jason
Long Li Aug. 15, 2023, 10:51 p.m. UTC | #10
> Subject: Re: [EXTERNAL] Re: [PATCH v4 1/1] RDMA/mana_ib: Add EQ
> interrupt support to mana ib driver.
> 
> On Wed, Aug 02, 2023 at 04:11:18AM +0000, Ajay Sharma wrote:
> >
> >
> > > On Aug 1, 2023, at 6:46 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >
> > > On Tue, Aug 01, 2023 at 07:06:57PM +0000, Long Li wrote:
> > >
> > >> The driver interrupt code limits the CPU processing time of each EQ
> > >> by reading a small batch of EQEs in this interrupt. It guarantees
> > >> all the EQs are checked on this CPU, and limits the interrupt
> > >> processing time for any given EQ. In this way, a bad EQ (which is
> > >> stormed by a bad user doing unreasonable re-arming on the CQ) can't
> > >> storm other EQs on this CPU.
> > >
> > > Of course it can, the bad use just creates a million EQs and pushes
> > > a bit of work through them constantly. How is that really any
> > > different from pushing more EQEs into a single EQ?
> > >
> > > And how does your EQ multiplexing work anyhow? Do you poll every EQ
> > > on every interrupt? That itself is a DOS vector.
> >
> > User does not create eqs directly . EQ creation is by product of
> > opening device ie allocating context.
> 
> Which is done directly by the user.
> 
> > I am not sure if the same
> > process is allowed to open device multiple times
> 
> Of course it can.
> 
> > of lock implemented. So million eqs are probably far fetched .
> 
> Uh, how do you conclude that?
> 
> >  As for how the eq servicing is done - only those eq’s for which the
> > interrupt is raised are checked. And each eq is tied only once and
> > only to a single interrupt.
> 
> So you iterate over a list of EQs in every interrupt?
> 
> Allowing userspace to increase the number of EQs on an interrupt is a direct
> DOS vector, no special fussing required.
> 
> If you want this to work properly you need to have your HW arrange things so
> there is only ever one EQE in the EQ for a given CQ at any time. Another EQE
> cannot be stuffed by the HW until the kernel reads the first EQE and acks it
> back.
> 
> You have almost got this right, the mistake is that userspace is the thing that
> allows the HW to generate a new EQE. If you care about DOS then this is the
> wrong design, the kernel and only the kernel must be able to trigger a new EQE
> for the CQ.
> 
> In effect you need two CQ doorbells, a userspace one that re-arms the CQ, and
> a kernel one that allows a CQ that triggered on ARM to generate an EQE.
> 
> Thus the kernel can strictly limit the flow of EQEs through the EQs such that an
> EQ can never overflow and a CQ can never consume more than one EQE.
> 
> You cannot really fix this hardware problem with a software solution. You will
> always have a DOS at some point.

We'll address the comments and send another patch.

Thanks,

Long
diff mbox series

Patch

diff --git a/drivers/infiniband/hw/mana/cq.c b/drivers/infiniband/hw/mana/cq.c
index d141cab8a1e6..6865dab66d48 100644
--- a/drivers/infiniband/hw/mana/cq.c
+++ b/drivers/infiniband/hw/mana/cq.c
@@ -12,13 +12,20 @@  int mana_ib_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
 	struct ib_device *ibdev = ibcq->device;
 	struct mana_ib_create_cq ucmd = {};
 	struct mana_ib_dev *mdev;
+	struct gdma_context *gc;
+	struct gdma_dev *gd;
 	int err;
 
 	mdev = container_of(ibdev, struct mana_ib_dev, ib_dev);
+	gd = mdev->gdma_dev;
+	gc = gd->gdma_context;
 
 	if (udata->inlen < sizeof(ucmd))
 		return -EINVAL;
 
+	cq->comp_vector = attr->comp_vector > gc->max_num_queues ?
+				0 : attr->comp_vector;
+
 	err = ib_copy_from_udata(&ucmd, udata, min(sizeof(ucmd), udata->inlen));
 	if (err) {
 		ibdev_dbg(ibdev,
@@ -69,11 +76,35 @@  int mana_ib_destroy_cq(struct ib_cq *ibcq, struct ib_udata *udata)
 	struct mana_ib_cq *cq = container_of(ibcq, struct mana_ib_cq, ibcq);
 	struct ib_device *ibdev = ibcq->device;
 	struct mana_ib_dev *mdev;
+	struct gdma_context *gc;
+	struct gdma_dev *gd;
+	int err;
+
 
 	mdev = container_of(ibdev, struct mana_ib_dev, ib_dev);
+	gd = mdev->gdma_dev;
+	gc = gd->gdma_context;
+
 
-	mana_ib_gd_destroy_dma_region(mdev, cq->gdma_region);
-	ib_umem_release(cq->umem);
+
+	if (atomic_read(&ibcq->usecnt) == 0) {
+		err = mana_ib_gd_destroy_dma_region(mdev, cq->gdma_region);
+		if (err) {
+			ibdev_dbg(ibdev,
+				  "Failed to destroy dma region, %d\n", err);
+			return err;
+		}
+		kfree(gc->cq_table[cq->id]);
+		gc->cq_table[cq->id] = NULL;
+		ib_umem_release(cq->umem);
+	}
 
 	return 0;
 }
+
+void mana_ib_cq_handler(void *ctx, struct gdma_queue *gdma_cq)
+{
+	struct mana_ib_cq *cq = ctx;
+
+	cq->ibcq.comp_handler(&cq->ibcq, cq->ibcq.cq_context);
+}
diff --git a/drivers/infiniband/hw/mana/main.c b/drivers/infiniband/hw/mana/main.c
index 7be4c3adb4e2..b20a6c6c1de1 100644
--- a/drivers/infiniband/hw/mana/main.c
+++ b/drivers/infiniband/hw/mana/main.c
@@ -143,6 +143,78 @@  int mana_ib_dealloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
 	return err;
 }
 
+static void mana_ib_destroy_eq(struct mana_ib_ucontext *ucontext,
+			       struct mana_ib_dev *mdev)
+{
+	struct gdma_context *gc = mdev->gdma_dev->gdma_context;
+	struct gdma_queue *eq;
+	int i;
+
+	if (!ucontext->eqs)
+		return;
+
+	for (i = 0; i < gc->max_num_queues; i++) {
+		eq = ucontext->eqs[i].eq;
+		if (!eq)
+			continue;
+
+		mana_gd_destroy_queue(gc, eq);
+	}
+
+	kfree(ucontext->eqs);
+	ucontext->eqs = NULL;
+}
+
+static int mana_ib_create_eq(struct mana_ib_ucontext *ucontext,
+			     struct mana_ib_dev *mdev)
+{
+	struct gdma_queue_spec spec = {};
+	struct gdma_queue *queue;
+	struct gdma_context *gc;
+	struct ib_device *ibdev;
+	struct gdma_dev *gd;
+	int err;
+	int i;
+
+	if (!ucontext || !mdev)
+		return -EINVAL;
+
+	ibdev = ucontext->ibucontext.device;
+	gd = mdev->gdma_dev;
+
+	gc = gd->gdma_context;
+
+	ucontext->eqs = kcalloc(gc->max_num_queues, sizeof(struct mana_eq),
+				GFP_KERNEL);
+	if (!ucontext->eqs)
+		return -ENOMEM;
+
+	spec.type = GDMA_EQ;
+	spec.monitor_avl_buf = false;
+	spec.queue_size = EQ_SIZE;
+	spec.eq.callback = NULL;
+	spec.eq.context = ucontext->eqs;
+	spec.eq.log2_throttle_limit = LOG2_EQ_THROTTLE;
+	spec.eq.msix_allocated = true;
+
+	for (i = 0; i < gc->max_num_queues; i++) {
+		spec.eq.msix_index = i;
+		err = mana_gd_create_mana_eq(gd, &spec, &queue);
+		if (err)
+			goto out;
+
+		queue->eq.disable_needed = true;
+		ucontext->eqs[i].eq = queue;
+	}
+
+	return 0;
+
+out:
+	ibdev_dbg(ibdev, "Failed to allocated eq err %d\n", err);
+	mana_ib_destroy_eq(ucontext, mdev);
+	return err;
+}
+
 static int mana_gd_destroy_doorbell_page(struct gdma_context *gc,
 					 int doorbell_page)
 {
@@ -225,7 +297,17 @@  int mana_ib_alloc_ucontext(struct ib_ucontext *ibcontext,
 
 	ucontext->doorbell = doorbell_page;
 
+	ret = mana_ib_create_eq(ucontext, mdev);
+	if (ret) {
+		ibdev_dbg(ibdev, "Failed to create eq's , ret %d\n", ret);
+		goto err;
+	}
+
 	return 0;
+
+err:
+	mana_gd_destroy_doorbell_page(gc, doorbell_page);
+	return ret;
 }
 
 void mana_ib_dealloc_ucontext(struct ib_ucontext *ibcontext)
@@ -240,6 +322,8 @@  void mana_ib_dealloc_ucontext(struct ib_ucontext *ibcontext)
 	mdev = container_of(ibdev, struct mana_ib_dev, ib_dev);
 	gc = mdev->gdma_dev->gdma_context;
 
+	mana_ib_destroy_eq(mana_ucontext, mdev);
+
 	ret = mana_gd_destroy_doorbell_page(gc, mana_ucontext->doorbell);
 	if (ret)
 		ibdev_dbg(ibdev, "Failed to destroy doorbell page %d\n", ret);
diff --git a/drivers/infiniband/hw/mana/mana_ib.h b/drivers/infiniband/hw/mana/mana_ib.h
index 502cc8672eef..9672fa1670a5 100644
--- a/drivers/infiniband/hw/mana/mana_ib.h
+++ b/drivers/infiniband/hw/mana/mana_ib.h
@@ -67,6 +67,7 @@  struct mana_ib_cq {
 	int cqe;
 	u64 gdma_region;
 	u64 id;
+	u32 comp_vector;
 };
 
 struct mana_ib_qp {
@@ -86,6 +87,7 @@  struct mana_ib_qp {
 struct mana_ib_ucontext {
 	struct ib_ucontext ibucontext;
 	u32 doorbell;
+	struct mana_eq *eqs;
 };
 
 struct mana_ib_rwq_ind_table {
@@ -159,4 +161,6 @@  int mana_ib_query_gid(struct ib_device *ibdev, u32 port, int index,
 
 void mana_ib_disassociate_ucontext(struct ib_ucontext *ibcontext);
 
+void mana_ib_cq_handler(void *ctx, struct gdma_queue *gdma_cq);
+
 #endif
diff --git a/drivers/infiniband/hw/mana/qp.c b/drivers/infiniband/hw/mana/qp.c
index 54b61930a7fd..b8fcb7a8eae0 100644
--- a/drivers/infiniband/hw/mana/qp.c
+++ b/drivers/infiniband/hw/mana/qp.c
@@ -96,16 +96,20 @@  static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
 	struct mana_ib_qp *qp = container_of(ibqp, struct mana_ib_qp, ibqp);
 	struct mana_ib_dev *mdev =
 		container_of(pd->device, struct mana_ib_dev, ib_dev);
+	struct ib_ucontext *ib_ucontext = pd->uobject->context;
 	struct ib_rwq_ind_table *ind_tbl = attr->rwq_ind_tbl;
 	struct mana_ib_create_qp_rss_resp resp = {};
 	struct mana_ib_create_qp_rss ucmd = {};
+	struct mana_ib_ucontext *mana_ucontext;
 	struct gdma_dev *gd = mdev->gdma_dev;
 	mana_handle_t *mana_ind_table;
 	struct mana_port_context *mpc;
+	struct gdma_queue *gdma_cq;
 	struct mana_context *mc;
 	struct net_device *ndev;
 	struct mana_ib_cq *cq;
 	struct mana_ib_wq *wq;
+	struct mana_eq *eq;
 	unsigned int ind_tbl_size;
 	struct ib_cq *ibcq;
 	struct ib_wq *ibwq;
@@ -114,6 +118,8 @@  static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
 	int ret;
 
 	mc = gd->driver_data;
+	mana_ucontext =
+		container_of(ib_ucontext, struct mana_ib_ucontext, ibucontext);
 
 	if (!udata || udata->inlen < sizeof(ucmd))
 		return -EINVAL;
@@ -180,6 +186,7 @@  static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
 	for (i = 0; i < ind_tbl_size; i++) {
 		struct mana_obj_spec wq_spec = {};
 		struct mana_obj_spec cq_spec = {};
+		unsigned int max_num_queues = gd->gdma_context->max_num_queues;
 
 		ibwq = ind_tbl->ind_tbl[i];
 		wq = container_of(ibwq, struct mana_ib_wq, ibwq);
@@ -193,7 +200,8 @@  static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
 		cq_spec.gdma_region = cq->gdma_region;
 		cq_spec.queue_size = cq->cqe * COMP_ENTRY_SIZE;
 		cq_spec.modr_ctx_id = 0;
-		cq_spec.attached_eq = GDMA_CQ_NO_EQ;
+		eq = &mana_ucontext->eqs[cq->comp_vector % max_num_queues];
+		cq_spec.attached_eq = eq->eq->id;
 
 		ret = mana_create_wq_obj(mpc, mpc->port_handle, GDMA_RQ,
 					 &wq_spec, &cq_spec, &wq->rx_object);
@@ -215,6 +223,22 @@  static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
 		resp.entries[i].wqid = wq->id;
 
 		mana_ind_table[i] = wq->rx_object;
+
+		if (gd->gdma_context->cq_table[cq->id] == NULL) {
+
+			gdma_cq = kzalloc(sizeof(*gdma_cq), GFP_KERNEL);
+			if (!gdma_cq) {
+				ret = -ENOMEM;
+				goto free_cq;
+			}
+
+			gdma_cq->cq.context = cq;
+			gdma_cq->type = GDMA_CQ;
+			gdma_cq->cq.callback = mana_ib_cq_handler;
+			gdma_cq->id = cq->id;
+			gd->gdma_context->cq_table[cq->id] = gdma_cq;
+		}
+
 	}
 	resp.num_entries = i;
 
@@ -224,7 +248,7 @@  static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
 					 ucmd.rx_hash_key_len,
 					 ucmd.rx_hash_key);
 	if (ret)
-		goto fail;
+		goto free_cq;
 
 	ret = ib_copy_to_udata(udata, &resp, sizeof(resp));
 	if (ret) {
@@ -238,6 +262,23 @@  static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
 
 	return 0;
 
+free_cq:
+	{
+		int j = i;
+		u64 cqid;
+
+		while (j-- > 0) {
+			cqid = resp.entries[j].cqid;
+			gdma_cq = gd->gdma_context->cq_table[cqid];
+			cq = gdma_cq->cq.context;
+			if (atomic_read(&cq->ibcq.usecnt) == 0) {
+				kfree(gd->gdma_context->cq_table[cqid]);
+				gd->gdma_context->cq_table[cqid] = NULL;
+			}
+		}
+
+	}
+
 fail:
 	while (i-- > 0) {
 		ibwq = ind_tbl->ind_tbl[i];
@@ -269,10 +310,12 @@  static int mana_ib_create_qp_raw(struct ib_qp *ibqp, struct ib_pd *ibpd,
 	struct mana_obj_spec wq_spec = {};
 	struct mana_obj_spec cq_spec = {};
 	struct mana_port_context *mpc;
+	struct gdma_queue *gdma_cq;
 	struct mana_context *mc;
 	struct net_device *ndev;
 	struct ib_umem *umem;
-	int err;
+	struct mana_eq *eq;
+	int err, eq_vec;
 	u32 port;
 
 	mc = gd->driver_data;
@@ -350,7 +393,9 @@  static int mana_ib_create_qp_raw(struct ib_qp *ibqp, struct ib_pd *ibpd,
 	cq_spec.gdma_region = send_cq->gdma_region;
 	cq_spec.queue_size = send_cq->cqe * COMP_ENTRY_SIZE;
 	cq_spec.modr_ctx_id = 0;
-	cq_spec.attached_eq = GDMA_CQ_NO_EQ;
+	eq_vec = send_cq->comp_vector % gd->gdma_context->max_num_queues;
+	eq = &mana_ucontext->eqs[eq_vec];
+	cq_spec.attached_eq = eq->eq->id;
 
 	err = mana_create_wq_obj(mpc, mpc->port_handle, GDMA_SQ, &wq_spec,
 				 &cq_spec, &qp->tx_object);
@@ -368,6 +413,23 @@  static int mana_ib_create_qp_raw(struct ib_qp *ibqp, struct ib_pd *ibpd,
 	qp->sq_id = wq_spec.queue_index;
 	send_cq->id = cq_spec.queue_index;
 
+	if (gd->gdma_context->cq_table[send_cq->id] == NULL) {
+
+		gdma_cq = kzalloc(sizeof(*gdma_cq), GFP_KERNEL);
+		if (!gdma_cq) {
+			err = -ENOMEM;
+			goto err_destroy_wqobj_and_cq;
+		}
+
+		gdma_cq->cq.context = send_cq;
+		gdma_cq->type = GDMA_CQ;
+		gdma_cq->cq.callback = mana_ib_cq_handler;
+		gdma_cq->id = send_cq->id;
+		gd->gdma_context->cq_table[send_cq->id] = gdma_cq;
+	} else {
+		gdma_cq = gd->gdma_context->cq_table[send_cq->id];
+	}
+
 	ibdev_dbg(&mdev->ib_dev,
 		  "ret %d qp->tx_object 0x%llx sq id %llu cq id %llu\n", err,
 		  qp->tx_object, qp->sq_id, send_cq->id);
@@ -381,12 +443,17 @@  static int mana_ib_create_qp_raw(struct ib_qp *ibqp, struct ib_pd *ibpd,
 		ibdev_dbg(&mdev->ib_dev,
 			  "Failed copy udata for create qp-raw, %d\n",
 			  err);
-		goto err_destroy_wq_obj;
+		goto err_destroy_wqobj_and_cq;
 	}
 
 	return 0;
 
-err_destroy_wq_obj:
+err_destroy_wqobj_and_cq:
+	if (atomic_read(&send_cq->ibcq.usecnt) == 0) {
+		kfree(gdma_cq);
+		gd->gdma_context->cq_table[send_cq->id] = NULL;
+	}
+
 	mana_destroy_wq_obj(mpc, GDMA_SQ, qp->tx_object);
 
 err_destroy_dma_region:
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 8f3f78b68592..16e4b049a6c8 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -368,53 +368,57 @@  static void mana_gd_process_eqe(struct gdma_queue *eq)
 	}
 }
 
-static void mana_gd_process_eq_events(void *arg)
+static void mana_gd_process_eq_events(struct list_head *eq_list)
 {
 	u32 owner_bits, new_bits, old_bits;
 	union gdma_eqe_info eqe_info;
 	struct gdma_eqe *eq_eqe_ptr;
-	struct gdma_queue *eq = arg;
 	struct gdma_context *gc;
+	struct gdma_queue *eq;
 	struct gdma_eqe *eqe;
 	u32 head, num_eqe;
 	int i;
 
-	gc = eq->gdma_dev->gdma_context;
+	list_for_each_entry_rcu(eq, eq_list, entry) {
+		gc = eq->gdma_dev->gdma_context;
 
-	num_eqe = eq->queue_size / GDMA_EQE_SIZE;
-	eq_eqe_ptr = eq->queue_mem_ptr;
+		num_eqe = eq->queue_size / GDMA_EQE_SIZE;
+		eq_eqe_ptr = eq->queue_mem_ptr;
 
-	/* Process up to 5 EQEs at a time, and update the HW head. */
-	for (i = 0; i < 5; i++) {
-		eqe = &eq_eqe_ptr[eq->head % num_eqe];
-		eqe_info.as_uint32 = eqe->eqe_info;
-		owner_bits = eqe_info.owner_bits;
+		/* Process up to 5 EQEs at a time, and update the HW head. */
+		for (i = 0; i < 5; i++) {
+			eqe = &eq_eqe_ptr[eq->head % num_eqe];
+			eqe_info.as_uint32 = eqe->eqe_info;
+			owner_bits = eqe_info.owner_bits;
 
-		old_bits = (eq->head / num_eqe - 1) & GDMA_EQE_OWNER_MASK;
-		/* No more entries */
-		if (owner_bits == old_bits)
-			break;
+			old_bits =
+				(eq->head / num_eqe - 1) & GDMA_EQE_OWNER_MASK;
+			/* No more entries */
+			if (owner_bits == old_bits)
+				break;
 
-		new_bits = (eq->head / num_eqe) & GDMA_EQE_OWNER_MASK;
-		if (owner_bits != new_bits) {
-			dev_err(gc->dev, "EQ %d: overflow detected\n", eq->id);
-			break;
-		}
+			new_bits = (eq->head / num_eqe) & GDMA_EQE_OWNER_MASK;
+			if (owner_bits != new_bits) {
+				dev_err(gc->dev, "EQ %d: overflow detected\n",
+					eq->id);
+				break;
+			}
 
-		/* Per GDMA spec, rmb is necessary after checking owner_bits, before
-		 * reading eqe.
-		 */
-		rmb();
+			/* Per GDMA spec, rmb is necessary after checking
+			 * owner_bits, before reading eqe.
+			 */
+			rmb();
 
-		mana_gd_process_eqe(eq);
+			mana_gd_process_eqe(eq);
 
-		eq->head++;
-	}
+			eq->head++;
+		}
 
-	head = eq->head % (num_eqe << GDMA_EQE_OWNER_BITS);
+		head = eq->head % (num_eqe << GDMA_EQE_OWNER_BITS);
 
-	mana_gd_ring_doorbell(gc, eq->gdma_dev->doorbell, eq->type, eq->id,
-			      head, SET_ARM_BIT);
+		mana_gd_ring_doorbell(gc, eq->gdma_dev->doorbell, eq->type,
+				      eq->id, head, SET_ARM_BIT);
+	}
 }
 
 static int mana_gd_register_irq(struct gdma_queue *queue,
@@ -432,44 +436,47 @@  static int mana_gd_register_irq(struct gdma_queue *queue,
 	gc = gd->gdma_context;
 	r = &gc->msix_resource;
 	dev = gc->dev;
+	msi_index = spec->eq.msix_index;
 
 	spin_lock_irqsave(&r->lock, flags);
 
-	msi_index = find_first_zero_bit(r->map, r->size);
-	if (msi_index >= r->size || msi_index >= gc->num_msix_usable) {
-		err = -ENOSPC;
-	} else {
-		bitmap_set(r->map, msi_index, 1);
-		queue->eq.msix_index = msi_index;
-	}
-
-	spin_unlock_irqrestore(&r->lock, flags);
+	if (!spec->eq.msix_allocated) {
+		msi_index = find_first_zero_bit(r->map, r->size);
+		if (msi_index >= r->size || msi_index >= gc->num_msix_usable)
+			err = -ENOSPC;
+		else
+			bitmap_set(r->map, msi_index, 1);
 
-	if (err) {
-		dev_err(dev, "Register IRQ err:%d, msi:%u rsize:%u, nMSI:%u",
-			err, msi_index, r->size, gc->num_msix_usable);
+		if (err) {
+			dev_err(dev, "Register IRQ err:%d, msi:%u rsize:%u, nMSI:%u",
+				err, msi_index, r->size, gc->num_msix_usable);
 
-		return err;
+			goto out;
+		}
 	}
 
+	queue->eq.msix_index = msi_index;
 	gic = &gc->irq_contexts[msi_index];
 
-	WARN_ON(gic->handler || gic->arg);
-
-	gic->arg = queue;
+	list_add_rcu(&queue->entry, &gic->eq_list);
 
 	gic->handler = mana_gd_process_eq_events;
 
-	return 0;
+out:
+	spin_unlock_irqrestore(&r->lock, flags);
+
+	return err;
 }
 
-static void mana_gd_deregiser_irq(struct gdma_queue *queue)
+static void mana_gd_deregister_irq(struct gdma_queue *queue)
 {
 	struct gdma_dev *gd = queue->gdma_dev;
 	struct gdma_irq_context *gic;
 	struct gdma_context *gc;
 	struct gdma_resource *r;
 	unsigned int msix_index;
+	struct list_head *p, *n;
+	struct gdma_queue *eq;
 	unsigned long flags;
 
 	gc = gd->gdma_context;
@@ -480,13 +487,25 @@  static void mana_gd_deregiser_irq(struct gdma_queue *queue)
 	if (WARN_ON(msix_index >= gc->num_msix_usable))
 		return;
 
+	spin_lock_irqsave(&r->lock, flags);
+
 	gic = &gc->irq_contexts[msix_index];
-	gic->handler = NULL;
-	gic->arg = NULL;
 
-	spin_lock_irqsave(&r->lock, flags);
-	bitmap_clear(r->map, msix_index, 1);
+	list_for_each_safe(p, n, &gic->eq_list) {
+		eq = list_entry(p, struct gdma_queue, entry);
+		if (queue == eq) {
+			list_del_rcu(&eq->entry);
+			break;
+		}
+	}
+
+	if (list_empty(&gic->eq_list)) {
+		gic->handler = NULL;
+		bitmap_clear(r->map, msix_index, 1);
+	}
+
 	spin_unlock_irqrestore(&r->lock, flags);
+	synchronize_rcu();
 
 	queue->eq.msix_index = INVALID_PCI_MSIX_INDEX;
 }
@@ -550,7 +569,7 @@  static void mana_gd_destroy_eq(struct gdma_context *gc, bool flush_evenets,
 			dev_warn(gc->dev, "Failed to flush EQ: %d\n", err);
 	}
 
-	mana_gd_deregiser_irq(queue);
+	mana_gd_deregister_irq(queue);
 
 	if (queue->eq.disable_needed)
 		mana_gd_disable_queue(queue);
@@ -565,7 +584,7 @@  static int mana_gd_create_eq(struct gdma_dev *gd,
 	u32 log2_num_entries;
 	int err;
 
-	queue->eq.msix_index = INVALID_PCI_MSIX_INDEX;
+	queue->eq.msix_index = spec->eq.msix_index;
 
 	log2_num_entries = ilog2(queue->queue_size / GDMA_EQE_SIZE);
 
@@ -602,6 +621,7 @@  static int mana_gd_create_eq(struct gdma_dev *gd,
 	mana_gd_destroy_eq(gc, false, queue);
 	return err;
 }
+EXPORT_SYMBOL(mana_gd_create_mana_eq);
 
 static void mana_gd_create_cq(const struct gdma_queue_spec *spec,
 			      struct gdma_queue *queue)
@@ -873,6 +893,7 @@  void mana_gd_destroy_queue(struct gdma_context *gc, struct gdma_queue *queue)
 	mana_gd_free_memory(gmi);
 	kfree(queue);
 }
+EXPORT_SYMBOL(mana_gd_destroy_queue);
 
 int mana_gd_verify_vf_version(struct pci_dev *pdev)
 {
@@ -1188,7 +1209,7 @@  static irqreturn_t mana_gd_intr(int irq, void *arg)
 	struct gdma_irq_context *gic = arg;
 
 	if (gic->handler)
-		gic->handler(gic->arg);
+		gic->handler(&gic->eq_list);
 
 	return IRQ_HANDLED;
 }
@@ -1241,7 +1262,7 @@  static int mana_gd_setup_irqs(struct pci_dev *pdev)
 	for (i = 0; i < nvec; i++) {
 		gic = &gc->irq_contexts[i];
 		gic->handler = NULL;
-		gic->arg = NULL;
+		INIT_LIST_HEAD(&gic->eq_list);
 
 		if (!i)
 			snprintf(gic->name, MANA_IRQ_NAME_SZ, "mana_hwc@pci:%s",
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 06d6292e09b3..85345225813f 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1156,6 +1156,7 @@  static int mana_create_eq(struct mana_context *ac)
 	spec.eq.callback = NULL;
 	spec.eq.context = ac->eqs;
 	spec.eq.log2_throttle_limit = LOG2_EQ_THROTTLE;
+	spec.eq.msix_allocated = false;
 
 	for (i = 0; i < gc->max_num_queues; i++) {
 		err = mana_gd_create_mana_eq(gd, &spec, &ac->eqs[i].eq);
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index 96c120160f15..cc728fc42043 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -6,6 +6,7 @@ 
 
 #include <linux/dma-mapping.h>
 #include <linux/netdevice.h>
+#include <linux/list.h>
 
 #include "shm_channel.h"
 
@@ -291,6 +292,8 @@  struct gdma_queue {
 	u32 head;
 	u32 tail;
 
+	struct list_head entry;
+
 	/* Extra fields specific to EQ/CQ. */
 	union {
 		struct {
@@ -325,6 +328,8 @@  struct gdma_queue_spec {
 			void *context;
 
 			unsigned long log2_throttle_limit;
+			bool msix_allocated;
+			unsigned int msix_index;
 		} eq;
 
 		struct {
@@ -340,8 +345,8 @@  struct gdma_queue_spec {
 #define MANA_IRQ_NAME_SZ 32
 
 struct gdma_irq_context {
-	void (*handler)(void *arg);
-	void *arg;
+	void (*handler)(struct list_head *arg);
+	struct list_head eq_list;
 	char name[MANA_IRQ_NAME_SZ];
 };