diff mbox series

cnic: change __GFP_COMP allocation method

Message ID 20231219055514.12324-1-njavali@marvell.com (mailing list archive)
State Changes Requested
Headers show
Series cnic: change __GFP_COMP allocation method | expand

Commit Message

Nilesh Javali Dec. 19, 2023, 5:55 a.m. UTC
A system crash observed resulting from iscsiuio page reference counting issue,

 kernel: BUG: Bad page map in process iscsiuio  pte:8000000877bed027 pmd:8579c3067
 kernel: page:00000000d2c88757 refcount:0 mapcount:-128 mapping:0000000000000000 index:0x0 pfn:0x877bed
 kernel: flags: 0x57ffffc0000004(referenced|node=1|zone=2|lastcpupid=0x1fffff)
 kernel: page_type: 0xffffff7f(buddy)
 kernel: raw: 0057ffffc0000004 ffff88881ffb8618 ffffea0021ba6f48 0000000000000000
 kernel: raw: 0000000000000000 0000000000000000 00000000ffffff7f 0000000000000000
 kernel: page dumped because: bad pte
 kernel: addr:00007f14f4b35000 vm_flags:0c0400fb anon_vma:0000000000000000 mapping:ffff888885004888 index:4
 kernel: file:uio0 fault:uio_vma_fault [uio] mmap:uio_mmap [uio] read_folio:0x0
 kernel: CPU: 23 PID: 3227 Comm: iscsiuio Kdump: loaded Not tainted 6.6.0-rc1+ #6
 kernel: Hardware name: HPE Synergy 480 Gen10/Synergy 480 Gen10 Compute Module, BIOS I42 09/21/2023
 kernel: Call Trace:
 kernel: <TASK>
 kernel: dump_stack_lvl+0x33/0x50
 kernel: print_bad_pte+0x1b6/0x280
 kernel: ? page_remove_rmap+0xd1/0x220
 kernel: zap_pte_range+0x35e/0x8c0
 kernel: zap_pmd_range.isra.0+0xf9/0x230
 kernel: unmap_page_range+0x2d4/0x4a0
 kernel: unmap_vmas+0xac/0x140
 kernel: exit_mmap+0xdf/0x350
 kernel: __mmput+0x43/0x120
 kernel: exit_mm+0xb3/0x120
 kernel: do_exit+0x276/0x4f0
 kernel: do_group_exit+0x2d/0x80
 kernel: __x64_sys_exit_group+0x14/0x20
 kernel: do_syscall_64+0x59/0x90
 kernel: ? syscall_exit_work+0x103/0x130
 kernel: ? syscall_exit_to_user_mode+0x22/0x40
 kernel: ? do_syscall_64+0x69/0x90
 kernel: ? exc_page_fault+0x65/0x150
 kernel: entry_SYSCALL_64_after_hwframe+0x6e/0xd8
 kernel: RIP: 0033:0x7f14f4918a7d
 kernel: Code: Unable to access opcode bytes at 0x7f14f4918a53.
 kernel: RSP: 002b:00007ffe91cfb698 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
 kernel: RAX: ffffffffffffffda RBX: 00007f14f49f69e0 RCX: 00007f14f4918a7d
 kernel: RDX: 00000000000000e7 RSI: fffffffffffffee8 RDI: 0000000000000000
 kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
 kernel: R10: 00007ffe91cfb4c0 R11: 0000000000000246 R12: 00007f14f49f69e0
 kernel: R13: 00007f14f49fbf00 R14: 0000000000000002 R15: 00007f14f49fbee8
 kernel: </TASK>
 kernel: Disabling lock debugging due to kernel taint

The dma_alloc_coherent no more provide __GFP_COMP allocation as per commit,
dma-mapping: reject __GFP_COMP in dma_alloc_attrs, and hence
instead use __get_free_pages for __GFP_COMP allocation along with
dma_map_single to get dma address in order to fix page reference counting
issue caused in iscsiuio mmap.

Fixes: bb73955c0b1d ("cnic: don't pass bogus GFP_ flags to dma_alloc_coherent")
Signed-off-by: Nilesh Javali <njavali@marvell.com>
---
 drivers/net/ethernet/broadcom/cnic.c | 42 ++++++++++++++++++++++------
 1 file changed, 34 insertions(+), 8 deletions(-)

Comments

Christoph Hellwig Dec. 19, 2023, 5:58 a.m. UTC | #1
On Tue, Dec 19, 2023 at 11:25:14AM +0530, Nilesh Javali wrote:
> The dma_alloc_coherent no more provide __GFP_COMP allocation as per commit,
> dma-mapping: reject __GFP_COMP in dma_alloc_attrs, and hence
> instead use __get_free_pages for __GFP_COMP allocation along with
> dma_map_single to get dma address in order to fix page reference counting
> issue caused in iscsiuio mmap.

You can't just do a single map for things mapped to userspace, as that
breaks setups that are not DMA coherent.  There was a patch floating
around to explicitly support dma coherent allocations in uio, which is
the right thing to do.
Nilesh Javali Dec. 19, 2023, 6:16 a.m. UTC | #2
Christoph,

> -----Original Message-----
> From: Christoph Hellwig <hch@infradead.org>
> Sent: Tuesday, December 19, 2023 11:29 AM
> To: Nilesh Javali <njavali@marvell.com>
> Cc: martin.petersen@oracle.com; lduncan@suse.com; cleech@redhat.com;
> linux-scsi@vger.kernel.org; GR-QLogic-Storage-Upstream <GR-QLogic-
> Storage-Upstream@marvell.com>; jmeneghi@redhat.com
> Subject: [EXT] Re: [PATCH] cnic: change __GFP_COMP allocation method
> 
> External Email
> 
> ----------------------------------------------------------------------
> On Tue, Dec 19, 2023 at 11:25:14AM +0530, Nilesh Javali wrote:
> > The dma_alloc_coherent no more provide __GFP_COMP allocation as per
> commit,
> > dma-mapping: reject __GFP_COMP in dma_alloc_attrs, and hence
> > instead use __get_free_pages for __GFP_COMP allocation along with
> > dma_map_single to get dma address in order to fix page reference
> counting
> > issue caused in iscsiuio mmap.
> 
> You can't just do a single map for things mapped to userspace, as that
> breaks setups that are not DMA coherent.  There was a patch floating
> around to explicitly support dma coherent allocations in uio, which is
> the right thing to do.

If you are referring to the series proposed by Chris Leech, then this had
objections. And that was the reason to look for an alternative method for
coherent DMA mapping. 

[PATCH 0/3] UIO_MEM_DMA_COHERENT for cnic/bnx2/bnx2x

During bnx2i iSCSI testing we ran into page refcounting issues in the
uio mmaps exported from cnic to the iscsiuio process, and bisected back
to the removal of the __GFP_COMP flag from dma_alloc_coherent calls.

In order to fix these drivers to be able to mmap dma coherent memory via
a uio device, without resorting to hacks and working with an iommu
enabled, introduce a new uio mmap type backed by dma_mmap_coherent.

While converting the uio interface, I also noticed that not all of these
allocations were PAGE_SIZE aligned. Particularly the bnx2/bnx2x status
block mapping was much smaller than any architecture page size, and I
was concerned that it could be unintentionally exposing kernel memory.

Chris Leech (3):
  uio: introduce UIO_DMA_COHERENT type
  cnic. bnx2, bnx2x: page align uio mmap allocations
  cnic, bnx2, bnx2x: use UIO_MEM_DMA_COHERENT

 drivers/net/ethernet/broadcom/bnx2.c          |  2 ++
 .../net/ethernet/broadcom/bnx2x/bnx2x_main.c  | 10 +++---
 drivers/net/ethernet/broadcom/cnic.c          | 34 ++++++++++++-------
 drivers/net/ethernet/broadcom/cnic.h          |  1 +
 drivers/net/ethernet/broadcom/cnic_if.h       |  1 +
 drivers/uio/uio.c                             | 34 +++++++++++++++++++
 include/linux/uio_driver.h                    | 12 +++++--
 7 files changed, 75 insertions(+), 19 deletions(-)

Thanks,
Nilesh
Christoph Hellwig Dec. 21, 2023, 6:48 a.m. UTC | #3
On Tue, Dec 19, 2023 at 06:16:38AM +0000, Nilesh Javali wrote:
> If you are referring to the series proposed by Chris Leech, then this had
> objections. And that was the reason to look for an alternative method for
> coherent DMA mapping. 
> 
> [PATCH 0/3] UIO_MEM_DMA_COHERENT for cnic/bnx2/bnx2x

Yes.  Well, Greg (rightly) dislikes what the iscsi drivers have been
doing.  But we're stuck supporting them, so I see no way around that.
John Meneghini Dec. 21, 2023, 2:33 p.m. UTC | #4
Including Greg.

On 12/21/23 01:48, Christoph Hellwig wrote:
> On Tue, Dec 19, 2023 at 06:16:38AM +0000, Nilesh Javali wrote:
>> If you are referring to the series proposed by Chris Leech, then this had
>> objections. And that was the reason to look for an alternative method for
>> coherent DMA mapping.
>>
>> [PATCH 0/3] UIO_MEM_DMA_COHERENT for cnic/bnx2/bnx2x
> 
> Yes.  Well, Greg (rightly) dislikes what the iscsi drivers have been
> doing.  But we're stuck supporting them, so I see no way around that.

If this is true then can we reconsider Chris's patches.

Red Hat has multiple enterprise customers who are relying on this driver and we need to keep it running - at least till the end 
of RHEL 9.  We can try and drop support for bnx2/cnic in RHEL 10 but RHEL 9 is in the middle of its life cycle and a failure to 
address this issue is causing many problems as we attempt to keep RHEL 9 current with what's upstream.

Greg, can we please take Chris's patches upstream?

https://lore.kernel.org/netdev/20230929170023.1020032-3-cleech@redhat.com/

/John
Greg KH Dec. 21, 2023, 5:02 p.m. UTC | #5
On Thu, Dec 21, 2023 at 09:33:44AM -0500, John Meneghini wrote:
> Including Greg.
> 
> On 12/21/23 01:48, Christoph Hellwig wrote:
> > On Tue, Dec 19, 2023 at 06:16:38AM +0000, Nilesh Javali wrote:
> > > If you are referring to the series proposed by Chris Leech, then this had
> > > objections. And that was the reason to look for an alternative method for
> > > coherent DMA mapping.
> > > 
> > > [PATCH 0/3] UIO_MEM_DMA_COHERENT for cnic/bnx2/bnx2x
> > 
> > Yes.  Well, Greg (rightly) dislikes what the iscsi drivers have been
> > doing.  But we're stuck supporting them, so I see no way around that.
> 
> If this is true then can we reconsider Chris's patches.
> 
> Red Hat has multiple enterprise customers who are relying on this driver and
> we need to keep it running - at least till the end of RHEL 9.  We can try
> and drop support for bnx2/cnic in RHEL 10 but RHEL 9 is in the middle of its
> life cycle and a failure to address this issue is causing many problems as
> we attempt to keep RHEL 9 current with what's upstream.

So you are trying to tell me to accept kernel patches today to keep
RHEL9 obsolete systems alive?  And then sometime in the future we can
drop those changes because why?  That feels very wrong and confusing.
What does what gets merged in 2024 have to do with RHEL 9 systems?  You
are free to do whatever you want in your enterprise kernels, don't rely
on making me take broken-by-design code and be forced to maintain it for
the next 10+ years please, that's just not nice.

> Greg, can we please take Chris's patches upstream?

It's a total abuse of the UIO api, and I thought I actually had comments
about it doing it incorrectly as well.  Resend them in the new year
after they have been cleaned up and we can reconsider them then, it's
too late now for anything new for 6.8-rc1 with the holidays apon us now
anyway.

And get the "we want you to take this crud and maintain it for forever
because we have to support an obsolete and out-of-date kernel for paying
customers" story a bit more straight so it doesn't sound so bad :)

thanks,

greg k-h
diff mbox series

Patch

diff --git a/drivers/net/ethernet/broadcom/cnic.c b/drivers/net/ethernet/broadcom/cnic.c
index 7926aaef8f0c..28bfb39a3c0f 100644
--- a/drivers/net/ethernet/broadcom/cnic.c
+++ b/drivers/net/ethernet/broadcom/cnic.c
@@ -840,14 +840,20 @@  static void cnic_free_context(struct cnic_dev *dev)
 static void __cnic_free_uio_rings(struct cnic_uio_dev *udev)
 {
 	if (udev->l2_buf) {
-		dma_free_coherent(&udev->pdev->dev, udev->l2_buf_size,
-				  udev->l2_buf, udev->l2_buf_map);
+		if (udev->l2_buf_map)
+			dma_unmap_single(&udev->pdev->dev, udev->l2_buf_map,
+					 udev->l2_buf_size, DMA_BIDIRECTIONAL);
+		free_pages((unsigned long)udev->l2_buf,
+			   get_order(udev->l2_buf_size));
 		udev->l2_buf = NULL;
 	}
 
 	if (udev->l2_ring) {
-		dma_free_coherent(&udev->pdev->dev, udev->l2_ring_size,
-				  udev->l2_ring, udev->l2_ring_map);
+		if (udev->l2_ring_map)
+			dma_unmap_single(&udev->pdev->dev, udev->l2_ring_map,
+					 udev->l2_ring_size, DMA_BIDIRECTIONAL);
+		free_pages((unsigned long)udev->l2_ring,
+			   get_order(udev->l2_ring_size));
 		udev->l2_ring = NULL;
 	}
 
@@ -1026,20 +1032,40 @@  static int __cnic_alloc_uio_rings(struct cnic_uio_dev *udev, int pages)
 		return 0;
 
 	udev->l2_ring_size = pages * CNIC_PAGE_SIZE;
-	udev->l2_ring = dma_alloc_coherent(&udev->pdev->dev, udev->l2_ring_size,
-					   &udev->l2_ring_map, GFP_KERNEL);
+	udev->l2_ring = (void *)__get_free_pages(GFP_KERNEL | __GFP_COMP |
+						 __GFP_ZERO,
+						 get_order(udev->l2_ring_size));
 	if (!udev->l2_ring)
 		return -ENOMEM;
 
+	udev->l2_ring_map = dma_map_single(&udev->pdev->dev, udev->l2_ring,
+					   udev->l2_ring_size,
+					   DMA_BIDIRECTIONAL);
+	if (unlikely(dma_mapping_error(&udev->pdev->dev, udev->l2_ring_map))) {
+		pr_err("unable to map L2 ring memory %d\n", udev->l2_ring_size);
+		__cnic_free_uio_rings(udev);
+		return -ENOMEM;
+	}
+
 	udev->l2_buf_size = (cp->l2_rx_ring_size + 1) * cp->l2_single_buf_size;
 	udev->l2_buf_size = CNIC_PAGE_ALIGN(udev->l2_buf_size);
-	udev->l2_buf = dma_alloc_coherent(&udev->pdev->dev, udev->l2_buf_size,
-					  &udev->l2_buf_map, GFP_KERNEL);
+	udev->l2_buf = (void *)__get_free_pages(GFP_KERNEL | __GFP_COMP |
+						__GFP_ZERO,
+						get_order(udev->l2_buf_size));
 	if (!udev->l2_buf) {
 		__cnic_free_uio_rings(udev);
 		return -ENOMEM;
 	}
 
+	udev->l2_buf_map = dma_map_single(&udev->pdev->dev, udev->l2_buf,
+					  udev->l2_buf_size,
+					  DMA_BIDIRECTIONAL);
+	if (unlikely(dma_mapping_error(&udev->pdev->dev, udev->l2_buf_map))) {
+		pr_err("unable to map L2 buf memory %d\n", udev->l2_buf_size);
+		__cnic_free_uio_rings(udev);
+		return -ENOMEM;
+	}
+
 	return 0;
 
 }