Message ID | 20190620161240.22738-1-logang@deltatee.com (mailing list archive) |
---|---|
Headers | show |
Series | Removing struct page from P2PDMA | expand |
On Thu, Jun 20, 2019 at 9:13 AM Logan Gunthorpe <logang@deltatee.com> wrote: > > For eons there has been a debate over whether or not to use > struct pages for peer-to-peer DMA transactions. Pro-pagers have > argued that struct pages are necessary for interacting with > existing code like scatterlists or the bio_vecs. Anti-pagers > assert that the tracking of the memory is unecessary and > allocating the pages is a waste of memory. Both viewpoints are > valid, however developers working on GPUs and RDMA tend to be > able to do away with struct pages relatively easily Presumably because they have historically never tried to be inter-operable with the block layer or drivers outside graphics and RDMA. > compared to > those wanting to work with NVMe devices through the block layer. > So it would be of great value to be able to universally do P2PDMA > transactions without the use of struct pages. Please spell out the value, it is not immediately obvious to me outside of some memory capacity savings. > Previously, there have been multiple attempts[1][2] to replace > struct page usage with pfn_t but this has been unpopular seeing > it creates dangerous edge cases where unsuspecting code might > run accross pfn_t's they are not ready for. That's not the conclusion I arrived at because pfn_t is specifically an opaque type precisely to force "unsuspecting" code to throw compiler assertions. Instead pfn_t was dealt its death blow here: https://lore.kernel.org/lkml/CA+55aFzON9617c2_Amep0ngLq91kfrPiSccdZakxir82iekUiA@mail.gmail.com/ ...and I think that feedback also reads on this proposal. > Currently, we have P2PDMA using struct pages through the block layer > and the dangerous cases are avoided by using a queue flag that > indicates support for the special pages. > > This RFC proposes a new solution: allow the block layer to take > DMA addresses directly for queues that indicate support. This will > provide a more general path for doing P2PDMA-like requests and will > allow us to remove the struct pages that back P2PDMA memory thus paving > the way to build a more uniform P2PDMA ecosystem. My primary concern with this is that ascribes a level of generality that just isn't there for peer-to-peer dma operations. "Peer" addresses are not "DMA" addresses, and the rules about what can and can't do peer-DMA are not generically known to the block layer. At least with a side object there's a chance to describe / recall those restrictions as these things get passed around the I/O stack, but an undecorated "DMA" address passed through the block layer with no other benefit to any subsystem besides RDMA does not feel like it advances the state of the art. Again, what are the benefits of plumbing this RDMA special case?
On Thu, Jun 20, 2019 at 11:45:38AM -0700, Dan Williams wrote: > > Previously, there have been multiple attempts[1][2] to replace > > struct page usage with pfn_t but this has been unpopular seeing > > it creates dangerous edge cases where unsuspecting code might > > run accross pfn_t's they are not ready for. > > That's not the conclusion I arrived at because pfn_t is specifically > an opaque type precisely to force "unsuspecting" code to throw > compiler assertions. Instead pfn_t was dealt its death blow here: > > https://lore.kernel.org/lkml/CA+55aFzON9617c2_Amep0ngLq91kfrPiSccdZakxir82iekUiA@mail.gmail.com/ > > ...and I think that feedback also reads on this proposal. I read through Linus's remarks and it he seems completely right that anything that touches a filesystem needs a struct page, because FS's rely heavily on that. It is much less clear to me why a GPU BAR or a NVME CMB that never touches a filesystem needs a struct page.. The best reason I've seen is that it must have struct page because the block layer heavily depends on struct page. Since that thread was so DAX/pmem centric (and Linus did say he liked the __pfn_t), maybe it is worth checking again, but not for DAX/pmem users? This P2P is quite distinct from DAX as the struct page* would point to non-cacheable weird memory that few struct page users would even be able to work with, while I understand DAX use cases focused on CPU cache coherent memory, and filesystem involvement. > My primary concern with this is that ascribes a level of generality > that just isn't there for peer-to-peer dma operations. "Peer" > addresses are not "DMA" addresses, and the rules about what can and > can't do peer-DMA are not generically known to the block layer. ?? The P2P infrastructure produces a DMA bus address for the initiating device that is is absolutely a DMA address. There is some intermediate CPU centric representation, but after mapping it is the same as any other DMA bus address. The map function can tell if the device pair combination can do p2p or not. > Again, what are the benefits of plumbing this RDMA special case? It is not just RDMA, this is interesting for GPU and vfio use cases too. RDMA is just the most complete in-tree user we have today. ie GPU people wouuld really like to do read() and have P2P transparently happen to on-GPU pages. With GPUs having huge amounts of memory loading file data into them is really a performance critical thing. Jason
On 2019-06-20 12:45 p.m., Dan Williams wrote: > On Thu, Jun 20, 2019 at 9:13 AM Logan Gunthorpe <logang@deltatee.com> wrote: >> >> For eons there has been a debate over whether or not to use >> struct pages for peer-to-peer DMA transactions. Pro-pagers have >> argued that struct pages are necessary for interacting with >> existing code like scatterlists or the bio_vecs. Anti-pagers >> assert that the tracking of the memory is unecessary and >> allocating the pages is a waste of memory. Both viewpoints are >> valid, however developers working on GPUs and RDMA tend to be >> able to do away with struct pages relatively easily > > Presumably because they have historically never tried to be > inter-operable with the block layer or drivers outside graphics and > RDMA. Yes, but really there are three main sets of users for P2P right now: graphics, RDMA and NVMe. And every time a patch set comes from GPU/RDMA people they don't bother with struct page. I seem to be the only one trying to push P2P with NVMe and it seems to be a losing battle. > Please spell out the value, it is not immediately obvious to me > outside of some memory capacity savings. There are a few things: * Have consistency with P2P efforts as most other efforts have been avoiding struct page. Nobody else seems to want pci_p2pdma_add_resource() or any devm_memremap_pages() call. * Avoid all arch-specific dependencies for P2P. With struct page the IO memory must fit in the linear mapping. This requires some work with RISC-V and I remember some complaints from the powerpc people regarding this. Certainly not all arches will be able to fit the IO region into the linear mapping space. * Remove a bunch of PCI P2PDMA special case mapping stuff from the block layer and RDMA interface (which I've been hearing complaints over). * Save the struct page memory that is largely unused (as you note). >> Previously, there have been multiple attempts[1][2] to replace >> struct page usage with pfn_t but this has been unpopular seeing >> it creates dangerous edge cases where unsuspecting code might >> run accross pfn_t's they are not ready for. > > That's not the conclusion I arrived at because pfn_t is specifically > an opaque type precisely to force "unsuspecting" code to throw > compiler assertions. Instead pfn_t was dealt its death blow here: > > https://lore.kernel.org/lkml/CA+55aFzON9617c2_Amep0ngLq91kfrPiSccdZakxir82iekUiA@mail.gmail.com/ Ok, well yes the special pages are what we've done for P2PDMA today. But I don't think Linus's criticism really applies to what's in this RFC. For starters, P2PDMA doesn't, and has have never, used struct page to look up the reference count. PCI BARs have no relation to the cache so there's no need to serialize their access but this can be done before/after the DMA addresses are submitted to the block/rdma layer if it was required. In fact, the only thing the struct page is used for in the current P2PDMA implementation is a single flag indicating it's special and needs to be mapped in a special way. > My primary concern with this is that ascribes a level of generality > that just isn't there for peer-to-peer dma operations. "Peer" > addresses are not "DMA" addresses, and the rules about what can and > can't do peer-DMA are not generically known to the block layer. Correct, but I don't think we should teach the block layer about these rules. In the current code, the rules are enforced outside the block layer before the bios are submitted and this patch set doesn't change that. The driver orchestrating P2P will always have to check the rules and derive addresses from them (as appropriate). With the RFC the block layer then doesn't have to care and can just handle the DMA addresses directly. > At least with a side object there's a chance to describe / recall those > restrictions as these things get passed around the I/O stack, but an > undecorated "DMA" address passed through the block layer with no other > benefit to any subsystem besides RDMA does not feel like it advances > the state of the art. > > Again, what are the benefits of plumbing this RDMA special case? Because I don't think it is an RDMA special case. Logan
On Thu, Jun 20, 2019 at 12:34 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Thu, Jun 20, 2019 at 11:45:38AM -0700, Dan Williams wrote: > > > > Previously, there have been multiple attempts[1][2] to replace > > > struct page usage with pfn_t but this has been unpopular seeing > > > it creates dangerous edge cases where unsuspecting code might > > > run accross pfn_t's they are not ready for. > > > > That's not the conclusion I arrived at because pfn_t is specifically > > an opaque type precisely to force "unsuspecting" code to throw > > compiler assertions. Instead pfn_t was dealt its death blow here: > > > > https://lore.kernel.org/lkml/CA+55aFzON9617c2_Amep0ngLq91kfrPiSccdZakxir82iekUiA@mail.gmail.com/ > > > > ...and I think that feedback also reads on this proposal. > > I read through Linus's remarks and it he seems completely right that > anything that touches a filesystem needs a struct page, because FS's > rely heavily on that. > > It is much less clear to me why a GPU BAR or a NVME CMB that never > touches a filesystem needs a struct page.. The best reason I've seen > is that it must have struct page because the block layer heavily > depends on struct page. > > Since that thread was so DAX/pmem centric (and Linus did say he liked > the __pfn_t), maybe it is worth checking again, but not for DAX/pmem > users? > > This P2P is quite distinct from DAX as the struct page* would point to > non-cacheable weird memory that few struct page users would even be > able to work with, while I understand DAX use cases focused on CPU > cache coherent memory, and filesystem involvement. What I'm poking at is whether this block layer capability can pick up users outside of RDMA, more on this below... > > > My primary concern with this is that ascribes a level of generality > > that just isn't there for peer-to-peer dma operations. "Peer" > > addresses are not "DMA" addresses, and the rules about what can and > > can't do peer-DMA are not generically known to the block layer. > > ?? The P2P infrastructure produces a DMA bus address for the > initiating device that is is absolutely a DMA address. There is some > intermediate CPU centric representation, but after mapping it is the > same as any other DMA bus address. Right, this goes back to the confusion caused by the hardware / bus / address that a dma-engine would consume directly, and Linux "DMA" address as a device-specific translation of host memory. Is the block layer representation of this address going to go through a peer / "bus" address translation when it reaches the RDMA driver? In other words if we tried to use this facility with other drivers how would the driver know it was passed a traditional Linux DMA address, vs a peer bus address that the device may not be able to handle? > The map function can tell if the device pair combination can do p2p or > not. Ok, if this map step is still there then reduce a significant portion of my concern and it becomes a quibble about the naming and how a non-RDMA device driver might figure out if it was handled an address it can't handle. > > > Again, what are the benefits of plumbing this RDMA special case? > > It is not just RDMA, this is interesting for GPU and vfio use cases > too. RDMA is just the most complete in-tree user we have today. > > ie GPU people wouuld really like to do read() and have P2P > transparently happen to on-GPU pages. With GPUs having huge amounts of > memory loading file data into them is really a performance critical > thing. A direct-i/o read(2) into a page-less GPU mapping? Through a regular file or a device special file?
On 2019-06-20 2:18 p.m., Dan Williams wrote: >> Since that thread was so DAX/pmem centric (and Linus did say he liked >> the __pfn_t), maybe it is worth checking again, but not for DAX/pmem >> users? >> >> This P2P is quite distinct from DAX as the struct page* would point to >> non-cacheable weird memory that few struct page users would even be >> able to work with, while I understand DAX use cases focused on CPU >> cache coherent memory, and filesystem involvement. > > What I'm poking at is whether this block layer capability can pick up > users outside of RDMA, more on this below... I assume you mean outside of P2PDMA.... This new block layer capability is more likely to pick up additional users compared to the existing block layer changes that are *very* specific to PCI P2PDMA. I also have (probably significantly controversial) plans to use this to allow P2P through user space with O_DIRECT using an idea Jerome had in a previous patch set that was discussed a bit informally at LSF/MM this year. But that's a whole other RFC and requires a bunch of work I haven't done yet. >> >>> My primary concern with this is that ascribes a level of generality >>> that just isn't there for peer-to-peer dma operations. "Peer" >>> addresses are not "DMA" addresses, and the rules about what can and >>> can't do peer-DMA are not generically known to the block layer. >> >> ?? The P2P infrastructure produces a DMA bus address for the >> initiating device that is is absolutely a DMA address. There is some >> intermediate CPU centric representation, but after mapping it is the >> same as any other DMA bus address. > > Right, this goes back to the confusion caused by the hardware / bus / > address that a dma-engine would consume directly, and Linux "DMA" > address as a device-specific translation of host memory. > > Is the block layer representation of this address going to go through > a peer / "bus" address translation when it reaches the RDMA driver? In > other words if we tried to use this facility with other drivers how > would the driver know it was passed a traditional Linux DMA address, > vs a peer bus address that the device may not be able to handle? The idea is that the driver doesn't need to know. There's no distinction between a Linux DMA address and a peer bus address. They are both used for the same purpose: to program into a DMA engine. If the device cannot handle such a DMA address then it shouldn't indicate support for this feature or the P2PDMA layer needs a way to detect this. Really, this property depends more on the bus than the device and that's what all the P2PDMA code in the PCI tree handles. >> The map function can tell if the device pair combination can do p2p or >> not. > > Ok, if this map step is still there then reduce a significant portion > of my concern and it becomes a quibble about the naming and how a > non-RDMA device driver might figure out if it was handled an address > it can't handle. Yes, there will always be a map step, but it should be done by the orchestrator because it requires both devices (the client and the provider) and the block layer really should not know about both devices. In this RFC, the map step is kind of hidden but would probably come back in the future. It's currently a call to pci_p2pmem_virt_to_bus() but would eventually need to be a pci_p2pmem_map_resource() or similar which takes a pointer to the pci_dev provider and the struct device client doing the mapping. Logan
On Thu, Jun 20, 2019 at 12:35 PM Logan Gunthorpe <logang@deltatee.com> wrote: > > > > On 2019-06-20 12:45 p.m., Dan Williams wrote: > > On Thu, Jun 20, 2019 at 9:13 AM Logan Gunthorpe <logang@deltatee.com> wrote: > >> > >> For eons there has been a debate over whether or not to use > >> struct pages for peer-to-peer DMA transactions. Pro-pagers have > >> argued that struct pages are necessary for interacting with > >> existing code like scatterlists or the bio_vecs. Anti-pagers > >> assert that the tracking of the memory is unecessary and > >> allocating the pages is a waste of memory. Both viewpoints are > >> valid, however developers working on GPUs and RDMA tend to be > >> able to do away with struct pages relatively easily > > > > Presumably because they have historically never tried to be > > inter-operable with the block layer or drivers outside graphics and > > RDMA. > > Yes, but really there are three main sets of users for P2P right now: > graphics, RDMA and NVMe. And every time a patch set comes from GPU/RDMA > people they don't bother with struct page. I seem to be the only one > trying to push P2P with NVMe and it seems to be a losing battle. > > > Please spell out the value, it is not immediately obvious to me > > outside of some memory capacity savings. > > There are a few things: > > * Have consistency with P2P efforts as most other efforts have been > avoiding struct page. Nobody else seems to want > pci_p2pdma_add_resource() or any devm_memremap_pages() call. > > * Avoid all arch-specific dependencies for P2P. With struct page the IO > memory must fit in the linear mapping. This requires some work with > RISC-V and I remember some complaints from the powerpc people regarding > this. Certainly not all arches will be able to fit the IO region into > the linear mapping space. > > * Remove a bunch of PCI P2PDMA special case mapping stuff from the block > layer and RDMA interface (which I've been hearing complaints over). This seems to be the most salient point. I was missing the fact that this replaces custom hacks and "special" pages with an explicit "just pass this pre-mapped address down the stack". It's functionality that might plausibly be used outside of p2p, as long as the driver can assert that it never needs to touch the data with the cpu before handing it off to a dma-engine.
On 2019-06-20 5:40 p.m., Dan Williams wrote: > This seems to be the most salient point. I was missing the fact that > this replaces custom hacks and "special" pages with an explicit "just > pass this pre-mapped address down the stack". It's functionality that > might plausibly be used outside of p2p, as long as the driver can > assert that it never needs to touch the data with the cpu before > handing it off to a dma-engine. Yup, that's a good way to put it. If I resend this patchset, I'll include wording like yours in the cover letter. Logan
On Thu, Jun 20, 2019 at 01:18:13PM -0700, Dan Williams wrote: > > This P2P is quite distinct from DAX as the struct page* would point to > > non-cacheable weird memory that few struct page users would even be > > able to work with, while I understand DAX use cases focused on CPU > > cache coherent memory, and filesystem involvement. > > What I'm poking at is whether this block layer capability can pick up > users outside of RDMA, more on this below... The generic capability is to do a transfer through the block layer and scatter/gather the resulting data to some PCIe BAR memory. Currently the block layer can only scatter/gather data into CPU cache coherent memory. We know of several useful places to put PCIe BAR memory already: - On a GPU (or FPGA, acclerator, etc), ie the GB's of GPU private memory that is standard these days. - On a NVMe CMB. This lets the NVMe drive avoid DMA entirely - On a RDMA NIC. Mellanox NICs have a small amount of BAR memory that can be used like a CMB and avoids a DMA RDMA doesn't really get so involved here, except that RDMA is often the prefered way to source/sink the data buffers after the block layer has scatter/gathered to them. (and of course RDMA is often for a block driver, ie NMVe over fabrics) > > > My primary concern with this is that ascribes a level of generality > > > that just isn't there for peer-to-peer dma operations. "Peer" > > > addresses are not "DMA" addresses, and the rules about what can and > > > can't do peer-DMA are not generically known to the block layer. > > > > ?? The P2P infrastructure produces a DMA bus address for the > > initiating device that is is absolutely a DMA address. There is some > > intermediate CPU centric representation, but after mapping it is the > > same as any other DMA bus address. > > Right, this goes back to the confusion caused by the hardware / bus / > address that a dma-engine would consume directly, and Linux "DMA" > address as a device-specific translation of host memory. I don't think there is a confusion :) Logan explained it, the dma_addr_t is always the thing you program into the DMA engine of the device it was created for, and this changes nothing about that. Think of the dma vec as the same as a dma mapped SGL, just with no available struct page. > Is the block layer representation of this address going to go through > a peer / "bus" address translation when it reaches the RDMA driver? No, it is just like any other dma mapped SGL, it is ready to go for the device it was mapped for, and can be used for nothing other than programming DMA on that device. > > ie GPU people wouuld really like to do read() and have P2P > > transparently happen to on-GPU pages. With GPUs having huge amounts of > > memory loading file data into them is really a performance critical > > thing. > > A direct-i/o read(2) into a page-less GPU mapping? The interesting case is probably an O_DIRECT read into a DEVICE_PRIVATE page owned by the GPU driver and mmaped into the process calling read(). The GPU driver can dynamically arrange for that DEVICE_PRIVATE page to linked to P2P targettable BAR memory so the HW is capable of a direct CPU bypass transfer from the underlying block device (ie NVMe or RDMA) to the GPU. One way to approach this problem is to use this new dma_addr path in the block layer. Another way is to feed the DEVICE_PRIVATE pages into the block layer and have it DMA map them to a P2P address. In either case we have a situation where the block layer cannot touch the target struct page buffers with the CPU because there is no cache coherent CPU mapping for them, and we have to create a CPU clean path in the block layer. At best you could do memcpy to/from on these things, but if a GPU is involved even that is incredibly inefficient. The GPU can do the memcpy with DMA much faster than a memcpy_to/from_io. Jason
On Fri, Jun 21, 2019 at 10:47 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Thu, Jun 20, 2019 at 01:18:13PM -0700, Dan Williams wrote: > > > > This P2P is quite distinct from DAX as the struct page* would point to > > > non-cacheable weird memory that few struct page users would even be > > > able to work with, while I understand DAX use cases focused on CPU > > > cache coherent memory, and filesystem involvement. > > > > What I'm poking at is whether this block layer capability can pick up > > users outside of RDMA, more on this below... > > The generic capability is to do a transfer through the block layer and > scatter/gather the resulting data to some PCIe BAR memory. Currently > the block layer can only scatter/gather data into CPU cache coherent > memory. > > We know of several useful places to put PCIe BAR memory already: > - On a GPU (or FPGA, acclerator, etc), ie the GB's of GPU private > memory that is standard these days. > - On a NVMe CMB. This lets the NVMe drive avoid DMA entirely > - On a RDMA NIC. Mellanox NICs have a small amount of BAR memory that > can be used like a CMB and avoids a DMA > > RDMA doesn't really get so involved here, except that RDMA is often > the prefered way to source/sink the data buffers after the block layer has > scatter/gathered to them. (and of course RDMA is often for a block > driver, ie NMVe over fabrics) > > > > > My primary concern with this is that ascribes a level of generality > > > > that just isn't there for peer-to-peer dma operations. "Peer" > > > > addresses are not "DMA" addresses, and the rules about what can and > > > > can't do peer-DMA are not generically known to the block layer. > > > > > > ?? The P2P infrastructure produces a DMA bus address for the > > > initiating device that is is absolutely a DMA address. There is some > > > intermediate CPU centric representation, but after mapping it is the > > > same as any other DMA bus address. > > > > Right, this goes back to the confusion caused by the hardware / bus / > > address that a dma-engine would consume directly, and Linux "DMA" > > address as a device-specific translation of host memory. > > I don't think there is a confusion :) Logan explained it, the > dma_addr_t is always the thing you program into the DMA engine of the > device it was created for, and this changes nothing about that. Yup, Logan and I already settled that point on our last exchange and offered to make that clearer in the changelog.
This is not going to fly. For one passing a dma_addr_t through the block layer is a layering violation, and one that I think will also bite us in practice. The host physical to PCIe bus address mapping can have offsets, and those offsets absolutely can be different for differnet root ports. So with your caller generated dma_addr_t everything works fine with a switched setup as the one you are probably testing on, but on a sufficiently complicated setup with multiple root ports it can break. Also duplicating the whole block I/O stack, including hooks all over the fast path is pretty much a no-go. I've been pondering for a while if we wouldn't be better off just passing a phys_addr_t + len instead of the page, offset, len tuple in the bio_vec, though. If you look at the normal I/O path here is what we normally do: - we get a page as input, either because we have it at hand (e.g. from the page cache) or from get_user_pages (which actually caculates it from a pfn in the page tables) - once in the bio all the merging decisions are based on the physical address, so we have to convert it to the physical address there, potentially multiple times - then dma mapping all works off the physical address, which it gets from the page at the start - then only the dma address is used for the I/O - on I/O completion we often but not always need the page again. In the direct I/O case for reference counting and dirty status, in the file system also for things like marking the page uptodate So if we move to a phys_addr_t we'd need to go back to the page at least once. But because of how the merging works we really only need to do it once per segment, as we can just do pointer arithmerics do get the following pages. As we generally go at least once from a physical address to a page in the merging code even a relatively expensive vmem_map looks shouldn't be too bad. Even more so given that the super hot path (small blkdev direct I/O) can actually trivially cache the affected pages as well. Linus kinda hates the pfn approach, but part of that was really that it was proposed for file system data, which we all found out really can't work as-is without pages the hard way. Another part probably was potential performance issue, but between the few page lookups, and the fact that using a single phys_addr_t instead of pfn/page + offset should avoid quite a few calculations performance should not actually be affected, although we'll have to be careful to actually verify that.
On Thu, Jun 20, 2019 at 04:33:53PM -0300, Jason Gunthorpe wrote: > > My primary concern with this is that ascribes a level of generality > > that just isn't there for peer-to-peer dma operations. "Peer" > > addresses are not "DMA" addresses, and the rules about what can and > > can't do peer-DMA are not generically known to the block layer. > > ?? The P2P infrastructure produces a DMA bus address for the > initiating device that is is absolutely a DMA address. There is some > intermediate CPU centric representation, but after mapping it is the > same as any other DMA bus address. > > The map function can tell if the device pair combination can do p2p or > not. At the PCIe level there is no such thing as a DMA address, it all is bus address with MMIO and DMA in the same address space (without that P2P would have not chance of actually working obviously). But that bus address space is different per "bus" (which would be an root port in PCIe), and we need to be careful about that.
On Mon, Jun 24, 2019 at 09:31:26AM +0200, Christoph Hellwig wrote: > On Thu, Jun 20, 2019 at 04:33:53PM -0300, Jason Gunthorpe wrote: > > > My primary concern with this is that ascribes a level of generality > > > that just isn't there for peer-to-peer dma operations. "Peer" > > > addresses are not "DMA" addresses, and the rules about what can and > > > can't do peer-DMA are not generically known to the block layer. > > > > ?? The P2P infrastructure produces a DMA bus address for the > > initiating device that is is absolutely a DMA address. There is some > > intermediate CPU centric representation, but after mapping it is the > > same as any other DMA bus address. > > > > The map function can tell if the device pair combination can do p2p or > > not. > > At the PCIe level there is no such thing as a DMA address, it all > is bus address with MMIO and DMA in the same address space (without > that P2P would have not chance of actually working obviously). But > that bus address space is different per "bus" (which would be an > root port in PCIe), and we need to be careful about that. Sure, that is how dma_addr_t is supposed to work - it is always a device specific value that can be used only by the device that it was created for, and different devices could have different dma_addr_t values for the same memory. So when Logan goes and puts dma_addr_t into the block stack he must also invert things so that the DMA map happens at the start of the process to create the right dma_addr_t early. I'm not totally clear if this series did that inversion, if it didn't then it should not be using the dma_addr_t label at all, or refering to anything as a 'dma address' as it is just confusing. BTW, it is not just offset right? It is possible that the IOMMU can generate unique dma_addr_t values for each device?? Simple offset is just something we saw in certain embedded cases, IIRC. Jason
On Mon, Jun 24, 2019 at 10:46:41AM -0300, Jason Gunthorpe wrote: > BTW, it is not just offset right? It is possible that the IOMMU can > generate unique dma_addr_t values for each device?? Simple offset is > just something we saw in certain embedded cases, IIRC. Yes, it could. If we are trying to do P2P between two devices on different root ports and with the IOMMU enabled we'll generate a new bus address for the BAR on the other side dynamically everytime we map.
On Mon, Jun 24, 2019 at 03:50:24PM +0200, Christoph Hellwig wrote: > On Mon, Jun 24, 2019 at 10:46:41AM -0300, Jason Gunthorpe wrote: > > BTW, it is not just offset right? It is possible that the IOMMU can > > generate unique dma_addr_t values for each device?? Simple offset is > > just something we saw in certain embedded cases, IIRC. > > Yes, it could. If we are trying to do P2P between two devices on > different root ports and with the IOMMU enabled we'll generate > a new bus address for the BAR on the other side dynamically everytime > we map. Even with the same root port if ACS is turned on could behave like this. It is only a very narrow case where you can take shortcuts with dma_addr_t, and I don't think shortcuts like are are appropriate for the mainline kernel.. Jason
On 2019-06-24 1:27 a.m., Christoph Hellwig wrote: > This is not going to fly. > > For one passing a dma_addr_t through the block layer is a layering > violation, and one that I think will also bite us in practice. > The host physical to PCIe bus address mapping can have offsets, and > those offsets absolutely can be different for differnet root ports. > So with your caller generated dma_addr_t everything works fine with > a switched setup as the one you are probably testing on, but on a > sufficiently complicated setup with multiple root ports it can break. I don't follow this argument. Yes, I understand PCI Bus offsets and yes I understand that they only apply beyond the bus they're working with. But this isn't *that* complicated and it should be the responsibility of the P2PDMA code to sort out and provide a dma_addr_t for. The dma_addr_t that's passed through the block layer could be a bus address or it could be the result of a dma_map_* request (if the transaction is found to go through an RC) depending on the requirements of the devices being used. > Also duplicating the whole block I/O stack, including hooks all over > the fast path is pretty much a no-go. There was very little duplicate code in the patch set. (Really just the mapping code). There are a few hooks, but in practice not that many if we ignore the WARN_ONs. We might be able to work to reduce this further. The main hooks are: when we skip bouncing, when we skip integrity prep, when we split, and when we map. And the patchset drops the PCI_P2PDMA hook when we map. So we're talking about maybe three or four extra ifs that would likely normally be fast due to the branch predictor. > I've been pondering for a while if we wouldn't be better off just > passing a phys_addr_t + len instead of the page, offset, len tuple > in the bio_vec, though. If you look at the normal I/O path here > is what we normally do: > > - we get a page as input, either because we have it at hand (e.g. > from the page cache) or from get_user_pages (which actually caculates > it from a pfn in the page tables) > - once in the bio all the merging decisions are based on the physical > address, so we have to convert it to the physical address there, > potentially multiple times > - then dma mapping all works off the physical address, which it gets > from the page at the start > - then only the dma address is used for the I/O > - on I/O completion we often but not always need the page again. In > the direct I/O case for reference counting and dirty status, in the > file system also for things like marking the page uptodate > > So if we move to a phys_addr_t we'd need to go back to the page at least > once. But because of how the merging works we really only need to do > it once per segment, as we can just do pointer arithmerics do get the > following pages. As we generally go at least once from a physical > address to a page in the merging code even a relatively expensive vmem_map > looks shouldn't be too bad. Even more so given that the super hot path > (small blkdev direct I/O) can actually trivially cache the affected pages > as well. I've always wondered why it wasn't done this way. Passing around a page pointer *and* an offset always seemed less efficient than just a physical address. If we did do this, the proposed dma_addr_t and phys_addr_t paths through the block layer could be a lot more similar as things like the split calculation could work on either address type. We'd just have to prevent bouncing and integrity and change have a hook on how it's mapped. > Linus kinda hates the pfn approach, but part of that was really that > it was proposed for file system data, which we all found out really > can't work as-is without pages the hard way. Another part probably > was potential performance issue, but between the few page lookups, and > the fact that using a single phys_addr_t instead of pfn/page + offset > should avoid quite a few calculations performance should not actually > be affected, although we'll have to be careful to actually verify that. Yes, I'd agree that removing the offset should make things simpler. But that requires changing a lot of stuff and doesn't really help what I'm trying to do. Logan
On 2019-06-24 7:46 a.m., Jason Gunthorpe wrote: > On Mon, Jun 24, 2019 at 09:31:26AM +0200, Christoph Hellwig wrote: >> On Thu, Jun 20, 2019 at 04:33:53PM -0300, Jason Gunthorpe wrote: >>>> My primary concern with this is that ascribes a level of generality >>>> that just isn't there for peer-to-peer dma operations. "Peer" >>>> addresses are not "DMA" addresses, and the rules about what can and >>>> can't do peer-DMA are not generically known to the block layer. >>> >>> ?? The P2P infrastructure produces a DMA bus address for the >>> initiating device that is is absolutely a DMA address. There is some >>> intermediate CPU centric representation, but after mapping it is the >>> same as any other DMA bus address. >>> >>> The map function can tell if the device pair combination can do p2p or >>> not. >> >> At the PCIe level there is no such thing as a DMA address, it all >> is bus address with MMIO and DMA in the same address space (without >> that P2P would have not chance of actually working obviously). But >> that bus address space is different per "bus" (which would be an >> root port in PCIe), and we need to be careful about that. > > Sure, that is how dma_addr_t is supposed to work - it is always a > device specific value that can be used only by the device that it was > created for, and different devices could have different dma_addr_t > values for the same memory. > > So when Logan goes and puts dma_addr_t into the block stack he must > also invert things so that the DMA map happens at the start of the > process to create the right dma_addr_t early. Yes, that's correct. The intent was to invert it so the dma_map could happen at the start of the process so that P2PDMA code could be called with all the information it needs to make it's decision on how to map; without having to hook into the mapping process of every driver that wants to participate. Logan
On 2019-06-24 7:55 a.m., Jason Gunthorpe wrote: > On Mon, Jun 24, 2019 at 03:50:24PM +0200, Christoph Hellwig wrote: >> On Mon, Jun 24, 2019 at 10:46:41AM -0300, Jason Gunthorpe wrote: >>> BTW, it is not just offset right? It is possible that the IOMMU can >>> generate unique dma_addr_t values for each device?? Simple offset is >>> just something we saw in certain embedded cases, IIRC. >> >> Yes, it could. If we are trying to do P2P between two devices on >> different root ports and with the IOMMU enabled we'll generate >> a new bus address for the BAR on the other side dynamically everytime >> we map. > > Even with the same root port if ACS is turned on could behave like this. Yup. > It is only a very narrow case where you can take shortcuts with > dma_addr_t, and I don't think shortcuts like are are appropriate for > the mainline kernel.. I don't think it's that narrow and it opens up a lot of avenues for system design that people are wanting to go. If your high speed data path can avoid the root complex and CPU, you can design a system which a much smaller CPU and fewer lanes directed at the CPU. Logan
On Mon, Jun 24, 2019 at 10:53:38AM -0600, Logan Gunthorpe wrote: > > It is only a very narrow case where you can take shortcuts with > > dma_addr_t, and I don't think shortcuts like are are appropriate for > > the mainline kernel.. > > I don't think it's that narrow and it opens up a lot of avenues for > system design that people are wanting to go. If your high speed data > path can avoid the root complex and CPU, you can design a system which a > much smaller CPU and fewer lanes directed at the CPU. I mean the shortcut that something generates dma_addr_t for Device A and then passes it to Device B - that is too hacky for mainline. Sounded like this series does generate the dma_addr for the correct device.. Jason
On 2019-06-24 12:16 p.m., Jason Gunthorpe wrote: > On Mon, Jun 24, 2019 at 10:53:38AM -0600, Logan Gunthorpe wrote: >>> It is only a very narrow case where you can take shortcuts with >>> dma_addr_t, and I don't think shortcuts like are are appropriate for >>> the mainline kernel.. >> >> I don't think it's that narrow and it opens up a lot of avenues for >> system design that people are wanting to go. If your high speed data >> path can avoid the root complex and CPU, you can design a system which a >> much smaller CPU and fewer lanes directed at the CPU. > > I mean the shortcut that something generates dma_addr_t for Device A > and then passes it to Device B - that is too hacky for mainline. Oh, that's not a shortcut. It's completely invalid and not likely to work in any case. If you're mapping something you have to pass the device that the dma_addr_t is being programmed into. > Sounded like this series does generate the dma_addr for the correct > device.. This series doesn't generate any DMA addresses with dma_map(). The current p2pdma code ensures everything is behind the same root port and only uses the pci bus address. This is valid and correct, but yes it's something to expand upon. I'll be doing some work shortly to add transactions that go through the IOMMU and calls dma_map_* when appropriate. Logan
On Mon, Jun 24, 2019 at 12:28:33PM -0600, Logan Gunthorpe wrote: > > Sounded like this series does generate the dma_addr for the correct > > device.. > > This series doesn't generate any DMA addresses with dma_map(). The > current p2pdma code ensures everything is behind the same root port and > only uses the pci bus address. This is valid and correct, but yes it's > something to expand upon. I think if you do this it still has to be presented as the same API like dma_map that takes in the target device * and produces the device specific dma_addr_t Otherwise this whole thing is confusing and looks like *all* of it can only work under the switch assumption Jason
On 2019-06-24 12:54 p.m., Jason Gunthorpe wrote: > On Mon, Jun 24, 2019 at 12:28:33PM -0600, Logan Gunthorpe wrote: > >>> Sounded like this series does generate the dma_addr for the correct >>> device.. >> >> This series doesn't generate any DMA addresses with dma_map(). The >> current p2pdma code ensures everything is behind the same root port and >> only uses the pci bus address. This is valid and correct, but yes it's >> something to expand upon. > > I think if you do this it still has to be presented as the same API > like dma_map that takes in the target device * and produces the device > specific dma_addr_t Yes, once we consider the case where it can go through the root complex, we will need an API similar to dma_map(). We got rid of that API because it wasn't yet required or used by anything and, per our best practices, we don't add features that aren't used as that is more confusing for people reading/reworking the code. > Otherwise this whole thing is confusing and looks like *all* of it can > only work under the switch assumption Hopefully it'll be clearer once we do the work to map for going through the root complex. It's not that confusing to me. But it's all orthogonal to the dma_addr_t through the block layer concept. Logan
On Mon, Jun 24, 2019 at 10:10:16AM -0600, Logan Gunthorpe wrote: > Yes, that's correct. The intent was to invert it so the dma_map could > happen at the start of the process so that P2PDMA code could be called > with all the information it needs to make it's decision on how to map; > without having to hook into the mapping process of every driver that > wants to participate. And that just isn't how things work in layering. We need to keep generating the dma addresses in the driver in the receiving end, as there are all kinds of interesting ideas how we do that. E.g. for the Mellanox NICs addressing their own bars is not done by PCIe bus addresses but by relative offsets. And while NVMe has refused to go down that route in the current band aid fix for CMB addressing I suspect it will sooner or later have to do the same to deal with the addressing problems in a multiple PASID world.
On Mon, Jun 24, 2019 at 10:07:56AM -0600, Logan Gunthorpe wrote: > > For one passing a dma_addr_t through the block layer is a layering > > violation, and one that I think will also bite us in practice. > > The host physical to PCIe bus address mapping can have offsets, and > > those offsets absolutely can be different for differnet root ports. > > So with your caller generated dma_addr_t everything works fine with > > a switched setup as the one you are probably testing on, but on a > > sufficiently complicated setup with multiple root ports it can break. > > I don't follow this argument. Yes, I understand PCI Bus offsets and yes > I understand that they only apply beyond the bus they're working with. > But this isn't *that* complicated and it should be the responsibility of > the P2PDMA code to sort out and provide a dma_addr_t for. The dma_addr_t > that's passed through the block layer could be a bus address or it could > be the result of a dma_map_* request (if the transaction is found to go > through an RC) depending on the requirements of the devices being used. You assume all addressing is done by the PCI bus address. If a device is addressing its own BAR there is no reason to use the PCI bus address, as it might have much more intelligent schemes (usually bar + offset). > > > Also duplicating the whole block I/O stack, including hooks all over > > the fast path is pretty much a no-go. > > There was very little duplicate code in the patch set. (Really just the > mapping code). There are a few hooks, but in practice not that many if > we ignore the WARN_ONs. We might be able to work to reduce this further. > The main hooks are: when we skip bouncing, when we skip integrity prep, > when we split, and when we map. And the patchset drops the PCI_P2PDMA > hook when we map. So we're talking about maybe three or four extra ifs > that would likely normally be fast due to the branch predictor. And all of those add code to the block layer fast path.
On 2019-06-25 1:20 a.m., Christoph Hellwig wrote: > On Mon, Jun 24, 2019 at 10:07:56AM -0600, Logan Gunthorpe wrote: >>> For one passing a dma_addr_t through the block layer is a layering >>> violation, and one that I think will also bite us in practice. >>> The host physical to PCIe bus address mapping can have offsets, and >>> those offsets absolutely can be different for differnet root ports. >>> So with your caller generated dma_addr_t everything works fine with >>> a switched setup as the one you are probably testing on, but on a >>> sufficiently complicated setup with multiple root ports it can break. >> >> I don't follow this argument. Yes, I understand PCI Bus offsets and yes >> I understand that they only apply beyond the bus they're working with. >> But this isn't *that* complicated and it should be the responsibility of >> the P2PDMA code to sort out and provide a dma_addr_t for. The dma_addr_t >> that's passed through the block layer could be a bus address or it could >> be the result of a dma_map_* request (if the transaction is found to go >> through an RC) depending on the requirements of the devices being used. > > You assume all addressing is done by the PCI bus address. If a device > is addressing its own BAR there is no reason to use the PCI bus address, > as it might have much more intelligent schemes (usually bar + offset). Yes, that will be a bit tricky regardless of what we do. >>> Also duplicating the whole block I/O stack, including hooks all over >>> the fast path is pretty much a no-go. >> >> There was very little duplicate code in the patch set. (Really just the >> mapping code). There are a few hooks, but in practice not that many if >> we ignore the WARN_ONs. We might be able to work to reduce this further. >> The main hooks are: when we skip bouncing, when we skip integrity prep, >> when we split, and when we map. And the patchset drops the PCI_P2PDMA >> hook when we map. So we're talking about maybe three or four extra ifs >> that would likely normally be fast due to the branch predictor. > > And all of those add code to the block layer fast path. If we can't add any ifs to the block layer, there's really nothing we can do. So then we're committed to using struct page for P2P? Logan
On Tue, Jun 25, 2019 at 09:57:52AM -0600, Logan Gunthorpe wrote: > > You assume all addressing is done by the PCI bus address. If a device > > is addressing its own BAR there is no reason to use the PCI bus address, > > as it might have much more intelligent schemes (usually bar + offset). > > Yes, that will be a bit tricky regardless of what we do. At least right now it isn't at all. I've implemented support for a draft NVMe proposal for that, and it basically boils down to this in the p2p path: addr = sg_phys(sg); if (page->pgmap->dev == ctrl->dev && HAS_RELATIVE_ADDRESSING) addr -= ctrl->cmb_start_addr; // set magic flag in the SGL } else { addr -= pgmap->pci_p2pdma_bus_offset; } without the pagemap it would require a range compare instead, which isn't all that hard either. > >>> Also duplicating the whole block I/O stack, including hooks all over > >>> the fast path is pretty much a no-go. > >> > >> There was very little duplicate code in the patch set. (Really just the > >> mapping code). There are a few hooks, but in practice not that many if > >> we ignore the WARN_ONs. We might be able to work to reduce this further. > >> The main hooks are: when we skip bouncing, when we skip integrity prep, > >> when we split, and when we map. And the patchset drops the PCI_P2PDMA > >> hook when we map. So we're talking about maybe three or four extra ifs > >> that would likely normally be fast due to the branch predictor. > > > > And all of those add code to the block layer fast path. > > If we can't add any ifs to the block layer, there's really nothing we > can do. That is not what I said. Of course we can. But we rather have a really good reason. And adding a parallel I/O path violating the highlevel model is not one. > So then we're committed to using struct page for P2P? Only until we have a significantly better soltution. And I think using physical address in some form instead of pages is that, adding a parallel path with dma_addr_t is not, it actually is worse than the current code in many respects.
On 2019-06-25 11:01 a.m., Christoph Hellwig wrote: > On Tue, Jun 25, 2019 at 09:57:52AM -0600, Logan Gunthorpe wrote: >>> You assume all addressing is done by the PCI bus address. If a device >>> is addressing its own BAR there is no reason to use the PCI bus address, >>> as it might have much more intelligent schemes (usually bar + offset). >> >> Yes, that will be a bit tricky regardless of what we do. > > At least right now it isn't at all. I've implemented support for > a draft NVMe proposal for that, and it basically boils down to this > in the p2p path: > > addr = sg_phys(sg); > > if (page->pgmap->dev == ctrl->dev && HAS_RELATIVE_ADDRESSING) > addr -= ctrl->cmb_start_addr; > > // set magic flag in the SGL > } else { > addr -= pgmap->pci_p2pdma_bus_offset; > } > > without the pagemap it would require a range compare instead, which > isn't all that hard either. > >>>>> Also duplicating the whole block I/O stack, including hooks all over >>>>> the fast path is pretty much a no-go. >>>> >>>> There was very little duplicate code in the patch set. (Really just the >>>> mapping code). There are a few hooks, but in practice not that many if >>>> we ignore the WARN_ONs. We might be able to work to reduce this further. >>>> The main hooks are: when we skip bouncing, when we skip integrity prep, >>>> when we split, and when we map. And the patchset drops the PCI_P2PDMA >>>> hook when we map. So we're talking about maybe three or four extra ifs >>>> that would likely normally be fast due to the branch predictor. >>> >>> And all of those add code to the block layer fast path. >> >> If we can't add any ifs to the block layer, there's really nothing we >> can do. > > That is not what I said. Of course we can. But we rather have a > really good reason. And adding a parallel I/O path violating the > highlevel model is not one. > >> So then we're committed to using struct page for P2P? > > Only until we have a significantly better soltution. And I think > using physical address in some form instead of pages is that, > adding a parallel path with dma_addr_t is not, it actually is worse > than the current code in many respects. Well whether it's dma_addr_t, phys_addr_t, pfn_t the result isn't all that different. You still need roughly the same 'if' hooks for any backed memory that isn't in the linear mapping and you can't get a kernel mapping for directly. It wouldn't be too hard to do a similar patch set that uses something like phys_addr_t instead and have a request and queue flag for support of non-mappable memory. But you'll end up with very similar 'if' hooks and we'd have to clean up all bio-using drivers that access the struct pages directly. Though, we'd also still have the problem of how to recognize when the address points to P2PDMA and needs to be translated to the bus offset. The map-first inversion was what helped here because the driver submitting the requests had all the information. Though it could be another request flag and indicating non-mappable memory could be a flag group like REQ_NOMERGE_FLAGS -- REQ_NOMAP_FLAGS. If you think any of the above ideas sound workable I'd be happy to try to code up another prototype. Logan
On Tue, Jun 25, 2019 at 01:54:21PM -0600, Logan Gunthorpe wrote: > Well whether it's dma_addr_t, phys_addr_t, pfn_t the result isn't all > that different. You still need roughly the same 'if' hooks for any > backed memory that isn't in the linear mapping and you can't get a > kernel mapping for directly. > > It wouldn't be too hard to do a similar patch set that uses something > like phys_addr_t instead and have a request and queue flag for support > of non-mappable memory. But you'll end up with very similar 'if' hooks > and we'd have to clean up all bio-using drivers that access the struct > pages directly. We'll need to clean that mess up anyway, and I've been chugging along doing some of that. A lot still assume no highmem, so we need to convert them over to something that kmaps anyway. If we get the abstraction right that will actually help converting over to a better reprsentation. > Though, we'd also still have the problem of how to recognize when the > address points to P2PDMA and needs to be translated to the bus offset. > The map-first inversion was what helped here because the driver > submitting the requests had all the information. Though it could be > another request flag and indicating non-mappable memory could be a flag > group like REQ_NOMERGE_FLAGS -- REQ_NOMAP_FLAGS. The assumes the request all has the same memory, which is a simplifing assuption. My idea was that if had our new bio_vec like this: struct bio_vec { phys_addr_t paddr; // 64-bit on 64-bit systems unsigned long len; }; we have a hole behind len where we could store flag. Preferably optionally based on a P2P or other magic memory types config option so that 32-bit systems with 32-bit phys_addr_t actually benefit from the smaller and better packing structure. > If you think any of the above ideas sound workable I'd be happy to try > to code up another prototype. Іt sounds workable. To some of the first steps are cleanups independent of how the bio_vec is eventually going to look like. That is making the DMA-API internals work on the phys_addr_t, which also unifies the map_resource implementation with map_page. I plan to do that relatively soon. The next is sorting out access to bios data by virtual address. All these need nice kmapping helper that avoid too much open coding. I was going to look into that next, mostly to kill the block layer bounce buffering code. Similar things will also be needed at the scatterlist level I think. After that we need to more audits of how bv_page is still used. something like a bv_phys() helper that does "page_to_phys(bv->bv_page) + bv->bv_offset" might come in handy for example.
On 2019-06-26 12:57 a.m., Christoph Hellwig wrote: > On Tue, Jun 25, 2019 at 01:54:21PM -0600, Logan Gunthorpe wrote: >> Well whether it's dma_addr_t, phys_addr_t, pfn_t the result isn't all >> that different. You still need roughly the same 'if' hooks for any >> backed memory that isn't in the linear mapping and you can't get a >> kernel mapping for directly. >> >> It wouldn't be too hard to do a similar patch set that uses something >> like phys_addr_t instead and have a request and queue flag for support >> of non-mappable memory. But you'll end up with very similar 'if' hooks >> and we'd have to clean up all bio-using drivers that access the struct >> pages directly. > > We'll need to clean that mess up anyway, and I've been chugging > along doing some of that. A lot still assume no highmem, so we need > to convert them over to something that kmaps anyway. If we get > the abstraction right that will actually help converting over to > a better reprsentation. > >> Though, we'd also still have the problem of how to recognize when the >> address points to P2PDMA and needs to be translated to the bus offset. >> The map-first inversion was what helped here because the driver >> submitting the requests had all the information. Though it could be >> another request flag and indicating non-mappable memory could be a flag >> group like REQ_NOMERGE_FLAGS -- REQ_NOMAP_FLAGS. > > The assumes the request all has the same memory, which is a simplifing > assuption. My idea was that if had our new bio_vec like this: > > struct bio_vec { > phys_addr_t paddr; // 64-bit on 64-bit systems > unsigned long len; > }; > > we have a hole behind len where we could store flag. Preferably > optionally based on a P2P or other magic memory types config > option so that 32-bit systems with 32-bit phys_addr_t actually > benefit from the smaller and better packing structure. That seems sensible. The one thing that's unclear though is how to get the PCI Bus address when appropriate. Can we pass that in instead of the phys_addr with an appropriate flag? Or will we need to pass the actual physical address and then, at the map step, the driver has to some how lookup the PCI device to figure out the bus offset? >> If you think any of the above ideas sound workable I'd be happy to try >> to code up another prototype. > > Іt sounds workable. To some of the first steps are cleanups independent > of how the bio_vec is eventually going to look like. That is making > the DMA-API internals work on the phys_addr_t, which also unifies the > map_resource implementation with map_page. I plan to do that relatively > soon. The next is sorting out access to bios data by virtual address. > All these need nice kmapping helper that avoid too much open coding. > I was going to look into that next, mostly to kill the block layer > bounce buffering code. Similar things will also be needed at the > scatterlist level I think. After that we need to more audits of > how bv_page is still used. something like a bv_phys() helper that > does "page_to_phys(bv->bv_page) + bv->bv_offset" might come in handy > for example. Ok, I should be able to help with that. When I have a chance I'll try to look at the bv_phys() helper. Logan
On Wed, Jun 26, 2019 at 12:31:08PM -0600, Logan Gunthorpe wrote: > > we have a hole behind len where we could store flag. Preferably > > optionally based on a P2P or other magic memory types config > > option so that 32-bit systems with 32-bit phys_addr_t actually > > benefit from the smaller and better packing structure. > > That seems sensible. The one thing that's unclear though is how to get > the PCI Bus address when appropriate. Can we pass that in instead of the > phys_addr with an appropriate flag? Or will we need to pass the actual > physical address and then, at the map step, the driver has to some how > lookup the PCI device to figure out the bus offset? I agree with CH, if we go down this path it is a layering violation for the thing injecting bio's into the block stack to know what struct device they egress&dma map on just to be able to do the dma_map up front. So we must be able to go from this new phys_addr_t&flags to some BAR information during dma_map. For instance we could use a small hash table of the upper phys addr bits, or an interval tree, to do the lookup. The bar info would give the exporting struct device and any other info we need to make the iommu mapping. This phys_addr_t seems like a good approach to me as it avoids the struct page overheads and will lets us provide copy from/to bio primitives that could work on BAR memory. I think we can surely use this approach in RDMA as well. Jason
On Wed, Jun 26, 2019 at 1:21 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Wed, Jun 26, 2019 at 12:31:08PM -0600, Logan Gunthorpe wrote: > > > we have a hole behind len where we could store flag. Preferably > > > optionally based on a P2P or other magic memory types config > > > option so that 32-bit systems with 32-bit phys_addr_t actually > > > benefit from the smaller and better packing structure. > > > > That seems sensible. The one thing that's unclear though is how to get > > the PCI Bus address when appropriate. Can we pass that in instead of the > > phys_addr with an appropriate flag? Or will we need to pass the actual > > physical address and then, at the map step, the driver has to some how > > lookup the PCI device to figure out the bus offset? > > I agree with CH, if we go down this path it is a layering violation > for the thing injecting bio's into the block stack to know what struct > device they egress&dma map on just to be able to do the dma_map up > front. > > So we must be able to go from this new phys_addr_t&flags to some BAR > information during dma_map. > > For instance we could use a small hash table of the upper phys addr > bits, or an interval tree, to do the lookup. Hmm, that sounds like dev_pagemap without the pages. There's already no requirement that dev_pagemap point to real / present pages (DEVICE_PRIVATE) seems a straightforward extension to use it for helping coordinate phys_addr_t in 'struct bio'. Then Logan's future plans to let userspace coordinate p2p operations could build on PTE_DEVMAP.
On 2019-06-26 2:21 p.m., Jason Gunthorpe wrote: > On Wed, Jun 26, 2019 at 12:31:08PM -0600, Logan Gunthorpe wrote: >>> we have a hole behind len where we could store flag. Preferably >>> optionally based on a P2P or other magic memory types config >>> option so that 32-bit systems with 32-bit phys_addr_t actually >>> benefit from the smaller and better packing structure. >> >> That seems sensible. The one thing that's unclear though is how to get >> the PCI Bus address when appropriate. Can we pass that in instead of the >> phys_addr with an appropriate flag? Or will we need to pass the actual >> physical address and then, at the map step, the driver has to some how >> lookup the PCI device to figure out the bus offset? > > I agree with CH, if we go down this path it is a layering violation > for the thing injecting bio's into the block stack to know what struct > device they egress&dma map on just to be able to do the dma_map up > front. Not sure I agree with this statement. The p2pdma code already *must* know and access the pci_dev of the dma device ahead of when it submits the IO to know if it's valid to allocate and use P2P memory at all. This is why the submitting driver has a lot of the information needed to map this memory that the mapping driver does not. > So we must be able to go from this new phys_addr_t&flags to some BAR > information during dma_map. > For instance we could use a small hash table of the upper phys addr > bits, or an interval tree, to do the lookup. Yes, if we're going to take a hard stance on this. But using an interval tree (or similar) is a lot more work for the CPU to figure out these mappings that may not be strictly necessary if we could just pass better information down from the submitting driver to the mapping driver. > The bar info would give the exporting struct device and any other info > we need to make the iommu mapping. Well, the IOMMU mapping is the normal thing the mapping driver will always do. We'd really just need the submitting driver to, when appropriate, inform the mapping driver that this is a pci bus address and not to call dma_map_xxx(). Then, for special mappings for the CMB like Christoph is talking about, it's simply a matter of doing a range compare on the PCI Bus address and converting the bus address to a BAR and offset. Logan
On Wed, Jun 26, 2019 at 01:39:01PM -0700, Dan Williams wrote:
> Hmm, that sounds like dev_pagemap without the pages.
Yes, and other page related overhead. Maybe both ideas can exist in
the pagemap code?
All that is needed here is to map a bar phys_addr_t to some 'bar info'
that helps the mapping.
Jason
On 2019-06-26 2:39 p.m., Dan Williams wrote: > On Wed, Jun 26, 2019 at 1:21 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: >> >> On Wed, Jun 26, 2019 at 12:31:08PM -0600, Logan Gunthorpe wrote: >>>> we have a hole behind len where we could store flag. Preferably >>>> optionally based on a P2P or other magic memory types config >>>> option so that 32-bit systems with 32-bit phys_addr_t actually >>>> benefit from the smaller and better packing structure. >>> >>> That seems sensible. The one thing that's unclear though is how to get >>> the PCI Bus address when appropriate. Can we pass that in instead of the >>> phys_addr with an appropriate flag? Or will we need to pass the actual >>> physical address and then, at the map step, the driver has to some how >>> lookup the PCI device to figure out the bus offset? >> >> I agree with CH, if we go down this path it is a layering violation >> for the thing injecting bio's into the block stack to know what struct >> device they egress&dma map on just to be able to do the dma_map up >> front. >> >> So we must be able to go from this new phys_addr_t&flags to some BAR >> information during dma_map. >> >> For instance we could use a small hash table of the upper phys addr >> bits, or an interval tree, to do the lookup. > > Hmm, that sounds like dev_pagemap without the pages. Yup, that's why I'd like to avoid it, but IMO it would still be an improvement to use a interval tree over struct pages because without struct page we just have a range and a length and it's relatively easy to check that the whole range belongs to a specific pci_dev. To be correct with the struct page approach we really have to loop through all pages to ensure they all belong to the same pci_dev which is a big pain. > There's already no requirement that dev_pagemap point to real / > present pages (DEVICE_PRIVATE) seems a straightforward extension to > use it for helping coordinate phys_addr_t in 'struct bio'. Then > Logan's future plans to let userspace coordinate p2p operations could > build on PTE_DEVMAP. Well I think the biggest difficulty with struct page for user space is dealing with cases when the struct pages of different types get mixed together (or even struct pages that are all P2P pages but from different PCI devices). We'd have to go through each page and ensure that each type gets it's own bio_vec with appropriate flags. Though really, the whole mixed IO from userspace poses a bunch of problems. I'd prefer to just be able to say that a single IO can be all or nothing P2P memory from a single device. Logan
On Wed, Jun 26, 2019 at 02:45:38PM -0600, Logan Gunthorpe wrote: > > > On 2019-06-26 2:21 p.m., Jason Gunthorpe wrote: > > On Wed, Jun 26, 2019 at 12:31:08PM -0600, Logan Gunthorpe wrote: > >>> we have a hole behind len where we could store flag. Preferably > >>> optionally based on a P2P or other magic memory types config > >>> option so that 32-bit systems with 32-bit phys_addr_t actually > >>> benefit from the smaller and better packing structure. > >> > >> That seems sensible. The one thing that's unclear though is how to get > >> the PCI Bus address when appropriate. Can we pass that in instead of the > >> phys_addr with an appropriate flag? Or will we need to pass the actual > >> physical address and then, at the map step, the driver has to some how > >> lookup the PCI device to figure out the bus offset? > > > > I agree with CH, if we go down this path it is a layering violation > > for the thing injecting bio's into the block stack to know what struct > > device they egress&dma map on just to be able to do the dma_map up > > front. > > Not sure I agree with this statement. The p2pdma code already *must* > know and access the pci_dev of the dma device ahead of when it submits > the IO to know if it's valid to allocate and use P2P memory at all. I don't think we should make drives do that. What if it got CMB memory on some other device? > > For instance we could use a small hash table of the upper phys addr > > bits, or an interval tree, to do the lookup. > > Yes, if we're going to take a hard stance on this. But using an interval > tree (or similar) is a lot more work for the CPU to figure out these > mappings that may not be strictly necessary if we could just pass better > information down from the submitting driver to the mapping driver. Right, this is coming down to an optimization argument. I think there are very few cases (Basically yours) where the caller will know this info, so we need to support the other cases anyhow. I think with some simple caching this will become negligible for cases you care about Jason
On 2019-06-26 3:00 p.m., Jason Gunthorpe wrote: > On Wed, Jun 26, 2019 at 02:45:38PM -0600, Logan Gunthorpe wrote: >> >> >> On 2019-06-26 2:21 p.m., Jason Gunthorpe wrote: >>> On Wed, Jun 26, 2019 at 12:31:08PM -0600, Logan Gunthorpe wrote: >>>>> we have a hole behind len where we could store flag. Preferably >>>>> optionally based on a P2P or other magic memory types config >>>>> option so that 32-bit systems with 32-bit phys_addr_t actually >>>>> benefit from the smaller and better packing structure. >>>> >>>> That seems sensible. The one thing that's unclear though is how to get >>>> the PCI Bus address when appropriate. Can we pass that in instead of the >>>> phys_addr with an appropriate flag? Or will we need to pass the actual >>>> physical address and then, at the map step, the driver has to some how >>>> lookup the PCI device to figure out the bus offset? >>> >>> I agree with CH, if we go down this path it is a layering violation >>> for the thing injecting bio's into the block stack to know what struct >>> device they egress&dma map on just to be able to do the dma_map up >>> front. >> >> Not sure I agree with this statement. The p2pdma code already *must* >> know and access the pci_dev of the dma device ahead of when it submits >> the IO to know if it's valid to allocate and use P2P memory at all. > > I don't think we should make drives do that. What if it got CMB memory > on some other device? Huh? A driver submitting P2P requests finds appropriate memory to use based on the DMA device that will be doing the mapping. It *has* to. It doesn't necessarily have control over which P2P provider it might find (ie. it may get CMB memory from a random NVMe device), but it easily knows the NVMe device it got the CMB memory for. Look at the existing code in the nvme target. >>> For instance we could use a small hash table of the upper phys addr >>> bits, or an interval tree, to do the lookup. >> >> Yes, if we're going to take a hard stance on this. But using an interval >> tree (or similar) is a lot more work for the CPU to figure out these >> mappings that may not be strictly necessary if we could just pass better >> information down from the submitting driver to the mapping driver. > > Right, this is coming down to an optimization argument. I think there > are very few cases (Basically yours) where the caller will know this > info, so we need to support the other cases anyhow. I disagree. I think it has to be a common pattern. A driver doing a P2P transaction *must* find some device to obtain memory from (or it may be itself) and check if it is compatible with the device that's going to be mapping the memory or vice versa. So no matter what we do, a driver submitting P2P requests must have access to both the PCI device that's going to be mapping the memory and the device that's providing the memory. > I think with some simple caching this will become negligible for cases > you care about Well *maybe* it will be negligible performance wise, but it's also a lot more complicated, code wise. Tree lookups will always be a lot more expensive than just checking a flag. Logan
On Wed, Jun 26, 2019 at 03:18:07PM -0600, Logan Gunthorpe wrote: > > I don't think we should make drives do that. What if it got CMB memory > > on some other device? > > Huh? A driver submitting P2P requests finds appropriate memory to use > based on the DMA device that will be doing the mapping. It *has* to. It > doesn't necessarily have control over which P2P provider it might find > (ie. it may get CMB memory from a random NVMe device), but it easily > knows the NVMe device it got the CMB memory for. Look at the existing > code in the nvme target. No, this all thinking about things from the CMB perspective. With CMB you don't care about the BAR location because it is just a temporary buffer. That is a unique use model. Every other case has data residing in BAR memory that can really only reside in that one place (ie on a GPU/FPGA DRAM or something). When an IO against that is run it should succeed, even if that means bounce buffering the IO - as the user has really asked for this transfer to happen. We certainly don't get to generally pick where the data resides before starting the IO, that luxury is only for CMB. > > I think with some simple caching this will become negligible for cases > > you care about > > Well *maybe* it will be negligible performance wise, but it's also a lot > more complicated, code wise. Tree lookups will always be a lot more > expensive than just checking a flag. Interval trees are pretty simple API wise, and if we only populate them with P2P providers you probably find the tree depth is negligible in current systems with one or two P2P providers. Jason
On Wed, Jun 26, 2019 at 12:31:08PM -0600, Logan Gunthorpe wrote: > > we have a hole behind len where we could store flag. Preferably > > optionally based on a P2P or other magic memory types config > > option so that 32-bit systems with 32-bit phys_addr_t actually > > benefit from the smaller and better packing structure. > > That seems sensible. The one thing that's unclear though is how to get > the PCI Bus address when appropriate. Can we pass that in instead of the > phys_addr with an appropriate flag? Or will we need to pass the actual > physical address and then, at the map step, the driver has to some how > lookup the PCI device to figure out the bus offset? Yes, I think we'll need a lookup mechanism of some kind.
On Wed, Jun 26, 2019 at 02:45:38PM -0600, Logan Gunthorpe wrote: > > The bar info would give the exporting struct device and any other info > > we need to make the iommu mapping. > > Well, the IOMMU mapping is the normal thing the mapping driver will > always do. We'd really just need the submitting driver to, when > appropriate, inform the mapping driver that this is a pci bus address > and not to call dma_map_xxx(). Then, for special mappings for the CMB > like Christoph is talking about, it's simply a matter of doing a range > compare on the PCI Bus address and converting the bus address to a BAR > and offset. Well, range compare on the physical address. We have a few different options here: (a) a range is normal RAM, DMA mapping works as usual (b) a range is another devices BAR, in which case we need to do a map_resource equivalent (which really just means don't bother with cache flush on non-coherent architectures) and apply any needed offset, fixed or iommu based (c) a range points to a BAR on the acting device. In which case we don't need to DMA map at all, because no dma is happening but just an internal transfer. And depending on the device that might also require a different addressing mode I guess it might make sense to just have a block layer flag that (b) or (c) might be contained in a bio. Then we always look up the data structure, but can still fall back to (a) if nothing was found. That even allows free mixing and matching of memory types, at least as long as they are contained to separate bio_vec segments.
On 2019-06-27 12:32 a.m., Jason Gunthorpe wrote: > On Wed, Jun 26, 2019 at 03:18:07PM -0600, Logan Gunthorpe wrote: >>> I don't think we should make drives do that. What if it got CMB memory >>> on some other device? >> >> Huh? A driver submitting P2P requests finds appropriate memory to use >> based on the DMA device that will be doing the mapping. It *has* to. It >> doesn't necessarily have control over which P2P provider it might find >> (ie. it may get CMB memory from a random NVMe device), but it easily >> knows the NVMe device it got the CMB memory for. Look at the existing >> code in the nvme target. > > No, this all thinking about things from the CMB perspective. With CMB > you don't care about the BAR location because it is just a temporary > buffer. That is a unique use model. > > Every other case has data residing in BAR memory that can really only > reside in that one place (ie on a GPU/FPGA DRAM or something). When an IO > against that is run it should succeed, even if that means bounce > buffering the IO - as the user has really asked for this transfer to > happen. > > We certainly don't get to generally pick where the data resides before > starting the IO, that luxury is only for CMB. I disagree. If we we're going to implement a "bounce" we'd probably want to do it in two DMA requests. So the GPU/FPGA driver would first decide whether it can do it P2P directly and, if it can't, would want to submit a DMA request copy the data to host memory and then submit an IO normally to the data's final destination. I think it would be a larger layering violation to have the NVMe driver (for example) memcpy data off a GPU's bar during a dma_map step to support this bouncing. And it's even crazier to expect a DMA transfer to be setup in the map step. Logan
On 2019-06-27 3:08 a.m., Christoph Hellwig wrote: > On Wed, Jun 26, 2019 at 02:45:38PM -0600, Logan Gunthorpe wrote: >>> The bar info would give the exporting struct device and any other info >>> we need to make the iommu mapping. >> >> Well, the IOMMU mapping is the normal thing the mapping driver will >> always do. We'd really just need the submitting driver to, when >> appropriate, inform the mapping driver that this is a pci bus address >> and not to call dma_map_xxx(). Then, for special mappings for the CMB >> like Christoph is talking about, it's simply a matter of doing a range >> compare on the PCI Bus address and converting the bus address to a BAR >> and offset. > > Well, range compare on the physical address. We have a few different > options here: > > (a) a range is normal RAM, DMA mapping works as usual > (b) a range is another devices BAR, in which case we need to do a > map_resource equivalent (which really just means don't bother with > cache flush on non-coherent architectures) and apply any needed > offset, fixed or iommu based Well I would split this into two cases: (b1) ranges in another device's BAR that will pass through the root complex and require a map_resource equivalent and (b2) ranges in another device's bar that don't pass through the root complex and require applying an offset to the bus address. Both require rather different handling and the submitting driver should already know ahead of time what type we have. > (c) a range points to a BAR on the acting device. In which case we > don't need to DMA map at all, because no dma is happening but just an > internal transfer. And depending on the device that might also require > a different addressing mode I think (c) is actually just a special case of (b2). Any device that has a special protocol for addressing the local BAR can just do a range compare on the address to determine if it's local or not. Devices that don't have a special protocol for this would handle both (c) and (b2) the same. > I guess it might make sense to just have a block layer flag that (b) or > (c) might be contained in a bio. Then we always look up the data > structure, but can still fall back to (a) if nothing was found. That > even allows free mixing and matching of memory types, at least as long > as they are contained to separate bio_vec segments. IMO these three cases should be reflected in flags in the bio_vec. We'd probably still need a queue flag to indicate support for mapping these, but a flag on the bio that indicates special cases *might* exist in the bio_vec and the driver has to do extra work to somehow distinguish the three types doesn't seem useful. bio_vec flags also make it easy to support mixing segments from different memory types. Logan
On Thu, Jun 27, 2019 at 10:09:41AM -0600, Logan Gunthorpe wrote: > > > On 2019-06-27 12:32 a.m., Jason Gunthorpe wrote: > > On Wed, Jun 26, 2019 at 03:18:07PM -0600, Logan Gunthorpe wrote: > >>> I don't think we should make drives do that. What if it got CMB memory > >>> on some other device? > >> > >> Huh? A driver submitting P2P requests finds appropriate memory to use > >> based on the DMA device that will be doing the mapping. It *has* to. It > >> doesn't necessarily have control over which P2P provider it might find > >> (ie. it may get CMB memory from a random NVMe device), but it easily > >> knows the NVMe device it got the CMB memory for. Look at the existing > >> code in the nvme target. > > > > No, this all thinking about things from the CMB perspective. With CMB > > you don't care about the BAR location because it is just a temporary > > buffer. That is a unique use model. > > > > Every other case has data residing in BAR memory that can really only > > reside in that one place (ie on a GPU/FPGA DRAM or something). When an IO > > against that is run it should succeed, even if that means bounce > > buffering the IO - as the user has really asked for this transfer to > > happen. > > > > We certainly don't get to generally pick where the data resides before > > starting the IO, that luxury is only for CMB. > > I disagree. If we we're going to implement a "bounce" we'd probably want > to do it in two DMA requests. How do you mean? > So the GPU/FPGA driver would first decide whether it can do it P2P > directly and, if it can't, would want to submit a DMA request copy > the data to host memory and then submit an IO normally to the data's > final destination. I don't think a GPU/FPGA driver will be involved, this would enter the block layer through the O_DIRECT path or something generic.. This the general flow I was suggesting to Dan earlier > I think it would be a larger layering violation to have the NVMe driver > (for example) memcpy data off a GPU's bar during a dma_map step to > support this bouncing. And it's even crazier to expect a DMA transfer to > be setup in the map step. Why? Don't we already expect the DMA mapper to handle bouncing for lots of cases, how is this case different? This is the best place to place it to make it shared. Jason
On 2019-06-27 10:35 a.m., Jason Gunthorpe wrote: > On Thu, Jun 27, 2019 at 10:09:41AM -0600, Logan Gunthorpe wrote: >> >> >> On 2019-06-27 12:32 a.m., Jason Gunthorpe wrote: >>> On Wed, Jun 26, 2019 at 03:18:07PM -0600, Logan Gunthorpe wrote: >>>>> I don't think we should make drives do that. What if it got CMB memory >>>>> on some other device? >>>> >>>> Huh? A driver submitting P2P requests finds appropriate memory to use >>>> based on the DMA device that will be doing the mapping. It *has* to. It >>>> doesn't necessarily have control over which P2P provider it might find >>>> (ie. it may get CMB memory from a random NVMe device), but it easily >>>> knows the NVMe device it got the CMB memory for. Look at the existing >>>> code in the nvme target. >>> >>> No, this all thinking about things from the CMB perspective. With CMB >>> you don't care about the BAR location because it is just a temporary >>> buffer. That is a unique use model. >>> >>> Every other case has data residing in BAR memory that can really only >>> reside in that one place (ie on a GPU/FPGA DRAM or something). When an IO >>> against that is run it should succeed, even if that means bounce >>> buffering the IO - as the user has really asked for this transfer to >>> happen. >>> >>> We certainly don't get to generally pick where the data resides before >>> starting the IO, that luxury is only for CMB. >> >> I disagree. If we we're going to implement a "bounce" we'd probably want >> to do it in two DMA requests. > > How do you mean? > >> So the GPU/FPGA driver would first decide whether it can do it P2P >> directly and, if it can't, would want to submit a DMA request copy >> the data to host memory and then submit an IO normally to the data's >> final destination. > > I don't think a GPU/FPGA driver will be involved, this would enter the > block layer through the O_DIRECT path or something generic.. This the > general flow I was suggesting to Dan earlier I would say the O_DIRECT path has to somehow call into the driver backing the VMA to get an address to appropriate memory (in some way vaguely similar to how we were discussing at LSF/MM). If P2P can't be done at that point, then the provider driver would do the copy to system memory, in the most appropriate way, and return regular pages for O_DIRECT to submit to the block device. >> I think it would be a larger layering violation to have the NVMe driver >> (for example) memcpy data off a GPU's bar during a dma_map step to >> support this bouncing. And it's even crazier to expect a DMA transfer to >> be setup in the map step. > > Why? Don't we already expect the DMA mapper to handle bouncing for > lots of cases, how is this case different? This is the best place to > place it to make it shared. This is different because it's special memory where the DMA mapper can't possibly know the best way to transfer the data. The best way to transfer the data is almost certainly going to be a DMA request handled by the GPU/FPGA. So, one way or another, the GPU/FPGA driver has to be involved. One could argue that the hook to the GPU/FPGA driver could be in the mapping step but then we'd have to do lookups based on an address -- where as the VMA could more easily have a hook back to whatever driver exported it. Logan
On Thu, Jun 27, 2019 at 10:30:42AM -0600, Logan Gunthorpe wrote: > > (a) a range is normal RAM, DMA mapping works as usual > > (b) a range is another devices BAR, in which case we need to do a > > map_resource equivalent (which really just means don't bother with > > cache flush on non-coherent architectures) and apply any needed > > offset, fixed or iommu based > > Well I would split this into two cases: (b1) ranges in another device's > BAR that will pass through the root complex and require a map_resource > equivalent and (b2) ranges in another device's bar that don't pass > through the root complex and require applying an offset to the bus > address. Both require rather different handling and the submitting > driver should already know ahead of time what type we have. True. > > > (c) a range points to a BAR on the acting device. In which case we > > don't need to DMA map at all, because no dma is happening but just an > > internal transfer. And depending on the device that might also require > > a different addressing mode > > I think (c) is actually just a special case of (b2). Any device that has > a special protocol for addressing the local BAR can just do a range > compare on the address to determine if it's local or not. Devices that > don't have a special protocol for this would handle both (c) and (b2) > the same. It is not. (c) is fundamentally very different as it is not actually an operation that ever goes out to the wire at all, and which is why the actual physical address on the wire does not matter at all. Some interfaces like NVMe have designed it in a way that it the commands used to do this internal transfer look like (b2), but that is just their (IMHO very questionable) interface design choice, that produces a whole chain of problems. > > I guess it might make sense to just have a block layer flag that (b) or > > (c) might be contained in a bio. Then we always look up the data > > structure, but can still fall back to (a) if nothing was found. That > > even allows free mixing and matching of memory types, at least as long > > as they are contained to separate bio_vec segments. > > IMO these three cases should be reflected in flags in the bio_vec. We'd > probably still need a queue flag to indicate support for mapping these, > but a flag on the bio that indicates special cases *might* exist in the > bio_vec and the driver has to do extra work to somehow distinguish the > three types doesn't seem useful. bio_vec flags also make it easy to > support mixing segments from different memory types. So I Ñ–nitially suggested these flags. But without a pgmap we absolutely need a lookup operation to find which phys address ranges map to which device. And once we do that the data structure the only thing we need is a flag saying that we need that information, and everything else can be in the data structure returned from that lookup.
On 2019-06-27 11:00 a.m., Christoph Hellwig wrote: > It is not. (c) is fundamentally very different as it is not actually > an operation that ever goes out to the wire at all, and which is why the > actual physical address on the wire does not matter at all. > Some interfaces like NVMe have designed it in a way that it the commands > used to do this internal transfer look like (b2), but that is just their > (IMHO very questionable) interface design choice, that produces a whole > chain of problems. From the mapping device's driver's perspective yes, but from the perspective of a submitting driver they would be the same. >>> I guess it might make sense to just have a block layer flag that (b) or >>> (c) might be contained in a bio. Then we always look up the data >>> structure, but can still fall back to (a) if nothing was found. That >>> even allows free mixing and matching of memory types, at least as long >>> as they are contained to separate bio_vec segments. >> >> IMO these three cases should be reflected in flags in the bio_vec. We'd >> probably still need a queue flag to indicate support for mapping these, >> but a flag on the bio that indicates special cases *might* exist in the >> bio_vec and the driver has to do extra work to somehow distinguish the >> three types doesn't seem useful. bio_vec flags also make it easy to >> support mixing segments from different memory types. > > So I Ñ–nitially suggested these flags. But without a pgmap we absolutely > need a lookup operation to find which phys address ranges map to which > device. And once we do that the data structure the only thing we need > is a flag saying that we need that information, and everything else > can be in the data structure returned from that lookup. Yes, you did suggest them. But what I'm trying to suggest is we don't *necessarily* need the lookup. For demonstration purposes only, a submitting driver could very roughly potentially do: struct bio_vec vec; dist = pci_p2pdma_dist(provider_pdev, mapping_pdev); if (dist < 0) { /* use regular memory */ vec.bv_addr = virt_to_phys(kmalloc(...)); vec.bv_flags = 0; } else if (dist & PCI_P2PDMA_THRU_HOST_BRIDGE) { vec.bv_addr = pci_p2pmem_alloc_phys(provider_pdev, ...); vec.bv_flags = BVEC_MAP_RESOURCE; } else { vec.bv_addr = pci_p2pmem_alloc_bus_addr(provider_pdev, ...); vec.bv_flags = BVEC_MAP_BUS_ADDR; } -- And a mapping driver would roughly just do: dma_addr_t dma_addr; if (vec.bv_flags & BVEC_MAP_BUS_ADDR) { if (pci_bus_addr_in_bar(mapping_pdev, vec.bv_addr, &bar, &off)) { /* case (c) */ /* program the DMA engine with bar and off */ } else { /* case (b2) */ dma_addr = vec.bv_addr; } } else if (vec.bv_flags & BVEC_MAP_RESOURCE) { /* case (b1) */ dma_addr = dma_map_resource(mapping_dev, vec.bv_addr, ...); } else { /* case (a) */ dma_addr = dma_map_page(..., phys_to_page(vec.bv_addr), ...); } The real difficulty here is that you'd really want all the above handled by a dma_map_bvec() so it can combine every vector hitting the IOMMU into a single continuous IOVA -- but it's hard to fit case (c) into that equation. So it might be that a dma_map_bvec() handles cases (a), (b1) and (b2) and the mapping driver has to then check each resulting DMA vector for pci_bus_addr_in_bar() while it is programming the DMA engine to deal with case (c). Logan
On Thu, Jun 27, 2019 at 10:49:43AM -0600, Logan Gunthorpe wrote: > > I don't think a GPU/FPGA driver will be involved, this would enter the > > block layer through the O_DIRECT path or something generic.. This the > > general flow I was suggesting to Dan earlier > > I would say the O_DIRECT path has to somehow call into the driver > backing the VMA to get an address to appropriate memory (in some way > vaguely similar to how we were discussing at LSF/MM) Maybe, maybe no. For something like VFIO the PTE already has the correct phys_addr_t and we don't need to do anything.. For DEVICE_PRIVATE we need to get the phys_addr_t out - presumably through a new pagemap op? > If P2P can't be done at that point, then the provider driver would > do the copy to system memory, in the most appropriate way, and > return regular pages for O_DIRECT to submit to the block device. That only makes sense for the migratable DEVICE_PRIVATE case, it doesn't help the VFIO-like case, there you'd need to bounce buffer. > >> I think it would be a larger layering violation to have the NVMe driver > >> (for example) memcpy data off a GPU's bar during a dma_map step to > >> support this bouncing. And it's even crazier to expect a DMA transfer to > >> be setup in the map step. > > > > Why? Don't we already expect the DMA mapper to handle bouncing for > > lots of cases, how is this case different? This is the best place to > > place it to make it shared. > > This is different because it's special memory where the DMA mapper > can't possibly know the best way to transfer the data. Why not? If we have a 'bar info' structure that could have data transfer op callbacks, infact, I think we might already have similar callbacks for migrating to/from DEVICE_PRIVATE memory with DMA.. > One could argue that the hook to the GPU/FPGA driver could be in the > mapping step but then we'd have to do lookups based on an address -- > where as the VMA could more easily have a hook back to whatever driver > exported it. The trouble with a VMA hook is that it is only really avaiable when working with the VA, and it is not actually available during GUP, you have to have a GUP-like thing such as hmm_range_snapshot that is specifically VMA based. And it is certainly not available during dma_map. When working with VMA's/etc it seems there are some good reasons to drive things off of the PTE content (either via struct page & pgmap or via phys_addr_t & barmap) I think the best reason to prefer a uniform phys_addr_t is that it does give us the option to copy the data to/from CPU memory. That option goes away as soon as the bio sometimes provides a dma_addr_t. At least for RDMA, we do have some cases (like siw/rxe, hfi) where they sometimes need to do that copy. I suspect the block stack is similar, in the general case. Jason
On Thu, Jun 27, 2019 at 12:00:35PM -0600, Logan Gunthorpe wrote: > > It is not. (c) is fundamentally very different as it is not actually > > an operation that ever goes out to the wire at all, and which is why the > > actual physical address on the wire does not matter at all. > > Some interfaces like NVMe have designed it in a way that it the commands > > used to do this internal transfer look like (b2), but that is just their > > (IMHO very questionable) interface design choice, that produces a whole > > chain of problems. > > >From the mapping device's driver's perspective yes, but from the > perspective of a submitting driver they would be the same. With your dma_addr_t scheme it won't be the same, as you'd need a magic way to generate the internal addressing and stuff it into the dma_addr_t. With a phys_addr_t based scheme they should basically be all the same. > Yes, you did suggest them. But what I'm trying to suggest is we don't > *necessarily* need the lookup. For demonstration purposes only, a > submitting driver could very roughly potentially do: > > struct bio_vec vec; > dist = pci_p2pdma_dist(provider_pdev, mapping_pdev); > if (dist < 0) { > /* use regular memory */ > vec.bv_addr = virt_to_phys(kmalloc(...)); > vec.bv_flags = 0; > } else if (dist & PCI_P2PDMA_THRU_HOST_BRIDGE) { > vec.bv_addr = pci_p2pmem_alloc_phys(provider_pdev, ...); > vec.bv_flags = BVEC_MAP_RESOURCE; > } else { > vec.bv_addr = pci_p2pmem_alloc_bus_addr(provider_pdev, ...); > vec.bv_flags = BVEC_MAP_BUS_ADDR; > } That doesn't look too bad, except.. > -- And a mapping driver would roughly just do: > > dma_addr_t dma_addr; > if (vec.bv_flags & BVEC_MAP_BUS_ADDR) { > if (pci_bus_addr_in_bar(mapping_pdev, vec.bv_addr, &bar, &off)) { > /* case (c) */ > /* program the DMA engine with bar and off */ Why bother with that here if we could also let the caller handle that? pci_p2pdma_dist() should be able to trivially find that out based on provider_dev == mapping_dev. > The real difficulty here is that you'd really want all the above handled > by a dma_map_bvec() so it can combine every vector hitting the IOMMU > into a single continuous IOVA -- but it's hard to fit case (c) into that > equation. So it might be that a dma_map_bvec() handles cases (a), (b1) > and (b2) and the mapping driver has to then check each resulting DMA > vector for pci_bus_addr_in_bar() while it is programming the DMA engine > to deal with case (c). I'd do it the other way around. pci_p2pdma_dist is used to find the p2p type. The p2p type is stuff into the bio_vec, and we then: (1) manually check for case (c) in driver for drivers that want to treat it different from (b) (2) we then have a dma mapping wrapper that checks the p2p type and does the right thing for the rest.
On 2019-06-28 7:38 a.m., Christoph Hellwig wrote: > On Thu, Jun 27, 2019 at 12:00:35PM -0600, Logan Gunthorpe wrote: >>> It is not. (c) is fundamentally very different as it is not actually >>> an operation that ever goes out to the wire at all, and which is why the >>> actual physical address on the wire does not matter at all. >>> Some interfaces like NVMe have designed it in a way that it the commands >>> used to do this internal transfer look like (b2), but that is just their >>> (IMHO very questionable) interface design choice, that produces a whole >>> chain of problems. >> >> >From the mapping device's driver's perspective yes, but from the >> perspective of a submitting driver they would be the same. > > With your dma_addr_t scheme it won't be the same, as you'd need > a magic way to generate the internal addressing and stuff it into > the dma_addr_t. With a phys_addr_t based scheme they should basically > be all the same. Yes, I see the folly in the dma_addr_t scheme now. I like the phys_addr_t ideas we have been discussing. >> Yes, you did suggest them. But what I'm trying to suggest is we don't >> *necessarily* need the lookup. For demonstration purposes only, a >> submitting driver could very roughly potentially do: >> >> struct bio_vec vec; >> dist = pci_p2pdma_dist(provider_pdev, mapping_pdev); >> if (dist < 0) { >> /* use regular memory */ >> vec.bv_addr = virt_to_phys(kmalloc(...)); >> vec.bv_flags = 0; >> } else if (dist & PCI_P2PDMA_THRU_HOST_BRIDGE) { >> vec.bv_addr = pci_p2pmem_alloc_phys(provider_pdev, ...); >> vec.bv_flags = BVEC_MAP_RESOURCE; >> } else { >> vec.bv_addr = pci_p2pmem_alloc_bus_addr(provider_pdev, ...); >> vec.bv_flags = BVEC_MAP_BUS_ADDR; >> } > > That doesn't look too bad, except.. > >> -- And a mapping driver would roughly just do: >> >> dma_addr_t dma_addr; >> if (vec.bv_flags & BVEC_MAP_BUS_ADDR) { >> if (pci_bus_addr_in_bar(mapping_pdev, vec.bv_addr, &bar, &off)) { >> /* case (c) */ >> /* program the DMA engine with bar and off */ > > Why bother with that here if we could also let the caller handle > that? pci_p2pdma_dist() should be able to trivially find that out > based on provider_dev == mapping_dev. True, in fact pci_p2pdma_dist() should return 0 in that case. Though the driver will still have to do a range compare to figure out which BAR the address belongs to and find the offset. >> The real difficulty here is that you'd really want all the above handled >> by a dma_map_bvec() so it can combine every vector hitting the IOMMU >> into a single continuous IOVA -- but it's hard to fit case (c) into that >> equation. So it might be that a dma_map_bvec() handles cases (a), (b1) >> and (b2) and the mapping driver has to then check each resulting DMA >> vector for pci_bus_addr_in_bar() while it is programming the DMA engine >> to deal with case (c). > > I'd do it the other way around. pci_p2pdma_dist is used to find > the p2p type. The p2p type is stuff into the bio_vec, and we then: > > (1) manually check for case (c) in driver for drivers that want to > treat it different from (b) > (2) we then have a dma mapping wrapper that checks the p2p type > and does the right thing for the rest. Sure, that could make sense. I imagine there's a lot of details that are wrong or could be done better in my example. The purpose of it was just to demonstrate that we can do it without a lookup in an interval tree on the physical address. Logan
On 2019-06-27 10:57 p.m., Jason Gunthorpe wrote: > On Thu, Jun 27, 2019 at 10:49:43AM -0600, Logan Gunthorpe wrote: > >>> I don't think a GPU/FPGA driver will be involved, this would enter the >>> block layer through the O_DIRECT path or something generic.. This the >>> general flow I was suggesting to Dan earlier >> >> I would say the O_DIRECT path has to somehow call into the driver >> backing the VMA to get an address to appropriate memory (in some way >> vaguely similar to how we were discussing at LSF/MM) > > Maybe, maybe no. For something like VFIO the PTE already has the > correct phys_addr_t and we don't need to do anything.. > > For DEVICE_PRIVATE we need to get the phys_addr_t out - presumably > through a new pagemap op? I don't know much about either VFIO or DEVICE_PRIVATE, but I'd still wager there would be a better way to handle it before they submit it to the block layer. >> If P2P can't be done at that point, then the provider driver would >> do the copy to system memory, in the most appropriate way, and >> return regular pages for O_DIRECT to submit to the block device. > > That only makes sense for the migratable DEVICE_PRIVATE case, it > doesn't help the VFIO-like case, there you'd need to bounce buffer. > >>>> I think it would be a larger layering violation to have the NVMe driver >>>> (for example) memcpy data off a GPU's bar during a dma_map step to >>>> support this bouncing. And it's even crazier to expect a DMA transfer to >>>> be setup in the map step. >>> >>> Why? Don't we already expect the DMA mapper to handle bouncing for >>> lots of cases, how is this case different? This is the best place to >>> place it to make it shared. >> >> This is different because it's special memory where the DMA mapper >> can't possibly know the best way to transfer the data. > > Why not? If we have a 'bar info' structure that could have data > transfer op callbacks, infact, I think we might already have similar > callbacks for migrating to/from DEVICE_PRIVATE memory with DMA.. Well it could, in theory be done, but It just seems wrong to setup and wait for more DMA requests while we are in mid-progress setting up another DMA request. Especially when the block layer has historically had issues with stack sizes. It's also possible you might have multiple bio_vec's that have to each do a migration and with a hook here they'd have to be done serially. >> One could argue that the hook to the GPU/FPGA driver could be in the >> mapping step but then we'd have to do lookups based on an address -- >> where as the VMA could more easily have a hook back to whatever driver >> exported it. > > The trouble with a VMA hook is that it is only really avaiable when > working with the VA, and it is not actually available during GUP, you > have to have a GUP-like thing such as hmm_range_snapshot that is > specifically VMA based. And it is certainly not available during dma_map. Yup, I'm hoping some of the GUP cleanups will help with that but it's definitely a problem. I never said the VMA would be available at dma_map time nor would I want it to be. I expect it to be available before we submit the request to the block layer and it really only applies to the O_DIRECT path and maybe a similar thing in the RDMA path. > When working with VMA's/etc it seems there are some good reasons to > drive things off of the PTE content (either via struct page & pgmap or > via phys_addr_t & barmap) > > I think the best reason to prefer a uniform phys_addr_t is that it > does give us the option to copy the data to/from CPU memory. That > option goes away as soon as the bio sometimes provides a dma_addr_t. Not really. phys_addr_t alone doesn't give us a way to copy data. You need a lookup table on that address and a couple of hooks. > At least for RDMA, we do have some cases (like siw/rxe, hfi) where > they sometimes need to do that copy. I suspect the block stack is > similar, in the general case. But the whole point of the use cases I'm trying to serve is to avoid the root complex. If the block layer randomly decides to ephemerally copy the data back to the CPU (for integrity or something) then we've accomplished nothing and shouldn't have put the data in the BAR to begin with. Similarly, for DEVICE_PRIVATE, I'd have guessed it wouldn't want to use ephemeral copies but actually migrate the memory semi-permanently to the CPU for more than one transaction and I would argue that it makes the most sense to make these decisions before the data gets to the block layer. Logan
On Fri, Jun 28, 2019 at 10:22:06AM -0600, Logan Gunthorpe wrote: > > Why not? If we have a 'bar info' structure that could have data > > transfer op callbacks, infact, I think we might already have similar > > callbacks for migrating to/from DEVICE_PRIVATE memory with DMA.. > > Well it could, in theory be done, but It just seems wrong to setup and > wait for more DMA requests while we are in mid-progress setting up > another DMA request. Especially when the block layer has historically > had issues with stack sizes. It's also possible you might have multiple > bio_vec's that have to each do a migration and with a hook here they'd > have to be done serially. *shrug* this is just standard bounce buffering stuff... > > I think the best reason to prefer a uniform phys_addr_t is that it > > does give us the option to copy the data to/from CPU memory. That > > option goes away as soon as the bio sometimes provides a dma_addr_t. > > Not really. phys_addr_t alone doesn't give us a way to copy data. You > need a lookup table on that address and a couple of hooks. Yes, I'm not sure how you envision using phys_addr_t without a lookup.. At the end of the day we must get the src and target 'struct device' in the dma_map area (at the minimum to compute the offset to translate phys_addr_t to dma_addr_t) and the only way to do that from phys_addr_t is via lookup?? > > At least for RDMA, we do have some cases (like siw/rxe, hfi) where > > they sometimes need to do that copy. I suspect the block stack is > > similar, in the general case. > > But the whole point of the use cases I'm trying to serve is to avoid the > root complex. Well, I think this is sort of a seperate issue. Generically I think the dma layer should continue to work largely transparently, and if I feed in BAR memory that can't be P2P'd it should bounce, just like all the other DMA limitations it already supports. That is pretty much its whole purpose in life. The issue of having the caller optimize what it sends is kind of separate - yes you definately still need the egress DMA device to drive CMB buffer selection, and DEVICE_PRIVATE also needs it to decide if it should migrate or not. What I see as the question is how to layout the BIO. If we agree the bio should only have phys_addr_t then we need some 'bar info' (ie at least the offset) in the dma map and some 'bar info' (ie the DMA device) during the bio construciton. What you are trying to do is optimize the passing of that 'bar info' with a limited number of bits in the BIO. A single flag means an interval tree, 4-8 bits could build a probably O(1) hash lookup, 64 bits could store a pointer, etc. If we can spare 4-8 bits in the bio then I suggest a 'perfect hash table'. Assign each registered P2P 'bar info' a small 4 bit id and hash on that. It should be fast enough to not worry about the double lookup. Jason
On 2019-06-28 11:29 a.m., Jason Gunthorpe wrote: > On Fri, Jun 28, 2019 at 10:22:06AM -0600, Logan Gunthorpe wrote: > >>> Why not? If we have a 'bar info' structure that could have data >>> transfer op callbacks, infact, I think we might already have similar >>> callbacks for migrating to/from DEVICE_PRIVATE memory with DMA.. >> >> Well it could, in theory be done, but It just seems wrong to setup and >> wait for more DMA requests while we are in mid-progress setting up >> another DMA request. Especially when the block layer has historically >> had issues with stack sizes. It's also possible you might have multiple >> bio_vec's that have to each do a migration and with a hook here they'd >> have to be done serially. > > *shrug* this is just standard bounce buffering stuff... I don't know of any "standard" bounce buffering stuff that uses random other device's DMA engines where appropriate. >>> I think the best reason to prefer a uniform phys_addr_t is that it >>> does give us the option to copy the data to/from CPU memory. That >>> option goes away as soon as the bio sometimes provides a dma_addr_t. >> >> Not really. phys_addr_t alone doesn't give us a way to copy data. You >> need a lookup table on that address and a couple of hooks. > > Yes, I'm not sure how you envision using phys_addr_t without a > lookup.. At the end of the day we must get the src and target 'struct > device' in the dma_map area (at the minimum to compute the offset to > translate phys_addr_t to dma_addr_t) and the only way to do that from > phys_addr_t is via lookup?? I thought my other email to Christoph laid it out pretty cleanly... >>> At least for RDMA, we do have some cases (like siw/rxe, hfi) where >>> they sometimes need to do that copy. I suspect the block stack is >>> similar, in the general case. >> >> But the whole point of the use cases I'm trying to serve is to avoid the >> root complex. > > Well, I think this is sort of a seperate issue. Generically I think > the dma layer should continue to work largely transparently, and if I > feed in BAR memory that can't be P2P'd it should bounce, just like > all the other DMA limitations it already supports. That is pretty much > its whole purpose in life. I disagree. It's one thing for the DMA layer to work around architecture limitations like HighMem/LowMem and just do a memcpy when it can't handle it -- it's whole different thing for the DMA layer to know about the varieties of memory on different peripheral device's and the nuances of how and when to transfer between them. I think the submitting driver has the best information of when to do these transfers. IMO the bouncing in the DMA layer isn't a desirable thing, it was a necessary addition to work around various legacy platform issues and have existing code still work correctly. It's always better for a driver to allocate memory appropriate for the DMA than to just use random memory and rely on it being bounced by the lower layer. For P2P, we don't have existing code to worry about so I don't think a magic automatic bouncing design is appropriate. > The issue of having the caller optimize what it sends is kind of > separate - yes you definately still need the egress DMA device to > drive CMB buffer selection, and DEVICE_PRIVATE also needs it to decide > if it should migrate or not. Yes, but my contention is that I don't want to have to make these decisions twice: once before I submit it to the block layer, then again at mapping time. > What I see as the question is how to layout the BIO. > > If we agree the bio should only have phys_addr_t then we need some > 'bar info' (ie at least the offset) in the dma map and some 'bar info' > (ie the DMA device) during the bio construciton. Per my other email, it was phys_addr_t plus hints on how to map the memory (bus address, dma_map_resource, or regular). This requires exactly two flag bits in the bio_vec and no interval tree or hash table. I don't want to have to pass bar info, other hooks, or anything like that to the block layer. > What you are trying to do is optimize the passing of that 'bar info' > with a limited number of bits in the BIO. > > A single flag means an interval tree, 4-8 bits could build a probably > O(1) hash lookup, 64 bits could store a pointer, etc. Well, an interval tree can get you the backing device for a given phys_addr_t; however, for P2PDMA, we need a second lookup based on the mapping device. This is because exactly how you map the data depends on both devices. Currently I'm working on this for the existing implementation and struct page gets me the backing device but I need another xarray cache based on the mapping device to figure out exactly how to map the memory. But none of this mess is required if we can just pass the mapping hints through the block layer as flags (per my other email) because the submitting driver should already know ahead of time what it's trying to do. > If we can spare 4-8 bits in the bio then I suggest a 'perfect hash > table'. Assign each registered P2P 'bar info' a small 4 bit id and > hash on that. It should be fast enough to not worry about the double > lookup. This feels like it's just setting us up to run into nasty limits based on the number of bits we actually have. The number of bits in a bio_vec will always be a precious resource. If I have a server chassis that exist today with 24 NVMe devices, and each device has a CMB, I'm already using up 6 of those bits. Then we might have DEVICE_PRIVATE and other uses on top of that. Logan
On Fri, Jun 28, 2019 at 12:29:32PM -0600, Logan Gunthorpe wrote: > > > On 2019-06-28 11:29 a.m., Jason Gunthorpe wrote: > > On Fri, Jun 28, 2019 at 10:22:06AM -0600, Logan Gunthorpe wrote: > > > >>> Why not? If we have a 'bar info' structure that could have data > >>> transfer op callbacks, infact, I think we might already have similar > >>> callbacks for migrating to/from DEVICE_PRIVATE memory with DMA.. > >> > >> Well it could, in theory be done, but It just seems wrong to setup and > >> wait for more DMA requests while we are in mid-progress setting up > >> another DMA request. Especially when the block layer has historically > >> had issues with stack sizes. It's also possible you might have multiple > >> bio_vec's that have to each do a migration and with a hook here they'd > >> have to be done serially. > > > > *shrug* this is just standard bounce buffering stuff... > > I don't know of any "standard" bounce buffering stuff that uses random > other device's DMA engines where appropriate. IMHO, it is conceptually the same as memcpy.. And probably we will not ever need such optimization in dma map. Other copy places might be different at least we have the option. > IMO the bouncing in the DMA layer isn't a desirable thing, it was a > necessary addition to work around various legacy platform issues and > have existing code still work correctly. Of course it is not desireable! But there are many situations where we do not have the luxury to work around the HW limits in the caller, so those callers either have to not do DMA or they have to open code bounce buffering - both are wrong. > > What I see as the question is how to layout the BIO. > > > > If we agree the bio should only have phys_addr_t then we need some > > 'bar info' (ie at least the offset) in the dma map and some 'bar info' > > (ie the DMA device) during the bio construciton. > > Per my other email, it was phys_addr_t plus hints on how to map the > memory (bus address, dma_map_resource, or regular). This requires > exactly two flag bits in the bio_vec and no interval tree or hash table. > I don't want to have to pass bar info, other hooks, or anything like > that to the block layer. This scheme makes the assumption that the dma mapping struct device is all you need, and we never need to know the originating struct device during dma map. This is clearly safe if the two devices are on the same PCIe segment However, I'd feel more comfortable about that assumption if we had code to support the IOMMU case, and know for sure it doesn't require more info :( But I suppose it is also reasonable that only the IOMMU case would have the expensive 'bar info' lookup or something. Maybe you can hide these flags as some dma_map helper, then the layering might be nicer: dma_map_set_bio_p2p_flags(bio, phys_addr, source dev, dest_dev) ? ie the choice of flag scheme to use is opaque to the DMA layer. > > If we can spare 4-8 bits in the bio then I suggest a 'perfect hash > > table'. Assign each registered P2P 'bar info' a small 4 bit id and > > hash on that. It should be fast enough to not worry about the double > > lookup. > > This feels like it's just setting us up to run into nasty limits based > on the number of bits we actually have. The number of bits in a bio_vec > will always be a precious resource. If I have a server chassis that > exist today with 24 NVMe devices, and each device has a CMB, I'm already > using up 6 of those bits. Then we might have DEVICE_PRIVATE and other > uses on top of that. A hash is an alternative data structure to a interval tree that has better scaling for small numbers of BARs, which I think is our case. Jason
On 2019-06-28 1:09 p.m., Jason Gunthorpe wrote: > On Fri, Jun 28, 2019 at 12:29:32PM -0600, Logan Gunthorpe wrote: >> >> >> On 2019-06-28 11:29 a.m., Jason Gunthorpe wrote: >>> On Fri, Jun 28, 2019 at 10:22:06AM -0600, Logan Gunthorpe wrote: >>> >>>>> Why not? If we have a 'bar info' structure that could have data >>>>> transfer op callbacks, infact, I think we might already have similar >>>>> callbacks for migrating to/from DEVICE_PRIVATE memory with DMA.. >>>> >>>> Well it could, in theory be done, but It just seems wrong to setup and >>>> wait for more DMA requests while we are in mid-progress setting up >>>> another DMA request. Especially when the block layer has historically >>>> had issues with stack sizes. It's also possible you might have multiple >>>> bio_vec's that have to each do a migration and with a hook here they'd >>>> have to be done serially. >>> >>> *shrug* this is just standard bounce buffering stuff... >> >> I don't know of any "standard" bounce buffering stuff that uses random >> other device's DMA engines where appropriate. > > IMHO, it is conceptually the same as memcpy.. And probably we will not > ever need such optimization in dma map. Other copy places might be > different at least we have the option. > >> IMO the bouncing in the DMA layer isn't a desirable thing, it was a >> necessary addition to work around various legacy platform issues and >> have existing code still work correctly. > > Of course it is not desireable! But there are many situations where we > do not have the luxury to work around the HW limits in the caller, so > those callers either have to not do DMA or they have to open code > bounce buffering - both are wrong. They don't have to open code it, they can use helpers and good coding practices. But the submitting driver is the one that's in the best position to figure this stuff out. Just like it is with the dma_map bouncing -- all it has to do is use dma_alloc_coherent(). If we don't write any submitting drivers that assume the dma_map API bounces than we should never have to deal with it. >>> What I see as the question is how to layout the BIO. >>> >>> If we agree the bio should only have phys_addr_t then we need some >>> 'bar info' (ie at least the offset) in the dma map and some 'bar info' >>> (ie the DMA device) during the bio construciton. >> >> Per my other email, it was phys_addr_t plus hints on how to map the >> memory (bus address, dma_map_resource, or regular). This requires >> exactly two flag bits in the bio_vec and no interval tree or hash table. >> I don't want to have to pass bar info, other hooks, or anything like >> that to the block layer. > > This scheme makes the assumption that the dma mapping struct device is > all you need, and we never need to know the originating struct device > during dma map. This is clearly safe if the two devices are on the > same PCIe segment > > However, I'd feel more comfortable about that assumption if we had > code to support the IOMMU case, and know for sure it doesn't require > more info :( The example I posted *does* support the IOMMU case. That was case (b1) in the description. The idea is that pci_p2pdma_dist() returns a distance with a high bit set (PCI_P2PDMA_THRU_HOST_BRIDGE) when an IOMMU mapping is required and the appropriate flag tells it to call dma_map_resource(). This way, it supports both same-segment and different-segments without needing any look ups in the map step. For the only existing upstream use case (NVMe-of), this is ideal because we can calculate the mapping requirements exactly once ahead of any transfers. Then populating the bvecs and dma-mapping for each transfer is fast and doesn't require any additional work besides deciding where to get the memory from. For O_DIRECT and userspace RDMA, this should also be ideal, the real problem is how to get the necessary information out of the VMA. This isn't helped by having a lookup at the dma map step. But the provider driver is certainly going to be involved in creating the VMA so it should be able to easily provide the necessary hooks. Though there are still a bunch of challenges here. Maybe other use-cases are not this ideal but I suspect they should still be able to make use of the same flags. It's hard to say right now, though, because we haven't seen any other use cases. > Maybe you can hide these flags as some dma_map helper, then the > layering might be nicer: > > dma_map_set_bio_p2p_flags(bio, phys_addr, source dev, dest_dev) > > ? > > ie the choice of flag scheme to use is opaque to the DMA layer. If there was such a use case, I suppose you could use a couple of flag bits to tell you how to interpret the other flag bits but, at the moment, I only see a need for 2 bits so we'll probably have a lot of spares for a long time. You could certainly have a 3rd bit which says do a lookup and try to figure out bouncing, but I don't think it's a good idea. >>> If we can spare 4-8 bits in the bio then I suggest a 'perfect hash >>> table'. Assign each registered P2P 'bar info' a small 4 bit id and >>> hash on that. It should be fast enough to not worry about the double >>> lookup. >> >> This feels like it's just setting us up to run into nasty limits based >> on the number of bits we actually have. The number of bits in a bio_vec >> will always be a precious resource. If I have a server chassis that >> exist today with 24 NVMe devices, and each device has a CMB, I'm already >> using up 6 of those bits. Then we might have DEVICE_PRIVATE and other >> uses on top of that. > > A hash is an alternative data structure to a interval tree that has > better scaling for small numbers of BARs, which I think is our > case. But then you need a large and not necessarily future-proof number of bits in the bio_vec to store the hash. Logan
On Fri, Jun 28, 2019 at 01:35:42PM -0600, Logan Gunthorpe wrote: > > However, I'd feel more comfortable about that assumption if we had > > code to support the IOMMU case, and know for sure it doesn't require > > more info :( > > The example I posted *does* support the IOMMU case. That was case (b1) > in the description. The idea is that pci_p2pdma_dist() returns a > distance with a high bit set (PCI_P2PDMA_THRU_HOST_BRIDGE) when an IOMMU > mapping is required and the appropriate flag tells it to call > dma_map_resource(). This way, it supports both same-segment and > different-segments without needing any look ups in the map step. I mean we actually have some iommu drivers that can setup P2P in real HW. I'm worried that real IOMMUs will need to have the BDF of the completer to route completions back to the requester - which we can't trivially get through this scheme. However, maybe that is just a future problem, and certainly we can see that with an interval tree or otherwise such a IOMMU could get the information it needs. Jason
On 2019-07-02 4:45 p.m., Jason Gunthorpe wrote: > On Fri, Jun 28, 2019 at 01:35:42PM -0600, Logan Gunthorpe wrote: > >>> However, I'd feel more comfortable about that assumption if we had >>> code to support the IOMMU case, and know for sure it doesn't require >>> more info :( >> >> The example I posted *does* support the IOMMU case. That was case (b1) >> in the description. The idea is that pci_p2pdma_dist() returns a >> distance with a high bit set (PCI_P2PDMA_THRU_HOST_BRIDGE) when an IOMMU >> mapping is required and the appropriate flag tells it to call >> dma_map_resource(). This way, it supports both same-segment and >> different-segments without needing any look ups in the map step. > > I mean we actually have some iommu drivers that can setup P2P in real > HW. I'm worried that real IOMMUs will need to have the BDF of the > completer to route completions back to the requester - which we can't > trivially get through this scheme. I've never seen such an IOMMU but I guess, in theory, it could exist. The IOMMUs that setup P2P-like transactions in real hardware make use of dma_map_resource(). There aren't a lot of users of this function (it's actually been broken with the Intel IOMMU until I fixed it recently and I'd expect there are other broken implementations); but, to my knowledge, none of them have needed the BDF of the provider to date. > However, maybe that is just a future problem, and certainly we can see > that with an interval tree or otherwise such a IOMMU could get the > information it needs. Yup, the rule of thumb is to design for the needs we have today not imagined future problems. Logan