mbox series

[RFC,00/28] Removing struct page from P2PDMA

Message ID 20190620161240.22738-1-logang@deltatee.com (mailing list archive)
Headers show
Series Removing struct page from P2PDMA | expand

Message

Logan Gunthorpe June 20, 2019, 4:12 p.m. UTC
For eons there has been a debate over whether or not to use
struct pages for peer-to-peer DMA transactions. Pro-pagers have
argued that struct pages are necessary for interacting with
existing code like scatterlists or the bio_vecs. Anti-pagers
assert that the tracking of the memory is unecessary and
allocating the pages is a waste of memory. Both viewpoints are
valid, however developers working on GPUs and RDMA tend to be
able to do away with struct pages relatively easily compared to
those wanting to work with NVMe devices through the block layer.
So it would be of great value to be able to universally do P2PDMA
transactions without the use of struct pages.

Previously, there have been multiple attempts[1][2] to replace
struct page usage with pfn_t but this has been unpopular seeing
it creates dangerous edge cases where unsuspecting code might
run accross pfn_t's they are not ready for.

Currently, we have P2PDMA using struct pages through the block layer
and the dangerous cases are avoided by using a queue flag that
indicates support for the special pages.

This RFC proposes a new solution: allow the block layer to take
DMA addresses directly for queues that indicate support. This will
provide a more general path for doing P2PDMA-like requests and will
allow us to remove the struct pages that back P2PDMA memory thus paving
the way to build a more uniform P2PDMA ecosystem.

This is a fairly long patch set but most of the patches are quite
small. Patches 1 through 18 introduce the concept of a dma_vec that
is similar to a bio_vec (except it takes dma_addr_t's instead of pages
and offsets) as well as a special dma-direct bio/request. Most of these
patches just prevent the new type of bio from being mis-used and
also support splitting and mapping them in the same way that struct
page bios can be operated on. Patches 19 through 22 modify the existing
P2PDMA support in nvme-pci, ib-core and nvmet to use DMA addresses
directly. Patches 23 through 25 remove the P2PDMA specific
code from the block layer and ib-core. Finally, patches 26 through 28
remove the struct pages from the PCI P2PDMA code.

This RFC is based on v5.2-rc5 and a git branch is available here:

https://github.com/sbates130272/linux-p2pmem.git dma_direct_rfc1

[1] https://lwn.net/Articles/647404/
[2] https://lore.kernel.org/lkml/1495662147-18277-1-git-send-email-logang@deltatee.com/

--

Logan Gunthorpe (28):
  block: Introduce DMA direct request type
  block: Add dma_vec structure
  block: Warn on mis-use of dma-direct bios
  block: Never bounce dma-direct bios
  block: Skip dma-direct bios in bio_integrity_prep()
  block: Support dma-direct bios in bio_advance_iter()
  block: Use dma_vec length in bio_cur_bytes() for dma-direct bios
  block: Introduce dmavec_phys_mergeable()
  block: Introduce vec_gap_to_prev()
  block: Create generic vec_split_segs() from bvec_split_segs()
  block: Create blk_segment_split_ctx
  block: Create helper for bvec_should_split()
  block: Generalize bvec_should_split()
  block: Support splitting dma-direct bios
  block: Support counting dma-direct bio segments
  block: Implement mapping dma-direct requests to SGs in blk_rq_map_sg()
  block: Introduce queue flag to indicate support for dma-direct bios
  block: Introduce bio_add_dma_addr()
  nvme-pci: Support dma-direct bios
  IB/core: Introduce API for initializing a RW ctx from a DMA address
  nvmet: Split nvmet_bdev_execute_rw() into a helper function
  nvmet: Use DMA addresses instead of struct pages for P2P
  nvme-pci: Remove support for PCI_P2PDMA requests
  block: Remove PCI_P2PDMA queue flag
  IB/core: Remove P2PDMA mapping support in rdma_rw_ctx
  PCI/P2PDMA: Remove SGL helpers
  PCI/P2PDMA: Remove struct pages that back P2PDMA memory
  memremap: Remove PCI P2PDMA page memory type

 Documentation/driver-api/pci/p2pdma.rst |   9 +-
 block/bio-integrity.c                   |   4 +
 block/bio.c                             |  71 +++++++
 block/blk-core.c                        |   3 +
 block/blk-merge.c                       | 256 ++++++++++++++++++------
 block/blk.h                             |  49 ++++-
 block/bounce.c                          |   8 +
 drivers/infiniband/core/rw.c            |  85 ++++++--
 drivers/nvme/host/core.c                |   4 +-
 drivers/nvme/host/nvme.h                |   2 +-
 drivers/nvme/host/pci.c                 |  29 ++-
 drivers/nvme/target/core.c              |  12 +-
 drivers/nvme/target/io-cmd-bdev.c       |  82 +++++---
 drivers/nvme/target/nvmet.h             |   5 +-
 drivers/nvme/target/rdma.c              |  43 +++-
 drivers/pci/p2pdma.c                    | 202 +++----------------
 include/linux/bio.h                     |  32 ++-
 include/linux/blk_types.h               |  14 +-
 include/linux/blkdev.h                  |  16 +-
 include/linux/bvec.h                    |  43 ++++
 include/linux/memremap.h                |   5 -
 include/linux/mm.h                      |  13 --
 include/linux/pci-p2pdma.h              |  19 --
 include/rdma/rw.h                       |   6 +
 24 files changed, 648 insertions(+), 364 deletions(-)

--
2.20.1

Comments

Dan Williams June 20, 2019, 6:45 p.m. UTC | #1
On Thu, Jun 20, 2019 at 9:13 AM Logan Gunthorpe <logang@deltatee.com> wrote:
>
> For eons there has been a debate over whether or not to use
> struct pages for peer-to-peer DMA transactions. Pro-pagers have
> argued that struct pages are necessary for interacting with
> existing code like scatterlists or the bio_vecs. Anti-pagers
> assert that the tracking of the memory is unecessary and
> allocating the pages is a waste of memory. Both viewpoints are
> valid, however developers working on GPUs and RDMA tend to be
> able to do away with struct pages relatively easily

Presumably because they have historically never tried to be
inter-operable with the block layer or drivers outside graphics and
RDMA.

>  compared to
> those wanting to work with NVMe devices through the block layer.
> So it would be of great value to be able to universally do P2PDMA
> transactions without the use of struct pages.

Please spell out the value, it is not immediately obvious to me
outside of some memory capacity savings.

> Previously, there have been multiple attempts[1][2] to replace
> struct page usage with pfn_t but this has been unpopular seeing
> it creates dangerous edge cases where unsuspecting code might
> run accross pfn_t's they are not ready for.

That's not the conclusion I arrived at because pfn_t is specifically
an opaque type precisely to force "unsuspecting" code to throw
compiler assertions. Instead pfn_t was dealt its death blow here:

https://lore.kernel.org/lkml/CA+55aFzON9617c2_Amep0ngLq91kfrPiSccdZakxir82iekUiA@mail.gmail.com/

...and I think that feedback also reads on this proposal.

> Currently, we have P2PDMA using struct pages through the block layer
> and the dangerous cases are avoided by using a queue flag that
> indicates support for the special pages.
>
> This RFC proposes a new solution: allow the block layer to take
> DMA addresses directly for queues that indicate support. This will
> provide a more general path for doing P2PDMA-like requests and will
> allow us to remove the struct pages that back P2PDMA memory thus paving
> the way to build a more uniform P2PDMA ecosystem.

My primary concern with this is that ascribes a level of generality
that just isn't there for peer-to-peer dma operations. "Peer"
addresses are not "DMA" addresses, and the rules about what can and
can't do peer-DMA are not generically known to the block layer. At
least with a side object there's a chance to describe / recall those
restrictions as these things get passed around the I/O stack, but an
undecorated "DMA" address passed through the block layer with no other
benefit to any subsystem besides RDMA does not feel like it advances
the state of the art.

Again, what are the benefits of plumbing this RDMA special case?
Jason Gunthorpe June 20, 2019, 7:33 p.m. UTC | #2
On Thu, Jun 20, 2019 at 11:45:38AM -0700, Dan Williams wrote:

> > Previously, there have been multiple attempts[1][2] to replace
> > struct page usage with pfn_t but this has been unpopular seeing
> > it creates dangerous edge cases where unsuspecting code might
> > run accross pfn_t's they are not ready for.
> 
> That's not the conclusion I arrived at because pfn_t is specifically
> an opaque type precisely to force "unsuspecting" code to throw
> compiler assertions. Instead pfn_t was dealt its death blow here:
> 
> https://lore.kernel.org/lkml/CA+55aFzON9617c2_Amep0ngLq91kfrPiSccdZakxir82iekUiA@mail.gmail.com/
> 
> ...and I think that feedback also reads on this proposal.

I read through Linus's remarks and it he seems completely right that
anything that touches a filesystem needs a struct page, because FS's
rely heavily on that.

It is much less clear to me why a GPU BAR or a NVME CMB that never
touches a filesystem needs a struct page.. The best reason I've seen
is that it must have struct page because the block layer heavily
depends on struct page.

Since that thread was so DAX/pmem centric (and Linus did say he liked
the __pfn_t), maybe it is worth checking again, but not for DAX/pmem
users?

This P2P is quite distinct from DAX as the struct page* would point to
non-cacheable weird memory that few struct page users would even be
able to work with, while I understand DAX use cases focused on CPU
cache coherent memory, and filesystem involvement.

> My primary concern with this is that ascribes a level of generality
> that just isn't there for peer-to-peer dma operations. "Peer"
> addresses are not "DMA" addresses, and the rules about what can and
> can't do peer-DMA are not generically known to the block layer.

?? The P2P infrastructure produces a DMA bus address for the
initiating device that is is absolutely a DMA address. There is some
intermediate CPU centric representation, but after mapping it is the
same as any other DMA bus address.

The map function can tell if the device pair combination can do p2p or
not.

> Again, what are the benefits of plumbing this RDMA special case?

It is not just RDMA, this is interesting for GPU and vfio use cases
too. RDMA is just the most complete in-tree user we have today.

ie GPU people wouuld really like to do read() and have P2P
transparently happen to on-GPU pages. With GPUs having huge amounts of
memory loading file data into them is really a performance critical
thing.

Jason
Logan Gunthorpe June 20, 2019, 7:34 p.m. UTC | #3
On 2019-06-20 12:45 p.m., Dan Williams wrote:
> On Thu, Jun 20, 2019 at 9:13 AM Logan Gunthorpe <logang@deltatee.com> wrote:
>>
>> For eons there has been a debate over whether or not to use
>> struct pages for peer-to-peer DMA transactions. Pro-pagers have
>> argued that struct pages are necessary for interacting with
>> existing code like scatterlists or the bio_vecs. Anti-pagers
>> assert that the tracking of the memory is unecessary and
>> allocating the pages is a waste of memory. Both viewpoints are
>> valid, however developers working on GPUs and RDMA tend to be
>> able to do away with struct pages relatively easily
> 
> Presumably because they have historically never tried to be
> inter-operable with the block layer or drivers outside graphics and
> RDMA.

Yes, but really there are three main sets of users for P2P right now:
graphics, RDMA and NVMe. And every time a patch set comes from GPU/RDMA
people they don't bother with struct page. I seem to be the only one
trying to push P2P with NVMe and it seems to be a losing battle.

> Please spell out the value, it is not immediately obvious to me
> outside of some memory capacity savings.

There are a few things:

* Have consistency with P2P efforts as most other efforts have been
avoiding struct page. Nobody else seems to want
pci_p2pdma_add_resource() or any devm_memremap_pages() call.

* Avoid all arch-specific dependencies for P2P. With struct page the IO
memory must fit in the linear mapping. This requires some work with
RISC-V and I remember some complaints from the powerpc people regarding
this. Certainly not all arches will be able to fit the IO region into
the linear mapping space.

* Remove a bunch of PCI P2PDMA special case mapping stuff from the block
layer and RDMA interface (which I've been hearing complaints over).

* Save the struct page memory that is largely unused (as you note).

>> Previously, there have been multiple attempts[1][2] to replace
>> struct page usage with pfn_t but this has been unpopular seeing
>> it creates dangerous edge cases where unsuspecting code might
>> run accross pfn_t's they are not ready for.
> 
> That's not the conclusion I arrived at because pfn_t is specifically
> an opaque type precisely to force "unsuspecting" code to throw
> compiler assertions. Instead pfn_t was dealt its death blow here:
> 
> https://lore.kernel.org/lkml/CA+55aFzON9617c2_Amep0ngLq91kfrPiSccdZakxir82iekUiA@mail.gmail.com/

Ok, well yes the special pages are what we've done for P2PDMA today. But
I don't think Linus's criticism really applies to what's in this RFC.
For starters, P2PDMA doesn't, and has have never, used struct page to
look up the reference count. PCI BARs have no relation to the cache so
there's no need to serialize their access but this can be done
before/after the DMA addresses are submitted to the block/rdma layer if
it was required.

In fact, the only thing the struct page is used for in the current
P2PDMA implementation is a single flag indicating it's special and needs
to be mapped in a special way.
> My primary concern with this is that ascribes a level of generality
> that just isn't there for peer-to-peer dma operations. "Peer"
> addresses are not "DMA" addresses, and the rules about what can and
> can't do peer-DMA are not generically known to the block layer.

Correct, but I don't think we should teach the block layer about these
rules. In the current code, the rules are enforced outside the block
layer before the bios are submitted and this patch set doesn't change
that. The driver orchestrating P2P will always have to check the rules
and derive addresses from them (as appropriate). With the RFC the block
layer then doesn't have to care and can just handle the DMA addresses
directly.

> At least with a side object there's a chance to describe / recall those
> restrictions as these things get passed around the I/O stack, but an
> undecorated "DMA" address passed through the block layer with no other
> benefit to any subsystem besides RDMA does not feel like it advances
> the state of the art.
> 
> Again, what are the benefits of plumbing this RDMA special case?

Because I don't think it is an RDMA special case.

Logan
Dan Williams June 20, 2019, 8:18 p.m. UTC | #4
On Thu, Jun 20, 2019 at 12:34 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Jun 20, 2019 at 11:45:38AM -0700, Dan Williams wrote:
>
> > > Previously, there have been multiple attempts[1][2] to replace
> > > struct page usage with pfn_t but this has been unpopular seeing
> > > it creates dangerous edge cases where unsuspecting code might
> > > run accross pfn_t's they are not ready for.
> >
> > That's not the conclusion I arrived at because pfn_t is specifically
> > an opaque type precisely to force "unsuspecting" code to throw
> > compiler assertions. Instead pfn_t was dealt its death blow here:
> >
> > https://lore.kernel.org/lkml/CA+55aFzON9617c2_Amep0ngLq91kfrPiSccdZakxir82iekUiA@mail.gmail.com/
> >
> > ...and I think that feedback also reads on this proposal.
>
> I read through Linus's remarks and it he seems completely right that
> anything that touches a filesystem needs a struct page, because FS's
> rely heavily on that.
>
> It is much less clear to me why a GPU BAR or a NVME CMB that never
> touches a filesystem needs a struct page.. The best reason I've seen
> is that it must have struct page because the block layer heavily
> depends on struct page.
>
> Since that thread was so DAX/pmem centric (and Linus did say he liked
> the __pfn_t), maybe it is worth checking again, but not for DAX/pmem
> users?
>
> This P2P is quite distinct from DAX as the struct page* would point to
> non-cacheable weird memory that few struct page users would even be
> able to work with, while I understand DAX use cases focused on CPU
> cache coherent memory, and filesystem involvement.

What I'm poking at is whether this block layer capability can pick up
users outside of RDMA, more on this below...

>
> > My primary concern with this is that ascribes a level of generality
> > that just isn't there for peer-to-peer dma operations. "Peer"
> > addresses are not "DMA" addresses, and the rules about what can and
> > can't do peer-DMA are not generically known to the block layer.
>
> ?? The P2P infrastructure produces a DMA bus address for the
> initiating device that is is absolutely a DMA address. There is some
> intermediate CPU centric representation, but after mapping it is the
> same as any other DMA bus address.

Right, this goes back to the confusion caused by the hardware / bus /
address that a dma-engine would consume directly, and Linux "DMA"
address as a device-specific translation of host memory.

Is the block layer representation of this address going to go through
a peer / "bus" address translation when it reaches the RDMA driver? In
other words if we tried to use this facility with other drivers how
would the driver know it was passed a traditional Linux DMA address,
vs a peer bus address that the device may not be able to handle?

> The map function can tell if the device pair combination can do p2p or
> not.

Ok, if this map step is still there then reduce a significant portion
of my concern and it becomes a quibble about the naming and how a
non-RDMA device driver might figure out if it was handled an address
it can't handle.

>
> > Again, what are the benefits of plumbing this RDMA special case?
>
> It is not just RDMA, this is interesting for GPU and vfio use cases
> too. RDMA is just the most complete in-tree user we have today.
>
> ie GPU people wouuld really like to do read() and have P2P
> transparently happen to on-GPU pages. With GPUs having huge amounts of
> memory loading file data into them is really a performance critical
> thing.

A direct-i/o read(2) into a page-less GPU mapping? Through a regular
file or a device special file?
Logan Gunthorpe June 20, 2019, 8:51 p.m. UTC | #5
On 2019-06-20 2:18 p.m., Dan Williams wrote:
>> Since that thread was so DAX/pmem centric (and Linus did say he liked
>> the __pfn_t), maybe it is worth checking again, but not for DAX/pmem
>> users?
>>
>> This P2P is quite distinct from DAX as the struct page* would point to
>> non-cacheable weird memory that few struct page users would even be
>> able to work with, while I understand DAX use cases focused on CPU
>> cache coherent memory, and filesystem involvement.
> 
> What I'm poking at is whether this block layer capability can pick up
> users outside of RDMA, more on this below...

I assume you mean outside of P2PDMA....

This new block layer capability is more likely to pick up additional
users compared to the existing block layer changes that are *very*
specific to PCI P2PDMA.

I also have (probably significantly controversial) plans to use this to
allow P2P through user space with O_DIRECT using an idea Jerome had in a
previous patch set that was discussed a bit informally at LSF/MM this
year. But that's a whole other RFC and requires a bunch of work I
haven't done yet.

>>
>>> My primary concern with this is that ascribes a level of generality
>>> that just isn't there for peer-to-peer dma operations. "Peer"
>>> addresses are not "DMA" addresses, and the rules about what can and
>>> can't do peer-DMA are not generically known to the block layer.
>>
>> ?? The P2P infrastructure produces a DMA bus address for the
>> initiating device that is is absolutely a DMA address. There is some
>> intermediate CPU centric representation, but after mapping it is the
>> same as any other DMA bus address.
> 
> Right, this goes back to the confusion caused by the hardware / bus /
> address that a dma-engine would consume directly, and Linux "DMA"
> address as a device-specific translation of host memory.
> 
> Is the block layer representation of this address going to go through
> a peer / "bus" address translation when it reaches the RDMA driver? In
> other words if we tried to use this facility with other drivers how
> would the driver know it was passed a traditional Linux DMA address,
> vs a peer bus address that the device may not be able to handle?

The idea is that the driver doesn't need to know. There's no distinction
between a Linux DMA address and a peer bus address. They are both used
for the same purpose: to program into a DMA engine. If the device cannot
handle such a DMA address then it shouldn't indicate support for this
feature or the P2PDMA layer needs a way to detect this. Really, this
property depends more on the bus than the device and that's what all the
P2PDMA code in the PCI tree handles.

>> The map function can tell if the device pair combination can do p2p or
>> not.
> 
> Ok, if this map step is still there then reduce a significant portion
> of my concern and it becomes a quibble about the naming and how a
> non-RDMA device driver might figure out if it was handled an address
> it can't handle.

Yes, there will always be a map step, but it should be done by the
orchestrator because it requires both devices (the client and the
provider) and the block layer really should not know about both devices.

In this RFC, the map step is kind of hidden but would probably come back
in the future. It's currently a call to pci_p2pmem_virt_to_bus() but
would eventually need to be a pci_p2pmem_map_resource() or similar which
takes a pointer to the pci_dev provider and the struct device client
doing the mapping.

Logan
Dan Williams June 20, 2019, 11:40 p.m. UTC | #6
On Thu, Jun 20, 2019 at 12:35 PM Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
>
> On 2019-06-20 12:45 p.m., Dan Williams wrote:
> > On Thu, Jun 20, 2019 at 9:13 AM Logan Gunthorpe <logang@deltatee.com> wrote:
> >>
> >> For eons there has been a debate over whether or not to use
> >> struct pages for peer-to-peer DMA transactions. Pro-pagers have
> >> argued that struct pages are necessary for interacting with
> >> existing code like scatterlists or the bio_vecs. Anti-pagers
> >> assert that the tracking of the memory is unecessary and
> >> allocating the pages is a waste of memory. Both viewpoints are
> >> valid, however developers working on GPUs and RDMA tend to be
> >> able to do away with struct pages relatively easily
> >
> > Presumably because they have historically never tried to be
> > inter-operable with the block layer or drivers outside graphics and
> > RDMA.
>
> Yes, but really there are three main sets of users for P2P right now:
> graphics, RDMA and NVMe. And every time a patch set comes from GPU/RDMA
> people they don't bother with struct page. I seem to be the only one
> trying to push P2P with NVMe and it seems to be a losing battle.
>
> > Please spell out the value, it is not immediately obvious to me
> > outside of some memory capacity savings.
>
> There are a few things:
>
> * Have consistency with P2P efforts as most other efforts have been
> avoiding struct page. Nobody else seems to want
> pci_p2pdma_add_resource() or any devm_memremap_pages() call.
>
> * Avoid all arch-specific dependencies for P2P. With struct page the IO
> memory must fit in the linear mapping. This requires some work with
> RISC-V and I remember some complaints from the powerpc people regarding
> this. Certainly not all arches will be able to fit the IO region into
> the linear mapping space.
>
> * Remove a bunch of PCI P2PDMA special case mapping stuff from the block
> layer and RDMA interface (which I've been hearing complaints over).

This seems to be the most salient point. I was missing the fact that
this replaces custom hacks and "special" pages with an explicit "just
pass this pre-mapped address down the stack". It's functionality that
might plausibly be used outside of p2p, as long as the driver can
assert that it never needs to touch the data with the cpu before
handing it off to a dma-engine.
Logan Gunthorpe June 20, 2019, 11:42 p.m. UTC | #7
On 2019-06-20 5:40 p.m., Dan Williams wrote:
> This seems to be the most salient point. I was missing the fact that
> this replaces custom hacks and "special" pages with an explicit "just
> pass this pre-mapped address down the stack". It's functionality that
> might plausibly be used outside of p2p, as long as the driver can
> assert that it never needs to touch the data with the cpu before
> handing it off to a dma-engine.

Yup, that's a good way to put it. If I resend this patchset, I'll
include wording like yours in the cover letter.

Logan
Jason Gunthorpe June 21, 2019, 5:47 p.m. UTC | #8
On Thu, Jun 20, 2019 at 01:18:13PM -0700, Dan Williams wrote:

> > This P2P is quite distinct from DAX as the struct page* would point to
> > non-cacheable weird memory that few struct page users would even be
> > able to work with, while I understand DAX use cases focused on CPU
> > cache coherent memory, and filesystem involvement.
> 
> What I'm poking at is whether this block layer capability can pick up
> users outside of RDMA, more on this below...

The generic capability is to do a transfer through the block layer and
scatter/gather the resulting data to some PCIe BAR memory. Currently
the block layer can only scatter/gather data into CPU cache coherent
memory.

We know of several useful places to put PCIe BAR memory already:
 - On a GPU (or FPGA, acclerator, etc), ie the GB's of GPU private
   memory that is standard these days.
 - On a NVMe CMB. This lets the NVMe drive avoid DMA entirely
 - On a RDMA NIC. Mellanox NICs have a small amount of BAR memory that
   can be used like a CMB and avoids a DMA

RDMA doesn't really get so involved here, except that RDMA is often
the prefered way to source/sink the data buffers after the block layer has
scatter/gathered to them. (and of course RDMA is often for a block
driver, ie NMVe over fabrics)

> > > My primary concern with this is that ascribes a level of generality
> > > that just isn't there for peer-to-peer dma operations. "Peer"
> > > addresses are not "DMA" addresses, and the rules about what can and
> > > can't do peer-DMA are not generically known to the block layer.
> >
> > ?? The P2P infrastructure produces a DMA bus address for the
> > initiating device that is is absolutely a DMA address. There is some
> > intermediate CPU centric representation, but after mapping it is the
> > same as any other DMA bus address.
> 
> Right, this goes back to the confusion caused by the hardware / bus /
> address that a dma-engine would consume directly, and Linux "DMA"
> address as a device-specific translation of host memory.

I don't think there is a confusion :) Logan explained it, the
dma_addr_t is always the thing you program into the DMA engine of the
device it was created for, and this changes nothing about that.

Think of the dma vec as the same as a dma mapped SGL, just with no
available struct page.

> Is the block layer representation of this address going to go through
> a peer / "bus" address translation when it reaches the RDMA driver? 

No, it is just like any other dma mapped SGL, it is ready to go for
the device it was mapped for, and can be used for nothing other than
programming DMA on that device.

> > ie GPU people wouuld really like to do read() and have P2P
> > transparently happen to on-GPU pages. With GPUs having huge amounts of
> > memory loading file data into them is really a performance critical
> > thing.
> 
> A direct-i/o read(2) into a page-less GPU mapping? 

The interesting case is probably an O_DIRECT read into a
DEVICE_PRIVATE page owned by the GPU driver and mmaped into the
process calling read(). The GPU driver can dynamically arrange for
that DEVICE_PRIVATE page to linked to P2P targettable BAR memory so
the HW is capable of a direct CPU bypass transfer from the underlying
block device (ie NVMe or RDMA) to the GPU.

One way to approach this problem is to use this new dma_addr path in
the block layer.

Another way is to feed the DEVICE_PRIVATE pages into the block layer
and have it DMA map them to a P2P address.

In either case we have a situation where the block layer cannot touch
the target struct page buffers with the CPU because there is no cache
coherent CPU mapping for them, and we have to create a CPU clean path
in the block layer.

At best you could do memcpy to/from on these things, but if a GPU is
involved even that is incredibly inefficient. The GPU can do the
memcpy with DMA much faster than a memcpy_to/from_io.

Jason
Dan Williams June 21, 2019, 5:54 p.m. UTC | #9
On Fri, Jun 21, 2019 at 10:47 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Jun 20, 2019 at 01:18:13PM -0700, Dan Williams wrote:
>
> > > This P2P is quite distinct from DAX as the struct page* would point to
> > > non-cacheable weird memory that few struct page users would even be
> > > able to work with, while I understand DAX use cases focused on CPU
> > > cache coherent memory, and filesystem involvement.
> >
> > What I'm poking at is whether this block layer capability can pick up
> > users outside of RDMA, more on this below...
>
> The generic capability is to do a transfer through the block layer and
> scatter/gather the resulting data to some PCIe BAR memory. Currently
> the block layer can only scatter/gather data into CPU cache coherent
> memory.
>
> We know of several useful places to put PCIe BAR memory already:
>  - On a GPU (or FPGA, acclerator, etc), ie the GB's of GPU private
>    memory that is standard these days.
>  - On a NVMe CMB. This lets the NVMe drive avoid DMA entirely
>  - On a RDMA NIC. Mellanox NICs have a small amount of BAR memory that
>    can be used like a CMB and avoids a DMA
>
> RDMA doesn't really get so involved here, except that RDMA is often
> the prefered way to source/sink the data buffers after the block layer has
> scatter/gathered to them. (and of course RDMA is often for a block
> driver, ie NMVe over fabrics)
>
> > > > My primary concern with this is that ascribes a level of generality
> > > > that just isn't there for peer-to-peer dma operations. "Peer"
> > > > addresses are not "DMA" addresses, and the rules about what can and
> > > > can't do peer-DMA are not generically known to the block layer.
> > >
> > > ?? The P2P infrastructure produces a DMA bus address for the
> > > initiating device that is is absolutely a DMA address. There is some
> > > intermediate CPU centric representation, but after mapping it is the
> > > same as any other DMA bus address.
> >
> > Right, this goes back to the confusion caused by the hardware / bus /
> > address that a dma-engine would consume directly, and Linux "DMA"
> > address as a device-specific translation of host memory.
>
> I don't think there is a confusion :) Logan explained it, the
> dma_addr_t is always the thing you program into the DMA engine of the
> device it was created for, and this changes nothing about that.

Yup, Logan and I already settled that point on our last exchange and
offered to make that clearer in the changelog.
Christoph Hellwig June 24, 2019, 7:27 a.m. UTC | #10
This is not going to fly.

For one passing a dma_addr_t through the block layer is a layering
violation, and one that I think will also bite us in practice.
The host physical to PCIe bus address mapping can have offsets, and
those offsets absolutely can be different for differnet root ports.
So with your caller generated dma_addr_t everything works fine with
a switched setup as the one you are probably testing on, but on a
sufficiently complicated setup with multiple root ports it can break.

Also duplicating the whole block I/O stack, including hooks all over
the fast path is pretty much a no-go.

I've been pondering for a while if we wouldn't be better off just
passing a phys_addr_t + len instead of the page, offset, len tuple
in the bio_vec, though.  If you look at the normal I/O path here
is what we normally do:

 - we get a page as input, either because we have it at hand (e.g.
   from the page cache) or from get_user_pages (which actually caculates
   it from a pfn in the page tables)
 - once in the bio all the merging decisions are based on the physical
   address, so we have to convert it to the physical address there,
   potentially multiple times
 - then dma mapping all works off the physical address, which it gets
   from the page at the start
 - then only the dma address is used for the I/O
 - on I/O completion we often but not always need the page again.  In
   the direct I/O case for reference counting and dirty status, in the
   file system also for things like marking the page uptodate

So if we move to a phys_addr_t we'd need to go back to the page at least
once.  But because of how the merging works we really only need to do
it once per segment, as we can just do pointer arithmerics do get the
following pages.  As we generally go at least once from a physical
address to a page in the merging code even a relatively expensive vmem_map
looks shouldn't be too bad.  Even more so given that the super hot path
(small blkdev direct I/O) can actually trivially cache the affected pages
as well.

Linus kinda hates the pfn approach, but part of that was really that
it was proposed for file system data, which we all found out really
can't work as-is without pages the hard way.  Another part probably
was potential performance issue, but between the few page lookups, and
the fact that using a single phys_addr_t instead of pfn/page + offset
should avoid quite a few calculations performance should not actually
be affected, although we'll have to be careful to actually verify that.
Christoph Hellwig June 24, 2019, 7:31 a.m. UTC | #11
On Thu, Jun 20, 2019 at 04:33:53PM -0300, Jason Gunthorpe wrote:
> > My primary concern with this is that ascribes a level of generality
> > that just isn't there for peer-to-peer dma operations. "Peer"
> > addresses are not "DMA" addresses, and the rules about what can and
> > can't do peer-DMA are not generically known to the block layer.
> 
> ?? The P2P infrastructure produces a DMA bus address for the
> initiating device that is is absolutely a DMA address. There is some
> intermediate CPU centric representation, but after mapping it is the
> same as any other DMA bus address.
> 
> The map function can tell if the device pair combination can do p2p or
> not.

At the PCIe level there is no such thing as a DMA address, it all
is bus address with MMIO and DMA in the same address space (without
that P2P would have not chance of actually working obviously).  But
that bus address space is different per "bus" (which would be an
root port in PCIe), and we need to be careful about that.
Jason Gunthorpe June 24, 2019, 1:46 p.m. UTC | #12
On Mon, Jun 24, 2019 at 09:31:26AM +0200, Christoph Hellwig wrote:
> On Thu, Jun 20, 2019 at 04:33:53PM -0300, Jason Gunthorpe wrote:
> > > My primary concern with this is that ascribes a level of generality
> > > that just isn't there for peer-to-peer dma operations. "Peer"
> > > addresses are not "DMA" addresses, and the rules about what can and
> > > can't do peer-DMA are not generically known to the block layer.
> > 
> > ?? The P2P infrastructure produces a DMA bus address for the
> > initiating device that is is absolutely a DMA address. There is some
> > intermediate CPU centric representation, but after mapping it is the
> > same as any other DMA bus address.
> > 
> > The map function can tell if the device pair combination can do p2p or
> > not.
> 
> At the PCIe level there is no such thing as a DMA address, it all
> is bus address with MMIO and DMA in the same address space (without
> that P2P would have not chance of actually working obviously).  But
> that bus address space is different per "bus" (which would be an
> root port in PCIe), and we need to be careful about that.

Sure, that is how dma_addr_t is supposed to work - it is always a
device specific value that can be used only by the device that it was
created for, and different devices could have different dma_addr_t
values for the same memory. 

So when Logan goes and puts dma_addr_t into the block stack he must
also invert things so that the DMA map happens at the start of the
process to create the right dma_addr_t early.

I'm not totally clear if this series did that inversion, if it didn't
then it should not be using the dma_addr_t label at all, or refering
to anything as a 'dma address' as it is just confusing.

BTW, it is not just offset right? It is possible that the IOMMU can
generate unique dma_addr_t values for each device?? Simple offset is
just something we saw in certain embedded cases, IIRC.

Jason
Christoph Hellwig June 24, 2019, 1:50 p.m. UTC | #13
On Mon, Jun 24, 2019 at 10:46:41AM -0300, Jason Gunthorpe wrote:
> BTW, it is not just offset right? It is possible that the IOMMU can
> generate unique dma_addr_t values for each device?? Simple offset is
> just something we saw in certain embedded cases, IIRC.

Yes, it could.  If we are trying to do P2P between two devices on
different root ports and with the IOMMU enabled we'll generate
a new bus address for the BAR on the other side dynamically everytime
we map.
Jason Gunthorpe June 24, 2019, 1:55 p.m. UTC | #14
On Mon, Jun 24, 2019 at 03:50:24PM +0200, Christoph Hellwig wrote:
> On Mon, Jun 24, 2019 at 10:46:41AM -0300, Jason Gunthorpe wrote:
> > BTW, it is not just offset right? It is possible that the IOMMU can
> > generate unique dma_addr_t values for each device?? Simple offset is
> > just something we saw in certain embedded cases, IIRC.
> 
> Yes, it could.  If we are trying to do P2P between two devices on
> different root ports and with the IOMMU enabled we'll generate
> a new bus address for the BAR on the other side dynamically everytime
> we map.

Even with the same root port if ACS is turned on could behave like this.

It is only a very narrow case where you can take shortcuts with
dma_addr_t, and I don't think shortcuts like are are appropriate for
the mainline kernel..

Jason
Logan Gunthorpe June 24, 2019, 4:07 p.m. UTC | #15
On 2019-06-24 1:27 a.m., Christoph Hellwig wrote:
> This is not going to fly.
> 
> For one passing a dma_addr_t through the block layer is a layering
> violation, and one that I think will also bite us in practice.
> The host physical to PCIe bus address mapping can have offsets, and
> those offsets absolutely can be different for differnet root ports.
> So with your caller generated dma_addr_t everything works fine with
> a switched setup as the one you are probably testing on, but on a
> sufficiently complicated setup with multiple root ports it can break.

I don't follow this argument. Yes, I understand PCI Bus offsets and yes
I understand that they only apply beyond the bus they're working with.
But this isn't *that* complicated and it should be the responsibility of
the P2PDMA code to sort out and provide a dma_addr_t for. The dma_addr_t
that's passed through the block layer could be a bus address or it could
be the result of a dma_map_* request (if the transaction is found to go
through an RC) depending on the requirements of the devices being used.

> Also duplicating the whole block I/O stack, including hooks all over
> the fast path is pretty much a no-go.

There was very little duplicate code in the patch set. (Really just the
mapping code). There are a few hooks, but in practice not that many if
we ignore the WARN_ONs. We might be able to work to reduce this further.
The main hooks are: when we skip bouncing, when we skip integrity prep,
when we split, and when we map. And the patchset drops the PCI_P2PDMA
hook when we map. So we're talking about maybe three or four extra ifs
that would likely normally be fast due to the branch predictor.

> I've been pondering for a while if we wouldn't be better off just
> passing a phys_addr_t + len instead of the page, offset, len tuple
> in the bio_vec, though.  If you look at the normal I/O path here
> is what we normally do:
> 
>  - we get a page as input, either because we have it at hand (e.g.
>    from the page cache) or from get_user_pages (which actually caculates
>    it from a pfn in the page tables)
>  - once in the bio all the merging decisions are based on the physical
>    address, so we have to convert it to the physical address there,
>    potentially multiple times
>  - then dma mapping all works off the physical address, which it gets
>    from the page at the start
>  - then only the dma address is used for the I/O
>  - on I/O completion we often but not always need the page again.  In
>    the direct I/O case for reference counting and dirty status, in the
>    file system also for things like marking the page uptodate
> 
> So if we move to a phys_addr_t we'd need to go back to the page at least
> once.  But because of how the merging works we really only need to do
> it once per segment, as we can just do pointer arithmerics do get the
> following pages.  As we generally go at least once from a physical
> address to a page in the merging code even a relatively expensive vmem_map
> looks shouldn't be too bad.  Even more so given that the super hot path
> (small blkdev direct I/O) can actually trivially cache the affected pages
> as well.

I've always wondered why it wasn't done this way. Passing around a page
pointer *and* an offset always seemed less efficient than just a
physical address. If we did do this, the proposed dma_addr_t and
phys_addr_t paths through the block layer could be a lot more similar as
things like the split calculation could work on either address type.
We'd just have to prevent bouncing and integrity and change have a hook
on how it's mapped.

> Linus kinda hates the pfn approach, but part of that was really that
> it was proposed for file system data, which we all found out really
> can't work as-is without pages the hard way.  Another part probably
> was potential performance issue, but between the few page lookups, and
> the fact that using a single phys_addr_t instead of pfn/page + offset
> should avoid quite a few calculations performance should not actually
> be affected, although we'll have to be careful to actually verify that.

Yes, I'd agree that removing the offset should make things simpler. But
that requires changing a lot of stuff and doesn't really help what I'm
trying to do.

Logan
Logan Gunthorpe June 24, 2019, 4:10 p.m. UTC | #16
On 2019-06-24 7:46 a.m., Jason Gunthorpe wrote:
> On Mon, Jun 24, 2019 at 09:31:26AM +0200, Christoph Hellwig wrote:
>> On Thu, Jun 20, 2019 at 04:33:53PM -0300, Jason Gunthorpe wrote:
>>>> My primary concern with this is that ascribes a level of generality
>>>> that just isn't there for peer-to-peer dma operations. "Peer"
>>>> addresses are not "DMA" addresses, and the rules about what can and
>>>> can't do peer-DMA are not generically known to the block layer.
>>>
>>> ?? The P2P infrastructure produces a DMA bus address for the
>>> initiating device that is is absolutely a DMA address. There is some
>>> intermediate CPU centric representation, but after mapping it is the
>>> same as any other DMA bus address.
>>>
>>> The map function can tell if the device pair combination can do p2p or
>>> not.
>>
>> At the PCIe level there is no such thing as a DMA address, it all
>> is bus address with MMIO and DMA in the same address space (without
>> that P2P would have not chance of actually working obviously).  But
>> that bus address space is different per "bus" (which would be an
>> root port in PCIe), and we need to be careful about that.
> 
> Sure, that is how dma_addr_t is supposed to work - it is always a
> device specific value that can be used only by the device that it was
> created for, and different devices could have different dma_addr_t
> values for the same memory. 
> 
> So when Logan goes and puts dma_addr_t into the block stack he must
> also invert things so that the DMA map happens at the start of the
> process to create the right dma_addr_t early.

Yes, that's correct. The intent was to invert it so the dma_map could
happen at the start of the process so that P2PDMA code could be called
with all the information it needs to make it's decision on how to map;
without having to hook into the mapping process of every driver that
wants to participate.

Logan
Logan Gunthorpe June 24, 2019, 4:53 p.m. UTC | #17
On 2019-06-24 7:55 a.m., Jason Gunthorpe wrote:
> On Mon, Jun 24, 2019 at 03:50:24PM +0200, Christoph Hellwig wrote:
>> On Mon, Jun 24, 2019 at 10:46:41AM -0300, Jason Gunthorpe wrote:
>>> BTW, it is not just offset right? It is possible that the IOMMU can
>>> generate unique dma_addr_t values for each device?? Simple offset is
>>> just something we saw in certain embedded cases, IIRC.
>>
>> Yes, it could.  If we are trying to do P2P between two devices on
>> different root ports and with the IOMMU enabled we'll generate
>> a new bus address for the BAR on the other side dynamically everytime
>> we map.
> 
> Even with the same root port if ACS is turned on could behave like this.

Yup.

> It is only a very narrow case where you can take shortcuts with
> dma_addr_t, and I don't think shortcuts like are are appropriate for
> the mainline kernel..

I don't think it's that narrow and it opens up a lot of avenues for
system design that people are wanting to go. If your high speed data
path can avoid the root complex and CPU, you can design a system which a
much smaller CPU and fewer lanes directed at the CPU.

Logan
Jason Gunthorpe June 24, 2019, 6:16 p.m. UTC | #18
On Mon, Jun 24, 2019 at 10:53:38AM -0600, Logan Gunthorpe wrote:
> > It is only a very narrow case where you can take shortcuts with
> > dma_addr_t, and I don't think shortcuts like are are appropriate for
> > the mainline kernel..
> 
> I don't think it's that narrow and it opens up a lot of avenues for
> system design that people are wanting to go. If your high speed data
> path can avoid the root complex and CPU, you can design a system which a
> much smaller CPU and fewer lanes directed at the CPU.

I mean the shortcut that something generates dma_addr_t for Device A
and then passes it to Device B - that is too hacky for mainline.

Sounded like this series does generate the dma_addr for the correct
device..

Jason
Logan Gunthorpe June 24, 2019, 6:28 p.m. UTC | #19
On 2019-06-24 12:16 p.m., Jason Gunthorpe wrote:
> On Mon, Jun 24, 2019 at 10:53:38AM -0600, Logan Gunthorpe wrote:
>>> It is only a very narrow case where you can take shortcuts with
>>> dma_addr_t, and I don't think shortcuts like are are appropriate for
>>> the mainline kernel..
>>
>> I don't think it's that narrow and it opens up a lot of avenues for
>> system design that people are wanting to go. If your high speed data
>> path can avoid the root complex and CPU, you can design a system which a
>> much smaller CPU and fewer lanes directed at the CPU.
> 
> I mean the shortcut that something generates dma_addr_t for Device A
> and then passes it to Device B - that is too hacky for mainline.

Oh, that's not a shortcut. It's completely invalid and not likely to
work in any case. If you're mapping something you have to pass the
device that the dma_addr_t is being programmed into.

> Sounded like this series does generate the dma_addr for the correct
> device..

This series doesn't generate any DMA addresses with dma_map(). The
current p2pdma code ensures everything is behind the same root port and
only uses the pci bus address. This is valid and correct, but yes it's
something to expand upon.

I'll be doing some work shortly to add transactions that go through the
IOMMU and calls dma_map_* when appropriate.

Logan
Jason Gunthorpe June 24, 2019, 6:54 p.m. UTC | #20
On Mon, Jun 24, 2019 at 12:28:33PM -0600, Logan Gunthorpe wrote:

> > Sounded like this series does generate the dma_addr for the correct
> > device..
> 
> This series doesn't generate any DMA addresses with dma_map(). The
> current p2pdma code ensures everything is behind the same root port and
> only uses the pci bus address. This is valid and correct, but yes it's
> something to expand upon.

I think if you do this it still has to be presented as the same API
like dma_map that takes in the target device * and produces the device
specific dma_addr_t

Otherwise this whole thing is confusing and looks like *all* of it can
only work under the switch assumption

Jason
Logan Gunthorpe June 24, 2019, 7:37 p.m. UTC | #21
On 2019-06-24 12:54 p.m., Jason Gunthorpe wrote:
> On Mon, Jun 24, 2019 at 12:28:33PM -0600, Logan Gunthorpe wrote:
> 
>>> Sounded like this series does generate the dma_addr for the correct
>>> device..
>>
>> This series doesn't generate any DMA addresses with dma_map(). The
>> current p2pdma code ensures everything is behind the same root port and
>> only uses the pci bus address. This is valid and correct, but yes it's
>> something to expand upon.
> 
> I think if you do this it still has to be presented as the same API
> like dma_map that takes in the target device * and produces the device
> specific dma_addr_t

Yes, once we consider the case where it can go through the root complex,
we will need an API similar to dma_map(). We got rid of that API because
it wasn't yet required or used by anything and, per our best practices,
we don't add features that aren't used as that is more confusing for
people reading/reworking the code.

> Otherwise this whole thing is confusing and looks like *all* of it can
> only work under the switch assumption

Hopefully it'll be clearer once we do the work to map for going through
the root complex. It's not that confusing to me. But it's all orthogonal
to the dma_addr_t through the block layer concept.

Logan
Christoph Hellwig June 25, 2019, 7:18 a.m. UTC | #22
On Mon, Jun 24, 2019 at 10:10:16AM -0600, Logan Gunthorpe wrote:
> Yes, that's correct. The intent was to invert it so the dma_map could
> happen at the start of the process so that P2PDMA code could be called
> with all the information it needs to make it's decision on how to map;
> without having to hook into the mapping process of every driver that
> wants to participate.

And that just isn't how things work in layering.  We need to keep
generating the dma addresses in the driver in the receiving end, as
there are all kinds of interesting ideas how we do that.  E.g. for the
Mellanox NICs addressing their own bars is not done by PCIe bus
addresses but by relative offsets.  And while NVMe has refused to go
down that route in the current band aid fix for CMB addressing I suspect
it will sooner or later have to do the same to deal with the addressing
problems in a multiple PASID world.
Christoph Hellwig June 25, 2019, 7:20 a.m. UTC | #23
On Mon, Jun 24, 2019 at 10:07:56AM -0600, Logan Gunthorpe wrote:
> > For one passing a dma_addr_t through the block layer is a layering
> > violation, and one that I think will also bite us in practice.
> > The host physical to PCIe bus address mapping can have offsets, and
> > those offsets absolutely can be different for differnet root ports.
> > So with your caller generated dma_addr_t everything works fine with
> > a switched setup as the one you are probably testing on, but on a
> > sufficiently complicated setup with multiple root ports it can break.
> 
> I don't follow this argument. Yes, I understand PCI Bus offsets and yes
> I understand that they only apply beyond the bus they're working with.
> But this isn't *that* complicated and it should be the responsibility of
> the P2PDMA code to sort out and provide a dma_addr_t for. The dma_addr_t
> that's passed through the block layer could be a bus address or it could
> be the result of a dma_map_* request (if the transaction is found to go
> through an RC) depending on the requirements of the devices being used.

You assume all addressing is done by the PCI bus address.  If a device
is addressing its own BAR there is no reason to use the PCI bus address,
as it might have much more intelligent schemes (usually bar + offset).
> 
> > Also duplicating the whole block I/O stack, including hooks all over
> > the fast path is pretty much a no-go.
> 
> There was very little duplicate code in the patch set. (Really just the
> mapping code). There are a few hooks, but in practice not that many if
> we ignore the WARN_ONs. We might be able to work to reduce this further.
> The main hooks are: when we skip bouncing, when we skip integrity prep,
> when we split, and when we map. And the patchset drops the PCI_P2PDMA
> hook when we map. So we're talking about maybe three or four extra ifs
> that would likely normally be fast due to the branch predictor.

And all of those add code to the block layer fast path.
Logan Gunthorpe June 25, 2019, 3:57 p.m. UTC | #24
On 2019-06-25 1:20 a.m., Christoph Hellwig wrote:
> On Mon, Jun 24, 2019 at 10:07:56AM -0600, Logan Gunthorpe wrote:
>>> For one passing a dma_addr_t through the block layer is a layering
>>> violation, and one that I think will also bite us in practice.
>>> The host physical to PCIe bus address mapping can have offsets, and
>>> those offsets absolutely can be different for differnet root ports.
>>> So with your caller generated dma_addr_t everything works fine with
>>> a switched setup as the one you are probably testing on, but on a
>>> sufficiently complicated setup with multiple root ports it can break.
>>
>> I don't follow this argument. Yes, I understand PCI Bus offsets and yes
>> I understand that they only apply beyond the bus they're working with.
>> But this isn't *that* complicated and it should be the responsibility of
>> the P2PDMA code to sort out and provide a dma_addr_t for. The dma_addr_t
>> that's passed through the block layer could be a bus address or it could
>> be the result of a dma_map_* request (if the transaction is found to go
>> through an RC) depending on the requirements of the devices being used.
> 
> You assume all addressing is done by the PCI bus address.  If a device
> is addressing its own BAR there is no reason to use the PCI bus address,
> as it might have much more intelligent schemes (usually bar + offset).

Yes, that will be a bit tricky regardless of what we do.

>>> Also duplicating the whole block I/O stack, including hooks all over
>>> the fast path is pretty much a no-go.
>>
>> There was very little duplicate code in the patch set. (Really just the
>> mapping code). There are a few hooks, but in practice not that many if
>> we ignore the WARN_ONs. We might be able to work to reduce this further.
>> The main hooks are: when we skip bouncing, when we skip integrity prep,
>> when we split, and when we map. And the patchset drops the PCI_P2PDMA
>> hook when we map. So we're talking about maybe three or four extra ifs
>> that would likely normally be fast due to the branch predictor.
> 
> And all of those add code to the block layer fast path.

If we can't add any ifs to the block layer, there's really nothing we
can do.

So then we're committed to using struct page for P2P?

Logan
Christoph Hellwig June 25, 2019, 5:01 p.m. UTC | #25
On Tue, Jun 25, 2019 at 09:57:52AM -0600, Logan Gunthorpe wrote:
> > You assume all addressing is done by the PCI bus address.  If a device
> > is addressing its own BAR there is no reason to use the PCI bus address,
> > as it might have much more intelligent schemes (usually bar + offset).
> 
> Yes, that will be a bit tricky regardless of what we do.

At least right now it isn't at all.  I've implemented support for
a draft NVMe proposal for that, and it basically boils down to this
in the p2p path:

	addr = sg_phys(sg);

	if (page->pgmap->dev == ctrl->dev && HAS_RELATIVE_ADDRESSING)
		addr -= ctrl->cmb_start_addr;

		// set magic flag in the SGL
	} else {
		addr -= pgmap->pci_p2pdma_bus_offset;
	}

without the pagemap it would require a range compare instead, which
isn't all that hard either.

> >>> Also duplicating the whole block I/O stack, including hooks all over
> >>> the fast path is pretty much a no-go.
> >>
> >> There was very little duplicate code in the patch set. (Really just the
> >> mapping code). There are a few hooks, but in practice not that many if
> >> we ignore the WARN_ONs. We might be able to work to reduce this further.
> >> The main hooks are: when we skip bouncing, when we skip integrity prep,
> >> when we split, and when we map. And the patchset drops the PCI_P2PDMA
> >> hook when we map. So we're talking about maybe three or four extra ifs
> >> that would likely normally be fast due to the branch predictor.
> > 
> > And all of those add code to the block layer fast path.
> 
> If we can't add any ifs to the block layer, there's really nothing we
> can do.

That is not what I said.  Of course we can.  But we rather have a
really good reason.  And adding a parallel I/O path violating the
highlevel model is not one.

> So then we're committed to using struct page for P2P?

Only until we have a significantly better soltution.  And I think
using physical address in some form instead of pages is that,
adding a parallel path with dma_addr_t is not, it actually is worse
than the current code in many respects.
Logan Gunthorpe June 25, 2019, 7:54 p.m. UTC | #26
On 2019-06-25 11:01 a.m., Christoph Hellwig wrote:
> On Tue, Jun 25, 2019 at 09:57:52AM -0600, Logan Gunthorpe wrote:
>>> You assume all addressing is done by the PCI bus address.  If a device
>>> is addressing its own BAR there is no reason to use the PCI bus address,
>>> as it might have much more intelligent schemes (usually bar + offset).
>>
>> Yes, that will be a bit tricky regardless of what we do.
> 
> At least right now it isn't at all.  I've implemented support for
> a draft NVMe proposal for that, and it basically boils down to this
> in the p2p path:
> 
> 	addr = sg_phys(sg);
> 
> 	if (page->pgmap->dev == ctrl->dev && HAS_RELATIVE_ADDRESSING)
> 		addr -= ctrl->cmb_start_addr;
> 
> 		// set magic flag in the SGL
> 	} else {
> 		addr -= pgmap->pci_p2pdma_bus_offset;
> 	}
> 
> without the pagemap it would require a range compare instead, which
> isn't all that hard either.
> 
>>>>> Also duplicating the whole block I/O stack, including hooks all over
>>>>> the fast path is pretty much a no-go.
>>>>
>>>> There was very little duplicate code in the patch set. (Really just the
>>>> mapping code). There are a few hooks, but in practice not that many if
>>>> we ignore the WARN_ONs. We might be able to work to reduce this further.
>>>> The main hooks are: when we skip bouncing, when we skip integrity prep,
>>>> when we split, and when we map. And the patchset drops the PCI_P2PDMA
>>>> hook when we map. So we're talking about maybe three or four extra ifs
>>>> that would likely normally be fast due to the branch predictor.
>>>
>>> And all of those add code to the block layer fast path.
>>
>> If we can't add any ifs to the block layer, there's really nothing we
>> can do.
> 
> That is not what I said.  Of course we can.  But we rather have a
> really good reason.  And adding a parallel I/O path violating the
> highlevel model is not one.
> 
>> So then we're committed to using struct page for P2P?
> 
> Only until we have a significantly better soltution.  And I think
> using physical address in some form instead of pages is that,
> adding a parallel path with dma_addr_t is not, it actually is worse
> than the current code in many respects.

Well whether it's dma_addr_t, phys_addr_t, pfn_t the result isn't all
that different. You still need roughly the same 'if' hooks for any
backed memory that isn't in the linear mapping and you can't get a
kernel mapping for directly.

It wouldn't be too hard to do a similar patch set that uses something
like phys_addr_t instead and have a request and queue flag for support
of non-mappable memory. But you'll end up with very similar 'if' hooks
and we'd have to clean up all bio-using drivers that access the struct
pages directly.

Though, we'd also still have the problem of how to recognize when the
address points to P2PDMA and needs to be translated to the bus offset.
The map-first inversion was what helped here because the driver
submitting the requests had all the information. Though it could be
another request flag and indicating non-mappable memory could be a flag
group like REQ_NOMERGE_FLAGS -- REQ_NOMAP_FLAGS.

If you think any of the above ideas sound workable I'd be happy to try
to code up another prototype.

Logan
Christoph Hellwig June 26, 2019, 6:57 a.m. UTC | #27
On Tue, Jun 25, 2019 at 01:54:21PM -0600, Logan Gunthorpe wrote:
> Well whether it's dma_addr_t, phys_addr_t, pfn_t the result isn't all
> that different. You still need roughly the same 'if' hooks for any
> backed memory that isn't in the linear mapping and you can't get a
> kernel mapping for directly.
> 
> It wouldn't be too hard to do a similar patch set that uses something
> like phys_addr_t instead and have a request and queue flag for support
> of non-mappable memory. But you'll end up with very similar 'if' hooks
> and we'd have to clean up all bio-using drivers that access the struct
> pages directly.

We'll need to clean that mess up anyway, and I've been chugging
along doing some of that.  A lot still assume no highmem, so we need
to convert them over to something that kmaps anyway.  If we get
the abstraction right that will actually help converting over to
a better reprsentation.

> Though, we'd also still have the problem of how to recognize when the
> address points to P2PDMA and needs to be translated to the bus offset.
> The map-first inversion was what helped here because the driver
> submitting the requests had all the information. Though it could be
> another request flag and indicating non-mappable memory could be a flag
> group like REQ_NOMERGE_FLAGS -- REQ_NOMAP_FLAGS.

The assumes the request all has the same memory, which is a simplifing
assuption.  My idea was that if had our new bio_vec like this:

struct bio_vec {
	phys_addr_t		paddr; // 64-bit on 64-bit systems
	unsigned long		len;
};

we have a hole behind len where we could store flag.  Preferably
optionally based on a P2P or other magic memory types config
option so that 32-bit systems with 32-bit phys_addr_t actually
benefit from the smaller and better packing structure.

> If you think any of the above ideas sound workable I'd be happy to try
> to code up another prototype.

Іt sounds workable.  To some of the first steps are cleanups independent
of how the bio_vec is eventually going to look like.  That is making
the DMA-API internals work on the phys_addr_t, which also unifies the
map_resource implementation with map_page.  I plan to do that relatively
soon.  The next is sorting out access to bios data by virtual address.
All these need nice kmapping helper that avoid too much open coding.
I was going to look into that next, mostly to kill the block layer
bounce buffering code.  Similar things will also be needed at the
scatterlist level I think.  After that we need to more audits of
how bv_page is still used.  something like a bv_phys() helper that
does "page_to_phys(bv->bv_page) + bv->bv_offset" might come in handy
for example.
Logan Gunthorpe June 26, 2019, 6:31 p.m. UTC | #28
On 2019-06-26 12:57 a.m., Christoph Hellwig wrote:
> On Tue, Jun 25, 2019 at 01:54:21PM -0600, Logan Gunthorpe wrote:
>> Well whether it's dma_addr_t, phys_addr_t, pfn_t the result isn't all
>> that different. You still need roughly the same 'if' hooks for any
>> backed memory that isn't in the linear mapping and you can't get a
>> kernel mapping for directly.
>>
>> It wouldn't be too hard to do a similar patch set that uses something
>> like phys_addr_t instead and have a request and queue flag for support
>> of non-mappable memory. But you'll end up with very similar 'if' hooks
>> and we'd have to clean up all bio-using drivers that access the struct
>> pages directly.
> 
> We'll need to clean that mess up anyway, and I've been chugging
> along doing some of that.  A lot still assume no highmem, so we need
> to convert them over to something that kmaps anyway.  If we get
> the abstraction right that will actually help converting over to
> a better reprsentation.
> 
>> Though, we'd also still have the problem of how to recognize when the
>> address points to P2PDMA and needs to be translated to the bus offset.
>> The map-first inversion was what helped here because the driver
>> submitting the requests had all the information. Though it could be
>> another request flag and indicating non-mappable memory could be a flag
>> group like REQ_NOMERGE_FLAGS -- REQ_NOMAP_FLAGS.
> 
> The assumes the request all has the same memory, which is a simplifing
> assuption.  My idea was that if had our new bio_vec like this:
> 
> struct bio_vec {
> 	phys_addr_t		paddr; // 64-bit on 64-bit systems
> 	unsigned long		len;
> };
> 
> we have a hole behind len where we could store flag.  Preferably
> optionally based on a P2P or other magic memory types config
> option so that 32-bit systems with 32-bit phys_addr_t actually
> benefit from the smaller and better packing structure.

That seems sensible. The one thing that's unclear though is how to get
the PCI Bus address when appropriate. Can we pass that in instead of the
phys_addr with an appropriate flag? Or will we need to pass the actual
physical address and then, at the map step, the driver has to some how
lookup the PCI device to figure out the bus offset?

>> If you think any of the above ideas sound workable I'd be happy to try
>> to code up another prototype.
> 
> Іt sounds workable.  To some of the first steps are cleanups independent
> of how the bio_vec is eventually going to look like.  That is making
> the DMA-API internals work on the phys_addr_t, which also unifies the
> map_resource implementation with map_page.  I plan to do that relatively
> soon.  The next is sorting out access to bios data by virtual address.
> All these need nice kmapping helper that avoid too much open coding.
> I was going to look into that next, mostly to kill the block layer
> bounce buffering code.  Similar things will also be needed at the
> scatterlist level I think.  After that we need to more audits of
> how bv_page is still used.  something like a bv_phys() helper that
> does "page_to_phys(bv->bv_page) + bv->bv_offset" might come in handy
> for example.

Ok, I should be able to help with that. When I have a chance I'll try to
look at the bv_phys() helper.

Logan
Jason Gunthorpe June 26, 2019, 8:21 p.m. UTC | #29
On Wed, Jun 26, 2019 at 12:31:08PM -0600, Logan Gunthorpe wrote:
> > we have a hole behind len where we could store flag.  Preferably
> > optionally based on a P2P or other magic memory types config
> > option so that 32-bit systems with 32-bit phys_addr_t actually
> > benefit from the smaller and better packing structure.
> 
> That seems sensible. The one thing that's unclear though is how to get
> the PCI Bus address when appropriate. Can we pass that in instead of the
> phys_addr with an appropriate flag? Or will we need to pass the actual
> physical address and then, at the map step, the driver has to some how
> lookup the PCI device to figure out the bus offset?

I agree with CH, if we go down this path it is a layering violation
for the thing injecting bio's into the block stack to know what struct
device they egress&dma map on just to be able to do the dma_map up
front.

So we must be able to go from this new phys_addr_t&flags to some BAR
information during dma_map.

For instance we could use a small hash table of the upper phys addr
bits, or an interval tree, to do the lookup.

The bar info would give the exporting struct device and any other info
we need to make the iommu mapping.

This phys_addr_t seems like a good approach to me as it avoids the
struct page overheads and will lets us provide copy from/to bio
primitives that could work on BAR memory. I think we can surely use
this approach in RDMA as well.

Jason
Dan Williams June 26, 2019, 8:39 p.m. UTC | #30
On Wed, Jun 26, 2019 at 1:21 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Wed, Jun 26, 2019 at 12:31:08PM -0600, Logan Gunthorpe wrote:
> > > we have a hole behind len where we could store flag.  Preferably
> > > optionally based on a P2P or other magic memory types config
> > > option so that 32-bit systems with 32-bit phys_addr_t actually
> > > benefit from the smaller and better packing structure.
> >
> > That seems sensible. The one thing that's unclear though is how to get
> > the PCI Bus address when appropriate. Can we pass that in instead of the
> > phys_addr with an appropriate flag? Or will we need to pass the actual
> > physical address and then, at the map step, the driver has to some how
> > lookup the PCI device to figure out the bus offset?
>
> I agree with CH, if we go down this path it is a layering violation
> for the thing injecting bio's into the block stack to know what struct
> device they egress&dma map on just to be able to do the dma_map up
> front.
>
> So we must be able to go from this new phys_addr_t&flags to some BAR
> information during dma_map.
>
> For instance we could use a small hash table of the upper phys addr
> bits, or an interval tree, to do the lookup.

Hmm, that sounds like dev_pagemap without the pages.

There's already no requirement that dev_pagemap point to real /
present pages (DEVICE_PRIVATE) seems a straightforward extension to
use it for helping coordinate phys_addr_t in 'struct bio'. Then
Logan's future plans to let userspace coordinate p2p operations could
build on PTE_DEVMAP.
Logan Gunthorpe June 26, 2019, 8:45 p.m. UTC | #31
On 2019-06-26 2:21 p.m., Jason Gunthorpe wrote:
> On Wed, Jun 26, 2019 at 12:31:08PM -0600, Logan Gunthorpe wrote:
>>> we have a hole behind len where we could store flag.  Preferably
>>> optionally based on a P2P or other magic memory types config
>>> option so that 32-bit systems with 32-bit phys_addr_t actually
>>> benefit from the smaller and better packing structure.
>>
>> That seems sensible. The one thing that's unclear though is how to get
>> the PCI Bus address when appropriate. Can we pass that in instead of the
>> phys_addr with an appropriate flag? Or will we need to pass the actual
>> physical address and then, at the map step, the driver has to some how
>> lookup the PCI device to figure out the bus offset?
> 
> I agree with CH, if we go down this path it is a layering violation
> for the thing injecting bio's into the block stack to know what struct
> device they egress&dma map on just to be able to do the dma_map up
> front.

Not sure I agree with this statement. The p2pdma code already *must*
know and access the pci_dev of the dma device ahead of when it submits
the IO to know if it's valid to allocate and use P2P memory at all. This
is why the submitting driver has a lot of the information needed to map
this memory that the mapping driver does not.

> So we must be able to go from this new phys_addr_t&flags to some BAR
> information during dma_map.

> For instance we could use a small hash table of the upper phys addr
> bits, or an interval tree, to do the lookup.

Yes, if we're going to take a hard stance on this. But using an interval
tree (or similar) is a lot more work for the CPU to figure out these
mappings that may not be strictly necessary if we could just pass better
information down from the submitting driver to the mapping driver.

> The bar info would give the exporting struct device and any other info
> we need to make the iommu mapping.

Well, the IOMMU mapping is the normal thing the mapping driver will
always do. We'd really just need the submitting driver to, when
appropriate, inform the mapping driver that this is a pci bus address
and not to call dma_map_xxx(). Then, for special mappings for the CMB
like Christoph is talking about, it's simply a matter of doing a range
compare on the PCI Bus address and converting the bus address to a BAR
and offset.

Logan
Jason Gunthorpe June 26, 2019, 8:54 p.m. UTC | #32
On Wed, Jun 26, 2019 at 01:39:01PM -0700, Dan Williams wrote:
> Hmm, that sounds like dev_pagemap without the pages.

Yes, and other page related overhead. Maybe both ideas can exist in
the pagemap code?

All that is needed here is to map a bar phys_addr_t to some 'bar info'
that helps the mapping.

Jason
Logan Gunthorpe June 26, 2019, 8:55 p.m. UTC | #33
On 2019-06-26 2:39 p.m., Dan Williams wrote:
> On Wed, Jun 26, 2019 at 1:21 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>>
>> On Wed, Jun 26, 2019 at 12:31:08PM -0600, Logan Gunthorpe wrote:
>>>> we have a hole behind len where we could store flag.  Preferably
>>>> optionally based on a P2P or other magic memory types config
>>>> option so that 32-bit systems with 32-bit phys_addr_t actually
>>>> benefit from the smaller and better packing structure.
>>>
>>> That seems sensible. The one thing that's unclear though is how to get
>>> the PCI Bus address when appropriate. Can we pass that in instead of the
>>> phys_addr with an appropriate flag? Or will we need to pass the actual
>>> physical address and then, at the map step, the driver has to some how
>>> lookup the PCI device to figure out the bus offset?
>>
>> I agree with CH, if we go down this path it is a layering violation
>> for the thing injecting bio's into the block stack to know what struct
>> device they egress&dma map on just to be able to do the dma_map up
>> front.
>>
>> So we must be able to go from this new phys_addr_t&flags to some BAR
>> information during dma_map.
>>
>> For instance we could use a small hash table of the upper phys addr
>> bits, or an interval tree, to do the lookup.
> 
> Hmm, that sounds like dev_pagemap without the pages.

Yup, that's why I'd like to avoid it, but IMO it would still be an
improvement to use a interval tree over struct pages because without
struct page we just have a range and a length and it's relatively easy
to check that the whole range belongs to a specific pci_dev. To be
correct with the struct page approach we really have to loop through all
pages to ensure they all belong to the same pci_dev which is a big pain.

> There's already no requirement that dev_pagemap point to real /
> present pages (DEVICE_PRIVATE) seems a straightforward extension to
> use it for helping coordinate phys_addr_t in 'struct bio'. Then
> Logan's future plans to let userspace coordinate p2p operations could
> build on PTE_DEVMAP.

Well I think the biggest difficulty with struct page for user space is
dealing with cases when the struct pages of different types get mixed
together (or even struct pages that are all P2P pages but from different
PCI devices). We'd have to go through each page and ensure that each
type gets it's own bio_vec with appropriate flags.

Though really, the whole mixed IO from userspace poses a bunch of
problems. I'd prefer to just be able to say that a single IO can be all
or nothing P2P memory from a single device.

Logan
Jason Gunthorpe June 26, 2019, 9 p.m. UTC | #34
On Wed, Jun 26, 2019 at 02:45:38PM -0600, Logan Gunthorpe wrote:
> 
> 
> On 2019-06-26 2:21 p.m., Jason Gunthorpe wrote:
> > On Wed, Jun 26, 2019 at 12:31:08PM -0600, Logan Gunthorpe wrote:
> >>> we have a hole behind len where we could store flag.  Preferably
> >>> optionally based on a P2P or other magic memory types config
> >>> option so that 32-bit systems with 32-bit phys_addr_t actually
> >>> benefit from the smaller and better packing structure.
> >>
> >> That seems sensible. The one thing that's unclear though is how to get
> >> the PCI Bus address when appropriate. Can we pass that in instead of the
> >> phys_addr with an appropriate flag? Or will we need to pass the actual
> >> physical address and then, at the map step, the driver has to some how
> >> lookup the PCI device to figure out the bus offset?
> > 
> > I agree with CH, if we go down this path it is a layering violation
> > for the thing injecting bio's into the block stack to know what struct
> > device they egress&dma map on just to be able to do the dma_map up
> > front.
> 
> Not sure I agree with this statement. The p2pdma code already *must*
> know and access the pci_dev of the dma device ahead of when it submits
> the IO to know if it's valid to allocate and use P2P memory at all.

I don't think we should make drives do that. What if it got CMB memory
on some other device?

> > For instance we could use a small hash table of the upper phys addr
> > bits, or an interval tree, to do the lookup.
> 
> Yes, if we're going to take a hard stance on this. But using an interval
> tree (or similar) is a lot more work for the CPU to figure out these
> mappings that may not be strictly necessary if we could just pass better
> information down from the submitting driver to the mapping driver.

Right, this is coming down to an optimization argument. I think there
are very few cases (Basically yours) where the caller will know this
info, so we need to support the other cases anyhow.

I think with some simple caching this will become negligible for cases
you care about

Jason
Logan Gunthorpe June 26, 2019, 9:18 p.m. UTC | #35
On 2019-06-26 3:00 p.m., Jason Gunthorpe wrote:
> On Wed, Jun 26, 2019 at 02:45:38PM -0600, Logan Gunthorpe wrote:
>>
>>
>> On 2019-06-26 2:21 p.m., Jason Gunthorpe wrote:
>>> On Wed, Jun 26, 2019 at 12:31:08PM -0600, Logan Gunthorpe wrote:
>>>>> we have a hole behind len where we could store flag.  Preferably
>>>>> optionally based on a P2P or other magic memory types config
>>>>> option so that 32-bit systems with 32-bit phys_addr_t actually
>>>>> benefit from the smaller and better packing structure.
>>>>
>>>> That seems sensible. The one thing that's unclear though is how to get
>>>> the PCI Bus address when appropriate. Can we pass that in instead of the
>>>> phys_addr with an appropriate flag? Or will we need to pass the actual
>>>> physical address and then, at the map step, the driver has to some how
>>>> lookup the PCI device to figure out the bus offset?
>>>
>>> I agree with CH, if we go down this path it is a layering violation
>>> for the thing injecting bio's into the block stack to know what struct
>>> device they egress&dma map on just to be able to do the dma_map up
>>> front.
>>
>> Not sure I agree with this statement. The p2pdma code already *must*
>> know and access the pci_dev of the dma device ahead of when it submits
>> the IO to know if it's valid to allocate and use P2P memory at all.
> 
> I don't think we should make drives do that. What if it got CMB memory
> on some other device?

Huh? A driver submitting P2P requests finds appropriate memory to use
based on the DMA device that will be doing the mapping. It *has* to. It
doesn't necessarily have control over which P2P provider it might find
(ie. it may get CMB memory from a random NVMe device), but it easily
knows the NVMe device it got the CMB memory for. Look at the existing
code in the nvme target.

>>> For instance we could use a small hash table of the upper phys addr
>>> bits, or an interval tree, to do the lookup.
>>
>> Yes, if we're going to take a hard stance on this. But using an interval
>> tree (or similar) is a lot more work for the CPU to figure out these
>> mappings that may not be strictly necessary if we could just pass better
>> information down from the submitting driver to the mapping driver.
> 
> Right, this is coming down to an optimization argument. I think there
> are very few cases (Basically yours) where the caller will know this
> info, so we need to support the other cases anyhow.

I disagree. I think it has to be a common pattern. A driver doing a P2P
transaction *must* find some device to obtain memory from (or it may be
itself)  and check if it is compatible with the device that's going to
be mapping the memory or vice versa. So no matter what we do, a driver
submitting P2P requests must have access to both the PCI device that's
going to be mapping the memory and the device that's providing the memory.

> I think with some simple caching this will become negligible for cases
> you care about

Well *maybe* it will be negligible performance wise, but it's also a lot
more complicated, code wise. Tree lookups will always be a lot more
expensive than just checking a flag.

Logan
Jason Gunthorpe June 27, 2019, 6:32 a.m. UTC | #36
On Wed, Jun 26, 2019 at 03:18:07PM -0600, Logan Gunthorpe wrote:
> > I don't think we should make drives do that. What if it got CMB memory
> > on some other device?
> 
> Huh? A driver submitting P2P requests finds appropriate memory to use
> based on the DMA device that will be doing the mapping. It *has* to. It
> doesn't necessarily have control over which P2P provider it might find
> (ie. it may get CMB memory from a random NVMe device), but it easily
> knows the NVMe device it got the CMB memory for. Look at the existing
> code in the nvme target.

No, this all thinking about things from the CMB perspective. With CMB
you don't care about the BAR location because it is just a temporary
buffer. That is a unique use model.

Every other case has data residing in BAR memory that can really only
reside in that one place (ie on a GPU/FPGA DRAM or something). When an IO
against that is run it should succeed, even if that means bounce
buffering the IO - as the user has really asked for this transfer to
happen.

We certainly don't get to generally pick where the data resides before
starting the IO, that luxury is only for CMB.

> > I think with some simple caching this will become negligible for cases
> > you care about
> 
> Well *maybe* it will be negligible performance wise, but it's also a lot
> more complicated, code wise. Tree lookups will always be a lot more
> expensive than just checking a flag.

Interval trees are pretty simple API wise, and if we only populate
them with P2P providers you probably find the tree depth is negligible
in current systems with one or two P2P providers.

Jason
Christoph Hellwig June 27, 2019, 9:01 a.m. UTC | #37
On Wed, Jun 26, 2019 at 12:31:08PM -0600, Logan Gunthorpe wrote:
> > we have a hole behind len where we could store flag.  Preferably
> > optionally based on a P2P or other magic memory types config
> > option so that 32-bit systems with 32-bit phys_addr_t actually
> > benefit from the smaller and better packing structure.
> 
> That seems sensible. The one thing that's unclear though is how to get
> the PCI Bus address when appropriate. Can we pass that in instead of the
> phys_addr with an appropriate flag? Or will we need to pass the actual
> physical address and then, at the map step, the driver has to some how
> lookup the PCI device to figure out the bus offset?

Yes, I think we'll need a lookup mechanism of some kind.
Christoph Hellwig June 27, 2019, 9:08 a.m. UTC | #38
On Wed, Jun 26, 2019 at 02:45:38PM -0600, Logan Gunthorpe wrote:
> > The bar info would give the exporting struct device and any other info
> > we need to make the iommu mapping.
> 
> Well, the IOMMU mapping is the normal thing the mapping driver will
> always do. We'd really just need the submitting driver to, when
> appropriate, inform the mapping driver that this is a pci bus address
> and not to call dma_map_xxx(). Then, for special mappings for the CMB
> like Christoph is talking about, it's simply a matter of doing a range
> compare on the PCI Bus address and converting the bus address to a BAR
> and offset.

Well, range compare on the physical address.  We have a few different
options here:

 (a) a range is normal RAM, DMA mapping works as usual
 (b) a range is another devices BAR, in which case we need to do a
     map_resource equivalent (which really just means don't bother with
     cache flush on non-coherent architectures) and apply any needed
     offset, fixed or iommu based
 (c) a range points to a BAR on the acting device. In which case we
     don't need to DMA map at all, because no dma is happening but just an
     internal transfer.  And depending on the device that might also require
     a different addressing mode

I guess it might make sense to just have a block layer flag that (b) or
(c) might be contained in a bio.  Then we always look up the data
structure, but can still fall back to (a) if nothing was found.  That
even allows free mixing and matching of memory types, at least as long
as they are contained to separate bio_vec segments.
Logan Gunthorpe June 27, 2019, 4:09 p.m. UTC | #39
On 2019-06-27 12:32 a.m., Jason Gunthorpe wrote:
> On Wed, Jun 26, 2019 at 03:18:07PM -0600, Logan Gunthorpe wrote:
>>> I don't think we should make drives do that. What if it got CMB memory
>>> on some other device?
>>
>> Huh? A driver submitting P2P requests finds appropriate memory to use
>> based on the DMA device that will be doing the mapping. It *has* to. It
>> doesn't necessarily have control over which P2P provider it might find
>> (ie. it may get CMB memory from a random NVMe device), but it easily
>> knows the NVMe device it got the CMB memory for. Look at the existing
>> code in the nvme target.
> 
> No, this all thinking about things from the CMB perspective. With CMB
> you don't care about the BAR location because it is just a temporary
> buffer. That is a unique use model.
> 
> Every other case has data residing in BAR memory that can really only
> reside in that one place (ie on a GPU/FPGA DRAM or something). When an IO
> against that is run it should succeed, even if that means bounce
> buffering the IO - as the user has really asked for this transfer to
> happen.
> 
> We certainly don't get to generally pick where the data resides before
> starting the IO, that luxury is only for CMB.

I disagree. If we we're going to implement a "bounce" we'd probably want
to do it in two DMA requests. So the GPU/FPGA driver would first decide
whether it can do it P2P directly and, if it can't, would want to submit
a DMA request copy the data to host memory and then submit an IO
normally to the data's final destination.

I think it would be a larger layering violation to have the NVMe driver
(for example) memcpy data off a GPU's bar during a dma_map step to
support this bouncing. And it's even crazier to expect a DMA transfer to
be setup in the map step.

Logan
Logan Gunthorpe June 27, 2019, 4:30 p.m. UTC | #40
On 2019-06-27 3:08 a.m., Christoph Hellwig wrote:
> On Wed, Jun 26, 2019 at 02:45:38PM -0600, Logan Gunthorpe wrote:
>>> The bar info would give the exporting struct device and any other info
>>> we need to make the iommu mapping.
>>
>> Well, the IOMMU mapping is the normal thing the mapping driver will
>> always do. We'd really just need the submitting driver to, when
>> appropriate, inform the mapping driver that this is a pci bus address
>> and not to call dma_map_xxx(). Then, for special mappings for the CMB
>> like Christoph is talking about, it's simply a matter of doing a range
>> compare on the PCI Bus address and converting the bus address to a BAR
>> and offset.
> 
> Well, range compare on the physical address.  We have a few different
> options here:
> 
>  (a) a range is normal RAM, DMA mapping works as usual
>  (b) a range is another devices BAR, in which case we need to do a
>      map_resource equivalent (which really just means don't bother with
>      cache flush on non-coherent architectures) and apply any needed
>      offset, fixed or iommu based

Well I would split this into two cases: (b1) ranges in another device's
BAR that will pass through the root complex and require a map_resource
equivalent and (b2) ranges in another device's bar that don't pass
through the root complex and require applying an offset to the bus
address. Both require rather different handling and the submitting
driver should already know ahead of time what type we have.

>  (c) a range points to a BAR on the acting device. In which case we
>      don't need to DMA map at all, because no dma is happening but just an
>      internal transfer.  And depending on the device that might also require
>      a different addressing mode

I think (c) is actually just a special case of (b2). Any device that has
a special protocol for addressing the local BAR can just do a range
compare on the address to determine if it's local or not. Devices that
don't have a special protocol for this would handle both (c) and (b2)
the same.

> I guess it might make sense to just have a block layer flag that (b) or
> (c) might be contained in a bio.  Then we always look up the data
> structure, but can still fall back to (a) if nothing was found.  That
> even allows free mixing and matching of memory types, at least as long
> as they are contained to separate bio_vec segments.

IMO these three cases should be reflected in flags in the bio_vec. We'd
probably still need a queue flag to indicate support for mapping these,
but a flag on the bio that indicates special cases *might* exist in the
bio_vec and the driver has to do extra work to somehow distinguish the
three types doesn't seem useful. bio_vec flags also make it easy to
support mixing segments from different memory types.

Logan
Jason Gunthorpe June 27, 2019, 4:35 p.m. UTC | #41
On Thu, Jun 27, 2019 at 10:09:41AM -0600, Logan Gunthorpe wrote:
> 
> 
> On 2019-06-27 12:32 a.m., Jason Gunthorpe wrote:
> > On Wed, Jun 26, 2019 at 03:18:07PM -0600, Logan Gunthorpe wrote:
> >>> I don't think we should make drives do that. What if it got CMB memory
> >>> on some other device?
> >>
> >> Huh? A driver submitting P2P requests finds appropriate memory to use
> >> based on the DMA device that will be doing the mapping. It *has* to. It
> >> doesn't necessarily have control over which P2P provider it might find
> >> (ie. it may get CMB memory from a random NVMe device), but it easily
> >> knows the NVMe device it got the CMB memory for. Look at the existing
> >> code in the nvme target.
> > 
> > No, this all thinking about things from the CMB perspective. With CMB
> > you don't care about the BAR location because it is just a temporary
> > buffer. That is a unique use model.
> > 
> > Every other case has data residing in BAR memory that can really only
> > reside in that one place (ie on a GPU/FPGA DRAM or something). When an IO
> > against that is run it should succeed, even if that means bounce
> > buffering the IO - as the user has really asked for this transfer to
> > happen.
> > 
> > We certainly don't get to generally pick where the data resides before
> > starting the IO, that luxury is only for CMB.
> 
> I disagree. If we we're going to implement a "bounce" we'd probably want
> to do it in two DMA requests.

How do you mean?

> So the GPU/FPGA driver would first decide whether it can do it P2P
> directly and, if it can't, would want to submit a DMA request copy
> the data to host memory and then submit an IO normally to the data's
> final destination.

I don't think a GPU/FPGA driver will be involved, this would enter the
block layer through the O_DIRECT path or something generic.. This the
general flow I was suggesting to Dan earlier

> I think it would be a larger layering violation to have the NVMe driver
> (for example) memcpy data off a GPU's bar during a dma_map step to
> support this bouncing. And it's even crazier to expect a DMA transfer to
> be setup in the map step.

Why? Don't we already expect the DMA mapper to handle bouncing for
lots of cases, how is this case different? This is the best place to
place it to make it shared.

Jason
Logan Gunthorpe June 27, 2019, 4:49 p.m. UTC | #42
On 2019-06-27 10:35 a.m., Jason Gunthorpe wrote:
> On Thu, Jun 27, 2019 at 10:09:41AM -0600, Logan Gunthorpe wrote:
>>
>>
>> On 2019-06-27 12:32 a.m., Jason Gunthorpe wrote:
>>> On Wed, Jun 26, 2019 at 03:18:07PM -0600, Logan Gunthorpe wrote:
>>>>> I don't think we should make drives do that. What if it got CMB memory
>>>>> on some other device?
>>>>
>>>> Huh? A driver submitting P2P requests finds appropriate memory to use
>>>> based on the DMA device that will be doing the mapping. It *has* to. It
>>>> doesn't necessarily have control over which P2P provider it might find
>>>> (ie. it may get CMB memory from a random NVMe device), but it easily
>>>> knows the NVMe device it got the CMB memory for. Look at the existing
>>>> code in the nvme target.
>>>
>>> No, this all thinking about things from the CMB perspective. With CMB
>>> you don't care about the BAR location because it is just a temporary
>>> buffer. That is a unique use model.
>>>
>>> Every other case has data residing in BAR memory that can really only
>>> reside in that one place (ie on a GPU/FPGA DRAM or something). When an IO
>>> against that is run it should succeed, even if that means bounce
>>> buffering the IO - as the user has really asked for this transfer to
>>> happen.
>>>
>>> We certainly don't get to generally pick where the data resides before
>>> starting the IO, that luxury is only for CMB.
>>
>> I disagree. If we we're going to implement a "bounce" we'd probably want
>> to do it in two DMA requests.
> 
> How do you mean?
> 
>> So the GPU/FPGA driver would first decide whether it can do it P2P
>> directly and, if it can't, would want to submit a DMA request copy
>> the data to host memory and then submit an IO normally to the data's
>> final destination.
> 
> I don't think a GPU/FPGA driver will be involved, this would enter the
> block layer through the O_DIRECT path or something generic.. This the
> general flow I was suggesting to Dan earlier

I would say the O_DIRECT path has to somehow call into the driver
backing the VMA to get an address to appropriate memory (in some way
vaguely similar to how we were discussing at LSF/MM). If P2P can't be
done at that point, then the provider driver would do the copy to system
memory, in the most appropriate way, and return regular pages for
O_DIRECT to submit to the block device.

>> I think it would be a larger layering violation to have the NVMe driver
>> (for example) memcpy data off a GPU's bar during a dma_map step to
>> support this bouncing. And it's even crazier to expect a DMA transfer to
>> be setup in the map step.
> 
> Why? Don't we already expect the DMA mapper to handle bouncing for
> lots of cases, how is this case different? This is the best place to
> place it to make it shared.

This is different because it's special memory where the DMA mapper can't
possibly know the best way to transfer the data. The best way to
transfer the data is almost certainly going to be a DMA request handled
by the GPU/FPGA. So, one way or another, the GPU/FPGA driver has to be
involved.

One could argue that the hook to the GPU/FPGA driver could be in the
mapping step but then we'd have to do lookups based on an address --
where as the VMA could more easily have a hook back to whatever driver
exported it.

Logan
Christoph Hellwig June 27, 2019, 5 p.m. UTC | #43
On Thu, Jun 27, 2019 at 10:30:42AM -0600, Logan Gunthorpe wrote:
> >  (a) a range is normal RAM, DMA mapping works as usual
> >  (b) a range is another devices BAR, in which case we need to do a
> >      map_resource equivalent (which really just means don't bother with
> >      cache flush on non-coherent architectures) and apply any needed
> >      offset, fixed or iommu based
> 
> Well I would split this into two cases: (b1) ranges in another device's
> BAR that will pass through the root complex and require a map_resource
> equivalent and (b2) ranges in another device's bar that don't pass
> through the root complex and require applying an offset to the bus
> address. Both require rather different handling and the submitting
> driver should already know ahead of time what type we have.

True.

> 
> >  (c) a range points to a BAR on the acting device. In which case we
> >      don't need to DMA map at all, because no dma is happening but just an
> >      internal transfer.  And depending on the device that might also require
> >      a different addressing mode
> 
> I think (c) is actually just a special case of (b2). Any device that has
> a special protocol for addressing the local BAR can just do a range
> compare on the address to determine if it's local or not. Devices that
> don't have a special protocol for this would handle both (c) and (b2)
> the same.

It is not.  (c) is fundamentally very different as it is not actually
an operation that ever goes out to the wire at all, and which is why the
actual physical address on the wire does not matter at all.
Some interfaces like NVMe have designed it in a way that it the commands
used to do this internal transfer look like (b2), but that is just their
(IMHO very questionable) interface design choice, that produces a whole
chain of problems.

> > I guess it might make sense to just have a block layer flag that (b) or
> > (c) might be contained in a bio.  Then we always look up the data
> > structure, but can still fall back to (a) if nothing was found.  That
> > even allows free mixing and matching of memory types, at least as long
> > as they are contained to separate bio_vec segments.
> 
> IMO these three cases should be reflected in flags in the bio_vec. We'd
> probably still need a queue flag to indicate support for mapping these,
> but a flag on the bio that indicates special cases *might* exist in the
> bio_vec and the driver has to do extra work to somehow distinguish the
> three types doesn't seem useful. bio_vec flags also make it easy to
> support mixing segments from different memory types.

So I Ñ–nitially suggested these flags.  But without a pgmap we absolutely
need a lookup operation to find which phys address ranges map to which
device.  And once we do that the data structure the only thing we need
is a flag saying that we need that information, and everything else
can be in the data structure returned from that lookup.
Logan Gunthorpe June 27, 2019, 6 p.m. UTC | #44
On 2019-06-27 11:00 a.m., Christoph Hellwig wrote:
> It is not.  (c) is fundamentally very different as it is not actually
> an operation that ever goes out to the wire at all, and which is why the
> actual physical address on the wire does not matter at all.
> Some interfaces like NVMe have designed it in a way that it the commands
> used to do this internal transfer look like (b2), but that is just their
> (IMHO very questionable) interface design choice, that produces a whole
> chain of problems.

From the mapping device's driver's perspective yes, but from the
perspective of a submitting driver they would be the same.

>>> I guess it might make sense to just have a block layer flag that (b) or
>>> (c) might be contained in a bio.  Then we always look up the data
>>> structure, but can still fall back to (a) if nothing was found.  That
>>> even allows free mixing and matching of memory types, at least as long
>>> as they are contained to separate bio_vec segments.
>>
>> IMO these three cases should be reflected in flags in the bio_vec. We'd
>> probably still need a queue flag to indicate support for mapping these,
>> but a flag on the bio that indicates special cases *might* exist in the
>> bio_vec and the driver has to do extra work to somehow distinguish the
>> three types doesn't seem useful. bio_vec flags also make it easy to
>> support mixing segments from different memory types.
> 
> So I Ñ–nitially suggested these flags.  But without a pgmap we absolutely
> need a lookup operation to find which phys address ranges map to which
> device.  And once we do that the data structure the only thing we need
> is a flag saying that we need that information, and everything else
> can be in the data structure returned from that lookup.

Yes, you did suggest them. But what I'm trying to suggest is we don't
*necessarily* need the lookup. For demonstration purposes only, a
submitting driver could very roughly potentially do:

struct bio_vec vec;
dist = pci_p2pdma_dist(provider_pdev, mapping_pdev);
if (dist < 0) {
     /* use regular memory */
     vec.bv_addr = virt_to_phys(kmalloc(...));
     vec.bv_flags = 0;
} else if (dist & PCI_P2PDMA_THRU_HOST_BRIDGE) {
     vec.bv_addr = pci_p2pmem_alloc_phys(provider_pdev, ...);
     vec.bv_flags = BVEC_MAP_RESOURCE;
} else {
     vec.bv_addr = pci_p2pmem_alloc_bus_addr(provider_pdev, ...);
     vec.bv_flags = BVEC_MAP_BUS_ADDR;
}

-- And a mapping driver would roughly just do:

dma_addr_t dma_addr;
if (vec.bv_flags & BVEC_MAP_BUS_ADDR) {
     if (pci_bus_addr_in_bar(mapping_pdev, vec.bv_addr, &bar, &off))  {
          /* case (c) */
          /* program the DMA engine with bar and off */
     } else {
          /* case (b2) */
          dma_addr = vec.bv_addr;
     }
} else if (vec.bv_flags & BVEC_MAP_RESOURCE) {
     /* case (b1) */
     dma_addr = dma_map_resource(mapping_dev, vec.bv_addr, ...);
} else {
     /* case (a) */
     dma_addr = dma_map_page(..., phys_to_page(vec.bv_addr), ...);
}

The real difficulty here is that you'd really want all the above handled
by a dma_map_bvec() so it can combine every vector hitting the IOMMU
into a single continuous IOVA -- but it's hard to fit case (c) into that
equation. So it might be that a dma_map_bvec() handles cases (a), (b1)
and (b2) and the mapping driver has to then check each resulting DMA
vector for pci_bus_addr_in_bar() while it is programming the DMA engine
to deal with case (c).

Logan
Jason Gunthorpe June 28, 2019, 4:57 a.m. UTC | #45
On Thu, Jun 27, 2019 at 10:49:43AM -0600, Logan Gunthorpe wrote:

> > I don't think a GPU/FPGA driver will be involved, this would enter the
> > block layer through the O_DIRECT path or something generic.. This the
> > general flow I was suggesting to Dan earlier
> 
> I would say the O_DIRECT path has to somehow call into the driver
> backing the VMA to get an address to appropriate memory (in some way
> vaguely similar to how we were discussing at LSF/MM)

Maybe, maybe no. For something like VFIO the PTE already has the
correct phys_addr_t and we don't need to do anything..

For DEVICE_PRIVATE we need to get the phys_addr_t out - presumably
through a new pagemap op?

> If P2P can't be done at that point, then the provider driver would
> do the copy to system memory, in the most appropriate way, and
> return regular pages for O_DIRECT to submit to the block device.

That only makes sense for the migratable DEVICE_PRIVATE case, it
doesn't help the VFIO-like case, there you'd need to bounce buffer.

> >> I think it would be a larger layering violation to have the NVMe driver
> >> (for example) memcpy data off a GPU's bar during a dma_map step to
> >> support this bouncing. And it's even crazier to expect a DMA transfer to
> >> be setup in the map step.
> > 
> > Why? Don't we already expect the DMA mapper to handle bouncing for
> > lots of cases, how is this case different? This is the best place to
> > place it to make it shared.
> 
> This is different because it's special memory where the DMA mapper
> can't possibly know the best way to transfer the data.

Why not?  If we have a 'bar info' structure that could have data
transfer op callbacks, infact, I think we might already have similar
callbacks for migrating to/from DEVICE_PRIVATE memory with DMA..

> One could argue that the hook to the GPU/FPGA driver could be in the
> mapping step but then we'd have to do lookups based on an address --
> where as the VMA could more easily have a hook back to whatever driver
> exported it.

The trouble with a VMA hook is that it is only really avaiable when
working with the VA, and it is not actually available during GUP, you
have to have a GUP-like thing such as hmm_range_snapshot that is
specifically VMA based. And it is certainly not available during dma_map.

When working with VMA's/etc it seems there are some good reasons to
drive things off of the PTE content (either via struct page & pgmap or
via phys_addr_t & barmap)

I think the best reason to prefer a uniform phys_addr_t is that it
does give us the option to copy the data to/from CPU memory. That
option goes away as soon as the bio sometimes provides a dma_addr_t.

At least for RDMA, we do have some cases (like siw/rxe, hfi) where
they sometimes need to do that copy. I suspect the block stack is
similar, in the general case.

Jason
Christoph Hellwig June 28, 2019, 1:38 p.m. UTC | #46
On Thu, Jun 27, 2019 at 12:00:35PM -0600, Logan Gunthorpe wrote:
> > It is not.  (c) is fundamentally very different as it is not actually
> > an operation that ever goes out to the wire at all, and which is why the
> > actual physical address on the wire does not matter at all.
> > Some interfaces like NVMe have designed it in a way that it the commands
> > used to do this internal transfer look like (b2), but that is just their
> > (IMHO very questionable) interface design choice, that produces a whole
> > chain of problems.
> 
> >From the mapping device's driver's perspective yes, but from the
> perspective of a submitting driver they would be the same.

With your dma_addr_t scheme it won't be the same, as you'd need
a magic way to generate the internal addressing and stuff it into
the dma_addr_t.  With a phys_addr_t based scheme they should basically
be all the same.

> Yes, you did suggest them. But what I'm trying to suggest is we don't
> *necessarily* need the lookup. For demonstration purposes only, a
> submitting driver could very roughly potentially do:
> 
> struct bio_vec vec;
> dist = pci_p2pdma_dist(provider_pdev, mapping_pdev);
> if (dist < 0) {
>      /* use regular memory */
>      vec.bv_addr = virt_to_phys(kmalloc(...));
>      vec.bv_flags = 0;
> } else if (dist & PCI_P2PDMA_THRU_HOST_BRIDGE) {
>      vec.bv_addr = pci_p2pmem_alloc_phys(provider_pdev, ...);
>      vec.bv_flags = BVEC_MAP_RESOURCE;
> } else {
>      vec.bv_addr = pci_p2pmem_alloc_bus_addr(provider_pdev, ...);
>      vec.bv_flags = BVEC_MAP_BUS_ADDR;
> }

That doesn't look too bad, except..

> -- And a mapping driver would roughly just do:
> 
> dma_addr_t dma_addr;
> if (vec.bv_flags & BVEC_MAP_BUS_ADDR) {
>      if (pci_bus_addr_in_bar(mapping_pdev, vec.bv_addr, &bar, &off))  {
>           /* case (c) */
>           /* program the DMA engine with bar and off */

Why bother with that here if we could also let the caller handle
that? pci_p2pdma_dist() should be able to trivially find that out
based on provider_dev == mapping_dev.

> The real difficulty here is that you'd really want all the above handled
> by a dma_map_bvec() so it can combine every vector hitting the IOMMU
> into a single continuous IOVA -- but it's hard to fit case (c) into that
> equation. So it might be that a dma_map_bvec() handles cases (a), (b1)
> and (b2) and the mapping driver has to then check each resulting DMA
> vector for pci_bus_addr_in_bar() while it is programming the DMA engine
> to deal with case (c).

I'd do it the other way around.  pci_p2pdma_dist is used to find
the p2p type.  The p2p type is stuff into the bio_vec, and we then:

 (1) manually check for case (c) in driver for drivers that want to
     treat it different from (b)
 (2) we then have a dma mapping wrapper that checks the p2p type
     and does the right thing for the rest.
Logan Gunthorpe June 28, 2019, 3:54 p.m. UTC | #47
On 2019-06-28 7:38 a.m., Christoph Hellwig wrote:
> On Thu, Jun 27, 2019 at 12:00:35PM -0600, Logan Gunthorpe wrote:
>>> It is not.  (c) is fundamentally very different as it is not actually
>>> an operation that ever goes out to the wire at all, and which is why the
>>> actual physical address on the wire does not matter at all.
>>> Some interfaces like NVMe have designed it in a way that it the commands
>>> used to do this internal transfer look like (b2), but that is just their
>>> (IMHO very questionable) interface design choice, that produces a whole
>>> chain of problems.
>>
>> >From the mapping device's driver's perspective yes, but from the
>> perspective of a submitting driver they would be the same.
> 
> With your dma_addr_t scheme it won't be the same, as you'd need
> a magic way to generate the internal addressing and stuff it into
> the dma_addr_t.  With a phys_addr_t based scheme they should basically
> be all the same.

Yes, I see the folly in the dma_addr_t scheme now. I like the
phys_addr_t ideas we have been discussing.

>> Yes, you did suggest them. But what I'm trying to suggest is we don't
>> *necessarily* need the lookup. For demonstration purposes only, a
>> submitting driver could very roughly potentially do:
>>
>> struct bio_vec vec;
>> dist = pci_p2pdma_dist(provider_pdev, mapping_pdev);
>> if (dist < 0) {
>>      /* use regular memory */
>>      vec.bv_addr = virt_to_phys(kmalloc(...));
>>      vec.bv_flags = 0;
>> } else if (dist & PCI_P2PDMA_THRU_HOST_BRIDGE) {
>>      vec.bv_addr = pci_p2pmem_alloc_phys(provider_pdev, ...);
>>      vec.bv_flags = BVEC_MAP_RESOURCE;
>> } else {
>>      vec.bv_addr = pci_p2pmem_alloc_bus_addr(provider_pdev, ...);
>>      vec.bv_flags = BVEC_MAP_BUS_ADDR;
>> }
> 
> That doesn't look too bad, except..
> 
>> -- And a mapping driver would roughly just do:
>>
>> dma_addr_t dma_addr;
>> if (vec.bv_flags & BVEC_MAP_BUS_ADDR) {
>>      if (pci_bus_addr_in_bar(mapping_pdev, vec.bv_addr, &bar, &off))  {
>>           /* case (c) */
>>           /* program the DMA engine with bar and off */
> 
> Why bother with that here if we could also let the caller handle
> that? pci_p2pdma_dist() should be able to trivially find that out
> based on provider_dev == mapping_dev.

True, in fact pci_p2pdma_dist() should return 0 in that case.

Though the driver will still have to do a range compare to figure out
which BAR the address belongs to and find the offset.

>> The real difficulty here is that you'd really want all the above handled
>> by a dma_map_bvec() so it can combine every vector hitting the IOMMU
>> into a single continuous IOVA -- but it's hard to fit case (c) into that
>> equation. So it might be that a dma_map_bvec() handles cases (a), (b1)
>> and (b2) and the mapping driver has to then check each resulting DMA
>> vector for pci_bus_addr_in_bar() while it is programming the DMA engine
>> to deal with case (c).
> 
> I'd do it the other way around.  pci_p2pdma_dist is used to find
> the p2p type.  The p2p type is stuff into the bio_vec, and we then:
> 
>  (1) manually check for case (c) in driver for drivers that want to
>      treat it different from (b)
>  (2) we then have a dma mapping wrapper that checks the p2p type
>      and does the right thing for the rest.

Sure, that could make sense.

I imagine there's a lot of details that are wrong or could be done
better in my example. The purpose of it was just to demonstrate that we
can do it without a lookup in an interval tree on the physical address.

Logan
Logan Gunthorpe June 28, 2019, 4:22 p.m. UTC | #48
On 2019-06-27 10:57 p.m., Jason Gunthorpe wrote:
> On Thu, Jun 27, 2019 at 10:49:43AM -0600, Logan Gunthorpe wrote:
> 
>>> I don't think a GPU/FPGA driver will be involved, this would enter the
>>> block layer through the O_DIRECT path or something generic.. This the
>>> general flow I was suggesting to Dan earlier
>>
>> I would say the O_DIRECT path has to somehow call into the driver
>> backing the VMA to get an address to appropriate memory (in some way
>> vaguely similar to how we were discussing at LSF/MM)
> 
> Maybe, maybe no. For something like VFIO the PTE already has the
> correct phys_addr_t and we don't need to do anything..
> 
> For DEVICE_PRIVATE we need to get the phys_addr_t out - presumably
> through a new pagemap op?

I don't know much about either VFIO or DEVICE_PRIVATE, but I'd still
wager there would be a better way to handle it before they submit it to
the block layer.

>> If P2P can't be done at that point, then the provider driver would
>> do the copy to system memory, in the most appropriate way, and
>> return regular pages for O_DIRECT to submit to the block device.
> 
> That only makes sense for the migratable DEVICE_PRIVATE case, it
> doesn't help the VFIO-like case, there you'd need to bounce buffer.
> 
>>>> I think it would be a larger layering violation to have the NVMe driver
>>>> (for example) memcpy data off a GPU's bar during a dma_map step to
>>>> support this bouncing. And it's even crazier to expect a DMA transfer to
>>>> be setup in the map step.
>>>
>>> Why? Don't we already expect the DMA mapper to handle bouncing for
>>> lots of cases, how is this case different? This is the best place to
>>> place it to make it shared.
>>
>> This is different because it's special memory where the DMA mapper
>> can't possibly know the best way to transfer the data.
> 
> Why not?  If we have a 'bar info' structure that could have data
> transfer op callbacks, infact, I think we might already have similar
> callbacks for migrating to/from DEVICE_PRIVATE memory with DMA..

Well it could, in theory be done, but It just seems wrong to setup and
wait for more DMA requests while we are in mid-progress setting up
another DMA request. Especially when the block layer has historically
had issues with stack sizes. It's also possible you might have multiple
bio_vec's that have to each do a migration and with a hook here they'd
have to be done serially.

>> One could argue that the hook to the GPU/FPGA driver could be in the
>> mapping step but then we'd have to do lookups based on an address --
>> where as the VMA could more easily have a hook back to whatever driver
>> exported it.
> 
> The trouble with a VMA hook is that it is only really avaiable when
> working with the VA, and it is not actually available during GUP, you
> have to have a GUP-like thing such as hmm_range_snapshot that is
> specifically VMA based. And it is certainly not available during dma_map.

Yup, I'm hoping some of the GUP cleanups will help with that but it's
definitely a problem. I never said the VMA would be available at dma_map
time nor would I want it to be. I expect it to be available before we
submit the request to the block layer and it really only applies to the
O_DIRECT path and maybe a similar thing in the RDMA path.

> When working with VMA's/etc it seems there are some good reasons to
> drive things off of the PTE content (either via struct page & pgmap or
> via phys_addr_t & barmap)
> 
> I think the best reason to prefer a uniform phys_addr_t is that it
> does give us the option to copy the data to/from CPU memory. That
> option goes away as soon as the bio sometimes provides a dma_addr_t.

Not really. phys_addr_t alone doesn't give us a way to copy data. You
need a lookup table on that address and a couple of hooks.

> At least for RDMA, we do have some cases (like siw/rxe, hfi) where
> they sometimes need to do that copy. I suspect the block stack is
> similar, in the general case.

But the whole point of the use cases I'm trying to serve is to avoid the
root complex. If the block layer randomly decides to ephemerally copy
the data back to the CPU (for integrity or something) then we've
accomplished nothing and shouldn't have put the data in the BAR to begin
with. Similarly, for DEVICE_PRIVATE, I'd have guessed it wouldn't want
to use ephemeral copies but actually migrate the memory semi-permanently
to the CPU for more than one transaction and I would argue that it makes
the most sense to make these decisions before the data gets to the block
layer.

Logan
Jason Gunthorpe June 28, 2019, 5:29 p.m. UTC | #49
On Fri, Jun 28, 2019 at 10:22:06AM -0600, Logan Gunthorpe wrote:

> > Why not?  If we have a 'bar info' structure that could have data
> > transfer op callbacks, infact, I think we might already have similar
> > callbacks for migrating to/from DEVICE_PRIVATE memory with DMA..
> 
> Well it could, in theory be done, but It just seems wrong to setup and
> wait for more DMA requests while we are in mid-progress setting up
> another DMA request. Especially when the block layer has historically
> had issues with stack sizes. It's also possible you might have multiple
> bio_vec's that have to each do a migration and with a hook here they'd
> have to be done serially.

*shrug* this is just standard bounce buffering stuff...
 
> > I think the best reason to prefer a uniform phys_addr_t is that it
> > does give us the option to copy the data to/from CPU memory. That
> > option goes away as soon as the bio sometimes provides a dma_addr_t.
> 
> Not really. phys_addr_t alone doesn't give us a way to copy data. You
> need a lookup table on that address and a couple of hooks.

Yes, I'm not sure how you envision using phys_addr_t without a
lookup.. At the end of the day we must get the src and target 'struct
device' in the dma_map area (at the minimum to compute the offset to
translate phys_addr_t to dma_addr_t) and the only way to do that from
phys_addr_t is via lookup??

> > At least for RDMA, we do have some cases (like siw/rxe, hfi) where
> > they sometimes need to do that copy. I suspect the block stack is
> > similar, in the general case.
> 
> But the whole point of the use cases I'm trying to serve is to avoid the
> root complex. 

Well, I think this is sort of a seperate issue. Generically I think
the dma layer should continue to work largely transparently, and if I
feed in BAR memory that can't be P2P'd it should bounce, just like
all the other DMA limitations it already supports. That is pretty much
its whole purpose in life.

The issue of having the caller optimize what it sends is kind of
separate - yes you definately still need the egress DMA device to
drive CMB buffer selection, and DEVICE_PRIVATE also needs it to decide
if it should migrate or not.

What I see as the question is how to layout the BIO. 

If we agree the bio should only have phys_addr_t then we need some
'bar info' (ie at least the offset) in the dma map and some 'bar info'
(ie the DMA device) during the bio construciton.

What you are trying to do is optimize the passing of that 'bar info'
with a limited number of bits in the BIO.

A single flag means an interval tree, 4-8 bits could build a probably
O(1) hash lookup, 64 bits could store a pointer, etc.

If we can spare 4-8 bits in the bio then I suggest a 'perfect hash
table'. Assign each registered P2P 'bar info' a small 4 bit id and
hash on that. It should be fast enough to not worry about the double
lookup.

Jason
Logan Gunthorpe June 28, 2019, 6:29 p.m. UTC | #50
On 2019-06-28 11:29 a.m., Jason Gunthorpe wrote:
> On Fri, Jun 28, 2019 at 10:22:06AM -0600, Logan Gunthorpe wrote:
> 
>>> Why not?  If we have a 'bar info' structure that could have data
>>> transfer op callbacks, infact, I think we might already have similar
>>> callbacks for migrating to/from DEVICE_PRIVATE memory with DMA..
>>
>> Well it could, in theory be done, but It just seems wrong to setup and
>> wait for more DMA requests while we are in mid-progress setting up
>> another DMA request. Especially when the block layer has historically
>> had issues with stack sizes. It's also possible you might have multiple
>> bio_vec's that have to each do a migration and with a hook here they'd
>> have to be done serially.
> 
> *shrug* this is just standard bounce buffering stuff...

I don't know of any "standard" bounce buffering stuff that uses random
other device's DMA engines where appropriate.

>>> I think the best reason to prefer a uniform phys_addr_t is that it
>>> does give us the option to copy the data to/from CPU memory. That
>>> option goes away as soon as the bio sometimes provides a dma_addr_t.
>>
>> Not really. phys_addr_t alone doesn't give us a way to copy data. You
>> need a lookup table on that address and a couple of hooks.
> 
> Yes, I'm not sure how you envision using phys_addr_t without a
> lookup.. At the end of the day we must get the src and target 'struct
> device' in the dma_map area (at the minimum to compute the offset to
> translate phys_addr_t to dma_addr_t) and the only way to do that from
> phys_addr_t is via lookup??

I thought my other email to Christoph laid it out pretty cleanly...

>>> At least for RDMA, we do have some cases (like siw/rxe, hfi) where
>>> they sometimes need to do that copy. I suspect the block stack is
>>> similar, in the general case.
>>
>> But the whole point of the use cases I'm trying to serve is to avoid the
>> root complex. 
> 
> Well, I think this is sort of a seperate issue. Generically I think
> the dma layer should continue to work largely transparently, and if I
> feed in BAR memory that can't be P2P'd it should bounce, just like
> all the other DMA limitations it already supports. That is pretty much
> its whole purpose in life.

I disagree. It's one thing for the DMA layer to work around architecture
limitations like HighMem/LowMem and just do a memcpy when it can't
handle it -- it's whole different thing for the DMA layer to know about
the varieties of memory on different peripheral device's and the nuances
of how and when to transfer between them. I think the submitting driver
has the best information of when to do these transfers.

IMO the bouncing in the DMA layer isn't a desirable thing, it was a
necessary addition to work around various legacy platform issues and
have existing code still work correctly. It's always better for a driver
to allocate memory appropriate for the DMA than to just use random
memory and rely on it being bounced by the lower layer. For P2P, we
don't have existing code to worry about so I don't think a magic
automatic bouncing design is appropriate.

> The issue of having the caller optimize what it sends is kind of
> separate - yes you definately still need the egress DMA device to
> drive CMB buffer selection, and DEVICE_PRIVATE also needs it to decide
> if it should migrate or not.

Yes, but my contention is that I don't want to have to make these
decisions twice: once before I submit it to the block layer, then again
at mapping time.

> What I see as the question is how to layout the BIO. 
> 
> If we agree the bio should only have phys_addr_t then we need some
> 'bar info' (ie at least the offset) in the dma map and some 'bar info'
> (ie the DMA device) during the bio construciton.

Per my other email, it was phys_addr_t plus hints on how to map the
memory (bus address, dma_map_resource, or regular). This requires
exactly two flag bits in the bio_vec and no interval tree or hash table.
I don't want to have to pass bar info, other hooks, or anything like
that to the block layer.

> What you are trying to do is optimize the passing of that 'bar info'
> with a limited number of bits in the BIO.
> 
> A single flag means an interval tree, 4-8 bits could build a probably
> O(1) hash lookup, 64 bits could store a pointer, etc.

Well, an interval tree can get you the backing device for a given
phys_addr_t; however, for P2PDMA, we need a second lookup based on the
mapping device. This is because exactly how you map the data depends on
both devices. Currently I'm working on this for the existing
implementation and struct page gets me the backing device but I need
another xarray cache based on the mapping device to figure out exactly
how to map the memory. But none of this mess is required if we can just
pass the mapping hints through the block layer as flags (per my other
email) because the submitting driver should already know ahead of time
what it's trying to do.

> If we can spare 4-8 bits in the bio then I suggest a 'perfect hash
> table'. Assign each registered P2P 'bar info' a small 4 bit id and
> hash on that. It should be fast enough to not worry about the double
> lookup.

This feels like it's just setting us up to run into nasty limits based
on the number of bits we actually have. The number of bits in a bio_vec
will always be a precious resource. If I have a server chassis that
exist today with 24 NVMe devices, and each device has a CMB, I'm already
using up 6 of those bits. Then we might have DEVICE_PRIVATE and other
uses on top of that.

Logan
Jason Gunthorpe June 28, 2019, 7:09 p.m. UTC | #51
On Fri, Jun 28, 2019 at 12:29:32PM -0600, Logan Gunthorpe wrote:
> 
> 
> On 2019-06-28 11:29 a.m., Jason Gunthorpe wrote:
> > On Fri, Jun 28, 2019 at 10:22:06AM -0600, Logan Gunthorpe wrote:
> > 
> >>> Why not?  If we have a 'bar info' structure that could have data
> >>> transfer op callbacks, infact, I think we might already have similar
> >>> callbacks for migrating to/from DEVICE_PRIVATE memory with DMA..
> >>
> >> Well it could, in theory be done, but It just seems wrong to setup and
> >> wait for more DMA requests while we are in mid-progress setting up
> >> another DMA request. Especially when the block layer has historically
> >> had issues with stack sizes. It's also possible you might have multiple
> >> bio_vec's that have to each do a migration and with a hook here they'd
> >> have to be done serially.
> > 
> > *shrug* this is just standard bounce buffering stuff...
> 
> I don't know of any "standard" bounce buffering stuff that uses random
> other device's DMA engines where appropriate.

IMHO, it is conceptually the same as memcpy.. And probably we will not
ever need such optimization in dma map. Other copy places might be
different at least we have the option.
 
> IMO the bouncing in the DMA layer isn't a desirable thing, it was a
> necessary addition to work around various legacy platform issues and
> have existing code still work correctly. 

Of course it is not desireable! But there are many situations where we
do not have the luxury to work around the HW limits in the caller, so
those callers either have to not do DMA or they have to open code
bounce buffering - both are wrong.

> > What I see as the question is how to layout the BIO. 
> > 
> > If we agree the bio should only have phys_addr_t then we need some
> > 'bar info' (ie at least the offset) in the dma map and some 'bar info'
> > (ie the DMA device) during the bio construciton.
> 
> Per my other email, it was phys_addr_t plus hints on how to map the
> memory (bus address, dma_map_resource, or regular). This requires
> exactly two flag bits in the bio_vec and no interval tree or hash table.
> I don't want to have to pass bar info, other hooks, or anything like
> that to the block layer.

This scheme makes the assumption that the dma mapping struct device is
all you need, and we never need to know the originating struct device
during dma map. This is clearly safe if the two devices are on the
same PCIe segment

However, I'd feel more comfortable about that assumption if we had
code to support the IOMMU case, and know for sure it doesn't require
more info :(

But I suppose it is also reasonable that only the IOMMU case would
have the expensive 'bar info' lookup or something.

Maybe you can hide these flags as some dma_map helper, then the
layering might be nicer:

  dma_map_set_bio_p2p_flags(bio, phys_addr, source dev, dest_dev) 

?

ie the choice of flag scheme to use is opaque to the DMA layer.

> > If we can spare 4-8 bits in the bio then I suggest a 'perfect hash
> > table'. Assign each registered P2P 'bar info' a small 4 bit id and
> > hash on that. It should be fast enough to not worry about the double
> > lookup.
> 
> This feels like it's just setting us up to run into nasty limits based
> on the number of bits we actually have. The number of bits in a bio_vec
> will always be a precious resource. If I have a server chassis that
> exist today with 24 NVMe devices, and each device has a CMB, I'm already
> using up 6 of those bits. Then we might have DEVICE_PRIVATE and other
> uses on top of that.

A hash is an alternative data structure to a interval tree that has
better scaling for small numbers of BARs, which I think is our
case.

Jason
Logan Gunthorpe June 28, 2019, 7:35 p.m. UTC | #52
On 2019-06-28 1:09 p.m., Jason Gunthorpe wrote:
> On Fri, Jun 28, 2019 at 12:29:32PM -0600, Logan Gunthorpe wrote:
>>
>>
>> On 2019-06-28 11:29 a.m., Jason Gunthorpe wrote:
>>> On Fri, Jun 28, 2019 at 10:22:06AM -0600, Logan Gunthorpe wrote:
>>>
>>>>> Why not?  If we have a 'bar info' structure that could have data
>>>>> transfer op callbacks, infact, I think we might already have similar
>>>>> callbacks for migrating to/from DEVICE_PRIVATE memory with DMA..
>>>>
>>>> Well it could, in theory be done, but It just seems wrong to setup and
>>>> wait for more DMA requests while we are in mid-progress setting up
>>>> another DMA request. Especially when the block layer has historically
>>>> had issues with stack sizes. It's also possible you might have multiple
>>>> bio_vec's that have to each do a migration and with a hook here they'd
>>>> have to be done serially.
>>>
>>> *shrug* this is just standard bounce buffering stuff...
>>
>> I don't know of any "standard" bounce buffering stuff that uses random
>> other device's DMA engines where appropriate.
> 
> IMHO, it is conceptually the same as memcpy.. And probably we will not
> ever need such optimization in dma map. Other copy places might be
> different at least we have the option.
>  
>> IMO the bouncing in the DMA layer isn't a desirable thing, it was a
>> necessary addition to work around various legacy platform issues and
>> have existing code still work correctly. 
> 
> Of course it is not desireable! But there are many situations where we
> do not have the luxury to work around the HW limits in the caller, so
> those callers either have to not do DMA or they have to open code
> bounce buffering - both are wrong.

They don't have to open code it, they can use helpers and good coding
practices. But the submitting driver is the one that's in the best
position to figure this stuff out. Just like it is with the dma_map
bouncing -- all it has to do is use dma_alloc_coherent(). If we don't
write any submitting drivers that assume the dma_map API bounces than we
should never have to deal with it.

>>> What I see as the question is how to layout the BIO. 
>>>
>>> If we agree the bio should only have phys_addr_t then we need some
>>> 'bar info' (ie at least the offset) in the dma map and some 'bar info'
>>> (ie the DMA device) during the bio construciton.
>>
>> Per my other email, it was phys_addr_t plus hints on how to map the
>> memory (bus address, dma_map_resource, or regular). This requires
>> exactly two flag bits in the bio_vec and no interval tree or hash table.
>> I don't want to have to pass bar info, other hooks, or anything like
>> that to the block layer.
> 
> This scheme makes the assumption that the dma mapping struct device is
> all you need, and we never need to know the originating struct device
> during dma map. This is clearly safe if the two devices are on the
> same PCIe segment
> 
> However, I'd feel more comfortable about that assumption if we had
> code to support the IOMMU case, and know for sure it doesn't require
> more info :(

The example I posted *does* support the IOMMU case. That was case (b1)
in the description. The idea is that pci_p2pdma_dist() returns a
distance with a high bit set (PCI_P2PDMA_THRU_HOST_BRIDGE) when an IOMMU
mapping is required and the appropriate flag tells it to call
dma_map_resource(). This way, it supports both same-segment and
different-segments without needing any look ups in the map step.

For the only existing upstream use case (NVMe-of), this is ideal because
we can calculate the mapping requirements exactly once ahead of any
transfers. Then populating the bvecs and dma-mapping for each transfer
is fast and doesn't require any additional work besides deciding where
to get the memory from.

For O_DIRECT and userspace RDMA, this should also be ideal, the real
problem is how to get the necessary information out of the VMA. This
isn't helped by having a lookup at the dma map step. But the provider
driver is certainly going to be involved in creating the VMA so it
should be able to easily provide the necessary hooks. Though there are
still a bunch of challenges here.

Maybe other use-cases are not this ideal but I suspect they should still
be able to make use of the same flags. It's hard to say right now,
though, because we haven't seen any other use cases.


> Maybe you can hide these flags as some dma_map helper, then the
> layering might be nicer:
> 
>   dma_map_set_bio_p2p_flags(bio, phys_addr, source dev, dest_dev) 
> 
> ?
> 
> ie the choice of flag scheme to use is opaque to the DMA layer.

If there was such a use case, I suppose you could use a couple of flag
bits to tell you how to interpret the other flag bits but, at the
moment, I only see a need for 2 bits so we'll probably have a lot of
spares for a long time. You could certainly have a 3rd bit which says do
a lookup and try to figure out bouncing, but I don't think it's a good idea.

>>> If we can spare 4-8 bits in the bio then I suggest a 'perfect hash
>>> table'. Assign each registered P2P 'bar info' a small 4 bit id and
>>> hash on that. It should be fast enough to not worry about the double
>>> lookup.
>>
>> This feels like it's just setting us up to run into nasty limits based
>> on the number of bits we actually have. The number of bits in a bio_vec
>> will always be a precious resource. If I have a server chassis that
>> exist today with 24 NVMe devices, and each device has a CMB, I'm already
>> using up 6 of those bits. Then we might have DEVICE_PRIVATE and other
>> uses on top of that.
> 
> A hash is an alternative data structure to a interval tree that has
> better scaling for small numbers of BARs, which I think is our
> case.

But then you need a large and not necessarily future-proof number of
bits in the bio_vec to store the hash.

Logan
Jason Gunthorpe July 2, 2019, 10:45 p.m. UTC | #53
On Fri, Jun 28, 2019 at 01:35:42PM -0600, Logan Gunthorpe wrote:

> > However, I'd feel more comfortable about that assumption if we had
> > code to support the IOMMU case, and know for sure it doesn't require
> > more info :(
> 
> The example I posted *does* support the IOMMU case. That was case (b1)
> in the description. The idea is that pci_p2pdma_dist() returns a
> distance with a high bit set (PCI_P2PDMA_THRU_HOST_BRIDGE) when an IOMMU
> mapping is required and the appropriate flag tells it to call
> dma_map_resource(). This way, it supports both same-segment and
> different-segments without needing any look ups in the map step.

I mean we actually have some iommu drivers that can setup P2P in real
HW. I'm worried that real IOMMUs will need to have the BDF of the
completer to route completions back to the requester - which we can't
trivially get through this scheme.

However, maybe that is just a future problem, and certainly we can see
that with an interval tree or otherwise such a IOMMU could get the
information it needs.

Jason
Logan Gunthorpe July 2, 2019, 10:52 p.m. UTC | #54
On 2019-07-02 4:45 p.m., Jason Gunthorpe wrote:
> On Fri, Jun 28, 2019 at 01:35:42PM -0600, Logan Gunthorpe wrote:
> 
>>> However, I'd feel more comfortable about that assumption if we had
>>> code to support the IOMMU case, and know for sure it doesn't require
>>> more info :(
>>
>> The example I posted *does* support the IOMMU case. That was case (b1)
>> in the description. The idea is that pci_p2pdma_dist() returns a
>> distance with a high bit set (PCI_P2PDMA_THRU_HOST_BRIDGE) when an IOMMU
>> mapping is required and the appropriate flag tells it to call
>> dma_map_resource(). This way, it supports both same-segment and
>> different-segments without needing any look ups in the map step.
> 
> I mean we actually have some iommu drivers that can setup P2P in real
> HW. I'm worried that real IOMMUs will need to have the BDF of the
> completer to route completions back to the requester - which we can't
> trivially get through this scheme.

I've never seen such an IOMMU but I guess, in theory, it could exist.
The IOMMUs that setup P2P-like transactions in real hardware make use of
dma_map_resource(). There aren't a lot of users of this function (it's
actually been broken with the Intel IOMMU until I fixed it recently and
I'd expect there are other broken implementations); but, to my
knowledge, none of them have needed the BDF of the provider to date.

> However, maybe that is just a future problem, and certainly we can see
> that with an interval tree or otherwise such a IOMMU could get the
> information it needs.

Yup, the rule of thumb is to design for the needs we have today not
imagined future problems.

Logan