Message ID | 20191015204814.30099-3-rcampbell@nvidia.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | HMM tests and minor fixes | expand |
On Tue, Oct 15, 2019 at 01:48:13PM -0700, Ralph Campbell wrote: > Allow hmm_range_fault() to return success (0) when the CPU pagetable > entry points to the special shared zero page. > The caller can then handle the zero page by possibly clearing device > private memory instead of DMAing a zero page. > > Signed-off-by: Ralph Campbell <rcampbell@nvidia.com> > Reviewed-by: Christoph Hellwig <hch@lst.de> > Cc: "Jérôme Glisse" <jglisse@redhat.com> > Cc: Jason Gunthorpe <jgg@mellanox.com> > mm/hmm.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/mm/hmm.c b/mm/hmm.c > index 5df0dbf77e89..f62b119722a3 100644 > +++ b/mm/hmm.c > @@ -530,7 +530,9 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, > return -EBUSY; > } else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) { > *pfn = range->values[HMM_PFN_SPECIAL]; > - return -EFAULT; > + if (!is_zero_pfn(pte_pfn(pte))) > + return -EFAULT; > + return 0; Does it make sense to return HMM_PFN_SPECIAL in this case? Does the zero pfn have a struct page? Does it need mandatory special treatment? ie the base behavior without any driver code should be to dma from the zero memory. A fancy driver should be able to detect the zero and do something else. I'm not clear what the two existing users do with PFN_SPECIAL? Nouveau looks like it is the same value as error, can't guess what amdgpu does with its magic constant Jason
On Tue, Oct 15, 2019 at 01:48:13PM -0700, Ralph Campbell wrote: > Allow hmm_range_fault() to return success (0) when the CPU pagetable > entry points to the special shared zero page. > The caller can then handle the zero page by possibly clearing device > private memory instead of DMAing a zero page. I do not understand why you are talking about DMA. GPU can work on main memory and migrating to GPU memory is optional and should not involve this function at all. > > Signed-off-by: Ralph Campbell <rcampbell@nvidia.com> > Reviewed-by: Christoph Hellwig <hch@lst.de> > Cc: "Jérôme Glisse" <jglisse@redhat.com> > Cc: Jason Gunthorpe <jgg@mellanox.com> NAK please keep semantic or change it fully. See the alternative below. > --- > mm/hmm.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/mm/hmm.c b/mm/hmm.c > index 5df0dbf77e89..f62b119722a3 100644 > --- a/mm/hmm.c > +++ b/mm/hmm.c > @@ -530,7 +530,9 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, > return -EBUSY; > } else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) { > *pfn = range->values[HMM_PFN_SPECIAL]; > - return -EFAULT; > + if (!is_zero_pfn(pte_pfn(pte))) > + return -EFAULT; > + return 0; An acceptable change would be to turn the branch into: } else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) { if (!is_zero_pfn(pte_pfn(pte))) { *pfn = range->values[HMM_PFN_SPECIAL]; return -EFAULT; } /* Fall-through for zero pfn (if write was needed the above * hmm_pte_need_faul() would had catched it). */ }
On 10/21/19 11:08 AM, Jason Gunthorpe wrote: > On Tue, Oct 15, 2019 at 01:48:13PM -0700, Ralph Campbell wrote: >> Allow hmm_range_fault() to return success (0) when the CPU pagetable >> entry points to the special shared zero page. >> The caller can then handle the zero page by possibly clearing device >> private memory instead of DMAing a zero page. >> >> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com> >> Reviewed-by: Christoph Hellwig <hch@lst.de> >> Cc: "Jérôme Glisse" <jglisse@redhat.com> >> Cc: Jason Gunthorpe <jgg@mellanox.com> >> mm/hmm.c | 4 +++- >> 1 file changed, 3 insertions(+), 1 deletion(-) >> >> diff --git a/mm/hmm.c b/mm/hmm.c >> index 5df0dbf77e89..f62b119722a3 100644 >> +++ b/mm/hmm.c >> @@ -530,7 +530,9 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, >> return -EBUSY; >> } else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) { >> *pfn = range->values[HMM_PFN_SPECIAL]; >> - return -EFAULT; >> + if (!is_zero_pfn(pte_pfn(pte))) >> + return -EFAULT; >> + return 0; > > Does it make sense to return HMM_PFN_SPECIAL in this case? Does the > zero pfn have a struct page? Does it need mandatory special treatment? The zero pfn does not have a struct page so it needs special treatment: see nouveau_dmem_convert_pfn() where it calls hmm_device_entry_to_page(). If HMM ever ends up supporting VM_PFNMAP there would need to be a way to distinguish pfns with and without a backing struct page too. > ie the base behavior without any driver code should be to dma from the > zero memory. A fancy driver should be able to detect the zero and do > something else. Correct. > I'm not clear what the two existing users do with PFN_SPECIAL? Nouveau > looks like it is the same value as error, can't guess what amdgpu does > with its magic constant > > Jason I doubt the zero pfn case is being handled correctly in amd/nouveau. I made the change above when explicitly testing for it in the patch adding HMM tests.
On 10/21/19 11:49 AM, Jerome Glisse wrote: > On Tue, Oct 15, 2019 at 01:48:13PM -0700, Ralph Campbell wrote: >> Allow hmm_range_fault() to return success (0) when the CPU pagetable >> entry points to the special shared zero page. >> The caller can then handle the zero page by possibly clearing device >> private memory instead of DMAing a zero page. > > I do not understand why you are talking about DMA. GPU can work > on main memory and migrating to GPU memory is optional and should > not involve this function at all. Good point. This is the device accessing the zero page over PCIe or another bus, not migrating a zero page to device private memory. I'll update the wording. >> >> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com> >> Reviewed-by: Christoph Hellwig <hch@lst.de> >> Cc: "Jérôme Glisse" <jglisse@redhat.com> >> Cc: Jason Gunthorpe <jgg@mellanox.com> > > NAK please keep semantic or change it fully. See the alternative > below. > >> --- >> mm/hmm.c | 4 +++- >> 1 file changed, 3 insertions(+), 1 deletion(-) >> >> diff --git a/mm/hmm.c b/mm/hmm.c >> index 5df0dbf77e89..f62b119722a3 100644 >> --- a/mm/hmm.c >> +++ b/mm/hmm.c >> @@ -530,7 +530,9 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, >> return -EBUSY; >> } else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) { >> *pfn = range->values[HMM_PFN_SPECIAL]; >> - return -EFAULT; >> + if (!is_zero_pfn(pte_pfn(pte))) >> + return -EFAULT; >> + return 0; > > An acceptable change would be to turn the branch into: > } else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) { > if (!is_zero_pfn(pte_pfn(pte))) { > *pfn = range->values[HMM_PFN_SPECIAL]; > return -EFAULT; > } > /* Fall-through for zero pfn (if write was needed the above > * hmm_pte_need_faul() would had catched it). > */ > } > Except this will return the zero pfn with no indication that it is special (i.e., doesn't have a struct page).
On Mon, Oct 21, 2019 at 01:54:15PM -0700, Ralph Campbell wrote: > > On 10/21/19 11:49 AM, Jerome Glisse wrote: > > On Tue, Oct 15, 2019 at 01:48:13PM -0700, Ralph Campbell wrote: > > > Allow hmm_range_fault() to return success (0) when the CPU pagetable > > > entry points to the special shared zero page. > > > The caller can then handle the zero page by possibly clearing device > > > private memory instead of DMAing a zero page. > > > > I do not understand why you are talking about DMA. GPU can work > > on main memory and migrating to GPU memory is optional and should > > not involve this function at all. > > Good point. This is the device accessing the zero page over PCIe > or another bus, not migrating a zero page to device private memory. > I'll update the wording. > > > > > > > Signed-off-by: Ralph Campbell <rcampbell@nvidia.com> > > > Reviewed-by: Christoph Hellwig <hch@lst.de> > > > Cc: "Jérôme Glisse" <jglisse@redhat.com> > > > Cc: Jason Gunthorpe <jgg@mellanox.com> > > > > NAK please keep semantic or change it fully. See the alternative > > below. > > > > > --- > > > mm/hmm.c | 4 +++- > > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > > > diff --git a/mm/hmm.c b/mm/hmm.c > > > index 5df0dbf77e89..f62b119722a3 100644 > > > --- a/mm/hmm.c > > > +++ b/mm/hmm.c > > > @@ -530,7 +530,9 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, > > > return -EBUSY; > > > } else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) { > > > *pfn = range->values[HMM_PFN_SPECIAL]; > > > - return -EFAULT; > > > + if (!is_zero_pfn(pte_pfn(pte))) > > > + return -EFAULT; > > > + return 0; > > > > An acceptable change would be to turn the branch into: > > } else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) { > > if (!is_zero_pfn(pte_pfn(pte))) { > > *pfn = range->values[HMM_PFN_SPECIAL]; > > return -EFAULT; > > } > > /* Fall-through for zero pfn (if write was needed the above > > * hmm_pte_need_faul() would had catched it). > > */ > > } > > > > Except this will return the zero pfn with no indication that it is special > (i.e., doesn't have a struct page). That is fine, the device driver should not do anything with it ie if the device driver wanted to write then the write fault test would return true and it would fault. Note that driver should not dereference the struct page. Cheers, Jérôme
On Mon, Oct 21, 2019 at 10:45:49PM -0400, Jerome Glisse wrote: > On Mon, Oct 21, 2019 at 01:54:15PM -0700, Ralph Campbell wrote: > > > > On 10/21/19 11:49 AM, Jerome Glisse wrote: > > > On Tue, Oct 15, 2019 at 01:48:13PM -0700, Ralph Campbell wrote: > > > > Allow hmm_range_fault() to return success (0) when the CPU pagetable > > > > entry points to the special shared zero page. > > > > The caller can then handle the zero page by possibly clearing device > > > > private memory instead of DMAing a zero page. > > > > > > I do not understand why you are talking about DMA. GPU can work > > > on main memory and migrating to GPU memory is optional and should > > > not involve this function at all. > > > > Good point. This is the device accessing the zero page over PCIe > > or another bus, not migrating a zero page to device private memory. > > I'll update the wording. > > > > > > > > > > Signed-off-by: Ralph Campbell <rcampbell@nvidia.com> > > > > Reviewed-by: Christoph Hellwig <hch@lst.de> > > > > Cc: "Jérôme Glisse" <jglisse@redhat.com> > > > > Cc: Jason Gunthorpe <jgg@mellanox.com> > > > > > > NAK please keep semantic or change it fully. See the alternative > > > below. > > > > > > > mm/hmm.c | 4 +++- > > > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/mm/hmm.c b/mm/hmm.c > > > > index 5df0dbf77e89..f62b119722a3 100644 > > > > +++ b/mm/hmm.c > > > > @@ -530,7 +530,9 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, > > > > return -EBUSY; > > > > } else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) { > > > > *pfn = range->values[HMM_PFN_SPECIAL]; > > > > - return -EFAULT; > > > > + if (!is_zero_pfn(pte_pfn(pte))) > > > > + return -EFAULT; > > > > + return 0; > > > > > > An acceptable change would be to turn the branch into: > > > } else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) { > > > if (!is_zero_pfn(pte_pfn(pte))) { > > > *pfn = range->values[HMM_PFN_SPECIAL]; > > > return -EFAULT; > > > } > > > /* Fall-through for zero pfn (if write was needed the above > > > * hmm_pte_need_faul() would had catched it). > > > */ > > > } > > > > > > > Except this will return the zero pfn with no indication that it is special > > (i.e., doesn't have a struct page). > > That is fine, the device driver should not do anything with it ie > if the device driver wanted to write then the write fault test > would return true and it would fault. > > Note that driver should not dereference the struct page. Can this thing be dma mapped for read? Jason
On Tue, Oct 22, 2019 at 03:05:18PM +0000, Jason Gunthorpe wrote: > On Mon, Oct 21, 2019 at 10:45:49PM -0400, Jerome Glisse wrote: > > On Mon, Oct 21, 2019 at 01:54:15PM -0700, Ralph Campbell wrote: > > > > > > On 10/21/19 11:49 AM, Jerome Glisse wrote: > > > > On Tue, Oct 15, 2019 at 01:48:13PM -0700, Ralph Campbell wrote: > > > > > Allow hmm_range_fault() to return success (0) when the CPU pagetable > > > > > entry points to the special shared zero page. > > > > > The caller can then handle the zero page by possibly clearing device > > > > > private memory instead of DMAing a zero page. > > > > > > > > I do not understand why you are talking about DMA. GPU can work > > > > on main memory and migrating to GPU memory is optional and should > > > > not involve this function at all. > > > > > > Good point. This is the device accessing the zero page over PCIe > > > or another bus, not migrating a zero page to device private memory. > > > I'll update the wording. > > > > > > > > > > > > > Signed-off-by: Ralph Campbell <rcampbell@nvidia.com> > > > > > Reviewed-by: Christoph Hellwig <hch@lst.de> > > > > > Cc: "Jérôme Glisse" <jglisse@redhat.com> > > > > > Cc: Jason Gunthorpe <jgg@mellanox.com> > > > > > > > > NAK please keep semantic or change it fully. See the alternative > > > > below. > > > > > > > > > mm/hmm.c | 4 +++- > > > > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > > > > > > > diff --git a/mm/hmm.c b/mm/hmm.c > > > > > index 5df0dbf77e89..f62b119722a3 100644 > > > > > +++ b/mm/hmm.c > > > > > @@ -530,7 +530,9 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, > > > > > return -EBUSY; > > > > > } else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) { > > > > > *pfn = range->values[HMM_PFN_SPECIAL]; > > > > > - return -EFAULT; > > > > > + if (!is_zero_pfn(pte_pfn(pte))) > > > > > + return -EFAULT; > > > > > + return 0; > > > > > > > > An acceptable change would be to turn the branch into: > > > > } else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) { > > > > if (!is_zero_pfn(pte_pfn(pte))) { > > > > *pfn = range->values[HMM_PFN_SPECIAL]; > > > > return -EFAULT; > > > > } > > > > /* Fall-through for zero pfn (if write was needed the above > > > > * hmm_pte_need_faul() would had catched it). > > > > */ > > > > } > > > > > > > > > > Except this will return the zero pfn with no indication that it is special > > > (i.e., doesn't have a struct page). > > > > That is fine, the device driver should not do anything with it ie > > if the device driver wanted to write then the write fault test > > would return true and it would fault. > > > > Note that driver should not dereference the struct page. > > Can this thing be dma mapped for read? > Yes it can, the zero page is just a regular page (AFAIK on all architecture). So device can dma map it for read only, there is no reason to treat it any differently. The HMM_PTE_SPECIAL is only (as documented in the header) for pte insert with insert_pfn or insert_page ie pte inserted in vma with MIXED or PFNMAP flag. While HMM catch those vma early on and backof it can still race with some driver setting the vma flag and installing special pte afterward hence why special pte goes through this special path. The zero page being a special pte is just an exception ie it is the only special pte allowed in vma that do not have MIXED or PFNMAP flag set. Cheers, Jérôme
On Tue, Oct 22, 2019 at 01:06:31PM -0400, Jerome Glisse wrote: > > > That is fine, the device driver should not do anything with it ie > > > if the device driver wanted to write then the write fault test > > > would return true and it would fault. > > > > > > Note that driver should not dereference the struct page. > > > > Can this thing be dma mapped for read? > > > > Yes it can, the zero page is just a regular page (AFAIK on all > architecture). So device can dma map it for read only, there is > no reason to treat it any differently. > > The HMM_PTE_SPECIAL is only (as documented in the header) for > pte insert with insert_pfn or insert_page ie pte inserted in > vma with MIXED or PFNMAP flag. While HMM catch those vma early > on and backof it can still race with some driver setting the vma > flag and installing special pte afterward hence why special pte > goes through this special path. > > The zero page being a special pte is just an exception ie it > is the only special pte allowed in vma that do not have MIXED or > PFNMAP flag set. Just to be clear then, the correct behavior is to return the zero page pfn as a HMM_PFN_VALID and the driver should treat it the same as any memory page and dma map it? Smart drivers can test somehow for pfn == zero_page and optimize? Jason
On Tue, Oct 22, 2019 at 05:09:19PM +0000, Jason Gunthorpe wrote: > On Tue, Oct 22, 2019 at 01:06:31PM -0400, Jerome Glisse wrote: > > > > > That is fine, the device driver should not do anything with it ie > > > > if the device driver wanted to write then the write fault test > > > > would return true and it would fault. > > > > > > > > Note that driver should not dereference the struct page. > > > > > > Can this thing be dma mapped for read? > > > > > > > Yes it can, the zero page is just a regular page (AFAIK on all > > architecture). So device can dma map it for read only, there is > > no reason to treat it any differently. > > > > The HMM_PTE_SPECIAL is only (as documented in the header) for > > pte insert with insert_pfn or insert_page ie pte inserted in > > vma with MIXED or PFNMAP flag. While HMM catch those vma early > > on and backof it can still race with some driver setting the vma > > flag and installing special pte afterward hence why special pte > > goes through this special path. > > > > The zero page being a special pte is just an exception ie it > > is the only special pte allowed in vma that do not have MIXED or > > PFNMAP flag set. > > Just to be clear then, the correct behavior is to return the zero page > pfn as a HMM_PFN_VALID and the driver should treat it the same as any > memory page and dma map it? Yes correct. > > Smart drivers can test somehow for pfn == zero_page and optimize? There is nothing to optimize here, i do not know any hardware that have a special page table entry that make all memory access return zero. What was confusing in Ralph commit message is that he was conflating the memory migration, which is a totaly different code path, with that code. When doing memory migration it is easy to program the DMA engine to zero out destination memory (common feature found on various devices) and that optimization is allowed by the migrate code. So i can not think of any reason why distinguishing the zero page in this code would help. Maybe i missed some new feature in the mmu of some new hardware. Cheers, Jérôme
On Tue, Oct 22, 2019 at 01:30:26PM -0400, Jerome Glisse wrote: > > Smart drivers can test somehow for pfn == zero_page and optimize? > > There is nothing to optimize here, i do not know any hardware that > have a special page table entry that make all memory access return > zero. Presumably any GPU could globally dedicate one page of internal memory as a zero page and remap CPU zero page to that internal memory page? This is basically how the CPU zero page works. I suspect mlx5 could do the same with its internal memory, but the internal memory is too limited to make this worth while. mlx5 also has a specially 'zero MR' that always reads as zero (and discards writes), but it doesn't quite fit well into the ODP flow. Jason
On Tue, Oct 22, 2019 at 05:41:11PM +0000, Jason Gunthorpe wrote: > On Tue, Oct 22, 2019 at 01:30:26PM -0400, Jerome Glisse wrote: > > > > Smart drivers can test somehow for pfn == zero_page and optimize? > > > > There is nothing to optimize here, i do not know any hardware that > > have a special page table entry that make all memory access return > > zero. > > Presumably any GPU could globally dedicate one page of internal memory > as a zero page and remap CPU zero page to that internal memory page? > This is basically how the CPU zero page works. Yes that would work too but i do not know of any upstream driver that does that. > I suspect mlx5 could do the same with its internal memory, but the > internal memory is too limited to make this worth while. > > mlx5 also has a specially 'zero MR' that always reads as zero (and > discards writes), but it doesn't quite fit well into the ODP flow. Well you can always ask for new stuff to your beloved hardware engineers, they never say no right ? :) Cheers, Jérôme
diff --git a/mm/hmm.c b/mm/hmm.c index 5df0dbf77e89..f62b119722a3 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -530,7 +530,9 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, return -EBUSY; } else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) { *pfn = range->values[HMM_PFN_SPECIAL]; - return -EFAULT; + if (!is_zero_pfn(pte_pfn(pte))) + return -EFAULT; + return 0; } *pfn = hmm_device_entry_from_pfn(range, pte_pfn(pte)) | cpu_flags;