diff mbox series

[v2,07/11] mm/hmm: add default fault flags to avoid the need to pre-fill pfns arrays.

Message ID 20190325144011.10560-8-jglisse@redhat.com (mailing list archive)
State New, archived
Headers show
Series Improve HMM driver API v2 | expand

Commit Message

Jerome Glisse March 25, 2019, 2:40 p.m. UTC
From: Jérôme Glisse <jglisse@redhat.com>

The HMM mirror API can be use in two fashions. The first one where the HMM
user coalesce multiple page faults into one request and set flags per pfns
for of those faults. The second one where the HMM user want to pre-fault a
range with specific flags. For the latter one it is a waste to have the user
pre-fill the pfn arrays with a default flags value.

This patch adds a default flags value allowing user to set them for a range
without having to pre-fill the pfn array.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>
---
 include/linux/hmm.h |  7 +++++++
 mm/hmm.c            | 12 ++++++++++++
 2 files changed, 19 insertions(+)

Comments

Ira Weiny March 28, 2019, 4:42 p.m. UTC | #1
On Thu, Mar 28, 2019 at 04:28:47PM -0700, John Hubbard wrote:
> On 3/28/19 4:21 PM, Jerome Glisse wrote:
> > On Thu, Mar 28, 2019 at 03:40:42PM -0700, John Hubbard wrote:
> >> On 3/28/19 3:31 PM, Jerome Glisse wrote:
> >>> On Thu, Mar 28, 2019 at 03:19:06PM -0700, John Hubbard wrote:
> >>>> On 3/28/19 3:12 PM, Jerome Glisse wrote:
> >>>>> On Thu, Mar 28, 2019 at 02:59:50PM -0700, John Hubbard wrote:
> >>>>>> On 3/25/19 7:40 AM, jglisse@redhat.com wrote:
> >>>>>>> From: Jérôme Glisse <jglisse@redhat.com>
> [...]
> >> Hi Jerome,
> >>
> >> I think you're talking about flags, but I'm talking about the mask. The 
> >> above link doesn't appear to use the pfn_flags_mask, and the default_flags 
> >> that it uses are still in the same lower 3 bits:
> >>
> >> +static uint64_t odp_hmm_flags[HMM_PFN_FLAG_MAX] = {
> >> +	ODP_READ_BIT,	/* HMM_PFN_VALID */
> >> +	ODP_WRITE_BIT,	/* HMM_PFN_WRITE */
> >> +	ODP_DEVICE_BIT,	/* HMM_PFN_DEVICE_PRIVATE */
> >> +};
> >>
> >> So I still don't see why we need the flexibility of a full 0xFFFFFFFFFFFFFFFF
> >> mask, that is *also* runtime changeable. 
> > 
> > So the pfn array is using a device driver specific format and we have
> > no idea nor do we need to know where the valid, write, ... bit are in
> > that format. Those bits can be in the top 60 bits like 63, 62, 61, ...
> > we do not care. They are device with bit at the top and for those you
> > need a mask that allows you to mask out those bits or not depending on
> > what the user want to do.
> > 
> > The mask here is against an _unknown_ (from HMM POV) format. So we can
> > not presume where the bits will be and thus we can not presume what a
> > proper mask is.
> > 
> > So that's why a full unsigned long mask is use here.
> > 
> > Maybe an example will help let say the device flag are:
> >     VALID (1 << 63)
> >     WRITE (1 << 62)
> > 
> > Now let say that device wants to fault with at least read a range
> > it does set:
> >     range->default_flags = (1 << 63)
> >     range->pfn_flags_mask = 0;
> > 
> > This will fill fault all page in the range with at least read
> > permission.
> > 
> > Now let say it wants to do the same except for one page in the range
> > for which its want to have write. Now driver set:
> >     range->default_flags = (1 << 63);
> >     range->pfn_flags_mask = (1 << 62);
> >     range->pfns[index_of_write] = (1 << 62);
> > 
> > With this HMM will fault in all page with at least read (ie valid)
> > and for the address: range->start + index_of_write << PAGE_SHIFT it
> > will fault with write permission ie if the CPU pte does not have
> > write permission set then handle_mm_fault() will be call asking for
> > write permission.
> > 
> > 
> > Note that in the above HMM will populate the pfns array with write
> > permission for any entry that have write permission within the CPU
> > pte ie the default_flags and pfn_flags_mask is only the minimun
> > requirement but HMM always returns all the flag that are set in the
> > CPU pte.
> > 
> > 
> > Now let say you are an "old" driver like nouveau upstream, then it
> > means that you are setting each individual entry within range->pfns
> > with the exact flags you want for each address hence here what you
> > want is:
> >     range->default_flags = 0;
> >     range->pfn_flags_mask = -1UL;
> > 
> > So that what we do is (for each entry):
> >     (range->pfns[index] & range->pfn_flags_mask) | range->default_flags
> > and we end up with the flags that were set by the driver for each of
> > the individual range->pfns entries.
> > 
> > 
> > Does this help ?
> > 
> 
> Yes, the key point for me was that this is an entirely device driver specific
> format. OK. But then we have HMM setting it. So a comment to the effect that
> this is device-specific might be nice, but I'll leave that up to you whether
> it is useful.

Indeed I did not realize there is an hmm "pfn" until I saw this function:

/*
 * hmm_pfn_from_pfn() - create a valid HMM pfn value from pfn
 * @range: range use to encode HMM pfn value
 * @pfn: pfn value for which to create the HMM pfn
 * Returns: valid HMM pfn for the pfn
 */
static inline uint64_t hmm_pfn_from_pfn(const struct hmm_range *range,
                                        unsigned long pfn)

So should this patch contain some sort of helper like this... maybe?

I'm assuming the "hmm_pfn" being returned above is the device pfn being
discussed here?

I'm also thinking calling it pfn is confusing.  I'm not advocating a new type
but calling the "device pfn's" "hmm_pfn" or "device_pfn" seems like it would
have shortened the discussion here.

Ira

> 
> Either way, you can add:
> 
> 	Reviewed-by: John Hubbard <jhubbard@nvidia.com>
> 
> thanks,
> -- 
> John Hubbard
> NVIDIA
John Hubbard March 28, 2019, 9:59 p.m. UTC | #2
On 3/25/19 7:40 AM, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> The HMM mirror API can be use in two fashions. The first one where the HMM
> user coalesce multiple page faults into one request and set flags per pfns
> for of those faults. The second one where the HMM user want to pre-fault a
> range with specific flags. For the latter one it is a waste to have the user
> pre-fill the pfn arrays with a default flags value.
> 
> This patch adds a default flags value allowing user to set them for a range
> without having to pre-fill the pfn array.
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> ---
>  include/linux/hmm.h |  7 +++++++
>  mm/hmm.c            | 12 ++++++++++++
>  2 files changed, 19 insertions(+)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 79671036cb5f..13bc2c72f791 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -165,6 +165,8 @@ enum hmm_pfn_value_e {
>   * @pfns: array of pfns (big enough for the range)
>   * @flags: pfn flags to match device driver page table
>   * @values: pfn value for some special case (none, special, error, ...)
> + * @default_flags: default flags for the range (write, read, ...)
> + * @pfn_flags_mask: allows to mask pfn flags so that only default_flags matter
>   * @pfn_shifts: pfn shift value (should be <= PAGE_SHIFT)
>   * @valid: pfns array did not change since it has been fill by an HMM function
>   */
> @@ -177,6 +179,8 @@ struct hmm_range {
>  	uint64_t		*pfns;
>  	const uint64_t		*flags;
>  	const uint64_t		*values;
> +	uint64_t		default_flags;
> +	uint64_t		pfn_flags_mask;
>  	uint8_t			pfn_shift;
>  	bool			valid;
>  };
> @@ -521,6 +525,9 @@ static inline int hmm_vma_fault(struct hmm_range *range, bool block)
>  {
>  	long ret;
>  
> +	range->default_flags = 0;
> +	range->pfn_flags_mask = -1UL;

Hi Jerome,

This is nice to have. Let's constrain it a little bit more, though: the pfn_flags_mask
definitely does not need to be a run time value. And we want some assurance that
the mask is 
	a) large enough for the flags, and
	b) small enough to avoid overrunning the pfns field.

Those are less certain with a run-time struct field, and more obviously correct with
something like, approximately:

 	#define PFN_FLAGS_MASK 0xFFFF

or something.

In other words, this is more flexibility than we need--just a touch too much,
IMHO.

> +
>  	ret = hmm_range_register(range, range->vma->vm_mm,
>  				 range->start, range->end);
>  	if (ret)
> diff --git a/mm/hmm.c b/mm/hmm.c
> index fa9498eeb9b6..4fe88a196d17 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -415,6 +415,18 @@ static inline void hmm_pte_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
>  	if (!hmm_vma_walk->fault)
>  		return;
>  
> +	/*
> +	 * So we not only consider the individual per page request we also
> +	 * consider the default flags requested for the range. The API can
> +	 * be use in 2 fashions. The first one where the HMM user coalesce
> +	 * multiple page fault into one request and set flags per pfns for
> +	 * of those faults. The second one where the HMM user want to pre-
> +	 * fault a range with specific flags. For the latter one it is a
> +	 * waste to have the user pre-fill the pfn arrays with a default
> +	 * flags value.
> +	 */
> +	pfns = (pfns & range->pfn_flags_mask) | range->default_flags;

Need to verify that the mask isn't too large or too small.

> +
>  	/* We aren't ask to do anything ... */
>  	if (!(pfns & range->flags[HMM_PFN_VALID]))
>  		return;
> 



thanks,
Jerome Glisse March 28, 2019, 10:12 p.m. UTC | #3
On Thu, Mar 28, 2019 at 02:59:50PM -0700, John Hubbard wrote:
> On 3/25/19 7:40 AM, jglisse@redhat.com wrote:
> > From: Jérôme Glisse <jglisse@redhat.com>
> > 
> > The HMM mirror API can be use in two fashions. The first one where the HMM
> > user coalesce multiple page faults into one request and set flags per pfns
> > for of those faults. The second one where the HMM user want to pre-fault a
> > range with specific flags. For the latter one it is a waste to have the user
> > pre-fill the pfn arrays with a default flags value.
> > 
> > This patch adds a default flags value allowing user to set them for a range
> > without having to pre-fill the pfn array.
> > 
> > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> > Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: John Hubbard <jhubbard@nvidia.com>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > ---
> >  include/linux/hmm.h |  7 +++++++
> >  mm/hmm.c            | 12 ++++++++++++
> >  2 files changed, 19 insertions(+)
> > 
> > diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> > index 79671036cb5f..13bc2c72f791 100644
> > --- a/include/linux/hmm.h
> > +++ b/include/linux/hmm.h
> > @@ -165,6 +165,8 @@ enum hmm_pfn_value_e {
> >   * @pfns: array of pfns (big enough for the range)
> >   * @flags: pfn flags to match device driver page table
> >   * @values: pfn value for some special case (none, special, error, ...)
> > + * @default_flags: default flags for the range (write, read, ...)
> > + * @pfn_flags_mask: allows to mask pfn flags so that only default_flags matter
> >   * @pfn_shifts: pfn shift value (should be <= PAGE_SHIFT)
> >   * @valid: pfns array did not change since it has been fill by an HMM function
> >   */
> > @@ -177,6 +179,8 @@ struct hmm_range {
> >  	uint64_t		*pfns;
> >  	const uint64_t		*flags;
> >  	const uint64_t		*values;
> > +	uint64_t		default_flags;
> > +	uint64_t		pfn_flags_mask;
> >  	uint8_t			pfn_shift;
> >  	bool			valid;
> >  };
> > @@ -521,6 +525,9 @@ static inline int hmm_vma_fault(struct hmm_range *range, bool block)
> >  {
> >  	long ret;
> >  
> > +	range->default_flags = 0;
> > +	range->pfn_flags_mask = -1UL;
> 
> Hi Jerome,
> 
> This is nice to have. Let's constrain it a little bit more, though: the pfn_flags_mask
> definitely does not need to be a run time value. And we want some assurance that
> the mask is 
> 	a) large enough for the flags, and
> 	b) small enough to avoid overrunning the pfns field.
> 
> Those are less certain with a run-time struct field, and more obviously correct with
> something like, approximately:
> 
>  	#define PFN_FLAGS_MASK 0xFFFF
> 
> or something.
> 
> In other words, this is more flexibility than we need--just a touch too much,
> IMHO.

This mirror the fact that flags are provided as an array and some devices use
the top bits for flags (read, write, ...). So here it is the safe default to
set it to -1. If the caller want to leverage this optimization it can override
the default_flags value.

> 
> > +
> >  	ret = hmm_range_register(range, range->vma->vm_mm,
> >  				 range->start, range->end);
> >  	if (ret)
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index fa9498eeb9b6..4fe88a196d17 100644
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -415,6 +415,18 @@ static inline void hmm_pte_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
> >  	if (!hmm_vma_walk->fault)
> >  		return;
> >  
> > +	/*
> > +	 * So we not only consider the individual per page request we also
> > +	 * consider the default flags requested for the range. The API can
> > +	 * be use in 2 fashions. The first one where the HMM user coalesce
> > +	 * multiple page fault into one request and set flags per pfns for
> > +	 * of those faults. The second one where the HMM user want to pre-
> > +	 * fault a range with specific flags. For the latter one it is a
> > +	 * waste to have the user pre-fill the pfn arrays with a default
> > +	 * flags value.
> > +	 */
> > +	pfns = (pfns & range->pfn_flags_mask) | range->default_flags;
> 
> Need to verify that the mask isn't too large or too small.

I need to check agin but default flag is anded somewhere to limit
the bit to the one we expect.

Cheers,
Jérôme
John Hubbard March 28, 2019, 10:19 p.m. UTC | #4
On 3/28/19 3:12 PM, Jerome Glisse wrote:
> On Thu, Mar 28, 2019 at 02:59:50PM -0700, John Hubbard wrote:
>> On 3/25/19 7:40 AM, jglisse@redhat.com wrote:
>>> From: Jérôme Glisse <jglisse@redhat.com>
>>>
>>> The HMM mirror API can be use in two fashions. The first one where the HMM
>>> user coalesce multiple page faults into one request and set flags per pfns
>>> for of those faults. The second one where the HMM user want to pre-fault a
>>> range with specific flags. For the latter one it is a waste to have the user
>>> pre-fill the pfn arrays with a default flags value.
>>>
>>> This patch adds a default flags value allowing user to set them for a range
>>> without having to pre-fill the pfn array.
>>>
>>> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
>>> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: John Hubbard <jhubbard@nvidia.com>
>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>> ---
>>>  include/linux/hmm.h |  7 +++++++
>>>  mm/hmm.c            | 12 ++++++++++++
>>>  2 files changed, 19 insertions(+)
>>>
>>> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
>>> index 79671036cb5f..13bc2c72f791 100644
>>> --- a/include/linux/hmm.h
>>> +++ b/include/linux/hmm.h
>>> @@ -165,6 +165,8 @@ enum hmm_pfn_value_e {
>>>   * @pfns: array of pfns (big enough for the range)
>>>   * @flags: pfn flags to match device driver page table
>>>   * @values: pfn value for some special case (none, special, error, ...)
>>> + * @default_flags: default flags for the range (write, read, ...)
>>> + * @pfn_flags_mask: allows to mask pfn flags so that only default_flags matter
>>>   * @pfn_shifts: pfn shift value (should be <= PAGE_SHIFT)
>>>   * @valid: pfns array did not change since it has been fill by an HMM function
>>>   */
>>> @@ -177,6 +179,8 @@ struct hmm_range {
>>>  	uint64_t		*pfns;
>>>  	const uint64_t		*flags;
>>>  	const uint64_t		*values;
>>> +	uint64_t		default_flags;
>>> +	uint64_t		pfn_flags_mask;
>>>  	uint8_t			pfn_shift;
>>>  	bool			valid;
>>>  };
>>> @@ -521,6 +525,9 @@ static inline int hmm_vma_fault(struct hmm_range *range, bool block)
>>>  {
>>>  	long ret;
>>>  
>>> +	range->default_flags = 0;
>>> +	range->pfn_flags_mask = -1UL;
>>
>> Hi Jerome,
>>
>> This is nice to have. Let's constrain it a little bit more, though: the pfn_flags_mask
>> definitely does not need to be a run time value. And we want some assurance that
>> the mask is 
>> 	a) large enough for the flags, and
>> 	b) small enough to avoid overrunning the pfns field.
>>
>> Those are less certain with a run-time struct field, and more obviously correct with
>> something like, approximately:
>>
>>  	#define PFN_FLAGS_MASK 0xFFFF
>>
>> or something.
>>
>> In other words, this is more flexibility than we need--just a touch too much,
>> IMHO.
> 
> This mirror the fact that flags are provided as an array and some devices use
> the top bits for flags (read, write, ...). So here it is the safe default to
> set it to -1. If the caller want to leverage this optimization it can override
> the default_flags value.
> 

Optimization? OK, now I'm a bit lost. Maybe this is another place where I could
use a peek at the calling code. The only flags I've seen so far use the bottom
3 bits and that's it. 

Maybe comments here?

>>
>>> +
>>>  	ret = hmm_range_register(range, range->vma->vm_mm,
>>>  				 range->start, range->end);
>>>  	if (ret)
>>> diff --git a/mm/hmm.c b/mm/hmm.c
>>> index fa9498eeb9b6..4fe88a196d17 100644
>>> --- a/mm/hmm.c
>>> +++ b/mm/hmm.c
>>> @@ -415,6 +415,18 @@ static inline void hmm_pte_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
>>>  	if (!hmm_vma_walk->fault)
>>>  		return;
>>>  
>>> +	/*
>>> +	 * So we not only consider the individual per page request we also
>>> +	 * consider the default flags requested for the range. The API can
>>> +	 * be use in 2 fashions. The first one where the HMM user coalesce
>>> +	 * multiple page fault into one request and set flags per pfns for
>>> +	 * of those faults. The second one where the HMM user want to pre-
>>> +	 * fault a range with specific flags. For the latter one it is a
>>> +	 * waste to have the user pre-fill the pfn arrays with a default
>>> +	 * flags value.
>>> +	 */
>>> +	pfns = (pfns & range->pfn_flags_mask) | range->default_flags;
>>
>> Need to verify that the mask isn't too large or too small.
> 
> I need to check agin but default flag is anded somewhere to limit
> the bit to the one we expect.

Right, but in general, the *mask* could be wrong. It would be nice to have
an assert, and/or a comment, or something to verify the mask is proper.

Really, a hardcoded mask is simple and correct--unless it *definitely* must
vary for devices of course.

thanks,
Jerome Glisse March 28, 2019, 10:31 p.m. UTC | #5
On Thu, Mar 28, 2019 at 03:19:06PM -0700, John Hubbard wrote:
> On 3/28/19 3:12 PM, Jerome Glisse wrote:
> > On Thu, Mar 28, 2019 at 02:59:50PM -0700, John Hubbard wrote:
> >> On 3/25/19 7:40 AM, jglisse@redhat.com wrote:
> >>> From: Jérôme Glisse <jglisse@redhat.com>
> >>>
> >>> The HMM mirror API can be use in two fashions. The first one where the HMM
> >>> user coalesce multiple page faults into one request and set flags per pfns
> >>> for of those faults. The second one where the HMM user want to pre-fault a
> >>> range with specific flags. For the latter one it is a waste to have the user
> >>> pre-fill the pfn arrays with a default flags value.
> >>>
> >>> This patch adds a default flags value allowing user to set them for a range
> >>> without having to pre-fill the pfn array.
> >>>
> >>> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> >>> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
> >>> Cc: Andrew Morton <akpm@linux-foundation.org>
> >>> Cc: John Hubbard <jhubbard@nvidia.com>
> >>> Cc: Dan Williams <dan.j.williams@intel.com>
> >>> ---
> >>>  include/linux/hmm.h |  7 +++++++
> >>>  mm/hmm.c            | 12 ++++++++++++
> >>>  2 files changed, 19 insertions(+)
> >>>
> >>> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> >>> index 79671036cb5f..13bc2c72f791 100644
> >>> --- a/include/linux/hmm.h
> >>> +++ b/include/linux/hmm.h
> >>> @@ -165,6 +165,8 @@ enum hmm_pfn_value_e {
> >>>   * @pfns: array of pfns (big enough for the range)
> >>>   * @flags: pfn flags to match device driver page table
> >>>   * @values: pfn value for some special case (none, special, error, ...)
> >>> + * @default_flags: default flags for the range (write, read, ...)
> >>> + * @pfn_flags_mask: allows to mask pfn flags so that only default_flags matter
> >>>   * @pfn_shifts: pfn shift value (should be <= PAGE_SHIFT)
> >>>   * @valid: pfns array did not change since it has been fill by an HMM function
> >>>   */
> >>> @@ -177,6 +179,8 @@ struct hmm_range {
> >>>  	uint64_t		*pfns;
> >>>  	const uint64_t		*flags;
> >>>  	const uint64_t		*values;
> >>> +	uint64_t		default_flags;
> >>> +	uint64_t		pfn_flags_mask;
> >>>  	uint8_t			pfn_shift;
> >>>  	bool			valid;
> >>>  };
> >>> @@ -521,6 +525,9 @@ static inline int hmm_vma_fault(struct hmm_range *range, bool block)
> >>>  {
> >>>  	long ret;
> >>>  
> >>> +	range->default_flags = 0;
> >>> +	range->pfn_flags_mask = -1UL;
> >>
> >> Hi Jerome,
> >>
> >> This is nice to have. Let's constrain it a little bit more, though: the pfn_flags_mask
> >> definitely does not need to be a run time value. And we want some assurance that
> >> the mask is 
> >> 	a) large enough for the flags, and
> >> 	b) small enough to avoid overrunning the pfns field.
> >>
> >> Those are less certain with a run-time struct field, and more obviously correct with
> >> something like, approximately:
> >>
> >>  	#define PFN_FLAGS_MASK 0xFFFF
> >>
> >> or something.
> >>
> >> In other words, this is more flexibility than we need--just a touch too much,
> >> IMHO.
> > 
> > This mirror the fact that flags are provided as an array and some devices use
> > the top bits for flags (read, write, ...). So here it is the safe default to
> > set it to -1. If the caller want to leverage this optimization it can override
> > the default_flags value.
> > 
> 
> Optimization? OK, now I'm a bit lost. Maybe this is another place where I could
> use a peek at the calling code. The only flags I've seen so far use the bottom
> 3 bits and that's it. 
> 
> Maybe comments here?
> 
> >>
> >>> +
> >>>  	ret = hmm_range_register(range, range->vma->vm_mm,
> >>>  				 range->start, range->end);
> >>>  	if (ret)
> >>> diff --git a/mm/hmm.c b/mm/hmm.c
> >>> index fa9498eeb9b6..4fe88a196d17 100644
> >>> --- a/mm/hmm.c
> >>> +++ b/mm/hmm.c
> >>> @@ -415,6 +415,18 @@ static inline void hmm_pte_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
> >>>  	if (!hmm_vma_walk->fault)
> >>>  		return;
> >>>  
> >>> +	/*
> >>> +	 * So we not only consider the individual per page request we also
> >>> +	 * consider the default flags requested for the range. The API can
> >>> +	 * be use in 2 fashions. The first one where the HMM user coalesce
> >>> +	 * multiple page fault into one request and set flags per pfns for
> >>> +	 * of those faults. The second one where the HMM user want to pre-
> >>> +	 * fault a range with specific flags. For the latter one it is a
> >>> +	 * waste to have the user pre-fill the pfn arrays with a default
> >>> +	 * flags value.
> >>> +	 */
> >>> +	pfns = (pfns & range->pfn_flags_mask) | range->default_flags;
> >>
> >> Need to verify that the mask isn't too large or too small.
> > 
> > I need to check agin but default flag is anded somewhere to limit
> > the bit to the one we expect.
> 
> Right, but in general, the *mask* could be wrong. It would be nice to have
> an assert, and/or a comment, or something to verify the mask is proper.
> 
> Really, a hardcoded mask is simple and correct--unless it *definitely* must
> vary for devices of course.

Ok so re-read the code and it is correct. The helper for compatibility with
old API (so that i do not break nouveau upstream code) initialize those to
the safe default ie:

range->default_flags = 0;
range->pfn_flags_mask = -1;

Which means that in the above comment we are in the case where it is the
individual entry within the pfn array that will determine if we fault or
not.

Driver using the new API can either use this safe default or use the
second case in the above comment and set default_flags to something
else than 0.

Note that those default_flags are not set in the final result they are
use to determine if we need to do a page fault. For instance if you set
the write bit in the default flags then the pfns computed above will
have the write bit set and when we compare with the CPU pte if the CPU
pte do not have the write bit set then we will fault. What matter is
that in this case the value within the pfns array is totaly pointless
ie we do not care what it is, it will not affect the decission ie the
decision is made by looking at the default flags.

Hope this clarify thing. You can look at the ODP patch to see how it
is use:

https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-odp-v2&id=eebd4f3095290a16ebc03182e2d3ab5dfa7b05ec

Cheers,
Jérôme
John Hubbard March 28, 2019, 10:40 p.m. UTC | #6
On 3/28/19 3:31 PM, Jerome Glisse wrote:
> On Thu, Mar 28, 2019 at 03:19:06PM -0700, John Hubbard wrote:
>> On 3/28/19 3:12 PM, Jerome Glisse wrote:
>>> On Thu, Mar 28, 2019 at 02:59:50PM -0700, John Hubbard wrote:
>>>> On 3/25/19 7:40 AM, jglisse@redhat.com wrote:
>>>>> From: Jérôme Glisse <jglisse@redhat.com>
>>>>>
>>>>> The HMM mirror API can be use in two fashions. The first one where the HMM
>>>>> user coalesce multiple page faults into one request and set flags per pfns
>>>>> for of those faults. The second one where the HMM user want to pre-fault a
>>>>> range with specific flags. For the latter one it is a waste to have the user
>>>>> pre-fill the pfn arrays with a default flags value.
>>>>>
>>>>> This patch adds a default flags value allowing user to set them for a range
>>>>> without having to pre-fill the pfn array.
>>>>>
>>>>> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
>>>>> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
>>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>>>> Cc: John Hubbard <jhubbard@nvidia.com>
>>>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>>>> ---
>>>>>  include/linux/hmm.h |  7 +++++++
>>>>>  mm/hmm.c            | 12 ++++++++++++
>>>>>  2 files changed, 19 insertions(+)
>>>>>
>>>>> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
>>>>> index 79671036cb5f..13bc2c72f791 100644
>>>>> --- a/include/linux/hmm.h
>>>>> +++ b/include/linux/hmm.h
>>>>> @@ -165,6 +165,8 @@ enum hmm_pfn_value_e {
>>>>>   * @pfns: array of pfns (big enough for the range)
>>>>>   * @flags: pfn flags to match device driver page table
>>>>>   * @values: pfn value for some special case (none, special, error, ...)
>>>>> + * @default_flags: default flags for the range (write, read, ...)
>>>>> + * @pfn_flags_mask: allows to mask pfn flags so that only default_flags matter
>>>>>   * @pfn_shifts: pfn shift value (should be <= PAGE_SHIFT)
>>>>>   * @valid: pfns array did not change since it has been fill by an HMM function
>>>>>   */
>>>>> @@ -177,6 +179,8 @@ struct hmm_range {
>>>>>  	uint64_t		*pfns;
>>>>>  	const uint64_t		*flags;
>>>>>  	const uint64_t		*values;
>>>>> +	uint64_t		default_flags;
>>>>> +	uint64_t		pfn_flags_mask;
>>>>>  	uint8_t			pfn_shift;
>>>>>  	bool			valid;
>>>>>  };
>>>>> @@ -521,6 +525,9 @@ static inline int hmm_vma_fault(struct hmm_range *range, bool block)
>>>>>  {
>>>>>  	long ret;
>>>>>  
>>>>> +	range->default_flags = 0;
>>>>> +	range->pfn_flags_mask = -1UL;
>>>>
>>>> Hi Jerome,
>>>>
>>>> This is nice to have. Let's constrain it a little bit more, though: the pfn_flags_mask
>>>> definitely does not need to be a run time value. And we want some assurance that
>>>> the mask is 
>>>> 	a) large enough for the flags, and
>>>> 	b) small enough to avoid overrunning the pfns field.
>>>>
>>>> Those are less certain with a run-time struct field, and more obviously correct with
>>>> something like, approximately:
>>>>
>>>>  	#define PFN_FLAGS_MASK 0xFFFF
>>>>
>>>> or something.
>>>>
>>>> In other words, this is more flexibility than we need--just a touch too much,
>>>> IMHO.
>>>
>>> This mirror the fact that flags are provided as an array and some devices use
>>> the top bits for flags (read, write, ...). So here it is the safe default to
>>> set it to -1. If the caller want to leverage this optimization it can override
>>> the default_flags value.
>>>
>>
>> Optimization? OK, now I'm a bit lost. Maybe this is another place where I could
>> use a peek at the calling code. The only flags I've seen so far use the bottom
>> 3 bits and that's it. 
>>
>> Maybe comments here?
>>
>>>>
>>>>> +
>>>>>  	ret = hmm_range_register(range, range->vma->vm_mm,
>>>>>  				 range->start, range->end);
>>>>>  	if (ret)
>>>>> diff --git a/mm/hmm.c b/mm/hmm.c
>>>>> index fa9498eeb9b6..4fe88a196d17 100644
>>>>> --- a/mm/hmm.c
>>>>> +++ b/mm/hmm.c
>>>>> @@ -415,6 +415,18 @@ static inline void hmm_pte_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
>>>>>  	if (!hmm_vma_walk->fault)
>>>>>  		return;
>>>>>  
>>>>> +	/*
>>>>> +	 * So we not only consider the individual per page request we also
>>>>> +	 * consider the default flags requested for the range. The API can
>>>>> +	 * be use in 2 fashions. The first one where the HMM user coalesce
>>>>> +	 * multiple page fault into one request and set flags per pfns for
>>>>> +	 * of those faults. The second one where the HMM user want to pre-
>>>>> +	 * fault a range with specific flags. For the latter one it is a
>>>>> +	 * waste to have the user pre-fill the pfn arrays with a default
>>>>> +	 * flags value.
>>>>> +	 */
>>>>> +	pfns = (pfns & range->pfn_flags_mask) | range->default_flags;
>>>>
>>>> Need to verify that the mask isn't too large or too small.
>>>
>>> I need to check agin but default flag is anded somewhere to limit
>>> the bit to the one we expect.
>>
>> Right, but in general, the *mask* could be wrong. It would be nice to have
>> an assert, and/or a comment, or something to verify the mask is proper.
>>
>> Really, a hardcoded mask is simple and correct--unless it *definitely* must
>> vary for devices of course.
> 
> Ok so re-read the code and it is correct. The helper for compatibility with
> old API (so that i do not break nouveau upstream code) initialize those to
> the safe default ie:
> 
> range->default_flags = 0;
> range->pfn_flags_mask = -1;
> 
> Which means that in the above comment we are in the case where it is the
> individual entry within the pfn array that will determine if we fault or
> not.
> 
> Driver using the new API can either use this safe default or use the
> second case in the above comment and set default_flags to something
> else than 0.
> 
> Note that those default_flags are not set in the final result they are
> use to determine if we need to do a page fault. For instance if you set
> the write bit in the default flags then the pfns computed above will
> have the write bit set and when we compare with the CPU pte if the CPU
> pte do not have the write bit set then we will fault. What matter is
> that in this case the value within the pfns array is totaly pointless
> ie we do not care what it is, it will not affect the decission ie the
> decision is made by looking at the default flags.
> 
> Hope this clarify thing. You can look at the ODP patch to see how it
> is use:
> 
> https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-odp-v2&id=eebd4f3095290a16ebc03182e2d3ab5dfa7b05ec
> 

Hi Jerome,

I think you're talking about flags, but I'm talking about the mask. The 
above link doesn't appear to use the pfn_flags_mask, and the default_flags 
that it uses are still in the same lower 3 bits:

+static uint64_t odp_hmm_flags[HMM_PFN_FLAG_MAX] = {
+	ODP_READ_BIT,	/* HMM_PFN_VALID */
+	ODP_WRITE_BIT,	/* HMM_PFN_WRITE */
+	ODP_DEVICE_BIT,	/* HMM_PFN_DEVICE_PRIVATE */
+};

So I still don't see why we need the flexibility of a full 0xFFFFFFFFFFFFFFFF
mask, that is *also* runtime changeable. 

thanks,
Jerome Glisse March 28, 2019, 11:21 p.m. UTC | #7
On Thu, Mar 28, 2019 at 03:40:42PM -0700, John Hubbard wrote:
> On 3/28/19 3:31 PM, Jerome Glisse wrote:
> > On Thu, Mar 28, 2019 at 03:19:06PM -0700, John Hubbard wrote:
> >> On 3/28/19 3:12 PM, Jerome Glisse wrote:
> >>> On Thu, Mar 28, 2019 at 02:59:50PM -0700, John Hubbard wrote:
> >>>> On 3/25/19 7:40 AM, jglisse@redhat.com wrote:
> >>>>> From: Jérôme Glisse <jglisse@redhat.com>
> >>>>>
> >>>>> The HMM mirror API can be use in two fashions. The first one where the HMM
> >>>>> user coalesce multiple page faults into one request and set flags per pfns
> >>>>> for of those faults. The second one where the HMM user want to pre-fault a
> >>>>> range with specific flags. For the latter one it is a waste to have the user
> >>>>> pre-fill the pfn arrays with a default flags value.
> >>>>>
> >>>>> This patch adds a default flags value allowing user to set them for a range
> >>>>> without having to pre-fill the pfn array.
> >>>>>
> >>>>> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> >>>>> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
> >>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
> >>>>> Cc: John Hubbard <jhubbard@nvidia.com>
> >>>>> Cc: Dan Williams <dan.j.williams@intel.com>
> >>>>> ---
> >>>>>  include/linux/hmm.h |  7 +++++++
> >>>>>  mm/hmm.c            | 12 ++++++++++++
> >>>>>  2 files changed, 19 insertions(+)
> >>>>>
> >>>>> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> >>>>> index 79671036cb5f..13bc2c72f791 100644
> >>>>> --- a/include/linux/hmm.h
> >>>>> +++ b/include/linux/hmm.h
> >>>>> @@ -165,6 +165,8 @@ enum hmm_pfn_value_e {
> >>>>>   * @pfns: array of pfns (big enough for the range)
> >>>>>   * @flags: pfn flags to match device driver page table
> >>>>>   * @values: pfn value for some special case (none, special, error, ...)
> >>>>> + * @default_flags: default flags for the range (write, read, ...)
> >>>>> + * @pfn_flags_mask: allows to mask pfn flags so that only default_flags matter
> >>>>>   * @pfn_shifts: pfn shift value (should be <= PAGE_SHIFT)
> >>>>>   * @valid: pfns array did not change since it has been fill by an HMM function
> >>>>>   */
> >>>>> @@ -177,6 +179,8 @@ struct hmm_range {
> >>>>>  	uint64_t		*pfns;
> >>>>>  	const uint64_t		*flags;
> >>>>>  	const uint64_t		*values;
> >>>>> +	uint64_t		default_flags;
> >>>>> +	uint64_t		pfn_flags_mask;
> >>>>>  	uint8_t			pfn_shift;
> >>>>>  	bool			valid;
> >>>>>  };
> >>>>> @@ -521,6 +525,9 @@ static inline int hmm_vma_fault(struct hmm_range *range, bool block)
> >>>>>  {
> >>>>>  	long ret;
> >>>>>  
> >>>>> +	range->default_flags = 0;
> >>>>> +	range->pfn_flags_mask = -1UL;
> >>>>
> >>>> Hi Jerome,
> >>>>
> >>>> This is nice to have. Let's constrain it a little bit more, though: the pfn_flags_mask
> >>>> definitely does not need to be a run time value. And we want some assurance that
> >>>> the mask is 
> >>>> 	a) large enough for the flags, and
> >>>> 	b) small enough to avoid overrunning the pfns field.
> >>>>
> >>>> Those are less certain with a run-time struct field, and more obviously correct with
> >>>> something like, approximately:
> >>>>
> >>>>  	#define PFN_FLAGS_MASK 0xFFFF
> >>>>
> >>>> or something.
> >>>>
> >>>> In other words, this is more flexibility than we need--just a touch too much,
> >>>> IMHO.
> >>>
> >>> This mirror the fact that flags are provided as an array and some devices use
> >>> the top bits for flags (read, write, ...). So here it is the safe default to
> >>> set it to -1. If the caller want to leverage this optimization it can override
> >>> the default_flags value.
> >>>
> >>
> >> Optimization? OK, now I'm a bit lost. Maybe this is another place where I could
> >> use a peek at the calling code. The only flags I've seen so far use the bottom
> >> 3 bits and that's it. 
> >>
> >> Maybe comments here?
> >>
> >>>>
> >>>>> +
> >>>>>  	ret = hmm_range_register(range, range->vma->vm_mm,
> >>>>>  				 range->start, range->end);
> >>>>>  	if (ret)
> >>>>> diff --git a/mm/hmm.c b/mm/hmm.c
> >>>>> index fa9498eeb9b6..4fe88a196d17 100644
> >>>>> --- a/mm/hmm.c
> >>>>> +++ b/mm/hmm.c
> >>>>> @@ -415,6 +415,18 @@ static inline void hmm_pte_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
> >>>>>  	if (!hmm_vma_walk->fault)
> >>>>>  		return;
> >>>>>  
> >>>>> +	/*
> >>>>> +	 * So we not only consider the individual per page request we also
> >>>>> +	 * consider the default flags requested for the range. The API can
> >>>>> +	 * be use in 2 fashions. The first one where the HMM user coalesce
> >>>>> +	 * multiple page fault into one request and set flags per pfns for
> >>>>> +	 * of those faults. The second one where the HMM user want to pre-
> >>>>> +	 * fault a range with specific flags. For the latter one it is a
> >>>>> +	 * waste to have the user pre-fill the pfn arrays with a default
> >>>>> +	 * flags value.
> >>>>> +	 */
> >>>>> +	pfns = (pfns & range->pfn_flags_mask) | range->default_flags;
> >>>>
> >>>> Need to verify that the mask isn't too large or too small.
> >>>
> >>> I need to check agin but default flag is anded somewhere to limit
> >>> the bit to the one we expect.
> >>
> >> Right, but in general, the *mask* could be wrong. It would be nice to have
> >> an assert, and/or a comment, or something to verify the mask is proper.
> >>
> >> Really, a hardcoded mask is simple and correct--unless it *definitely* must
> >> vary for devices of course.
> > 
> > Ok so re-read the code and it is correct. The helper for compatibility with
> > old API (so that i do not break nouveau upstream code) initialize those to
> > the safe default ie:
> > 
> > range->default_flags = 0;
> > range->pfn_flags_mask = -1;
> > 
> > Which means that in the above comment we are in the case where it is the
> > individual entry within the pfn array that will determine if we fault or
> > not.
> > 
> > Driver using the new API can either use this safe default or use the
> > second case in the above comment and set default_flags to something
> > else than 0.
> > 
> > Note that those default_flags are not set in the final result they are
> > use to determine if we need to do a page fault. For instance if you set
> > the write bit in the default flags then the pfns computed above will
> > have the write bit set and when we compare with the CPU pte if the CPU
> > pte do not have the write bit set then we will fault. What matter is
> > that in this case the value within the pfns array is totaly pointless
> > ie we do not care what it is, it will not affect the decission ie the
> > decision is made by looking at the default flags.
> > 
> > Hope this clarify thing. You can look at the ODP patch to see how it
> > is use:
> > 
> > https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-odp-v2&id=eebd4f3095290a16ebc03182e2d3ab5dfa7b05ec
> > 
> 
> Hi Jerome,
> 
> I think you're talking about flags, but I'm talking about the mask. The 
> above link doesn't appear to use the pfn_flags_mask, and the default_flags 
> that it uses are still in the same lower 3 bits:
> 
> +static uint64_t odp_hmm_flags[HMM_PFN_FLAG_MAX] = {
> +	ODP_READ_BIT,	/* HMM_PFN_VALID */
> +	ODP_WRITE_BIT,	/* HMM_PFN_WRITE */
> +	ODP_DEVICE_BIT,	/* HMM_PFN_DEVICE_PRIVATE */
> +};
> 
> So I still don't see why we need the flexibility of a full 0xFFFFFFFFFFFFFFFF
> mask, that is *also* runtime changeable. 

So the pfn array is using a device driver specific format and we have
no idea nor do we need to know where the valid, write, ... bit are in
that format. Those bits can be in the top 60 bits like 63, 62, 61, ...
we do not care. They are device with bit at the top and for those you
need a mask that allows you to mask out those bits or not depending on
what the user want to do.

The mask here is against an _unknown_ (from HMM POV) format. So we can
not presume where the bits will be and thus we can not presume what a
proper mask is.

So that's why a full unsigned long mask is use here.

Maybe an example will help let say the device flag are:
    VALID (1 << 63)
    WRITE (1 << 62)

Now let say that device wants to fault with at least read a range
it does set:
    range->default_flags = (1 << 63)
    range->pfn_flags_mask = 0;

This will fill fault all page in the range with at least read
permission.

Now let say it wants to do the same except for one page in the range
for which its want to have write. Now driver set:
    range->default_flags = (1 << 63);
    range->pfn_flags_mask = (1 << 62);
    range->pfns[index_of_write] = (1 << 62);

With this HMM will fault in all page with at least read (ie valid)
and for the address: range->start + index_of_write << PAGE_SHIFT it
will fault with write permission ie if the CPU pte does not have
write permission set then handle_mm_fault() will be call asking for
write permission.


Note that in the above HMM will populate the pfns array with write
permission for any entry that have write permission within the CPU
pte ie the default_flags and pfn_flags_mask is only the minimun
requirement but HMM always returns all the flag that are set in the
CPU pte.


Now let say you are an "old" driver like nouveau upstream, then it
means that you are setting each individual entry within range->pfns
with the exact flags you want for each address hence here what you
want is:
    range->default_flags = 0;
    range->pfn_flags_mask = -1UL;

So that what we do is (for each entry):
    (range->pfns[index] & range->pfn_flags_mask) | range->default_flags
and we end up with the flags that were set by the driver for each of
the individual range->pfns entries.


Does this help ?

Cheers,
Jérôme
John Hubbard March 28, 2019, 11:28 p.m. UTC | #8
On 3/28/19 4:21 PM, Jerome Glisse wrote:
> On Thu, Mar 28, 2019 at 03:40:42PM -0700, John Hubbard wrote:
>> On 3/28/19 3:31 PM, Jerome Glisse wrote:
>>> On Thu, Mar 28, 2019 at 03:19:06PM -0700, John Hubbard wrote:
>>>> On 3/28/19 3:12 PM, Jerome Glisse wrote:
>>>>> On Thu, Mar 28, 2019 at 02:59:50PM -0700, John Hubbard wrote:
>>>>>> On 3/25/19 7:40 AM, jglisse@redhat.com wrote:
>>>>>>> From: Jérôme Glisse <jglisse@redhat.com>
[...]
>> Hi Jerome,
>>
>> I think you're talking about flags, but I'm talking about the mask. The 
>> above link doesn't appear to use the pfn_flags_mask, and the default_flags 
>> that it uses are still in the same lower 3 bits:
>>
>> +static uint64_t odp_hmm_flags[HMM_PFN_FLAG_MAX] = {
>> +	ODP_READ_BIT,	/* HMM_PFN_VALID */
>> +	ODP_WRITE_BIT,	/* HMM_PFN_WRITE */
>> +	ODP_DEVICE_BIT,	/* HMM_PFN_DEVICE_PRIVATE */
>> +};
>>
>> So I still don't see why we need the flexibility of a full 0xFFFFFFFFFFFFFFFF
>> mask, that is *also* runtime changeable. 
> 
> So the pfn array is using a device driver specific format and we have
> no idea nor do we need to know where the valid, write, ... bit are in
> that format. Those bits can be in the top 60 bits like 63, 62, 61, ...
> we do not care. They are device with bit at the top and for those you
> need a mask that allows you to mask out those bits or not depending on
> what the user want to do.
> 
> The mask here is against an _unknown_ (from HMM POV) format. So we can
> not presume where the bits will be and thus we can not presume what a
> proper mask is.
> 
> So that's why a full unsigned long mask is use here.
> 
> Maybe an example will help let say the device flag are:
>     VALID (1 << 63)
>     WRITE (1 << 62)
> 
> Now let say that device wants to fault with at least read a range
> it does set:
>     range->default_flags = (1 << 63)
>     range->pfn_flags_mask = 0;
> 
> This will fill fault all page in the range with at least read
> permission.
> 
> Now let say it wants to do the same except for one page in the range
> for which its want to have write. Now driver set:
>     range->default_flags = (1 << 63);
>     range->pfn_flags_mask = (1 << 62);
>     range->pfns[index_of_write] = (1 << 62);
> 
> With this HMM will fault in all page with at least read (ie valid)
> and for the address: range->start + index_of_write << PAGE_SHIFT it
> will fault with write permission ie if the CPU pte does not have
> write permission set then handle_mm_fault() will be call asking for
> write permission.
> 
> 
> Note that in the above HMM will populate the pfns array with write
> permission for any entry that have write permission within the CPU
> pte ie the default_flags and pfn_flags_mask is only the minimun
> requirement but HMM always returns all the flag that are set in the
> CPU pte.
> 
> 
> Now let say you are an "old" driver like nouveau upstream, then it
> means that you are setting each individual entry within range->pfns
> with the exact flags you want for each address hence here what you
> want is:
>     range->default_flags = 0;
>     range->pfn_flags_mask = -1UL;
> 
> So that what we do is (for each entry):
>     (range->pfns[index] & range->pfn_flags_mask) | range->default_flags
> and we end up with the flags that were set by the driver for each of
> the individual range->pfns entries.
> 
> 
> Does this help ?
> 

Yes, the key point for me was that this is an entirely device driver specific
format. OK. But then we have HMM setting it. So a comment to the effect that
this is device-specific might be nice, but I'll leave that up to you whether
it is useful.

Either way, you can add:

	Reviewed-by: John Hubbard <jhubbard@nvidia.com>

thanks,
Jerome Glisse March 28, 2019, 11:43 p.m. UTC | #9
On Thu, Mar 28, 2019 at 04:28:47PM -0700, John Hubbard wrote:
> On 3/28/19 4:21 PM, Jerome Glisse wrote:
> > On Thu, Mar 28, 2019 at 03:40:42PM -0700, John Hubbard wrote:
> >> On 3/28/19 3:31 PM, Jerome Glisse wrote:
> >>> On Thu, Mar 28, 2019 at 03:19:06PM -0700, John Hubbard wrote:
> >>>> On 3/28/19 3:12 PM, Jerome Glisse wrote:
> >>>>> On Thu, Mar 28, 2019 at 02:59:50PM -0700, John Hubbard wrote:
> >>>>>> On 3/25/19 7:40 AM, jglisse@redhat.com wrote:
> >>>>>>> From: Jérôme Glisse <jglisse@redhat.com>
> [...]
> >> Hi Jerome,
> >>
> >> I think you're talking about flags, but I'm talking about the mask. The 
> >> above link doesn't appear to use the pfn_flags_mask, and the default_flags 
> >> that it uses are still in the same lower 3 bits:
> >>
> >> +static uint64_t odp_hmm_flags[HMM_PFN_FLAG_MAX] = {
> >> +	ODP_READ_BIT,	/* HMM_PFN_VALID */
> >> +	ODP_WRITE_BIT,	/* HMM_PFN_WRITE */
> >> +	ODP_DEVICE_BIT,	/* HMM_PFN_DEVICE_PRIVATE */
> >> +};
> >>
> >> So I still don't see why we need the flexibility of a full 0xFFFFFFFFFFFFFFFF
> >> mask, that is *also* runtime changeable. 
> > 
> > So the pfn array is using a device driver specific format and we have
> > no idea nor do we need to know where the valid, write, ... bit are in
> > that format. Those bits can be in the top 60 bits like 63, 62, 61, ...
> > we do not care. They are device with bit at the top and for those you
> > need a mask that allows you to mask out those bits or not depending on
> > what the user want to do.
> > 
> > The mask here is against an _unknown_ (from HMM POV) format. So we can
> > not presume where the bits will be and thus we can not presume what a
> > proper mask is.
> > 
> > So that's why a full unsigned long mask is use here.
> > 
> > Maybe an example will help let say the device flag are:
> >     VALID (1 << 63)
> >     WRITE (1 << 62)
> > 
> > Now let say that device wants to fault with at least read a range
> > it does set:
> >     range->default_flags = (1 << 63)
> >     range->pfn_flags_mask = 0;
> > 
> > This will fill fault all page in the range with at least read
> > permission.
> > 
> > Now let say it wants to do the same except for one page in the range
> > for which its want to have write. Now driver set:
> >     range->default_flags = (1 << 63);
> >     range->pfn_flags_mask = (1 << 62);
> >     range->pfns[index_of_write] = (1 << 62);
> > 
> > With this HMM will fault in all page with at least read (ie valid)
> > and for the address: range->start + index_of_write << PAGE_SHIFT it
> > will fault with write permission ie if the CPU pte does not have
> > write permission set then handle_mm_fault() will be call asking for
> > write permission.
> > 
> > 
> > Note that in the above HMM will populate the pfns array with write
> > permission for any entry that have write permission within the CPU
> > pte ie the default_flags and pfn_flags_mask is only the minimun
> > requirement but HMM always returns all the flag that are set in the
> > CPU pte.
> > 
> > 
> > Now let say you are an "old" driver like nouveau upstream, then it
> > means that you are setting each individual entry within range->pfns
> > with the exact flags you want for each address hence here what you
> > want is:
> >     range->default_flags = 0;
> >     range->pfn_flags_mask = -1UL;
> > 
> > So that what we do is (for each entry):
> >     (range->pfns[index] & range->pfn_flags_mask) | range->default_flags
> > and we end up with the flags that were set by the driver for each of
> > the individual range->pfns entries.
> > 
> > 
> > Does this help ?
> > 
> 
> Yes, the key point for me was that this is an entirely device driver specific
> format. OK. But then we have HMM setting it. So a comment to the effect that
> this is device-specific might be nice, but I'll leave that up to you whether
> it is useful.

The code you were pointing at is temporary ie once this get merge that code
will get remove in release N+2 ie merge code in N, update nouveau in N+1 and
remove this temporary code in N+2

When updating HMM API it is easier to stage API update over release like that
so there is no need to synchronize accross multiple tree (mm, drm, rdma, ...)

> Either way, you can add:
> 
> 	Reviewed-by: John Hubbard <jhubbard@nvidia.com>
> 
> thanks,
> -- 
> John Hubbard
> NVIDIA
Jerome Glisse March 29, 2019, 1:17 a.m. UTC | #10
On Thu, Mar 28, 2019 at 09:42:31AM -0700, Ira Weiny wrote:
> On Thu, Mar 28, 2019 at 04:28:47PM -0700, John Hubbard wrote:
> > On 3/28/19 4:21 PM, Jerome Glisse wrote:
> > > On Thu, Mar 28, 2019 at 03:40:42PM -0700, John Hubbard wrote:
> > >> On 3/28/19 3:31 PM, Jerome Glisse wrote:
> > >>> On Thu, Mar 28, 2019 at 03:19:06PM -0700, John Hubbard wrote:
> > >>>> On 3/28/19 3:12 PM, Jerome Glisse wrote:
> > >>>>> On Thu, Mar 28, 2019 at 02:59:50PM -0700, John Hubbard wrote:
> > >>>>>> On 3/25/19 7:40 AM, jglisse@redhat.com wrote:
> > >>>>>>> From: Jérôme Glisse <jglisse@redhat.com>
> > [...]
> > >> Hi Jerome,
> > >>
> > >> I think you're talking about flags, but I'm talking about the mask. The 
> > >> above link doesn't appear to use the pfn_flags_mask, and the default_flags 
> > >> that it uses are still in the same lower 3 bits:
> > >>
> > >> +static uint64_t odp_hmm_flags[HMM_PFN_FLAG_MAX] = {
> > >> +	ODP_READ_BIT,	/* HMM_PFN_VALID */
> > >> +	ODP_WRITE_BIT,	/* HMM_PFN_WRITE */
> > >> +	ODP_DEVICE_BIT,	/* HMM_PFN_DEVICE_PRIVATE */
> > >> +};
> > >>
> > >> So I still don't see why we need the flexibility of a full 0xFFFFFFFFFFFFFFFF
> > >> mask, that is *also* runtime changeable. 
> > > 
> > > So the pfn array is using a device driver specific format and we have
> > > no idea nor do we need to know where the valid, write, ... bit are in
> > > that format. Those bits can be in the top 60 bits like 63, 62, 61, ...
> > > we do not care. They are device with bit at the top and for those you
> > > need a mask that allows you to mask out those bits or not depending on
> > > what the user want to do.
> > > 
> > > The mask here is against an _unknown_ (from HMM POV) format. So we can
> > > not presume where the bits will be and thus we can not presume what a
> > > proper mask is.
> > > 
> > > So that's why a full unsigned long mask is use here.
> > > 
> > > Maybe an example will help let say the device flag are:
> > >     VALID (1 << 63)
> > >     WRITE (1 << 62)
> > > 
> > > Now let say that device wants to fault with at least read a range
> > > it does set:
> > >     range->default_flags = (1 << 63)
> > >     range->pfn_flags_mask = 0;
> > > 
> > > This will fill fault all page in the range with at least read
> > > permission.
> > > 
> > > Now let say it wants to do the same except for one page in the range
> > > for which its want to have write. Now driver set:
> > >     range->default_flags = (1 << 63);
> > >     range->pfn_flags_mask = (1 << 62);
> > >     range->pfns[index_of_write] = (1 << 62);
> > > 
> > > With this HMM will fault in all page with at least read (ie valid)
> > > and for the address: range->start + index_of_write << PAGE_SHIFT it
> > > will fault with write permission ie if the CPU pte does not have
> > > write permission set then handle_mm_fault() will be call asking for
> > > write permission.
> > > 
> > > 
> > > Note that in the above HMM will populate the pfns array with write
> > > permission for any entry that have write permission within the CPU
> > > pte ie the default_flags and pfn_flags_mask is only the minimun
> > > requirement but HMM always returns all the flag that are set in the
> > > CPU pte.
> > > 
> > > 
> > > Now let say you are an "old" driver like nouveau upstream, then it
> > > means that you are setting each individual entry within range->pfns
> > > with the exact flags you want for each address hence here what you
> > > want is:
> > >     range->default_flags = 0;
> > >     range->pfn_flags_mask = -1UL;
> > > 
> > > So that what we do is (for each entry):
> > >     (range->pfns[index] & range->pfn_flags_mask) | range->default_flags
> > > and we end up with the flags that were set by the driver for each of
> > > the individual range->pfns entries.
> > > 
> > > 
> > > Does this help ?
> > > 
> > 
> > Yes, the key point for me was that this is an entirely device driver specific
> > format. OK. But then we have HMM setting it. So a comment to the effect that
> > this is device-specific might be nice, but I'll leave that up to you whether
> > it is useful.
> 
> Indeed I did not realize there is an hmm "pfn" until I saw this function:
> 
> /*
>  * hmm_pfn_from_pfn() - create a valid HMM pfn value from pfn
>  * @range: range use to encode HMM pfn value
>  * @pfn: pfn value for which to create the HMM pfn
>  * Returns: valid HMM pfn for the pfn
>  */
> static inline uint64_t hmm_pfn_from_pfn(const struct hmm_range *range,
>                                         unsigned long pfn)
> 
> So should this patch contain some sort of helper like this... maybe?
> 
> I'm assuming the "hmm_pfn" being returned above is the device pfn being
> discussed here?
> 
> I'm also thinking calling it pfn is confusing.  I'm not advocating a new type
> but calling the "device pfn's" "hmm_pfn" or "device_pfn" seems like it would
> have shortened the discussion here.
> 

That helper is also use today by nouveau so changing that name is not that
easy it does require the multi-release dance. So i am not sure how much
value there is in a name change.

Cheers,
Jérôme
John Hubbard March 29, 2019, 1:30 a.m. UTC | #11
On 3/28/19 6:17 PM, Jerome Glisse wrote:
> On Thu, Mar 28, 2019 at 09:42:31AM -0700, Ira Weiny wrote:
>> On Thu, Mar 28, 2019 at 04:28:47PM -0700, John Hubbard wrote:
>>> On 3/28/19 4:21 PM, Jerome Glisse wrote:
>>>> On Thu, Mar 28, 2019 at 03:40:42PM -0700, John Hubbard wrote:
>>>>> On 3/28/19 3:31 PM, Jerome Glisse wrote:
>>>>>> On Thu, Mar 28, 2019 at 03:19:06PM -0700, John Hubbard wrote:
>>>>>>> On 3/28/19 3:12 PM, Jerome Glisse wrote:
>>>>>>>> On Thu, Mar 28, 2019 at 02:59:50PM -0700, John Hubbard wrote:
>>>>>>>>> On 3/25/19 7:40 AM, jglisse@redhat.com wrote:
>>>>>>>>>> From: Jérôme Glisse <jglisse@redhat.com>
>>> [...]
>> Indeed I did not realize there is an hmm "pfn" until I saw this function:
>>
>> /*
>>  * hmm_pfn_from_pfn() - create a valid HMM pfn value from pfn
>>  * @range: range use to encode HMM pfn value
>>  * @pfn: pfn value for which to create the HMM pfn
>>  * Returns: valid HMM pfn for the pfn
>>  */
>> static inline uint64_t hmm_pfn_from_pfn(const struct hmm_range *range,
>>                                         unsigned long pfn)
>>
>> So should this patch contain some sort of helper like this... maybe?
>>
>> I'm assuming the "hmm_pfn" being returned above is the device pfn being
>> discussed here?
>>
>> I'm also thinking calling it pfn is confusing.  I'm not advocating a new type
>> but calling the "device pfn's" "hmm_pfn" or "device_pfn" seems like it would
>> have shortened the discussion here.
>>
> 
> That helper is also use today by nouveau so changing that name is not that
> easy it does require the multi-release dance. So i am not sure how much
> value there is in a name change.
> 

Once the dust settles, I would expect that a name change for this could go
via Andrew's tree, right? It seems incredible to claim that we've built something
that effectively does not allow any minor changes!

I do think it's worth some *minor* trouble to improve the name, assuming that we
can do it in a simple patch, rather than some huge maintainer-level effort.

This field name not a large thing, but the cumulative effect of having a number of
naming glitches within HMM is significant. The size and complexity of HMM has
always made it hard to attract code reviewers, so let's improve what we can, to
counteract that.

thanks,
Jerome Glisse March 29, 2019, 1:42 a.m. UTC | #12
On Thu, Mar 28, 2019 at 06:30:26PM -0700, John Hubbard wrote:
> On 3/28/19 6:17 PM, Jerome Glisse wrote:
> > On Thu, Mar 28, 2019 at 09:42:31AM -0700, Ira Weiny wrote:
> >> On Thu, Mar 28, 2019 at 04:28:47PM -0700, John Hubbard wrote:
> >>> On 3/28/19 4:21 PM, Jerome Glisse wrote:
> >>>> On Thu, Mar 28, 2019 at 03:40:42PM -0700, John Hubbard wrote:
> >>>>> On 3/28/19 3:31 PM, Jerome Glisse wrote:
> >>>>>> On Thu, Mar 28, 2019 at 03:19:06PM -0700, John Hubbard wrote:
> >>>>>>> On 3/28/19 3:12 PM, Jerome Glisse wrote:
> >>>>>>>> On Thu, Mar 28, 2019 at 02:59:50PM -0700, John Hubbard wrote:
> >>>>>>>>> On 3/25/19 7:40 AM, jglisse@redhat.com wrote:
> >>>>>>>>>> From: Jérôme Glisse <jglisse@redhat.com>
> >>> [...]
> >> Indeed I did not realize there is an hmm "pfn" until I saw this function:
> >>
> >> /*
> >>  * hmm_pfn_from_pfn() - create a valid HMM pfn value from pfn
> >>  * @range: range use to encode HMM pfn value
> >>  * @pfn: pfn value for which to create the HMM pfn
> >>  * Returns: valid HMM pfn for the pfn
> >>  */
> >> static inline uint64_t hmm_pfn_from_pfn(const struct hmm_range *range,
> >>                                         unsigned long pfn)
> >>
> >> So should this patch contain some sort of helper like this... maybe?
> >>
> >> I'm assuming the "hmm_pfn" being returned above is the device pfn being
> >> discussed here?
> >>
> >> I'm also thinking calling it pfn is confusing.  I'm not advocating a new type
> >> but calling the "device pfn's" "hmm_pfn" or "device_pfn" seems like it would
> >> have shortened the discussion here.
> >>
> > 
> > That helper is also use today by nouveau so changing that name is not that
> > easy it does require the multi-release dance. So i am not sure how much
> > value there is in a name change.
> > 
> 
> Once the dust settles, I would expect that a name change for this could go
> via Andrew's tree, right? It seems incredible to claim that we've built something
> that effectively does not allow any minor changes!
> 
> I do think it's worth some *minor* trouble to improve the name, assuming that we
> can do it in a simple patch, rather than some huge maintainer-level effort.

Change to nouveau have to go through nouveau tree so changing name means:
 -  release N add function with new name, maybe make the old function just
    a wrapper to the new function
 -  release N+1 update user to use the new name
 -  release N+2 remove the old name

So it is do-able but it is painful so i rather do that one latter that now
as i am sure people will then complain again about some little thing and it
will post pone this whole patchset on that new bit. To avoid post-poning
RDMA and bunch of other patchset that build on top of that i rather get
this patchset in and then do more changes in the next cycle.

This is just a capacity thing.

Cheers,
Jérôme
Jerome Glisse March 29, 2019, 1:59 a.m. UTC | #13
On Thu, Mar 28, 2019 at 09:42:59PM -0400, Jerome Glisse wrote:
> On Thu, Mar 28, 2019 at 06:30:26PM -0700, John Hubbard wrote:
> > On 3/28/19 6:17 PM, Jerome Glisse wrote:
> > > On Thu, Mar 28, 2019 at 09:42:31AM -0700, Ira Weiny wrote:
> > >> On Thu, Mar 28, 2019 at 04:28:47PM -0700, John Hubbard wrote:
> > >>> On 3/28/19 4:21 PM, Jerome Glisse wrote:
> > >>>> On Thu, Mar 28, 2019 at 03:40:42PM -0700, John Hubbard wrote:
> > >>>>> On 3/28/19 3:31 PM, Jerome Glisse wrote:
> > >>>>>> On Thu, Mar 28, 2019 at 03:19:06PM -0700, John Hubbard wrote:
> > >>>>>>> On 3/28/19 3:12 PM, Jerome Glisse wrote:
> > >>>>>>>> On Thu, Mar 28, 2019 at 02:59:50PM -0700, John Hubbard wrote:
> > >>>>>>>>> On 3/25/19 7:40 AM, jglisse@redhat.com wrote:
> > >>>>>>>>>> From: Jérôme Glisse <jglisse@redhat.com>
> > >>> [...]
> > >> Indeed I did not realize there is an hmm "pfn" until I saw this function:
> > >>
> > >> /*
> > >>  * hmm_pfn_from_pfn() - create a valid HMM pfn value from pfn
> > >>  * @range: range use to encode HMM pfn value
> > >>  * @pfn: pfn value for which to create the HMM pfn
> > >>  * Returns: valid HMM pfn for the pfn
> > >>  */
> > >> static inline uint64_t hmm_pfn_from_pfn(const struct hmm_range *range,
> > >>                                         unsigned long pfn)
> > >>
> > >> So should this patch contain some sort of helper like this... maybe?
> > >>
> > >> I'm assuming the "hmm_pfn" being returned above is the device pfn being
> > >> discussed here?
> > >>
> > >> I'm also thinking calling it pfn is confusing.  I'm not advocating a new type
> > >> but calling the "device pfn's" "hmm_pfn" or "device_pfn" seems like it would
> > >> have shortened the discussion here.
> > >>
> > > 
> > > That helper is also use today by nouveau so changing that name is not that
> > > easy it does require the multi-release dance. So i am not sure how much
> > > value there is in a name change.
> > > 
> > 
> > Once the dust settles, I would expect that a name change for this could go
> > via Andrew's tree, right? It seems incredible to claim that we've built something
> > that effectively does not allow any minor changes!
> > 
> > I do think it's worth some *minor* trouble to improve the name, assuming that we
> > can do it in a simple patch, rather than some huge maintainer-level effort.
> 
> Change to nouveau have to go through nouveau tree so changing name means:
>  -  release N add function with new name, maybe make the old function just
>     a wrapper to the new function
>  -  release N+1 update user to use the new name
>  -  release N+2 remove the old name
> 
> So it is do-able but it is painful so i rather do that one latter that now
> as i am sure people will then complain again about some little thing and it
> will post pone this whole patchset on that new bit. To avoid post-poning
> RDMA and bunch of other patchset that build on top of that i rather get
> this patchset in and then do more changes in the next cycle.
> 
> This is just a capacity thing.

Also for clarity changes to API i am doing in this patchset is to make
the ODP convertion easier and thus they bring a real hard value. Renaming
those function is esthetic, i am not saying it is useless, i am saying it
does not have the same value as those other changes and i would rather not
miss another merge window just for esthetic changes.

Cheers,
Jérôme
John Hubbard March 29, 2019, 2:05 a.m. UTC | #14
On 3/28/19 6:59 PM, Jerome Glisse wrote:
>>>>>> [...]
>>>>> Indeed I did not realize there is an hmm "pfn" until I saw this function:
>>>>>
>>>>> /*
>>>>>  * hmm_pfn_from_pfn() - create a valid HMM pfn value from pfn
>>>>>  * @range: range use to encode HMM pfn value
>>>>>  * @pfn: pfn value for which to create the HMM pfn
>>>>>  * Returns: valid HMM pfn for the pfn
>>>>>  */
>>>>> static inline uint64_t hmm_pfn_from_pfn(const struct hmm_range *range,
>>>>>                                         unsigned long pfn)
>>>>>
>>>>> So should this patch contain some sort of helper like this... maybe?
>>>>>
>>>>> I'm assuming the "hmm_pfn" being returned above is the device pfn being
>>>>> discussed here?
>>>>>
>>>>> I'm also thinking calling it pfn is confusing.  I'm not advocating a new type
>>>>> but calling the "device pfn's" "hmm_pfn" or "device_pfn" seems like it would
>>>>> have shortened the discussion here.
>>>>>
>>>>
>>>> That helper is also use today by nouveau so changing that name is not that
>>>> easy it does require the multi-release dance. So i am not sure how much
>>>> value there is in a name change.
>>>>
>>>
>>> Once the dust settles, I would expect that a name change for this could go
>>> via Andrew's tree, right? It seems incredible to claim that we've built something
>>> that effectively does not allow any minor changes!
>>>
>>> I do think it's worth some *minor* trouble to improve the name, assuming that we
>>> can do it in a simple patch, rather than some huge maintainer-level effort.
>>
>> Change to nouveau have to go through nouveau tree so changing name means:

Yes, I understand the guideline, but is that always how it must be done? Ben (+cc)?

>>  -  release N add function with new name, maybe make the old function just
>>     a wrapper to the new function
>>  -  release N+1 update user to use the new name
>>  -  release N+2 remove the old name
>>
>> So it is do-able but it is painful so i rather do that one latter that now
>> as i am sure people will then complain again about some little thing and it
>> will post pone this whole patchset on that new bit. To avoid post-poning
>> RDMA and bunch of other patchset that build on top of that i rather get
>> this patchset in and then do more changes in the next cycle.
>>
>> This is just a capacity thing.
> 
> Also for clarity changes to API i am doing in this patchset is to make
> the ODP convertion easier and thus they bring a real hard value. Renaming
> those function is esthetic, i am not saying it is useless, i am saying it
> does not have the same value as those other changes and i would rather not
> miss another merge window just for esthetic changes.
> 

Agreed, that this minor point should not hold up this patch.

thanks,
Jerome Glisse March 29, 2019, 2:12 a.m. UTC | #15
On Thu, Mar 28, 2019 at 07:05:21PM -0700, John Hubbard wrote:
> On 3/28/19 6:59 PM, Jerome Glisse wrote:
> >>>>>> [...]
> >>>>> Indeed I did not realize there is an hmm "pfn" until I saw this function:
> >>>>>
> >>>>> /*
> >>>>>  * hmm_pfn_from_pfn() - create a valid HMM pfn value from pfn
> >>>>>  * @range: range use to encode HMM pfn value
> >>>>>  * @pfn: pfn value for which to create the HMM pfn
> >>>>>  * Returns: valid HMM pfn for the pfn
> >>>>>  */
> >>>>> static inline uint64_t hmm_pfn_from_pfn(const struct hmm_range *range,
> >>>>>                                         unsigned long pfn)
> >>>>>
> >>>>> So should this patch contain some sort of helper like this... maybe?
> >>>>>
> >>>>> I'm assuming the "hmm_pfn" being returned above is the device pfn being
> >>>>> discussed here?
> >>>>>
> >>>>> I'm also thinking calling it pfn is confusing.  I'm not advocating a new type
> >>>>> but calling the "device pfn's" "hmm_pfn" or "device_pfn" seems like it would
> >>>>> have shortened the discussion here.
> >>>>>
> >>>>
> >>>> That helper is also use today by nouveau so changing that name is not that
> >>>> easy it does require the multi-release dance. So i am not sure how much
> >>>> value there is in a name change.
> >>>>
> >>>
> >>> Once the dust settles, I would expect that a name change for this could go
> >>> via Andrew's tree, right? It seems incredible to claim that we've built something
> >>> that effectively does not allow any minor changes!
> >>>
> >>> I do think it's worth some *minor* trouble to improve the name, assuming that we
> >>> can do it in a simple patch, rather than some huge maintainer-level effort.
> >>
> >> Change to nouveau have to go through nouveau tree so changing name means:
> 
> Yes, I understand the guideline, but is that always how it must be done? Ben (+cc)?

Yes, it is not only about nouveau, it will be about every single
upstream driver using HMM. It is the easiest solution all other
solution involve coordination and/or risk of people that handle
the conflict to do something that break things.

Cheers,
Jérôme
diff mbox series

Patch

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 79671036cb5f..13bc2c72f791 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -165,6 +165,8 @@  enum hmm_pfn_value_e {
  * @pfns: array of pfns (big enough for the range)
  * @flags: pfn flags to match device driver page table
  * @values: pfn value for some special case (none, special, error, ...)
+ * @default_flags: default flags for the range (write, read, ...)
+ * @pfn_flags_mask: allows to mask pfn flags so that only default_flags matter
  * @pfn_shifts: pfn shift value (should be <= PAGE_SHIFT)
  * @valid: pfns array did not change since it has been fill by an HMM function
  */
@@ -177,6 +179,8 @@  struct hmm_range {
 	uint64_t		*pfns;
 	const uint64_t		*flags;
 	const uint64_t		*values;
+	uint64_t		default_flags;
+	uint64_t		pfn_flags_mask;
 	uint8_t			pfn_shift;
 	bool			valid;
 };
@@ -521,6 +525,9 @@  static inline int hmm_vma_fault(struct hmm_range *range, bool block)
 {
 	long ret;
 
+	range->default_flags = 0;
+	range->pfn_flags_mask = -1UL;
+
 	ret = hmm_range_register(range, range->vma->vm_mm,
 				 range->start, range->end);
 	if (ret)
diff --git a/mm/hmm.c b/mm/hmm.c
index fa9498eeb9b6..4fe88a196d17 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -415,6 +415,18 @@  static inline void hmm_pte_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
 	if (!hmm_vma_walk->fault)
 		return;
 
+	/*
+	 * So we not only consider the individual per page request we also
+	 * consider the default flags requested for the range. The API can
+	 * be use in 2 fashions. The first one where the HMM user coalesce
+	 * multiple page fault into one request and set flags per pfns for
+	 * of those faults. The second one where the HMM user want to pre-
+	 * fault a range with specific flags. For the latter one it is a
+	 * waste to have the user pre-fill the pfn arrays with a default
+	 * flags value.
+	 */
+	pfns = (pfns & range->pfn_flags_mask) | range->default_flags;
+
 	/* We aren't ask to do anything ... */
 	if (!(pfns & range->flags[HMM_PFN_VALID]))
 		return;