diff mbox series

[v4,2/6] vfio: Add a new device feature for the power management

Message ID 20220701110814.7310-3-abhsahu@nvidia.com (mailing list archive)
State New, archived
Headers show
Series vfio/pci: power management changes | expand

Commit Message

Abhishek Sahu July 1, 2022, 11:08 a.m. UTC
This patch adds the new feature VFIO_DEVICE_FEATURE_POWER_MANAGEMENT
for the power management in the header file. The implementation for the
same will be added in the subsequent patches.

With the standard registers, all power states cannot be achieved. The
platform-based power management needs to be involved to go into the
lowest power state. For all the platform-based power management, this
device feature can be used.

This device feature uses flags to specify the different operations. In
the future, if any more power management functionality is needed then
a new flag can be added to it. It supports both GET and SET operations.

Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
---
 include/uapi/linux/vfio.h | 55 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 55 insertions(+)

Comments

Alex Williamson July 6, 2022, 3:39 p.m. UTC | #1
On Fri, 1 Jul 2022 16:38:10 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> This patch adds the new feature VFIO_DEVICE_FEATURE_POWER_MANAGEMENT
> for the power management in the header file. The implementation for the
> same will be added in the subsequent patches.
> 
> With the standard registers, all power states cannot be achieved. The
> platform-based power management needs to be involved to go into the
> lowest power state. For all the platform-based power management, this
> device feature can be used.
> 
> This device feature uses flags to specify the different operations. In
> the future, if any more power management functionality is needed then
> a new flag can be added to it. It supports both GET and SET operations.
> 
> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
> ---
>  include/uapi/linux/vfio.h | 55 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 55 insertions(+)
> 
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 733a1cddde30..7e00de5c21ea 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -986,6 +986,61 @@ enum vfio_device_mig_state {
>  	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
>  };
>  
> +/*
> + * Perform power management-related operations for the VFIO device.
> + *
> + * The low power feature uses platform-based power management to move the
> + * device into the low power state.  This low power state is device-specific.
> + *
> + * This device feature uses flags to specify the different operations.
> + * It supports both the GET and SET operations.
> + *
> + * - VFIO_PM_LOW_POWER_ENTER flag moves the VFIO device into the low power
> + *   state with platform-based power management.  This low power state will be
> + *   internal to the VFIO driver and the user will not come to know which power
> + *   state is chosen.  Once the user has moved the VFIO device into the low
> + *   power state, then the user should not do any device access without moving
> + *   the device out of the low power state.

Except we're wrapping device accesses to make this possible.  This
should probably describe how any discrete access will wake the device
but ongoing access through mmaps will generate user faults.

> + *
> + * - VFIO_PM_LOW_POWER_EXIT flag moves the VFIO device out of the low power
> + *    state.  This flag should only be set if the user has previously put the
> + *    device into low power state with the VFIO_PM_LOW_POWER_ENTER flag.

Indenting.

> + *
> + * - VFIO_PM_LOW_POWER_ENTER and VFIO_PM_LOW_POWER_EXIT are mutually exclusive.
> + *
> + * - VFIO_PM_LOW_POWER_REENTERY_DISABLE flag is only valid with
> + *   VFIO_PM_LOW_POWER_ENTER.  If there is any access for the VFIO device on
> + *   the host side, then the device will be moved out of the low power state
> + *   without the user's guest driver involvement.  Some devices require the
> + *   user's guest driver involvement for each low-power entry.  If this flag is
> + *   set, then the re-entry to the low power state will be disabled, and the
> + *   host kernel will not move the device again into the low power state.
> + *   The VFIO driver internally maintains a list of devices for which low
> + *   power re-entry is disabled by default and for those devices, the
> + *   re-entry will be disabled even if the user has not set this flag
> + *   explicitly.

Wrong polarity.  The kernel should not maintain the policy.  By default
every wakeup, whether from host kernel accesses or via user accesses
that do a pm-get should signal a wakeup to userspace.  Userspace needs
to opt-out of that wakeup to let the kernel automatically re-enter low
power and userspace needs to maintain the policy for which devices it
wants that to occur.

> + *
> + * For the IOCTL call with VFIO_DEVICE_FEATURE_GET:
> + *
> + * - VFIO_PM_LOW_POWER_ENTER will be set if the user has put the device into
> + *   the low power state, otherwise, VFIO_PM_LOW_POWER_EXIT will be set.
> + *
> + * - If the device is in a normal power state currently, then
> + *   VFIO_PM_LOW_POWER_REENTERY_DISABLE will be set for the devices where low
> + *   power re-entry is disabled by default.  If the device is in the low power
> + *   state currently, then VFIO_PM_LOW_POWER_REENTERY_DISABLE will be set
> + *   according to the current transition.

Very confusing semantics.

What if the feature SET ioctl took an eventfd and that eventfd was one
time use.  Calling the ioctl would setup the eventfd to notify the user
on wakeup and call pm-put.  Any access to the device via host, ioctl,
or region would be wrapped in pm-get/put and the pm-resume handler
would perform the matching pm-get to balance the feature SET and signal
the eventfd.  If the user opts-out by not providing a wakeup eventfd,
then the pm-resume handler does not perform a pm-get.  Possibly we
could even allow mmap access if a wake-up eventfd is provided.  The
feature GET ioctl would be used to exit low power behavior and would be
a no-op if the wakeup eventfd had already been signaled.  Thanks,

Alex

> + */
> +struct vfio_device_feature_power_management {
> +	__u32	flags;
> +#define VFIO_PM_LOW_POWER_ENTER			(1 << 0)
> +#define VFIO_PM_LOW_POWER_EXIT			(1 << 1)
> +#define VFIO_PM_LOW_POWER_REENTERY_DISABLE	(1 << 2)
> +	__u32	reserved;
> +};
> +
> +#define VFIO_DEVICE_FEATURE_POWER_MANAGEMENT	3
> +
>  /* -------- API for Type1 VFIO IOMMU -------- */
>  
>  /**
Abhishek Sahu July 8, 2022, 9:39 a.m. UTC | #2
On 7/6/2022 9:09 PM, Alex Williamson wrote:
> On Fri, 1 Jul 2022 16:38:10 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> 
>> This patch adds the new feature VFIO_DEVICE_FEATURE_POWER_MANAGEMENT
>> for the power management in the header file. The implementation for the
>> same will be added in the subsequent patches.
>>
>> With the standard registers, all power states cannot be achieved. The
>> platform-based power management needs to be involved to go into the
>> lowest power state. For all the platform-based power management, this
>> device feature can be used.
>>
>> This device feature uses flags to specify the different operations. In
>> the future, if any more power management functionality is needed then
>> a new flag can be added to it. It supports both GET and SET operations.
>>
>> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
>> ---
>>  include/uapi/linux/vfio.h | 55 +++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 55 insertions(+)
>>
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index 733a1cddde30..7e00de5c21ea 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -986,6 +986,61 @@ enum vfio_device_mig_state {
>>  	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
>>  };
>>  
>> +/*
>> + * Perform power management-related operations for the VFIO device.
>> + *
>> + * The low power feature uses platform-based power management to move the
>> + * device into the low power state.  This low power state is device-specific.
>> + *
>> + * This device feature uses flags to specify the different operations.
>> + * It supports both the GET and SET operations.
>> + *
>> + * - VFIO_PM_LOW_POWER_ENTER flag moves the VFIO device into the low power
>> + *   state with platform-based power management.  This low power state will be
>> + *   internal to the VFIO driver and the user will not come to know which power
>> + *   state is chosen.  Once the user has moved the VFIO device into the low
>> + *   power state, then the user should not do any device access without moving
>> + *   the device out of the low power state.
> 
> Except we're wrapping device accesses to make this possible.  This
> should probably describe how any discrete access will wake the device
> but ongoing access through mmaps will generate user faults.
> 

 Sure. I will add that details also.

>> + *
>> + * - VFIO_PM_LOW_POWER_EXIT flag moves the VFIO device out of the low power
>> + *    state.  This flag should only be set if the user has previously put the
>> + *    device into low power state with the VFIO_PM_LOW_POWER_ENTER flag.
> 
> Indenting.
> 
 
 I will fix this.

>> + *
>> + * - VFIO_PM_LOW_POWER_ENTER and VFIO_PM_LOW_POWER_EXIT are mutually exclusive.
>> + *
>> + * - VFIO_PM_LOW_POWER_REENTERY_DISABLE flag is only valid with
>> + *   VFIO_PM_LOW_POWER_ENTER.  If there is any access for the VFIO device on
>> + *   the host side, then the device will be moved out of the low power state
>> + *   without the user's guest driver involvement.  Some devices require the
>> + *   user's guest driver involvement for each low-power entry.  If this flag is
>> + *   set, then the re-entry to the low power state will be disabled, and the
>> + *   host kernel will not move the device again into the low power state.
>> + *   The VFIO driver internally maintains a list of devices for which low
>> + *   power re-entry is disabled by default and for those devices, the
>> + *   re-entry will be disabled even if the user has not set this flag
>> + *   explicitly.
> 
> Wrong polarity.  The kernel should not maintain the policy.  By default
> every wakeup, whether from host kernel accesses or via user accesses
> that do a pm-get should signal a wakeup to userspace.  Userspace needs
> to opt-out of that wakeup to let the kernel automatically re-enter low
> power and userspace needs to maintain the policy for which devices it
> wants that to occur.
> 
 
 Okay. So that means, in the kernel side, we don’t have to maintain
 the list which currently contains NVIDIA device ID. Also, in our
 updated approach, this opt-out of that wake-up means that user
 has not provided eventfd in the feature SET ioctl. Correct ?
 
>> + *
>> + * For the IOCTL call with VFIO_DEVICE_FEATURE_GET:
>> + *
>> + * - VFIO_PM_LOW_POWER_ENTER will be set if the user has put the device into
>> + *   the low power state, otherwise, VFIO_PM_LOW_POWER_EXIT will be set.
>> + *
>> + * - If the device is in a normal power state currently, then
>> + *   VFIO_PM_LOW_POWER_REENTERY_DISABLE will be set for the devices where low
>> + *   power re-entry is disabled by default.  If the device is in the low power
>> + *   state currently, then VFIO_PM_LOW_POWER_REENTERY_DISABLE will be set
>> + *   according to the current transition.
> 
> Very confusing semantics.
> 
> What if the feature SET ioctl took an eventfd and that eventfd was one
> time use.  Calling the ioctl would setup the eventfd to notify the user
> on wakeup and call pm-put.  Any access to the device via host, ioctl,
> or region would be wrapped in pm-get/put and the pm-resume handler
> would perform the matching pm-get to balance the feature SET and signal
> the eventfd. 

 This seems a better option. It will help in making the ioctl simpler
 and we don’t have to add a separate index for PME which I added in
 patch 6. 

> If the user opts-out by not providing a wakeup eventfd,
> then the pm-resume handler does not perform a pm-get. Possibly we
> could even allow mmap access if a wake-up eventfd is provided.

 Sorry. I am not clear on this mmap part. We currently invalidates
 mapping before going into runtime-suspend. Now, if use tries do
 mmap then do we need some extra handling in the fault handler ?
 Need your help in understanding this part.

> The
> feature GET ioctl would be used to exit low power behavior and would be
> a no-op if the wakeup eventfd had already been signaled.  Thanks,
>
 
 I will use the GET ioctl for low power exit instead of returning the
 current status.
 
 Regards,
 Abhishek

> Alex
> 
>> + */
>> +struct vfio_device_feature_power_management {
>> +	__u32	flags;
>> +#define VFIO_PM_LOW_POWER_ENTER			(1 << 0)
>> +#define VFIO_PM_LOW_POWER_EXIT			(1 << 1)
>> +#define VFIO_PM_LOW_POWER_REENTERY_DISABLE	(1 << 2)
>> +	__u32	reserved;
>> +};
>> +
>> +#define VFIO_DEVICE_FEATURE_POWER_MANAGEMENT	3
>> +
>>  /* -------- API for Type1 VFIO IOMMU -------- */
>>  
>>  /**
>
Alex Williamson July 8, 2022, 4:36 p.m. UTC | #3
On Fri, 8 Jul 2022 15:09:22 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> On 7/6/2022 9:09 PM, Alex Williamson wrote:
> > On Fri, 1 Jul 2022 16:38:10 +0530
> > Abhishek Sahu <abhsahu@nvidia.com> wrote:
> >   
> >> This patch adds the new feature VFIO_DEVICE_FEATURE_POWER_MANAGEMENT
> >> for the power management in the header file. The implementation for the
> >> same will be added in the subsequent patches.
> >>
> >> With the standard registers, all power states cannot be achieved. The
> >> platform-based power management needs to be involved to go into the
> >> lowest power state. For all the platform-based power management, this
> >> device feature can be used.
> >>
> >> This device feature uses flags to specify the different operations. In
> >> the future, if any more power management functionality is needed then
> >> a new flag can be added to it. It supports both GET and SET operations.
> >>
> >> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
> >> ---
> >>  include/uapi/linux/vfio.h | 55 +++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 55 insertions(+)
> >>
> >> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >> index 733a1cddde30..7e00de5c21ea 100644
> >> --- a/include/uapi/linux/vfio.h
> >> +++ b/include/uapi/linux/vfio.h
> >> @@ -986,6 +986,61 @@ enum vfio_device_mig_state {
> >>  	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
> >>  };
> >>  
> >> +/*
> >> + * Perform power management-related operations for the VFIO device.
> >> + *
> >> + * The low power feature uses platform-based power management to move the
> >> + * device into the low power state.  This low power state is device-specific.
> >> + *
> >> + * This device feature uses flags to specify the different operations.
> >> + * It supports both the GET and SET operations.
> >> + *
> >> + * - VFIO_PM_LOW_POWER_ENTER flag moves the VFIO device into the low power
> >> + *   state with platform-based power management.  This low power state will be
> >> + *   internal to the VFIO driver and the user will not come to know which power
> >> + *   state is chosen.  Once the user has moved the VFIO device into the low
> >> + *   power state, then the user should not do any device access without moving
> >> + *   the device out of the low power state.  
> > 
> > Except we're wrapping device accesses to make this possible.  This
> > should probably describe how any discrete access will wake the device
> > but ongoing access through mmaps will generate user faults.
> >   
> 
>  Sure. I will add that details also.
> 
> >> + *
> >> + * - VFIO_PM_LOW_POWER_EXIT flag moves the VFIO device out of the low power
> >> + *    state.  This flag should only be set if the user has previously put the
> >> + *    device into low power state with the VFIO_PM_LOW_POWER_ENTER flag.  
> > 
> > Indenting.
> >   
>  
>  I will fix this.
> 
> >> + *
> >> + * - VFIO_PM_LOW_POWER_ENTER and VFIO_PM_LOW_POWER_EXIT are mutually exclusive.
> >> + *
> >> + * - VFIO_PM_LOW_POWER_REENTERY_DISABLE flag is only valid with
> >> + *   VFIO_PM_LOW_POWER_ENTER.  If there is any access for the VFIO device on
> >> + *   the host side, then the device will be moved out of the low power state
> >> + *   without the user's guest driver involvement.  Some devices require the
> >> + *   user's guest driver involvement for each low-power entry.  If this flag is
> >> + *   set, then the re-entry to the low power state will be disabled, and the
> >> + *   host kernel will not move the device again into the low power state.
> >> + *   The VFIO driver internally maintains a list of devices for which low
> >> + *   power re-entry is disabled by default and for those devices, the
> >> + *   re-entry will be disabled even if the user has not set this flag
> >> + *   explicitly.  
> > 
> > Wrong polarity.  The kernel should not maintain the policy.  By default
> > every wakeup, whether from host kernel accesses or via user accesses
> > that do a pm-get should signal a wakeup to userspace.  Userspace needs
> > to opt-out of that wakeup to let the kernel automatically re-enter low
> > power and userspace needs to maintain the policy for which devices it
> > wants that to occur.
> >   
>  
>  Okay. So that means, in the kernel side, we don’t have to maintain
>  the list which currently contains NVIDIA device ID. Also, in our
>  updated approach, this opt-out of that wake-up means that user
>  has not provided eventfd in the feature SET ioctl. Correct ?

Yes, I'm imagining that if the user hasn't provided a one-shot wake-up
eventfd, that's the opt-out for being notified of device wakes.  For
example, pm-resume would have something like:

	
	if (vdev->pm_wake_eventfd) {
		eventfd_signal(vdev->pm_wake_eventfd, 1);
		vdev->pm_wake_eventfd = NULL;
		pm_runtime_get_noresume(dev);
	}

(eventfd pseudo handling substantially simplified)

So w/o a wake-up eventfd, the user would need to call the pm feature
exit ioctl to elevate the pm reference to prevent it going back to low
power.  The pm feature exit ioctl would be optional if a wake eventfd is
provided, so some piece of the eventfd context would need to remain to
determine whether a pm-get is necessary.

> >> + *
> >> + * For the IOCTL call with VFIO_DEVICE_FEATURE_GET:
> >> + *
> >> + * - VFIO_PM_LOW_POWER_ENTER will be set if the user has put the device into
> >> + *   the low power state, otherwise, VFIO_PM_LOW_POWER_EXIT will be set.
> >> + *
> >> + * - If the device is in a normal power state currently, then
> >> + *   VFIO_PM_LOW_POWER_REENTERY_DISABLE will be set for the devices where low
> >> + *   power re-entry is disabled by default.  If the device is in the low power
> >> + *   state currently, then VFIO_PM_LOW_POWER_REENTERY_DISABLE will be set
> >> + *   according to the current transition.  
> > 
> > Very confusing semantics.
> > 
> > What if the feature SET ioctl took an eventfd and that eventfd was one
> > time use.  Calling the ioctl would setup the eventfd to notify the user
> > on wakeup and call pm-put.  Any access to the device via host, ioctl,
> > or region would be wrapped in pm-get/put and the pm-resume handler
> > would perform the matching pm-get to balance the feature SET and signal
> > the eventfd.   
> 
>  This seems a better option. It will help in making the ioctl simpler
>  and we don’t have to add a separate index for PME which I added in
>  patch 6. 
> 
> > If the user opts-out by not providing a wakeup eventfd,
> > then the pm-resume handler does not perform a pm-get. Possibly we
> > could even allow mmap access if a wake-up eventfd is provided.  
> 
>  Sorry. I am not clear on this mmap part. We currently invalidates
>  mapping before going into runtime-suspend. Now, if use tries do
>  mmap then do we need some extra handling in the fault handler ?
>  Need your help in understanding this part.

The option that I'm thinking about is if the mmap fault handler is
wrapped in a pm-get/put then we could actually populate the mmap.  In
the case where the pm-get triggers the wake-eventfd in pm-resume, the
device doesn't return to low power when the mmap fault handler calls
pm-put.  This possibly allows that we could actually invalidate mmaps on
pm-suspend rather than in the pm feature enter ioctl, essentially the
same as we're doing for intx.  I wonder though if this allows the
possibility that we just bounce between mmap fault and pm-suspend.  So
long as some work can be done, for instance the pm-suspend occurs
asynchronously to the pm-put, this might be ok.

> > The
> > feature GET ioctl would be used to exit low power behavior and would be
> > a no-op if the wakeup eventfd had already been signaled.  Thanks,
> >  
>  
>  I will use the GET ioctl for low power exit instead of returning the
>  current status.

Note that Yishai is proposing a device DMA dirty logging feature where
the stop and start are exposed via SET on separate features, rather
than SET/GET.  We should probably maintain some consistency between
these use cases.  Possibly we might even want two separate pm enter
ioctls, one with the wake eventfd and one without.  I think this is the
sort of thing Jason is describing for future expansion of the dirty
tracking uAPI.  Thanks,

Alex
Abhishek Sahu July 11, 2022, 9:43 a.m. UTC | #4
On 7/8/2022 10:06 PM, Alex Williamson wrote:
> On Fri, 8 Jul 2022 15:09:22 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> 
>> On 7/6/2022 9:09 PM, Alex Williamson wrote:
>>> On Fri, 1 Jul 2022 16:38:10 +0530
>>> Abhishek Sahu <abhsahu@nvidia.com> wrote:
>>>   
>>>> This patch adds the new feature VFIO_DEVICE_FEATURE_POWER_MANAGEMENT
>>>> for the power management in the header file. The implementation for the
>>>> same will be added in the subsequent patches.
>>>>
>>>> With the standard registers, all power states cannot be achieved. The
>>>> platform-based power management needs to be involved to go into the
>>>> lowest power state. For all the platform-based power management, this
>>>> device feature can be used.
>>>>
>>>> This device feature uses flags to specify the different operations. In
>>>> the future, if any more power management functionality is needed then
>>>> a new flag can be added to it. It supports both GET and SET operations.
>>>>
>>>> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
>>>> ---
>>>>  include/uapi/linux/vfio.h | 55 +++++++++++++++++++++++++++++++++++++++
>>>>  1 file changed, 55 insertions(+)
>>>>
>>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>>> index 733a1cddde30..7e00de5c21ea 100644
>>>> --- a/include/uapi/linux/vfio.h
>>>> +++ b/include/uapi/linux/vfio.h
>>>> @@ -986,6 +986,61 @@ enum vfio_device_mig_state {
>>>>  	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
>>>>  };
>>>>  
>>>> +/*
>>>> + * Perform power management-related operations for the VFIO device.
>>>> + *
>>>> + * The low power feature uses platform-based power management to move the
>>>> + * device into the low power state.  This low power state is device-specific.
>>>> + *
>>>> + * This device feature uses flags to specify the different operations.
>>>> + * It supports both the GET and SET operations.
>>>> + *
>>>> + * - VFIO_PM_LOW_POWER_ENTER flag moves the VFIO device into the low power
>>>> + *   state with platform-based power management.  This low power state will be
>>>> + *   internal to the VFIO driver and the user will not come to know which power
>>>> + *   state is chosen.  Once the user has moved the VFIO device into the low
>>>> + *   power state, then the user should not do any device access without moving
>>>> + *   the device out of the low power state.  
>>>
>>> Except we're wrapping device accesses to make this possible.  This
>>> should probably describe how any discrete access will wake the device
>>> but ongoing access through mmaps will generate user faults.
>>>   
>>
>>  Sure. I will add that details also.
>>
>>>> + *
>>>> + * - VFIO_PM_LOW_POWER_EXIT flag moves the VFIO device out of the low power
>>>> + *    state.  This flag should only be set if the user has previously put the
>>>> + *    device into low power state with the VFIO_PM_LOW_POWER_ENTER flag.  
>>>
>>> Indenting.
>>>   
>>  
>>  I will fix this.
>>
>>>> + *
>>>> + * - VFIO_PM_LOW_POWER_ENTER and VFIO_PM_LOW_POWER_EXIT are mutually exclusive.
>>>> + *
>>>> + * - VFIO_PM_LOW_POWER_REENTERY_DISABLE flag is only valid with
>>>> + *   VFIO_PM_LOW_POWER_ENTER.  If there is any access for the VFIO device on
>>>> + *   the host side, then the device will be moved out of the low power state
>>>> + *   without the user's guest driver involvement.  Some devices require the
>>>> + *   user's guest driver involvement for each low-power entry.  If this flag is
>>>> + *   set, then the re-entry to the low power state will be disabled, and the
>>>> + *   host kernel will not move the device again into the low power state.
>>>> + *   The VFIO driver internally maintains a list of devices for which low
>>>> + *   power re-entry is disabled by default and for those devices, the
>>>> + *   re-entry will be disabled even if the user has not set this flag
>>>> + *   explicitly.  
>>>
>>> Wrong polarity.  The kernel should not maintain the policy.  By default
>>> every wakeup, whether from host kernel accesses or via user accesses
>>> that do a pm-get should signal a wakeup to userspace.  Userspace needs
>>> to opt-out of that wakeup to let the kernel automatically re-enter low
>>> power and userspace needs to maintain the policy for which devices it
>>> wants that to occur.
>>>   
>>  
>>  Okay. So that means, in the kernel side, we don’t have to maintain
>>  the list which currently contains NVIDIA device ID. Also, in our
>>  updated approach, this opt-out of that wake-up means that user
>>  has not provided eventfd in the feature SET ioctl. Correct ?
> 
> Yes, I'm imagining that if the user hasn't provided a one-shot wake-up
> eventfd, that's the opt-out for being notified of device wakes.  For
> example, pm-resume would have something like:
> 
> 	
> 	if (vdev->pm_wake_eventfd) {
> 		eventfd_signal(vdev->pm_wake_eventfd, 1);
> 		vdev->pm_wake_eventfd = NULL;
> 		pm_runtime_get_noresume(dev);
> 	}
> 
> (eventfd pseudo handling substantially simplified)
> 
> So w/o a wake-up eventfd, the user would need to call the pm feature
> exit ioctl to elevate the pm reference to prevent it going back to low
> power.  The pm feature exit ioctl would be optional if a wake eventfd is
> provided, so some piece of the eventfd context would need to remain to
> determine whether a pm-get is necessary.
> 
>>>> + *
>>>> + * For the IOCTL call with VFIO_DEVICE_FEATURE_GET:
>>>> + *
>>>> + * - VFIO_PM_LOW_POWER_ENTER will be set if the user has put the device into
>>>> + *   the low power state, otherwise, VFIO_PM_LOW_POWER_EXIT will be set.
>>>> + *
>>>> + * - If the device is in a normal power state currently, then
>>>> + *   VFIO_PM_LOW_POWER_REENTERY_DISABLE will be set for the devices where low
>>>> + *   power re-entry is disabled by default.  If the device is in the low power
>>>> + *   state currently, then VFIO_PM_LOW_POWER_REENTERY_DISABLE will be set
>>>> + *   according to the current transition.  
>>>
>>> Very confusing semantics.
>>>
>>> What if the feature SET ioctl took an eventfd and that eventfd was one
>>> time use.  Calling the ioctl would setup the eventfd to notify the user
>>> on wakeup and call pm-put.  Any access to the device via host, ioctl,
>>> or region would be wrapped in pm-get/put and the pm-resume handler
>>> would perform the matching pm-get to balance the feature SET and signal
>>> the eventfd.   
>>
>>  This seems a better option. It will help in making the ioctl simpler
>>  and we don’t have to add a separate index for PME which I added in
>>  patch 6. 
>>
>>> If the user opts-out by not providing a wakeup eventfd,
>>> then the pm-resume handler does not perform a pm-get. Possibly we
>>> could even allow mmap access if a wake-up eventfd is provided.  
>>
>>  Sorry. I am not clear on this mmap part. We currently invalidates
>>  mapping before going into runtime-suspend. Now, if use tries do
>>  mmap then do we need some extra handling in the fault handler ?
>>  Need your help in understanding this part.
> 
> The option that I'm thinking about is if the mmap fault handler is
> wrapped in a pm-get/put then we could actually populate the mmap.  In
> the case where the pm-get triggers the wake-eventfd in pm-resume, the
> device doesn't return to low power when the mmap fault handler calls
> pm-put.  This possibly allows that we could actually invalidate mmaps on
> pm-suspend rather than in the pm feature enter ioctl, essentially the
> same as we're doing for intx.  I wonder though if this allows the
> possibility that we just bounce between mmap fault and pm-suspend.  So
> long as some work can be done, for instance the pm-suspend occurs
> asynchronously to the pm-put, this might be ok.
> 

 We can do this. But in the normal use case, the situation should
 never arise where user should access any mmaped region when user has
 already put the device into D3 (D3hot or D3cold). This can only happen
 if there is some bug in the guest driver or user is doing wrong
 sequence. Do we need to add handling to officially support this part ?

 pm-get can take more than a second for resume for some devices and
 will doing this in fault handler be safe ?

 Also, we will add this support only when wake-eventfd is provided so
 still w/o wake-eventfd case, the mmap access will still generate fault.
 So, we will have different behavior. Will that be acceptable ?

>>> The
>>> feature GET ioctl would be used to exit low power behavior and would be
>>> a no-op if the wakeup eventfd had already been signaled.  Thanks,
>>>  
>>  
>>  I will use the GET ioctl for low power exit instead of returning the
>>  current status.
> 
> Note that Yishai is proposing a device DMA dirty logging feature where
> the stop and start are exposed via SET on separate features, rather
> than SET/GET.  We should probably maintain some consistency between
> these use cases.  Possibly we might even want two separate pm enter
> ioctls, one with the wake eventfd and one without.  I think this is the
> sort of thing Jason is describing for future expansion of the dirty
> tracking uAPI.  Thanks,
> 
> Alex
> 

 Okay. So, we need to add 3 device features in total.

 VFIO_DEVICE_FEATURE_PM_ENTRY
 VFIO_DEVICE_FEATURE_PM_ENTRY_WITH_WAKEUP
 VFIO_DEVICE_FEATURE_PM_EXIT

 And only the second one need structure which will have only one field
 for eventfd and we need to return error if wakeup-eventfd is not
 provided in the second feature ?

 Do we need to support GET operation also for these ?
 We can skip GET operation since that won’t be very useful.

 Regards,
 Abhishek
Alex Williamson July 11, 2022, 1:04 p.m. UTC | #5
On Mon, 11 Jul 2022 15:13:13 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> On 7/8/2022 10:06 PM, Alex Williamson wrote:
> > On Fri, 8 Jul 2022 15:09:22 +0530
> > Abhishek Sahu <abhsahu@nvidia.com> wrote:
> >   
> >> On 7/6/2022 9:09 PM, Alex Williamson wrote:  
> >>> On Fri, 1 Jul 2022 16:38:10 +0530
> >>> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> >>>     
> >>>> This patch adds the new feature VFIO_DEVICE_FEATURE_POWER_MANAGEMENT
> >>>> for the power management in the header file. The implementation for the
> >>>> same will be added in the subsequent patches.
> >>>>
> >>>> With the standard registers, all power states cannot be achieved. The
> >>>> platform-based power management needs to be involved to go into the
> >>>> lowest power state. For all the platform-based power management, this
> >>>> device feature can be used.
> >>>>
> >>>> This device feature uses flags to specify the different operations. In
> >>>> the future, if any more power management functionality is needed then
> >>>> a new flag can be added to it. It supports both GET and SET operations.
> >>>>
> >>>> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
> >>>> ---
> >>>>  include/uapi/linux/vfio.h | 55 +++++++++++++++++++++++++++++++++++++++
> >>>>  1 file changed, 55 insertions(+)
> >>>>
> >>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >>>> index 733a1cddde30..7e00de5c21ea 100644
> >>>> --- a/include/uapi/linux/vfio.h
> >>>> +++ b/include/uapi/linux/vfio.h
> >>>> @@ -986,6 +986,61 @@ enum vfio_device_mig_state {
> >>>>  	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
> >>>>  };
> >>>>  
> >>>> +/*
> >>>> + * Perform power management-related operations for the VFIO device.
> >>>> + *
> >>>> + * The low power feature uses platform-based power management to move the
> >>>> + * device into the low power state.  This low power state is device-specific.
> >>>> + *
> >>>> + * This device feature uses flags to specify the different operations.
> >>>> + * It supports both the GET and SET operations.
> >>>> + *
> >>>> + * - VFIO_PM_LOW_POWER_ENTER flag moves the VFIO device into the low power
> >>>> + *   state with platform-based power management.  This low power state will be
> >>>> + *   internal to the VFIO driver and the user will not come to know which power
> >>>> + *   state is chosen.  Once the user has moved the VFIO device into the low
> >>>> + *   power state, then the user should not do any device access without moving
> >>>> + *   the device out of the low power state.    
> >>>
> >>> Except we're wrapping device accesses to make this possible.  This
> >>> should probably describe how any discrete access will wake the device
> >>> but ongoing access through mmaps will generate user faults.
> >>>     
> >>
> >>  Sure. I will add that details also.
> >>  
> >>>> + *
> >>>> + * - VFIO_PM_LOW_POWER_EXIT flag moves the VFIO device out of the low power
> >>>> + *    state.  This flag should only be set if the user has previously put the
> >>>> + *    device into low power state with the VFIO_PM_LOW_POWER_ENTER flag.    
> >>>
> >>> Indenting.
> >>>     
> >>  
> >>  I will fix this.
> >>  
> >>>> + *
> >>>> + * - VFIO_PM_LOW_POWER_ENTER and VFIO_PM_LOW_POWER_EXIT are mutually exclusive.
> >>>> + *
> >>>> + * - VFIO_PM_LOW_POWER_REENTERY_DISABLE flag is only valid with
> >>>> + *   VFIO_PM_LOW_POWER_ENTER.  If there is any access for the VFIO device on
> >>>> + *   the host side, then the device will be moved out of the low power state
> >>>> + *   without the user's guest driver involvement.  Some devices require the
> >>>> + *   user's guest driver involvement for each low-power entry.  If this flag is
> >>>> + *   set, then the re-entry to the low power state will be disabled, and the
> >>>> + *   host kernel will not move the device again into the low power state.
> >>>> + *   The VFIO driver internally maintains a list of devices for which low
> >>>> + *   power re-entry is disabled by default and for those devices, the
> >>>> + *   re-entry will be disabled even if the user has not set this flag
> >>>> + *   explicitly.    
> >>>
> >>> Wrong polarity.  The kernel should not maintain the policy.  By default
> >>> every wakeup, whether from host kernel accesses or via user accesses
> >>> that do a pm-get should signal a wakeup to userspace.  Userspace needs
> >>> to opt-out of that wakeup to let the kernel automatically re-enter low
> >>> power and userspace needs to maintain the policy for which devices it
> >>> wants that to occur.
> >>>     
> >>  
> >>  Okay. So that means, in the kernel side, we don’t have to maintain
> >>  the list which currently contains NVIDIA device ID. Also, in our
> >>  updated approach, this opt-out of that wake-up means that user
> >>  has not provided eventfd in the feature SET ioctl. Correct ?  
> > 
> > Yes, I'm imagining that if the user hasn't provided a one-shot wake-up
> > eventfd, that's the opt-out for being notified of device wakes.  For
> > example, pm-resume would have something like:
> > 
> > 	
> > 	if (vdev->pm_wake_eventfd) {
> > 		eventfd_signal(vdev->pm_wake_eventfd, 1);
> > 		vdev->pm_wake_eventfd = NULL;
> > 		pm_runtime_get_noresume(dev);
> > 	}
> > 
> > (eventfd pseudo handling substantially simplified)
> > 
> > So w/o a wake-up eventfd, the user would need to call the pm feature
> > exit ioctl to elevate the pm reference to prevent it going back to low
> > power.  The pm feature exit ioctl would be optional if a wake eventfd is
> > provided, so some piece of the eventfd context would need to remain to
> > determine whether a pm-get is necessary.
> >   
> >>>> + *
> >>>> + * For the IOCTL call with VFIO_DEVICE_FEATURE_GET:
> >>>> + *
> >>>> + * - VFIO_PM_LOW_POWER_ENTER will be set if the user has put the device into
> >>>> + *   the low power state, otherwise, VFIO_PM_LOW_POWER_EXIT will be set.
> >>>> + *
> >>>> + * - If the device is in a normal power state currently, then
> >>>> + *   VFIO_PM_LOW_POWER_REENTERY_DISABLE will be set for the devices where low
> >>>> + *   power re-entry is disabled by default.  If the device is in the low power
> >>>> + *   state currently, then VFIO_PM_LOW_POWER_REENTERY_DISABLE will be set
> >>>> + *   according to the current transition.    
> >>>
> >>> Very confusing semantics.
> >>>
> >>> What if the feature SET ioctl took an eventfd and that eventfd was one
> >>> time use.  Calling the ioctl would setup the eventfd to notify the user
> >>> on wakeup and call pm-put.  Any access to the device via host, ioctl,
> >>> or region would be wrapped in pm-get/put and the pm-resume handler
> >>> would perform the matching pm-get to balance the feature SET and signal
> >>> the eventfd.     
> >>
> >>  This seems a better option. It will help in making the ioctl simpler
> >>  and we don’t have to add a separate index for PME which I added in
> >>  patch 6. 
> >>  
> >>> If the user opts-out by not providing a wakeup eventfd,
> >>> then the pm-resume handler does not perform a pm-get. Possibly we
> >>> could even allow mmap access if a wake-up eventfd is provided.    
> >>
> >>  Sorry. I am not clear on this mmap part. We currently invalidates
> >>  mapping before going into runtime-suspend. Now, if use tries do
> >>  mmap then do we need some extra handling in the fault handler ?
> >>  Need your help in understanding this part.  
> > 
> > The option that I'm thinking about is if the mmap fault handler is
> > wrapped in a pm-get/put then we could actually populate the mmap.  In
> > the case where the pm-get triggers the wake-eventfd in pm-resume, the
> > device doesn't return to low power when the mmap fault handler calls
> > pm-put.  This possibly allows that we could actually invalidate mmaps on
> > pm-suspend rather than in the pm feature enter ioctl, essentially the
> > same as we're doing for intx.  I wonder though if this allows the
> > possibility that we just bounce between mmap fault and pm-suspend.  So
> > long as some work can be done, for instance the pm-suspend occurs
> > asynchronously to the pm-put, this might be ok.
> >   
> 
>  We can do this. But in the normal use case, the situation should
>  never arise where user should access any mmaped region when user has
>  already put the device into D3 (D3hot or D3cold). This can only happen
>  if there is some bug in the guest driver or user is doing wrong
>  sequence. Do we need to add handling to officially support this part ?

We cannot rely on userspace drivers to be bug free or non-malicious,
but if we want to impose that an mmap access while low power is
enabled always triggers a fault, that's ok.

>  pm-get can take more than a second for resume for some devices and
>  will doing this in fault handler be safe ?
> 
>  Also, we will add this support only when wake-eventfd is provided so
>  still w/o wake-eventfd case, the mmap access will still generate fault.
>  So, we will have different behavior. Will that be acceptable ?

Let's keep it simple, generate a fault for all cases.

> >>> The
> >>> feature GET ioctl would be used to exit low power behavior and would be
> >>> a no-op if the wakeup eventfd had already been signaled.  Thanks,
> >>>    
> >>  
> >>  I will use the GET ioctl for low power exit instead of returning the
> >>  current status.  
> > 
> > Note that Yishai is proposing a device DMA dirty logging feature where
> > the stop and start are exposed via SET on separate features, rather
> > than SET/GET.  We should probably maintain some consistency between
> > these use cases.  Possibly we might even want two separate pm enter
> > ioctls, one with the wake eventfd and one without.  I think this is the
> > sort of thing Jason is describing for future expansion of the dirty
> > tracking uAPI.  Thanks,
> > 
> > Alex
> >   
> 
>  Okay. So, we need to add 3 device features in total.
> 
>  VFIO_DEVICE_FEATURE_PM_ENTRY
>  VFIO_DEVICE_FEATURE_PM_ENTRY_WITH_WAKEUP
>  VFIO_DEVICE_FEATURE_PM_EXIT
> 
>  And only the second one need structure which will have only one field
>  for eventfd and we need to return error if wakeup-eventfd is not
>  provided in the second feature ?

Yes, we'd use eventfd_ctx and fail on a bad fileget.

>  Do we need to support GET operation also for these ?
>  We can skip GET operation since that won’t be very useful.

What would they do?  Thanks,

Alex
Abhishek Sahu July 11, 2022, 5:30 p.m. UTC | #6
On 7/11/2022 6:34 PM, Alex Williamson wrote:
> On Mon, 11 Jul 2022 15:13:13 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> 
>> On 7/8/2022 10:06 PM, Alex Williamson wrote:
>>> On Fri, 8 Jul 2022 15:09:22 +0530
>>> Abhishek Sahu <abhsahu@nvidia.com> wrote:
>>>   
>>>> On 7/6/2022 9:09 PM, Alex Williamson wrote:  
>>>>> On Fri, 1 Jul 2022 16:38:10 +0530
>>>>> Abhishek Sahu <abhsahu@nvidia.com> wrote:
>>>>>     
>>>>>> This patch adds the new feature VFIO_DEVICE_FEATURE_POWER_MANAGEMENT
>>>>>> for the power management in the header file. The implementation for the
>>>>>> same will be added in the subsequent patches.
>>>>>>
>>>>>> With the standard registers, all power states cannot be achieved. The
>>>>>> platform-based power management needs to be involved to go into the
>>>>>> lowest power state. For all the platform-based power management, this
>>>>>> device feature can be used.
>>>>>>
>>>>>> This device feature uses flags to specify the different operations. In
>>>>>> the future, if any more power management functionality is needed then
>>>>>> a new flag can be added to it. It supports both GET and SET operations.
>>>>>>
>>>>>> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
>>>>>> ---
>>>>>>  include/uapi/linux/vfio.h | 55 +++++++++++++++++++++++++++++++++++++++
>>>>>>  1 file changed, 55 insertions(+)
>>>>>>
>>>>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>>>>> index 733a1cddde30..7e00de5c21ea 100644
>>>>>> --- a/include/uapi/linux/vfio.h
>>>>>> +++ b/include/uapi/linux/vfio.h
>>>>>> @@ -986,6 +986,61 @@ enum vfio_device_mig_state {
>>>>>>  	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
>>>>>>  };
>>>>>>  
>>>>>> +/*
>>>>>> + * Perform power management-related operations for the VFIO device.
>>>>>> + *
>>>>>> + * The low power feature uses platform-based power management to move the
>>>>>> + * device into the low power state.  This low power state is device-specific.
>>>>>> + *
>>>>>> + * This device feature uses flags to specify the different operations.
>>>>>> + * It supports both the GET and SET operations.
>>>>>> + *
>>>>>> + * - VFIO_PM_LOW_POWER_ENTER flag moves the VFIO device into the low power
>>>>>> + *   state with platform-based power management.  This low power state will be
>>>>>> + *   internal to the VFIO driver and the user will not come to know which power
>>>>>> + *   state is chosen.  Once the user has moved the VFIO device into the low
>>>>>> + *   power state, then the user should not do any device access without moving
>>>>>> + *   the device out of the low power state.    
>>>>>
>>>>> Except we're wrapping device accesses to make this possible.  This
>>>>> should probably describe how any discrete access will wake the device
>>>>> but ongoing access through mmaps will generate user faults.
>>>>>     
>>>>
>>>>  Sure. I will add that details also.
>>>>  
>>>>>> + *
>>>>>> + * - VFIO_PM_LOW_POWER_EXIT flag moves the VFIO device out of the low power
>>>>>> + *    state.  This flag should only be set if the user has previously put the
>>>>>> + *    device into low power state with the VFIO_PM_LOW_POWER_ENTER flag.    
>>>>>
>>>>> Indenting.
>>>>>     
>>>>  
>>>>  I will fix this.
>>>>  
>>>>>> + *
>>>>>> + * - VFIO_PM_LOW_POWER_ENTER and VFIO_PM_LOW_POWER_EXIT are mutually exclusive.
>>>>>> + *
>>>>>> + * - VFIO_PM_LOW_POWER_REENTERY_DISABLE flag is only valid with
>>>>>> + *   VFIO_PM_LOW_POWER_ENTER.  If there is any access for the VFIO device on
>>>>>> + *   the host side, then the device will be moved out of the low power state
>>>>>> + *   without the user's guest driver involvement.  Some devices require the
>>>>>> + *   user's guest driver involvement for each low-power entry.  If this flag is
>>>>>> + *   set, then the re-entry to the low power state will be disabled, and the
>>>>>> + *   host kernel will not move the device again into the low power state.
>>>>>> + *   The VFIO driver internally maintains a list of devices for which low
>>>>>> + *   power re-entry is disabled by default and for those devices, the
>>>>>> + *   re-entry will be disabled even if the user has not set this flag
>>>>>> + *   explicitly.    
>>>>>
>>>>> Wrong polarity.  The kernel should not maintain the policy.  By default
>>>>> every wakeup, whether from host kernel accesses or via user accesses
>>>>> that do a pm-get should signal a wakeup to userspace.  Userspace needs
>>>>> to opt-out of that wakeup to let the kernel automatically re-enter low
>>>>> power and userspace needs to maintain the policy for which devices it
>>>>> wants that to occur.
>>>>>     
>>>>  
>>>>  Okay. So that means, in the kernel side, we don’t have to maintain
>>>>  the list which currently contains NVIDIA device ID. Also, in our
>>>>  updated approach, this opt-out of that wake-up means that user
>>>>  has not provided eventfd in the feature SET ioctl. Correct ?  
>>>
>>> Yes, I'm imagining that if the user hasn't provided a one-shot wake-up
>>> eventfd, that's the opt-out for being notified of device wakes.  For
>>> example, pm-resume would have something like:
>>>
>>> 	
>>> 	if (vdev->pm_wake_eventfd) {
>>> 		eventfd_signal(vdev->pm_wake_eventfd, 1);
>>> 		vdev->pm_wake_eventfd = NULL;
>>> 		pm_runtime_get_noresume(dev);
>>> 	}
>>>
>>> (eventfd pseudo handling substantially simplified)
>>>
>>> So w/o a wake-up eventfd, the user would need to call the pm feature
>>> exit ioctl to elevate the pm reference to prevent it going back to low
>>> power.  The pm feature exit ioctl would be optional if a wake eventfd is
>>> provided, so some piece of the eventfd context would need to remain to
>>> determine whether a pm-get is necessary.
>>>   
>>>>>> + *
>>>>>> + * For the IOCTL call with VFIO_DEVICE_FEATURE_GET:
>>>>>> + *
>>>>>> + * - VFIO_PM_LOW_POWER_ENTER will be set if the user has put the device into
>>>>>> + *   the low power state, otherwise, VFIO_PM_LOW_POWER_EXIT will be set.
>>>>>> + *
>>>>>> + * - If the device is in a normal power state currently, then
>>>>>> + *   VFIO_PM_LOW_POWER_REENTERY_DISABLE will be set for the devices where low
>>>>>> + *   power re-entry is disabled by default.  If the device is in the low power
>>>>>> + *   state currently, then VFIO_PM_LOW_POWER_REENTERY_DISABLE will be set
>>>>>> + *   according to the current transition.    
>>>>>
>>>>> Very confusing semantics.
>>>>>
>>>>> What if the feature SET ioctl took an eventfd and that eventfd was one
>>>>> time use.  Calling the ioctl would setup the eventfd to notify the user
>>>>> on wakeup and call pm-put.  Any access to the device via host, ioctl,
>>>>> or region would be wrapped in pm-get/put and the pm-resume handler
>>>>> would perform the matching pm-get to balance the feature SET and signal
>>>>> the eventfd.     
>>>>
>>>>  This seems a better option. It will help in making the ioctl simpler
>>>>  and we don’t have to add a separate index for PME which I added in
>>>>  patch 6. 
>>>>  
>>>>> If the user opts-out by not providing a wakeup eventfd,
>>>>> then the pm-resume handler does not perform a pm-get. Possibly we
>>>>> could even allow mmap access if a wake-up eventfd is provided.    
>>>>
>>>>  Sorry. I am not clear on this mmap part. We currently invalidates
>>>>  mapping before going into runtime-suspend. Now, if use tries do
>>>>  mmap then do we need some extra handling in the fault handler ?
>>>>  Need your help in understanding this part.  
>>>
>>> The option that I'm thinking about is if the mmap fault handler is
>>> wrapped in a pm-get/put then we could actually populate the mmap.  In
>>> the case where the pm-get triggers the wake-eventfd in pm-resume, the
>>> device doesn't return to low power when the mmap fault handler calls
>>> pm-put.  This possibly allows that we could actually invalidate mmaps on
>>> pm-suspend rather than in the pm feature enter ioctl, essentially the
>>> same as we're doing for intx.  I wonder though if this allows the
>>> possibility that we just bounce between mmap fault and pm-suspend.  So
>>> long as some work can be done, for instance the pm-suspend occurs
>>> asynchronously to the pm-put, this might be ok.
>>>   
>>
>>  We can do this. But in the normal use case, the situation should
>>  never arise where user should access any mmaped region when user has
>>  already put the device into D3 (D3hot or D3cold). This can only happen
>>  if there is some bug in the guest driver or user is doing wrong
>>  sequence. Do we need to add handling to officially support this part ?
> 
> We cannot rely on userspace drivers to be bug free or non-malicious,
> but if we want to impose that an mmap access while low power is
> enabled always triggers a fault, that's ok.
> 
>>  pm-get can take more than a second for resume for some devices and
>>  will doing this in fault handler be safe ?
>>
>>  Also, we will add this support only when wake-eventfd is provided so
>>  still w/o wake-eventfd case, the mmap access will still generate fault.
>>  So, we will have different behavior. Will that be acceptable ?
> 
> Let's keep it simple, generate a fault for all cases.
> 

 Thanks Alex for confirmation.

>>>>> The
>>>>> feature GET ioctl would be used to exit low power behavior and would be
>>>>> a no-op if the wakeup eventfd had already been signaled.  Thanks,
>>>>>    
>>>>  
>>>>  I will use the GET ioctl for low power exit instead of returning the
>>>>  current status.  
>>>
>>> Note that Yishai is proposing a device DMA dirty logging feature where
>>> the stop and start are exposed via SET on separate features, rather
>>> than SET/GET.  We should probably maintain some consistency between
>>> these use cases.  Possibly we might even want two separate pm enter
>>> ioctls, one with the wake eventfd and one without.  I think this is the
>>> sort of thing Jason is describing for future expansion of the dirty
>>> tracking uAPI.  Thanks,
>>>
>>> Alex
>>>   
>>
>>  Okay. So, we need to add 3 device features in total.
>>
>>  VFIO_DEVICE_FEATURE_PM_ENTRY
>>  VFIO_DEVICE_FEATURE_PM_ENTRY_WITH_WAKEUP
>>  VFIO_DEVICE_FEATURE_PM_EXIT
>>
>>  And only the second one need structure which will have only one field
>>  for eventfd and we need to return error if wakeup-eventfd is not
>>  provided in the second feature ?
> 
> Yes, we'd use eventfd_ctx and fail on a bad fileget.
> 
>>  Do we need to support GET operation also for these ?
>>  We can skip GET operation since that won’t be very useful.
> 
> What would they do?  Thanks,
> 
> Alex
> 

 If we implement GET operation then it can return the
 current status. For example, for VFIO_DEVICE_FEATURE_PM_ENTRY
 can return the information whether user has put the device into
 low power previously. But this information is not much useful as such
 and it requires to add a structure where this information needs to
 be filled. Also, the GET will again cause the device wake-up.
 So, for these device features, we can support only SET operation.

 I checked the Yishai DMA logging patches and there start
 and stop seems to be supporting only SET operation and there is
 separate feature which supports only GET operation.

 Regards,
 Abhishek
diff mbox series

Patch

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 733a1cddde30..7e00de5c21ea 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -986,6 +986,61 @@  enum vfio_device_mig_state {
 	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
 };
 
+/*
+ * Perform power management-related operations for the VFIO device.
+ *
+ * The low power feature uses platform-based power management to move the
+ * device into the low power state.  This low power state is device-specific.
+ *
+ * This device feature uses flags to specify the different operations.
+ * It supports both the GET and SET operations.
+ *
+ * - VFIO_PM_LOW_POWER_ENTER flag moves the VFIO device into the low power
+ *   state with platform-based power management.  This low power state will be
+ *   internal to the VFIO driver and the user will not come to know which power
+ *   state is chosen.  Once the user has moved the VFIO device into the low
+ *   power state, then the user should not do any device access without moving
+ *   the device out of the low power state.
+ *
+ * - VFIO_PM_LOW_POWER_EXIT flag moves the VFIO device out of the low power
+ *    state.  This flag should only be set if the user has previously put the
+ *    device into low power state with the VFIO_PM_LOW_POWER_ENTER flag.
+ *
+ * - VFIO_PM_LOW_POWER_ENTER and VFIO_PM_LOW_POWER_EXIT are mutually exclusive.
+ *
+ * - VFIO_PM_LOW_POWER_REENTERY_DISABLE flag is only valid with
+ *   VFIO_PM_LOW_POWER_ENTER.  If there is any access for the VFIO device on
+ *   the host side, then the device will be moved out of the low power state
+ *   without the user's guest driver involvement.  Some devices require the
+ *   user's guest driver involvement for each low-power entry.  If this flag is
+ *   set, then the re-entry to the low power state will be disabled, and the
+ *   host kernel will not move the device again into the low power state.
+ *   The VFIO driver internally maintains a list of devices for which low
+ *   power re-entry is disabled by default and for those devices, the
+ *   re-entry will be disabled even if the user has not set this flag
+ *   explicitly.
+ *
+ * For the IOCTL call with VFIO_DEVICE_FEATURE_GET:
+ *
+ * - VFIO_PM_LOW_POWER_ENTER will be set if the user has put the device into
+ *   the low power state, otherwise, VFIO_PM_LOW_POWER_EXIT will be set.
+ *
+ * - If the device is in a normal power state currently, then
+ *   VFIO_PM_LOW_POWER_REENTERY_DISABLE will be set for the devices where low
+ *   power re-entry is disabled by default.  If the device is in the low power
+ *   state currently, then VFIO_PM_LOW_POWER_REENTERY_DISABLE will be set
+ *   according to the current transition.
+ */
+struct vfio_device_feature_power_management {
+	__u32	flags;
+#define VFIO_PM_LOW_POWER_ENTER			(1 << 0)
+#define VFIO_PM_LOW_POWER_EXIT			(1 << 1)
+#define VFIO_PM_LOW_POWER_REENTERY_DISABLE	(1 << 2)
+	__u32	reserved;
+};
+
+#define VFIO_DEVICE_FEATURE_POWER_MANAGEMENT	3
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**