diff mbox series

[v5,1/5] vfio: Add the device features for the low power entry and exit

Message ID 20220719121523.21396-2-abhsahu@nvidia.com (mailing list archive)
State Superseded
Headers show
Series vfio/pci: power management changes | expand

Commit Message

Abhishek Sahu July 19, 2022, 12:15 p.m. UTC
This patch adds the following new device features for the low
power entry and exit in the header file. The implementation for the
same will be added in the subsequent patches.

- VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY
- VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP
- VFIO_DEVICE_FEATURE_LOW_POWER_EXIT

With the standard registers, all power states cannot be achieved. The
platform-based power management needs to be involved to go into the
lowest power state. For doing low power entry and exit with
platform-based power management, these device features can be used.

The entry device feature has two variants. These two variants are mainly
to support the different behaviour for the low power entry.
If there is any access for the VFIO device on the host side, then the
device will be moved out of the low power state without the user's
guest driver involvement. Some devices (for example NVIDIA VGA or
3D controller) require the user's guest driver involvement for
each low-power entry. In the first variant, the host can move the
device into low power without any guest driver involvement while
in the second variant, the host will send a notification to the user
through eventfd and then the users guest driver needs to move
the device into low power.

These device features only support VFIO_DEVICE_FEATURE_SET operation.

Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
---
 include/uapi/linux/vfio.h | 55 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 55 insertions(+)

Comments

Alex Williamson July 21, 2022, 10:34 p.m. UTC | #1
On Tue, 19 Jul 2022 17:45:19 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> This patch adds the following new device features for the low
> power entry and exit in the header file. The implementation for the
> same will be added in the subsequent patches.
> 
> - VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY
> - VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP
> - VFIO_DEVICE_FEATURE_LOW_POWER_EXIT
> 
> With the standard registers, all power states cannot be achieved. The

We're talking about standard PCI PM registers here, let's make that
clear since we're adding a device agnostic interface here.

> platform-based power management needs to be involved to go into the
> lowest power state. For doing low power entry and exit with
> platform-based power management, these device features can be used.
> 
> The entry device feature has two variants. These two variants are mainly
> to support the different behaviour for the low power entry.
> If there is any access for the VFIO device on the host side, then the
> device will be moved out of the low power state without the user's
> guest driver involvement. Some devices (for example NVIDIA VGA or
> 3D controller) require the user's guest driver involvement for
> each low-power entry. In the first variant, the host can move the
> device into low power without any guest driver involvement while

Perhaps, "In the first variant, the host can return the device to low
power automatically.  The device will continue to attempt to reach low
power until the low power exit feature is called."

> in the second variant, the host will send a notification to the user
> through eventfd and then the users guest driver needs to move
> the device into low power.

"In the second variant, if the device exits low power due to an access,
the host kernel will signal the user via the provided eventfd and will
not return the device to low power without a subsequent call to one of
the low power entry features.  A call to the low power exit feature is
optional if the user provided eventfd is signaled."
 
> These device features only support VFIO_DEVICE_FEATURE_SET operation.

And PROBE.

> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
> ---
>  include/uapi/linux/vfio.h | 55 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 55 insertions(+)
> 
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 733a1cddde30..08fd3482d22b 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -986,6 +986,61 @@ enum vfio_device_mig_state {
>  	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
>  };
>  
> +/*
> + * Upon VFIO_DEVICE_FEATURE_SET, move the VFIO device into the low power state
> + * with the platform-based power management.  This low power state will be

This is really "allow the device to be moved into a low power state"
rather than actually "move the device into" such a state though, right?

> + * internal to the VFIO driver and the user will not come to know which power
> + * state is chosen.  If any device access happens (either from the host or
> + * the guest) when the device is in the low power state, then the host will
> + * move the device out of the low power state first.  Once the access has been
> + * finished, then the host will move the device into the low power state again.
> + * If the user wants that the device should not go into the low power state
> + * again in this case, then the user should use the
> + * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device feature for the

This should probably just read "For single shot low power support with
wake-up notification, see
VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP below."

> + * low power entry.  The mmap'ed region access is not allowed till the low power
> + * exit happens through VFIO_DEVICE_FEATURE_LOW_POWER_EXIT and will
> + * generate the access fault.
> + */
> +#define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY 3

Note that Yishai's DMA logging set is competing for the same feature
entries.  We'll need to coordinate.

> +
> +/*
> + * Upon VFIO_DEVICE_FEATURE_SET, move the VFIO device into the low power state
> + * with the platform-based power management and provide support for the wake-up
> + * notifications through eventfd.  This low power state will be internal to the
> + * VFIO driver and the user will not come to know which power state is chosen.
> + * If any device access happens (either from the host or the guest) when the
> + * device is in the low power state, then the host will move the device out of
> + * the low power state first and a notification will be sent to the guest
> + * through eventfd.  Once the access is finished, the host will not move back
> + * the device into the low power state.  The guest should move the device into
> + * the low power state again upon receiving the wakeup notification.  The
> + * notification will be generated only if the device physically went into the
> + * low power state.  If the low power entry has been disabled from the host
> + * side, then the device will not go into the low power state even after
> + * calling this device feature and then the device access does not require
> + * wake-up.  The mmap'ed region access is not allowed till the low power exit
> + * happens.  The low power exit can happen either through
> + * VFIO_DEVICE_FEATURE_LOW_POWER_EXIT or through any other access (where the
> + * wake-up notification has been generated).

Seems this could leverage a lot more from the previous, simply stating
that this has the same behavior as VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY
with the exception that the user provides an eventfd for notification
when the device resumes from low power and will not try to re-enter a
low power state without a subsequent user call to one of the low power
entry feature ioctls.  It might also be worth covering the fact that
device accesses by the user, including region and ioctl access, will
also trigger the eventfd if that access triggers a resume.

As I'm thinking about this, that latter clause is somewhat subtle.
AIUI a user can call the low power entry with wakeup feature and
proceed to do various ioctl and region (not mmap) accesses that could
perpetually keep the device awake, or there may be dependent devices
such that the device may never go to low power.  It needs to be very
clear that only if the wakeup eventfd has the device entered into and
exited a low power state making the low power exit ioctl optional.

> + */
> +struct vfio_device_low_power_entry_with_wakeup {
> +	__s32 wakeup_eventfd;
> +	__u32 reserved;
> +};
> +
> +#define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP 4
> +
> +/*
> + * Upon VFIO_DEVICE_FEATURE_SET, move the VFIO device out of the low power
> + * state.

Any ioctl effectively does that, the key here is that the low power
state may not be re-entered after this ioctl.

>  This device feature should be called only if the user has previously
> + * put the device into the low power state either with
> + * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY or
> + * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device feature.  If the

Doesn't really seem worth mentioning, we need to protect against misuse
regardless.

> + * device is not in the low power state currently, this device feature will
> + * return early with the success status.

This is an implementation detail, it doesn't need to be part of the
uAPI.  Thanks,

Alex

> + */
> +#define VFIO_DEVICE_FEATURE_LOW_POWER_EXIT 5
> +
>  /* -------- API for Type1 VFIO IOMMU -------- */
>  
>  /**
Abhishek Sahu July 25, 2022, 2:40 p.m. UTC | #2
On 7/22/2022 4:04 AM, Alex Williamson wrote:
> On Tue, 19 Jul 2022 17:45:19 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> 
>> This patch adds the following new device features for the low
>> power entry and exit in the header file. The implementation for the
>> same will be added in the subsequent patches.
>>
>> - VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY
>> - VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP
>> - VFIO_DEVICE_FEATURE_LOW_POWER_EXIT
>>
>> With the standard registers, all power states cannot be achieved. The
> 
> We're talking about standard PCI PM registers here, let's make that
> clear since we're adding a device agnostic interface here.
> 
>> platform-based power management needs to be involved to go into the
>> lowest power state. For doing low power entry and exit with
>> platform-based power management, these device features can be used.
>>
>> The entry device feature has two variants. These two variants are mainly
>> to support the different behaviour for the low power entry.
>> If there is any access for the VFIO device on the host side, then the
>> device will be moved out of the low power state without the user's
>> guest driver involvement. Some devices (for example NVIDIA VGA or
>> 3D controller) require the user's guest driver involvement for
>> each low-power entry. In the first variant, the host can move the
>> device into low power without any guest driver involvement while
> 
> Perhaps, "In the first variant, the host can return the device to low
> power automatically.  The device will continue to attempt to reach low
> power until the low power exit feature is called."
> 
>> in the second variant, the host will send a notification to the user
>> through eventfd and then the users guest driver needs to move
>> the device into low power.
> 
> "In the second variant, if the device exits low power due to an access,
> the host kernel will signal the user via the provided eventfd and will
> not return the device to low power without a subsequent call to one of
> the low power entry features.  A call to the low power exit feature is
> optional if the user provided eventfd is signaled."
>  
>> These device features only support VFIO_DEVICE_FEATURE_SET operation.
> 
> And PROBE.
> 

 Thanks Alex.
 I will make the above changes in the commit message.

>> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
>> ---
>>  include/uapi/linux/vfio.h | 55 +++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 55 insertions(+)
>>
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index 733a1cddde30..08fd3482d22b 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -986,6 +986,61 @@ enum vfio_device_mig_state {
>>  	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
>>  };
>>  
>> +/*
>> + * Upon VFIO_DEVICE_FEATURE_SET, move the VFIO device into the low power state
>> + * with the platform-based power management.  This low power state will be
> 
> This is really "allow the device to be moved into a low power state"
> rather than actually "move the device into" such a state though, right?
> 
 
 Yes. It will just allow the device to be moved into a low power state.
 I have addressed all your suggestions in the uAPI description and
 added the updated description in the last.

 Can you please check that once and check if it looks okay.
 
>> + * internal to the VFIO driver and the user will not come to know which power
>> + * state is chosen.  If any device access happens (either from the host or
>> + * the guest) when the device is in the low power state, then the host will
>> + * move the device out of the low power state first.  Once the access has been
>> + * finished, then the host will move the device into the low power state again.
>> + * If the user wants that the device should not go into the low power state
>> + * again in this case, then the user should use the
>> + * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device feature for the
> 
> This should probably just read "For single shot low power support with
> wake-up notification, see
> VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP below."
> 
>> + * low power entry.  The mmap'ed region access is not allowed till the low power
>> + * exit happens through VFIO_DEVICE_FEATURE_LOW_POWER_EXIT and will
>> + * generate the access fault.
>> + */
>> +#define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY 3
> 
> Note that Yishai's DMA logging set is competing for the same feature
> entries.  We'll need to coordinate.
> 

 Since this is last week of rc, so will it possible to consider the updated
 patches for next kernel (I can try to send them after complete testing the
 by the end of this week). Otherwise, I can wait for next kernel and then
 I can rebase my patches.

>> +
>> +/*
>> + * Upon VFIO_DEVICE_FEATURE_SET, move the VFIO device into the low power state
>> + * with the platform-based power management and provide support for the wake-up
>> + * notifications through eventfd.  This low power state will be internal to the
>> + * VFIO driver and the user will not come to know which power state is chosen.
>> + * If any device access happens (either from the host or the guest) when the
>> + * device is in the low power state, then the host will move the device out of
>> + * the low power state first and a notification will be sent to the guest
>> + * through eventfd.  Once the access is finished, the host will not move back
>> + * the device into the low power state.  The guest should move the device into
>> + * the low power state again upon receiving the wakeup notification.  The
>> + * notification will be generated only if the device physically went into the
>> + * low power state.  If the low power entry has been disabled from the host
>> + * side, then the device will not go into the low power state even after
>> + * calling this device feature and then the device access does not require
>> + * wake-up.  The mmap'ed region access is not allowed till the low power exit
>> + * happens.  The low power exit can happen either through
>> + * VFIO_DEVICE_FEATURE_LOW_POWER_EXIT or through any other access (where the
>> + * wake-up notification has been generated).
> 
> Seems this could leverage a lot more from the previous, simply stating
> that this has the same behavior as VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY
> with the exception that the user provides an eventfd for notification
> when the device resumes from low power and will not try to re-enter a
> low power state without a subsequent user call to one of the low power
> entry feature ioctls.  It might also be worth covering the fact that
> device accesses by the user, including region and ioctl access, will
> also trigger the eventfd if that access triggers a resume.
> 
> As I'm thinking about this, that latter clause is somewhat subtle.
> AIUI a user can call the low power entry with wakeup feature and
> proceed to do various ioctl and region (not mmap) accesses that could
> perpetually keep the device awake, or there may be dependent devices
> such that the device may never go to low power.  It needs to be very
> clear that only if the wakeup eventfd has the device entered into and
> exited a low power state making the low power exit ioctl optional.
> 

 Yes. In my updated description, I have added more details.
 Can you please check if that helps.

>> + */
>> +struct vfio_device_low_power_entry_with_wakeup {
>> +	__s32 wakeup_eventfd;
>> +	__u32 reserved;
>> +};
>> +
>> +#define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP 4
>> +
>> +/*
>> + * Upon VFIO_DEVICE_FEATURE_SET, move the VFIO device out of the low power
>> + * state.
> 
> Any ioctl effectively does that, the key here is that the low power
> state may not be re-entered after this ioctl.
> 
>>  This device feature should be called only if the user has previously
>> + * put the device into the low power state either with
>> + * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY or
>> + * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device feature.  If the
> 
> Doesn't really seem worth mentioning, we need to protect against misuse
> regardless.
> 
>> + * device is not in the low power state currently, this device feature will
>> + * return early with the success status.
> 
> This is an implementation detail, it doesn't need to be part of the
> uAPI.  Thanks,
> 
> Alex
> 
>> + */
>> +#define VFIO_DEVICE_FEATURE_LOW_POWER_EXIT 5
>> +
>>  /* -------- API for Type1 VFIO IOMMU -------- */
>>  
>>  /**
> 

 /*
  * Upon VFIO_DEVICE_FEATURE_SET, allow the device to be moved into a low power
  * state with the platform-based power management.  This low power state will
  * be internal to the VFIO driver and the user will not come to know which
  * power state is chosen.  If any device access happens (either from the host
  * or the guest) when the device is in the low power state, then the host will
  * move the device out of the low power state first.  Once the access has been
  * finished, then the host will move the device into the low power state
  * again.  For single shot low power support with wake-up notification, see
  * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP below.  The mmap'ed region
  * access is not allowed till the low power exit happens through
  * VFIO_DEVICE_FEATURE_LOW_POWER_EXIT and will generate the access fault.
  */
 #define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY 3

 /*
  * This device feature has the same behavior as
  * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY with the exception that the user
  * provides an eventfd for wake-up notification.  When the device moves out of
  * the low power state for the wake-up, the host will not try to re-enter a
  * low power state without a subsequent user call to one of the low power
  * entry device feature IOCTLs.  The low power exit can happen either through
  * VFIO_DEVICE_FEATURE_LOW_POWER_EXIT or through any other access (where the
  * wake-up notification has been generated).
  *
  * The notification through eventfd will be generated only if the device has
  * gone into the low power state after calling this device feature IOCTL.
  * There are various cases where the device will not go into the low power
  * state after calling this device feature IOCTL (for example, the low power
  * entry has been disabled from the host side, the user keeps the device busy
  * after calling this device feature IOCTL, there are dependent devices which
  * block the device low power entry, etc.) and in such cases, the device access
  * does not require wake-up.  Also, the low power exit through
  * VFIO_DEVICE_FEATURE_LOW_POWER_EXIT is mandatory for the cases where the
  * wake-up notification has not been generated.
  */
 struct vfio_device_low_power_entry_with_wakeup {
	 __s32 wakeup_eventfd;
	 __u32 reserved;
 };

 #define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP 4

 /*
  * Upon VFIO_DEVICE_FEATURE_SET, prevent the VFIO device low power state entry
  * which has been previously allowed with VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY
  * or VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device features.
  */
 #define VFIO_DEVICE_FEATURE_LOW_POWER_EXIT 5

 Thanks,
 Abhishek
Alex Williamson July 25, 2022, 10:09 p.m. UTC | #3
On Mon, 25 Jul 2022 20:10:44 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> On 7/22/2022 4:04 AM, Alex Williamson wrote:
> > On Tue, 19 Jul 2022 17:45:19 +0530
> > Abhishek Sahu <abhsahu@nvidia.com> wrote:
> >   
> >> This patch adds the following new device features for the low
> >> power entry and exit in the header file. The implementation for the
> >> same will be added in the subsequent patches.
> >>
> >> - VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY
> >> - VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP
> >> - VFIO_DEVICE_FEATURE_LOW_POWER_EXIT
> >>
> >> With the standard registers, all power states cannot be achieved. The  
> > 
> > We're talking about standard PCI PM registers here, let's make that
> > clear since we're adding a device agnostic interface here.
> >   
> >> platform-based power management needs to be involved to go into the
> >> lowest power state. For doing low power entry and exit with
> >> platform-based power management, these device features can be used.
> >>
> >> The entry device feature has two variants. These two variants are mainly
> >> to support the different behaviour for the low power entry.
> >> If there is any access for the VFIO device on the host side, then the
> >> device will be moved out of the low power state without the user's
> >> guest driver involvement. Some devices (for example NVIDIA VGA or
> >> 3D controller) require the user's guest driver involvement for
> >> each low-power entry. In the first variant, the host can move the
> >> device into low power without any guest driver involvement while  
> > 
> > Perhaps, "In the first variant, the host can return the device to low
> > power automatically.  The device will continue to attempt to reach low
> > power until the low power exit feature is called."
> >   
> >> in the second variant, the host will send a notification to the user
> >> through eventfd and then the users guest driver needs to move
> >> the device into low power.  
> > 
> > "In the second variant, if the device exits low power due to an access,
> > the host kernel will signal the user via the provided eventfd and will
> > not return the device to low power without a subsequent call to one of
> > the low power entry features.  A call to the low power exit feature is
> > optional if the user provided eventfd is signaled."
> >    
> >> These device features only support VFIO_DEVICE_FEATURE_SET operation.  
> > 
> > And PROBE.
> >   
> 
>  Thanks Alex.
>  I will make the above changes in the commit message.
> 
> >> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
> >> ---
> >>  include/uapi/linux/vfio.h | 55 +++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 55 insertions(+)
> >>
> >> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >> index 733a1cddde30..08fd3482d22b 100644
> >> --- a/include/uapi/linux/vfio.h
> >> +++ b/include/uapi/linux/vfio.h
> >> @@ -986,6 +986,61 @@ enum vfio_device_mig_state {
> >>  	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
> >>  };
> >>  
> >> +/*
> >> + * Upon VFIO_DEVICE_FEATURE_SET, move the VFIO device into the low power state
> >> + * with the platform-based power management.  This low power state will be  
> > 
> > This is really "allow the device to be moved into a low power state"
> > rather than actually "move the device into" such a state though, right?
> >   
>  
>  Yes. It will just allow the device to be moved into a low power state.
>  I have addressed all your suggestions in the uAPI description and
>  added the updated description in the last.
> 
>  Can you please check that once and check if it looks okay.
>  
> >> + * internal to the VFIO driver and the user will not come to know which power
> >> + * state is chosen.  If any device access happens (either from the host or
> >> + * the guest) when the device is in the low power state, then the host will
> >> + * move the device out of the low power state first.  Once the access has been
> >> + * finished, then the host will move the device into the low power state again.
> >> + * If the user wants that the device should not go into the low power state
> >> + * again in this case, then the user should use the
> >> + * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device feature for the  
> > 
> > This should probably just read "For single shot low power support with
> > wake-up notification, see
> > VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP below."
> >   
> >> + * low power entry.  The mmap'ed region access is not allowed till the low power
> >> + * exit happens through VFIO_DEVICE_FEATURE_LOW_POWER_EXIT and will
> >> + * generate the access fault.
> >> + */
> >> +#define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY 3  
> > 
> > Note that Yishai's DMA logging set is competing for the same feature
> > entries.  We'll need to coordinate.
> >   
> 
>  Since this is last week of rc, so will it possible to consider the updated
>  patches for next kernel (I can try to send them after complete testing the
>  by the end of this week). Otherwise, I can wait for next kernel and then
>  I can rebase my patches.

I think we're close, though I would like to solicit uAPI feedback from
others on the list.  We can certainly sort out feature numbers
conflicts at commit time if Yishai's series becomes ready as well.  I'm
on PTO for the rest of the week starting tomorrow afternoon, but I'd
suggest trying to get this re-posted before the end of the week, asking
for more reviews for the uAPI, and we can evaluate early next week if
this is ready.

> >> +
> >> +/*
> >> + * Upon VFIO_DEVICE_FEATURE_SET, move the VFIO device into the low power state
> >> + * with the platform-based power management and provide support for the wake-up
> >> + * notifications through eventfd.  This low power state will be internal to the
> >> + * VFIO driver and the user will not come to know which power state is chosen.
> >> + * If any device access happens (either from the host or the guest) when the
> >> + * device is in the low power state, then the host will move the device out of
> >> + * the low power state first and a notification will be sent to the guest
> >> + * through eventfd.  Once the access is finished, the host will not move back
> >> + * the device into the low power state.  The guest should move the device into
> >> + * the low power state again upon receiving the wakeup notification.  The
> >> + * notification will be generated only if the device physically went into the
> >> + * low power state.  If the low power entry has been disabled from the host
> >> + * side, then the device will not go into the low power state even after
> >> + * calling this device feature and then the device access does not require
> >> + * wake-up.  The mmap'ed region access is not allowed till the low power exit
> >> + * happens.  The low power exit can happen either through
> >> + * VFIO_DEVICE_FEATURE_LOW_POWER_EXIT or through any other access (where the
> >> + * wake-up notification has been generated).  
> > 
> > Seems this could leverage a lot more from the previous, simply stating
> > that this has the same behavior as VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY
> > with the exception that the user provides an eventfd for notification
> > when the device resumes from low power and will not try to re-enter a
> > low power state without a subsequent user call to one of the low power
> > entry feature ioctls.  It might also be worth covering the fact that
> > device accesses by the user, including region and ioctl access, will
> > also trigger the eventfd if that access triggers a resume.
> > 
> > As I'm thinking about this, that latter clause is somewhat subtle.
> > AIUI a user can call the low power entry with wakeup feature and
> > proceed to do various ioctl and region (not mmap) accesses that could
> > perpetually keep the device awake, or there may be dependent devices
> > such that the device may never go to low power.  It needs to be very
> > clear that only if the wakeup eventfd has the device entered into and
> > exited a low power state making the low power exit ioctl optional.
> >   
> 
>  Yes. In my updated description, I have added more details.
>  Can you please check if that helps.
> 
> >> + */
> >> +struct vfio_device_low_power_entry_with_wakeup {
> >> +	__s32 wakeup_eventfd;
> >> +	__u32 reserved;
> >> +};
> >> +
> >> +#define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP 4
> >> +
> >> +/*
> >> + * Upon VFIO_DEVICE_FEATURE_SET, move the VFIO device out of the low power
> >> + * state.  
> > 
> > Any ioctl effectively does that, the key here is that the low power
> > state may not be re-entered after this ioctl.
> >   
> >>  This device feature should be called only if the user has previously
> >> + * put the device into the low power state either with
> >> + * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY or
> >> + * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device feature.  If the  
> > 
> > Doesn't really seem worth mentioning, we need to protect against misuse
> > regardless.
> >   
> >> + * device is not in the low power state currently, this device feature will
> >> + * return early with the success status.  
> > 
> > This is an implementation detail, it doesn't need to be part of the
> > uAPI.  Thanks,
> > 
> > Alex
> >   
> >> + */
> >> +#define VFIO_DEVICE_FEATURE_LOW_POWER_EXIT 5
> >> +
> >>  /* -------- API for Type1 VFIO IOMMU -------- */
> >>  
> >>  /**  
> >   
> 
>  /*
>   * Upon VFIO_DEVICE_FEATURE_SET, allow the device to be moved into a low power
>   * state with the platform-based power management.  This low power state will
>   * be internal to the VFIO driver and the user will not come to know which
>   * power state is chosen.  If any device access happens (either from the host

Couldn't the user look in sysfs to determine the power state?  There
might also be external hardware monitors of the power state, so this
statement is a bit overreaching.  Maybe something along the lines of...

"Device use of lower power states depend on factors managed by the
runtime power management core, including system level support and
coordinating support among dependent devices.  Enabling device low
power entry does not guarantee lower power usage by the device, nor is
a mechanism provided through this feature to know the current power
state of the device."

>   * or the guest) when the device is in the low power state, then the host will

Let's not confine ourselves to a VM use case, "...from the host or
through the vfio uAPI".

>   * move the device out of the low power state first.  Once the access has been

"move the device out of the low power state as necessary prior to the
access."

>   * finished, then the host will move the device into the low power state

"Once the access is completed, the device may re-enter the low power
state."

>   * again.  For single shot low power support with wake-up notification, see
>   * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP below.  The mmap'ed region
>   * access is not allowed till the low power exit happens through
>   * VFIO_DEVICE_FEATURE_LOW_POWER_EXIT and will generate the access fault.

"Access to mmap'd device regions is disabled on LOW_POWER_ENTRY and
may only be resumed after calling LOW_POWER_EXIT."

>   */
>  #define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY 3
> 
>  /*
>   * This device feature has the same behavior as
>   * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY with the exception that the user
>   * provides an eventfd for wake-up notification.  When the device moves out of
>   * the low power state for the wake-up, the host will not try to re-enter a

"...will not allow the device to re-enter a low power state..."

>   * low power state without a subsequent user call to one of the low power
>   * entry device feature IOCTLs.  The low power exit can happen either through
>   * VFIO_DEVICE_FEATURE_LOW_POWER_EXIT or through any other access (where the
>   * wake-up notification has been generated).
>   *
>   * The notification through eventfd will be generated only if the device has
>   * gone into the low power state after calling this device feature IOCTL.

"The notification through the provided eventfd will be generated only
when the device has entered and is resumed from a low power state after
calling this device feature IOCTL."


>   * There are various cases where the device will not go into the low power
>   * state after calling this device feature IOCTL (for example, the low power
>   * entry has been disabled from the host side, the user keeps the device busy
>   * after calling this device feature IOCTL, there are dependent devices which
>   * block the device low power entry, etc.) and in such cases, the device access
>   * does not require wake-up.  Also, the low power exit through

"A device that has not entered low power state, as managed through the
runtime power management core, will not generate a notification through
the provided eventfd on access."

>   * VFIO_DEVICE_FEATURE_LOW_POWER_EXIT is mandatory for the cases where the
>   * wake-up notification has not been generated.

"Calling the LOW_POWER_EXIT feature is optional in the case where
notification has been signaled on the provided eventfd that a resume
from low power has occurred."

We should also reiterate the statement above about mmap access because
it would be reasonable for a user to assume that if any access wakes
the device, that should include mmap faults and therefore a resume
notification should occur for such an event.  We could implement that
for the one-shot mode, but we're choosing not to.

>   */
>  struct vfio_device_low_power_entry_with_wakeup {
> 	 __s32 wakeup_eventfd;
> 	 __u32 reserved;
>  };
> 
>  #define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP 4
> 
>  /*
>   * Upon VFIO_DEVICE_FEATURE_SET, prevent the VFIO device low power state entry
>   * which has been previously allowed with VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY
>   * or VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device features.
>   */

"Upon VFIO_DEVICE_FEATURE_SET, disallow use of device low power states
as previously enabled via VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY or
VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device features.  This
device feature ioctl may itself generate a wakeup eventfd notification
in the latter case if the device has previously entered a low power
state."

Thanks,
Alex
Abhishek Sahu July 26, 2022, 12:47 p.m. UTC | #4
On 7/26/2022 3:39 AM, Alex Williamson wrote:
> On Mon, 25 Jul 2022 20:10:44 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> 
>> On 7/22/2022 4:04 AM, Alex Williamson wrote:
>>> On Tue, 19 Jul 2022 17:45:19 +0530
>>> Abhishek Sahu <abhsahu@nvidia.com> wrote:
>>>   
>>>> This patch adds the following new device features for the low
>>>> power entry and exit in the header file. The implementation for the
>>>> same will be added in the subsequent patches.
>>>>
>>>> - VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY
>>>> - VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP
>>>> - VFIO_DEVICE_FEATURE_LOW_POWER_EXIT
>>>>
>>>> With the standard registers, all power states cannot be achieved. The  
>>>
>>> We're talking about standard PCI PM registers here, let's make that
>>> clear since we're adding a device agnostic interface here.
>>>   
>>>> platform-based power management needs to be involved to go into the
>>>> lowest power state. For doing low power entry and exit with
>>>> platform-based power management, these device features can be used.
>>>>
>>>> The entry device feature has two variants. These two variants are mainly
>>>> to support the different behaviour for the low power entry.
>>>> If there is any access for the VFIO device on the host side, then the
>>>> device will be moved out of the low power state without the user's
>>>> guest driver involvement. Some devices (for example NVIDIA VGA or
>>>> 3D controller) require the user's guest driver involvement for
>>>> each low-power entry. In the first variant, the host can move the
>>>> device into low power without any guest driver involvement while  
>>>
>>> Perhaps, "In the first variant, the host can return the device to low
>>> power automatically.  The device will continue to attempt to reach low
>>> power until the low power exit feature is called."
>>>   
>>>> in the second variant, the host will send a notification to the user
>>>> through eventfd and then the users guest driver needs to move
>>>> the device into low power.  
>>>
>>> "In the second variant, if the device exits low power due to an access,
>>> the host kernel will signal the user via the provided eventfd and will
>>> not return the device to low power without a subsequent call to one of
>>> the low power entry features.  A call to the low power exit feature is
>>> optional if the user provided eventfd is signaled."
>>>    
>>>> These device features only support VFIO_DEVICE_FEATURE_SET operation.  
>>>
>>> And PROBE.
>>>   
>>
>>  Thanks Alex.
>>  I will make the above changes in the commit message.
>>
>>>> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
>>>> ---
>>>>  include/uapi/linux/vfio.h | 55 +++++++++++++++++++++++++++++++++++++++
>>>>  1 file changed, 55 insertions(+)
>>>>
>>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>>> index 733a1cddde30..08fd3482d22b 100644
>>>> --- a/include/uapi/linux/vfio.h
>>>> +++ b/include/uapi/linux/vfio.h
>>>> @@ -986,6 +986,61 @@ enum vfio_device_mig_state {
>>>>  	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
>>>>  };
>>>>  
>>>> +/*
>>>> + * Upon VFIO_DEVICE_FEATURE_SET, move the VFIO device into the low power state
>>>> + * with the platform-based power management.  This low power state will be  
>>>
>>> This is really "allow the device to be moved into a low power state"
>>> rather than actually "move the device into" such a state though, right?
>>>   
>>  
>>  Yes. It will just allow the device to be moved into a low power state.
>>  I have addressed all your suggestions in the uAPI description and
>>  added the updated description in the last.
>>
>>  Can you please check that once and check if it looks okay.
>>  
>>>> + * internal to the VFIO driver and the user will not come to know which power
>>>> + * state is chosen.  If any device access happens (either from the host or
>>>> + * the guest) when the device is in the low power state, then the host will
>>>> + * move the device out of the low power state first.  Once the access has been
>>>> + * finished, then the host will move the device into the low power state again.
>>>> + * If the user wants that the device should not go into the low power state
>>>> + * again in this case, then the user should use the
>>>> + * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device feature for the  
>>>
>>> This should probably just read "For single shot low power support with
>>> wake-up notification, see
>>> VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP below."
>>>   
>>>> + * low power entry.  The mmap'ed region access is not allowed till the low power
>>>> + * exit happens through VFIO_DEVICE_FEATURE_LOW_POWER_EXIT and will
>>>> + * generate the access fault.
>>>> + */
>>>> +#define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY 3  
>>>
>>> Note that Yishai's DMA logging set is competing for the same feature
>>> entries.  We'll need to coordinate.
>>>   
>>
>>  Since this is last week of rc, so will it possible to consider the updated
>>  patches for next kernel (I can try to send them after complete testing the
>>  by the end of this week). Otherwise, I can wait for next kernel and then
>>  I can rebase my patches.
> 
> I think we're close, though I would like to solicit uAPI feedback from
> others on the list.  We can certainly sort out feature numbers
> conflicts at commit time if Yishai's series becomes ready as well.  I'm
> on PTO for the rest of the week starting tomorrow afternoon, but I'd
> suggest trying to get this re-posted before the end of the week, asking
> for more reviews for the uAPI, and we can evaluate early next week if
> this is ready.
> 

 Sure. I will re-post the updated patch series after the complete testing.
 
>>>> +
>>>> +/*
>>>> + * Upon VFIO_DEVICE_FEATURE_SET, move the VFIO device into the low power state
>>>> + * with the platform-based power management and provide support for the wake-up
>>>> + * notifications through eventfd.  This low power state will be internal to the
>>>> + * VFIO driver and the user will not come to know which power state is chosen.
>>>> + * If any device access happens (either from the host or the guest) when the
>>>> + * device is in the low power state, then the host will move the device out of
>>>> + * the low power state first and a notification will be sent to the guest
>>>> + * through eventfd.  Once the access is finished, the host will not move back
>>>> + * the device into the low power state.  The guest should move the device into
>>>> + * the low power state again upon receiving the wakeup notification.  The
>>>> + * notification will be generated only if the device physically went into the
>>>> + * low power state.  If the low power entry has been disabled from the host
>>>> + * side, then the device will not go into the low power state even after
>>>> + * calling this device feature and then the device access does not require
>>>> + * wake-up.  The mmap'ed region access is not allowed till the low power exit
>>>> + * happens.  The low power exit can happen either through
>>>> + * VFIO_DEVICE_FEATURE_LOW_POWER_EXIT or through any other access (where the
>>>> + * wake-up notification has been generated).  
>>>
>>> Seems this could leverage a lot more from the previous, simply stating
>>> that this has the same behavior as VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY
>>> with the exception that the user provides an eventfd for notification
>>> when the device resumes from low power and will not try to re-enter a
>>> low power state without a subsequent user call to one of the low power
>>> entry feature ioctls.  It might also be worth covering the fact that
>>> device accesses by the user, including region and ioctl access, will
>>> also trigger the eventfd if that access triggers a resume.
>>>
>>> As I'm thinking about this, that latter clause is somewhat subtle.
>>> AIUI a user can call the low power entry with wakeup feature and
>>> proceed to do various ioctl and region (not mmap) accesses that could
>>> perpetually keep the device awake, or there may be dependent devices
>>> such that the device may never go to low power.  It needs to be very
>>> clear that only if the wakeup eventfd has the device entered into and
>>> exited a low power state making the low power exit ioctl optional.
>>>   
>>
>>  Yes. In my updated description, I have added more details.
>>  Can you please check if that helps.
>>
>>>> + */
>>>> +struct vfio_device_low_power_entry_with_wakeup {
>>>> +	__s32 wakeup_eventfd;
>>>> +	__u32 reserved;
>>>> +};
>>>> +
>>>> +#define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP 4
>>>> +
>>>> +/*
>>>> + * Upon VFIO_DEVICE_FEATURE_SET, move the VFIO device out of the low power
>>>> + * state.  
>>>
>>> Any ioctl effectively does that, the key here is that the low power
>>> state may not be re-entered after this ioctl.
>>>   
>>>>  This device feature should be called only if the user has previously
>>>> + * put the device into the low power state either with
>>>> + * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY or
>>>> + * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device feature.  If the  
>>>
>>> Doesn't really seem worth mentioning, we need to protect against misuse
>>> regardless.
>>>   
>>>> + * device is not in the low power state currently, this device feature will
>>>> + * return early with the success status.  
>>>
>>> This is an implementation detail, it doesn't need to be part of the
>>> uAPI.  Thanks,
>>>
>>> Alex
>>>   
>>>> + */
>>>> +#define VFIO_DEVICE_FEATURE_LOW_POWER_EXIT 5
>>>> +
>>>>  /* -------- API for Type1 VFIO IOMMU -------- */
>>>>  
>>>>  /**  
>>>   
>>
>>  /*
>>   * Upon VFIO_DEVICE_FEATURE_SET, allow the device to be moved into a low power
>>   * state with the platform-based power management.  This low power state will
>>   * be internal to the VFIO driver and the user will not come to know which
>>   * power state is chosen.  If any device access happens (either from the host
> 
> Couldn't the user look in sysfs to determine the power state?  There
> might also be external hardware monitors of the power state, so this
> statement is a bit overreaching.  Maybe something along the lines of...
> 
> "Device use of lower power states depend on factors managed by the
> runtime power management core, including system level support and
> coordinating support among dependent devices.  Enabling device low
> power entry does not guarantee lower power usage by the device, nor is
> a mechanism provided through this feature to know the current power
> state of the device."
> 
>>   * or the guest) when the device is in the low power state, then the host will
> 
> Let's not confine ourselves to a VM use case, "...from the host or
> through the vfio uAPI".
> 
>>   * move the device out of the low power state first.  Once the access has been
> 
> "move the device out of the low power state as necessary prior to the
> access."
> 
>>   * finished, then the host will move the device into the low power state
> 
> "Once the access is completed, the device may re-enter the low power
> state."
> 
>>   * again.  For single shot low power support with wake-up notification, see
>>   * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP below.  The mmap'ed region
>>   * access is not allowed till the low power exit happens through
>>   * VFIO_DEVICE_FEATURE_LOW_POWER_EXIT and will generate the access fault.
> 
> "Access to mmap'd device regions is disabled on LOW_POWER_ENTRY and
> may only be resumed after calling LOW_POWER_EXIT."
> 
>>   */
>>  #define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY 3
>>
>>  /*
>>   * This device feature has the same behavior as
>>   * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY with the exception that the user
>>   * provides an eventfd for wake-up notification.  When the device moves out of
>>   * the low power state for the wake-up, the host will not try to re-enter a
> 
> "...will not allow the device to re-enter a low power state..."
> 
>>   * low power state without a subsequent user call to one of the low power
>>   * entry device feature IOCTLs.  The low power exit can happen either through
>>   * VFIO_DEVICE_FEATURE_LOW_POWER_EXIT or through any other access (where the
>>   * wake-up notification has been generated).
>>   *
>>   * The notification through eventfd will be generated only if the device has
>>   * gone into the low power state after calling this device feature IOCTL.
> 
> "The notification through the provided eventfd will be generated only
> when the device has entered and is resumed from a low power state after
> calling this device feature IOCTL."
> 
> 
>>   * There are various cases where the device will not go into the low power
>>   * state after calling this device feature IOCTL (for example, the low power
>>   * entry has been disabled from the host side, the user keeps the device busy
>>   * after calling this device feature IOCTL, there are dependent devices which
>>   * block the device low power entry, etc.) and in such cases, the device access
>>   * does not require wake-up.  Also, the low power exit through
> 
> "A device that has not entered low power state, as managed through the
> runtime power management core, will not generate a notification through
> the provided eventfd on access."
> 
>>   * VFIO_DEVICE_FEATURE_LOW_POWER_EXIT is mandatory for the cases where the
>>   * wake-up notification has not been generated.
> 
> "Calling the LOW_POWER_EXIT feature is optional in the case where
> notification has been signaled on the provided eventfd that a resume
> from low power has occurred."
> 
> We should also reiterate the statement above about mmap access because
> it would be reasonable for a user to assume that if any access wakes
> the device, that should include mmap faults and therefore a resume
> notification should occur for such an event.  We could implement that
> for the one-shot mode, but we're choosing not to.
> 
>>   */
>>  struct vfio_device_low_power_entry_with_wakeup {
>> 	 __s32 wakeup_eventfd;
>> 	 __u32 reserved;
>>  };
>>
>>  #define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP 4
>>
>>  /*
>>   * Upon VFIO_DEVICE_FEATURE_SET, prevent the VFIO device low power state entry
>>   * which has been previously allowed with VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY
>>   * or VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device features.
>>   */
> 
> "Upon VFIO_DEVICE_FEATURE_SET, disallow use of device low power states
> as previously enabled via VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY or
> VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device features.  This
> device feature ioctl may itself generate a wakeup eventfd notification
> in the latter case if the device has previously entered a low power
> state."
> 
> Thanks,
> Alex
> 

 Thanks Alex for your thorough review of uAPI.
 I have incorporated all the suggestions.
 Following is the updated uAPI.
 
 /*
  * Upon VFIO_DEVICE_FEATURE_SET, allow the device to be moved into a low power
  * state with the platform-based power management.  Device use of lower power
  * states depends on factors managed by the runtime power management core,
  * including system level support and coordinating support among dependent
  * devices.  Enabling device low power entry does not guarantee lower power
  * usage by the device, nor is a mechanism provided through this feature to
  * know the current power state of the device.  If any device access happens
  * (either from the host or through the vfio uAPI) when the device is in the
  * low power state, then the host will move the device out of the low power
  * state as necessary prior to the access.  Once the access is completed, the
  * device may re-enter the low power state.  For single shot low power support
  * with wake-up notification, see
  * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP below.  Access to mmap'd
  * device regions is disabled on LOW_POWER_ENTRY and may only be resumed after
  * calling LOW_POWER_EXIT.
  */
 #define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY 3
 
 /*
  * This device feature has the same behavior as
  * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY with the exception that the user
  * provides an eventfd for wake-up notification.  When the device moves out of
  * the low power state for the wake-up, the host will not allow the device to
  * re-enter a low power state without a subsequent user call to one of the low
  * power entry device feature IOCTLs.  Access to mmap'd device regions is
  * disabled on LOW_POWER_ENTRY_WITH_WAKEUP and may only be resumed after the
  * low power exit.  The low power exit can happen either through LOW_POWER_EXIT
  * or through any other access (where the wake-up notification has been
  * generated).  The access to mmap'd device regions will not trigger low power
  * exit.
  *
  * The notification through the provided eventfd will be generated only when
  * the device has entered and is resumed from a low power state after
  * calling this device feature IOCTL.  A device that has not entered low power
  * state, as managed through the runtime power management core, will not
  * generate a notification through the provided eventfd on access.  Calling the
  * LOW_POWER_EXIT feature is optional in the case where notification has been
  * signaled on the provided eventfd that a resume from low power has occurred.
  */
 struct vfio_device_low_power_entry_with_wakeup {
 	__s32 wakeup_eventfd;
 	__u32 reserved;
 };
 
 #define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP 4
 
 /*
  * Upon VFIO_DEVICE_FEATURE_SET, disallow use of device low power states as
  * previously enabled via VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY or
  * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device features.
  * This device feature IOCTL may itself generate a wakeup eventfd notification
  * in the latter case if the device has previously entered a low power state.
  */
 #define VFIO_DEVICE_FEATURE_LOW_POWER_EXIT 5

 Regards,
 Abhishek
Cornelia Huck July 26, 2022, 1:13 p.m. UTC | #5
On Tue, Jul 26 2022, Abhishek Sahu <abhsahu@nvidia.com> wrote:

>  #define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP 4
>  
>  /*
>   * Upon VFIO_DEVICE_FEATURE_SET, disallow use of device low power states as
>   * previously enabled via VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY or
>   * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device features.
>   * This device feature IOCTL may itself generate a wakeup eventfd notification
>   * in the latter case if the device has previously entered a low power state.

Nit: s/has/had/

>   */
>  #define VFIO_DEVICE_FEATURE_LOW_POWER_EXIT 5

I haven't followed this closely, and I'm not that familiar with power
management, but at least I can't spot anything obviously problematic.
Alex Williamson July 26, 2022, 2:17 p.m. UTC | #6
On Tue, 26 Jul 2022 18:17:18 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> On 7/26/2022 3:39 AM, Alex Williamson wrote:
> > On Mon, 25 Jul 2022 20:10:44 +0530
> > Abhishek Sahu <abhsahu@nvidia.com> wrote:
> >   
> >> On 7/22/2022 4:04 AM, Alex Williamson wrote:  
> >>> On Tue, 19 Jul 2022 17:45:19 +0530
> >>> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> >>>     
> >>>> This patch adds the following new device features for the low
> >>>> power entry and exit in the header file. The implementation for the
> >>>> same will be added in the subsequent patches.
> >>>>
> >>>> - VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY
> >>>> - VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP
> >>>> - VFIO_DEVICE_FEATURE_LOW_POWER_EXIT
> >>>>
> >>>> With the standard registers, all power states cannot be achieved. The    
> >>>
> >>> We're talking about standard PCI PM registers here, let's make that
> >>> clear since we're adding a device agnostic interface here.
> >>>     
> >>>> platform-based power management needs to be involved to go into the
> >>>> lowest power state. For doing low power entry and exit with
> >>>> platform-based power management, these device features can be used.
> >>>>
> >>>> The entry device feature has two variants. These two variants are mainly
> >>>> to support the different behaviour for the low power entry.
> >>>> If there is any access for the VFIO device on the host side, then the
> >>>> device will be moved out of the low power state without the user's
> >>>> guest driver involvement. Some devices (for example NVIDIA VGA or
> >>>> 3D controller) require the user's guest driver involvement for
> >>>> each low-power entry. In the first variant, the host can move the
> >>>> device into low power without any guest driver involvement while    
> >>>
> >>> Perhaps, "In the first variant, the host can return the device to low
> >>> power automatically.  The device will continue to attempt to reach low
> >>> power until the low power exit feature is called."
> >>>     
> >>>> in the second variant, the host will send a notification to the user
> >>>> through eventfd and then the users guest driver needs to move
> >>>> the device into low power.    
> >>>
> >>> "In the second variant, if the device exits low power due to an access,
> >>> the host kernel will signal the user via the provided eventfd and will
> >>> not return the device to low power without a subsequent call to one of
> >>> the low power entry features.  A call to the low power exit feature is
> >>> optional if the user provided eventfd is signaled."
> >>>      
> >>>> These device features only support VFIO_DEVICE_FEATURE_SET operation.    
> >>>
> >>> And PROBE.
> >>>     
> >>
> >>  Thanks Alex.
> >>  I will make the above changes in the commit message.
> >>  
> >>>> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
> >>>> ---
> >>>>  include/uapi/linux/vfio.h | 55 +++++++++++++++++++++++++++++++++++++++
> >>>>  1 file changed, 55 insertions(+)
> >>>>
> >>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >>>> index 733a1cddde30..08fd3482d22b 100644
> >>>> --- a/include/uapi/linux/vfio.h
> >>>> +++ b/include/uapi/linux/vfio.h
> >>>> @@ -986,6 +986,61 @@ enum vfio_device_mig_state {
> >>>>  	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
> >>>>  };
> >>>>  
> >>>> +/*
> >>>> + * Upon VFIO_DEVICE_FEATURE_SET, move the VFIO device into the low power state
> >>>> + * with the platform-based power management.  This low power state will be    
> >>>
> >>> This is really "allow the device to be moved into a low power state"
> >>> rather than actually "move the device into" such a state though, right?
> >>>     
> >>  
> >>  Yes. It will just allow the device to be moved into a low power state.
> >>  I have addressed all your suggestions in the uAPI description and
> >>  added the updated description in the last.
> >>
> >>  Can you please check that once and check if it looks okay.
> >>    
> >>>> + * internal to the VFIO driver and the user will not come to know which power
> >>>> + * state is chosen.  If any device access happens (either from the host or
> >>>> + * the guest) when the device is in the low power state, then the host will
> >>>> + * move the device out of the low power state first.  Once the access has been
> >>>> + * finished, then the host will move the device into the low power state again.
> >>>> + * If the user wants that the device should not go into the low power state
> >>>> + * again in this case, then the user should use the
> >>>> + * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device feature for the    
> >>>
> >>> This should probably just read "For single shot low power support with
> >>> wake-up notification, see
> >>> VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP below."
> >>>     
> >>>> + * low power entry.  The mmap'ed region access is not allowed till the low power
> >>>> + * exit happens through VFIO_DEVICE_FEATURE_LOW_POWER_EXIT and will
> >>>> + * generate the access fault.
> >>>> + */
> >>>> +#define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY 3    
> >>>
> >>> Note that Yishai's DMA logging set is competing for the same feature
> >>> entries.  We'll need to coordinate.
> >>>     
> >>
> >>  Since this is last week of rc, so will it possible to consider the updated
> >>  patches for next kernel (I can try to send them after complete testing the
> >>  by the end of this week). Otherwise, I can wait for next kernel and then
> >>  I can rebase my patches.  
> > 
> > I think we're close, though I would like to solicit uAPI feedback from
> > others on the list.  We can certainly sort out feature numbers
> > conflicts at commit time if Yishai's series becomes ready as well.  I'm
> > on PTO for the rest of the week starting tomorrow afternoon, but I'd
> > suggest trying to get this re-posted before the end of the week, asking
> > for more reviews for the uAPI, and we can evaluate early next week if
> > this is ready.
> >   
> 
>  Sure. I will re-post the updated patch series after the complete testing.
>  
> >>>> +
> >>>> +/*
> >>>> + * Upon VFIO_DEVICE_FEATURE_SET, move the VFIO device into the low power state
> >>>> + * with the platform-based power management and provide support for the wake-up
> >>>> + * notifications through eventfd.  This low power state will be internal to the
> >>>> + * VFIO driver and the user will not come to know which power state is chosen.
> >>>> + * If any device access happens (either from the host or the guest) when the
> >>>> + * device is in the low power state, then the host will move the device out of
> >>>> + * the low power state first and a notification will be sent to the guest
> >>>> + * through eventfd.  Once the access is finished, the host will not move back
> >>>> + * the device into the low power state.  The guest should move the device into
> >>>> + * the low power state again upon receiving the wakeup notification.  The
> >>>> + * notification will be generated only if the device physically went into the
> >>>> + * low power state.  If the low power entry has been disabled from the host
> >>>> + * side, then the device will not go into the low power state even after
> >>>> + * calling this device feature and then the device access does not require
> >>>> + * wake-up.  The mmap'ed region access is not allowed till the low power exit
> >>>> + * happens.  The low power exit can happen either through
> >>>> + * VFIO_DEVICE_FEATURE_LOW_POWER_EXIT or through any other access (where the
> >>>> + * wake-up notification has been generated).    
> >>>
> >>> Seems this could leverage a lot more from the previous, simply stating
> >>> that this has the same behavior as VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY
> >>> with the exception that the user provides an eventfd for notification
> >>> when the device resumes from low power and will not try to re-enter a
> >>> low power state without a subsequent user call to one of the low power
> >>> entry feature ioctls.  It might also be worth covering the fact that
> >>> device accesses by the user, including region and ioctl access, will
> >>> also trigger the eventfd if that access triggers a resume.
> >>>
> >>> As I'm thinking about this, that latter clause is somewhat subtle.
> >>> AIUI a user can call the low power entry with wakeup feature and
> >>> proceed to do various ioctl and region (not mmap) accesses that could
> >>> perpetually keep the device awake, or there may be dependent devices
> >>> such that the device may never go to low power.  It needs to be very
> >>> clear that only if the wakeup eventfd has the device entered into and
> >>> exited a low power state making the low power exit ioctl optional.
> >>>     
> >>
> >>  Yes. In my updated description, I have added more details.
> >>  Can you please check if that helps.
> >>  
> >>>> + */
> >>>> +struct vfio_device_low_power_entry_with_wakeup {
> >>>> +	__s32 wakeup_eventfd;
> >>>> +	__u32 reserved;
> >>>> +};
> >>>> +
> >>>> +#define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP 4
> >>>> +
> >>>> +/*
> >>>> + * Upon VFIO_DEVICE_FEATURE_SET, move the VFIO device out of the low power
> >>>> + * state.    
> >>>
> >>> Any ioctl effectively does that, the key here is that the low power
> >>> state may not be re-entered after this ioctl.
> >>>     
> >>>>  This device feature should be called only if the user has previously
> >>>> + * put the device into the low power state either with
> >>>> + * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY or
> >>>> + * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device feature.  If the    
> >>>
> >>> Doesn't really seem worth mentioning, we need to protect against misuse
> >>> regardless.
> >>>     
> >>>> + * device is not in the low power state currently, this device feature will
> >>>> + * return early with the success status.    
> >>>
> >>> This is an implementation detail, it doesn't need to be part of the
> >>> uAPI.  Thanks,
> >>>
> >>> Alex
> >>>     
> >>>> + */
> >>>> +#define VFIO_DEVICE_FEATURE_LOW_POWER_EXIT 5
> >>>> +
> >>>>  /* -------- API for Type1 VFIO IOMMU -------- */
> >>>>  
> >>>>  /**    
> >>>     
> >>
> >>  /*
> >>   * Upon VFIO_DEVICE_FEATURE_SET, allow the device to be moved into a low power
> >>   * state with the platform-based power management.  This low power state will
> >>   * be internal to the VFIO driver and the user will not come to know which
> >>   * power state is chosen.  If any device access happens (either from the host  
> > 
> > Couldn't the user look in sysfs to determine the power state?  There
> > might also be external hardware monitors of the power state, so this
> > statement is a bit overreaching.  Maybe something along the lines of...
> > 
> > "Device use of lower power states depend on factors managed by the
> > runtime power management core, including system level support and
> > coordinating support among dependent devices.  Enabling device low
> > power entry does not guarantee lower power usage by the device, nor is
> > a mechanism provided through this feature to know the current power
> > state of the device."
> >   
> >>   * or the guest) when the device is in the low power state, then the host will  
> > 
> > Let's not confine ourselves to a VM use case, "...from the host or
> > through the vfio uAPI".
> >   
> >>   * move the device out of the low power state first.  Once the access has been  
> > 
> > "move the device out of the low power state as necessary prior to the
> > access."
> >   
> >>   * finished, then the host will move the device into the low power state  
> > 
> > "Once the access is completed, the device may re-enter the low power
> > state."
> >   
> >>   * again.  For single shot low power support with wake-up notification, see
> >>   * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP below.  The mmap'ed region
> >>   * access is not allowed till the low power exit happens through
> >>   * VFIO_DEVICE_FEATURE_LOW_POWER_EXIT and will generate the access fault.  
> > 
> > "Access to mmap'd device regions is disabled on LOW_POWER_ENTRY and
> > may only be resumed after calling LOW_POWER_EXIT."
> >   
> >>   */
> >>  #define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY 3
> >>
> >>  /*
> >>   * This device feature has the same behavior as
> >>   * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY with the exception that the user
> >>   * provides an eventfd for wake-up notification.  When the device moves out of
> >>   * the low power state for the wake-up, the host will not try to re-enter a  
> > 
> > "...will not allow the device to re-enter a low power state..."
> >   
> >>   * low power state without a subsequent user call to one of the low power
> >>   * entry device feature IOCTLs.  The low power exit can happen either through
> >>   * VFIO_DEVICE_FEATURE_LOW_POWER_EXIT or through any other access (where the
> >>   * wake-up notification has been generated).
> >>   *
> >>   * The notification through eventfd will be generated only if the device has
> >>   * gone into the low power state after calling this device feature IOCTL.  
> > 
> > "The notification through the provided eventfd will be generated only
> > when the device has entered and is resumed from a low power state after
> > calling this device feature IOCTL."
> > 
> >   
> >>   * There are various cases where the device will not go into the low power
> >>   * state after calling this device feature IOCTL (for example, the low power
> >>   * entry has been disabled from the host side, the user keeps the device busy
> >>   * after calling this device feature IOCTL, there are dependent devices which
> >>   * block the device low power entry, etc.) and in such cases, the device access
> >>   * does not require wake-up.  Also, the low power exit through  
> > 
> > "A device that has not entered low power state, as managed through the
> > runtime power management core, will not generate a notification through
> > the provided eventfd on access."
> >   
> >>   * VFIO_DEVICE_FEATURE_LOW_POWER_EXIT is mandatory for the cases where the
> >>   * wake-up notification has not been generated.  
> > 
> > "Calling the LOW_POWER_EXIT feature is optional in the case where
> > notification has been signaled on the provided eventfd that a resume
> > from low power has occurred."
> > 
> > We should also reiterate the statement above about mmap access because
> > it would be reasonable for a user to assume that if any access wakes
> > the device, that should include mmap faults and therefore a resume
> > notification should occur for such an event.  We could implement that
> > for the one-shot mode, but we're choosing not to.
> >   
> >>   */
> >>  struct vfio_device_low_power_entry_with_wakeup {
> >> 	 __s32 wakeup_eventfd;
> >> 	 __u32 reserved;
> >>  };
> >>
> >>  #define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP 4
> >>
> >>  /*
> >>   * Upon VFIO_DEVICE_FEATURE_SET, prevent the VFIO device low power state entry
> >>   * which has been previously allowed with VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY
> >>   * or VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device features.
> >>   */  
> > 
> > "Upon VFIO_DEVICE_FEATURE_SET, disallow use of device low power states
> > as previously enabled via VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY or
> > VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device features.  This
> > device feature ioctl may itself generate a wakeup eventfd notification
> > in the latter case if the device has previously entered a low power
> > state."
> > 
> > Thanks,
> > Alex
> >   
> 
>  Thanks Alex for your thorough review of uAPI.
>  I have incorporated all the suggestions.
>  Following is the updated uAPI.
>  
>  /*
>   * Upon VFIO_DEVICE_FEATURE_SET, allow the device to be moved into a low power
>   * state with the platform-based power management.  Device use of lower power
>   * states depends on factors managed by the runtime power management core,
>   * including system level support and coordinating support among dependent
>   * devices.  Enabling device low power entry does not guarantee lower power
>   * usage by the device, nor is a mechanism provided through this feature to
>   * know the current power state of the device.  If any device access happens
>   * (either from the host or through the vfio uAPI) when the device is in the
>   * low power state, then the host will move the device out of the low power
>   * state as necessary prior to the access.  Once the access is completed, the
>   * device may re-enter the low power state.  For single shot low power support
>   * with wake-up notification, see
>   * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP below.  Access to mmap'd
>   * device regions is disabled on LOW_POWER_ENTRY and may only be resumed after
>   * calling LOW_POWER_EXIT.
>   */
>  #define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY 3
>  
>  /*
>   * This device feature has the same behavior as
>   * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY with the exception that the user
>   * provides an eventfd for wake-up notification.  When the device moves out of
>   * the low power state for the wake-up, the host will not allow the device to
>   * re-enter a low power state without a subsequent user call to one of the low
>   * power entry device feature IOCTLs.  Access to mmap'd device regions is
>   * disabled on LOW_POWER_ENTRY_WITH_WAKEUP and may only be resumed after the
>   * low power exit.  The low power exit can happen either through LOW_POWER_EXIT
>   * or through any other access (where the wake-up notification has been
>   * generated).  The access to mmap'd device regions will not trigger low power
>   * exit.
>   *
>   * The notification through the provided eventfd will be generated only when
>   * the device has entered and is resumed from a low power state after
>   * calling this device feature IOCTL.  A device that has not entered low power
>   * state, as managed through the runtime power management core, will not
>   * generate a notification through the provided eventfd on access.  Calling the
>   * LOW_POWER_EXIT feature is optional in the case where notification has been
>   * signaled on the provided eventfd that a resume from low power has occurred.
>   */
>  struct vfio_device_low_power_entry_with_wakeup {
>  	__s32 wakeup_eventfd;
>  	__u32 reserved;
>  };
>  
>  #define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP 4
>  
>  /*
>   * Upon VFIO_DEVICE_FEATURE_SET, disallow use of device low power states as
>   * previously enabled via VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY or
>   * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device features.
>   * This device feature IOCTL may itself generate a wakeup eventfd notification
>   * in the latter case if the device has previously entered a low power state.
>   */
>  #define VFIO_DEVICE_FEATURE_LOW_POWER_EXIT 5

Looks ok to me.  Thanks,

Alex
Jason Gunthorpe July 26, 2022, 5:23 p.m. UTC | #7
On Tue, Jul 26, 2022 at 06:17:18PM +0530, Abhishek Sahu wrote:
>  Thanks Alex for your thorough review of uAPI.
>  I have incorporated all the suggestions.
>  Following is the updated uAPI.
>  
>  /*
>   * Upon VFIO_DEVICE_FEATURE_SET, allow the device to be moved into a low power
>   * state with the platform-based power management.  Device use of lower power
>   * states depends on factors managed by the runtime power management core,
>   * including system level support and coordinating support among dependent
>   * devices.  Enabling device low power entry does not guarantee lower power
>   * usage by the device, nor is a mechanism provided through this feature to
>   * know the current power state of the device.  If any device access happens
>   * (either from the host or through the vfio uAPI) when the device is in the
>   * low power state, then the host will move the device out of the low power
>   * state as necessary prior to the access.  Once the access is completed, the
>   * device may re-enter the low power state.  For single shot low power support
>   * with wake-up notification, see
>   * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP below.  Access to mmap'd
>   * device regions is disabled on LOW_POWER_ENTRY and may only be resumed after
>   * calling LOW_POWER_EXIT.
>   */
>  #define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY 3
>  
>  /*
>   * This device feature has the same behavior as
>   * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY with the exception that the user
>   * provides an eventfd for wake-up notification.

It feels like this should be one entry point instead of two.

A flag "automatic re-sleep" and an optional eventfd (-1 means not
provided) seems to capture both of these behaviors in a bit clearer
and extendable way.

Jason
Abhishek Sahu July 27, 2022, 6:07 a.m. UTC | #8
On 7/26/2022 10:53 PM, Jason Gunthorpe wrote:
> On Tue, Jul 26, 2022 at 06:17:18PM +0530, Abhishek Sahu wrote:
>>  Thanks Alex for your thorough review of uAPI.
>>  I have incorporated all the suggestions.
>>  Following is the updated uAPI.
>>  
>>  /*
>>   * Upon VFIO_DEVICE_FEATURE_SET, allow the device to be moved into a low power
>>   * state with the platform-based power management.  Device use of lower power
>>   * states depends on factors managed by the runtime power management core,
>>   * including system level support and coordinating support among dependent
>>   * devices.  Enabling device low power entry does not guarantee lower power
>>   * usage by the device, nor is a mechanism provided through this feature to
>>   * know the current power state of the device.  If any device access happens
>>   * (either from the host or through the vfio uAPI) when the device is in the
>>   * low power state, then the host will move the device out of the low power
>>   * state as necessary prior to the access.  Once the access is completed, the
>>   * device may re-enter the low power state.  For single shot low power support
>>   * with wake-up notification, see
>>   * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP below.  Access to mmap'd
>>   * device regions is disabled on LOW_POWER_ENTRY and may only be resumed after
>>   * calling LOW_POWER_EXIT.
>>   */
>>  #define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY 3
>>  
>>  /*
>>   * This device feature has the same behavior as
>>   * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY with the exception that the user
>>   * provides an eventfd for wake-up notification.
> 
> It feels like this should be one entry point instead of two.
> 
> A flag "automatic re-sleep" and an optional eventfd (-1 means not
> provided) seems to capture both of these behaviors in a bit clearer
> and extendable way.
> 
> Jason

 We discussed about that in the earlier version of the patch series.
 Since we have different exit related handling, so to avoid confusion
 we proceeded with 2 separate variants for the low power entry. Also,
 we don't need any parameter for the first case.

 But, I can do the changes to make a single entry point, if we conclude
 for that. 

 From my side, I have explored how the uAPI looks like if
 we go with this approach.

 /*
  * Upon VFIO_DEVICE_FEATURE_SET, allow the device to be moved into a low power
  * state with the platform-based power management.  Device use of lower power
  * states depends on factors managed by the runtime power management core,
  * including system level support and coordinating support among dependent
  * devices.  Enabling device low power entry does not guarantee lower power
  * usage by the device, nor is a mechanism provided through this feature to
  * know the current power state of the device.  If any device access happens
  * (either from the host or through the vfio uAPI) when the device is in the
  * low power state, then the host will move the device out of the low power
  * state as necessary prior to the access.  Once the access is completed, the
  * device re-entry to a low power state will be controlled through
  * VFIO_DEVICE_LOW_POWER_REENTERY_DISABLE flag.
  *
  * If LOW_POWER_REENTERY_DISABLE flag is not set, the device may re-enter the
  * low power state.  Access to mmap'd device regions is disabled on
  * LOW_POWER_ENTRY and may only be resumed after calling LOW_POWER_EXIT.
  *
  * If LOW_POWER_REENTERY_DISABLE flag is set, then user needs to provide an
  * eventfd for wake-up notification.  When the device moves out of the low
  * power state for the wake-up, the host will not allow the device to re-enter
  * a low power state without a subsequent user call to LOW_POWER_ENTRY.
  * Access to mmap'd device regions is disabled on LOW_POWER_ENTRY and may only
  * be resumed after the low power exit.  The low power exit can happen either
  * through LOW_POWER_EXIT or through any other access (where the wake-up
  * notification has been generated).  The access to mmap'd device regions will
  * not trigger low power exit.
  *
  * The notification through the provided eventfd will be generated only when
  * the device has entered and is resumed from a low power state after
  * calling this device feature IOCTL.  A device that has not entered low power
  * state, as managed through the runtime power management core, will not
  * generate a notification through the provided eventfd on access.  Calling the
  * LOW_POWER_EXIT feature is optional in the case where notification has been
  * signaled on the provided eventfd that a resume from low power has occurred.
  *
  * The wakeup_eventfd needs to be valid only if LOW_POWER_REENTERY_DISABLE
  * flag is set, otherwise, it will be ignored.
  */
 #define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY 3
 
 struct vfio_device_low_power_entry_with_wakeup {
 	__u32 flags;
 #define VFIO_DEVICE_LOW_POWER_REENTERY_DISABLE	(1 << 0)
 	__s32 wakeup_eventfd;
 };
 
 /*
  * Upon VFIO_DEVICE_FEATURE_SET, disallow use of device low power states as
  * previously enabled via VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY device feature.
  * This device feature IOCTL may itself generate a wakeup eventfd notification
  * if the device had previously entered a low power state with
  * VFIO_DEVICE_LOW_POWER_REENTERY_DISABLE flag set.
  */
 #define VFIO_DEVICE_FEATURE_LOW_POWER_EXIT 4

 Thanks,
 Abhishek
Alex Williamson Aug. 1, 2022, 6:42 p.m. UTC | #9
On Wed, 27 Jul 2022 11:37:02 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> On 7/26/2022 10:53 PM, Jason Gunthorpe wrote:
> > On Tue, Jul 26, 2022 at 06:17:18PM +0530, Abhishek Sahu wrote:  
> >>  Thanks Alex for your thorough review of uAPI.
> >>  I have incorporated all the suggestions.
> >>  Following is the updated uAPI.
> >>  
> >>  /*
> >>   * Upon VFIO_DEVICE_FEATURE_SET, allow the device to be moved into a low power
> >>   * state with the platform-based power management.  Device use of lower power
> >>   * states depends on factors managed by the runtime power management core,
> >>   * including system level support and coordinating support among dependent
> >>   * devices.  Enabling device low power entry does not guarantee lower power
> >>   * usage by the device, nor is a mechanism provided through this feature to
> >>   * know the current power state of the device.  If any device access happens
> >>   * (either from the host or through the vfio uAPI) when the device is in the
> >>   * low power state, then the host will move the device out of the low power
> >>   * state as necessary prior to the access.  Once the access is completed, the
> >>   * device may re-enter the low power state.  For single shot low power support
> >>   * with wake-up notification, see
> >>   * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP below.  Access to mmap'd
> >>   * device regions is disabled on LOW_POWER_ENTRY and may only be resumed after
> >>   * calling LOW_POWER_EXIT.
> >>   */
> >>  #define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY 3
> >>  
> >>  /*
> >>   * This device feature has the same behavior as
> >>   * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY with the exception that the user
> >>   * provides an eventfd for wake-up notification.  
> > 
> > It feels like this should be one entry point instead of two.
> > 
> > A flag "automatic re-sleep" and an optional eventfd (-1 means not
> > provided) seems to capture both of these behaviors in a bit clearer
> > and extendable way.

I think the mutual exclusion between re-entrant mode and one-shot is
quite a bit more subtle in the version below, so I don't particularly
find this cleaner.  Potentially we could have variant drivers support
one w/o the other in the previously proposed model as well.  It's
interesting to see this suggestion since since we seem to have a theme
of making features single purpose elsewhere.  Thanks,

Alex 

> 
>  We discussed about that in the earlier version of the patch series.
>  Since we have different exit related handling, so to avoid confusion
>  we proceeded with 2 separate variants for the low power entry. Also,
>  we don't need any parameter for the first case.
> 
>  But, I can do the changes to make a single entry point, if we conclude
>  for that. 
> 
>  From my side, I have explored how the uAPI looks like if
>  we go with this approach.
> 
>  /*
>   * Upon VFIO_DEVICE_FEATURE_SET, allow the device to be moved into a low power
>   * state with the platform-based power management.  Device use of lower power
>   * states depends on factors managed by the runtime power management core,
>   * including system level support and coordinating support among dependent
>   * devices.  Enabling device low power entry does not guarantee lower power
>   * usage by the device, nor is a mechanism provided through this feature to
>   * know the current power state of the device.  If any device access happens
>   * (either from the host or through the vfio uAPI) when the device is in the
>   * low power state, then the host will move the device out of the low power
>   * state as necessary prior to the access.  Once the access is completed, the
>   * device re-entry to a low power state will be controlled through
>   * VFIO_DEVICE_LOW_POWER_REENTERY_DISABLE flag.
>   *
>   * If LOW_POWER_REENTERY_DISABLE flag is not set, the device may re-enter the
>   * low power state.  Access to mmap'd device regions is disabled on
>   * LOW_POWER_ENTRY and may only be resumed after calling LOW_POWER_EXIT.
>   *
>   * If LOW_POWER_REENTERY_DISABLE flag is set, then user needs to provide an
>   * eventfd for wake-up notification.  When the device moves out of the low
>   * power state for the wake-up, the host will not allow the device to re-enter
>   * a low power state without a subsequent user call to LOW_POWER_ENTRY.
>   * Access to mmap'd device regions is disabled on LOW_POWER_ENTRY and may only
>   * be resumed after the low power exit.  The low power exit can happen either
>   * through LOW_POWER_EXIT or through any other access (where the wake-up
>   * notification has been generated).  The access to mmap'd device regions will
>   * not trigger low power exit.
>   *
>   * The notification through the provided eventfd will be generated only when
>   * the device has entered and is resumed from a low power state after
>   * calling this device feature IOCTL.  A device that has not entered low power
>   * state, as managed through the runtime power management core, will not
>   * generate a notification through the provided eventfd on access.  Calling the
>   * LOW_POWER_EXIT feature is optional in the case where notification has been
>   * signaled on the provided eventfd that a resume from low power has occurred.
>   *
>   * The wakeup_eventfd needs to be valid only if LOW_POWER_REENTERY_DISABLE
>   * flag is set, otherwise, it will be ignored.
>   */
>  #define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY 3
>  
>  struct vfio_device_low_power_entry_with_wakeup {
>  	__u32 flags;
>  #define VFIO_DEVICE_LOW_POWER_REENTERY_DISABLE	(1 << 0)
>  	__s32 wakeup_eventfd;
>  };
>  
>  /*
>   * Upon VFIO_DEVICE_FEATURE_SET, disallow use of device low power states as
>   * previously enabled via VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY device feature.
>   * This device feature IOCTL may itself generate a wakeup eventfd notification
>   * if the device had previously entered a low power state with
>   * VFIO_DEVICE_LOW_POWER_REENTERY_DISABLE flag set.
>   */
>  #define VFIO_DEVICE_FEATURE_LOW_POWER_EXIT 4
> 
>  Thanks,
>  Abhishek
>
Jason Gunthorpe Aug. 2, 2022, 2:04 p.m. UTC | #10
On Mon, Aug 01, 2022 at 12:42:53PM -0600, Alex Williamson wrote:
> On Wed, 27 Jul 2022 11:37:02 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> 
> > On 7/26/2022 10:53 PM, Jason Gunthorpe wrote:
> > > On Tue, Jul 26, 2022 at 06:17:18PM +0530, Abhishek Sahu wrote:  
> > >>  Thanks Alex for your thorough review of uAPI.
> > >>  I have incorporated all the suggestions.
> > >>  Following is the updated uAPI.
> > >>  
> > >>  /*
> > >>   * Upon VFIO_DEVICE_FEATURE_SET, allow the device to be moved into a low power
> > >>   * state with the platform-based power management.  Device use of lower power
> > >>   * states depends on factors managed by the runtime power management core,
> > >>   * including system level support and coordinating support among dependent
> > >>   * devices.  Enabling device low power entry does not guarantee lower power
> > >>   * usage by the device, nor is a mechanism provided through this feature to
> > >>   * know the current power state of the device.  If any device access happens
> > >>   * (either from the host or through the vfio uAPI) when the device is in the
> > >>   * low power state, then the host will move the device out of the low power
> > >>   * state as necessary prior to the access.  Once the access is completed, the
> > >>   * device may re-enter the low power state.  For single shot low power support
> > >>   * with wake-up notification, see
> > >>   * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP below.  Access to mmap'd
> > >>   * device regions is disabled on LOW_POWER_ENTRY and may only be resumed after
> > >>   * calling LOW_POWER_EXIT.
> > >>   */
> > >>  #define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY 3
> > >>  
> > >>  /*
> > >>   * This device feature has the same behavior as
> > >>   * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY with the exception that the user
> > >>   * provides an eventfd for wake-up notification.  
> > > 
> > > It feels like this should be one entry point instead of two.
> > > 
> > > A flag "automatic re-sleep" and an optional eventfd (-1 means not
> > > provided) seems to capture both of these behaviors in a bit clearer
> > > and extendable way.
> 
> I think the mutual exclusion between re-entrant mode and one-shot is
> quite a bit more subtle in the version below, so I don't particularly
> find this cleaner.  Potentially we could have variant drivers support
> one w/o the other in the previously proposed model as well.  It's
> interesting to see this suggestion since since we seem to have a theme
> of making features single purpose elsewhere.  Thanks,

It is still quite single purpose, just
VFIO_DEVICE_LOW_POWER_REENTERY_DISABLE is some minor customization of
that single purpose.

Either the flag is set or not, it isn't subtle..

Jason
Alex Williamson Aug. 2, 2022, 3:41 p.m. UTC | #11
On Tue, 2 Aug 2022 11:04:52 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, Aug 01, 2022 at 12:42:53PM -0600, Alex Williamson wrote:
> > On Wed, 27 Jul 2022 11:37:02 +0530
> > Abhishek Sahu <abhsahu@nvidia.com> wrote:
> >   
> > > On 7/26/2022 10:53 PM, Jason Gunthorpe wrote:  
> > > > On Tue, Jul 26, 2022 at 06:17:18PM +0530, Abhishek Sahu wrote:    
> > > >>  Thanks Alex for your thorough review of uAPI.
> > > >>  I have incorporated all the suggestions.
> > > >>  Following is the updated uAPI.
> > > >>  
> > > >>  /*
> > > >>   * Upon VFIO_DEVICE_FEATURE_SET, allow the device to be moved into a low power
> > > >>   * state with the platform-based power management.  Device use of lower power
> > > >>   * states depends on factors managed by the runtime power management core,
> > > >>   * including system level support and coordinating support among dependent
> > > >>   * devices.  Enabling device low power entry does not guarantee lower power
> > > >>   * usage by the device, nor is a mechanism provided through this feature to
> > > >>   * know the current power state of the device.  If any device access happens
> > > >>   * (either from the host or through the vfio uAPI) when the device is in the
> > > >>   * low power state, then the host will move the device out of the low power
> > > >>   * state as necessary prior to the access.  Once the access is completed, the
> > > >>   * device may re-enter the low power state.  For single shot low power support
> > > >>   * with wake-up notification, see
> > > >>   * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP below.  Access to mmap'd
> > > >>   * device regions is disabled on LOW_POWER_ENTRY and may only be resumed after
> > > >>   * calling LOW_POWER_EXIT.
> > > >>   */
> > > >>  #define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY 3
> > > >>  
> > > >>  /*
> > > >>   * This device feature has the same behavior as
> > > >>   * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY with the exception that the user
> > > >>   * provides an eventfd for wake-up notification.    
> > > > 
> > > > It feels like this should be one entry point instead of two.
> > > > 
> > > > A flag "automatic re-sleep" and an optional eventfd (-1 means not
> > > > provided) seems to capture both of these behaviors in a bit clearer
> > > > and extendable way.  
> > 
> > I think the mutual exclusion between re-entrant mode and one-shot is
> > quite a bit more subtle in the version below, so I don't particularly
> > find this cleaner.  Potentially we could have variant drivers support
> > one w/o the other in the previously proposed model as well.  It's
> > interesting to see this suggestion since since we seem to have a theme
> > of making features single purpose elsewhere.  Thanks,  
> 
> It is still quite single purpose, just
> VFIO_DEVICE_LOW_POWER_REENTERY_DISABLE is some minor customization of
> that single purpose.
> 
> Either the flag is set or not, it isn't subtle..

The subtlety is that there's a flag and a field and the flag can only
be set if the field is set, the flag can only be clear if the field is
clear, so we return -EINVAL for the other cases?  Why do we have both a
flag and a field?  This isn't like we're adding a feature later and the
flag needs to indicate that the field is present and valid.  It's just
not a very clean interface, imo.  Thanks,

Alex
Jason Gunthorpe Aug. 2, 2022, 4:35 p.m. UTC | #12
On Tue, Aug 02, 2022 at 09:41:28AM -0600, Alex Williamson wrote:

> The subtlety is that there's a flag and a field and the flag can only
> be set if the field is set, the flag can only be clear if the field is
> clear, so we return -EINVAL for the other cases?  Why do we have both a
> flag and a field?  This isn't like we're adding a feature later and the
> flag needs to indicate that the field is present and valid.  It's just
> not a very clean interface, imo.  Thanks,

That isn't how I read Abhishek's proposal.. The eventfd should always
work and should always behave as described "The notification through
the provided eventfd will be generated only when the device has
entered and is resumed from a low power state"

If userspace provides it without LOW_POWER_REENTERY_DISABLE then it
still generates the events.

The linkage to LOW_POWER_REENTERY_DISABLE is only that userspace
probably needs to use both elements together to generate the
auto-reentry behavior. Kernel should not enforce it.

Two fields, orthogonal behaviors.

Jason
Alex Williamson Aug. 2, 2022, 4:57 p.m. UTC | #13
On Tue, 2 Aug 2022 13:35:04 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Aug 02, 2022 at 09:41:28AM -0600, Alex Williamson wrote:
> 
> > The subtlety is that there's a flag and a field and the flag can only
> > be set if the field is set, the flag can only be clear if the field is
> > clear, so we return -EINVAL for the other cases?  Why do we have both a
> > flag and a field?  This isn't like we're adding a feature later and the
> > flag needs to indicate that the field is present and valid.  It's just
> > not a very clean interface, imo.  Thanks,  
> 
> That isn't how I read Abhishek's proposal.. The eventfd should always
> work and should always behave as described "The notification through
> the provided eventfd will be generated only when the device has
> entered and is resumed from a low power state"
> 
> If userspace provides it without LOW_POWER_REENTERY_DISABLE then it
> still generates the events.
> 
> The linkage to LOW_POWER_REENTERY_DISABLE is only that userspace
> probably needs to use both elements together to generate the
> auto-reentry behavior. Kernel should not enforce it.
> 
> Two fields, orthogonal behaviors.

What's the point of notifying userspace that the device was resumed if
it might already be suspended again by the time userspace responds to
the eventfd?  This really muddies the single-shot vs re-entrant
behavior.  If userspace allows the kernel to re-enter low power, then
they don't need to be notified on every wake.  If userspace wants to be
notified on wake, userspace can also re-initiate low power entry.
Thanks,

Alex
Jason Gunthorpe Aug. 2, 2022, 5:01 p.m. UTC | #14
On Tue, Aug 02, 2022 at 10:57:55AM -0600, Alex Williamson wrote:
> On Tue, 2 Aug 2022 13:35:04 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Tue, Aug 02, 2022 at 09:41:28AM -0600, Alex Williamson wrote:
> > 
> > > The subtlety is that there's a flag and a field and the flag can only
> > > be set if the field is set, the flag can only be clear if the field is
> > > clear, so we return -EINVAL for the other cases?  Why do we have both a
> > > flag and a field?  This isn't like we're adding a feature later and the
> > > flag needs to indicate that the field is present and valid.  It's just
> > > not a very clean interface, imo.  Thanks,  
> > 
> > That isn't how I read Abhishek's proposal.. The eventfd should always
> > work and should always behave as described "The notification through
> > the provided eventfd will be generated only when the device has
> > entered and is resumed from a low power state"
> > 
> > If userspace provides it without LOW_POWER_REENTERY_DISABLE then it
> > still generates the events.
> > 
> > The linkage to LOW_POWER_REENTERY_DISABLE is only that userspace
> > probably needs to use both elements together to generate the
> > auto-reentry behavior. Kernel should not enforce it.
> > 
> > Two fields, orthogonal behaviors.
> 
> What's the point of notifying userspace that the device was resumed if
> it might already be suspended again by the time userspace responds to
> the eventfd? 

I don't know - the eventfds is counting so it does let userspace
monitor frequency of auto-sleeping.

In any case the point is to make simple kernel APIs, not cover every
combination with a use case. Decoupling is simpler than coupling.

Jason
Abhishek Sahu Aug. 3, 2022, 6:32 a.m. UTC | #15
On 8/2/2022 10:31 PM, Jason Gunthorpe wrote:
> On Tue, Aug 02, 2022 at 10:57:55AM -0600, Alex Williamson wrote:
>> On Tue, 2 Aug 2022 13:35:04 -0300
>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>
>>> On Tue, Aug 02, 2022 at 09:41:28AM -0600, Alex Williamson wrote:
>>>
>>>> The subtlety is that there's a flag and a field and the flag can only
>>>> be set if the field is set, the flag can only be clear if the field is
>>>> clear, so we return -EINVAL for the other cases?  Why do we have both a
>>>> flag and a field?  This isn't like we're adding a feature later and the
>>>> flag needs to indicate that the field is present and valid.  It's just
>>>> not a very clean interface, imo.  Thanks,  
>>>
>>> That isn't how I read Abhishek's proposal.. The eventfd should always
>>> work and should always behave as described "The notification through
>>> the provided eventfd will be generated only when the device has
>>> entered and is resumed from a low power state"
>>>
>>> If userspace provides it without LOW_POWER_REENTERY_DISABLE then it
>>> still generates the events.
>>>
>>> The linkage to LOW_POWER_REENTERY_DISABLE is only that userspace
>>> probably needs to use both elements together to generate the
>>> auto-reentry behavior. Kernel should not enforce it.
>>>
>>> Two fields, orthogonal behaviors.
>>
>> What's the point of notifying userspace that the device was resumed if
>> it might already be suspended again by the time userspace responds to
>> the eventfd? 
> 
> I don't know - the eventfds is counting so it does let userspace
> monitor frequency of auto-sleeping.
> 
> In any case the point is to make simple kernel APIs, not cover every
> combination with a use case. Decoupling is simpler than coupling.
> 
> Jason


 Thanks Alex and Jason for your inputs.

 It seems, I can use the original uAPI where we will have 2 variants
 of ENTRY.

 Since already kernel merge window is started, so I will wait for
 v5.20-rc1 (or v6.0-rc1) and then I will rebase and test my patches on
 the updated kernel.

 Regards,
 Abhishek
diff mbox series

Patch

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 733a1cddde30..08fd3482d22b 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -986,6 +986,61 @@  enum vfio_device_mig_state {
 	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
 };
 
+/*
+ * Upon VFIO_DEVICE_FEATURE_SET, move the VFIO device into the low power state
+ * with the platform-based power management.  This low power state will be
+ * internal to the VFIO driver and the user will not come to know which power
+ * state is chosen.  If any device access happens (either from the host or
+ * the guest) when the device is in the low power state, then the host will
+ * move the device out of the low power state first.  Once the access has been
+ * finished, then the host will move the device into the low power state again.
+ * If the user wants that the device should not go into the low power state
+ * again in this case, then the user should use the
+ * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device feature for the
+ * low power entry.  The mmap'ed region access is not allowed till the low power
+ * exit happens through VFIO_DEVICE_FEATURE_LOW_POWER_EXIT and will
+ * generate the access fault.
+ */
+#define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY 3
+
+/*
+ * Upon VFIO_DEVICE_FEATURE_SET, move the VFIO device into the low power state
+ * with the platform-based power management and provide support for the wake-up
+ * notifications through eventfd.  This low power state will be internal to the
+ * VFIO driver and the user will not come to know which power state is chosen.
+ * If any device access happens (either from the host or the guest) when the
+ * device is in the low power state, then the host will move the device out of
+ * the low power state first and a notification will be sent to the guest
+ * through eventfd.  Once the access is finished, the host will not move back
+ * the device into the low power state.  The guest should move the device into
+ * the low power state again upon receiving the wakeup notification.  The
+ * notification will be generated only if the device physically went into the
+ * low power state.  If the low power entry has been disabled from the host
+ * side, then the device will not go into the low power state even after
+ * calling this device feature and then the device access does not require
+ * wake-up.  The mmap'ed region access is not allowed till the low power exit
+ * happens.  The low power exit can happen either through
+ * VFIO_DEVICE_FEATURE_LOW_POWER_EXIT or through any other access (where the
+ * wake-up notification has been generated).
+ */
+struct vfio_device_low_power_entry_with_wakeup {
+	__s32 wakeup_eventfd;
+	__u32 reserved;
+};
+
+#define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP 4
+
+/*
+ * Upon VFIO_DEVICE_FEATURE_SET, move the VFIO device out of the low power
+ * state.  This device feature should be called only if the user has previously
+ * put the device into the low power state either with
+ * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY or
+ * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device feature.  If the
+ * device is not in the low power state currently, this device feature will
+ * return early with the success status.
+ */
+#define VFIO_DEVICE_FEATURE_LOW_POWER_EXIT 5
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**