diff mbox series

[RFC,3/3] vfio/pci: use runtime PM for vfio-device into low power state

Message ID 20211115133640.2231-4-abhsahu@nvidia.com (mailing list archive)
State New, archived
Headers show
Series vfio/pci: Enable runtime power management support | expand

Commit Message

Abhishek Sahu Nov. 15, 2021, 1:36 p.m. UTC
From: Abhishek Sahu <abhsahu@nvidia.com>

Currently, if the runtime power management is enabled for vfio-pci
device in guest OS, then guest OS will do the register write for
PCI_PM_CTRL register. This write request will be handled in
vfio_pm_config_write() where it will do the actual register write of
PCI_PM_CTRL register. With this, the maximum D3hot state can be
achieved for low power. If we can use the runtime PM framework, then
we can achieve the D3cold state which will help in saving
maximum power.

This patch uses runtime PM framework whenever vfio-pci device will
be put in the low power state.

1. If runtime PM is enabled, then instead of directly writing
   PCI_PM_CTRL register, decrement the device usage counter whenever
   the power state is non-D0. The kernel runtime PM framework will
   itself put the PCI device in low power state when device usage
   counter will become zero. Similarly, when the power state will be
   again changed back to D0, then increment the device usage counter
   and the kernel runtime PM framework will itself bring the PCI device
   out of low power state.

2. The guest OS will read the PCI_PM_CTRL register back to
   confirm the current power state so virtual register bits can be
   used. For this, before decrementing the usage count, read the actual
   PCI_PM_CTRL register, update the power state related bits, and then
   update the vconfig bits corresponding to PCI_PM_CTRL offset. For
   PCI_PM_CTRL register read, return the virtual value if runtime PM is
   requested. This vconfig bits will be cleared when the power state
   will be changed back to D0.

3. For the guest OS, the PCI power state will be still D3hot
   even if put the actual PCI device into D3cold state. In the D3hot
   state, the config space can still be read. So, if there is any request
   from guest OS to read the config space, then we need to move the actual
   PCI device again back to D0. For this, increment the device usage
   count before reading/writing the config space and then decrement it
   again after reading/writing the config space. There can be
   back-to-back config register read/write request, but since the auto
   suspend methods are being used here so only first access will
   wake up the PCI device and for the remaining access, the device will
   already be active.

4. This above-mentioned wake up is not needed if the register
   read/write is done completely with virtual bits. For handling this
   condition, the actual resume of device will only being done in
   vfio_user_config_read()/vfio_user_config_write(). All the config
   register access will come vfio_pci_config_rw(). So, in this
   function, use the pm_runtime_get_noresume() so that only usage count
   will be incremented without resuming the device. Inside,
   vfio_user_config_read()/vfio_user_config_write(), use the routines
   with pm_runtime_put_noidle() so that the PCI device won’t be
   suspended in the lower level functions. Again in the top level
   vfio_pci_config_rw(), use the pm_runtime_put_autosuspend(). Now the
   auto suspend timer will be started and the device will be suspended
   again. If the device is already runtime suspended, then this routine
   will return early.

5. In the host side D3cold will only be used if the platform has
   support for this, otherwise some other state will be used. The
   config space can be read if the device is not in D3cold state. So in
   this case, we can skip the resuming of PCI device. The wrapper
   function vfio_pci_config_pm_runtime_get() takes care of this
   condition and invoke the pm_runtime_resume() only if current power
   state is D3cold.

6. For vfio_pci_config_pm_runtime_get()/vfio_
   pci_config_pm_runtime_put(), the reference code is taken from
   pci_config_pm_runtime_get()/pci_config_pm_runtime_put() and then it
   is modified according to vfio-pci driver requirement.

7. vfio_pci_set_power_state() will be unused after moving to runtime
   PM, so this function can be removed along with other related
   functions and structure fields.

Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_config.c | 178 ++++++++++++++++++++++++++---
 drivers/vfio/pci/vfio_pci_core.c   |  64 ++---------
 include/linux/vfio_pci_core.h      |   3 +-
 3 files changed, 174 insertions(+), 71 deletions(-)

Comments

Alex Williamson Nov. 17, 2021, 5:53 p.m. UTC | #1
On Mon, 15 Nov 2021 19:06:40 +0530
<abhsahu@nvidia.com> wrote:

> From: Abhishek Sahu <abhsahu@nvidia.com>
> 
> Currently, if the runtime power management is enabled for vfio-pci
> device in guest OS, then guest OS will do the register write for
> PCI_PM_CTRL register. This write request will be handled in
> vfio_pm_config_write() where it will do the actual register write of
> PCI_PM_CTRL register. With this, the maximum D3hot state can be
> achieved for low power. If we can use the runtime PM framework, then
> we can achieve the D3cold state which will help in saving
> maximum power.
> 
> This patch uses runtime PM framework whenever vfio-pci device will
> be put in the low power state.
> 
> 1. If runtime PM is enabled, then instead of directly writing
>    PCI_PM_CTRL register, decrement the device usage counter whenever
>    the power state is non-D0. The kernel runtime PM framework will
>    itself put the PCI device in low power state when device usage
>    counter will become zero. Similarly, when the power state will be
>    again changed back to D0, then increment the device usage counter
>    and the kernel runtime PM framework will itself bring the PCI device
>    out of low power state.
> 
> 2. The guest OS will read the PCI_PM_CTRL register back to
>    confirm the current power state so virtual register bits can be
>    used. For this, before decrementing the usage count, read the actual
>    PCI_PM_CTRL register, update the power state related bits, and then
>    update the vconfig bits corresponding to PCI_PM_CTRL offset. For
>    PCI_PM_CTRL register read, return the virtual value if runtime PM is
>    requested. This vconfig bits will be cleared when the power state
>    will be changed back to D0.
> 
> 3. For the guest OS, the PCI power state will be still D3hot
>    even if put the actual PCI device into D3cold state. In the D3hot
>    state, the config space can still be read. So, if there is any request
>    from guest OS to read the config space, then we need to move the actual
>    PCI device again back to D0. For this, increment the device usage
>    count before reading/writing the config space and then decrement it
>    again after reading/writing the config space. There can be
>    back-to-back config register read/write request, but since the auto
>    suspend methods are being used here so only first access will
>    wake up the PCI device and for the remaining access, the device will
>    already be active.
> 
> 4. This above-mentioned wake up is not needed if the register
>    read/write is done completely with virtual bits. For handling this
>    condition, the actual resume of device will only being done in
>    vfio_user_config_read()/vfio_user_config_write(). All the config
>    register access will come vfio_pci_config_rw(). So, in this
>    function, use the pm_runtime_get_noresume() so that only usage count
>    will be incremented without resuming the device. Inside,
>    vfio_user_config_read()/vfio_user_config_write(), use the routines
>    with pm_runtime_put_noidle() so that the PCI device won’t be
>    suspended in the lower level functions. Again in the top level
>    vfio_pci_config_rw(), use the pm_runtime_put_autosuspend(). Now the
>    auto suspend timer will be started and the device will be suspended
>    again. If the device is already runtime suspended, then this routine
>    will return early.
> 
> 5. In the host side D3cold will only be used if the platform has
>    support for this, otherwise some other state will be used. The
>    config space can be read if the device is not in D3cold state. So in
>    this case, we can skip the resuming of PCI device. The wrapper
>    function vfio_pci_config_pm_runtime_get() takes care of this
>    condition and invoke the pm_runtime_resume() only if current power
>    state is D3cold.
> 
> 6. For vfio_pci_config_pm_runtime_get()/vfio_
>    pci_config_pm_runtime_put(), the reference code is taken from
>    pci_config_pm_runtime_get()/pci_config_pm_runtime_put() and then it
>    is modified according to vfio-pci driver requirement.
> 
> 7. vfio_pci_set_power_state() will be unused after moving to runtime
>    PM, so this function can be removed along with other related
>    functions and structure fields.


If we're transitioning a device to D3cold rather than D3hot as
requested by userspace, isn't that a user visible change?  For
instance, a device may report NoSoftRst- indicating that the device
does not do a soft reset on D3hot->D0 transition.  If we're instead
putting the device in D3cold, then a transition back to D0 has very
much undergone a reset.  On one hand we should at least virtualize the
NoSoftRst bit to allow the guest to restore the device, but I wonder if
that's really safe.  Is a better option to prevent entering D3cold if
the device isn't natively reporting NoSoftRst-?

We're also essentially making a policy decision on behalf of userspace
that favors power saving over latency.  Is that universally the correct
trade-off?  I can imagine this could be desirable for many use cases,
but if we're going to covertly replace D3hot with D3cold, it seems like
there should be an opt-in.  Is co-opting the PM capability for this
even really acceptable or should there be a device ioctl to request
D3cold and plumbing through QEMU such that a VM guest can make informed
choices regarding device power management?

Also if the device is not responsive to config space due to the user
placing it in D3 now, I'd expect there are other ioctl paths that need
to be blocked, maybe even MMIO paths that might be a gap for existing
D3hot support.  Thanks,

Alex
Abhishek Sahu Nov. 18, 2021, 3:21 p.m. UTC | #2
On 11/17/2021 11:23 PM, Alex Williamson wrote:

> On Mon, 15 Nov 2021 19:06:40 +0530
> <abhsahu@nvidia.com> wrote:
> 
>> From: Abhishek Sahu <abhsahu@nvidia.com>
>>
>> Currently, if the runtime power management is enabled for vfio-pci
>> device in guest OS, then guest OS will do the register write for
>> PCI_PM_CTRL register. This write request will be handled in
>> vfio_pm_config_write() where it will do the actual register write of
>> PCI_PM_CTRL register. With this, the maximum D3hot state can be
>> achieved for low power. If we can use the runtime PM framework, then
>> we can achieve the D3cold state which will help in saving
>> maximum power.
>>
>> This patch uses runtime PM framework whenever vfio-pci device will
>> be put in the low power state.
>>
>> 1. If runtime PM is enabled, then instead of directly writing
>>    PCI_PM_CTRL register, decrement the device usage counter whenever
>>    the power state is non-D0. The kernel runtime PM framework will
>>    itself put the PCI device in low power state when device usage
>>    counter will become zero. Similarly, when the power state will be
>>    again changed back to D0, then increment the device usage counter
>>    and the kernel runtime PM framework will itself bring the PCI device
>>    out of low power state.
>>
>> 2. The guest OS will read the PCI_PM_CTRL register back to
>>    confirm the current power state so virtual register bits can be
>>    used. For this, before decrementing the usage count, read the actual
>>    PCI_PM_CTRL register, update the power state related bits, and then
>>    update the vconfig bits corresponding to PCI_PM_CTRL offset. For
>>    PCI_PM_CTRL register read, return the virtual value if runtime PM is
>>    requested. This vconfig bits will be cleared when the power state
>>    will be changed back to D0.
>>
>> 3. For the guest OS, the PCI power state will be still D3hot
>>    even if put the actual PCI device into D3cold state. In the D3hot
>>    state, the config space can still be read. So, if there is any request
>>    from guest OS to read the config space, then we need to move the actual
>>    PCI device again back to D0. For this, increment the device usage
>>    count before reading/writing the config space and then decrement it
>>    again after reading/writing the config space. There can be
>>    back-to-back config register read/write request, but since the auto
>>    suspend methods are being used here so only first access will
>>    wake up the PCI device and for the remaining access, the device will
>>    already be active.
>>
>> 4. This above-mentioned wake up is not needed if the register
>>    read/write is done completely with virtual bits. For handling this
>>    condition, the actual resume of device will only being done in
>>    vfio_user_config_read()/vfio_user_config_write(). All the config
>>    register access will come vfio_pci_config_rw(). So, in this
>>    function, use the pm_runtime_get_noresume() so that only usage count
>>    will be incremented without resuming the device. Inside,
>>    vfio_user_config_read()/vfio_user_config_write(), use the routines
>>    with pm_runtime_put_noidle() so that the PCI device won’t be
>>    suspended in the lower level functions. Again in the top level
>>    vfio_pci_config_rw(), use the pm_runtime_put_autosuspend(). Now the
>>    auto suspend timer will be started and the device will be suspended
>>    again. If the device is already runtime suspended, then this routine
>>    will return early.
>>
>> 5. In the host side D3cold will only be used if the platform has
>>    support for this, otherwise some other state will be used. The
>>    config space can be read if the device is not in D3cold state. So in
>>    this case, we can skip the resuming of PCI device. The wrapper
>>    function vfio_pci_config_pm_runtime_get() takes care of this
>>    condition and invoke the pm_runtime_resume() only if current power
>>    state is D3cold.
>>
>> 6. For vfio_pci_config_pm_runtime_get()/vfio_
>>    pci_config_pm_runtime_put(), the reference code is taken from
>>    pci_config_pm_runtime_get()/pci_config_pm_runtime_put() and then it
>>    is modified according to vfio-pci driver requirement.
>>
>> 7. vfio_pci_set_power_state() will be unused after moving to runtime
>>    PM, so this function can be removed along with other related
>>    functions and structure fields.
> 
> 

 Thanks Alex for checking this series and providing your inputs. 
 
> If we're transitioning a device to D3cold rather than D3hot as
> requested by userspace, isn't that a user visible change? 

  For most of the driver, in linux kernel, the D3hot vs D3cold
  state will be decided at PCI core layer. In the PCI core layer,
  pci_target_state() determines which D3 state to choose. It checks
  for platform_pci_power_manageable() and then it calls
  platform_pci_choose_state() to find the target state.
  In VM, the platform_pci_power_manageable() check will fail if the
  guest is linux OS. So, it uses, D3hot state.
 
  But there are few drivers which does not use the PCI framework
  generic power related routines during runtime suspend/system suspend
  and set the PCI power state directly with D3hot.
  Also, the guest can be non-Linux OS also and, in that case,
  it will be difficult to know the behavior. So, it may impact
  these cases.

> For instance, a device may report NoSoftRst- indicating that the device
> does not do a soft reset on D3hot->D0 transition.  If we're instead
> putting the device in D3cold, then a transition back to D0 has very
> much undergone a reset.  On one hand we should at least virtualize the
> NoSoftRst bit to allow the guest to restore the device, but I wonder if
> that's really safe.  Is a better option to prevent entering D3cold if
> the device isn't natively reporting NoSoftRst-?
> 

 You mean to say NoSoftRst+ instead of NoSoftRst- as visible in
 the lspci output. For NoSoftRst- case, we do a soft reset on
 D3hot->D0 transition. But, will this case not be handled internally
 in drivers/pci/pci-driver.c ? For both system suspend and runtime suspend,
 we check for pci_dev->state_saved flag and do pci_save_state()
 irrespective of NoSoftRst bit. For NoSoftRst- case, pci_restore_bars()
 will be called in pci_raw_set_power_state() which will reinitialize device
 for D3hot/D3cold-> D0 case. Once the device is initialized in the host,
 then for guest, it should work without re-initializing again in the
 guest side. I am not sure, if my understanding is correct.

> We're also essentially making a policy decision on behalf of userspace
> that favors power saving over latency.  Is that universally the correct
> trade-off? 

 For most drivers, the D3hot vs D3cold should not be favored due
 to latency reasons. In the linux kernel side, I am seeing, the
 PCI framework try to use D3cold state if platform and device
 supports that. But its correct that covertly replacing D3hot with
 D3cold may be concern for some drivers.

> I can imagine this could be desirable for many use cases,
> but if we're going to covertly replace D3hot with D3cold, it seems like
> there should be an opt-in.  Is co-opting the PM capability for this
> even really acceptable or should there be a device ioctl to request
> D3cold and plumbing through QEMU such that a VM guest can make informed
> choices regarding device power management?
> 

 Making IOCTL is also an option but that case, this support needs to
 be added in all hypervisors and user must pass this information
 explicitly for each device. Another option could be to use
 module parameter to explicitly enable D3cold support. If module
 parameter is not set, then we can call pci_d3cold_disable() and
 in that case, runtime PM should not use D3cold state. 

 Also, I was checking we can pass this information though some
 virtualized register bit which will be only defined for passing
 the information between guest and host. In the guest side if the
 target state is being decided with pci_target_state(), then
 the D3cold vs D3hot should not matter for the driver running
 in the guest side and in that case, it depends upon platform support.
 We can set this virtualize bit to 1. But, if driver is either
 setting D3hot state explicitly or has called pci_d3cold_disable() or
 similar API available in the guest OS, then set this bit to 0 and
 in that case, the D3cold state can be disabled in the host side.
 But don't know if is possible to use some non PCI defined
 virtualized register bit. 

 I am not sure what should be best option to make choice
 regarding d3cold but if we can have some option by which this
 can be done without involvement of user, then it will benefit
 for lot of cases. Currently, the D3cold is supported only in
 very few desktops/servers but in future, we will see on
 most of the platforms.  

> Also if the device is not responsive to config space due to the user
> placing it in D3 now, I'd expect there are other ioctl paths that need
> to be blocked, maybe even MMIO paths that might be a gap for existing
> D3hot support.  Thanks,
> 
> Alex
> 

 I was in assumption that most of IOCTL code will be called by the
 hypervisor before guest OS boot and during that time, the device
 will be always in D0. But, if we have paths where IOCTL can be
 called when the device has been suspended by guest OS, then can we
 use runtime_get/put API’s there also ?

 Thanks,
 Abhishek
Alex Williamson Nov. 18, 2021, 9:09 p.m. UTC | #3
On Thu, 18 Nov 2021 20:51:41 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:
> On 11/17/2021 11:23 PM, Alex Williamson wrote:
>  Thanks Alex for checking this series and providing your inputs. 
>  
> > If we're transitioning a device to D3cold rather than D3hot as
> > requested by userspace, isn't that a user visible change?   
> 
>   For most of the driver, in linux kernel, the D3hot vs D3cold
>   state will be decided at PCI core layer. In the PCI core layer,
>   pci_target_state() determines which D3 state to choose. It checks
>   for platform_pci_power_manageable() and then it calls
>   platform_pci_choose_state() to find the target state.
>   In VM, the platform_pci_power_manageable() check will fail if the
>   guest is linux OS. So, it uses, D3hot state.

Right, but my statement is really more that the device PM registers
cannot be used to put the device into D3cold, so the write of the PM
register that we're trapping was the user/guest's intention to put the
device into D3hot.  We therefore need to be careful about differences
in the resulting device state when it comes out of D3cold vs D3hot.

>   But there are few drivers which does not use the PCI framework
>   generic power related routines during runtime suspend/system suspend
>   and set the PCI power state directly with D3hot.

Current vfio-pci being one of those ;)

>   Also, the guest can be non-Linux OS also and, in that case,
>   it will be difficult to know the behavior. So, it may impact
>   these cases.

That's what I'm worried about.

> > For instance, a device may report NoSoftRst- indicating that the device
> > does not do a soft reset on D3hot->D0 transition.  If we're instead
> > putting the device in D3cold, then a transition back to D0 has very
> > much undergone a reset.  On one hand we should at least virtualize the
> > NoSoftRst bit to allow the guest to restore the device, but I wonder if
> > that's really safe.  Is a better option to prevent entering D3cold if
> > the device isn't natively reporting NoSoftRst-?
> >   
> 
>  You mean to say NoSoftRst+ instead of NoSoftRst- as visible in

Oops yes.  The concern is if the user/guest is not expecting a soft
reset when using D3hot, but we transparently promote D3hot to D3cold
which will always implies a device reset.

>  the lspci output. For NoSoftRst- case, we do a soft reset on
>  D3hot->D0 transition. But, will this case not be handled internally
>  in drivers/pci/pci-driver.c ? For both system suspend and runtime suspend,
>  we check for pci_dev->state_saved flag and do pci_save_state()
>  irrespective of NoSoftRst bit. For NoSoftRst- case, pci_restore_bars()
>  will be called in pci_raw_set_power_state() which will reinitialize device
>  for D3hot/D3cold-> D0 case. Once the device is initialized in the host,
>  then for guest, it should work without re-initializing again in the
>  guest side. I am not sure, if my understanding is correct.

The soft reset is not limited to the state that the PCI subsystem can
save and restore.  Device specific state that the user/guest may
legitimately expect to be retained may be reset as well.

[PCIe v5 5.3.1.4]
	Functional context is required to be maintained by Functions in
	the D3 hot state if the No_Soft_Reset field in the PMCSR is Set.

Unfortunately I don't see a specific definition of "functional
context", but I interpret that to include device specific state.  For
example, if a GPU contains specific frame buffer data and reports
NoSoftRst+, wouldn't it be reasonable to expect that framebuffer data
to be retained on D3hot->D0 transition?
 
> > We're also essentially making a policy decision on behalf of
> > userspace that favors power saving over latency.  Is that
> > universally the correct trade-off?   
> 
>  For most drivers, the D3hot vs D3cold should not be favored due
>  to latency reasons. In the linux kernel side, I am seeing, the
>  PCI framework try to use D3cold state if platform and device
>  supports that. But its correct that covertly replacing D3hot with
>  D3cold may be concern for some drivers.
> 
> > I can imagine this could be desirable for many use cases,
> > but if we're going to covertly replace D3hot with D3cold, it seems
> > like there should be an opt-in.  Is co-opting the PM capability for
> > this even really acceptable or should there be a device ioctl to
> > request D3cold and plumbing through QEMU such that a VM guest can
> > make informed choices regarding device power management?
> >   
> 
>  Making IOCTL is also an option but that case, this support needs to
>  be added in all hypervisors and user must pass this information
>  explicitly for each device. Another option could be to use
>  module parameter to explicitly enable D3cold support. If module
>  parameter is not set, then we can call pci_d3cold_disable() and
>  in that case, runtime PM should not use D3cold state.
> 
>  Also, I was checking we can pass this information though some
>  virtualized register bit which will be only defined for passing
>  the information between guest and host. In the guest side if the
>  target state is being decided with pci_target_state(), then
>  the D3cold vs D3hot should not matter for the driver running
>  in the guest side and in that case, it depends upon platform support.
>  We can set this virtualize bit to 1. But, if driver is either
>  setting D3hot state explicitly or has called pci_d3cold_disable() or
>  similar API available in the guest OS, then set this bit to 0 and
>  in that case, the D3cold state can be disabled in the host side.
>  But don't know if is possible to use some non PCI defined
>  virtualized register bit. 

If you're suggesting a device config space register, that's troublesome
because we can't guarantee that simply because a range of config space
isn't within a capability that it doesn't have some device specific
purpose.  However, we could certainly implement virtual registers in
the hypervisor that implement the ACPI support that an OS would use on
bare metal to implement D3cold.  Those could trigger this ioctl through
the vfio device.

>  I am not sure what should be best option to make choice
>  regarding d3cold but if we can have some option by which this
>  can be done without involvement of user, then it will benefit
>  for lot of cases. Currently, the D3cold is supported only in
>  very few desktops/servers but in future, we will see on
>  most of the platforms.  

I tend to see it as an interesting hack to promote D3hot to D3cold, and
potentially very useful.  However, we're also introducing potentially
unexpected device behavior, so I think it would probably need to be an
opt-in.  Possibly if the device reports NoSoftRst- we could use it by
default, but even virtualizing the NoSoftRst suggests that there's an
expectation that the guest driver has that support available.
 
> > Also if the device is not responsive to config space due to the user
> > placing it in D3 now, I'd expect there are other ioctl paths that
> > need to be blocked, maybe even MMIO paths that might be a gap for
> > existing D3hot support.  Thanks,
> 
>  I was in assumption that most of IOCTL code will be called by the
>  hypervisor before guest OS boot and during that time, the device
>  will be always in D0. But, if we have paths where IOCTL can be
>  called when the device has been suspended by guest OS, then can we
>  use runtime_get/put API’s there also ?

It's more a matter of preventing user actions that can cause harm
rather than expecting certain operations only in specific states.  We
could chose to either resume the device for those operations or fail
the operation.  We should probably also leverage the memory-disable
support to fault mmap access to MMIO when the device is in D3* as well.
Thanks,

Alex
Alex Williamson Nov. 19, 2021, 3:45 p.m. UTC | #4
On Thu, 18 Nov 2021 14:09:13 -0700
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Thu, 18 Nov 2021 20:51:41 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> > On 11/17/2021 11:23 PM, Alex Williamson wrote:
> >  Thanks Alex for checking this series and providing your inputs. 
> >    
> > > If we're transitioning a device to D3cold rather than D3hot as
> > > requested by userspace, isn't that a user visible change?     
> > 
> >   For most of the driver, in linux kernel, the D3hot vs D3cold
> >   state will be decided at PCI core layer. In the PCI core layer,
> >   pci_target_state() determines which D3 state to choose. It checks
> >   for platform_pci_power_manageable() and then it calls
> >   platform_pci_choose_state() to find the target state.
> >   In VM, the platform_pci_power_manageable() check will fail if the
> >   guest is linux OS. So, it uses, D3hot state.  
> 
> Right, but my statement is really more that the device PM registers
> cannot be used to put the device into D3cold, so the write of the PM
> register that we're trapping was the user/guest's intention to put the
> device into D3hot.  We therefore need to be careful about differences
> in the resulting device state when it comes out of D3cold vs D3hot.
> 
> >   But there are few drivers which does not use the PCI framework
> >   generic power related routines during runtime suspend/system suspend
> >   and set the PCI power state directly with D3hot.  
> 
> Current vfio-pci being one of those ;)
> 
> >   Also, the guest can be non-Linux OS also and, in that case,
> >   it will be difficult to know the behavior. So, it may impact
> >   these cases.  
> 
> That's what I'm worried about.
> 
> > > For instance, a device may report NoSoftRst- indicating that the device
> > > does not do a soft reset on D3hot->D0 transition.  If we're instead
> > > putting the device in D3cold, then a transition back to D0 has very
> > > much undergone a reset.  On one hand we should at least virtualize the
> > > NoSoftRst bit to allow the guest to restore the device, but I wonder if
> > > that's really safe.  Is a better option to prevent entering D3cold if
> > > the device isn't natively reporting NoSoftRst-?
> > >     
> > 
> >  You mean to say NoSoftRst+ instead of NoSoftRst- as visible in  
> 
> Oops yes.  The concern is if the user/guest is not expecting a soft
> reset when using D3hot, but we transparently promote D3hot to D3cold
> which will always implies a device reset.
> 
> >  the lspci output. For NoSoftRst- case, we do a soft reset on
> >  D3hot->D0 transition. But, will this case not be handled internally
> >  in drivers/pci/pci-driver.c ? For both system suspend and runtime suspend,
> >  we check for pci_dev->state_saved flag and do pci_save_state()
> >  irrespective of NoSoftRst bit. For NoSoftRst- case, pci_restore_bars()
> >  will be called in pci_raw_set_power_state() which will reinitialize device
> >  for D3hot/D3cold-> D0 case. Once the device is initialized in the host,
> >  then for guest, it should work without re-initializing again in the
> >  guest side. I am not sure, if my understanding is correct.  
> 
> The soft reset is not limited to the state that the PCI subsystem can
> save and restore.  Device specific state that the user/guest may
> legitimately expect to be retained may be reset as well.
> 
> [PCIe v5 5.3.1.4]
> 	Functional context is required to be maintained by Functions in
> 	the D3 hot state if the No_Soft_Reset field in the PMCSR is Set.
> 
> Unfortunately I don't see a specific definition of "functional
> context", but I interpret that to include device specific state.  For
> example, if a GPU contains specific frame buffer data and reports
> NoSoftRst+, wouldn't it be reasonable to expect that framebuffer data
> to be retained on D3hot->D0 transition?
>  
> > > We're also essentially making a policy decision on behalf of
> > > userspace that favors power saving over latency.  Is that
> > > universally the correct trade-off?     
> > 
> >  For most drivers, the D3hot vs D3cold should not be favored due
> >  to latency reasons. In the linux kernel side, I am seeing, the
> >  PCI framework try to use D3cold state if platform and device
> >  supports that. But its correct that covertly replacing D3hot with
> >  D3cold may be concern for some drivers.
> >   
> > > I can imagine this could be desirable for many use cases,
> > > but if we're going to covertly replace D3hot with D3cold, it seems
> > > like there should be an opt-in.  Is co-opting the PM capability for
> > > this even really acceptable or should there be a device ioctl to
> > > request D3cold and plumbing through QEMU such that a VM guest can
> > > make informed choices regarding device power management?
> > >     
> > 
> >  Making IOCTL is also an option but that case, this support needs to
> >  be added in all hypervisors and user must pass this information
> >  explicitly for each device. Another option could be to use
> >  module parameter to explicitly enable D3cold support. If module
> >  parameter is not set, then we can call pci_d3cold_disable() and
> >  in that case, runtime PM should not use D3cold state.
> > 
> >  Also, I was checking we can pass this information though some
> >  virtualized register bit which will be only defined for passing
> >  the information between guest and host. In the guest side if the
> >  target state is being decided with pci_target_state(), then
> >  the D3cold vs D3hot should not matter for the driver running
> >  in the guest side and in that case, it depends upon platform support.
> >  We can set this virtualize bit to 1. But, if driver is either
> >  setting D3hot state explicitly or has called pci_d3cold_disable() or
> >  similar API available in the guest OS, then set this bit to 0 and
> >  in that case, the D3cold state can be disabled in the host side.
> >  But don't know if is possible to use some non PCI defined
> >  virtualized register bit.   
> 
> If you're suggesting a device config space register, that's troublesome
> because we can't guarantee that simply because a range of config space
> isn't within a capability that it doesn't have some device specific
> purpose.  However, we could certainly implement virtual registers in
> the hypervisor that implement the ACPI support that an OS would use on
> bare metal to implement D3cold.  Those could trigger this ioctl through
> the vfio device.
> 
> >  I am not sure what should be best option to make choice
> >  regarding d3cold but if we can have some option by which this
> >  can be done without involvement of user, then it will benefit
> >  for lot of cases. Currently, the D3cold is supported only in
> >  very few desktops/servers but in future, we will see on
> >  most of the platforms.    
> 
> I tend to see it as an interesting hack to promote D3hot to D3cold, and
> potentially very useful.  However, we're also introducing potentially
> unexpected device behavior, so I think it would probably need to be an
> opt-in.  Possibly if the device reports NoSoftRst- we could use it by
> default, but even virtualizing the NoSoftRst suggests that there's an
> expectation that the guest driver has that support available.
>  
> > > Also if the device is not responsive to config space due to the user
> > > placing it in D3 now, I'd expect there are other ioctl paths that
> > > need to be blocked, maybe even MMIO paths that might be a gap for
> > > existing D3hot support.  Thanks,  
> > 
> >  I was in assumption that most of IOCTL code will be called by the
> >  hypervisor before guest OS boot and during that time, the device
> >  will be always in D0. But, if we have paths where IOCTL can be
> >  called when the device has been suspended by guest OS, then can we
> >  use runtime_get/put API’s there also ?  
> 
> It's more a matter of preventing user actions that can cause harm
> rather than expecting certain operations only in specific states.  We
> could chose to either resume the device for those operations or fail
> the operation.  We should probably also leverage the memory-disable
> support to fault mmap access to MMIO when the device is in D3* as well.

It also occurred to me last night that a guest triggering D3hot via the
PM registers must be a synchronous power state change, we can't use
auto-suspend.  This is necessary for nested assignment where the guest
might use a D3hot->D0 power state transition with NoSoftRst- devices in
order to perform a reset of the device.  With auto-suspend, the guest
would return the device to D0 before the physical device ever timed out
to enter a D3 state.  Thanks,

Alex
Abhishek Sahu Nov. 23, 2021, 7:27 a.m. UTC | #5
On 11/19/2021 9:15 PM, Alex Williamson wrote:
> On Thu, 18 Nov 2021 14:09:13 -0700
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
>> On Thu, 18 Nov 2021 20:51:41 +0530
>> Abhishek Sahu <abhsahu@nvidia.com> wrote:
>>> On 11/17/2021 11:23 PM, Alex Williamson wrote:
>>>  Thanks Alex for checking this series and providing your inputs.
>>>
>>>> If we're transitioning a device to D3cold rather than D3hot as
>>>> requested by userspace, isn't that a user visible change?
>>>
>>>   For most of the driver, in linux kernel, the D3hot vs D3cold
>>>   state will be decided at PCI core layer. In the PCI core layer,
>>>   pci_target_state() determines which D3 state to choose. It checks
>>>   for platform_pci_power_manageable() and then it calls
>>>   platform_pci_choose_state() to find the target state.
>>>   In VM, the platform_pci_power_manageable() check will fail if the
>>>   guest is linux OS. So, it uses, D3hot state.
>>
>> Right, but my statement is really more that the device PM registers
>> cannot be used to put the device into D3cold, so the write of the PM
>> register that we're trapping was the user/guest's intention to put the
>> device into D3hot.  We therefore need to be careful about differences
>> in the resulting device state when it comes out of D3cold vs D3hot.
>>>>>   But there are few drivers which does not use the PCI framework
>>>   generic power related routines during runtime suspend/system suspend
>>>   and set the PCI power state directly with D3hot.
>>
>> Current vfio-pci being one of those ;)
>>
>>>   Also, the guest can be non-Linux OS also and, in that case,
>>>   it will be difficult to know the behavior. So, it may impact
>>>   these cases.
>>
>> That's what I'm worried about.
>>
>>>> For instance, a device may report NoSoftRst- indicating that the device
>>>> does not do a soft reset on D3hot->D0 transition.  If we're instead
>>>> putting the device in D3cold, then a transition back to D0 has very
>>>> much undergone a reset.  On one hand we should at least virtualize the
>>>> NoSoftRst bit to allow the guest to restore the device, but I wonder if
>>>> that's really safe.  Is a better option to prevent entering D3cold if
>>>> the device isn't natively reporting NoSoftRst-?
>>>>
>>>
>>>  You mean to say NoSoftRst+ instead of NoSoftRst- as visible in
>>
>> Oops yes.  The concern is if the user/guest is not expecting a soft
>> reset when using D3hot, but we transparently promote D3hot to D3cold
>> which will always implies a device reset.
>>
>>>  the lspci output. For NoSoftRst- case, we do a soft reset on
>>>  D3hot->D0 transition. But, will this case not be handled internally
>>>  in drivers/pci/pci-driver.c ? For both system suspend and runtime suspend,
>>>  we check for pci_dev->state_saved flag and do pci_save_state()
>>>  irrespective of NoSoftRst bit. For NoSoftRst- case, pci_restore_bars()
>>>  will be called in pci_raw_set_power_state() which will reinitialize device
>>>  for D3hot/D3cold-> D0 case. Once the device is initialized in the host,
>>>  then for guest, it should work without re-initializing again in the
>>>  guest side. I am not sure, if my understanding is correct.
>>
>> The soft reset is not limited to the state that the PCI subsystem can
>> save and restore.  Device specific state that the user/guest may
>> legitimately expect to be retained may be reset as well.
>>
>> [PCIe v5 5.3.1.4]
>>       Functional context is required to be maintained by Functions in
>>       the D3 hot state if the No_Soft_Reset field in the PMCSR is Set.
>>
>> Unfortunately I don't see a specific definition of "functional
>> context", but I interpret that to include device specific state.  For
>> example, if a GPU contains specific frame buffer data and reports
>> NoSoftRst+, wouldn't it be reasonable to expect that framebuffer data
>> to be retained on D3hot->D0 transition?
>>

 Thanks Alex for your inputs.

 I got your point. Yes. That can happen.

>>>> We're also essentially making a policy decision on behalf of
>>>> userspace that favors power saving over latency.  Is that
>>>> universally the correct trade-off?
>>>
>>>  For most drivers, the D3hot vs D3cold should not be favored due
>>>  to latency reasons. In the linux kernel side, I am seeing, the
>>>  PCI framework try to use D3cold state if platform and device
>>>  supports that. But its correct that covertly replacing D3hot with
>>>  D3cold may be concern for some drivers.
>>>
>>>> I can imagine this could be desirable for many use cases,
>>>> but if we're going to covertly replace D3hot with D3cold, it seems
>>>> like there should be an opt-in.  Is co-opting the PM capability for
>>>> this even really acceptable or should there be a device ioctl to
>>>> request D3cold and plumbing through QEMU such that a VM guest can
>>>> make informed choices regarding device power management?
>>>>
>>>
>>>  Making IOCTL is also an option but that case, this support needs to
>>>  be added in all hypervisors and user must pass this information
>>>  explicitly for each device. Another option could be to use
>>>  module parameter to explicitly enable D3cold support. If module
>>>  parameter is not set, then we can call pci_d3cold_disable() and
>>>  in that case, runtime PM should not use D3cold state.
>>>
>>>  Also, I was checking we can pass this information though some
>>>  virtualized register bit which will be only defined for passing
>>>  the information between guest and host. In the guest side if the
>>>  target state is being decided with pci_target_state(), then
>>>  the D3cold vs D3hot should not matter for the driver running
>>>  in the guest side and in that case, it depends upon platform support.
>>>  We can set this virtualize bit to 1. But, if driver is either
>>>  setting D3hot state explicitly or has called pci_d3cold_disable() or
>>>  similar API available in the guest OS, then set this bit to 0 and
>>>  in that case, the D3cold state can be disabled in the host side.
>>>  But don't know if is possible to use some non PCI defined
>>>  virtualized register bit.
>>
>> If you're suggesting a device config space register, that's troublesome
>> because we can't guarantee that simply because a range of config space
>> isn't within a capability that it doesn't have some device specific
>> purpose.  However, we could certainly implement virtual registers in
>> the hypervisor that implement the ACPI support that an OS would use on
>> bare metal to implement D3cold.  Those could trigger this ioctl through
>> the vfio device.
>>

 Yes. I was suggesting a some bits in PM_CTRL register.
 We can identify some bits which are unused like bit 22 and 23.

 23:22 Undefined - these bits were defined in previous specifications.
 They should be ignored by software. 

 and virtualize these bits for passing D3cold related information 
 from guest to host. These bits were defined for PCI-to-PCI 
 bridge earlier. But this will be hack and will cause issues
 if PCI specification starts using these bits in future revisions.
 Also, it needs the changes in the other OS. So, it won't be good
 idea.
 
>>>  I am not sure what should be best option to make choice
>>>  regarding d3cold but if we can have some option by which this
>>>  can be done without involvement of user, then it will benefit
>>>  for lot of cases. Currently, the D3cold is supported only in
>>>  very few desktops/servers but in future, we will see on
>>>  most of the platforms.
>>
>> I tend to see it as an interesting hack to promote D3hot to D3cold, and
>> potentially very useful.  However, we're also introducing potentially
>> unexpected device behavior, so I think it would probably need to be an
>> opt-in.  Possibly if the device reports NoSoftRst- we could use it by
>> default, but even virtualizing the NoSoftRst suggests that there's an
>> expectation that the guest driver has that support available.
>>

 Sure. Based on the discussion, the best option is to provide IOCTL
 to user for transition to D3cold. The hypervisor can implement the
 the required ACPI power resource for D3Hot and D0 and then once
 guest OS calls these power resource methods,
 then it can invoke the IOCTL to the host side vfio-pci driver.

 The D0/D1/D2/D3hot transition can happen with existing way
 by directly writing into PM config register and the IOCTL
 needs to transition from D3hot to D3cold via runtime PM
 framework.

 I will do more analysis regarding this by doing code changes
 and will update.

>>>> Also if the device is not responsive to config space due to the user
>>>> placing it in D3 now, I'd expect there are other ioctl paths that
>>>> need to be blocked, maybe even MMIO paths that might be a gap for
>>>> existing D3hot support.  Thanks,
>>>
>>>  I was in assumption that most of IOCTL code will be called by the
>>>  hypervisor before guest OS boot and during that time, the device
>>>  will be always in D0. But, if we have paths where IOCTL can be
>>>  called when the device has been suspended by guest OS, then can we
>>>  use runtime_get/put API’s there also ?
>>
>> It's more a matter of preventing user actions that can cause harm
>> rather than expecting certain operations only in specific states.  We
>> could chose to either resume the device for those operations or fail
>> the operation.  We should probably also leverage the memory-disable
>> support to fault mmap access to MMIO when the device is in D3* as well.
> 

 Sure. I can explore regarding this as well.

> It also occurred to me last night that a guest triggering D3hot via the
> PM registers must be a synchronous power state change, we can't use
> auto-suspend.  This is necessary for nested assignment where the guest
> might use a D3hot->D0 power state transition with NoSoftRst- devices in
> order to perform a reset of the device.  With auto-suspend, the guest
> would return the device to D0 before the physical device ever timed out
> to enter a D3 state.  Thanks,
> 
> Alex
> 

 Okay. I think if we can use IOCTL based way to trigger D3cold, then
 autosuspend should not be required. Also, in that case, the transition
 to D3hot can happen with existing way of writing directly into PCI
 PM register. I will validate this use-case after doing changes
 with IOCTL based design. I will make the changes in QEMU locally
 to validate my changes. 

 Thanks,
 Abhishek
diff mbox series

Patch

diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index fb3a503a5b99..5576eb4308b4 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -25,6 +25,7 @@ 
 #include <linux/uaccess.h>
 #include <linux/vfio.h>
 #include <linux/slab.h>
+#include <linux/pm_runtime.h>
 
 #include <linux/vfio_pci_core.h>
 
@@ -119,12 +120,51 @@  struct perm_bits {
 #define	NO_WRITE	0
 #define	ALL_WRITE	0xFFFFFFFFU
 
-static int vfio_user_config_read(struct pci_dev *pdev, int offset,
+static void vfio_pci_config_pm_runtime_get(struct pci_dev *pdev)
+{
+	struct device *dev = &pdev->dev;
+	struct device *parent = dev->parent;
+
+	if (parent)
+		pm_runtime_get_sync(parent);
+
+	pm_runtime_get_noresume(dev);
+	/*
+	 * pdev->current_state is set to PCI_D3cold during suspending,
+	 * so wait until suspending completes
+	 */
+	pm_runtime_barrier(dev);
+	/*
+	 * Only need to resume devices in D3cold, because config
+	 * registers are still accessible for devices suspended but
+	 * not in D3cold.
+	 */
+	if (pdev->current_state == PCI_D3cold)
+		pm_runtime_resume(dev);
+}
+
+static void vfio_pci_config_pm_runtime_put(struct pci_dev *pdev)
+{
+	struct device *dev = &pdev->dev;
+	struct device *parent = dev->parent;
+
+	pm_runtime_mark_last_busy(dev);
+	pm_runtime_put_noidle(dev);
+
+	if (parent)
+		pm_runtime_put_noidle(parent);
+}
+
+static int vfio_user_config_read(struct vfio_pci_core_device *vdev, int offset,
 				 __le32 *val, int count)
 {
+	struct pci_dev *pdev = vdev->pdev;
 	int ret = -EINVAL;
 	u32 tmp_val = 0;
 
+	if (vdev->pm_runtime_suspended)
+		vfio_pci_config_pm_runtime_get(pdev);
+
 	switch (count) {
 	case 1:
 	{
@@ -147,15 +187,22 @@  static int vfio_user_config_read(struct pci_dev *pdev, int offset,
 
 	*val = cpu_to_le32(tmp_val);
 
+	if (vdev->pm_runtime_suspended)
+		vfio_pci_config_pm_runtime_put(pdev);
+
 	return ret;
 }
 
-static int vfio_user_config_write(struct pci_dev *pdev, int offset,
+static int vfio_user_config_write(struct vfio_pci_core_device *vdev, int offset,
 				  __le32 val, int count)
 {
+	struct pci_dev *pdev = vdev->pdev;
 	int ret = -EINVAL;
 	u32 tmp_val = le32_to_cpu(val);
 
+	if (vdev->pm_runtime_suspended)
+		vfio_pci_config_pm_runtime_get(pdev);
+
 	switch (count) {
 	case 1:
 		ret = pci_user_write_config_byte(pdev, offset, tmp_val);
@@ -168,6 +215,9 @@  static int vfio_user_config_write(struct pci_dev *pdev, int offset,
 		break;
 	}
 
+	if (vdev->pm_runtime_suspended)
+		vfio_pci_config_pm_runtime_put(pdev);
+
 	return ret;
 }
 
@@ -183,11 +233,10 @@  static int vfio_default_config_read(struct vfio_pci_core_device *vdev, int pos,
 
 	/* Any non-virtualized bits? */
 	if (cpu_to_le32(~0U >> (32 - (count * 8))) != virt) {
-		struct pci_dev *pdev = vdev->pdev;
 		__le32 phys_val = 0;
 		int ret;
 
-		ret = vfio_user_config_read(pdev, pos, &phys_val, count);
+		ret = vfio_user_config_read(vdev, pos, &phys_val, count);
 		if (ret)
 			return ret;
 
@@ -224,18 +273,17 @@  static int vfio_default_config_write(struct vfio_pci_core_device *vdev, int pos,
 
 	/* Non-virtualzed and writable bits go to hardware */
 	if (write & ~virt) {
-		struct pci_dev *pdev = vdev->pdev;
 		__le32 phys_val = 0;
 		int ret;
 
-		ret = vfio_user_config_read(pdev, pos, &phys_val, count);
+		ret = vfio_user_config_read(vdev, pos, &phys_val, count);
 		if (ret)
 			return ret;
 
 		phys_val &= ~(write & ~virt);
 		phys_val |= (val & (write & ~virt));
 
-		ret = vfio_user_config_write(pdev, pos, phys_val, count);
+		ret = vfio_user_config_write(vdev, pos, phys_val, count);
 		if (ret)
 			return ret;
 	}
@@ -250,7 +298,7 @@  static int vfio_direct_config_read(struct vfio_pci_core_device *vdev, int pos,
 {
 	int ret;
 
-	ret = vfio_user_config_read(vdev->pdev, pos, val, count);
+	ret = vfio_user_config_read(vdev, pos, val, count);
 	if (ret)
 		return ret;
 
@@ -275,7 +323,7 @@  static int vfio_raw_config_write(struct vfio_pci_core_device *vdev, int pos,
 {
 	int ret;
 
-	ret = vfio_user_config_write(vdev->pdev, pos, val, count);
+	ret = vfio_user_config_write(vdev, pos, val, count);
 	if (ret)
 		return ret;
 
@@ -288,7 +336,7 @@  static int vfio_raw_config_read(struct vfio_pci_core_device *vdev, int pos,
 {
 	int ret;
 
-	ret = vfio_user_config_read(vdev->pdev, pos, val, count);
+	ret = vfio_user_config_read(vdev, pos, val, count);
 	if (ret)
 		return ret;
 
@@ -692,13 +740,86 @@  static int __init init_pci_cap_basic_perm(struct perm_bits *perm)
 	return 0;
 }
 
+static int vfio_perform_runtime_pm(struct vfio_pci_core_device *vdev, int pos,
+				   int count, struct perm_bits *perm,
+				   int offset, __le32 val, pci_power_t state)
+{
+	/*
+	 * If runtime PM is enabled, then instead of directly writing
+	 * PCI_PM_CTRL register, decrement the device usage counter whenever
+	 * the power state is non-D0. The kernel runtime PM framework will
+	 * itself put the PCI device in the low power state when device usage
+	 * counter will become zero. The guest OS will read the PCI_PM_CTRL
+	 * register back to confirm the current power state so virtual
+	 * register bits can be used. For this, read the actual PCI_PM_CTRL
+	 * register, update the power state related bits and then update the
+	 * vconfig bits corresponding to PCI_PM_CTRL offset. If the
+	 * pm_runtime_suspended status is set, then return the virtual
+	 * register value for PCI_PM_CTRL register read. All the bits
+	 * in PCI_PM_CTRL are being returned from virtual config, so that
+	 * this register read will not wake up the PCI device from suspended
+	 * state.
+	 *
+	 * Once power state will be changed back to D0, then clear the power
+	 * state related bits in vconfig. After this, increment the device
+	 * usage counter which will internally wake up the PCI device from
+	 * suspended state.
+	 */
+	if (state != PCI_D0 && !vdev->pm_runtime_suspended) {
+		__le32 virt_val = 0;
+
+		count = vfio_default_config_write(vdev, pos, count, perm,
+						  offset, val);
+		if (count < 0)
+			return count;
+
+		vfio_default_config_read(vdev, pos, 4, perm, offset, &virt_val);
+		virt_val &= ~cpu_to_le32(PCI_PM_CTRL_STATE_MASK);
+		virt_val |= (val & cpu_to_le32(PCI_PM_CTRL_STATE_MASK));
+		memcpy(vdev->vconfig + pos, &virt_val, 4);
+		vdev->pm_runtime_suspended = true;
+		pm_runtime_mark_last_busy(&vdev->pdev->dev);
+		pm_runtime_put_autosuspend(&vdev->pdev->dev);
+		return count;
+	}
+
+	if (vdev->pm_runtime_suspended && state == PCI_D0) {
+		vdev->pm_runtime_suspended = false;
+		*(__le16 *)&vdev->vconfig[pos] &=
+			~cpu_to_le16(PCI_PM_CTRL_STATE_MASK);
+		pm_runtime_get_sync(&vdev->pdev->dev);
+	}
+
+	return vfio_default_config_write(vdev, pos, count, perm, offset, val);
+}
+
+static int vfio_pm_config_read(struct vfio_pci_core_device *vdev, int pos,
+			       int count, struct perm_bits *perm,
+			       int offset, __le32 *val)
+{
+	/*
+	 * If pm_runtime_suspended status is set, then return the virtual
+	 * register value for PCI_PM_CTRL register read.
+	 */
+	if (vdev->pm_runtime_suspended &&
+	    offset >= PCI_PM_CTRL && offset < (PCI_PM_CTRL + 4)) {
+		memcpy(val, vdev->vconfig + pos, count);
+		return count;
+	}
+
+	return vfio_default_config_read(vdev, pos, count, perm, offset, val);
+}
+
 static int vfio_pm_config_write(struct vfio_pci_core_device *vdev, int pos,
 				int count, struct perm_bits *perm,
 				int offset, __le32 val)
 {
-	count = vfio_default_config_write(vdev, pos, count, perm, offset, val);
-	if (count < 0)
-		return count;
+	if (offset != PCI_PM_CTRL) {
+		count = vfio_default_config_write(vdev, pos, count, perm,
+						  offset, val);
+		if (count < 0)
+			return count;
+	}
 
 	if (offset == PCI_PM_CTRL) {
 		pci_power_t state;
@@ -718,7 +839,8 @@  static int vfio_pm_config_write(struct vfio_pci_core_device *vdev, int pos,
 			break;
 		}
 
-		vfio_pci_set_power_state(vdev, state);
+		count = vfio_perform_runtime_pm(vdev, pos, count, perm,
+						offset, val, state);
 	}
 
 	return count;
@@ -731,6 +853,7 @@  static int __init init_pci_cap_pm_perm(struct perm_bits *perm)
 		return -ENOMEM;
 
 	perm->writefn = vfio_pm_config_write;
+	perm->readfn = vfio_pm_config_read;
 
 	/*
 	 * We always virtualize the next field so we can remove
@@ -1921,13 +2044,31 @@  ssize_t vfio_pci_config_rw(struct vfio_pci_core_device *vdev, char __user *buf,
 	size_t done = 0;
 	int ret = 0;
 	loff_t pos = *ppos;
+	bool runtime_put_required = false;
 
 	pos &= VFIO_PCI_OFFSET_MASK;
 
+	/*
+	 * For virtualized bits read/write, the device should not be resumed.
+	 * Increment the device usage count alone so that the device won't be
+	 * runtime suspended during config read/write.
+	 */
+	if (vdev->pm_runtime_suspended) {
+		pm_runtime_get_noresume(&vdev->pdev->dev);
+		runtime_put_required = true;
+	}
+
 	while (count) {
 		ret = vfio_config_do_rw(vdev, buf, count, &pos, iswrite);
-		if (ret < 0)
+		if (ret < 0) {
+			/*
+			 * Decrement the device usage counter corresponding to
+			 * previous pm_runtime_get_noresume().
+			 */
+			if (runtime_put_required)
+				pm_runtime_put_autosuspend(&vdev->pdev->dev);
 			return ret;
+		}
 
 		count -= ret;
 		done += ret;
@@ -1937,5 +2078,12 @@  ssize_t vfio_pci_config_rw(struct vfio_pci_core_device *vdev, char __user *buf,
 
 	*ppos += done;
 
+	/*
+	 * Decrement the device usage counter corresponding to previous
+	 * pm_runtime_get_noresume().
+	 */
+	if (runtime_put_required)
+		pm_runtime_put_autosuspend(&vdev->pdev->dev);
+
 	return done;
 }
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 511a52e92b32..79fa86914b6c 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -187,57 +187,6 @@  static bool vfio_pci_nointx(struct pci_dev *pdev)
 	return false;
 }
 
-static void vfio_pci_probe_power_state(struct vfio_pci_core_device *vdev)
-{
-	struct pci_dev *pdev = vdev->pdev;
-	u16 pmcsr;
-
-	if (!pdev->pm_cap)
-		return;
-
-	pci_read_config_word(pdev, pdev->pm_cap + PCI_PM_CTRL, &pmcsr);
-
-	vdev->needs_pm_restore = !(pmcsr & PCI_PM_CTRL_NO_SOFT_RESET);
-}
-
-/*
- * pci_set_power_state() wrapper handling devices which perform a soft reset on
- * D3->D0 transition.  Save state prior to D0/1/2->D3, stash it on the vdev,
- * restore when returned to D0.  Saved separately from pci_saved_state for use
- * by PM capability emulation and separately from pci_dev internal saved state
- * to avoid it being overwritten and consumed around other resets.
- */
-int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t state)
-{
-	struct pci_dev *pdev = vdev->pdev;
-	bool needs_restore = false, needs_save = false;
-	int ret;
-
-	if (vdev->needs_pm_restore) {
-		if (pdev->current_state < PCI_D3hot && state >= PCI_D3hot) {
-			pci_save_state(pdev);
-			needs_save = true;
-		}
-
-		if (pdev->current_state >= PCI_D3hot && state <= PCI_D0)
-			needs_restore = true;
-	}
-
-	ret = pci_set_power_state(pdev, state);
-
-	if (!ret) {
-		/* D3 might be unsupported via quirk, skip unless in D3 */
-		if (needs_save && pdev->current_state >= PCI_D3hot) {
-			vdev->pm_save = pci_store_saved_state(pdev);
-		} else if (needs_restore) {
-			pci_load_and_free_saved_state(pdev, &vdev->pm_save);
-			pci_restore_state(pdev);
-		}
-	}
-
-	return ret;
-}
-
 int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
 {
 	struct pci_dev *pdev = vdev->pdev;
@@ -323,6 +272,16 @@  void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
 	/* For needs_reset */
 	lockdep_assert_held(&vdev->vdev.dev_set->lock);
 
+	/*
+	 * The vfio device user can close the device after putting the device
+	 * into runtime suspended state so wake up the device first in
+	 * this case.
+	 */
+	if (vdev->pm_runtime_suspended) {
+		vdev->pm_runtime_suspended = false;
+		pm_runtime_get_sync(&pdev->dev);
+	}
+
 	/* Stop the device from further DMA */
 	pci_clear_master(pdev);
 
@@ -1809,7 +1768,6 @@  void vfio_pci_core_uninit_device(struct vfio_pci_core_device *vdev)
 	mutex_destroy(&vdev->vma_lock);
 	vfio_uninit_group_dev(&vdev->vdev);
 	kfree(vdev->region);
-	kfree(vdev->pm_save);
 }
 EXPORT_SYMBOL_GPL(vfio_pci_core_uninit_device);
 
@@ -1855,8 +1813,6 @@  int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
 	if (ret)
 		goto out_vf;
 
-	vfio_pci_probe_power_state(vdev);
-
 	/*
 	 * pci-core sets the device power state to an unknown value at
 	 * bootup and after being removed from a driver.  The only
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index aafe09c9fa64..2b1a556ce73f 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -123,9 +123,8 @@  struct vfio_pci_core_device {
 	bool			has_vga;
 	bool			needs_reset;
 	bool			nointx;
-	bool			needs_pm_restore;
+	bool                    pm_runtime_suspended;
 	struct pci_saved_state	*pci_saved_state;
-	struct pci_saved_state	*pm_save;
 	int			ioeventfds_nr;
 	struct eventfd_ctx	*err_trigger;
 	struct eventfd_ctx	*req_trigger;