diff mbox series

[RFC,v2,3/5] vfio/pci: fix memory leak during D3hot to D0 tranistion

Message ID 20220124181726.19174-4-abhsahu@nvidia.com (mailing list archive)
State New, archived
Headers show
Series vfio/pci: Enable runtime power management support | expand

Commit Message

Abhishek Sahu Jan. 24, 2022, 6:17 p.m. UTC
If needs_pm_restore is set (PCI device does not have support for no
soft reset), then the current PCI state will be saved during D0->D3hot
transition and same will be restored back during D3hot->D0 transition.
For saving the PCI state locally, pci_store_saved_state() is being
used and the pci_load_and_free_saved_state() will free the allocated
memory.

But for reset related IOCTLs, vfio driver calls PCI reset related
API's which will internally change the PCI power state back to D0. So,
when the guest resumes, then it will get the current state as D0 and it
will skip the call to vfio_pci_set_power_state() for changing the
power state to D0 explicitly. In this case, the memory pointed by
pm_save will never be freed.

Also, in malicious sequence, the state changing to D3hot followed by
VFIO_DEVICE_RESET/VFIO_DEVICE_PCI_HOT_RESET can be run in loop and
it can cause an OOM situation. This patch stores the power state locally
and uses the same for comparing the current power state. For the
places where D0 transition can happen, call vfio_pci_set_power_state()
to transition to D0 state. Since the vfio power state is still D3hot,
so this D0 transition will help in running the logic required
from D3hot->D0 transition. Also, to prevent any miss during
future development to detect this condition, this patch puts a
check and frees the memory after printing warning.

This locally saved power state will help in subsequent patches
also.

Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 53 ++++++++++++++++++++++++++++++--
 include/linux/vfio_pci_core.h    |  1 +
 2 files changed, 51 insertions(+), 3 deletions(-)

Comments

Alex Williamson Jan. 28, 2022, 12:05 a.m. UTC | #1
On Mon, 24 Jan 2022 23:47:24 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> If needs_pm_restore is set (PCI device does not have support for no
> soft reset), then the current PCI state will be saved during D0->D3hot
> transition and same will be restored back during D3hot->D0 transition.
> For saving the PCI state locally, pci_store_saved_state() is being
> used and the pci_load_and_free_saved_state() will free the allocated
> memory.
> 
> But for reset related IOCTLs, vfio driver calls PCI reset related
> API's which will internally change the PCI power state back to D0. So,
> when the guest resumes, then it will get the current state as D0 and it
> will skip the call to vfio_pci_set_power_state() for changing the
> power state to D0 explicitly. In this case, the memory pointed by
> pm_save will never be freed.
> 
> Also, in malicious sequence, the state changing to D3hot followed by
> VFIO_DEVICE_RESET/VFIO_DEVICE_PCI_HOT_RESET can be run in loop and
> it can cause an OOM situation. This patch stores the power state locally
> and uses the same for comparing the current power state. For the
> places where D0 transition can happen, call vfio_pci_set_power_state()
> to transition to D0 state. Since the vfio power state is still D3hot,
> so this D0 transition will help in running the logic required
> from D3hot->D0 transition. Also, to prevent any miss during
> future development to detect this condition, this patch puts a
> check and frees the memory after printing warning.
> 
> This locally saved power state will help in subsequent patches
> also.

Ideally let's put fixes patches at the start of the series, or better
yet send them separately, and don't include changes that only make
sense in the context of a subsequent patch.

Fixes: 51ef3a004b1e ("vfio/pci: Restore device state on PM transition")

> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 53 ++++++++++++++++++++++++++++++--
>  include/linux/vfio_pci_core.h    |  1 +
>  2 files changed, 51 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index c6e4fe9088c3..ee2fb8af57fa 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -206,6 +206,14 @@ static void vfio_pci_probe_power_state(struct vfio_pci_core_device *vdev)
>   * restore when returned to D0.  Saved separately from pci_saved_state for use
>   * by PM capability emulation and separately from pci_dev internal saved state
>   * to avoid it being overwritten and consumed around other resets.
> + *
> + * There are few cases where the PCI power state can be changed to D0
> + * without the involvement of this API. So, cache the power state locally
> + * and call this API to update the D0 state. It will help in running the
> + * logic that is needed for transitioning to the D0 state. For example,
> + * if needs_pm_restore is set, then the PCI state will be saved locally.
> + * The memory taken for saving this PCI state needs to be freed to
> + * prevent memory leak.
>   */
>  int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t state)
>  {
> @@ -214,20 +222,34 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
>  	int ret;
>  
>  	if (vdev->needs_pm_restore) {
> -		if (pdev->current_state < PCI_D3hot && state >= PCI_D3hot) {
> +		if (vdev->power_state < PCI_D3hot && state >= PCI_D3hot) {
>  			pci_save_state(pdev);
>  			needs_save = true;
>  		}
>  
> -		if (pdev->current_state >= PCI_D3hot && state <= PCI_D0)
> +		if (vdev->power_state >= PCI_D3hot && state <= PCI_D0)
>  			needs_restore = true;
>  	}
>  
>  	ret = pci_set_power_state(pdev, state);
>  
>  	if (!ret) {
> +		vdev->power_state = pdev->current_state;
> +
>  		/* D3 might be unsupported via quirk, skip unless in D3 */
> -		if (needs_save && pdev->current_state >= PCI_D3hot) {
> +		if (needs_save && vdev->power_state >= PCI_D3hot) {
> +			/*
> +			 * If somehow, the vfio driver was not able to free the
> +			 * memory allocated in pm_save, then free the earlier
> +			 * memory first before overwriting pm_save to prevent
> +			 * memory leak.
> +			 */
> +			if (vdev->pm_save) {
> +				pci_warn(pdev,
> +					 "Overwriting saved PCI state pointer so freeing the earlier memory\n");
> +				kfree(vdev->pm_save);
> +			}

The minimal fix for the described issue would simply be:

			kfree(vdev->pm_save);

It seems like the only purpose of the warning is try to make sure we
haven't missed any wake-up calls, where this would be a pretty small
breadcrumb to actually debug such an issue.

> +
>  			vdev->pm_save = pci_store_saved_state(pdev);
>  		} else if (needs_restore) {
>  			pci_load_and_free_saved_state(pdev, &vdev->pm_save);
> @@ -326,6 +348,14 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
>  	/* For needs_reset */
>  	lockdep_assert_held(&vdev->vdev.dev_set->lock);
>  
> +	/*
> +	 * If disable has been called while the power state is other than D0,
> +	 * then set the power state in vfio driver to D0. It will help
> +	 * in running the logic needed for D0 power state. The subsequent
> +	 * runtime PM API's will put the device into the low power state again.
> +	 */
> +	vfio_pci_set_power_state(vdev, PCI_D0);
> +

I do think we have an issue here, but the reason is that pci_pm_reset()
returns -EINVAL if we try to reset a device that isn't currently in D0.
Therefore any path where we're triggering a function reset that could
use a PM reset and we don't know if the device is in D0, should wake up
the device before we try that reset.

We're about to load the initial state of the device that was saved when
it was opened, so I don't think pdev->current_state vs
vdev->power_state matters here, we only care that the device is in D0.

>  	/* Stop the device from further DMA */
>  	pci_clear_master(pdev);
>  
> @@ -929,6 +959,15 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
>  
>  		vfio_pci_zap_and_down_write_memory_lock(vdev);
>  		ret = pci_try_reset_function(vdev->pdev);
> +
> +		/*
> +		 * If pci_try_reset_function() has been called while the power
> +		 * state is other than D0, then pci_try_reset_function() will
> +		 * internally set the device state to D0 without vfio driver
> +		 * interaction. Update the power state in vfio driver to perform
> +		 * the logic needed for D0 power state.
> +		 */
> +		vfio_pci_set_power_state(vdev, PCI_D0);

For the case where pci_try_reset_function() might use a PM reset, we
should set D0 before that call.  In doing so, the pdev->current_state
should match the actual device power state, so we still don't need to
stash power state on the vdev.

>  		up_write(&vdev->memory_lock);
>  
>  		return ret;
> @@ -2071,6 +2110,14 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>  
>  err_undo:
>  	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
> +		/*
> +		 * If pci_reset_bus() has been called while the power
> +		 * state is other than D0, then pci_reset_bus() will
> +		 * internally set the device state to D0 without vfio driver
> +		 * interaction. Update the power state in vfio driver to perform
> +		 * the logic needed for D0 power state.
> +		 */
> +		vfio_pci_set_power_state(cur, PCI_D0);

Here pci_reset_bus() will wakeup the device and I think the concern is
that around that bus reset we'll save and restore the device state, but
that's potentially bogus device state if waking it triggers a soft
reset.  We could again wake devices before the reset to make the state
correct, or we could test pm_save and perform the load and restore if
it exists.  Either of those would avoid needing to cache the power
state on the vdev.  Thanks,

Alex

>  		if (cur == cur_mem)
>  			is_mem = false;
>  		if (cur == cur_vma)
> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> index aafe09c9fa64..05db838e72cc 100644
> --- a/include/linux/vfio_pci_core.h
> +++ b/include/linux/vfio_pci_core.h
> @@ -124,6 +124,7 @@ struct vfio_pci_core_device {
>  	bool			needs_reset;
>  	bool			nointx;
>  	bool			needs_pm_restore;
> +	pci_power_t		power_state;
>  	struct pci_saved_state	*pci_saved_state;
>  	struct pci_saved_state	*pm_save;
>  	int			ioeventfds_nr;
Abhishek Sahu Jan. 31, 2022, 11:34 a.m. UTC | #2
On 1/28/2022 5:35 AM, Alex Williamson wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Mon, 24 Jan 2022 23:47:24 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> 
>> If needs_pm_restore is set (PCI device does not have support for no
>> soft reset), then the current PCI state will be saved during D0->D3hot
>> transition and same will be restored back during D3hot->D0 transition.
>> For saving the PCI state locally, pci_store_saved_state() is being
>> used and the pci_load_and_free_saved_state() will free the allocated
>> memory.
>>
>> But for reset related IOCTLs, vfio driver calls PCI reset related
>> API's which will internally change the PCI power state back to D0. So,
>> when the guest resumes, then it will get the current state as D0 and it
>> will skip the call to vfio_pci_set_power_state() for changing the
>> power state to D0 explicitly. In this case, the memory pointed by
>> pm_save will never be freed.
>>
>> Also, in malicious sequence, the state changing to D3hot followed by
>> VFIO_DEVICE_RESET/VFIO_DEVICE_PCI_HOT_RESET can be run in loop and
>> it can cause an OOM situation. This patch stores the power state locally
>> and uses the same for comparing the current power state. For the
>> places where D0 transition can happen, call vfio_pci_set_power_state()
>> to transition to D0 state. Since the vfio power state is still D3hot,
>> so this D0 transition will help in running the logic required
>> from D3hot->D0 transition. Also, to prevent any miss during
>> future development to detect this condition, this patch puts a
>> check and frees the memory after printing warning.
>>
>> This locally saved power state will help in subsequent patches
>> also.
> 
> Ideally let's put fixes patches at the start of the series, or better
> yet send them separately, and don't include changes that only make
> sense in the context of a subsequent patch.
> 
> Fixes: 51ef3a004b1e ("vfio/pci: Restore device state on PM transition")
> 

 Thanks Alex for reviewing this patch.
 I have added Fixes tag and sent this patch separately.

 Should I update this patch series or you are planning to review the
 other patches first of this patch series first. 

>> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
>> ---
>>  drivers/vfio/pci/vfio_pci_core.c | 53 ++++++++++++++++++++++++++++++--
>>  include/linux/vfio_pci_core.h    |  1 +
>>  2 files changed, 51 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>> index c6e4fe9088c3..ee2fb8af57fa 100644
>> --- a/drivers/vfio/pci/vfio_pci_core.c
>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>> @@ -206,6 +206,14 @@ static void vfio_pci_probe_power_state(struct vfio_pci_core_device *vdev)
>>   * restore when returned to D0.  Saved separately from pci_saved_state for use
>>   * by PM capability emulation and separately from pci_dev internal saved state
>>   * to avoid it being overwritten and consumed around other resets.
>> + *
>> + * There are few cases where the PCI power state can be changed to D0
>> + * without the involvement of this API. So, cache the power state locally
>> + * and call this API to update the D0 state. It will help in running the
>> + * logic that is needed for transitioning to the D0 state. For example,
>> + * if needs_pm_restore is set, then the PCI state will be saved locally.
>> + * The memory taken for saving this PCI state needs to be freed to
>> + * prevent memory leak.
>>   */
>>  int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t state)
>>  {
>> @@ -214,20 +222,34 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
>>       int ret;
>>
>>       if (vdev->needs_pm_restore) {
>> -             if (pdev->current_state < PCI_D3hot && state >= PCI_D3hot) {
>> +             if (vdev->power_state < PCI_D3hot && state >= PCI_D3hot) {
>>                       pci_save_state(pdev);
>>                       needs_save = true;
>>               }
>>
>> -             if (pdev->current_state >= PCI_D3hot && state <= PCI_D0)
>> +             if (vdev->power_state >= PCI_D3hot && state <= PCI_D0)
>>                       needs_restore = true;
>>       }
>>
>>       ret = pci_set_power_state(pdev, state);
>>
>>       if (!ret) {
>> +             vdev->power_state = pdev->current_state;
>> +
>>               /* D3 might be unsupported via quirk, skip unless in D3 */
>> -             if (needs_save && pdev->current_state >= PCI_D3hot) {
>> +             if (needs_save && vdev->power_state >= PCI_D3hot) {
>> +                     /*
>> +                      * If somehow, the vfio driver was not able to free the
>> +                      * memory allocated in pm_save, then free the earlier
>> +                      * memory first before overwriting pm_save to prevent
>> +                      * memory leak.
>> +                      */
>> +                     if (vdev->pm_save) {
>> +                             pci_warn(pdev,
>> +                                      "Overwriting saved PCI state pointer so freeing the earlier memory\n");
>> +                             kfree(vdev->pm_save);
>> +                     }
> 
> The minimal fix for the described issue would simply be:
> 
>                         kfree(vdev->pm_save);
> 
> It seems like the only purpose of the warning is try to make sure we
> haven't missed any wake-up calls, where this would be a pretty small
> breadcrumb to actually debug such an issue.
> 

 I have removed the warning in the updated patch.

>> +
>>                       vdev->pm_save = pci_store_saved_state(pdev);
>>               } else if (needs_restore) {
>>                       pci_load_and_free_saved_state(pdev, &vdev->pm_save);
>> @@ -326,6 +348,14 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
>>       /* For needs_reset */
>>       lockdep_assert_held(&vdev->vdev.dev_set->lock);
>>
>> +     /*
>> +      * If disable has been called while the power state is other than D0,
>> +      * then set the power state in vfio driver to D0. It will help
>> +      * in running the logic needed for D0 power state. The subsequent
>> +      * runtime PM API's will put the device into the low power state again.
>> +      */
>> +     vfio_pci_set_power_state(vdev, PCI_D0);
>> +
> 
> I do think we have an issue here, but the reason is that pci_pm_reset()
> returns -EINVAL if we try to reset a device that isn't currently in D0.
> Therefore any path where we're triggering a function reset that could
> use a PM reset and we don't know if the device is in D0, should wake up
> the device before we try that reset.
> 
> We're about to load the initial state of the device that was saved when
> it was opened, so I don't think pdev->current_state vs
> vdev->power_state matters here, we only care that the device is in D0.
> 

 I have added this point in the commit message of the updated patch.

>>       /* Stop the device from further DMA */
>>       pci_clear_master(pdev);
>>
>> @@ -929,6 +959,15 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
>>
>>               vfio_pci_zap_and_down_write_memory_lock(vdev);
>>               ret = pci_try_reset_function(vdev->pdev);
>> +
>> +             /*
>> +              * If pci_try_reset_function() has been called while the power
>> +              * state is other than D0, then pci_try_reset_function() will
>> +              * internally set the device state to D0 without vfio driver
>> +              * interaction. Update the power state in vfio driver to perform
>> +              * the logic needed for D0 power state.
>> +              */
>> +             vfio_pci_set_power_state(vdev, PCI_D0);
> 
> For the case where pci_try_reset_function() might use a PM reset, we
> should set D0 before that call.  In doing so, the pdev->current_state
> should match the actual device power state, so we still don't need to
> stash power state on the vdev.
> 

 I have set D0 state before calling pci_try_reset_function() in
 the updated patch.

>>               up_write(&vdev->memory_lock);
>>
>>               return ret;
>> @@ -2071,6 +2110,14 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>>
>>  err_undo:
>>       list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
>> +             /*
>> +              * If pci_reset_bus() has been called while the power
>> +              * state is other than D0, then pci_reset_bus() will
>> +              * internally set the device state to D0 without vfio driver
>> +              * interaction. Update the power state in vfio driver to perform
>> +              * the logic needed for D0 power state.
>> +              */
>> +             vfio_pci_set_power_state(cur, PCI_D0);
> 
> Here pci_reset_bus() will wakeup the device and I think the concern is
> that around that bus reset we'll save and restore the device state, but
> that's potentially bogus device state if waking it triggers a soft
> reset.  We could again wake devices before the reset to make the state
> correct, or we could test pm_save and perform the load and restore if
> it exists.  Either of those would avoid needing to cache the power
> state on the vdev.  Thanks,
> 

 I have made the changes to wake-up the devices.

 Thanks
 Abhishek

> Alex
> 
>>               if (cur == cur_mem)
>>                       is_mem = false;
>>               if (cur == cur_vma)
>> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
>> index aafe09c9fa64..05db838e72cc 100644
>> --- a/include/linux/vfio_pci_core.h
>> +++ b/include/linux/vfio_pci_core.h
>> @@ -124,6 +124,7 @@ struct vfio_pci_core_device {
>>       bool                    needs_reset;
>>       bool                    nointx;
>>       bool                    needs_pm_restore;
>> +     pci_power_t             power_state;
>>       struct pci_saved_state  *pci_saved_state;
>>       struct pci_saved_state  *pm_save;
>>       int                     ioeventfds_nr;
>
Alex Williamson Jan. 31, 2022, 3:33 p.m. UTC | #3
On Mon, 31 Jan 2022 17:04:12 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> On 1/28/2022 5:35 AM, Alex Williamson wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Mon, 24 Jan 2022 23:47:24 +0530
> > Abhishek Sahu <abhsahu@nvidia.com> wrote:
> >   
> >> If needs_pm_restore is set (PCI device does not have support for no
> >> soft reset), then the current PCI state will be saved during D0->D3hot
> >> transition and same will be restored back during D3hot->D0 transition.
> >> For saving the PCI state locally, pci_store_saved_state() is being
> >> used and the pci_load_and_free_saved_state() will free the allocated
> >> memory.
> >>
> >> But for reset related IOCTLs, vfio driver calls PCI reset related
> >> API's which will internally change the PCI power state back to D0. So,
> >> when the guest resumes, then it will get the current state as D0 and it
> >> will skip the call to vfio_pci_set_power_state() for changing the
> >> power state to D0 explicitly. In this case, the memory pointed by
> >> pm_save will never be freed.
> >>
> >> Also, in malicious sequence, the state changing to D3hot followed by
> >> VFIO_DEVICE_RESET/VFIO_DEVICE_PCI_HOT_RESET can be run in loop and
> >> it can cause an OOM situation. This patch stores the power state locally
> >> and uses the same for comparing the current power state. For the
> >> places where D0 transition can happen, call vfio_pci_set_power_state()
> >> to transition to D0 state. Since the vfio power state is still D3hot,
> >> so this D0 transition will help in running the logic required
> >> from D3hot->D0 transition. Also, to prevent any miss during
> >> future development to detect this condition, this patch puts a
> >> check and frees the memory after printing warning.
> >>
> >> This locally saved power state will help in subsequent patches
> >> also.  
> > 
> > Ideally let's put fixes patches at the start of the series, or better
> > yet send them separately, and don't include changes that only make
> > sense in the context of a subsequent patch.
> > 
> > Fixes: 51ef3a004b1e ("vfio/pci: Restore device state on PM transition")
> >   
> 
>  Thanks Alex for reviewing this patch.
>  I have added Fixes tag and sent this patch separately.
> 
>  Should I update this patch series or you are planning to review the
>  other patches first of this patch series first. 

Thanks for splitting this out.  I'll keep the remainder of the series
on the review queue, I expect I'll have some comments and it will be
easy enough to imagine vfio_pci_core_device.power_state being declared
in another patch if there's still a worthwhile use for it.  Thanks,

Alex
diff mbox series

Patch

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index c6e4fe9088c3..ee2fb8af57fa 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -206,6 +206,14 @@  static void vfio_pci_probe_power_state(struct vfio_pci_core_device *vdev)
  * restore when returned to D0.  Saved separately from pci_saved_state for use
  * by PM capability emulation and separately from pci_dev internal saved state
  * to avoid it being overwritten and consumed around other resets.
+ *
+ * There are few cases where the PCI power state can be changed to D0
+ * without the involvement of this API. So, cache the power state locally
+ * and call this API to update the D0 state. It will help in running the
+ * logic that is needed for transitioning to the D0 state. For example,
+ * if needs_pm_restore is set, then the PCI state will be saved locally.
+ * The memory taken for saving this PCI state needs to be freed to
+ * prevent memory leak.
  */
 int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t state)
 {
@@ -214,20 +222,34 @@  int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
 	int ret;
 
 	if (vdev->needs_pm_restore) {
-		if (pdev->current_state < PCI_D3hot && state >= PCI_D3hot) {
+		if (vdev->power_state < PCI_D3hot && state >= PCI_D3hot) {
 			pci_save_state(pdev);
 			needs_save = true;
 		}
 
-		if (pdev->current_state >= PCI_D3hot && state <= PCI_D0)
+		if (vdev->power_state >= PCI_D3hot && state <= PCI_D0)
 			needs_restore = true;
 	}
 
 	ret = pci_set_power_state(pdev, state);
 
 	if (!ret) {
+		vdev->power_state = pdev->current_state;
+
 		/* D3 might be unsupported via quirk, skip unless in D3 */
-		if (needs_save && pdev->current_state >= PCI_D3hot) {
+		if (needs_save && vdev->power_state >= PCI_D3hot) {
+			/*
+			 * If somehow, the vfio driver was not able to free the
+			 * memory allocated in pm_save, then free the earlier
+			 * memory first before overwriting pm_save to prevent
+			 * memory leak.
+			 */
+			if (vdev->pm_save) {
+				pci_warn(pdev,
+					 "Overwriting saved PCI state pointer so freeing the earlier memory\n");
+				kfree(vdev->pm_save);
+			}
+
 			vdev->pm_save = pci_store_saved_state(pdev);
 		} else if (needs_restore) {
 			pci_load_and_free_saved_state(pdev, &vdev->pm_save);
@@ -326,6 +348,14 @@  void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
 	/* For needs_reset */
 	lockdep_assert_held(&vdev->vdev.dev_set->lock);
 
+	/*
+	 * If disable has been called while the power state is other than D0,
+	 * then set the power state in vfio driver to D0. It will help
+	 * in running the logic needed for D0 power state. The subsequent
+	 * runtime PM API's will put the device into the low power state again.
+	 */
+	vfio_pci_set_power_state(vdev, PCI_D0);
+
 	/* Stop the device from further DMA */
 	pci_clear_master(pdev);
 
@@ -929,6 +959,15 @@  long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
 
 		vfio_pci_zap_and_down_write_memory_lock(vdev);
 		ret = pci_try_reset_function(vdev->pdev);
+
+		/*
+		 * If pci_try_reset_function() has been called while the power
+		 * state is other than D0, then pci_try_reset_function() will
+		 * internally set the device state to D0 without vfio driver
+		 * interaction. Update the power state in vfio driver to perform
+		 * the logic needed for D0 power state.
+		 */
+		vfio_pci_set_power_state(vdev, PCI_D0);
 		up_write(&vdev->memory_lock);
 
 		return ret;
@@ -2071,6 +2110,14 @@  static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
 
 err_undo:
 	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
+		/*
+		 * If pci_reset_bus() has been called while the power
+		 * state is other than D0, then pci_reset_bus() will
+		 * internally set the device state to D0 without vfio driver
+		 * interaction. Update the power state in vfio driver to perform
+		 * the logic needed for D0 power state.
+		 */
+		vfio_pci_set_power_state(cur, PCI_D0);
 		if (cur == cur_mem)
 			is_mem = false;
 		if (cur == cur_vma)
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index aafe09c9fa64..05db838e72cc 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -124,6 +124,7 @@  struct vfio_pci_core_device {
 	bool			needs_reset;
 	bool			nointx;
 	bool			needs_pm_restore;
+	pci_power_t		power_state;
 	struct pci_saved_state	*pci_saved_state;
 	struct pci_saved_state	*pm_save;
 	int			ioeventfds_nr;