diff mbox series

PCI/IOV: update num_VFs earlier

Message ID 20190329080058.21736-1-pierre.cregut@orange.com (mailing list archive)
State New, archived
Delegated to: Bjorn Helgaas
Headers show
Series PCI/IOV: update num_VFs earlier | expand

Commit Message

CREGUT Pierre IMT/OLN March 29, 2019, 8 a.m. UTC
Ensure that iov->num_VFs is set before a netlink message is sent
when the number of VFs is changed. Only the path for num_VFs > 0
is affected. The path for num_VFs = 0 is already correct.

Monitoring programs can relie on netlink messages to track interface
change and query their state in /sys. But when sriov_numvfs is set to a
positive value, the netlink message is sent before the value is available
in sysfs. The value read after the message is received is always zero.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=202991
Signed-off-by: Pierre Crégut <pierre.cregut@orange.com>
---
note: the behaviour can be tested with the following shell script also
available on the bugzilla (d being the phy device name):

ip monitor dev $d | grep --line-buffered "^[0-9]*:" | \
while read line; do cat /sys/class/net/$d/device/sriov_numvfs; done

 drivers/pci/iov.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Bjorn Helgaas April 5, 2019, 10:33 p.m. UTC | #1
On Fri, Mar 29, 2019 at 09:00:58AM +0100, Pierre Crégut wrote:
> Ensure that iov->num_VFs is set before a netlink message is sent
> when the number of VFs is changed. Only the path for num_VFs > 0
> is affected. The path for num_VFs = 0 is already correct.
> 
> Monitoring programs can relie on netlink messages to track interface
> change and query their state in /sys. But when sriov_numvfs is set to a
> positive value, the netlink message is sent before the value is available
> in sysfs. The value read after the message is received is always zero.

Thanks, Pierre!  Can you clue me in on where exactly the connection
from sriov_enable() to netlink is?

I see one side of the race is with sriov_numvfs_show(), but I don't
know where the netlink message is sent.  Is that connected with the
kobject_uevent(KOBJ_CHANGE)?

One thing this would help with is figuring out exactly how *much*
earlier we need to set iov->num_VFs.  It looks like the current patch
sets it before we actually enable the VFs, so a user could read
/sys/.../sriov_numvfs and get the wrong value.  Of course, that's
unavoidable; the question is whether it's OK to get the new value
*before* it actually takes effect, or whether we want to return a
stale value until after it takes effect.

> Link: https://bugzilla.kernel.org/show_bug.cgi?id=202991
> Signed-off-by: Pierre Crégut <pierre.cregut@orange.com>
> ---
> note: the behaviour can be tested with the following shell script also
> available on the bugzilla (d being the phy device name):
> 
> ip monitor dev $d | grep --line-buffered "^[0-9]*:" | \
> while read line; do cat /sys/class/net/$d/device/sriov_numvfs; done
> 
>  drivers/pci/iov.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index 3aa115ed3a65..a9655c10e87f 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -351,6 +351,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
>  		goto err_pcibios;
>  	}
>  
> +	iov->num_VFs = nr_virtfn;
>  	pci_iov_set_numvfs(dev, nr_virtfn);
>  	iov->ctrl |= PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE;
>  	pci_cfg_access_lock(dev);
> @@ -363,7 +364,6 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
>  		goto err_pcibios;
>  
>  	kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE);
> -	iov->num_VFs = nr_virtfn;
>  
>  	return 0;
>  
> @@ -379,6 +379,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
>  	if (iov->link != dev->devfn)
>  		sysfs_remove_link(&dev->dev.kobj, "dep_link");
>  
> +	iov->num_VFs = 0;
>  	pci_iov_set_numvfs(dev, 0);
>  	return rc;
>  }
> -- 
> 2.17.1
>
CREGUT Pierre IMT/OLN April 26, 2019, 8:11 a.m. UTC | #2
I also initially thought that kobject_uevent generated the netlink event
but this is not the case. This is generated by the specific driver in use.
For the Intel i40e driver, this is the call to i40e_do_reset_safe in
i40e_pci_sriov_configure that sends the event.
It is followed by i40e_pci_sriov_enable that calls i40e_alloc_vfs that
finally calls the generic pci_enable_sriov function.

So the proposed patch works well for the i40e driver (x710 cards) because
the update to num_VFs is fast enough to be committed before the event is
received. It may not work with other cards. The same is true for the zero
value and there is no guarantee for other cards.

The clean solution would be to lock the device in sriov_numvfs_show.
I guess that there are good reasons why locks have been avoided
in sysfs getter functions so let us explore other approaches.

We can either return a "not settled" value (-1) or (probably better)
do not return a value but an error (-EAGAIN returned by the show
function).

To distinguish this "not settled" situation we can either:
* overload the meaning of num_VFs (eg make it negative)
   but it is an unsigned short.
* add a bool to pci_sriov struct (rather simple but modifies a well
   established structure).
* use the fact that not_settled => device is locked and use
   mutex_is_locked as an over approximation.

The later is not perfect but requires minimal changes to
sriov_numvfs_show:

      if (mutex_is_locked(&dev->mutex))
          return -EAGAIN;

In all cases, the device could be locked or the boolean set just
after the test. But I don't think there is a case where causality
would be violated.Thank you in advance for your recommendations.  
I will update the patch according to your instructions.

Le 06/04/2019 à 00:33, Bjorn Helgaas a écrit :
> On Fri, Mar 29, 2019 at 09:00:58AM +0100, Pierre Crégut wrote:
>> Ensure that iov->num_VFs is set before a netlink message is sent
>> when the number of VFs is changed. Only the path for num_VFs > 0
>> is affected. The path for num_VFs = 0 is already correct.
>>
>> Monitoring programs can relie on netlink messages to track interface
>> change and query their state in /sys. But when sriov_numvfs is set to a
>> positive value, the netlink message is sent before the value is available
>> in sysfs. The value read after the message is received is always zero.
> Thanks, Pierre!  Can you clue me in on where exactly the connection
> from sriov_enable() to netlink is?
>
> I see one side of the race is with sriov_numvfs_show(), but I don't
> know where the netlink message is sent.  Is that connected with the
> kobject_uevent(KOBJ_CHANGE)?
>
> One thing this would help with is figuring out exactly how *much*
> earlier we need to set iov->num_VFs.  It looks like the current patch
> sets it before we actually enable the VFs, so a user could read
> /sys/.../sriov_numvfs and get the wrong value.  Of course, that's
> unavoidable; the question is whether it's OK to get the new value
> *before* it actually takes effect, or whether we want to return a
> stale value until after it takes effect.
>
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=202991
>> Signed-off-by: Pierre Crégut <pierre.cregut@orange.com>
>> ---
>> note: the behaviour can be tested with the following shell script also
>> available on the bugzilla (d being the phy device name):
>>
>> ip monitor dev $d | grep --line-buffered "^[0-9]*:" | \
>> while read line; do cat /sys/class/net/$d/device/sriov_numvfs; done
>>
>>   drivers/pci/iov.c | 3 ++-
>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
>> index 3aa115ed3a65..a9655c10e87f 100644
>> --- a/drivers/pci/iov.c
>> +++ b/drivers/pci/iov.c
>> @@ -351,6 +351,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
>>   		goto err_pcibios;
>>   	}
>>   
>> +	iov->num_VFs = nr_virtfn;
>>   	pci_iov_set_numvfs(dev, nr_virtfn);
>>   	iov->ctrl |= PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE;
>>   	pci_cfg_access_lock(dev);
>> @@ -363,7 +364,6 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
>>   		goto err_pcibios;
>>   
>>   	kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE);
>> -	iov->num_VFs = nr_virtfn;
>>   
>>   	return 0;
>>   
>> @@ -379,6 +379,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
>>   	if (iov->link != dev->devfn)
>>   		sysfs_remove_link(&dev->dev.kobj, "dep_link");
>>   
>> +	iov->num_VFs = 0;
>>   	pci_iov_set_numvfs(dev, 0);
>>   	return rc;
>>   }
>> -- 
>> 2.17.1
>>
Bjorn Helgaas June 13, 2019, 11:51 p.m. UTC | #3
Hi Pierre,

I'm really sorry, I totally missed your response.

On Fri, Apr 26, 2019 at 10:11:54AM +0200, CREGUT Pierre IMT/OLN wrote:
> I also initially thought that kobject_uevent generated the netlink event
> but this is not the case. This is generated by the specific driver in use.
> For the Intel i40e driver, this is the call to i40e_do_reset_safe in
> i40e_pci_sriov_configure that sends the event.
> It is followed by i40e_pci_sriov_enable that calls i40e_alloc_vfs that
> finally calls the generic pci_enable_sriov function.
> 
> So the proposed patch works well for the i40e driver (x710 cards) because
> the update to num_VFs is fast enough to be committed before the event is
> received. It may not work with other cards. The same is true for the zero
> value and there is no guarantee for other cards.
> 
> The clean solution would be to lock the device in sriov_numvfs_show.
> I guess that there are good reasons why locks have been avoided
> in sysfs getter functions so let us explore other approaches.
> 
> We can either return a "not settled" value (-1) or (probably better)
> do not return a value but an error (-EAGAIN returned by the show
> function).
> 
> To distinguish this "not settled" situation we can either:
> * overload the meaning of num_VFs (eg make it negative)
>   but it is an unsigned short.
> * add a bool to pci_sriov struct (rather simple but modifies a well
>   established structure).
> * use the fact that not_settled => device is locked and use
>   mutex_is_locked as an over approximation.
> 
> The later is not perfect but requires minimal changes to
> sriov_numvfs_show:
> 
>      if (mutex_is_locked(&dev->mutex))
>          return -EAGAIN;

I think this looks like a good resolution.

> In all cases, the device could be locked or the boolean set just
> after the test. But I don't think there is a case where causality
> would be violated.Thank you in advance for your recommendations.  I will
> update the patch according to your instructions.
> 
> Le 06/04/2019 à 00:33, Bjorn Helgaas a écrit :
> > On Fri, Mar 29, 2019 at 09:00:58AM +0100, Pierre Crégut wrote:
> > > Ensure that iov->num_VFs is set before a netlink message is sent
> > > when the number of VFs is changed. Only the path for num_VFs > 0
> > > is affected. The path for num_VFs = 0 is already correct.
> > > 
> > > Monitoring programs can relie on netlink messages to track interface
> > > change and query their state in /sys. But when sriov_numvfs is set to a
> > > positive value, the netlink message is sent before the value is available
> > > in sysfs. The value read after the message is received is always zero.
> > Thanks, Pierre!  Can you clue me in on where exactly the connection
> > from sriov_enable() to netlink is?
> > 
> > I see one side of the race is with sriov_numvfs_show(), but I don't
> > know where the netlink message is sent.  Is that connected with the
> > kobject_uevent(KOBJ_CHANGE)?
> > 
> > One thing this would help with is figuring out exactly how *much*
> > earlier we need to set iov->num_VFs.  It looks like the current patch
> > sets it before we actually enable the VFs, so a user could read
> > /sys/.../sriov_numvfs and get the wrong value.  Of course, that's
> > unavoidable; the question is whether it's OK to get the new value
> > *before* it actually takes effect, or whether we want to return a
> > stale value until after it takes effect.
> > 
> > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=202991
> > > Signed-off-by: Pierre Crégut <pierre.cregut@orange.com>
> > > ---
> > > note: the behaviour can be tested with the following shell script also
> > > available on the bugzilla (d being the phy device name):
> > > 
> > > ip monitor dev $d | grep --line-buffered "^[0-9]*:" | \
> > > while read line; do cat /sys/class/net/$d/device/sriov_numvfs; done
> > > 
> > >   drivers/pci/iov.c | 3 ++-
> > >   1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> > > index 3aa115ed3a65..a9655c10e87f 100644
> > > --- a/drivers/pci/iov.c
> > > +++ b/drivers/pci/iov.c
> > > @@ -351,6 +351,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
> > >   		goto err_pcibios;
> > >   	}
> > > +	iov->num_VFs = nr_virtfn;
> > >   	pci_iov_set_numvfs(dev, nr_virtfn);
> > >   	iov->ctrl |= PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE;
> > >   	pci_cfg_access_lock(dev);
> > > @@ -363,7 +364,6 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
> > >   		goto err_pcibios;
> > >   	kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE);
> > > -	iov->num_VFs = nr_virtfn;
> > >   	return 0;
> > > @@ -379,6 +379,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
> > >   	if (iov->link != dev->devfn)
> > >   		sysfs_remove_link(&dev->dev.kobj, "dep_link");
> > > +	iov->num_VFs = 0;
> > >   	pci_iov_set_numvfs(dev, 0);
> > >   	return rc;
> > >   }
> > > -- 
> > > 2.17.1
> > >
Bjorn Helgaas Oct. 1, 2019, 11:45 p.m. UTC | #4
On Fri, Apr 26, 2019 at 10:11:54AM +0200, CREGUT Pierre IMT/OLN wrote:
> I also initially thought that kobject_uevent generated the netlink event
> but this is not the case. This is generated by the specific driver in use.
> For the Intel i40e driver, this is the call to i40e_do_reset_safe in
> i40e_pci_sriov_configure that sends the event.
> It is followed by i40e_pci_sriov_enable that calls i40e_alloc_vfs that
> finally calls the generic pci_enable_sriov function.

I don't know anything about netlink.  The script from the bugzilla
(https://bugzilla.kernel.org/show_bug.cgi?id=202991) looks like it
runs

  ip monitor dev enp9s0f2

What are the actual netlink events you see?  Are they related to a
device being removed?

When we change num_VFs, I think we have to disable any existing VFs
before enabling the new num_VFs, so if you trigger on a netlink
"remove" event, I wouldn't be surprised that reading sriov_numvfs
would give a zero until the new VFs are enabled.

> So the proposed patch works well for the i40e driver (x710 cards) because
> the update to num_VFs is fast enough to be committed before the event is
> received. It may not work with other cards. The same is true for the zero
> value and there is no guarantee for other cards.
> 
> The clean solution would be to lock the device in sriov_numvfs_show.
> I guess that there are good reasons why locks have been avoided
> in sysfs getter functions so let us explore other approaches.
> 
> We can either return a "not settled" value (-1) or (probably better)
> do not return a value but an error (-EAGAIN returned by the show
> function).
> 
> To distinguish this "not settled" situation we can either:
> * overload the meaning of num_VFs (eg make it negative)
>   but it is an unsigned short.
> * add a bool to pci_sriov struct (rather simple but modifies a well
>   established structure).
> * use the fact that not_settled => device is locked and use
>   mutex_is_locked as an over approximation.
> 
> The later is not perfect but requires minimal changes to
> sriov_numvfs_show:
> 
>      if (mutex_is_locked(&dev->mutex))
>          return -EAGAIN;

I thought this was a good idea, but

  - It does break the device_lock() encapsulation a little bit:
    sriov_numvfs_store() uses device_lock(), which happens to be
    implemented as "mutex_lock(&dev->mutex)", but we really shouldn't
    rely on that implementation, and

  - The netlink events are being generated via the NIC driver, and I'm
    a little hesitant about changing the PCI core to deal with timing
    issues "over there".

> In all cases, the device could be locked or the boolean set just
> after the test. But I don't think there is a case where causality
> would be violated.Thank you in advance for your recommendations.  I will
> update the patch according to your instructions.
> 
> Le 06/04/2019 à 00:33, Bjorn Helgaas a écrit :
> > On Fri, Mar 29, 2019 at 09:00:58AM +0100, Pierre Crégut wrote:
> > > Ensure that iov->num_VFs is set before a netlink message is sent
> > > when the number of VFs is changed. Only the path for num_VFs > 0
> > > is affected. The path for num_VFs = 0 is already correct.
> > > 
> > > Monitoring programs can relie on netlink messages to track interface
> > > change and query their state in /sys. But when sriov_numvfs is set to a
> > > positive value, the netlink message is sent before the value is available
> > > in sysfs. The value read after the message is received is always zero.
> > Thanks, Pierre!  Can you clue me in on where exactly the connection
> > from sriov_enable() to netlink is?
> > 
> > I see one side of the race is with sriov_numvfs_show(), but I don't
> > know where the netlink message is sent.  Is that connected with the
> > kobject_uevent(KOBJ_CHANGE)?
> > 
> > One thing this would help with is figuring out exactly how *much*
> > earlier we need to set iov->num_VFs.  It looks like the current patch
> > sets it before we actually enable the VFs, so a user could read
> > /sys/.../sriov_numvfs and get the wrong value.  Of course, that's
> > unavoidable; the question is whether it's OK to get the new value
> > *before* it actually takes effect, or whether we want to return a
> > stale value until after it takes effect.
> > 
> > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=202991
> > > Signed-off-by: Pierre Crégut <pierre.cregut@orange.com>
> > > ---
> > > note: the behaviour can be tested with the following shell script also
> > > available on the bugzilla (d being the phy device name):
> > > 
> > > ip monitor dev $d | grep --line-buffered "^[0-9]*:" | \
> > > while read line; do cat /sys/class/net/$d/device/sriov_numvfs; done
> > > 
> > >   drivers/pci/iov.c | 3 ++-
> > >   1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> > > index 3aa115ed3a65..a9655c10e87f 100644
> > > --- a/drivers/pci/iov.c
> > > +++ b/drivers/pci/iov.c
> > > @@ -351,6 +351,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
> > >   		goto err_pcibios;
> > >   	}
> > > +	iov->num_VFs = nr_virtfn;
> > >   	pci_iov_set_numvfs(dev, nr_virtfn);
> > >   	iov->ctrl |= PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE;
> > >   	pci_cfg_access_lock(dev);
> > > @@ -363,7 +364,6 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
> > >   		goto err_pcibios;
> > >   	kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE);
> > > -	iov->num_VFs = nr_virtfn;
> > >   	return 0;
> > > @@ -379,6 +379,7 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
> > >   	if (iov->link != dev->devfn)
> > >   		sysfs_remove_link(&dev->dev.kobj, "dep_link");
> > > +	iov->num_VFs = 0;
> > >   	pci_iov_set_numvfs(dev, 0);
> > >   	return rc;
> > >   }
> > > -- 
> > > 2.17.1
> > >
CREGUT Pierre IMT/OLN Oct. 3, 2019, 9:04 a.m. UTC | #5
Le 02/10/2019 à 01:45, Bjorn Helgaas a écrit :
> On Fri, Apr 26, 2019 at 10:11:54AM +0200, CREGUT Pierre IMT/OLN wrote:
>> I also initially thought that kobject_uevent generated the netlink event
>> but this is not the case. This is generated by the specific driver in use.
>> For the Intel i40e driver, this is the call to i40e_do_reset_safe in
>> i40e_pci_sriov_configure that sends the event.
>> It is followed by i40e_pci_sriov_enable that calls i40e_alloc_vfs that
>> finally calls the generic pci_enable_sriov function.
> I don't know anything about netlink.  The script from the bugzilla
> (https://bugzilla.kernel.org/show_bug.cgi?id=202991) looks like it
> runs
>
>    ip monitor dev enp9s0f2
>
> What are the actual netlink events you see?  Are they related to a
> device being removed?

We have netlink events both when num_vfs goes from 0 to N and from N to 0.
Indeed you have to go to 0 before going to M with M != N.
On an Intel card, when one goes from 0 to N, the netlink event is sent 
"early". The
value of num_vfs is still 0 and you get the impression that the number 
of VFS has
not changed. As the meaning of those events is overloaded, you have to 
wait an
arbitrary amount of time until it settles (there will be no other event).
There is no such problem when it goes from N to 0 because of 
implementation details
but it may be different for another brand.

> When we change num_VFs, I think we have to disable any existing VFs
> before enabling the new num_VFs, so if you trigger on a netlink
> "remove" event, I wouldn't be surprised that reading sriov_numvfs
> would give a zero until the new VFs are enabled.
Yes but we are speaking of the event sent when num_vfs is changed from 0 
to N
> [...]
> I thought this was a good idea, but
>
>    - It does break the device_lock() encapsulation a little bit:
>      sriov_numvfs_store() uses device_lock(), which happens to be
>      implemented as "mutex_lock(&dev->mutex)", but we really shouldn't
>      rely on that implementation, and
The use of device_lock was the cheapest solution. It is true that lock 
and trylock are
exposed by device.h but not is_locked. To respect the abstraction, we 
would have to
lock the device (at least use trylock but it means locking when we can 
access the
value, in that case we may just make reading num_vfs blocking ?).

The other solution is to record the state of freshness of num_vfs but it 
means a
new Boolean in the pci_sriov data-structure.

>    - The netlink events are being generated via the NIC driver, and I'm
>      a little hesitant about changing the PCI core to deal with timing
>      issues "over there".

NIC drivers send netlink events when their state change, but it is the 
core that changes
the value of num_vfs. So I would think it is the core responsibility to 
make sure the
exposed value makes sense and it would be better to ignore the details 
of the driver
implementation.
That is why the initial patch moving when the value was updated was 
finally not
such a good idea.

[...]
Bjorn Helgaas Oct. 3, 2019, 10:10 p.m. UTC | #6
[+cc Don, Alex, Jakub]

On Thu, Oct 03, 2019 at 11:04:45AM +0200, CREGUT Pierre IMT/OLN wrote:
> Le 02/10/2019 à 01:45, Bjorn Helgaas a écrit :
> > On Fri, Apr 26, 2019 at 10:11:54AM +0200, CREGUT Pierre IMT/OLN wrote:
> > > I also initially thought that kobject_uevent generated the netlink event
> > > but this is not the case. This is generated by the specific driver in use.
> > > For the Intel i40e driver, this is the call to i40e_do_reset_safe in
> > > i40e_pci_sriov_configure that sends the event.
> > > It is followed by i40e_pci_sriov_enable that calls i40e_alloc_vfs that
> > > finally calls the generic pci_enable_sriov function.
> > I don't know anything about netlink.  The script from the bugzilla
> > (https://bugzilla.kernel.org/show_bug.cgi?id=202991) looks like it
> > runs
> > 
> >    ip monitor dev enp9s0f2
> > 
> > What are the actual netlink events you see?  Are they related to a
> > device being removed?
> 
> We have netlink events both when num_vfs goes from 0 to N and from N to 0.
> Indeed you have to go to 0 before going to M with M != N.

Right.

> On an Intel card, when one goes from 0 to N, the netlink event is
> sent "early". The value of num_vfs is still 0 and you get the
> impression that the number of VFS has not changed. As the meaning of
> those events is overloaded, you have to wait an arbitrary amount of
> time until it settles (there will be no other event).  There is no
> such problem when it goes from N to 0 because of implementation
> details but it may be different for another brand.

I hadn't looked far enough.  I think the "remove" netlink events are
probably from the i40e_do_reset_safe() path, which eventually calls
free_netdev() and put_device().

The pci_enable_sriov() path calls the driver's ->probe method, and I
suspect the "add" netlink events are emitted there.

> > When we change num_VFs, I think we have to disable any existing VFs
> > before enabling the new num_VFs, so if you trigger on a netlink
> > "remove" event, I wouldn't be surprised that reading sriov_numvfs
> > would give a zero until the new VFs are enabled.
> Yes but we are speaking of the event sent when num_vfs is changed from 0 to
> N
> > [...]
> > I thought this was a good idea, but
> > 
> >    - It does break the device_lock() encapsulation a little bit:
> >      sriov_numvfs_store() uses device_lock(), which happens to be
> >      implemented as "mutex_lock(&dev->mutex)", but we really shouldn't
> >      rely on that implementation, and

> The use of device_lock was the cheapest solution. It is true that
> lock and trylock are exposed by device.h but not is_locked. To
> respect the abstraction, we would have to lock the device (at least
> use trylock but it means locking when we can access the value, in
> that case we may just make reading num_vfs blocking ?).
> 
> The other solution is to record the state of freshness of num_vfs
> but it means a new Boolean in the pci_sriov data-structure.

> 
> >    - The netlink events are being generated via the NIC driver, and I'm
> >      a little hesitant about changing the PCI core to deal with timing
> >      issues "over there".
> 
> NIC drivers send netlink events when their state change, but it is
> the core that changes the value of num_vfs. So I would think it is
> the core responsibility to make sure the exposed value makes sense
> and it would be better to ignore the details of the driver
> implementation.

Yes, I think you're right.  And I like your previous suggestion of
just locking the device in the reader.  I'm not enough of a sysfs
expert to know if there's a good reason to avoid a lock there.  Does
the following look reasonable to you?


commit 0940fc95da45
Author: Pierre Crégut <pierre.cregut@orange.com>
Date:   Wed Sep 11 09:27:36 2019 +0200

    PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes
    
    When sriov_numvfs is being updated, drivers may notify about new devices
    before they are reflected in sriov->num_VFs, so concurrent sysfs reads
    previously returned stale values.
    
    Serialize the sysfs read vs the write so the read returns the correct
    num_VFs value.
    
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=202991
    Link: https://lore.kernel.org/r/20190911072736.32091-1-pierre.cregut@orange.com
    Signed-off-by: Pierre Crégut <pierre.cregut@orange.com>
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index b3f972e8cfed..e77562aabbae 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -254,8 +254,14 @@ static ssize_t sriov_numvfs_show(struct device *dev,
 				 char *buf)
 {
 	struct pci_dev *pdev = to_pci_dev(dev);
+	u16 num_vfs;
+
+	/* Serialize vs sriov_numvfs_store() so readers see valid num_VFs */
+	device_lock(&pdev->dev);
+	num_vfs = pdev->sriov->num_VFs;
+	device_lock(&pdev->dev);
 
-	return sprintf(buf, "%u\n", pdev->sriov->num_VFs);
+	return sprintf(buf, "%u\n", num_vfs);
 }
 
 /*
Jakub Kicinski Oct. 3, 2019, 10:36 p.m. UTC | #7
On Thu, 3 Oct 2019 17:10:07 -0500, Bjorn Helgaas wrote:
> On Thu, Oct 03, 2019 at 11:04:45AM +0200, CREGUT Pierre IMT/OLN wrote:
> > Le 02/10/2019 à 01:45, Bjorn Helgaas a écrit :  
> > > On Fri, Apr 26, 2019 at 10:11:54AM +0200, CREGUT Pierre IMT/OLN wrote:  
> > > > I also initially thought that kobject_uevent generated the netlink event
> > > > but this is not the case. This is generated by the specific driver in use.
> > > > For the Intel i40e driver, this is the call to i40e_do_reset_safe in
> > > > i40e_pci_sriov_configure that sends the event.
> > > > It is followed by i40e_pci_sriov_enable that calls i40e_alloc_vfs that
> > > > finally calls the generic pci_enable_sriov function.  
> > > I don't know anything about netlink.  The script from the bugzilla
> > > (https://bugzilla.kernel.org/show_bug.cgi?id=202991) looks like it
> > > runs
> > > 
> > >    ip monitor dev enp9s0f2
> > > 
> > > What are the actual netlink events you see?  Are they related to a
> > > device being removed?  
> > 
> > We have netlink events both when num_vfs goes from 0 to N and from N to 0.
> > Indeed you have to go to 0 before going to M with M != N.  
> 
> Right.

FWIW I think this netlink event is an artefact of i40e implementation,
and is not something the networking stack generates. Hopefully Alex can
correct me if I'm wrong, but I don't think most drivers will generate
such an event.

> commit 0940fc95da45
> Author: Pierre Crégut <pierre.cregut@orange.com>
> Date:   Wed Sep 11 09:27:36 2019 +0200
> 
>     PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes
>     
>     When sriov_numvfs is being updated, drivers may notify about new devices
>     before they are reflected in sriov->num_VFs, so concurrent sysfs reads
>     previously returned stale values.
>     
>     Serialize the sysfs read vs the write so the read returns the correct
>     num_VFs value.
>     
>     Link: https://bugzilla.kernel.org/show_bug.cgi?id=202991
>     Link: https://lore.kernel.org/r/20190911072736.32091-1-pierre.cregut@orange.com
>     Signed-off-by: Pierre Crégut <pierre.cregut@orange.com>
>     Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> 
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index b3f972e8cfed..e77562aabbae 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -254,8 +254,14 @@ static ssize_t sriov_numvfs_show(struct device *dev,
>  				 char *buf)
>  {
>  	struct pci_dev *pdev = to_pci_dev(dev);
> +	u16 num_vfs;
> +
> +	/* Serialize vs sriov_numvfs_store() so readers see valid num_VFs */
> +	device_lock(&pdev->dev);
> +	num_vfs = pdev->sriov->num_VFs;
> +	device_lock(&pdev->dev);
>  
> -	return sprintf(buf, "%u\n", pdev->sriov->num_VFs);
> +	return sprintf(buf, "%u\n", num_vfs);
>  }
>  
>  /*

The change makes sense to me!
Duyck, Alexander H Oct. 3, 2019, 10:37 p.m. UTC | #8
On Thu, 2019-10-03 at 17:10 -0500, Bjorn Helgaas wrote:
> [+cc Don, Alex, Jakub]
> 
> On Thu, Oct 03, 2019 at 11:04:45AM +0200, CREGUT Pierre IMT/OLN wrote:
> > Le 02/10/2019 à 01:45, Bjorn Helgaas a écrit :
> > > On Fri, Apr 26, 2019 at 10:11:54AM +0200, CREGUT Pierre IMT/OLN wrote:
> > > > I also initially thought that kobject_uevent generated the netlink event
> > > > but this is not the case. This is generated by the specific driver in use.
> > > > For the Intel i40e driver, this is the call to i40e_do_reset_safe in
> > > > i40e_pci_sriov_configure that sends the event.
> > > > It is followed by i40e_pci_sriov_enable that calls i40e_alloc_vfs that
> > > > finally calls the generic pci_enable_sriov function.
> > > I don't know anything about netlink.  The script from the bugzilla
> > > (https://bugzilla.kernel.org/show_bug.cgi?id=202991) looks like it
> > > runs
> > > 
> > >    ip monitor dev enp9s0f2
> > > 
> > > What are the actual netlink events you see?  Are they related to a
> > > device being removed?
> > 
> > We have netlink events both when num_vfs goes from 0 to N and from N to 0.
> > Indeed you have to go to 0 before going to M with M != N.
> 
> Right.

It doesn't make sense to monitor netlink for SR-IOV changes. All you are
catching is the incidental effect of the reset causing the link to bounce.

For example if you run "ip link set enp9s0f2 down" and then run your test
you will never get any notifications when you alter the num_vfs because
the link won't bounce because the link is already down.

It isn't surprising that you would see no VFs enabled as the act of
bringing down the interface to reconfigure it will give you a netlink
event. At that point we haven't even enabled SR-IOV yet, we were just
shutting down the existing config before we attempt to enable SR-IOV.

> > On an Intel card, when one goes from 0 to N, the netlink event is
> > sent "early". The value of num_vfs is still 0 and you get the
> > impression that the number of VFS has not changed. As the meaning of
> > those events is overloaded, you have to wait an arbitrary amount of
> > time until it settles (there will be no other event).  There is no
> > such problem when it goes from N to 0 because of implementation
> > details but it may be different for another brand.
> 
> I hadn't looked far enough.  I think the "remove" netlink events are
> probably from the i40e_do_reset_safe() path, which eventually calls
> free_netdev() and put_device().
> 
> The pci_enable_sriov() path calls the driver's ->probe method, and I
> suspect the "add" netlink events are emitted there.

So the issue as I see it is that this is a naive approach to how to
monitor for VFs being added or removed. All the script in the bugzilla
really does is catch resets when the interface is up.

Ideally we shouldn't even have the driver have to do the reset except for
the fact that it has to re-partition the device to split up resources.

> > > When we change num_VFs, I think we have to disable any existing VFs
> > > before enabling the new num_VFs, so if you trigger on a netlink
> > > "remove" event, I wouldn't be surprised that reading sriov_numvfs
> > > would give a zero until the new VFs are enabled.
> > Yes but we are speaking of the event sent when num_vfs is changed from 0 to
> > N
> > > [...]
> > > I thought this was a good idea, but
> > > 
> > >    - It does break the device_lock() encapsulation a little bit:
> > >      sriov_numvfs_store() uses device_lock(), which happens to be
> > >      implemented as "mutex_lock(&dev->mutex)", but we really shouldn't
> > >      rely on that implementation, and
> > The use of device_lock was the cheapest solution. It is true that
> > lock and trylock are exposed by device.h but not is_locked. To
> > respect the abstraction, we would have to lock the device (at least
> > use trylock but it means locking when we can access the value, in
> > that case we may just make reading num_vfs blocking ?).
> > 
> > The other solution is to record the state of freshness of num_vfs
> > but it means a new Boolean in the pci_sriov data-structure.
> > >    - The netlink events are being generated via the NIC driver, and I'm
> > >      a little hesitant about changing the PCI core to deal with timing
> > >      issues "over there".
> > 
> > NIC drivers send netlink events when their state change, but it is
> > the core that changes the value of num_vfs. So I would think it is
> > the core responsibility to make sure the exposed value makes sense
> > and it would be better to ignore the details of the driver
> > implementation.
> 
> Yes, I think you're right.  And I like your previous suggestion of
> just locking the device in the reader.  I'm not enough of a sysfs
> expert to know if there's a good reason to avoid a lock there.  Does
> the following look reasonable to you?
> 
> 
> commit 0940fc95da45
> Author: Pierre Crégut <pierre.cregut@orange.com>
> Date:   Wed Sep 11 09:27:36 2019 +0200
> 
>     PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes
>     
>     When sriov_numvfs is being updated, drivers may notify about new devices
>     before they are reflected in sriov->num_VFs, so concurrent sysfs reads
>     previously returned stale values.
>     
>     Serialize the sysfs read vs the write so the read returns the correct
>     num_VFs value.
>     
>     Link: https://bugzilla.kernel.org/show_bug.cgi?id=202991
>     Link: https://lore.kernel.org/r/20190911072736.32091-1-pierre.cregut@orange.com
>     Signed-off-by: Pierre Crégut <pierre.cregut@orange.com>
>     Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> 
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index b3f972e8cfed..e77562aabbae 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -254,8 +254,14 @@ static ssize_t sriov_numvfs_show(struct device *dev,
>  				 char *buf)
>  {
>  	struct pci_dev *pdev = to_pci_dev(dev);
> +	u16 num_vfs;
> +
> +	/* Serialize vs sriov_numvfs_store() so readers see valid num_VFs */
> +	device_lock(&pdev->dev);
> +	num_vfs = pdev->sriov->num_VFs;
> +	device_lock(&pdev->dev);
>  
> -	return sprintf(buf, "%u\n", pdev->sriov->num_VFs);
> +	return sprintf(buf, "%u\n", num_vfs);
>  }
>  
>  /*

I think this would probably be a good way to go. Then if the device has
some sort of issues enabling SR-IOV we don't have an unknown state when
this is being read. It is either set or it is not, and we prevent reading
the state while it is being altered.
Bjorn Helgaas Oct. 8, 2019, 9:38 p.m. UTC | #9
On Thu, Oct 03, 2019 at 05:10:07PM -0500, Bjorn Helgaas wrote:
> On Thu, Oct 03, 2019 at 11:04:45AM +0200, CREGUT Pierre IMT/OLN wrote:
> > ...

> > NIC drivers send netlink events when their state change, but it is
> > the core that changes the value of num_vfs. So I would think it is
> > the core responsibility to make sure the exposed value makes sense
> > and it would be better to ignore the details of the driver
> > implementation.
> 
> Yes, I think you're right.  And I like your previous suggestion of
> just locking the device in the reader.  I'm not enough of a sysfs
> expert to know if there's a good reason to avoid a lock there.  Does
> the following look reasonable to you?

I applied the patch below to pci/virtualization for v5.5, thanks for
your great patience!

> commit 0940fc95da45
> Author: Pierre Crégut <pierre.cregut@orange.com>
> Date:   Wed Sep 11 09:27:36 2019 +0200
> 
>     PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes
>     
>     When sriov_numvfs is being updated, drivers may notify about new devices
>     before they are reflected in sriov->num_VFs, so concurrent sysfs reads
>     previously returned stale values.
>     
>     Serialize the sysfs read vs the write so the read returns the correct
>     num_VFs value.
>     
>     Link: https://bugzilla.kernel.org/show_bug.cgi?id=202991
>     Link: https://lore.kernel.org/r/20190911072736.32091-1-pierre.cregut@orange.com
>     Signed-off-by: Pierre Crégut <pierre.cregut@orange.com>
>     Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> 
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index b3f972e8cfed..e77562aabbae 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -254,8 +254,14 @@ static ssize_t sriov_numvfs_show(struct device *dev,
>  				 char *buf)
>  {
>  	struct pci_dev *pdev = to_pci_dev(dev);
> +	u16 num_vfs;
> +
> +	/* Serialize vs sriov_numvfs_store() so readers see valid num_VFs */
> +	device_lock(&pdev->dev);
> +	num_vfs = pdev->sriov->num_VFs;
> +	device_lock(&pdev->dev);
>  
> -	return sprintf(buf, "%u\n", pdev->sriov->num_VFs);
> +	return sprintf(buf, "%u\n", num_vfs);
>  }
>  
>  /*
Donald Dutile Oct. 8, 2019, 10:06 p.m. UTC | #10
On 10/08/2019 05:38 PM, Bjorn Helgaas wrote:
> On Thu, Oct 03, 2019 at 05:10:07PM -0500, Bjorn Helgaas wrote:
>> On Thu, Oct 03, 2019 at 11:04:45AM +0200, CREGUT Pierre IMT/OLN wrote:
>>> ...
> 
>>> NIC drivers send netlink events when their state change, but it is
>>> the core that changes the value of num_vfs. So I would think it is
>>> the core responsibility to make sure the exposed value makes sense
>>> and it would be better to ignore the details of the driver
>>> implementation.
>>
>> Yes, I think you're right.  And I like your previous suggestion of
>> just locking the device in the reader.  I'm not enough of a sysfs
>> expert to know if there's a good reason to avoid a lock there.  Does
>> the following look reasonable to you?
> 
> I applied the patch below to pci/virtualization for v5.5, thanks for
I hope not... see below

> your great patience!
> 
>> commit 0940fc95da45
>> Author: Pierre Crégut <pierre.cregut@orange.com>
>> Date:   Wed Sep 11 09:27:36 2019 +0200
>>
>>      PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes
>>      
>>      When sriov_numvfs is being updated, drivers may notify about new devices
>>      before they are reflected in sriov->num_VFs, so concurrent sysfs reads
>>      previously returned stale values.
>>      
>>      Serialize the sysfs read vs the write so the read returns the correct
>>      num_VFs value.
>>      
>>      Link: https://bugzilla.kernel.org/show_bug.cgi?id=202991
>>      Link: https://lore.kernel.org/r/20190911072736.32091-1-pierre.cregut@orange.com
>>      Signed-off-by: Pierre Crégut <pierre.cregut@orange.com>
>>      Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>>
>> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
>> index b3f972e8cfed..e77562aabbae 100644
>> --- a/drivers/pci/iov.c
>> +++ b/drivers/pci/iov.c
>> @@ -254,8 +254,14 @@ static ssize_t sriov_numvfs_show(struct device *dev,
>>   				 char *buf)
>>   {
>>   	struct pci_dev *pdev = to_pci_dev(dev);
>> +	u16 num_vfs;
>> +
>> +	/* Serialize vs sriov_numvfs_store() so readers see valid num_VFs */
>> +	device_lock(&pdev->dev);
                ^^^^^ lock
>> +	num_vfs = pdev->sriov->num_VFs;
>> +	device_lock(&pdev->dev);
                ^^^^ and lock again!
>>   
>> -	return sprintf(buf, "%u\n", pdev->sriov->num_VFs);
>> +	return sprintf(buf, "%u\n", num_vfs);
>>   }
>>   
>>   /*
Bjorn Helgaas Oct. 9, 2019, 12:31 p.m. UTC | #11
On Tue, Oct 08, 2019 at 06:06:46PM -0400, Don Dutile wrote:
> On 10/08/2019 05:38 PM, Bjorn Helgaas wrote:
> > On Thu, Oct 03, 2019 at 05:10:07PM -0500, Bjorn Helgaas wrote:
> > > On Thu, Oct 03, 2019 at 11:04:45AM +0200, CREGUT Pierre IMT/OLN wrote:
> > > > ...
> > 
> > > > NIC drivers send netlink events when their state change, but it is
> > > > the core that changes the value of num_vfs. So I would think it is
> > > > the core responsibility to make sure the exposed value makes sense
> > > > and it would be better to ignore the details of the driver
> > > > implementation.
> > > 
> > > Yes, I think you're right.  And I like your previous suggestion of
> > > just locking the device in the reader.  I'm not enough of a sysfs
> > > expert to know if there's a good reason to avoid a lock there.  Does
> > > the following look reasonable to you?
> > 
> > I applied the patch below to pci/virtualization for v5.5, thanks for
> I hope not... see below
> 
> > your great patience!
> > 
> > > commit 0940fc95da45
> > > Author: Pierre Crégut <pierre.cregut@orange.com>
> > > Date:   Wed Sep 11 09:27:36 2019 +0200
> > > 
> > >      PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes
> > >      When sriov_numvfs is being updated, drivers may notify about new devices
> > >      before they are reflected in sriov->num_VFs, so concurrent sysfs reads
> > >      previously returned stale values.
> > >      Serialize the sysfs read vs the write so the read returns the correct
> > >      num_VFs value.
> > >      Link: https://bugzilla.kernel.org/show_bug.cgi?id=202991
> > >      Link: https://lore.kernel.org/r/20190911072736.32091-1-pierre.cregut@orange.com
> > >      Signed-off-by: Pierre Crégut <pierre.cregut@orange.com>
> > >      Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> > > 
> > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> > > index b3f972e8cfed..e77562aabbae 100644
> > > --- a/drivers/pci/iov.c
> > > +++ b/drivers/pci/iov.c
> > > @@ -254,8 +254,14 @@ static ssize_t sriov_numvfs_show(struct device *dev,
> > >   				 char *buf)
> > >   {
> > >   	struct pci_dev *pdev = to_pci_dev(dev);
> > > +	u16 num_vfs;
> > > +
> > > +	/* Serialize vs sriov_numvfs_store() so readers see valid num_VFs */
> > > +	device_lock(&pdev->dev);
>                ^^^^^ lock
> > > +	num_vfs = pdev->sriov->num_VFs;
> > > +	device_lock(&pdev->dev);
>                ^^^^ and lock again!

Oops, sorry, my fault.  Fixed.

> > > -	return sprintf(buf, "%u\n", pdev->sriov->num_VFs);
> > > +	return sprintf(buf, "%u\n", num_vfs);
> > >   }
> > >   /*
>
Donald Dutile Oct. 9, 2019, 2:20 p.m. UTC | #12
On 10/09/2019 08:31 AM, Bjorn Helgaas wrote:
> On Tue, Oct 08, 2019 at 06:06:46PM -0400, Don Dutile wrote:
>> On 10/08/2019 05:38 PM, Bjorn Helgaas wrote:
>>> On Thu, Oct 03, 2019 at 05:10:07PM -0500, Bjorn Helgaas wrote:
>>>> On Thu, Oct 03, 2019 at 11:04:45AM +0200, CREGUT Pierre IMT/OLN wrote:
>>>>> ...
>>>
>>>>> NIC drivers send netlink events when their state change, but it is
>>>>> the core that changes the value of num_vfs. So I would think it is
>>>>> the core responsibility to make sure the exposed value makes sense
>>>>> and it would be better to ignore the details of the driver
>>>>> implementation.
>>>>
>>>> Yes, I think you're right.  And I like your previous suggestion of
>>>> just locking the device in the reader.  I'm not enough of a sysfs
>>>> expert to know if there's a good reason to avoid a lock there.  Does
>>>> the following look reasonable to you?
>>>
>>> I applied the patch below to pci/virtualization for v5.5, thanks for
>> I hope not... see below
>>
>>> your great patience!
>>>
>>>> commit 0940fc95da45
>>>> Author: Pierre Crégut <pierre.cregut@orange.com>
>>>> Date:   Wed Sep 11 09:27:36 2019 +0200
>>>>
>>>>       PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes
>>>>       When sriov_numvfs is being updated, drivers may notify about new devices
>>>>       before they are reflected in sriov->num_VFs, so concurrent sysfs reads
>>>>       previously returned stale values.
>>>>       Serialize the sysfs read vs the write so the read returns the correct
>>>>       num_VFs value.
>>>>       Link: https://bugzilla.kernel.org/show_bug.cgi?id=202991
>>>>       Link: https://lore.kernel.org/r/20190911072736.32091-1-pierre.cregut@orange.com
>>>>       Signed-off-by: Pierre Crégut <pierre.cregut@orange.com>
>>>>       Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>>>>
>>>> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
>>>> index b3f972e8cfed..e77562aabbae 100644
>>>> --- a/drivers/pci/iov.c
>>>> +++ b/drivers/pci/iov.c
>>>> @@ -254,8 +254,14 @@ static ssize_t sriov_numvfs_show(struct device *dev,
>>>>    				 char *buf)
>>>>    {
>>>>    	struct pci_dev *pdev = to_pci_dev(dev);
>>>> +	u16 num_vfs;
>>>> +
>>>> +	/* Serialize vs sriov_numvfs_store() so readers see valid num_VFs */
>>>> +	device_lock(&pdev->dev);
>>                 ^^^^^ lock
>>>> +	num_vfs = pdev->sriov->num_VFs;
>>>> +	device_lock(&pdev->dev);
>>                 ^^^^ and lock again!
> 
> Oops, sorry, my fault.  Fixed.
> 
Thanks.
--dd

>>>> -	return sprintf(buf, "%u\n", pdev->sriov->num_VFs);
>>>> +	return sprintf(buf, "%u\n", num_vfs);
>>>>    }
>>>>    /*
>>
diff mbox series

Patch

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 3aa115ed3a65..a9655c10e87f 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -351,6 +351,7 @@  static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
 		goto err_pcibios;
 	}
 
+	iov->num_VFs = nr_virtfn;
 	pci_iov_set_numvfs(dev, nr_virtfn);
 	iov->ctrl |= PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE;
 	pci_cfg_access_lock(dev);
@@ -363,7 +364,6 @@  static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
 		goto err_pcibios;
 
 	kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE);
-	iov->num_VFs = nr_virtfn;
 
 	return 0;
 
@@ -379,6 +379,7 @@  static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
 	if (iov->link != dev->devfn)
 		sysfs_remove_link(&dev->dev.kobj, "dep_link");
 
+	iov->num_VFs = 0;
 	pci_iov_set_numvfs(dev, 0);
 	return rc;
 }