diff mbox

[RFC] vfio/mdev: delay uevent after initialization complete

Message ID 20180209102716.32253-1-cohuck@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Cornelia Huck Feb. 9, 2018, 10:27 a.m. UTC
The registration code first registers the mdev device, and then
proceeds to populate sysfs. An userspace application that listens
for the ADD uevent is therefore likely to look for sysfs entries
that have not yet been created.

The canonical way to fix this is to use attribute groups that are
registered by the driver core before it sends the ADD uevent; I
unfortunately did not find a way to make this work in this case,
though.

An alternative approach is to suppress uevents before we register
with the core and generate the ADD uevent ourselves after the
sysfs infrastructure is in place.

Signed-off-by: Cornelia Huck <cohuck@redhat.com>
---

This feels like a band-aid, but I can't figure out how to handle creating
attribute groups when there's a callback in the parent involved.

This should address the issue with libvirt's processing of mdevs raised in
https://www.redhat.com/archives/libvir-list/2018-February/msg00023.html
- although libvirt will still need to deal with older kernels, of course.

Best to consider this an untested patch :)

---
 drivers/vfio/mdev/mdev_core.c | 9 +++++++++
 1 file changed, 9 insertions(+)

Comments

Alex Williamson Feb. 12, 2018, 9:20 p.m. UTC | #1
On Fri,  9 Feb 2018 11:27:16 +0100
Cornelia Huck <cohuck@redhat.com> wrote:

> The registration code first registers the mdev device, and then
> proceeds to populate sysfs. An userspace application that listens
> for the ADD uevent is therefore likely to look for sysfs entries
> that have not yet been created.
> 
> The canonical way to fix this is to use attribute groups that are
> registered by the driver core before it sends the ADD uevent; I
> unfortunately did not find a way to make this work in this case,
> though.
> 
> An alternative approach is to suppress uevents before we register
> with the core and generate the ADD uevent ourselves after the
> sysfs infrastructure is in place.
> 
> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
> ---
> 
> This feels like a band-aid, but I can't figure out how to handle creating
> attribute groups when there's a callback in the parent involved.
> 
> This should address the issue with libvirt's processing of mdevs raised in
> https://www.redhat.com/archives/libvir-list/2018-February/msg00023.html
> - although libvirt will still need to deal with older kernels, of course.
> 
> Best to consider this an untested patch :)

I agree, this feels like a band-aide.  If every device in the kernel
needs to suppress udev events until until some key component is added,
that suggests that either udev is broken in general or not being used
as intended.  Zongyong submitted a different proposal to fix this
here[1].  That proposal seems a bit more sound and has precedence
elsewhere in the kernel.  What do you think of that approach?  We
don't need both afaict.  Thanks,

Alex

[1]https://patchwork.kernel.org/patch/10196197/

> ---
>  drivers/vfio/mdev/mdev_core.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> index 126991046eb7..942e880d8e7f 100644
> --- a/drivers/vfio/mdev/mdev_core.c
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -335,6 +335,8 @@ int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
>  	mdev->dev.release = mdev_device_release;
>  	dev_set_name(&mdev->dev, "%pUl", uuid.b);
>  
> +	/* don't notify userspace until we're ready */
> +	dev_set_uevent_suppress(&mdev->dev, 1);
>  	ret = device_register(&mdev->dev);
>  	if (ret) {
>  		put_device(&mdev->dev);
> @@ -350,6 +352,9 @@ int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
>  		mdev_device_remove_ops(mdev, true);
>  		goto create_failed;
>  	}
> +	/* all done, notify userspace */
> +	dev_set_uevent_suppress(&mdev->dev, 0);
> +	kobject_uevent(&mdev->dev.kobj, KOBJ_ADD);
>  
>  	mdev->type_kobj = kobj;
>  	dev_dbg(&mdev->dev, "MDEV: created\n");
> @@ -363,6 +368,10 @@ int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
>  	return ret;
>  
>  create_failed:
> +	/*
> +	 * If we reach this, uevents are still suppressed for mdev->dev,
> +	 * so we don't get a KOBJ_DEL uevent without a previous KOBJ_ADD.
> +	 */
>  	device_unregister(&mdev->dev);
>  
>  create_err:
Cornelia Huck Feb. 13, 2018, 1:09 p.m. UTC | #2
On Mon, 12 Feb 2018 14:20:57 -0700
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Fri,  9 Feb 2018 11:27:16 +0100
> Cornelia Huck <cohuck@redhat.com> wrote:
> 
> > The registration code first registers the mdev device, and then
> > proceeds to populate sysfs. An userspace application that listens
> > for the ADD uevent is therefore likely to look for sysfs entries
> > that have not yet been created.
> > 
> > The canonical way to fix this is to use attribute groups that are
> > registered by the driver core before it sends the ADD uevent; I
> > unfortunately did not find a way to make this work in this case,
> > though.
> > 
> > An alternative approach is to suppress uevents before we register
> > with the core and generate the ADD uevent ourselves after the
> > sysfs infrastructure is in place.
> > 
> > Signed-off-by: Cornelia Huck <cohuck@redhat.com>
> > ---
> > 
> > This feels like a band-aid, but I can't figure out how to handle creating
> > attribute groups when there's a callback in the parent involved.
> > 
> > This should address the issue with libvirt's processing of mdevs raised in
> > https://www.redhat.com/archives/libvir-list/2018-February/msg00023.html
> > - although libvirt will still need to deal with older kernels, of course.
> > 
> > Best to consider this an untested patch :)  
> 
> I agree, this feels like a band-aide.  If every device in the kernel
> needs to suppress udev events until until some key component is added,
> that suggests that either udev is broken in general or not being used
> as intended.  

I think udev is working exactly as designed - it's more a problem of
when the kernel is sending what kind of notification to userspace, and
the particular issue here is how the code sending the event (driver
core) and the code assembling part of the user interface (mdev)
interact.

> Zongyong submitted a different proposal to fix this
> here[1].  That proposal seems a bit more sound and has precedence
> elsewhere in the kernel.  What do you think of that approach?  We
> don't need both afaict.  Thanks,
> 
> Alex
> 
> [1]https://patchwork.kernel.org/patch/10196197/

Zongyong's patch is sending an additional CHANGE uevent, and I agree
that doing both does not make sense. However, I think the CHANGE uevent
is not quite suitable in this case, and delaying the ADD uevent is
better.

[Warning, the following may be a bit rambling.]

The Linux driver model works under the assumption that any device is
represented as an in-kernel object that exposes information and knobs
through sysfs. As long as the device exists, userspace can poke at the
sysfs entries and retrieve information or configure things.

The idea of the 'ADD' uevent is basically to let userspace know that
there is now a new device with its related sysfs entries, and it may
look at it and configure it. IOW, if I (as a userspace application) get
the ADD uevent, I expect to be able to look at the device's sysfs
entries and find all the files/directories that are usually there,
without having to wait.

This expectation is broken if a device is first registered with the
driver core (generating the ADD uevent) and the driver adds sysfs
attributes later. To fix this, the driver core added a way to specify
default attribute groups for the device, which are registered by the
driver core itself before it generates the ADD uevent. Unfortunately, I
did not see a way to do this here (which does not mean there isn't).
The alternative was to prevent the driver core from sending the ADD
uevent and do it from the mdev code when it was ready.

The 'CHANGE' uevent, on the other hand, tells userspace that something
has changed for the device (that already existed). I (as a userspace
application) would expect to see it if, for example, the information
exposed via sysfs has changed, or maybe even if new, optional, entries
have appeared and I might want to rescan. With Zongyong's patch,
userspace gets the CHANGE uevent for something that was already
expected to be there, and is now _really_ there. It does give userspace
an indication that it can now work with the device (which certainly
improves things), but I would prefer to get rid of the too-early uevent
completely so that userspace does not get notified at all before the
device is completely present in sysfs.

So, in short, my patch does 'don't tell userspace until we're really
done', and Zongyong's patch does 'tell userspace again when we're
really done'.
Alex Williamson Feb. 14, 2018, 12:15 a.m. UTC | #3
On Tue, 13 Feb 2018 14:09:01 +0100
Cornelia Huck <cohuck@redhat.com> wrote:

> On Mon, 12 Feb 2018 14:20:57 -0700
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> > On Fri,  9 Feb 2018 11:27:16 +0100
> > Cornelia Huck <cohuck@redhat.com> wrote:
> >   
> > > The registration code first registers the mdev device, and then
> > > proceeds to populate sysfs. An userspace application that listens
> > > for the ADD uevent is therefore likely to look for sysfs entries
> > > that have not yet been created.
> > > 
> > > The canonical way to fix this is to use attribute groups that are
> > > registered by the driver core before it sends the ADD uevent; I
> > > unfortunately did not find a way to make this work in this case,
> > > though.
> > > 
> > > An alternative approach is to suppress uevents before we register
> > > with the core and generate the ADD uevent ourselves after the
> > > sysfs infrastructure is in place.
> > > 
> > > Signed-off-by: Cornelia Huck <cohuck@redhat.com>
> > > ---
> > > 
> > > This feels like a band-aid, but I can't figure out how to handle creating
> > > attribute groups when there's a callback in the parent involved.
> > > 
> > > This should address the issue with libvirt's processing of mdevs raised in
> > > https://www.redhat.com/archives/libvir-list/2018-February/msg00023.html
> > > - although libvirt will still need to deal with older kernels, of course.
> > > 
> > > Best to consider this an untested patch :)    
> > 
> > I agree, this feels like a band-aide.  If every device in the kernel
> > needs to suppress udev events until until some key component is added,
> > that suggests that either udev is broken in general or not being used
> > as intended.    
> 
> I think udev is working exactly as designed - it's more a problem of
> when the kernel is sending what kind of notification to userspace, and
> the particular issue here is how the code sending the event (driver
> core) and the code assembling part of the user interface (mdev)
> interact.
> 
> > Zongyong submitted a different proposal to fix this
> > here[1].  That proposal seems a bit more sound and has precedence
> > elsewhere in the kernel.  What do you think of that approach?  We
> > don't need both afaict.  Thanks,
> > 
> > Alex
> > 
> > [1]https://patchwork.kernel.org/patch/10196197/  
> 
> Zongyong's patch is sending an additional CHANGE uevent, and I agree
> that doing both does not make sense. However, I think the CHANGE uevent
> is not quite suitable in this case, and delaying the ADD uevent is
> better.
> 
> [Warning, the following may be a bit rambling.]
> 
> The Linux driver model works under the assumption that any device is
> represented as an in-kernel object that exposes information and knobs
> through sysfs. As long as the device exists, userspace can poke at the
> sysfs entries and retrieve information or configure things.
> 
> The idea of the 'ADD' uevent is basically to let userspace know that
> there is now a new device with its related sysfs entries, and it may
> look at it and configure it. IOW, if I (as a userspace application) get
> the ADD uevent, I expect to be able to look at the device's sysfs
> entries and find all the files/directories that are usually there,
> without having to wait.
> 
> This expectation is broken if a device is first registered with the
> driver core (generating the ADD uevent) and the driver adds sysfs
> attributes later. To fix this, the driver core added a way to specify
> default attribute groups for the device, which are registered by the
> driver core itself before it generates the ADD uevent. Unfortunately, I
> did not see a way to do this here (which does not mean there isn't).
> The alternative was to prevent the driver core from sending the ADD
> uevent and do it from the mdev code when it was ready.
> 
> The 'CHANGE' uevent, on the other hand, tells userspace that something
> has changed for the device (that already existed). I (as a userspace
> application) would expect to see it if, for example, the information
> exposed via sysfs has changed, or maybe even if new, optional, entries
> have appeared and I might want to rescan. With Zongyong's patch,
> userspace gets the CHANGE uevent for something that was already
> expected to be there, and is now _really_ there. It does give userspace
> an indication that it can now work with the device (which certainly
> improves things), but I would prefer to get rid of the too-early uevent
> completely so that userspace does not get notified at all before the
> device is completely present in sysfs.
> 
> So, in short, my patch does 'don't tell userspace until we're really
> done', and Zongyong's patch does 'tell userspace again when we're
> really done'.

This all sounds reasonable, but don't we have this synchronization
problem _everywhere_?  I apologize for referencing this bug because it's
not public (https://bugzilla.redhat.com/show_bug.cgi?id=1376907) but
the gist of it is that soft-unplugging PCI devices using the remove
entry in sysfs and re-adding with rescan sysfs entry results in libvirt
seeing the ADD uevent before the PCI config attribute is created and it
balks on the device.  So at the PCI core we have this same issue and
developers are saying that there's no guarantee that sysfs entries
won't be added and removed at any time in the lifecycle of the device
and it's not the kernel's responsibility to provide that
synchronization.  So what are we to do?  If this issue exists in a far
more predominant device subsystem than mdev, do we need to reset
developer expectations about what sort of synchronization, if any, the
kernel is intending to provide?  Should we agree and document some best
practices around ordering of (or suppressing of) uevents so that we can
consistently cleanup subsystems, or define that such ordering is
officially not guaranteed?  Thanks,

Alex
Cornelia Huck Feb. 14, 2018, 5:20 p.m. UTC | #4
[cc:ing Greg for his opinion on this; retaining quoting for context]

On Tue, 13 Feb 2018 17:15:02 -0700
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Tue, 13 Feb 2018 14:09:01 +0100
> Cornelia Huck <cohuck@redhat.com> wrote:
> 
> > On Mon, 12 Feb 2018 14:20:57 -0700
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> >   
> > > On Fri,  9 Feb 2018 11:27:16 +0100
> > > Cornelia Huck <cohuck@redhat.com> wrote:
> > >     
> > > > The registration code first registers the mdev device, and then
> > > > proceeds to populate sysfs. An userspace application that listens
> > > > for the ADD uevent is therefore likely to look for sysfs entries
> > > > that have not yet been created.
> > > > 
> > > > The canonical way to fix this is to use attribute groups that are
> > > > registered by the driver core before it sends the ADD uevent; I
> > > > unfortunately did not find a way to make this work in this case,
> > > > though.
> > > > 
> > > > An alternative approach is to suppress uevents before we register
> > > > with the core and generate the ADD uevent ourselves after the
> > > > sysfs infrastructure is in place.
> > > > 
> > > > Signed-off-by: Cornelia Huck <cohuck@redhat.com>
> > > > ---
> > > > 
> > > > This feels like a band-aid, but I can't figure out how to handle creating
> > > > attribute groups when there's a callback in the parent involved.
> > > > 
> > > > This should address the issue with libvirt's processing of mdevs raised in
> > > > https://www.redhat.com/archives/libvir-list/2018-February/msg00023.html
> > > > - although libvirt will still need to deal with older kernels, of course.
> > > > 
> > > > Best to consider this an untested patch :)      
> > > 
> > > I agree, this feels like a band-aide.  If every device in the kernel
> > > needs to suppress udev events until until some key component is added,
> > > that suggests that either udev is broken in general or not being used
> > > as intended.      
> > 
> > I think udev is working exactly as designed - it's more a problem of
> > when the kernel is sending what kind of notification to userspace, and
> > the particular issue here is how the code sending the event (driver
> > core) and the code assembling part of the user interface (mdev)
> > interact.
> >   
> > > Zongyong submitted a different proposal to fix this
> > > here[1].  That proposal seems a bit more sound and has precedence
> > > elsewhere in the kernel.  What do you think of that approach?  We
> > > don't need both afaict.  Thanks,
> > > 
> > > Alex
> > > 
> > > [1]https://patchwork.kernel.org/patch/10196197/    
> > 
> > Zongyong's patch is sending an additional CHANGE uevent, and I agree
> > that doing both does not make sense. However, I think the CHANGE uevent
> > is not quite suitable in this case, and delaying the ADD uevent is
> > better.
> > 
> > [Warning, the following may be a bit rambling.]
> > 
> > The Linux driver model works under the assumption that any device is
> > represented as an in-kernel object that exposes information and knobs
> > through sysfs. As long as the device exists, userspace can poke at the
> > sysfs entries and retrieve information or configure things.
> > 
> > The idea of the 'ADD' uevent is basically to let userspace know that
> > there is now a new device with its related sysfs entries, and it may
> > look at it and configure it. IOW, if I (as a userspace application) get
> > the ADD uevent, I expect to be able to look at the device's sysfs
> > entries and find all the files/directories that are usually there,
> > without having to wait.
> > 
> > This expectation is broken if a device is first registered with the
> > driver core (generating the ADD uevent) and the driver adds sysfs
> > attributes later. To fix this, the driver core added a way to specify
> > default attribute groups for the device, which are registered by the
> > driver core itself before it generates the ADD uevent. Unfortunately, I
> > did not see a way to do this here (which does not mean there isn't).
> > The alternative was to prevent the driver core from sending the ADD
> > uevent and do it from the mdev code when it was ready.
> > 
> > The 'CHANGE' uevent, on the other hand, tells userspace that something
> > has changed for the device (that already existed). I (as a userspace
> > application) would expect to see it if, for example, the information
> > exposed via sysfs has changed, or maybe even if new, optional, entries
> > have appeared and I might want to rescan. With Zongyong's patch,
> > userspace gets the CHANGE uevent for something that was already
> > expected to be there, and is now _really_ there. It does give userspace
> > an indication that it can now work with the device (which certainly
> > improves things), but I would prefer to get rid of the too-early uevent
> > completely so that userspace does not get notified at all before the
> > device is completely present in sysfs.
> > 
> > So, in short, my patch does 'don't tell userspace until we're really
> > done', and Zongyong's patch does 'tell userspace again when we're
> > really done'.  
> 
> This all sounds reasonable, but don't we have this synchronization
> problem _everywhere_?  I apologize for referencing this bug because it's
> not public (https://bugzilla.redhat.com/show_bug.cgi?id=1376907) but
> the gist of it is that soft-unplugging PCI devices using the remove
> entry in sysfs and re-adding with rescan sysfs entry results in libvirt
> seeing the ADD uevent before the PCI config attribute is created and it
> balks on the device. 

Yikes. So it seems that _any_ PCI device is incomplete when the ADD
uevent is sent? Or does this just apply to an unfortunately large
number of drivers?

> So at the PCI core we have this same issue and
> developers are saying that there's no guarantee that sysfs entries
> won't be added and removed at any time in the lifecycle of the device
> and it's not the kernel's responsibility to provide that
> synchronization. 

I think this is mixing up two things:
- sysfs entries that are there by default. These not being in place
  when the ADD uevent is sent sounds broken.
- Optional sysfs entries that might change. There may be a case for
  those to be added/removed later on (probably paired with a CHANGE
  uevent.)

> So what are we to do?  If this issue exists in a far
> more predominant device subsystem than mdev, do we need to reset
> developer expectations about what sort of synchronization, if any, the
> kernel is intending to provide?  Should we agree and document some best
> practices around ordering of (or suppressing of) uevents so that we can
> consistently cleanup subsystems, or define that such ordering is
> officially not guaranteed?  Thanks,
> 
> Alex

I'd like to clarify expectations here as well. My understanding has
been what I outlined above, and what I think allows userspace to rely
on asynchronous notifications without having to resort to polling,
waiting, etc.
Cornelia Huck Feb. 26, 2018, 6:51 p.m. UTC | #5
On Wed, 14 Feb 2018 18:20:35 +0100
Cornelia Huck <cohuck@redhat.com> wrote:

Ping. Does anyone have an opinion on how to proceed?

> [cc:ing Greg for his opinion on this; retaining quoting for context]
> 
> On Tue, 13 Feb 2018 17:15:02 -0700
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> > On Tue, 13 Feb 2018 14:09:01 +0100
> > Cornelia Huck <cohuck@redhat.com> wrote:
> >   
> > > On Mon, 12 Feb 2018 14:20:57 -0700
> > > Alex Williamson <alex.williamson@redhat.com> wrote:
> > >     
> > > > On Fri,  9 Feb 2018 11:27:16 +0100
> > > > Cornelia Huck <cohuck@redhat.com> wrote:
> > > >       
> > > > > The registration code first registers the mdev device, and then
> > > > > proceeds to populate sysfs. An userspace application that listens
> > > > > for the ADD uevent is therefore likely to look for sysfs entries
> > > > > that have not yet been created.
> > > > > 
> > > > > The canonical way to fix this is to use attribute groups that are
> > > > > registered by the driver core before it sends the ADD uevent; I
> > > > > unfortunately did not find a way to make this work in this case,
> > > > > though.
> > > > > 
> > > > > An alternative approach is to suppress uevents before we register
> > > > > with the core and generate the ADD uevent ourselves after the
> > > > > sysfs infrastructure is in place.
> > > > > 
> > > > > Signed-off-by: Cornelia Huck <cohuck@redhat.com>
> > > > > ---
> > > > > 
> > > > > This feels like a band-aid, but I can't figure out how to handle creating
> > > > > attribute groups when there's a callback in the parent involved.
> > > > > 
> > > > > This should address the issue with libvirt's processing of mdevs raised in
> > > > > https://www.redhat.com/archives/libvir-list/2018-February/msg00023.html
> > > > > - although libvirt will still need to deal with older kernels, of course.
> > > > > 
> > > > > Best to consider this an untested patch :)        
> > > > 
> > > > I agree, this feels like a band-aide.  If every device in the kernel
> > > > needs to suppress udev events until until some key component is added,
> > > > that suggests that either udev is broken in general or not being used
> > > > as intended.        
> > > 
> > > I think udev is working exactly as designed - it's more a problem of
> > > when the kernel is sending what kind of notification to userspace, and
> > > the particular issue here is how the code sending the event (driver
> > > core) and the code assembling part of the user interface (mdev)
> > > interact.
> > >     
> > > > Zongyong submitted a different proposal to fix this
> > > > here[1].  That proposal seems a bit more sound and has precedence
> > > > elsewhere in the kernel.  What do you think of that approach?  We
> > > > don't need both afaict.  Thanks,
> > > > 
> > > > Alex
> > > > 
> > > > [1]https://patchwork.kernel.org/patch/10196197/      
> > > 
> > > Zongyong's patch is sending an additional CHANGE uevent, and I agree
> > > that doing both does not make sense. However, I think the CHANGE uevent
> > > is not quite suitable in this case, and delaying the ADD uevent is
> > > better.
> > > 
> > > [Warning, the following may be a bit rambling.]
> > > 
> > > The Linux driver model works under the assumption that any device is
> > > represented as an in-kernel object that exposes information and knobs
> > > through sysfs. As long as the device exists, userspace can poke at the
> > > sysfs entries and retrieve information or configure things.
> > > 
> > > The idea of the 'ADD' uevent is basically to let userspace know that
> > > there is now a new device with its related sysfs entries, and it may
> > > look at it and configure it. IOW, if I (as a userspace application) get
> > > the ADD uevent, I expect to be able to look at the device's sysfs
> > > entries and find all the files/directories that are usually there,
> > > without having to wait.
> > > 
> > > This expectation is broken if a device is first registered with the
> > > driver core (generating the ADD uevent) and the driver adds sysfs
> > > attributes later. To fix this, the driver core added a way to specify
> > > default attribute groups for the device, which are registered by the
> > > driver core itself before it generates the ADD uevent. Unfortunately, I
> > > did not see a way to do this here (which does not mean there isn't).
> > > The alternative was to prevent the driver core from sending the ADD
> > > uevent and do it from the mdev code when it was ready.
> > > 
> > > The 'CHANGE' uevent, on the other hand, tells userspace that something
> > > has changed for the device (that already existed). I (as a userspace
> > > application) would expect to see it if, for example, the information
> > > exposed via sysfs has changed, or maybe even if new, optional, entries
> > > have appeared and I might want to rescan. With Zongyong's patch,
> > > userspace gets the CHANGE uevent for something that was already
> > > expected to be there, and is now _really_ there. It does give userspace
> > > an indication that it can now work with the device (which certainly
> > > improves things), but I would prefer to get rid of the too-early uevent
> > > completely so that userspace does not get notified at all before the
> > > device is completely present in sysfs.
> > > 
> > > So, in short, my patch does 'don't tell userspace until we're really
> > > done', and Zongyong's patch does 'tell userspace again when we're
> > > really done'.    
> > 
> > This all sounds reasonable, but don't we have this synchronization
> > problem _everywhere_?  I apologize for referencing this bug because it's
> > not public (https://bugzilla.redhat.com/show_bug.cgi?id=1376907) but
> > the gist of it is that soft-unplugging PCI devices using the remove
> > entry in sysfs and re-adding with rescan sysfs entry results in libvirt
> > seeing the ADD uevent before the PCI config attribute is created and it
> > balks on the device.   
> 
> Yikes. So it seems that _any_ PCI device is incomplete when the ADD
> uevent is sent? Or does this just apply to an unfortunately large
> number of drivers?
> 
> > So at the PCI core we have this same issue and
> > developers are saying that there's no guarantee that sysfs entries
> > won't be added and removed at any time in the lifecycle of the device
> > and it's not the kernel's responsibility to provide that
> > synchronization.   
> 
> I think this is mixing up two things:
> - sysfs entries that are there by default. These not being in place
>   when the ADD uevent is sent sounds broken.
> - Optional sysfs entries that might change. There may be a case for
>   those to be added/removed later on (probably paired with a CHANGE
>   uevent.)
> 
> > So what are we to do?  If this issue exists in a far
> > more predominant device subsystem than mdev, do we need to reset
> > developer expectations about what sort of synchronization, if any, the
> > kernel is intending to provide?  Should we agree and document some best
> > practices around ordering of (or suppressing of) uevents so that we can
> > consistently cleanup subsystems, or define that such ordering is
> > officially not guaranteed?  Thanks,
> > 
> > Alex  
> 
> I'd like to clarify expectations here as well. My understanding has
> been what I outlined above, and what I think allows userspace to rely
> on asynchronous notifications without having to resort to polling,
> waiting, etc.
Greg Kroah-Hartman April 25, 2018, 3:42 p.m. UTC | #6
On Wed, Feb 14, 2018 at 06:20:35PM +0100, Cornelia Huck wrote:
> [cc:ing Greg for his opinion on this; retaining quoting for context]

Ick, just found this in my inbox, sorry for the delay...

> 
> On Tue, 13 Feb 2018 17:15:02 -0700
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> > On Tue, 13 Feb 2018 14:09:01 +0100
> > Cornelia Huck <cohuck@redhat.com> wrote:
> > 
> > > On Mon, 12 Feb 2018 14:20:57 -0700
> > > Alex Williamson <alex.williamson@redhat.com> wrote:
> > >   
> > > > On Fri,  9 Feb 2018 11:27:16 +0100
> > > > Cornelia Huck <cohuck@redhat.com> wrote:
> > > >     
> > > > > The registration code first registers the mdev device, and then
> > > > > proceeds to populate sysfs. An userspace application that listens
> > > > > for the ADD uevent is therefore likely to look for sysfs entries
> > > > > that have not yet been created.
> > > > > 
> > > > > The canonical way to fix this is to use attribute groups that are
> > > > > registered by the driver core before it sends the ADD uevent; I
> > > > > unfortunately did not find a way to make this work in this case,
> > > > > though.
> > > > > 
> > > > > An alternative approach is to suppress uevents before we register
> > > > > with the core and generate the ADD uevent ourselves after the
> > > > > sysfs infrastructure is in place.
> > > > > 
> > > > > Signed-off-by: Cornelia Huck <cohuck@redhat.com>
> > > > > ---
> > > > > 
> > > > > This feels like a band-aid, but I can't figure out how to handle creating
> > > > > attribute groups when there's a callback in the parent involved.
> > > > > 
> > > > > This should address the issue with libvirt's processing of mdevs raised in
> > > > > https://www.redhat.com/archives/libvir-list/2018-February/msg00023.html
> > > > > - although libvirt will still need to deal with older kernels, of course.
> > > > > 
> > > > > Best to consider this an untested patch :)      
> > > > 
> > > > I agree, this feels like a band-aide.  If every device in the kernel
> > > > needs to suppress udev events until until some key component is added,
> > > > that suggests that either udev is broken in general or not being used
> > > > as intended.      
> > > 
> > > I think udev is working exactly as designed - it's more a problem of
> > > when the kernel is sending what kind of notification to userspace, and
> > > the particular issue here is how the code sending the event (driver
> > > core) and the code assembling part of the user interface (mdev)
> > > interact.
> > >   
> > > > Zongyong submitted a different proposal to fix this
> > > > here[1].  That proposal seems a bit more sound and has precedence
> > > > elsewhere in the kernel.  What do you think of that approach?  We
> > > > don't need both afaict.  Thanks,
> > > > 
> > > > Alex
> > > > 
> > > > [1]https://patchwork.kernel.org/patch/10196197/    
> > > 
> > > Zongyong's patch is sending an additional CHANGE uevent, and I agree
> > > that doing both does not make sense. However, I think the CHANGE uevent
> > > is not quite suitable in this case, and delaying the ADD uevent is
> > > better.
> > > 
> > > [Warning, the following may be a bit rambling.]
> > > 
> > > The Linux driver model works under the assumption that any device is
> > > represented as an in-kernel object that exposes information and knobs
> > > through sysfs. As long as the device exists, userspace can poke at the
> > > sysfs entries and retrieve information or configure things.
> > > 
> > > The idea of the 'ADD' uevent is basically to let userspace know that
> > > there is now a new device with its related sysfs entries, and it may
> > > look at it and configure it. IOW, if I (as a userspace application) get
> > > the ADD uevent, I expect to be able to look at the device's sysfs
> > > entries and find all the files/directories that are usually there,
> > > without having to wait.
> > > 
> > > This expectation is broken if a device is first registered with the
> > > driver core (generating the ADD uevent) and the driver adds sysfs
> > > attributes later. To fix this, the driver core added a way to specify
> > > default attribute groups for the device, which are registered by the
> > > driver core itself before it generates the ADD uevent. Unfortunately, I
> > > did not see a way to do this here (which does not mean there isn't).
> > > The alternative was to prevent the driver core from sending the ADD
> > > uevent and do it from the mdev code when it was ready.
> > > 
> > > The 'CHANGE' uevent, on the other hand, tells userspace that something
> > > has changed for the device (that already existed). I (as a userspace
> > > application) would expect to see it if, for example, the information
> > > exposed via sysfs has changed, or maybe even if new, optional, entries
> > > have appeared and I might want to rescan. With Zongyong's patch,
> > > userspace gets the CHANGE uevent for something that was already
> > > expected to be there, and is now _really_ there. It does give userspace
> > > an indication that it can now work with the device (which certainly
> > > improves things), but I would prefer to get rid of the too-early uevent
> > > completely so that userspace does not get notified at all before the
> > > device is completely present in sysfs.
> > > 
> > > So, in short, my patch does 'don't tell userspace until we're really
> > > done', and Zongyong's patch does 'tell userspace again when we're
> > > really done'.  
> > 
> > This all sounds reasonable, but don't we have this synchronization
> > problem _everywhere_?  I apologize for referencing this bug because it's
> > not public (https://bugzilla.redhat.com/show_bug.cgi?id=1376907) but
> > the gist of it is that soft-unplugging PCI devices using the remove
> > entry in sysfs and re-adding with rescan sysfs entry results in libvirt
> > seeing the ADD uevent before the PCI config attribute is created and it
> > balks on the device. 
> 
> Yikes. So it seems that _any_ PCI device is incomplete when the ADD
> uevent is sent? Or does this just apply to an unfortunately large
> number of drivers?

What do you mean by "incomplete"?  The device is added to the bus at
that point in time, yes.  BUT, if a driver decides to add their own
attributes to the sysfs node, then that will happen afterward, unless
the bus has properly set that up (there are default attributes it can
use for that type of thing.)

It's a common problem that has been there since the beginning, and now
there's a new uevent that is "KOBJ_BIND" that is emitted when the driver
is finished being "bound" to the device.  Is that what you are looking
for here?

> > So at the PCI core we have this same issue and
> > developers are saying that there's no guarantee that sysfs entries
> > won't be added and removed at any time in the lifecycle of the device
> > and it's not the kernel's responsibility to provide that
> > synchronization. 
> 
> I think this is mixing up two things:
> - sysfs entries that are there by default. These not being in place
>   when the ADD uevent is sent sounds broken.

Yes it is, fix that in your bus code, default device attributes should
always be set up before ADD happens.  Driver specific ones will be there
by the time BIND happens.

> - Optional sysfs entries that might change. There may be a case for
>   those to be added/removed later on (probably paired with a CHANGE
>   uevent.)

CHANGE is pretty rare, and best used for things that are polled.  Or
major system events, like docking station changes.  Not for things that
happen all the time (like power state changes.)

does that help?

thanks,

greg k-h
Cornelia Huck April 27, 2018, 12:36 p.m. UTC | #7
On Wed, 25 Apr 2018 17:42:11 +0200
Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote:

> On Wed, Feb 14, 2018 at 06:20:35PM +0100, Cornelia Huck wrote:
> > [cc:ing Greg for his opinion on this; retaining quoting for context]  
> 
> Ick, just found this in my inbox, sorry for the delay...
> 
> > 
> > On Tue, 13 Feb 2018 17:15:02 -0700
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> >   
> > > On Tue, 13 Feb 2018 14:09:01 +0100
> > > Cornelia Huck <cohuck@redhat.com> wrote:
> > >   
> > > > On Mon, 12 Feb 2018 14:20:57 -0700
> > > > Alex Williamson <alex.williamson@redhat.com> wrote:
> > > >     
> > > > > On Fri,  9 Feb 2018 11:27:16 +0100
> > > > > Cornelia Huck <cohuck@redhat.com> wrote:
> > > > >       
> > > > > > The registration code first registers the mdev device, and then
> > > > > > proceeds to populate sysfs. An userspace application that listens
> > > > > > for the ADD uevent is therefore likely to look for sysfs entries
> > > > > > that have not yet been created.
> > > > > > 
> > > > > > The canonical way to fix this is to use attribute groups that are
> > > > > > registered by the driver core before it sends the ADD uevent; I
> > > > > > unfortunately did not find a way to make this work in this case,
> > > > > > though.
> > > > > > 
> > > > > > An alternative approach is to suppress uevents before we register
> > > > > > with the core and generate the ADD uevent ourselves after the
> > > > > > sysfs infrastructure is in place.
> > > > > > 
> > > > > > Signed-off-by: Cornelia Huck <cohuck@redhat.com>
> > > > > > ---
> > > > > > 
> > > > > > This feels like a band-aid, but I can't figure out how to handle creating
> > > > > > attribute groups when there's a callback in the parent involved.
> > > > > > 
> > > > > > This should address the issue with libvirt's processing of mdevs raised in
> > > > > > https://www.redhat.com/archives/libvir-list/2018-February/msg00023.html
> > > > > > - although libvirt will still need to deal with older kernels, of course.
> > > > > > 
> > > > > > Best to consider this an untested patch :)        
> > > > > 
> > > > > I agree, this feels like a band-aide.  If every device in the kernel
> > > > > needs to suppress udev events until until some key component is added,
> > > > > that suggests that either udev is broken in general or not being used
> > > > > as intended.        
> > > > 
> > > > I think udev is working exactly as designed - it's more a problem of
> > > > when the kernel is sending what kind of notification to userspace, and
> > > > the particular issue here is how the code sending the event (driver
> > > > core) and the code assembling part of the user interface (mdev)
> > > > interact.
> > > >     
> > > > > Zongyong submitted a different proposal to fix this
> > > > > here[1].  That proposal seems a bit more sound and has precedence
> > > > > elsewhere in the kernel.  What do you think of that approach?  We
> > > > > don't need both afaict.  Thanks,
> > > > > 
> > > > > Alex
> > > > > 
> > > > > [1]https://patchwork.kernel.org/patch/10196197/      
> > > > 
> > > > Zongyong's patch is sending an additional CHANGE uevent, and I agree
> > > > that doing both does not make sense. However, I think the CHANGE uevent
> > > > is not quite suitable in this case, and delaying the ADD uevent is
> > > > better.
> > > > 
> > > > [Warning, the following may be a bit rambling.]
> > > > 
> > > > The Linux driver model works under the assumption that any device is
> > > > represented as an in-kernel object that exposes information and knobs
> > > > through sysfs. As long as the device exists, userspace can poke at the
> > > > sysfs entries and retrieve information or configure things.
> > > > 
> > > > The idea of the 'ADD' uevent is basically to let userspace know that
> > > > there is now a new device with its related sysfs entries, and it may
> > > > look at it and configure it. IOW, if I (as a userspace application) get
> > > > the ADD uevent, I expect to be able to look at the device's sysfs
> > > > entries and find all the files/directories that are usually there,
> > > > without having to wait.
> > > > 
> > > > This expectation is broken if a device is first registered with the
> > > > driver core (generating the ADD uevent) and the driver adds sysfs
> > > > attributes later. To fix this, the driver core added a way to specify
> > > > default attribute groups for the device, which are registered by the
> > > > driver core itself before it generates the ADD uevent. Unfortunately, I
> > > > did not see a way to do this here (which does not mean there isn't).
> > > > The alternative was to prevent the driver core from sending the ADD
> > > > uevent and do it from the mdev code when it was ready.
> > > > 
> > > > The 'CHANGE' uevent, on the other hand, tells userspace that something
> > > > has changed for the device (that already existed). I (as a userspace
> > > > application) would expect to see it if, for example, the information
> > > > exposed via sysfs has changed, or maybe even if new, optional, entries
> > > > have appeared and I might want to rescan. With Zongyong's patch,
> > > > userspace gets the CHANGE uevent for something that was already
> > > > expected to be there, and is now _really_ there. It does give userspace
> > > > an indication that it can now work with the device (which certainly
> > > > improves things), but I would prefer to get rid of the too-early uevent
> > > > completely so that userspace does not get notified at all before the
> > > > device is completely present in sysfs.
> > > > 
> > > > So, in short, my patch does 'don't tell userspace until we're really
> > > > done', and Zongyong's patch does 'tell userspace again when we're
> > > > really done'.    
> > > 
> > > This all sounds reasonable, but don't we have this synchronization
> > > problem _everywhere_?  I apologize for referencing this bug because it's
> > > not public (https://bugzilla.redhat.com/show_bug.cgi?id=1376907) but
> > > the gist of it is that soft-unplugging PCI devices using the remove
> > > entry in sysfs and re-adding with rescan sysfs entry results in libvirt
> > > seeing the ADD uevent before the PCI config attribute is created and it
> > > balks on the device.   
> > 
> > Yikes. So it seems that _any_ PCI device is incomplete when the ADD
> > uevent is sent? Or does this just apply to an unfortunately large
> > number of drivers?  
> 
> What do you mean by "incomplete"?  The device is added to the bus at
> that point in time, yes.  BUT, if a driver decides to add their own
> attributes to the sysfs node, then that will happen afterward, unless
> the bus has properly set that up (there are default attributes it can
> use for that type of thing.)
> 
> It's a common problem that has been there since the beginning, and now
> there's a new uevent that is "KOBJ_BIND" that is emitted when the driver
> is finished being "bound" to the device.  Is that what you are looking
> for here?

Yes, that sounds exactly like the right thing (libvirt had been looking
for the config pci device attribute).

> 
> > > So at the PCI core we have this same issue and
> > > developers are saying that there's no guarantee that sysfs entries
> > > won't be added and removed at any time in the lifecycle of the device
> > > and it's not the kernel's responsibility to provide that
> > > synchronization.   
> > 
> > I think this is mixing up two things:
> > - sysfs entries that are there by default. These not being in place
> >   when the ADD uevent is sent sounds broken.  
> 
> Yes it is, fix that in your bus code, default device attributes should
> always be set up before ADD happens.  Driver specific ones will be there
> by the time BIND happens.

In this case (mdev), the mediated device is interacting with the
physical device as well (including specific attributes). I looked at
the code, but soon was lost in a maze of twisty passages, all alike.
Suppressing the uevent was a quick fix, but likely not the best one.

> 
> > - Optional sysfs entries that might change. There may be a case for
> >   those to be added/removed later on (probably paired with a CHANGE
> >   uevent.)  
> 
> CHANGE is pretty rare, and best used for things that are polled.  Or
> major system events, like docking station changes.  Not for things that
> happen all the time (like power state changes.)
> 
> does that help?

Yes, thanks!
diff mbox

Patch

diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
index 126991046eb7..942e880d8e7f 100644
--- a/drivers/vfio/mdev/mdev_core.c
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -335,6 +335,8 @@  int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
 	mdev->dev.release = mdev_device_release;
 	dev_set_name(&mdev->dev, "%pUl", uuid.b);
 
+	/* don't notify userspace until we're ready */
+	dev_set_uevent_suppress(&mdev->dev, 1);
 	ret = device_register(&mdev->dev);
 	if (ret) {
 		put_device(&mdev->dev);
@@ -350,6 +352,9 @@  int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
 		mdev_device_remove_ops(mdev, true);
 		goto create_failed;
 	}
+	/* all done, notify userspace */
+	dev_set_uevent_suppress(&mdev->dev, 0);
+	kobject_uevent(&mdev->dev.kobj, KOBJ_ADD);
 
 	mdev->type_kobj = kobj;
 	dev_dbg(&mdev->dev, "MDEV: created\n");
@@ -363,6 +368,10 @@  int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
 	return ret;
 
 create_failed:
+	/*
+	 * If we reach this, uevents are still suppressed for mdev->dev,
+	 * so we don't get a KOBJ_DEL uevent without a previous KOBJ_ADD.
+	 */
 	device_unregister(&mdev->dev);
 
 create_err: