diff mbox series

[8/8] vfio/mdev: Improve the create/remove sequence

Message ID 1553296835-37522-9-git-send-email-parav@mellanox.com (mailing list archive)
State New, archived
Headers show
Series vfio/mdev: Improve vfio/mdev core module | expand

Commit Message

Parav Pandit March 22, 2019, 11:20 p.m. UTC
There are five problems with current code structure.
1. mdev device is placed on the mdev bus before it is created in the
vendor driver. Once a device is placed on the mdev bus without creating
its supporting underlying vendor device, an open() can get triggered by
userspace on partially initialized device.
Below ladder diagram highlight it.

      cpu-0                                       cpu-1
      -----                                       -----
   create_store()
     mdev_create_device()
       device_register()
          ...
         vfio_mdev_probe()
         ...creates char device
                                        vfio_mdev_open()
                                          parent->ops->open(mdev)
                                            vfio_ap_mdev_open()
                                              matrix_mdev = NULL
        [...]
        parent->ops->create()
          vfio_ap_mdev_create()
            mdev_set_drvdata(mdev, matrix_mdev);
            /* Valid pointer set above */

2. Current creation sequence is,
   parent->ops_create()
   groups_register()

Remove sequence is,
   parent->ops->remove()
   groups_unregister()
However, remove sequence should be exact mirror of creation sequence.
Once this is achieved, all users of the mdev will be terminated first
before removing underlying vendor device.
(Follow standard linux driver model).
At that point vendor's remove() ops shouldn't failed because device is
taken off the bus that should terminate the users.

3. Additionally any new mdev driver that wants to work on mdev device
during probe() routine registered using mdev_register_driver() needs to
get stable mdev structure.

4. In following sequence, child devices created while removing mdev parent
device can be left out, or it may lead to race of removing half
initialized child mdev devices.

issue-1:
--------
       cpu-0                         cpu-1
       -----                         -----
                                  mdev_unregister_device()
                                     device_for_each_child()
                                        mdev_device_remove_cb()
                                            mdev_device_remove()
create_store()
  mdev_device_create()                   [...]
       device_register()
                                  parent_remove_sysfs_files()
                                  /* BUG: device added by cpu-0
                                   * whose parent is getting removed.
                                   */

issue-2:
--------
       cpu-0                         cpu-1
       -----                         -----
create_store()
  mdev_device_create()                   [...]
       device_register()

       [...]                      mdev_unregister_device()
                                     device_for_each_child()
                                        mdev_device_remove_cb()
                                            mdev_device_remove()

       mdev_create_sysfs_files()
       /* BUG: create is adding
        * sysfs files for a device
        * which is undergoing removal.
        */
                                 parent_remove_sysfs_files()

5. Below crash is observed when user initiated remove is in progress
and mdev_unregister_driver() completes parent unregistration.

       cpu-0                         cpu-1
       -----                         -----
remove_store()
   mdev_device_remove()
   active = false;
                                  mdev_unregister_device()
                                    remove type
   [...]
   mdev_remove_ops() crashes.

This is similar race like create() racing with mdev_unregister_device().

mtty mtty: MDEV: Registered
iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to group 57
vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV: group_id = 57
mdev_device_remove sleep started
mtty mtty: MDEV: Unregistering
mtty_dev: Unloaded!
BUG: unable to handle kernel paging request at ffffffffc027d668
PGD af9818067 P4D af9818067 PUD af981a067 PMD 8583c3067 PTE 0
Oops: 0000 [#1] SMP PTI
CPU: 15 PID: 3517 Comm: bash Kdump: loaded Not tainted 5.0.0-rc7-vdevbus+ #2
Hardware name: Supermicro SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
RIP: 0010:mdev_device_remove_ops+0x1a/0x50 [mdev]
Call Trace:
 mdev_device_remove+0xef/0x130 [mdev]
 remove_store+0x77/0xa0 [mdev]
 kernfs_fop_write+0x113/0x1a0
 __vfs_write+0x33/0x1b0
 ? rcu_read_lock_sched_held+0x64/0x70
 ? rcu_sync_lockdep_assert+0x2a/0x50
 ? __sb_start_write+0x121/0x1b0
 ? vfs_write+0x17c/0x1b0
 vfs_write+0xad/0x1b0
 ? trace_hardirqs_on_thunk+0x1a/0x1c
 ksys_write+0x55/0xc0
 do_syscall_64+0x5a/0x210

Therefore, mdev core is improved in following ways to overcome above
issues.

1. Before placing mdev devices on the bus, perform vendor drivers
creation which supports the mdev creation.
This ensures that mdev specific all necessary fields are initialized
before a given mdev can be accessed by bus driver.

2. During remove flow, first remove the device from the bus. This
ensures that any bus specific devices and data is cleared.
Once device is taken of the mdev bus, perform remove() of mdev from the
vendor driver.

3. Linux core device model provides way to register and auto unregister
the device sysfs attribute groups at dev->groups.
Make use of this groups to let core create the groups and simplify code
to avoid explicit groups creation and removal.

4. Wait for any ongoing mdev create() and remove() to finish before
unregistering parent device using srcu. This continues to allow multiple
create and remove to progress in parallel. At the same time guard parent
removal while parent is being access by create() and remove callbacks.

Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
Signed-off-by: Parav Pandit <parav@mellanox.com>
---
 drivers/vfio/mdev/mdev_core.c    | 142 +++++++++++++++++++++------------------
 drivers/vfio/mdev/mdev_private.h |   7 +-
 drivers/vfio/mdev/mdev_sysfs.c   |   6 +-
 3 files changed, 84 insertions(+), 71 deletions(-)

Comments

Maxim Levitsky March 25, 2019, 1:24 p.m. UTC | #1
On Fri, 2019-03-22 at 18:20 -0500, Parav Pandit wrote:
> There are five problems with current code structure.
> 1. mdev device is placed on the mdev bus before it is created in the
> vendor driver. Once a device is placed on the mdev bus without creating
> its supporting underlying vendor device, an open() can get triggered by
> userspace on partially initialized device.
> Below ladder diagram highlight it.
> 
>       cpu-0                                       cpu-1
>       -----                                       -----
>    create_store()
>      mdev_create_device()
>        device_register()
>           ...
>          vfio_mdev_probe()
>          ...creates char device
>                                         vfio_mdev_open()
>                                           parent->ops->open(mdev)
>                                             vfio_ap_mdev_open()
>                                               matrix_mdev = NULL
>         [...]
>         parent->ops->create()
>           vfio_ap_mdev_create()
>             mdev_set_drvdata(mdev, matrix_mdev);
>             /* Valid pointer set above */

Agree.
You probably mean mdev_device_create here.

> 
> 2. Current creation sequence is,
>    parent->ops_create()
>    groups_register()
> 
> Remove sequence is,
>    parent->ops->remove()
>    groups_unregister()
> However, remove sequence should be exact mirror of creation sequence.
> Once this is achieved, all users of the mdev will be terminated first
> before removing underlying vendor device.
> (Follow standard linux driver model).
> At that point vendor's remove() ops shouldn't failed because device is
> taken off the bus that should terminate the users.
Agreee here too.



> 
> 3. Additionally any new mdev driver that wants to work on mdev device
> during probe() routine registered using mdev_register_driver() needs to
> get stable mdev structure.
> 
> 4. In following sequence, child devices created while removing mdev parent
> device can be left out, or it may lead to race of removing half
> initialized child mdev devices.
> 
> issue-1:
> --------
>        cpu-0                         cpu-1
>        -----                         -----
>                                   mdev_unregister_device()
>                                      device_for_each_child()
>                                         mdev_device_remove_cb()
>                                             mdev_device_remove()
> create_store()
>   mdev_device_create()                   [...]
>        device_register()
>                                   parent_remove_sysfs_files()
>                                   /* BUG: device added by cpu-0
>                                    * whose parent is getting removed.
>                                    */
> 
> issue-2:
> --------
>        cpu-0                         cpu-1
>        -----                         -----
> create_store()
>   mdev_device_create()                   [...]
>        device_register()
> 
>        [...]                      mdev_unregister_device()
>                                      device_for_each_child()
>                                         mdev_device_remove_cb()
>                                             mdev_device_remove()
> 
>        mdev_create_sysfs_files()
>        /* BUG: create is adding
>         * sysfs files for a device
>         * which is undergoing removal.
>         */
>                                  parent_remove_sysfs_files()
Looks like an issue to me too.

> 
> 5. Below crash is observed when user initiated remove is in progress
> and mdev_unregister_driver() completes parent unregistration.
> 
>        cpu-0                         cpu-1
>        -----                         -----
> remove_store()
>    mdev_device_remove()
>    active = false;
>                                   mdev_unregister_device()
>                                     remove type
>    [...]
>    mdev_remove_ops() crashes.
> 
> This is similar race like create() racing with mdev_unregister_device().
> 
> mtty mtty: MDEV: Registered
> iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to group 57
> vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV: group_id = 57
> mdev_device_remove sleep started
> mtty mtty: MDEV: Unregistering
> mtty_dev: Unloaded!
> BUG: unable to handle kernel paging request at ffffffffc027d668
> PGD af9818067 P4D af9818067 PUD af981a067 PMD 8583c3067 PTE 0
> Oops: 0000 [#1] SMP PTI
> CPU: 15 PID: 3517 Comm: bash Kdump: loaded Not tainted 5.0.0-rc7-vdevbus+ #2
> Hardware name: Supermicro SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
> RIP: 0010:mdev_device_remove_ops+0x1a/0x50 [mdev]
> Call Trace:
>  mdev_device_remove+0xef/0x130 [mdev]
>  remove_store+0x77/0xa0 [mdev]
>  kernfs_fop_write+0x113/0x1a0
>  __vfs_write+0x33/0x1b0
>  ? rcu_read_lock_sched_held+0x64/0x70
>  ? rcu_sync_lockdep_assert+0x2a/0x50
>  ? __sb_start_write+0x121/0x1b0
>  ? vfs_write+0x17c/0x1b0
>  vfs_write+0xad/0x1b0
>  ? trace_hardirqs_on_thunk+0x1a/0x1c
>  ksys_write+0x55/0xc0
>  do_syscall_64+0x5a/0x210
> 
> Therefore, mdev core is improved in following ways to overcome above
> issues.
> 
> 1. Before placing mdev devices on the bus, perform vendor drivers
> creation which supports the mdev creation.
> This ensures that mdev specific all necessary fields are initialized
> before a given mdev can be accessed by bus driver.
> 
> 2. During remove flow, first remove the device from the bus. This
> ensures that any bus specific devices and data is cleared.
> Once device is taken of the mdev bus, perform remove() of mdev from the
> vendor driver.
> 
> 
> 3. Linux core device model provides way to register and auto unregister
> the device sysfs attribute groups at dev->groups.
> to avoid explicit groups creation and removal.
> to avoid explicit groups creation and removal.
> 
> 4. Wait for any ongoing mdev create() and remove() to finish before
> unregistering parent device using srcu. This continues to allow multiple
> create and remove to progress in parallel. At the same time guard parent
> removal while parent is being access by create() and remove callbacks.
All these fixes seem reasonable and correct to me


> 
> Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> Signed-off-by: Parav Pandit <parav@mellanox.com>
> ---
>  drivers/vfio/mdev/mdev_core.c    | 142 +++++++++++++++++++++-----------------
> -
>  drivers/vfio/mdev/mdev_private.h |   7 +-
>  drivers/vfio/mdev/mdev_sysfs.c   |   6 +-
>  3 files changed, 84 insertions(+), 71 deletions(-)
> 
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> index 944a058..8fe0ed1 100644
> --- a/drivers/vfio/mdev/mdev_core.c
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -84,6 +84,7 @@ static void mdev_release_parent(struct kref *kref)
>  						  ref);
>  	struct device *dev = parent->dev;
>  
> +	cleanup_srcu_struct(&parent->unreg_srcu);
>  	kfree(parent);
>  	put_device(dev);
>  }
> @@ -103,56 +104,30 @@ static inline void mdev_put_parent(struct mdev_parent
> *parent)
>  		kref_put(&parent->ref, mdev_release_parent);
>  }
>  
> -static int mdev_device_create_ops(struct kobject *kobj,
> -				  struct mdev_device *mdev)
> +static int mdev_device_must_remove(struct mdev_device *mdev)

Tiny nitpic: maybe a better name? or a comment for this function that state that
it tries removes the device even if in use

>  {
> -	struct mdev_parent *parent = mdev->parent;
> +	struct mdev_parent *parent;
> +	struct mdev_type *type;
>  	int ret;
>  
> -	ret = parent->ops->create(kobj, mdev);
> -	if (ret)
> -		return ret;
> +	type = to_mdev_type(mdev->type_kobj);
>  
> -	ret = sysfs_create_groups(&mdev->dev.kobj,
> -				  parent->ops->mdev_attr_groups);
> +	mdev_remove_sysfs_files(&mdev->dev, type);
> +	device_del(&mdev->dev);
> +	parent = mdev->parent;
> +	ret = parent->ops->remove(mdev);
>  	if (ret)
> -		parent->ops->remove(mdev);
> +		dev_err(&mdev->dev, "Remove failed: err=%d\n", ret);
>  
> +	/* Balances with device_initialize() */
> +	put_device(&mdev->dev);
>  	return ret;
>  }
>  
> -/*
> - * mdev_device_remove_ops gets called from sysfs's 'remove' and when parent
> - * device is being unregistered from mdev device framework.
> - * - 'force_remove' is set to 'false' when called from sysfs's 'remove' which
> - *   indicates that if the mdev device is active, used by VMM or userspace
> - *   application, vendor driver could return error then don't remove the
> device.
> - * - 'force_remove' is set to 'true' when called from
> mdev_unregister_device()
> - *   which indicate that parent device is being removed from mdev device
> - *   framework so remove mdev device forcefully.
> - */
> -static int mdev_device_remove_ops(struct mdev_device *mdev, bool
> force_remove)
> -{
> -	struct mdev_parent *parent = mdev->parent;
> -	int ret;
> -
> -	/*
> -	 * Vendor driver can return error if VMM or userspace application is
> -	 * using this mdev device.
> -	 */
> -	ret = parent->ops->remove(mdev);
> -	if (ret && !force_remove)
> -		return ret;
> -
> -	sysfs_remove_groups(&mdev->dev.kobj, parent->ops->mdev_attr_groups);
> -	return 0;
> -}
> -
>  static int mdev_device_remove_cb(struct device *dev, void *data)
>  {
>  	if (dev_is_mdev(dev))
> -		mdev_device_remove(dev, true);
> -
> +		mdev_device_must_remove(to_mdev_device(dev));
>  	return 0;
>  }
>  
> @@ -194,6 +169,7 @@ int mdev_register_device(struct device *dev, const struct
> mdev_parent_ops *ops)
>  	}
>  
>  	kref_init(&parent->ref);
> +	init_srcu_struct(&parent->unreg_srcu);
>  
>  	parent->dev = dev;
>  	parent->ops = ops;
> @@ -214,6 +190,7 @@ int mdev_register_device(struct device *dev, const struct
> mdev_parent_ops *ops)
>  	if (ret)
>  		dev_warn(dev, "Failed to create compatibility class link\n");
>  
> +	rcu_assign_pointer(parent->self, parent);
>  	list_add(&parent->next, &parent_list);
>  	mutex_unlock(&parent_list_lock);
>  
> @@ -244,21 +221,36 @@ void mdev_unregister_device(struct device *dev)
>  
>  	mutex_lock(&parent_list_lock);
>  	parent = __find_parent_device(dev);
> -
>  	if (!parent) {
>  		mutex_unlock(&parent_list_lock);
>  		return;
>  	}
> +	list_del(&parent->next);
> +	mutex_unlock(&parent_list_lock);
> +
>  	dev_info(dev, "MDEV: Unregistering\n");
>  
> -	list_del(&parent->next);
> +	/* Publish that this mdev parent is unregistering. So any new
> +	 * create/remove cannot start on this parent anymore by user.
> +	 */
> +	rcu_assign_pointer(parent->self, NULL);
> +
> +	/*
> +	 * Wait for any active create() or remove() mdev ops on the parent
> +	 * to complete.
> +	 */
> +	synchronize_srcu(&parent->unreg_srcu);
> +
> +	/* At this point it is confirmed that any pending user initiated
> +	 * create or remove callbacks accessing the parent are completed.
> +	 * It is safe to remove the parent now.
> +	 */
>  	class_compat_remove_link(mdev_bus_compat_class, dev, NULL);
>  
>  	device_for_each_child(dev, NULL, mdev_device_remove_cb);
>  
>  	parent_remove_sysfs_files(parent);
>  
> -	mutex_unlock(&parent_list_lock);
>  	mdev_put_parent(parent);
>  }
>  EXPORT_SYMBOL(mdev_unregister_device);
> @@ -278,14 +270,24 @@ static void mdev_device_release(struct device *dev)
>  int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le
> uuid)
>  {
>  	int ret;
> +	struct mdev_parent *valid_parent;
>  	struct mdev_device *mdev, *tmp;
>  	struct mdev_parent *parent;
>  	struct mdev_type *type = to_mdev_type(kobj);
> +	int srcu_idx;
>  
>  	parent = mdev_get_parent(type->parent);
>  	if (!parent)
>  		return -EINVAL;
>  
> +	srcu_idx = srcu_read_lock(&parent->unreg_srcu);
> +	valid_parent = srcu_dereference(parent->self, &parent->unreg_srcu);
> +	if (!valid_parent) {
> +		/* parent is undergoing unregistration */
> +		ret = -ENODEV;
> +		goto mdev_fail;
> +	}
> +
>  	mutex_lock(&mdev_list_lock);
>  
>  	/* Check for duplicate */
> @@ -310,68 +312,76 @@ int mdev_device_create(struct kobject *kobj, struct
> device *dev, uuid_le uuid)
>  
>  	mdev->parent = parent;
>  
> +	device_initialize(&mdev->dev);
>  	mdev->dev.parent  = dev;
>  	mdev->dev.bus     = &mdev_bus_type;
>  	mdev->dev.release = mdev_device_release;
> +	mdev->dev.groups = type->parent->ops->mdev_attr_groups;
>  	dev_set_name(&mdev->dev, "%pUl", uuid.b);
>  
> -	ret = device_register(&mdev->dev);
> +	ret = type->parent->ops->create(kobj, mdev);
>  	if (ret)
> -		goto mdev_fail;
> +		goto create_fail;
>  
> -	ret = mdev_device_create_ops(kobj, mdev);
> +	ret = device_add(&mdev->dev);
>  	if (ret)
> -		goto create_fail;
> +		goto dev_fail;
>  
>  	ret = mdev_create_sysfs_files(&mdev->dev, type);
> -	if (ret) {
> -		mdev_device_remove_ops(mdev, true);
> -		goto create_fail;
> -	}
> +	if (ret)
> +		goto sysfs_fail;
>  
>  	mdev->type_kobj = kobj;
>  	mdev->active = true;
>  	dev_dbg(&mdev->dev, "MDEV: created\n");
> +	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
>  
>  	return 0;
>  
> +sysfs_fail:
> +	device_del(&mdev->dev);
> +dev_fail:
> +	type->parent->ops->remove(mdev);
>  create_fail:
> -	device_unregister(&mdev->dev);
> +	put_device(&mdev->dev);
>  mdev_fail:
> +	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
>  	mdev_put_parent(parent);
>  	return ret;
>  }
>  
> -int mdev_device_remove(struct device *dev, bool force_remove)
> +int mdev_device_remove(struct device *dev)
>  {
> +	struct mdev_parent *valid_parent;
>  	struct mdev_device *mdev;
>  	struct mdev_parent *parent;
> -	struct mdev_type *type;
> +	int srcu_idx;
>  	int ret;
>  
>  	mdev = to_mdev_device(dev);
> +	parent = mdev->parent;
> +	srcu_idx = srcu_read_lock(&parent->unreg_srcu);
> +	valid_parent = srcu_dereference(parent->self, &parent->unreg_srcu);
> +	if (!valid_parent) {
> +		srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
> +		/* parent is undergoing unregistration */
> +		return -ENODEV;
> +	}
> +
> +	mutex_lock(&mdev_list_lock);
>  	if (!mdev->active) {
>  		mutex_unlock(&mdev_list_lock);
> -		return -EAGAIN;
> +		srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
> +		return -ENODEV;
>  	}
> -
>  	mdev->active = false;
>  	mutex_unlock(&mdev_list_lock);
>  
> -	type = to_mdev_type(mdev->type_kobj);
> -	parent = mdev->parent;
> -
> -	ret = mdev_device_remove_ops(mdev, force_remove);
> -	if (ret) {
> -		mdev->active = true;
> -		return ret;
> -	}
> +	ret = mdev_device_must_remove(mdev);
> +	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
>  
> -	mdev_remove_sysfs_files(dev, type);
> -	device_unregister(dev);
>  	mdev_put_parent(parent);
> -
> -	return 0;
> +	return ret;
>  }
>  
>  static int __init mdev_init(void)
> diff --git a/drivers/vfio/mdev/mdev_private.h
> b/drivers/vfio/mdev/mdev_private.h
> index 84b2b6c..3d17db9 100644
> --- a/drivers/vfio/mdev/mdev_private.h
> +++ b/drivers/vfio/mdev/mdev_private.h
> @@ -23,6 +23,11 @@ struct mdev_parent {
>  	struct list_head next;
>  	struct kset *mdev_types_kset;
>  	struct list_head type_list;
> +	/* Protects unregistration to wait until create/remove
> +	 * are completed.
> +	 */
> +	struct srcu_struct unreg_srcu;
> +	struct mdev_parent __rcu *self;
>  };
>  
>  struct mdev_device {
> @@ -58,6 +63,6 @@ struct mdev_type {
>  void mdev_remove_sysfs_files(struct device *dev, struct mdev_type *type);
>  
>  int  mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le
> uuid);
> -int  mdev_device_remove(struct device *dev, bool force_remove);
> +int  mdev_device_remove(struct device *dev);
>  
>  #endif /* MDEV_PRIVATE_H */
> diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
> index c782fa9..68a8191 100644
> --- a/drivers/vfio/mdev/mdev_sysfs.c
> +++ b/drivers/vfio/mdev/mdev_sysfs.c
> @@ -236,11 +236,9 @@ static ssize_t remove_store(struct device *dev, struct
> device_attribute *attr,
>  	if (val && device_remove_file_self(dev, attr)) {
>  		int ret;
>  
> -		ret = mdev_device_remove(dev, false);
> -		if (ret) {
> -			device_create_file(dev, attr);
> +		ret = mdev_device_remove(dev);
> +		if (ret)
>  			return ret;
> -		}
>  	}
>  
>  	return count;

The patch looks OK to me, especially looking at the code after the changes were
apllied. I might have missed something though due to amount of changes done.

I lightly tested the whole patch series with my mdev driver, and it seems to
survive, but my testing doesn't test much of the error paths so there that.

I'll keep this applied so if I notice any errors I'll let you know.

If you could split this into few patches, this would be even better, but anyway
thanks a lot for this work!

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky
Parav Pandit March 25, 2019, 9:42 p.m. UTC | #2
> -----Original Message-----
> From: Maxim Levitsky <mlevitsk@redhat.com>
> Sent: Monday, March 25, 2019 8:24 AM
> To: Parav Pandit <parav@mellanox.com>; kvm@vger.kernel.org; linux-
> kernel@vger.kernel.org; kwankhede@nvidia.com;
> alex.williamson@redhat.com
> Subject: Re: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
> 
> On Fri, 2019-03-22 at 18:20 -0500, Parav Pandit wrote:
> > There are five problems with current code structure.
> > 1. mdev device is placed on the mdev bus before it is created in the
> > vendor driver. Once a device is placed on the mdev bus without
> > creating its supporting underlying vendor device, an open() can get
> > triggered by userspace on partially initialized device.
> > Below ladder diagram highlight it.
> >
> >       cpu-0                                       cpu-1
> >       -----                                       -----
> >    create_store()
> >      mdev_create_device()
> >        device_register()
> >           ...
> >          vfio_mdev_probe()
> >          ...creates char device
> >                                         vfio_mdev_open()
> >                                           parent->ops->open(mdev)
> >                                             vfio_ap_mdev_open()
> >                                               matrix_mdev = NULL
> >         [...]
> >         parent->ops->create()
> >           vfio_ap_mdev_create()
> >             mdev_set_drvdata(mdev, matrix_mdev);
> >             /* Valid pointer set above */
> 
> Agree.
> You probably mean mdev_device_create here.
> 
> >
> > 2. Current creation sequence is,
> >    parent->ops_create()
> >    groups_register()
> >
> > Remove sequence is,
> >    parent->ops->remove()
> >    groups_unregister()
> > However, remove sequence should be exact mirror of creation sequence.
> > Once this is achieved, all users of the mdev will be terminated first
> > before removing underlying vendor device.
> > (Follow standard linux driver model).
> > At that point vendor's remove() ops shouldn't failed because device is
> > taken off the bus that should terminate the users.
> Agreee here too.
> 
> 
> 
> >
> > 3. Additionally any new mdev driver that wants to work on mdev device
> > during probe() routine registered using mdev_register_driver() needs
> > to get stable mdev structure.
> >
> > 4. In following sequence, child devices created while removing mdev
> > parent device can be left out, or it may lead to race of removing half
> > initialized child mdev devices.
> >
> > issue-1:
> > --------
> >        cpu-0                         cpu-1
> >        -----                         -----
> >                                   mdev_unregister_device()
> >                                      device_for_each_child()
> >                                         mdev_device_remove_cb()
> >                                             mdev_device_remove()
> > create_store()
> >   mdev_device_create()                   [...]
> >        device_register()
> >                                   parent_remove_sysfs_files()
> >                                   /* BUG: device added by cpu-0
> >                                    * whose parent is getting removed.
> >                                    */
> >
> > issue-2:
> > --------
> >        cpu-0                         cpu-1
> >        -----                         -----
> > create_store()
> >   mdev_device_create()                   [...]
> >        device_register()
> >
> >        [...]                      mdev_unregister_device()
> >                                      device_for_each_child()
> >                                         mdev_device_remove_cb()
> >                                             mdev_device_remove()
> >
> >        mdev_create_sysfs_files()
> >        /* BUG: create is adding
> >         * sysfs files for a device
> >         * which is undergoing removal.
> >         */
> >                                  parent_remove_sysfs_files()
> Looks like an issue to me too.
> 
> >
> > 5. Below crash is observed when user initiated remove is in progress
> > and mdev_unregister_driver() completes parent unregistration.
> >
> >        cpu-0                         cpu-1
> >        -----                         -----
> > remove_store()
> >    mdev_device_remove()
> >    active = false;
> >                                   mdev_unregister_device()
> >                                     remove type
> >    [...]
> >    mdev_remove_ops() crashes.
> >
> > This is similar race like create() racing with mdev_unregister_device().
> >
> > mtty mtty: MDEV: Registered
> > iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to group 57
> > vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV: group_id = 57
> > mdev_device_remove sleep started mtty mtty: MDEV: Unregistering
> > mtty_dev: Unloaded!
> > BUG: unable to handle kernel paging request at ffffffffc027d668 PGD
> > af9818067 P4D af9818067 PUD af981a067 PMD 8583c3067 PTE 0
> > Oops: 0000 [#1] SMP PTI
> > CPU: 15 PID: 3517 Comm: bash Kdump: loaded Not tainted
> > 5.0.0-rc7-vdevbus+ #2 Hardware name: Supermicro
> > SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
> > RIP: 0010:mdev_device_remove_ops+0x1a/0x50 [mdev] Call Trace:
> >  mdev_device_remove+0xef/0x130 [mdev]
> >  remove_store+0x77/0xa0 [mdev]
> >  kernfs_fop_write+0x113/0x1a0
> >  __vfs_write+0x33/0x1b0
> >  ? rcu_read_lock_sched_held+0x64/0x70
> >  ? rcu_sync_lockdep_assert+0x2a/0x50
> >  ? __sb_start_write+0x121/0x1b0
> >  ? vfs_write+0x17c/0x1b0
> >  vfs_write+0xad/0x1b0
> >  ? trace_hardirqs_on_thunk+0x1a/0x1c
> >  ksys_write+0x55/0xc0
> >  do_syscall_64+0x5a/0x210
> >
> > Therefore, mdev core is improved in following ways to overcome above
> > issues.
> >
> > 1. Before placing mdev devices on the bus, perform vendor drivers
> > creation which supports the mdev creation.
> > This ensures that mdev specific all necessary fields are initialized
> > before a given mdev can be accessed by bus driver.
> >
> > 2. During remove flow, first remove the device from the bus. This
> > ensures that any bus specific devices and data is cleared.
> > Once device is taken of the mdev bus, perform remove() of mdev from
> > the vendor driver.
> >
> >
> > 3. Linux core device model provides way to register and auto
> > unregister the device sysfs attribute groups at dev->groups.
> > to avoid explicit groups creation and removal.
> > to avoid explicit groups creation and removal.
> >
> > 4. Wait for any ongoing mdev create() and remove() to finish before
> > unregistering parent device using srcu. This continues to allow
> > multiple create and remove to progress in parallel. At the same time
> > guard parent removal while parent is being access by create() and remove
> callbacks.
> All these fixes seem reasonable and correct to me
> 
> 
> >
> > Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> > Signed-off-by: Parav Pandit <parav@mellanox.com>
> > ---
> >  drivers/vfio/mdev/mdev_core.c    | 142 +++++++++++++++++++++--------------
> ---
> > -
> >  drivers/vfio/mdev/mdev_private.h |   7 +-
> >  drivers/vfio/mdev/mdev_sysfs.c   |   6 +-
> >  3 files changed, 84 insertions(+), 71 deletions(-)
> >
> > diff --git a/drivers/vfio/mdev/mdev_core.c
> > b/drivers/vfio/mdev/mdev_core.c index 944a058..8fe0ed1 100644
> > --- a/drivers/vfio/mdev/mdev_core.c
> > +++ b/drivers/vfio/mdev/mdev_core.c
> > @@ -84,6 +84,7 @@ static void mdev_release_parent(struct kref *kref)
> >  						  ref);
> >  	struct device *dev = parent->dev;
> >
> > +	cleanup_srcu_struct(&parent->unreg_srcu);
> >  	kfree(parent);
> >  	put_device(dev);
> >  }
> > @@ -103,56 +104,30 @@ static inline void mdev_put_parent(struct
> > mdev_parent
> > *parent)
> >  		kref_put(&parent->ref, mdev_release_parent);  }
> >
> > -static int mdev_device_create_ops(struct kobject *kobj,
> > -				  struct mdev_device *mdev)
> > +static int mdev_device_must_remove(struct mdev_device *mdev)
> 
> Tiny nitpic: maybe a better name? or a comment for this function that state
> that it tries removes the device even if in use
> 
> >  {
> > -	struct mdev_parent *parent = mdev->parent;
> > +	struct mdev_parent *parent;
> > +	struct mdev_type *type;
> >  	int ret;
> >
> > -	ret = parent->ops->create(kobj, mdev);
> > -	if (ret)
> > -		return ret;
> > +	type = to_mdev_type(mdev->type_kobj);
> >
> > -	ret = sysfs_create_groups(&mdev->dev.kobj,
> > -				  parent->ops->mdev_attr_groups);
> > +	mdev_remove_sysfs_files(&mdev->dev, type);
> > +	device_del(&mdev->dev);
> > +	parent = mdev->parent;
> > +	ret = parent->ops->remove(mdev);
> >  	if (ret)
> > -		parent->ops->remove(mdev);
> > +		dev_err(&mdev->dev, "Remove failed: err=%d\n", ret);
> >
> > +	/* Balances with device_initialize() */
> > +	put_device(&mdev->dev);
> >  	return ret;
> >  }
> >
> > -/*
> > - * mdev_device_remove_ops gets called from sysfs's 'remove' and when
> > parent
> > - * device is being unregistered from mdev device framework.
> > - * - 'force_remove' is set to 'false' when called from sysfs's 'remove' which
> > - *   indicates that if the mdev device is active, used by VMM or userspace
> > - *   application, vendor driver could return error then don't remove the
> > device.
> > - * - 'force_remove' is set to 'true' when called from
> > mdev_unregister_device()
> > - *   which indicate that parent device is being removed from mdev device
> > - *   framework so remove mdev device forcefully.
> > - */
> > -static int mdev_device_remove_ops(struct mdev_device *mdev, bool
> > force_remove)
> > -{
> > -	struct mdev_parent *parent = mdev->parent;
> > -	int ret;
> > -
> > -	/*
> > -	 * Vendor driver can return error if VMM or userspace application is
> > -	 * using this mdev device.
> > -	 */
> > -	ret = parent->ops->remove(mdev);
> > -	if (ret && !force_remove)
> > -		return ret;
> > -
> > -	sysfs_remove_groups(&mdev->dev.kobj, parent->ops-
> >mdev_attr_groups);
> > -	return 0;
> > -}
> > -
> >  static int mdev_device_remove_cb(struct device *dev, void *data)  {
> >  	if (dev_is_mdev(dev))
> > -		mdev_device_remove(dev, true);
> > -
> > +		mdev_device_must_remove(to_mdev_device(dev));
> >  	return 0;
> >  }
> >
> > @@ -194,6 +169,7 @@ int mdev_register_device(struct device *dev, const
> > struct mdev_parent_ops *ops)
> >  	}
> >
> >  	kref_init(&parent->ref);
> > +	init_srcu_struct(&parent->unreg_srcu);
> >
> >  	parent->dev = dev;
> >  	parent->ops = ops;
> > @@ -214,6 +190,7 @@ int mdev_register_device(struct device *dev, const
> > struct mdev_parent_ops *ops)
> >  	if (ret)
> >  		dev_warn(dev, "Failed to create compatibility class link\n");
> >
> > +	rcu_assign_pointer(parent->self, parent);
> >  	list_add(&parent->next, &parent_list);
> >  	mutex_unlock(&parent_list_lock);
> >
> > @@ -244,21 +221,36 @@ void mdev_unregister_device(struct device *dev)
> >
> >  	mutex_lock(&parent_list_lock);
> >  	parent = __find_parent_device(dev);
> > -
> >  	if (!parent) {
> >  		mutex_unlock(&parent_list_lock);
> >  		return;
> >  	}
> > +	list_del(&parent->next);
> > +	mutex_unlock(&parent_list_lock);
> > +
> >  	dev_info(dev, "MDEV: Unregistering\n");
> >
> > -	list_del(&parent->next);
> > +	/* Publish that this mdev parent is unregistering. So any new
> > +	 * create/remove cannot start on this parent anymore by user.
> > +	 */
> > +	rcu_assign_pointer(parent->self, NULL);
> > +
> > +	/*
> > +	 * Wait for any active create() or remove() mdev ops on the parent
> > +	 * to complete.
> > +	 */
> > +	synchronize_srcu(&parent->unreg_srcu);
> > +
> > +	/* At this point it is confirmed that any pending user initiated
> > +	 * create or remove callbacks accessing the parent are completed.
> > +	 * It is safe to remove the parent now.
> > +	 */
> >  	class_compat_remove_link(mdev_bus_compat_class, dev, NULL);
> >
> >  	device_for_each_child(dev, NULL, mdev_device_remove_cb);
> >
> >  	parent_remove_sysfs_files(parent);
> >
> > -	mutex_unlock(&parent_list_lock);
> >  	mdev_put_parent(parent);
> >  }
> >  EXPORT_SYMBOL(mdev_unregister_device);
> > @@ -278,14 +270,24 @@ static void mdev_device_release(struct device
> > *dev)  int mdev_device_create(struct kobject *kobj, struct device
> > *dev, uuid_le
> > uuid)
> >  {
> >  	int ret;
> > +	struct mdev_parent *valid_parent;
> >  	struct mdev_device *mdev, *tmp;
> >  	struct mdev_parent *parent;
> >  	struct mdev_type *type = to_mdev_type(kobj);
> > +	int srcu_idx;
> >
> >  	parent = mdev_get_parent(type->parent);
> >  	if (!parent)
> >  		return -EINVAL;
> >
> > +	srcu_idx = srcu_read_lock(&parent->unreg_srcu);
> > +	valid_parent = srcu_dereference(parent->self, &parent->unreg_srcu);
> > +	if (!valid_parent) {
> > +		/* parent is undergoing unregistration */
> > +		ret = -ENODEV;
> > +		goto mdev_fail;
> > +	}
> > +
> >  	mutex_lock(&mdev_list_lock);
> >
> >  	/* Check for duplicate */
> > @@ -310,68 +312,76 @@ int mdev_device_create(struct kobject *kobj,
> > struct device *dev, uuid_le uuid)
> >
> >  	mdev->parent = parent;
> >
> > +	device_initialize(&mdev->dev);
> >  	mdev->dev.parent  = dev;
> >  	mdev->dev.bus     = &mdev_bus_type;
> >  	mdev->dev.release = mdev_device_release;
> > +	mdev->dev.groups = type->parent->ops->mdev_attr_groups;
> >  	dev_set_name(&mdev->dev, "%pUl", uuid.b);
> >
> > -	ret = device_register(&mdev->dev);
> > +	ret = type->parent->ops->create(kobj, mdev);
> >  	if (ret)
> > -		goto mdev_fail;
> > +		goto create_fail;
> >
> > -	ret = mdev_device_create_ops(kobj, mdev);
> > +	ret = device_add(&mdev->dev);
> >  	if (ret)
> > -		goto create_fail;
> > +		goto dev_fail;
> >
> >  	ret = mdev_create_sysfs_files(&mdev->dev, type);
> > -	if (ret) {
> > -		mdev_device_remove_ops(mdev, true);
> > -		goto create_fail;
> > -	}
> > +	if (ret)
> > +		goto sysfs_fail;
> >
> >  	mdev->type_kobj = kobj;
> >  	mdev->active = true;
> >  	dev_dbg(&mdev->dev, "MDEV: created\n");
> > +	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
> >
> >  	return 0;
> >
> > +sysfs_fail:
> > +	device_del(&mdev->dev);
> > +dev_fail:
> > +	type->parent->ops->remove(mdev);
> >  create_fail:
> > -	device_unregister(&mdev->dev);
> > +	put_device(&mdev->dev);
> >  mdev_fail:
> > +	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
> >  	mdev_put_parent(parent);
> >  	return ret;
> >  }
> >
> > -int mdev_device_remove(struct device *dev, bool force_remove)
> > +int mdev_device_remove(struct device *dev)
> >  {
> > +	struct mdev_parent *valid_parent;
> >  	struct mdev_device *mdev;
> >  	struct mdev_parent *parent;
> > -	struct mdev_type *type;
> > +	int srcu_idx;
> >  	int ret;
> >
> >  	mdev = to_mdev_device(dev);
> > +	parent = mdev->parent;
> > +	srcu_idx = srcu_read_lock(&parent->unreg_srcu);
> > +	valid_parent = srcu_dereference(parent->self, &parent->unreg_srcu);
> > +	if (!valid_parent) {
> > +		srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
> > +		/* parent is undergoing unregistration */
> > +		return -ENODEV;
> > +	}
> > +
> > +	mutex_lock(&mdev_list_lock);
> >  	if (!mdev->active) {
> >  		mutex_unlock(&mdev_list_lock);
> > -		return -EAGAIN;
> > +		srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
> > +		return -ENODEV;
> >  	}
> > -
> >  	mdev->active = false;
> >  	mutex_unlock(&mdev_list_lock);
> >
> > -	type = to_mdev_type(mdev->type_kobj);
> > -	parent = mdev->parent;
> > -
> > -	ret = mdev_device_remove_ops(mdev, force_remove);
> > -	if (ret) {
> > -		mdev->active = true;
> > -		return ret;
> > -	}
> > +	ret = mdev_device_must_remove(mdev);
> > +	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
> >
> > -	mdev_remove_sysfs_files(dev, type);
> > -	device_unregister(dev);
> >  	mdev_put_parent(parent);
> > -
> > -	return 0;
> > +	return ret;
> >  }
> >
> >  static int __init mdev_init(void)
> > diff --git a/drivers/vfio/mdev/mdev_private.h
> > b/drivers/vfio/mdev/mdev_private.h
> > index 84b2b6c..3d17db9 100644
> > --- a/drivers/vfio/mdev/mdev_private.h
> > +++ b/drivers/vfio/mdev/mdev_private.h
> > @@ -23,6 +23,11 @@ struct mdev_parent {
> >  	struct list_head next;
> >  	struct kset *mdev_types_kset;
> >  	struct list_head type_list;
> > +	/* Protects unregistration to wait until create/remove
> > +	 * are completed.
> > +	 */
> > +	struct srcu_struct unreg_srcu;
> > +	struct mdev_parent __rcu *self;
> >  };
> >
> >  struct mdev_device {
> > @@ -58,6 +63,6 @@ struct mdev_type {
> >  void mdev_remove_sysfs_files(struct device *dev, struct mdev_type
> > *type);
> >
> >  int  mdev_device_create(struct kobject *kobj, struct device *dev,
> > uuid_le uuid); -int  mdev_device_remove(struct device *dev, bool
> > force_remove);
> > +int  mdev_device_remove(struct device *dev);
> >
> >  #endif /* MDEV_PRIVATE_H */
> > diff --git a/drivers/vfio/mdev/mdev_sysfs.c
> > b/drivers/vfio/mdev/mdev_sysfs.c index c782fa9..68a8191 100644
> > --- a/drivers/vfio/mdev/mdev_sysfs.c
> > +++ b/drivers/vfio/mdev/mdev_sysfs.c
> > @@ -236,11 +236,9 @@ static ssize_t remove_store(struct device *dev,
> > struct device_attribute *attr,
> >  	if (val && device_remove_file_self(dev, attr)) {
> >  		int ret;
> >
> > -		ret = mdev_device_remove(dev, false);
> > -		if (ret) {
> > -			device_create_file(dev, attr);
> > +		ret = mdev_device_remove(dev);
> > +		if (ret)
> >  			return ret;
> > -		}
> >  	}
> >
> >  	return count;
> 
> The patch looks OK to me, especially looking at the code after the changes
> were apllied. I might have missed something though due to amount of
> changes done.
> 
> I lightly tested the whole patch series with my mdev driver, and it seems to
> survive, but my testing doesn't test much of the error paths so there that.
> 
> I'll keep this applied so if I notice any errors I'll let you know.
> 
> If you could split this into few patches, this would be even better, but
> anyway thanks a lot for this work!
> 
> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
> 
Thanks a lot Maxim for the review.
This particular patch cannot be cut into more patches because it touches serialization between multiple functions.
So all of those have to be touched.
I tried to cut into two where previous patch does the mdev cleanup, but I guess for correctness I got to merge here.
Only the commit message is probably big, but I had to explain all 4-5 cases for this refactor.

I will send v1 with below changes.
1. drop patch-1 in the series.
2. simplify patch 7 to drop the bool part...
3. move the loop_removal code in mdev_remove() from patch-7 to 8 for correctness.
Alex Williamson March 25, 2019, 11:18 p.m. UTC | #3
On Fri, 22 Mar 2019 18:20:35 -0500
Parav Pandit <parav@mellanox.com> wrote:

> There are five problems with current code structure.
> 1. mdev device is placed on the mdev bus before it is created in the
> vendor driver. Once a device is placed on the mdev bus without creating
> its supporting underlying vendor device, an open() can get triggered by
> userspace on partially initialized device.
> Below ladder diagram highlight it.
> 
>       cpu-0                                       cpu-1
>       -----                                       -----
>    create_store()
>      mdev_create_device()
>        device_register()
>           ...
>          vfio_mdev_probe()
>          ...creates char device
>                                         vfio_mdev_open()
>                                           parent->ops->open(mdev)
>                                             vfio_ap_mdev_open()
>                                               matrix_mdev = NULL
>         [...]
>         parent->ops->create()
>           vfio_ap_mdev_create()
>             mdev_set_drvdata(mdev, matrix_mdev);
>             /* Valid pointer set above */
> 
> 2. Current creation sequence is,
>    parent->ops_create()
>    groups_register()
> 
> Remove sequence is,
>    parent->ops->remove()
>    groups_unregister()
> However, remove sequence should be exact mirror of creation sequence.
> Once this is achieved, all users of the mdev will be terminated first
> before removing underlying vendor device.
> (Follow standard linux driver model).
> At that point vendor's remove() ops shouldn't failed because device is
> taken off the bus that should terminate the users.
> 
> 3. Additionally any new mdev driver that wants to work on mdev device
> during probe() routine registered using mdev_register_driver() needs to
> get stable mdev structure.
> 
> 4. In following sequence, child devices created while removing mdev parent
> device can be left out, or it may lead to race of removing half
> initialized child mdev devices.
> 
> issue-1:
> --------
>        cpu-0                         cpu-1
>        -----                         -----
>                                   mdev_unregister_device()
>                                      device_for_each_child()
>                                         mdev_device_remove_cb()
>                                             mdev_device_remove()
> create_store()
>   mdev_device_create()                   [...]
>        device_register()
>                                   parent_remove_sysfs_files()
>                                   /* BUG: device added by cpu-0
>                                    * whose parent is getting removed.
>                                    */
> 
> issue-2:
> --------
>        cpu-0                         cpu-1
>        -----                         -----
> create_store()
>   mdev_device_create()                   [...]
>        device_register()
> 
>        [...]                      mdev_unregister_device()
>                                      device_for_each_child()
>                                         mdev_device_remove_cb()
>                                             mdev_device_remove()
> 
>        mdev_create_sysfs_files()
>        /* BUG: create is adding
>         * sysfs files for a device
>         * which is undergoing removal.
>         */
>                                  parent_remove_sysfs_files()

In both cases above, it looks like the device will hold a reference to
the parent, so while there is a race, the parent object isn't released.

> 
> 5. Below crash is observed when user initiated remove is in progress
> and mdev_unregister_driver() completes parent unregistration.
> 
>        cpu-0                         cpu-1
>        -----                         -----
> remove_store()
>    mdev_device_remove()
>    active = false;
>                                   mdev_unregister_device()
>                                     remove type
>    [...]
>    mdev_remove_ops() crashes.
> 
> This is similar race like create() racing with mdev_unregister_device().

Not sure I catch this, the device should have a reference to the
parent, and we don't specifically clear parent->ops, so what's getting
removed that causes this oops?  Is .remove pointing at bad text
regardless?

> mtty mtty: MDEV: Registered
> iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to group 57
> vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV: group_id = 57
> mdev_device_remove sleep started
> mtty mtty: MDEV: Unregistering
> mtty_dev: Unloaded!
> BUG: unable to handle kernel paging request at ffffffffc027d668
> PGD af9818067 P4D af9818067 PUD af981a067 PMD 8583c3067 PTE 0
> Oops: 0000 [#1] SMP PTI
> CPU: 15 PID: 3517 Comm: bash Kdump: loaded Not tainted 5.0.0-rc7-vdevbus+ #2
> Hardware name: Supermicro SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
> RIP: 0010:mdev_device_remove_ops+0x1a/0x50 [mdev]
> Call Trace:
>  mdev_device_remove+0xef/0x130 [mdev]
>  remove_store+0x77/0xa0 [mdev]
>  kernfs_fop_write+0x113/0x1a0
>  __vfs_write+0x33/0x1b0
>  ? rcu_read_lock_sched_held+0x64/0x70
>  ? rcu_sync_lockdep_assert+0x2a/0x50
>  ? __sb_start_write+0x121/0x1b0
>  ? vfs_write+0x17c/0x1b0
>  vfs_write+0xad/0x1b0
>  ? trace_hardirqs_on_thunk+0x1a/0x1c
>  ksys_write+0x55/0xc0
>  do_syscall_64+0x5a/0x210
> 
> Therefore, mdev core is improved in following ways to overcome above
> issues.
> 
> 1. Before placing mdev devices on the bus, perform vendor drivers
> creation which supports the mdev creation.
> This ensures that mdev specific all necessary fields are initialized
> before a given mdev can be accessed by bus driver.
> 
> 2. During remove flow, first remove the device from the bus. This
> ensures that any bus specific devices and data is cleared.
> Once device is taken of the mdev bus, perform remove() of mdev from the
> vendor driver.
> 
> 3. Linux core device model provides way to register and auto unregister
> the device sysfs attribute groups at dev->groups.
> Make use of this groups to let core create the groups and simplify code
> to avoid explicit groups creation and removal.
> 
> 4. Wait for any ongoing mdev create() and remove() to finish before
> unregistering parent device using srcu. This continues to allow multiple
> create and remove to progress in parallel. At the same time guard parent
> removal while parent is being access by create() and remove callbacks.

So there should be 4-5 separate patches here?  Wishful thinking?

> Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> Signed-off-by: Parav Pandit <parav@mellanox.com>
> ---
>  drivers/vfio/mdev/mdev_core.c    | 142 +++++++++++++++++++++------------------
>  drivers/vfio/mdev/mdev_private.h |   7 +-
>  drivers/vfio/mdev/mdev_sysfs.c   |   6 +-
>  3 files changed, 84 insertions(+), 71 deletions(-)
> 
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> index 944a058..8fe0ed1 100644
> --- a/drivers/vfio/mdev/mdev_core.c
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -84,6 +84,7 @@ static void mdev_release_parent(struct kref *kref)
>  						  ref);
>  	struct device *dev = parent->dev;
>  
> +	cleanup_srcu_struct(&parent->unreg_srcu);
>  	kfree(parent);
>  	put_device(dev);
>  }
> @@ -103,56 +104,30 @@ static inline void mdev_put_parent(struct mdev_parent *parent)
>  		kref_put(&parent->ref, mdev_release_parent);
>  }
>  
> -static int mdev_device_create_ops(struct kobject *kobj,
> -				  struct mdev_device *mdev)
> +static int mdev_device_must_remove(struct mdev_device *mdev)

Naming is off here, mdev_device_remove_common()?

>  {
> -	struct mdev_parent *parent = mdev->parent;
> +	struct mdev_parent *parent;
> +	struct mdev_type *type;
>  	int ret;
>  
> -	ret = parent->ops->create(kobj, mdev);
> -	if (ret)
> -		return ret;
> +	type = to_mdev_type(mdev->type_kobj);
>  
> -	ret = sysfs_create_groups(&mdev->dev.kobj,
> -				  parent->ops->mdev_attr_groups);
> +	mdev_remove_sysfs_files(&mdev->dev, type);
> +	device_del(&mdev->dev);
> +	parent = mdev->parent;
> +	ret = parent->ops->remove(mdev);
>  	if (ret)
> -		parent->ops->remove(mdev);
> +		dev_err(&mdev->dev, "Remove failed: err=%d\n", ret);

Let the caller decide whether to be verbose with the error, parent
removal might want to warn, sysfs remove might just return an error.

>  
> +	/* Balances with device_initialize() */
> +	put_device(&mdev->dev);
>  	return ret;
>  }
>  
> -/*
> - * mdev_device_remove_ops gets called from sysfs's 'remove' and when parent
> - * device is being unregistered from mdev device framework.
> - * - 'force_remove' is set to 'false' when called from sysfs's 'remove' which
> - *   indicates that if the mdev device is active, used by VMM or userspace
> - *   application, vendor driver could return error then don't remove the device.
> - * - 'force_remove' is set to 'true' when called from mdev_unregister_device()
> - *   which indicate that parent device is being removed from mdev device
> - *   framework so remove mdev device forcefully.
> - */
> -static int mdev_device_remove_ops(struct mdev_device *mdev, bool force_remove)
> -{
> -	struct mdev_parent *parent = mdev->parent;
> -	int ret;
> -
> -	/*
> -	 * Vendor driver can return error if VMM or userspace application is
> -	 * using this mdev device.
> -	 */
> -	ret = parent->ops->remove(mdev);
> -	if (ret && !force_remove)
> -		return ret;
> -
> -	sysfs_remove_groups(&mdev->dev.kobj, parent->ops->mdev_attr_groups);
> -	return 0;
> -}

Seems like there's easily a separate patch in pushing the create/remove
ops into the calling function and separating for the iterator callback,
that would make this easier to review.

> -
>  static int mdev_device_remove_cb(struct device *dev, void *data)
>  {
>  	if (dev_is_mdev(dev))
> -		mdev_device_remove(dev, true);
> -
> +		mdev_device_must_remove(to_mdev_device(dev));
>  	return 0;
>  }
>  
> @@ -194,6 +169,7 @@ int mdev_register_device(struct device *dev, const struct mdev_parent_ops *ops)
>  	}
>  
>  	kref_init(&parent->ref);
> +	init_srcu_struct(&parent->unreg_srcu);
>  
>  	parent->dev = dev;
>  	parent->ops = ops;
> @@ -214,6 +190,7 @@ int mdev_register_device(struct device *dev, const struct mdev_parent_ops *ops)
>  	if (ret)
>  		dev_warn(dev, "Failed to create compatibility class link\n");
>  
> +	rcu_assign_pointer(parent->self, parent);
>  	list_add(&parent->next, &parent_list);
>  	mutex_unlock(&parent_list_lock);
>  
> @@ -244,21 +221,36 @@ void mdev_unregister_device(struct device *dev)
>  
>  	mutex_lock(&parent_list_lock);
>  	parent = __find_parent_device(dev);
> -
>  	if (!parent) {
>  		mutex_unlock(&parent_list_lock);
>  		return;
>  	}
> +	list_del(&parent->next);
> +	mutex_unlock(&parent_list_lock);
> +
>  	dev_info(dev, "MDEV: Unregistering\n");
>  
> -	list_del(&parent->next);
> +	/* Publish that this mdev parent is unregistering. So any new
> +	 * create/remove cannot start on this parent anymore by user.
> +	 */

Comment style, we're not in netdev.

> +	rcu_assign_pointer(parent->self, NULL);
> +
> +	/*
> +	 * Wait for any active create() or remove() mdev ops on the parent
> +	 * to complete.
> +	 */
> +	synchronize_srcu(&parent->unreg_srcu);
> +
> +	/* At this point it is confirmed that any pending user initiated
> +	 * create or remove callbacks accessing the parent are completed.
> +	 * It is safe to remove the parent now.
> +	 */
>  	class_compat_remove_link(mdev_bus_compat_class, dev, NULL);
>  
>  	device_for_each_child(dev, NULL, mdev_device_remove_cb);
>  
>  	parent_remove_sysfs_files(parent);
>  
> -	mutex_unlock(&parent_list_lock);
>  	mdev_put_parent(parent);
>  }
>  EXPORT_SYMBOL(mdev_unregister_device);
> @@ -278,14 +270,24 @@ static void mdev_device_release(struct device *dev)
>  int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
>  {
>  	int ret;
> +	struct mdev_parent *valid_parent;
>  	struct mdev_device *mdev, *tmp;
>  	struct mdev_parent *parent;
>  	struct mdev_type *type = to_mdev_type(kobj);
> +	int srcu_idx;
>  
>  	parent = mdev_get_parent(type->parent);
>  	if (!parent)
>  		return -EINVAL;
>  
> +	srcu_idx = srcu_read_lock(&parent->unreg_srcu);
> +	valid_parent = srcu_dereference(parent->self, &parent->unreg_srcu);
> +	if (!valid_parent) {
> +		/* parent is undergoing unregistration */
> +		ret = -ENODEV;
> +		goto mdev_fail;
> +	}
> +
>  	mutex_lock(&mdev_list_lock);
>  
>  	/* Check for duplicate */
> @@ -310,68 +312,76 @@ int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
>  
>  	mdev->parent = parent;
>  
> +	device_initialize(&mdev->dev);
>  	mdev->dev.parent  = dev;
>  	mdev->dev.bus     = &mdev_bus_type;
>  	mdev->dev.release = mdev_device_release;
> +	mdev->dev.groups = type->parent->ops->mdev_attr_groups;
>  	dev_set_name(&mdev->dev, "%pUl", uuid.b);
>  
> -	ret = device_register(&mdev->dev);
> +	ret = type->parent->ops->create(kobj, mdev);
>  	if (ret)
> -		goto mdev_fail;
> +		goto create_fail;
>  
> -	ret = mdev_device_create_ops(kobj, mdev);
> +	ret = device_add(&mdev->dev);

Separating device_initialize() and device_add() also looks like a
separate patch, then the srcu could be added at the end.  Thanks,

Alex

>  	if (ret)
> -		goto create_fail;
> +		goto dev_fail;
>  
>  	ret = mdev_create_sysfs_files(&mdev->dev, type);
> -	if (ret) {
> -		mdev_device_remove_ops(mdev, true);
> -		goto create_fail;
> -	}
> +	if (ret)
> +		goto sysfs_fail;
>  
>  	mdev->type_kobj = kobj;
>  	mdev->active = true;
>  	dev_dbg(&mdev->dev, "MDEV: created\n");
> +	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
>  
>  	return 0;
>  
> +sysfs_fail:
> +	device_del(&mdev->dev);
> +dev_fail:
> +	type->parent->ops->remove(mdev);
>  create_fail:
> -	device_unregister(&mdev->dev);
> +	put_device(&mdev->dev);
>  mdev_fail:
> +	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
>  	mdev_put_parent(parent);
>  	return ret;
>  }
>  
> -int mdev_device_remove(struct device *dev, bool force_remove)
> +int mdev_device_remove(struct device *dev)
>  {
> +	struct mdev_parent *valid_parent;
>  	struct mdev_device *mdev;
>  	struct mdev_parent *parent;
> -	struct mdev_type *type;
> +	int srcu_idx;
>  	int ret;
>  
>  	mdev = to_mdev_device(dev);
> +	parent = mdev->parent;
> +	srcu_idx = srcu_read_lock(&parent->unreg_srcu);
> +	valid_parent = srcu_dereference(parent->self, &parent->unreg_srcu);
> +	if (!valid_parent) {
> +		srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
> +		/* parent is undergoing unregistration */
> +		return -ENODEV;
> +	}
> +
> +	mutex_lock(&mdev_list_lock);
>  	if (!mdev->active) {
>  		mutex_unlock(&mdev_list_lock);
> -		return -EAGAIN;
> +		srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
> +		return -ENODEV;
>  	}
> -
>  	mdev->active = false;
>  	mutex_unlock(&mdev_list_lock);
>  
> -	type = to_mdev_type(mdev->type_kobj);
> -	parent = mdev->parent;
> -
> -	ret = mdev_device_remove_ops(mdev, force_remove);
> -	if (ret) {
> -		mdev->active = true;
> -		return ret;
> -	}
> +	ret = mdev_device_must_remove(mdev);
> +	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
>  
> -	mdev_remove_sysfs_files(dev, type);
> -	device_unregister(dev);
>  	mdev_put_parent(parent);
> -
> -	return 0;
> +	return ret;
>  }
>  
>  static int __init mdev_init(void)
> diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
> index 84b2b6c..3d17db9 100644
> --- a/drivers/vfio/mdev/mdev_private.h
> +++ b/drivers/vfio/mdev/mdev_private.h
> @@ -23,6 +23,11 @@ struct mdev_parent {
>  	struct list_head next;
>  	struct kset *mdev_types_kset;
>  	struct list_head type_list;
> +	/* Protects unregistration to wait until create/remove
> +	 * are completed.
> +	 */
> +	struct srcu_struct unreg_srcu;
> +	struct mdev_parent __rcu *self;
>  };
>  
>  struct mdev_device {
> @@ -58,6 +63,6 @@ struct mdev_type {
>  void mdev_remove_sysfs_files(struct device *dev, struct mdev_type *type);
>  
>  int  mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid);
> -int  mdev_device_remove(struct device *dev, bool force_remove);
> +int  mdev_device_remove(struct device *dev);
>  
>  #endif /* MDEV_PRIVATE_H */
> diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
> index c782fa9..68a8191 100644
> --- a/drivers/vfio/mdev/mdev_sysfs.c
> +++ b/drivers/vfio/mdev/mdev_sysfs.c
> @@ -236,11 +236,9 @@ static ssize_t remove_store(struct device *dev, struct device_attribute *attr,
>  	if (val && device_remove_file_self(dev, attr)) {
>  		int ret;
>  
> -		ret = mdev_device_remove(dev, false);
> -		if (ret) {
> -			device_create_file(dev, attr);
> +		ret = mdev_device_remove(dev);
> +		if (ret)
>  			return ret;
> -		}
>  	}
>  
>  	return count;
Parav Pandit March 25, 2019, 11:34 p.m. UTC | #4
> -----Original Message-----
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Monday, March 25, 2019 6:19 PM
> To: Parav Pandit <parav@mellanox.com>
> Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> kwankhede@nvidia.com
> Subject: Re: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
> 
> On Fri, 22 Mar 2019 18:20:35 -0500
> Parav Pandit <parav@mellanox.com> wrote:
> 
> > There are five problems with current code structure.
> > 1. mdev device is placed on the mdev bus before it is created in the
> > vendor driver. Once a device is placed on the mdev bus without
> > creating its supporting underlying vendor device, an open() can get
> > triggered by userspace on partially initialized device.
> > Below ladder diagram highlight it.
> >
> >       cpu-0                                       cpu-1
> >       -----                                       -----
> >    create_store()
> >      mdev_create_device()
> >        device_register()
> >           ...
> >          vfio_mdev_probe()
> >          ...creates char device
> >                                         vfio_mdev_open()
> >                                           parent->ops->open(mdev)
> >                                             vfio_ap_mdev_open()
> >                                               matrix_mdev = NULL
> >         [...]
> >         parent->ops->create()
> >           vfio_ap_mdev_create()
> >             mdev_set_drvdata(mdev, matrix_mdev);
> >             /* Valid pointer set above */
> >
> > 2. Current creation sequence is,
> >    parent->ops_create()
> >    groups_register()
> >
> > Remove sequence is,
> >    parent->ops->remove()
> >    groups_unregister()
> > However, remove sequence should be exact mirror of creation sequence.
> > Once this is achieved, all users of the mdev will be terminated first
> > before removing underlying vendor device.
> > (Follow standard linux driver model).
> > At that point vendor's remove() ops shouldn't failed because device is
> > taken off the bus that should terminate the users.
> >
> > 3. Additionally any new mdev driver that wants to work on mdev device
> > during probe() routine registered using mdev_register_driver() needs
> > to get stable mdev structure.
> >
> > 4. In following sequence, child devices created while removing mdev
> > parent device can be left out, or it may lead to race of removing half
> > initialized child mdev devices.
> >
> > issue-1:
> > --------
> >        cpu-0                         cpu-1
> >        -----                         -----
> >                                   mdev_unregister_device()
> >                                      device_for_each_child()
> >                                         mdev_device_remove_cb()
> >                                             mdev_device_remove()
> > create_store()
> >   mdev_device_create()                   [...]
> >        device_register()
> >                                   parent_remove_sysfs_files()
> >                                   /* BUG: device added by cpu-0
> >                                    * whose parent is getting removed.
> >                                    */
> >
> > issue-2:
> > --------
> >        cpu-0                         cpu-1
> >        -----                         -----
> > create_store()
> >   mdev_device_create()                   [...]
> >        device_register()
> >
> >        [...]                      mdev_unregister_device()
> >                                      device_for_each_child()
> >                                         mdev_device_remove_cb()
> >                                             mdev_device_remove()
> >
> >        mdev_create_sysfs_files()
> >        /* BUG: create is adding
> >         * sysfs files for a device
> >         * which is undergoing removal.
> >         */
> >                                  parent_remove_sysfs_files()
> 
> In both cases above, it looks like the device will hold a reference to the
> parent, so while there is a race, the parent object isn't released.
Yes, parent object is not released but parent fields are not stable.

> 
> >
> > 5. Below crash is observed when user initiated remove is in progress
> > and mdev_unregister_driver() completes parent unregistration.
> >
> >        cpu-0                         cpu-1
> >        -----                         -----
> > remove_store()
> >    mdev_device_remove()
> >    active = false;
> >                                   mdev_unregister_device()
> >                                     remove type
> >    [...]
> >    mdev_remove_ops() crashes.
> >
> > This is similar race like create() racing with mdev_unregister_device().
> 
> Not sure I catch this, the device should have a reference to the parent, and
> we don't specifically clear parent->ops, so what's getting removed that
> causes this oops?  Is .remove pointing at bad text regardless?
> 
I guess the mdev_attr_groups being stale now.

> > mtty mtty: MDEV: Registered
> > iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to group 57
> > vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV: group_id = 57
> > mdev_device_remove sleep started mtty mtty: MDEV: Unregistering
> > mtty_dev: Unloaded!
> > BUG: unable to handle kernel paging request at ffffffffc027d668 PGD
> > af9818067 P4D af9818067 PUD af981a067 PMD 8583c3067 PTE 0
> > Oops: 0000 [#1] SMP PTI
> > CPU: 15 PID: 3517 Comm: bash Kdump: loaded Not tainted
> > 5.0.0-rc7-vdevbus+ #2 Hardware name: Supermicro
> > SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
> > RIP: 0010:mdev_device_remove_ops+0x1a/0x50 [mdev] Call Trace:
> >  mdev_device_remove+0xef/0x130 [mdev]
> >  remove_store+0x77/0xa0 [mdev]
> >  kernfs_fop_write+0x113/0x1a0
> >  __vfs_write+0x33/0x1b0
> >  ? rcu_read_lock_sched_held+0x64/0x70
> >  ? rcu_sync_lockdep_assert+0x2a/0x50
> >  ? __sb_start_write+0x121/0x1b0
> >  ? vfs_write+0x17c/0x1b0
> >  vfs_write+0xad/0x1b0
> >  ? trace_hardirqs_on_thunk+0x1a/0x1c
> >  ksys_write+0x55/0xc0
> >  do_syscall_64+0x5a/0x210
> >
> > Therefore, mdev core is improved in following ways to overcome above
> > issues.
> >
> > 1. Before placing mdev devices on the bus, perform vendor drivers
> > creation which supports the mdev creation.
> > This ensures that mdev specific all necessary fields are initialized
> > before a given mdev can be accessed by bus driver.
> >
> > 2. During remove flow, first remove the device from the bus. This
> > ensures that any bus specific devices and data is cleared.
> > Once device is taken of the mdev bus, perform remove() of mdev from
> > the vendor driver.
> >
> > 3. Linux core device model provides way to register and auto
> > unregister the device sysfs attribute groups at dev->groups.
> > Make use of this groups to let core create the groups and simplify
> > code to avoid explicit groups creation and removal.
> >
> > 4. Wait for any ongoing mdev create() and remove() to finish before
> > unregistering parent device using srcu. This continues to allow
> > multiple create and remove to progress in parallel. At the same time
> > guard parent removal while parent is being access by create() and remove
> callbacks.
> 
> So there should be 4-5 separate patches here?  Wishful thinking?
> 
create, remove racing with unregister is handled using srcu.
Change-3 cannot be done without fixing the sequence so it should be in patch that fixes it.
Change described changes 1-2-3 are just one change. It is just the patch description to bring clarity.
Change-4 can be possibly done as split to different patch.

> > Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> > Signed-off-by: Parav Pandit <parav@mellanox.com>
> > ---
> >  drivers/vfio/mdev/mdev_core.c    | 142 +++++++++++++++++++++--------------
> ----
> >  drivers/vfio/mdev/mdev_private.h |   7 +-
> >  drivers/vfio/mdev/mdev_sysfs.c   |   6 +-
> >  3 files changed, 84 insertions(+), 71 deletions(-)
> >
> > diff --git a/drivers/vfio/mdev/mdev_core.c
> > b/drivers/vfio/mdev/mdev_core.c index 944a058..8fe0ed1 100644
> > --- a/drivers/vfio/mdev/mdev_core.c
> > +++ b/drivers/vfio/mdev/mdev_core.c
> > @@ -84,6 +84,7 @@ static void mdev_release_parent(struct kref *kref)
> >  						  ref);
> >  	struct device *dev = parent->dev;
> >
> > +	cleanup_srcu_struct(&parent->unreg_srcu);
> >  	kfree(parent);
> >  	put_device(dev);
> >  }
> > @@ -103,56 +104,30 @@ static inline void mdev_put_parent(struct
> mdev_parent *parent)
> >  		kref_put(&parent->ref, mdev_release_parent);  }
> >
> > -static int mdev_device_create_ops(struct kobject *kobj,
> > -				  struct mdev_device *mdev)
> > +static int mdev_device_must_remove(struct mdev_device *mdev)
> 
> Naming is off here, mdev_device_remove_common()?
> 
Yes, sounds better.

> >  {
> > -	struct mdev_parent *parent = mdev->parent;
> > +	struct mdev_parent *parent;
> > +	struct mdev_type *type;
> >  	int ret;
> >
> > -	ret = parent->ops->create(kobj, mdev);
> > -	if (ret)
> > -		return ret;
> > +	type = to_mdev_type(mdev->type_kobj);
> >
> > -	ret = sysfs_create_groups(&mdev->dev.kobj,
> > -				  parent->ops->mdev_attr_groups);
> > +	mdev_remove_sysfs_files(&mdev->dev, type);
> > +	device_del(&mdev->dev);
> > +	parent = mdev->parent;
> > +	ret = parent->ops->remove(mdev);
> >  	if (ret)
> > -		parent->ops->remove(mdev);
> > +		dev_err(&mdev->dev, "Remove failed: err=%d\n", ret);
> 
> Let the caller decide whether to be verbose with the error, parent removal
> might want to warn, sysfs remove might just return an error.
> 
I didn't follow. Caller meaning mdev_device_remove_common() or vendor driver?

> >
> > +	/* Balances with device_initialize() */
> > +	put_device(&mdev->dev);
> >  	return ret;
> >  }
> >
> > -/*
> > - * mdev_device_remove_ops gets called from sysfs's 'remove' and when
> > parent
> > - * device is being unregistered from mdev device framework.
> > - * - 'force_remove' is set to 'false' when called from sysfs's 'remove' which
> > - *   indicates that if the mdev device is active, used by VMM or userspace
> > - *   application, vendor driver could return error then don't remove the
> device.
> > - * - 'force_remove' is set to 'true' when called from
> mdev_unregister_device()
> > - *   which indicate that parent device is being removed from mdev device
> > - *   framework so remove mdev device forcefully.
> > - */
> > -static int mdev_device_remove_ops(struct mdev_device *mdev, bool
> > force_remove) -{
> > -	struct mdev_parent *parent = mdev->parent;
> > -	int ret;
> > -
> > -	/*
> > -	 * Vendor driver can return error if VMM or userspace application is
> > -	 * using this mdev device.
> > -	 */
> > -	ret = parent->ops->remove(mdev);
> > -	if (ret && !force_remove)
> > -		return ret;
> > -
> > -	sysfs_remove_groups(&mdev->dev.kobj, parent->ops-
> >mdev_attr_groups);
> > -	return 0;
> > -}
> 
> Seems like there's easily a separate patch in pushing the create/remove ops
> into the calling function and separating for the iterator callback, that would
> make this easier to review.
> 
> > -
> >  static int mdev_device_remove_cb(struct device *dev, void *data)  {
> >  	if (dev_is_mdev(dev))
> > -		mdev_device_remove(dev, true);
> > -
> > +		mdev_device_must_remove(to_mdev_device(dev));
> >  	return 0;
> >  }
> >
> > @@ -194,6 +169,7 @@ int mdev_register_device(struct device *dev, const
> struct mdev_parent_ops *ops)
> >  	}
> >
> >  	kref_init(&parent->ref);
> > +	init_srcu_struct(&parent->unreg_srcu);
> >
> >  	parent->dev = dev;
> >  	parent->ops = ops;
> > @@ -214,6 +190,7 @@ int mdev_register_device(struct device *dev, const
> struct mdev_parent_ops *ops)
> >  	if (ret)
> >  		dev_warn(dev, "Failed to create compatibility class link\n");
> >
> > +	rcu_assign_pointer(parent->self, parent);
> >  	list_add(&parent->next, &parent_list);
> >  	mutex_unlock(&parent_list_lock);
> >
> > @@ -244,21 +221,36 @@ void mdev_unregister_device(struct device *dev)
> >
> >  	mutex_lock(&parent_list_lock);
> >  	parent = __find_parent_device(dev);
> > -
> >  	if (!parent) {
> >  		mutex_unlock(&parent_list_lock);
> >  		return;
> >  	}
> > +	list_del(&parent->next);
> > +	mutex_unlock(&parent_list_lock);
> > +
> >  	dev_info(dev, "MDEV: Unregistering\n");
> >
> > -	list_del(&parent->next);
> > +	/* Publish that this mdev parent is unregistering. So any new
> > +	 * create/remove cannot start on this parent anymore by user.
> > +	 */
> 
> Comment style, we're not in netdev.
Yep. Will fix it.
> 
> > +	rcu_assign_pointer(parent->self, NULL);
> > +
> > +	/*
> > +	 * Wait for any active create() or remove() mdev ops on the parent
> > +	 * to complete.
> > +	 */
> > +	synchronize_srcu(&parent->unreg_srcu);
> > +
> > +	/* At this point it is confirmed that any pending user initiated
> > +	 * create or remove callbacks accessing the parent are completed.
> > +	 * It is safe to remove the parent now.
> > +	 */
> >  	class_compat_remove_link(mdev_bus_compat_class, dev, NULL);
> >
> >  	device_for_each_child(dev, NULL, mdev_device_remove_cb);
> >
> >  	parent_remove_sysfs_files(parent);
> >
> > -	mutex_unlock(&parent_list_lock);
> >  	mdev_put_parent(parent);
> >  }
> >  EXPORT_SYMBOL(mdev_unregister_device);
> > @@ -278,14 +270,24 @@ static void mdev_device_release(struct device
> > *dev)  int mdev_device_create(struct kobject *kobj, struct device
> > *dev, uuid_le uuid)  {
> >  	int ret;
> > +	struct mdev_parent *valid_parent;
> >  	struct mdev_device *mdev, *tmp;
> >  	struct mdev_parent *parent;
> >  	struct mdev_type *type = to_mdev_type(kobj);
> > +	int srcu_idx;
> >
> >  	parent = mdev_get_parent(type->parent);
> >  	if (!parent)
> >  		return -EINVAL;
> >
> > +	srcu_idx = srcu_read_lock(&parent->unreg_srcu);
> > +	valid_parent = srcu_dereference(parent->self, &parent->unreg_srcu);
> > +	if (!valid_parent) {
> > +		/* parent is undergoing unregistration */
> > +		ret = -ENODEV;
> > +		goto mdev_fail;
> > +	}
> > +
> >  	mutex_lock(&mdev_list_lock);
> >
> >  	/* Check for duplicate */
> > @@ -310,68 +312,76 @@ int mdev_device_create(struct kobject *kobj,
> > struct device *dev, uuid_le uuid)
> >
> >  	mdev->parent = parent;
> >
> > +	device_initialize(&mdev->dev);
> >  	mdev->dev.parent  = dev;
> >  	mdev->dev.bus     = &mdev_bus_type;
> >  	mdev->dev.release = mdev_device_release;
> > +	mdev->dev.groups = type->parent->ops->mdev_attr_groups;
> >  	dev_set_name(&mdev->dev, "%pUl", uuid.b);
> >
> > -	ret = device_register(&mdev->dev);
> > +	ret = type->parent->ops->create(kobj, mdev);
> >  	if (ret)
> > -		goto mdev_fail;
> > +		goto create_fail;
> >
> > -	ret = mdev_device_create_ops(kobj, mdev);
> > +	ret = device_add(&mdev->dev);
> 
> Separating device_initialize() and device_add() also looks like a separate
> patch, then the srcu could be added at the end.  Thanks,
> 
> Alex

I saw little more core generated that way, but I think its fine.
Basically, create/remove callback sequencing that does the device_inititailze/add etc in one patch and 
User side race handling using srcu in another patch.
Sounds good?
Alex Williamson March 26, 2019, 12:05 a.m. UTC | #5
On Mon, 25 Mar 2019 23:34:28 +0000
Parav Pandit <parav@mellanox.com> wrote:

> > -----Original Message-----
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Monday, March 25, 2019 6:19 PM
> > To: Parav Pandit <parav@mellanox.com>
> > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > kwankhede@nvidia.com
> > Subject: Re: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
> > 
> > On Fri, 22 Mar 2019 18:20:35 -0500
> > Parav Pandit <parav@mellanox.com> wrote:
> >   
> > > There are five problems with current code structure.
> > > 1. mdev device is placed on the mdev bus before it is created in the
> > > vendor driver. Once a device is placed on the mdev bus without
> > > creating its supporting underlying vendor device, an open() can get
> > > triggered by userspace on partially initialized device.
> > > Below ladder diagram highlight it.
> > >
> > >       cpu-0                                       cpu-1
> > >       -----                                       -----
> > >    create_store()
> > >      mdev_create_device()
> > >        device_register()
> > >           ...
> > >          vfio_mdev_probe()
> > >          ...creates char device
> > >                                         vfio_mdev_open()
> > >                                           parent->ops->open(mdev)
> > >                                             vfio_ap_mdev_open()
> > >                                               matrix_mdev = NULL
> > >         [...]
> > >         parent->ops->create()
> > >           vfio_ap_mdev_create()
> > >             mdev_set_drvdata(mdev, matrix_mdev);
> > >             /* Valid pointer set above */
> > >
> > > 2. Current creation sequence is,
> > >    parent->ops_create()
> > >    groups_register()
> > >
> > > Remove sequence is,
> > >    parent->ops->remove()
> > >    groups_unregister()
> > > However, remove sequence should be exact mirror of creation sequence.
> > > Once this is achieved, all users of the mdev will be terminated first
> > > before removing underlying vendor device.
> > > (Follow standard linux driver model).
> > > At that point vendor's remove() ops shouldn't failed because device is
> > > taken off the bus that should terminate the users.
> > >
> > > 3. Additionally any new mdev driver that wants to work on mdev device
> > > during probe() routine registered using mdev_register_driver() needs
> > > to get stable mdev structure.
> > >
> > > 4. In following sequence, child devices created while removing mdev
> > > parent device can be left out, or it may lead to race of removing half
> > > initialized child mdev devices.
> > >
> > > issue-1:
> > > --------
> > >        cpu-0                         cpu-1
> > >        -----                         -----
> > >                                   mdev_unregister_device()
> > >                                      device_for_each_child()
> > >                                         mdev_device_remove_cb()
> > >                                             mdev_device_remove()
> > > create_store()
> > >   mdev_device_create()                   [...]
> > >        device_register()
> > >                                   parent_remove_sysfs_files()
> > >                                   /* BUG: device added by cpu-0
> > >                                    * whose parent is getting removed.
> > >                                    */
> > >
> > > issue-2:
> > > --------
> > >        cpu-0                         cpu-1
> > >        -----                         -----
> > > create_store()
> > >   mdev_device_create()                   [...]
> > >        device_register()
> > >
> > >        [...]                      mdev_unregister_device()
> > >                                      device_for_each_child()
> > >                                         mdev_device_remove_cb()
> > >                                             mdev_device_remove()
> > >
> > >        mdev_create_sysfs_files()
> > >        /* BUG: create is adding
> > >         * sysfs files for a device
> > >         * which is undergoing removal.
> > >         */
> > >                                  parent_remove_sysfs_files()  
> > 
> > In both cases above, it looks like the device will hold a reference to the
> > parent, so while there is a race, the parent object isn't released.  
> Yes, parent object is not released but parent fields are not stable.
> 
> >   
> > >
> > > 5. Below crash is observed when user initiated remove is in progress
> > > and mdev_unregister_driver() completes parent unregistration.
> > >
> > >        cpu-0                         cpu-1
> > >        -----                         -----
> > > remove_store()
> > >    mdev_device_remove()
> > >    active = false;
> > >                                   mdev_unregister_device()
> > >                                     remove type
> > >    [...]
> > >    mdev_remove_ops() crashes.
> > >
> > > This is similar race like create() racing with mdev_unregister_device().  
> > 
> > Not sure I catch this, the device should have a reference to the parent, and
> > we don't specifically clear parent->ops, so what's getting removed that
> > causes this oops?  Is .remove pointing at bad text regardless?
> >   
> I guess the mdev_attr_groups being stale now.
> 
> > > mtty mtty: MDEV: Registered
> > > iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to group 57
> > > vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV: group_id = 57
> > > mdev_device_remove sleep started mtty mtty: MDEV: Unregistering
> > > mtty_dev: Unloaded!
> > > BUG: unable to handle kernel paging request at ffffffffc027d668 PGD
> > > af9818067 P4D af9818067 PUD af981a067 PMD 8583c3067 PTE 0
> > > Oops: 0000 [#1] SMP PTI
> > > CPU: 15 PID: 3517 Comm: bash Kdump: loaded Not tainted
> > > 5.0.0-rc7-vdevbus+ #2 Hardware name: Supermicro
> > > SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
> > > RIP: 0010:mdev_device_remove_ops+0x1a/0x50 [mdev] Call Trace:
> > >  mdev_device_remove+0xef/0x130 [mdev]
> > >  remove_store+0x77/0xa0 [mdev]
> > >  kernfs_fop_write+0x113/0x1a0
> > >  __vfs_write+0x33/0x1b0
> > >  ? rcu_read_lock_sched_held+0x64/0x70
> > >  ? rcu_sync_lockdep_assert+0x2a/0x50
> > >  ? __sb_start_write+0x121/0x1b0
> > >  ? vfs_write+0x17c/0x1b0
> > >  vfs_write+0xad/0x1b0
> > >  ? trace_hardirqs_on_thunk+0x1a/0x1c
> > >  ksys_write+0x55/0xc0
> > >  do_syscall_64+0x5a/0x210
> > >
> > > Therefore, mdev core is improved in following ways to overcome above
> > > issues.
> > >
> > > 1. Before placing mdev devices on the bus, perform vendor drivers
> > > creation which supports the mdev creation.
> > > This ensures that mdev specific all necessary fields are initialized
> > > before a given mdev can be accessed by bus driver.
> > >
> > > 2. During remove flow, first remove the device from the bus. This
> > > ensures that any bus specific devices and data is cleared.
> > > Once device is taken of the mdev bus, perform remove() of mdev from
> > > the vendor driver.
> > >
> > > 3. Linux core device model provides way to register and auto
> > > unregister the device sysfs attribute groups at dev->groups.
> > > Make use of this groups to let core create the groups and simplify
> > > code to avoid explicit groups creation and removal.
> > >
> > > 4. Wait for any ongoing mdev create() and remove() to finish before
> > > unregistering parent device using srcu. This continues to allow
> > > multiple create and remove to progress in parallel. At the same time
> > > guard parent removal while parent is being access by create() and remove  
> > callbacks.
> > 
> > So there should be 4-5 separate patches here?  Wishful thinking?
> >   
> create, remove racing with unregister is handled using srcu.
> Change-3 cannot be done without fixing the sequence so it should be in patch that fixes it.
> Change described changes 1-2-3 are just one change. It is just the patch description to bring clarity.
> Change-4 can be possibly done as split to different patch.
> 
> > > Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> > > Signed-off-by: Parav Pandit <parav@mellanox.com>
> > > ---
> > >  drivers/vfio/mdev/mdev_core.c    | 142 +++++++++++++++++++++--------------  
> > ----  
> > >  drivers/vfio/mdev/mdev_private.h |   7 +-
> > >  drivers/vfio/mdev/mdev_sysfs.c   |   6 +-
> > >  3 files changed, 84 insertions(+), 71 deletions(-)
> > >
> > > diff --git a/drivers/vfio/mdev/mdev_core.c
> > > b/drivers/vfio/mdev/mdev_core.c index 944a058..8fe0ed1 100644
> > > --- a/drivers/vfio/mdev/mdev_core.c
> > > +++ b/drivers/vfio/mdev/mdev_core.c
> > > @@ -84,6 +84,7 @@ static void mdev_release_parent(struct kref *kref)
> > >  						  ref);
> > >  	struct device *dev = parent->dev;
> > >
> > > +	cleanup_srcu_struct(&parent->unreg_srcu);
> > >  	kfree(parent);
> > >  	put_device(dev);
> > >  }
> > > @@ -103,56 +104,30 @@ static inline void mdev_put_parent(struct  
> > mdev_parent *parent)  
> > >  		kref_put(&parent->ref, mdev_release_parent);  }
> > >
> > > -static int mdev_device_create_ops(struct kobject *kobj,
> > > -				  struct mdev_device *mdev)
> > > +static int mdev_device_must_remove(struct mdev_device *mdev)  
> > 
> > Naming is off here, mdev_device_remove_common()?
> >   
> Yes, sounds better.
> 
> > >  {
> > > -	struct mdev_parent *parent = mdev->parent;
> > > +	struct mdev_parent *parent;
> > > +	struct mdev_type *type;
> > >  	int ret;
> > >
> > > -	ret = parent->ops->create(kobj, mdev);
> > > -	if (ret)
> > > -		return ret;
> > > +	type = to_mdev_type(mdev->type_kobj);
> > >
> > > -	ret = sysfs_create_groups(&mdev->dev.kobj,
> > > -				  parent->ops->mdev_attr_groups);
> > > +	mdev_remove_sysfs_files(&mdev->dev, type);
> > > +	device_del(&mdev->dev);
> > > +	parent = mdev->parent;
> > > +	ret = parent->ops->remove(mdev);
> > >  	if (ret)
> > > -		parent->ops->remove(mdev);
> > > +		dev_err(&mdev->dev, "Remove failed: err=%d\n", ret);  
> > 
> > Let the caller decide whether to be verbose with the error, parent removal
> > might want to warn, sysfs remove might just return an error.
> >   
> I didn't follow. Caller meaning mdev_device_remove_common() or vendor driver?

I mean the callback iterator on the parent remove can do a WARN_ON if
this returns an error while the device remove path can silently return
-EBUSY, the common function doesn't need to decide whether the parent
ops remove function deserves a dev_err.

> > >
> > > +	/* Balances with device_initialize() */
> > > +	put_device(&mdev->dev);
> > >  	return ret;
> > >  }
> > >
> > > -/*
> > > - * mdev_device_remove_ops gets called from sysfs's 'remove' and when
> > > parent
> > > - * device is being unregistered from mdev device framework.
> > > - * - 'force_remove' is set to 'false' when called from sysfs's 'remove' which
> > > - *   indicates that if the mdev device is active, used by VMM or userspace
> > > - *   application, vendor driver could return error then don't remove the  
> > device.  
> > > - * - 'force_remove' is set to 'true' when called from  
> > mdev_unregister_device()  
> > > - *   which indicate that parent device is being removed from mdev device
> > > - *   framework so remove mdev device forcefully.
> > > - */
> > > -static int mdev_device_remove_ops(struct mdev_device *mdev, bool
> > > force_remove) -{
> > > -	struct mdev_parent *parent = mdev->parent;
> > > -	int ret;
> > > -
> > > -	/*
> > > -	 * Vendor driver can return error if VMM or userspace application is
> > > -	 * using this mdev device.
> > > -	 */
> > > -	ret = parent->ops->remove(mdev);
> > > -	if (ret && !force_remove)
> > > -		return ret;
> > > -
> > > -	sysfs_remove_groups(&mdev->dev.kobj, parent->ops-
> > >mdev_attr_groups);
> > > -	return 0;
> > > -}  
> > 
> > Seems like there's easily a separate patch in pushing the create/remove ops
> > into the calling function and separating for the iterator callback, that would
> > make this easier to review.
> >   
> > > -
> > >  static int mdev_device_remove_cb(struct device *dev, void *data)  {
> > >  	if (dev_is_mdev(dev))
> > > -		mdev_device_remove(dev, true);
> > > -
> > > +		mdev_device_must_remove(to_mdev_device(dev));
> > >  	return 0;
> > >  }
> > >
> > > @@ -194,6 +169,7 @@ int mdev_register_device(struct device *dev, const  
> > struct mdev_parent_ops *ops)  
> > >  	}
> > >
> > >  	kref_init(&parent->ref);
> > > +	init_srcu_struct(&parent->unreg_srcu);
> > >
> > >  	parent->dev = dev;
> > >  	parent->ops = ops;
> > > @@ -214,6 +190,7 @@ int mdev_register_device(struct device *dev, const  
> > struct mdev_parent_ops *ops)  
> > >  	if (ret)
> > >  		dev_warn(dev, "Failed to create compatibility class link\n");
> > >
> > > +	rcu_assign_pointer(parent->self, parent);
> > >  	list_add(&parent->next, &parent_list);
> > >  	mutex_unlock(&parent_list_lock);
> > >
> > > @@ -244,21 +221,36 @@ void mdev_unregister_device(struct device *dev)
> > >
> > >  	mutex_lock(&parent_list_lock);
> > >  	parent = __find_parent_device(dev);
> > > -
> > >  	if (!parent) {
> > >  		mutex_unlock(&parent_list_lock);
> > >  		return;
> > >  	}
> > > +	list_del(&parent->next);
> > > +	mutex_unlock(&parent_list_lock);
> > > +
> > >  	dev_info(dev, "MDEV: Unregistering\n");
> > >
> > > -	list_del(&parent->next);
> > > +	/* Publish that this mdev parent is unregistering. So any new
> > > +	 * create/remove cannot start on this parent anymore by user.
> > > +	 */  
> > 
> > Comment style, we're not in netdev.  
> Yep. Will fix it.
> >   
> > > +	rcu_assign_pointer(parent->self, NULL);
> > > +
> > > +	/*
> > > +	 * Wait for any active create() or remove() mdev ops on the parent
> > > +	 * to complete.
> > > +	 */
> > > +	synchronize_srcu(&parent->unreg_srcu);
> > > +
> > > +	/* At this point it is confirmed that any pending user initiated
> > > +	 * create or remove callbacks accessing the parent are completed.
> > > +	 * It is safe to remove the parent now.
> > > +	 */
> > >  	class_compat_remove_link(mdev_bus_compat_class, dev, NULL);
> > >
> > >  	device_for_each_child(dev, NULL, mdev_device_remove_cb);
> > >
> > >  	parent_remove_sysfs_files(parent);
> > >
> > > -	mutex_unlock(&parent_list_lock);
> > >  	mdev_put_parent(parent);
> > >  }
> > >  EXPORT_SYMBOL(mdev_unregister_device);
> > > @@ -278,14 +270,24 @@ static void mdev_device_release(struct device
> > > *dev)  int mdev_device_create(struct kobject *kobj, struct device
> > > *dev, uuid_le uuid)  {
> > >  	int ret;
> > > +	struct mdev_parent *valid_parent;
> > >  	struct mdev_device *mdev, *tmp;
> > >  	struct mdev_parent *parent;
> > >  	struct mdev_type *type = to_mdev_type(kobj);
> > > +	int srcu_idx;
> > >
> > >  	parent = mdev_get_parent(type->parent);
> > >  	if (!parent)
> > >  		return -EINVAL;
> > >
> > > +	srcu_idx = srcu_read_lock(&parent->unreg_srcu);
> > > +	valid_parent = srcu_dereference(parent->self, &parent->unreg_srcu);
> > > +	if (!valid_parent) {
> > > +		/* parent is undergoing unregistration */
> > > +		ret = -ENODEV;
> > > +		goto mdev_fail;
> > > +	}
> > > +
> > >  	mutex_lock(&mdev_list_lock);
> > >
> > >  	/* Check for duplicate */
> > > @@ -310,68 +312,76 @@ int mdev_device_create(struct kobject *kobj,
> > > struct device *dev, uuid_le uuid)
> > >
> > >  	mdev->parent = parent;
> > >
> > > +	device_initialize(&mdev->dev);
> > >  	mdev->dev.parent  = dev;
> > >  	mdev->dev.bus     = &mdev_bus_type;
> > >  	mdev->dev.release = mdev_device_release;
> > > +	mdev->dev.groups = type->parent->ops->mdev_attr_groups;
> > >  	dev_set_name(&mdev->dev, "%pUl", uuid.b);
> > >
> > > -	ret = device_register(&mdev->dev);
> > > +	ret = type->parent->ops->create(kobj, mdev);
> > >  	if (ret)
> > > -		goto mdev_fail;
> > > +		goto create_fail;
> > >
> > > -	ret = mdev_device_create_ops(kobj, mdev);
> > > +	ret = device_add(&mdev->dev);  
> > 
> > Separating device_initialize() and device_add() also looks like a separate
> > patch, then the srcu could be added at the end.  Thanks,
> > 
> > Alex  
> 
> I saw little more core generated that way, but I think its fine.
> Basically, create/remove callback sequencing that does the device_inititailze/add etc in one patch and 
> User side race handling using srcu in another patch.
> Sounds good?

Splitting device_register into device_intialize/device_add solves the
first issue alone, that can be one patch.  Creating the common remove
function seems like a logical next patch.  The third patch could be
using the driver-core group attribute via those paths.  Another patch
could then incorporate the srcu code to gate the create/remove around
parent removal.  This basically matches your steps to address these
issues, it seems very split-able.  Thanks,

Alex
Parav Pandit March 26, 2019, 1:43 a.m. UTC | #6
> -----Original Message-----
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Monday, March 25, 2019 7:06 PM
> To: Parav Pandit <parav@mellanox.com>
> Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> kwankhede@nvidia.com
> Subject: Re: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
> 
> On Mon, 25 Mar 2019 23:34:28 +0000
> Parav Pandit <parav@mellanox.com> wrote:
> 
> > > -----Original Message-----
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Monday, March 25, 2019 6:19 PM
> > > To: Parav Pandit <parav@mellanox.com>
> > > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > kwankhede@nvidia.com
> > > Subject: Re: [PATCH 8/8] vfio/mdev: Improve the create/remove
> > > sequence
> > >
> > > On Fri, 22 Mar 2019 18:20:35 -0500
> > > Parav Pandit <parav@mellanox.com> wrote:
> > >
> > > > There are five problems with current code structure.
> > > > 1. mdev device is placed on the mdev bus before it is created in
> > > > the vendor driver. Once a device is placed on the mdev bus without
> > > > creating its supporting underlying vendor device, an open() can
> > > > get triggered by userspace on partially initialized device.
> > > > Below ladder diagram highlight it.
> > > >
> > > >       cpu-0                                       cpu-1
> > > >       -----                                       -----
> > > >    create_store()
> > > >      mdev_create_device()
> > > >        device_register()
> > > >           ...
> > > >          vfio_mdev_probe()
> > > >          ...creates char device
> > > >                                         vfio_mdev_open()
> > > >                                           parent->ops->open(mdev)
> > > >                                             vfio_ap_mdev_open()
> > > >                                               matrix_mdev = NULL
> > > >         [...]
> > > >         parent->ops->create()
> > > >           vfio_ap_mdev_create()
> > > >             mdev_set_drvdata(mdev, matrix_mdev);
> > > >             /* Valid pointer set above */
> > > >
> > > > 2. Current creation sequence is,
> > > >    parent->ops_create()
> > > >    groups_register()
> > > >
> > > > Remove sequence is,
> > > >    parent->ops->remove()
> > > >    groups_unregister()
> > > > However, remove sequence should be exact mirror of creation
> sequence.
> > > > Once this is achieved, all users of the mdev will be terminated
> > > > first before removing underlying vendor device.
> > > > (Follow standard linux driver model).
> > > > At that point vendor's remove() ops shouldn't failed because
> > > > device is taken off the bus that should terminate the users.
> > > >
> > > > 3. Additionally any new mdev driver that wants to work on mdev
> > > > device during probe() routine registered using
> > > > mdev_register_driver() needs to get stable mdev structure.
> > > >
> > > > 4. In following sequence, child devices created while removing
> > > > mdev parent device can be left out, or it may lead to race of
> > > > removing half initialized child mdev devices.
> > > >
> > > > issue-1:
> > > > --------
> > > >        cpu-0                         cpu-1
> > > >        -----                         -----
> > > >                                   mdev_unregister_device()
> > > >                                      device_for_each_child()
> > > >                                         mdev_device_remove_cb()
> > > >                                             mdev_device_remove()
> > > > create_store()
> > > >   mdev_device_create()                   [...]
> > > >        device_register()
> > > >                                   parent_remove_sysfs_files()
> > > >                                   /* BUG: device added by cpu-0
> > > >                                    * whose parent is getting removed.
> > > >                                    */
> > > >
> > > > issue-2:
> > > > --------
> > > >        cpu-0                         cpu-1
> > > >        -----                         -----
> > > > create_store()
> > > >   mdev_device_create()                   [...]
> > > >        device_register()
> > > >
> > > >        [...]                      mdev_unregister_device()
> > > >                                      device_for_each_child()
> > > >                                         mdev_device_remove_cb()
> > > >                                             mdev_device_remove()
> > > >
> > > >        mdev_create_sysfs_files()
> > > >        /* BUG: create is adding
> > > >         * sysfs files for a device
> > > >         * which is undergoing removal.
> > > >         */
> > > >                                  parent_remove_sysfs_files()
> > >
> > > In both cases above, it looks like the device will hold a reference
> > > to the parent, so while there is a race, the parent object isn't released.
> > Yes, parent object is not released but parent fields are not stable.
> >
> > >
> > > >
> > > > 5. Below crash is observed when user initiated remove is in
> > > > progress and mdev_unregister_driver() completes parent
> unregistration.
> > > >
> > > >        cpu-0                         cpu-1
> > > >        -----                         -----
> > > > remove_store()
> > > >    mdev_device_remove()
> > > >    active = false;
> > > >                                   mdev_unregister_device()
> > > >                                     remove type
> > > >    [...]
> > > >    mdev_remove_ops() crashes.
> > > >
> > > > This is similar race like create() racing with mdev_unregister_device().
> > >
> > > Not sure I catch this, the device should have a reference to the
> > > parent, and we don't specifically clear parent->ops, so what's
> > > getting removed that causes this oops?  Is .remove pointing at bad text
> regardless?
> > >
> > I guess the mdev_attr_groups being stale now.
> >
> > > > mtty mtty: MDEV: Registered
> > > > iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to group
> > > > 57 vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV: group_id
> > > > = 57 mdev_device_remove sleep started mtty mtty: MDEV:
> > > > Unregistering
> > > > mtty_dev: Unloaded!
> > > > BUG: unable to handle kernel paging request at ffffffffc027d668
> > > > PGD
> > > > af9818067 P4D af9818067 PUD af981a067 PMD 8583c3067 PTE 0
> > > > Oops: 0000 [#1] SMP PTI
> > > > CPU: 15 PID: 3517 Comm: bash Kdump: loaded Not tainted
> > > > 5.0.0-rc7-vdevbus+ #2 Hardware name: Supermicro
> > > > SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
> > > > RIP: 0010:mdev_device_remove_ops+0x1a/0x50 [mdev] Call Trace:
> > > >  mdev_device_remove+0xef/0x130 [mdev]
> > > >  remove_store+0x77/0xa0 [mdev]
> > > >  kernfs_fop_write+0x113/0x1a0
> > > >  __vfs_write+0x33/0x1b0
> > > >  ? rcu_read_lock_sched_held+0x64/0x70
> > > >  ? rcu_sync_lockdep_assert+0x2a/0x50  ?
> > > > __sb_start_write+0x121/0x1b0  ? vfs_write+0x17c/0x1b0
> > > >  vfs_write+0xad/0x1b0
> > > >  ? trace_hardirqs_on_thunk+0x1a/0x1c
> > > >  ksys_write+0x55/0xc0
> > > >  do_syscall_64+0x5a/0x210
> > > >
> > > > Therefore, mdev core is improved in following ways to overcome
> > > > above issues.
> > > >
> > > > 1. Before placing mdev devices on the bus, perform vendor drivers
> > > > creation which supports the mdev creation.
> > > > This ensures that mdev specific all necessary fields are
> > > > initialized before a given mdev can be accessed by bus driver.
> > > >
> > > > 2. During remove flow, first remove the device from the bus. This
> > > > ensures that any bus specific devices and data is cleared.
> > > > Once device is taken of the mdev bus, perform remove() of mdev
> > > > from the vendor driver.
> > > >
> > > > 3. Linux core device model provides way to register and auto
> > > > unregister the device sysfs attribute groups at dev->groups.
> > > > Make use of this groups to let core create the groups and simplify
> > > > code to avoid explicit groups creation and removal.
> > > >
> > > > 4. Wait for any ongoing mdev create() and remove() to finish
> > > > before unregistering parent device using srcu. This continues to
> > > > allow multiple create and remove to progress in parallel. At the
> > > > same time guard parent removal while parent is being access by
> > > > create() and remove
> > > callbacks.
> > >
> > > So there should be 4-5 separate patches here?  Wishful thinking?
> > >
> > create, remove racing with unregister is handled using srcu.
> > Change-3 cannot be done without fixing the sequence so it should be in
> patch that fixes it.
> > Change described changes 1-2-3 are just one change. It is just the patch
> description to bring clarity.
> > Change-4 can be possibly done as split to different patch.
> >
> > > > Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> > > > Signed-off-by: Parav Pandit <parav@mellanox.com>
> > > > ---
> > > >  drivers/vfio/mdev/mdev_core.c    | 142 +++++++++++++++++++++---------
> -----
> > > ----
> > > >  drivers/vfio/mdev/mdev_private.h |   7 +-
> > > >  drivers/vfio/mdev/mdev_sysfs.c   |   6 +-
> > > >  3 files changed, 84 insertions(+), 71 deletions(-)
> > > >
> > > > diff --git a/drivers/vfio/mdev/mdev_core.c
> > > > b/drivers/vfio/mdev/mdev_core.c index 944a058..8fe0ed1 100644
> > > > --- a/drivers/vfio/mdev/mdev_core.c
> > > > +++ b/drivers/vfio/mdev/mdev_core.c
> > > > @@ -84,6 +84,7 @@ static void mdev_release_parent(struct kref *kref)
> > > >  						  ref);
> > > >  	struct device *dev = parent->dev;
> > > >
> > > > +	cleanup_srcu_struct(&parent->unreg_srcu);
> > > >  	kfree(parent);
> > > >  	put_device(dev);
> > > >  }
> > > > @@ -103,56 +104,30 @@ static inline void mdev_put_parent(struct
> > > mdev_parent *parent)
> > > >  		kref_put(&parent->ref, mdev_release_parent);  }
> > > >
> > > > -static int mdev_device_create_ops(struct kobject *kobj,
> > > > -				  struct mdev_device *mdev)
> > > > +static int mdev_device_must_remove(struct mdev_device *mdev)
> > >
> > > Naming is off here, mdev_device_remove_common()?
> > >
> > Yes, sounds better.
> >
> > > >  {
> > > > -	struct mdev_parent *parent = mdev->parent;
> > > > +	struct mdev_parent *parent;
> > > > +	struct mdev_type *type;
> > > >  	int ret;
> > > >
> > > > -	ret = parent->ops->create(kobj, mdev);
> > > > -	if (ret)
> > > > -		return ret;
> > > > +	type = to_mdev_type(mdev->type_kobj);
> > > >
> > > > -	ret = sysfs_create_groups(&mdev->dev.kobj,
> > > > -				  parent->ops->mdev_attr_groups);
> > > > +	mdev_remove_sysfs_files(&mdev->dev, type);
> > > > +	device_del(&mdev->dev);
> > > > +	parent = mdev->parent;
> > > > +	ret = parent->ops->remove(mdev);
> > > >  	if (ret)
> > > > -		parent->ops->remove(mdev);
> > > > +		dev_err(&mdev->dev, "Remove failed: err=%d\n", ret);
> > >
> > > Let the caller decide whether to be verbose with the error, parent
> > > removal might want to warn, sysfs remove might just return an error.
> > >
> > I didn't follow. Caller meaning mdev_device_remove_common() or vendor
> driver?
> 
> I mean the callback iterator on the parent remove can do a WARN_ON if this
> returns an error while the device remove path can silently return -EBUSY, the
> common function doesn't need to decide whether the parent ops remove
> function deserves a dev_err.
> 
Ok. I understood. 
But device remove returning silent -EBUSY looks an error that should get logged in, because this is something not expected.
Its probably late for sysfs layer to return report an error by that time it prints device name, because put_device() is done.
So if remove() returns an error, I think its legitimate failure to do WARN_ON or dev_err().

> > > >
> > > > +	/* Balances with device_initialize() */
> > > > +	put_device(&mdev->dev);
> > > >  	return ret;
> > > >  }
> > > >
> > > > -/*
> > > > - * mdev_device_remove_ops gets called from sysfs's 'remove' and
> > > > when parent
> > > > - * device is being unregistered from mdev device framework.
> > > > - * - 'force_remove' is set to 'false' when called from sysfs's 'remove'
> which
> > > > - *   indicates that if the mdev device is active, used by VMM or
> userspace
> > > > - *   application, vendor driver could return error then don't remove
> the
> > > device.
> > > > - * - 'force_remove' is set to 'true' when called from
> > > mdev_unregister_device()
> > > > - *   which indicate that parent device is being removed from mdev
> device
> > > > - *   framework so remove mdev device forcefully.
> > > > - */
> > > > -static int mdev_device_remove_ops(struct mdev_device *mdev, bool
> > > > force_remove) -{
> > > > -	struct mdev_parent *parent = mdev->parent;
> > > > -	int ret;
> > > > -
> > > > -	/*
> > > > -	 * Vendor driver can return error if VMM or userspace application is
> > > > -	 * using this mdev device.
> > > > -	 */
> > > > -	ret = parent->ops->remove(mdev);
> > > > -	if (ret && !force_remove)
> > > > -		return ret;
> > > > -
> > > > -	sysfs_remove_groups(&mdev->dev.kobj, parent->ops-
> > > >mdev_attr_groups);
> > > > -	return 0;
> > > > -}
> > >
> > > Seems like there's easily a separate patch in pushing the
> > > create/remove ops into the calling function and separating for the
> > > iterator callback, that would make this easier to review.
> > >
> > > > -
> > > >  static int mdev_device_remove_cb(struct device *dev, void *data)  {
> > > >  	if (dev_is_mdev(dev))
> > > > -		mdev_device_remove(dev, true);
> > > > -
> > > > +		mdev_device_must_remove(to_mdev_device(dev));
> > > >  	return 0;
> > > >  }
> > > >
> > > > @@ -194,6 +169,7 @@ int mdev_register_device(struct device *dev,
> > > > const
> > > struct mdev_parent_ops *ops)
> > > >  	}
> > > >
> > > >  	kref_init(&parent->ref);
> > > > +	init_srcu_struct(&parent->unreg_srcu);
> > > >
> > > >  	parent->dev = dev;
> > > >  	parent->ops = ops;
> > > > @@ -214,6 +190,7 @@ int mdev_register_device(struct device *dev,
> > > > const
> > > struct mdev_parent_ops *ops)
> > > >  	if (ret)
> > > >  		dev_warn(dev, "Failed to create compatibility class link\n");
> > > >
> > > > +	rcu_assign_pointer(parent->self, parent);
> > > >  	list_add(&parent->next, &parent_list);
> > > >  	mutex_unlock(&parent_list_lock);
> > > >
> > > > @@ -244,21 +221,36 @@ void mdev_unregister_device(struct device
> > > > *dev)
> > > >
> > > >  	mutex_lock(&parent_list_lock);
> > > >  	parent = __find_parent_device(dev);
> > > > -
> > > >  	if (!parent) {
> > > >  		mutex_unlock(&parent_list_lock);
> > > >  		return;
> > > >  	}
> > > > +	list_del(&parent->next);
> > > > +	mutex_unlock(&parent_list_lock);
> > > > +
> > > >  	dev_info(dev, "MDEV: Unregistering\n");
> > > >
> > > > -	list_del(&parent->next);
> > > > +	/* Publish that this mdev parent is unregistering. So any new
> > > > +	 * create/remove cannot start on this parent anymore by user.
> > > > +	 */
> > >
> > > Comment style, we're not in netdev.
> > Yep. Will fix it.
> > >
> > > > +	rcu_assign_pointer(parent->self, NULL);
> > > > +
> > > > +	/*
> > > > +	 * Wait for any active create() or remove() mdev ops on the parent
> > > > +	 * to complete.
> > > > +	 */
> > > > +	synchronize_srcu(&parent->unreg_srcu);
> > > > +
> > > > +	/* At this point it is confirmed that any pending user initiated
> > > > +	 * create or remove callbacks accessing the parent are completed.
> > > > +	 * It is safe to remove the parent now.
> > > > +	 */
> > > >  	class_compat_remove_link(mdev_bus_compat_class, dev, NULL);
> > > >
> > > >  	device_for_each_child(dev, NULL, mdev_device_remove_cb);
> > > >
> > > >  	parent_remove_sysfs_files(parent);
> > > >
> > > > -	mutex_unlock(&parent_list_lock);
> > > >  	mdev_put_parent(parent);
> > > >  }
> > > >  EXPORT_SYMBOL(mdev_unregister_device);
> > > > @@ -278,14 +270,24 @@ static void mdev_device_release(struct
> > > > device
> > > > *dev)  int mdev_device_create(struct kobject *kobj, struct device
> > > > *dev, uuid_le uuid)  {
> > > >  	int ret;
> > > > +	struct mdev_parent *valid_parent;
> > > >  	struct mdev_device *mdev, *tmp;
> > > >  	struct mdev_parent *parent;
> > > >  	struct mdev_type *type = to_mdev_type(kobj);
> > > > +	int srcu_idx;
> > > >
> > > >  	parent = mdev_get_parent(type->parent);
> > > >  	if (!parent)
> > > >  		return -EINVAL;
> > > >
> > > > +	srcu_idx = srcu_read_lock(&parent->unreg_srcu);
> > > > +	valid_parent = srcu_dereference(parent->self, &parent->unreg_srcu);
> > > > +	if (!valid_parent) {
> > > > +		/* parent is undergoing unregistration */
> > > > +		ret = -ENODEV;
> > > > +		goto mdev_fail;
> > > > +	}
> > > > +
> > > >  	mutex_lock(&mdev_list_lock);
> > > >
> > > >  	/* Check for duplicate */
> > > > @@ -310,68 +312,76 @@ int mdev_device_create(struct kobject *kobj,
> > > > struct device *dev, uuid_le uuid)
> > > >
> > > >  	mdev->parent = parent;
> > > >
> > > > +	device_initialize(&mdev->dev);
> > > >  	mdev->dev.parent  = dev;
> > > >  	mdev->dev.bus     = &mdev_bus_type;
> > > >  	mdev->dev.release = mdev_device_release;
> > > > +	mdev->dev.groups = type->parent->ops->mdev_attr_groups;
> > > >  	dev_set_name(&mdev->dev, "%pUl", uuid.b);
> > > >
> > > > -	ret = device_register(&mdev->dev);
> > > > +	ret = type->parent->ops->create(kobj, mdev);
> > > >  	if (ret)
> > > > -		goto mdev_fail;
> > > > +		goto create_fail;
> > > >
> > > > -	ret = mdev_device_create_ops(kobj, mdev);
> > > > +	ret = device_add(&mdev->dev);
> > >
> > > Separating device_initialize() and device_add() also looks like a
> > > separate patch, then the srcu could be added at the end.  Thanks,
> > >
> > > Alex
> >
> > I saw little more core generated that way, but I think its fine.
> > Basically, create/remove callback sequencing that does the
> > device_inititailze/add etc in one patch and User side race handling using
> srcu in another patch.
> > Sounds good?
> 
> Splitting device_register into device_intialize/device_add solves the first
> issue alone, that can be one patch.  
Yes, once this is done, mdev_device_create_ops() is just a one line wrapper to groups creation.
Hence I was considering to do in same patch, but its fine as a separate clean up patch.
More split details below.

> Creating the common remove function
> seems like a logical next patch.  The third patch could be using the driver-
> core group attribute via those paths.  Another patch could then incorporate
> the srcu code to gate the create/remove around parent removal.  This
> basically matches your steps to address these issues, it seems very split-able.
> Thanks,
> 
So I reworked to split this one patch to following smaller refactor and fixes.
1. use of device_inititalize/add/remove helpers without fixing the sequence as prep patch
2. fix the create/remove sequence
3. factor out groups creation
4. remove helper function
5. srcu fix

> Alex
Alex Williamson March 26, 2019, 2:16 a.m. UTC | #7
On Tue, 26 Mar 2019 01:43:44 +0000
Parav Pandit <parav@mellanox.com> wrote:

> > -----Original Message-----
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Monday, March 25, 2019 7:06 PM
> > To: Parav Pandit <parav@mellanox.com>
> > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > kwankhede@nvidia.com
> > Subject: Re: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
> > 
> > On Mon, 25 Mar 2019 23:34:28 +0000
> > Parav Pandit <parav@mellanox.com> wrote:
> >   
> > > > -----Original Message-----
> > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > Sent: Monday, March 25, 2019 6:19 PM
> > > > To: Parav Pandit <parav@mellanox.com>
> > > > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > > kwankhede@nvidia.com
> > > > Subject: Re: [PATCH 8/8] vfio/mdev: Improve the create/remove
> > > > sequence
> > > >
> > > > On Fri, 22 Mar 2019 18:20:35 -0500
> > > > Parav Pandit <parav@mellanox.com> wrote:
> > > >  
> > > > > There are five problems with current code structure.
> > > > > 1. mdev device is placed on the mdev bus before it is created in
> > > > > the vendor driver. Once a device is placed on the mdev bus without
> > > > > creating its supporting underlying vendor device, an open() can
> > > > > get triggered by userspace on partially initialized device.
> > > > > Below ladder diagram highlight it.
> > > > >
> > > > >       cpu-0                                       cpu-1
> > > > >       -----                                       -----
> > > > >    create_store()
> > > > >      mdev_create_device()
> > > > >        device_register()
> > > > >           ...
> > > > >          vfio_mdev_probe()
> > > > >          ...creates char device
> > > > >                                         vfio_mdev_open()
> > > > >                                           parent->ops->open(mdev)
> > > > >                                             vfio_ap_mdev_open()
> > > > >                                               matrix_mdev = NULL
> > > > >         [...]
> > > > >         parent->ops->create()
> > > > >           vfio_ap_mdev_create()
> > > > >             mdev_set_drvdata(mdev, matrix_mdev);
> > > > >             /* Valid pointer set above */
> > > > >
> > > > > 2. Current creation sequence is,
> > > > >    parent->ops_create()
> > > > >    groups_register()
> > > > >
> > > > > Remove sequence is,
> > > > >    parent->ops->remove()
> > > > >    groups_unregister()
> > > > > However, remove sequence should be exact mirror of creation  
> > sequence.  
> > > > > Once this is achieved, all users of the mdev will be terminated
> > > > > first before removing underlying vendor device.
> > > > > (Follow standard linux driver model).
> > > > > At that point vendor's remove() ops shouldn't failed because
> > > > > device is taken off the bus that should terminate the users.
> > > > >
> > > > > 3. Additionally any new mdev driver that wants to work on mdev
> > > > > device during probe() routine registered using
> > > > > mdev_register_driver() needs to get stable mdev structure.
> > > > >
> > > > > 4. In following sequence, child devices created while removing
> > > > > mdev parent device can be left out, or it may lead to race of
> > > > > removing half initialized child mdev devices.
> > > > >
> > > > > issue-1:
> > > > > --------
> > > > >        cpu-0                         cpu-1
> > > > >        -----                         -----
> > > > >                                   mdev_unregister_device()
> > > > >                                      device_for_each_child()
> > > > >                                         mdev_device_remove_cb()
> > > > >                                             mdev_device_remove()
> > > > > create_store()
> > > > >   mdev_device_create()                   [...]
> > > > >        device_register()
> > > > >                                   parent_remove_sysfs_files()
> > > > >                                   /* BUG: device added by cpu-0
> > > > >                                    * whose parent is getting removed.
> > > > >                                    */
> > > > >
> > > > > issue-2:
> > > > > --------
> > > > >        cpu-0                         cpu-1
> > > > >        -----                         -----
> > > > > create_store()
> > > > >   mdev_device_create()                   [...]
> > > > >        device_register()
> > > > >
> > > > >        [...]                      mdev_unregister_device()
> > > > >                                      device_for_each_child()
> > > > >                                         mdev_device_remove_cb()
> > > > >                                             mdev_device_remove()
> > > > >
> > > > >        mdev_create_sysfs_files()
> > > > >        /* BUG: create is adding
> > > > >         * sysfs files for a device
> > > > >         * which is undergoing removal.
> > > > >         */
> > > > >                                  parent_remove_sysfs_files()  
> > > >
> > > > In both cases above, it looks like the device will hold a reference
> > > > to the parent, so while there is a race, the parent object isn't released.  
> > > Yes, parent object is not released but parent fields are not stable.
> > >  
> > > >  
> > > > >
> > > > > 5. Below crash is observed when user initiated remove is in
> > > > > progress and mdev_unregister_driver() completes parent  
> > unregistration.  
> > > > >
> > > > >        cpu-0                         cpu-1
> > > > >        -----                         -----
> > > > > remove_store()
> > > > >    mdev_device_remove()
> > > > >    active = false;
> > > > >                                   mdev_unregister_device()
> > > > >                                     remove type
> > > > >    [...]
> > > > >    mdev_remove_ops() crashes.
> > > > >
> > > > > This is similar race like create() racing with mdev_unregister_device().  
> > > >
> > > > Not sure I catch this, the device should have a reference to the
> > > > parent, and we don't specifically clear parent->ops, so what's
> > > > getting removed that causes this oops?  Is .remove pointing at bad text  
> > regardless?  
> > > >  
> > > I guess the mdev_attr_groups being stale now.
> > >  
> > > > > mtty mtty: MDEV: Registered
> > > > > iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to group
> > > > > 57 vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV: group_id
> > > > > = 57 mdev_device_remove sleep started mtty mtty: MDEV:
> > > > > Unregistering
> > > > > mtty_dev: Unloaded!
> > > > > BUG: unable to handle kernel paging request at ffffffffc027d668
> > > > > PGD
> > > > > af9818067 P4D af9818067 PUD af981a067 PMD 8583c3067 PTE 0
> > > > > Oops: 0000 [#1] SMP PTI
> > > > > CPU: 15 PID: 3517 Comm: bash Kdump: loaded Not tainted
> > > > > 5.0.0-rc7-vdevbus+ #2 Hardware name: Supermicro
> > > > > SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
> > > > > RIP: 0010:mdev_device_remove_ops+0x1a/0x50 [mdev] Call Trace:
> > > > >  mdev_device_remove+0xef/0x130 [mdev]
> > > > >  remove_store+0x77/0xa0 [mdev]
> > > > >  kernfs_fop_write+0x113/0x1a0
> > > > >  __vfs_write+0x33/0x1b0
> > > > >  ? rcu_read_lock_sched_held+0x64/0x70
> > > > >  ? rcu_sync_lockdep_assert+0x2a/0x50  ?
> > > > > __sb_start_write+0x121/0x1b0  ? vfs_write+0x17c/0x1b0
> > > > >  vfs_write+0xad/0x1b0
> > > > >  ? trace_hardirqs_on_thunk+0x1a/0x1c
> > > > >  ksys_write+0x55/0xc0
> > > > >  do_syscall_64+0x5a/0x210
> > > > >
> > > > > Therefore, mdev core is improved in following ways to overcome
> > > > > above issues.
> > > > >
> > > > > 1. Before placing mdev devices on the bus, perform vendor drivers
> > > > > creation which supports the mdev creation.
> > > > > This ensures that mdev specific all necessary fields are
> > > > > initialized before a given mdev can be accessed by bus driver.
> > > > >
> > > > > 2. During remove flow, first remove the device from the bus. This
> > > > > ensures that any bus specific devices and data is cleared.
> > > > > Once device is taken of the mdev bus, perform remove() of mdev
> > > > > from the vendor driver.
> > > > >
> > > > > 3. Linux core device model provides way to register and auto
> > > > > unregister the device sysfs attribute groups at dev->groups.
> > > > > Make use of this groups to let core create the groups and simplify
> > > > > code to avoid explicit groups creation and removal.
> > > > >
> > > > > 4. Wait for any ongoing mdev create() and remove() to finish
> > > > > before unregistering parent device using srcu. This continues to
> > > > > allow multiple create and remove to progress in parallel. At the
> > > > > same time guard parent removal while parent is being access by
> > > > > create() and remove  
> > > > callbacks.
> > > >
> > > > So there should be 4-5 separate patches here?  Wishful thinking?
> > > >  
> > > create, remove racing with unregister is handled using srcu.
> > > Change-3 cannot be done without fixing the sequence so it should be in  
> > patch that fixes it.  
> > > Change described changes 1-2-3 are just one change. It is just the patch  
> > description to bring clarity.  
> > > Change-4 can be possibly done as split to different patch.
> > >  
> > > > > Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> > > > > Signed-off-by: Parav Pandit <parav@mellanox.com>
> > > > > ---
> > > > >  drivers/vfio/mdev/mdev_core.c    | 142 +++++++++++++++++++++---------  
> > -----  
> > > > ----  
> > > > >  drivers/vfio/mdev/mdev_private.h |   7 +-
> > > > >  drivers/vfio/mdev/mdev_sysfs.c   |   6 +-
> > > > >  3 files changed, 84 insertions(+), 71 deletions(-)
> > > > >
> > > > > diff --git a/drivers/vfio/mdev/mdev_core.c
> > > > > b/drivers/vfio/mdev/mdev_core.c index 944a058..8fe0ed1 100644
> > > > > --- a/drivers/vfio/mdev/mdev_core.c
> > > > > +++ b/drivers/vfio/mdev/mdev_core.c
> > > > > @@ -84,6 +84,7 @@ static void mdev_release_parent(struct kref *kref)
> > > > >  						  ref);
> > > > >  	struct device *dev = parent->dev;
> > > > >
> > > > > +	cleanup_srcu_struct(&parent->unreg_srcu);
> > > > >  	kfree(parent);
> > > > >  	put_device(dev);
> > > > >  }
> > > > > @@ -103,56 +104,30 @@ static inline void mdev_put_parent(struct  
> > > > mdev_parent *parent)  
> > > > >  		kref_put(&parent->ref, mdev_release_parent);  }
> > > > >
> > > > > -static int mdev_device_create_ops(struct kobject *kobj,
> > > > > -				  struct mdev_device *mdev)
> > > > > +static int mdev_device_must_remove(struct mdev_device *mdev)  
> > > >
> > > > Naming is off here, mdev_device_remove_common()?
> > > >  
> > > Yes, sounds better.
> > >  
> > > > >  {
> > > > > -	struct mdev_parent *parent = mdev->parent;
> > > > > +	struct mdev_parent *parent;
> > > > > +	struct mdev_type *type;
> > > > >  	int ret;
> > > > >
> > > > > -	ret = parent->ops->create(kobj, mdev);
> > > > > -	if (ret)
> > > > > -		return ret;
> > > > > +	type = to_mdev_type(mdev->type_kobj);
> > > > >
> > > > > -	ret = sysfs_create_groups(&mdev->dev.kobj,
> > > > > -				  parent->ops->mdev_attr_groups);
> > > > > +	mdev_remove_sysfs_files(&mdev->dev, type);
> > > > > +	device_del(&mdev->dev);
> > > > > +	parent = mdev->parent;
> > > > > +	ret = parent->ops->remove(mdev);
> > > > >  	if (ret)
> > > > > -		parent->ops->remove(mdev);
> > > > > +		dev_err(&mdev->dev, "Remove failed: err=%d\n", ret);  
> > > >
> > > > Let the caller decide whether to be verbose with the error, parent
> > > > removal might want to warn, sysfs remove might just return an error.
> > > >  
> > > I didn't follow. Caller meaning mdev_device_remove_common() or vendor  
> > driver?
> > 
> > I mean the callback iterator on the parent remove can do a WARN_ON if this
> > returns an error while the device remove path can silently return -EBUSY, the
> > common function doesn't need to decide whether the parent ops remove
> > function deserves a dev_err.
> >   
> Ok. I understood. 
> But device remove returning silent -EBUSY looks an error that should
> get logged in, because this is something not expected. Its probably
> late for sysfs layer to return report an error by that time it prints
> device name, because put_device() is done. So if remove() returns an
> error, I think its legitimate failure to do WARN_ON or dev_err().

Calling put_device() if the parent remove op fails looks like a bug
introduced by this series, the current code allows that failure leaving
the device in a coherent state and returning errno to the sysfs store
function.

> > > > >
> > > > > +	/* Balances with device_initialize() */
> > > > > +	put_device(&mdev->dev);
> > > > >  	return ret;
> > > > >  }
> > > > >
> > > > > -/*
> > > > > - * mdev_device_remove_ops gets called from sysfs's 'remove'
> > > > > and when parent
> > > > > - * device is being unregistered from mdev device framework.
> > > > > - * - 'force_remove' is set to 'false' when called from
> > > > > sysfs's 'remove'  
> > which  
> > > > > - *   indicates that if the mdev device is active, used by
> > > > > VMM or  
> > userspace  
> > > > > - *   application, vendor driver could return error then
> > > > > don't remove  
> > the  
> > > > device.  
> > > > > - * - 'force_remove' is set to 'true' when called from  
> > > > mdev_unregister_device()  
> > > > > - *   which indicate that parent device is being removed from
> > > > > mdev  
> > device  
> > > > > - *   framework so remove mdev device forcefully.
> > > > > - */
> > > > > -static int mdev_device_remove_ops(struct mdev_device *mdev,
> > > > > bool force_remove) -{
> > > > > -	struct mdev_parent *parent = mdev->parent;
> > > > > -	int ret;
> > > > > -
> > > > > -	/*
> > > > > -	 * Vendor driver can return error if VMM or
> > > > > userspace application is
> > > > > -	 * using this mdev device.
> > > > > -	 */
> > > > > -	ret = parent->ops->remove(mdev);
> > > > > -	if (ret && !force_remove)
> > > > > -		return ret;
> > > > > -
> > > > > -	sysfs_remove_groups(&mdev->dev.kobj, parent->ops-
> > > > >mdev_attr_groups);
> > > > > -	return 0;
> > > > > -}  
> > > >
> > > > Seems like there's easily a separate patch in pushing the
> > > > create/remove ops into the calling function and separating for
> > > > the iterator callback, that would make this easier to review.
> > > >  
> > > > > -
> > > > >  static int mdev_device_remove_cb(struct device *dev, void
> > > > > *data)  { if (dev_is_mdev(dev))
> > > > > -		mdev_device_remove(dev, true);
> > > > > -
> > > > > +		mdev_device_must_remove(to_mdev_device(dev));
> > > > >  	return 0;
> > > > >  }
> > > > >
> > > > > @@ -194,6 +169,7 @@ int mdev_register_device(struct device
> > > > > *dev, const  
> > > > struct mdev_parent_ops *ops)  
> > > > >  	}
> > > > >
> > > > >  	kref_init(&parent->ref);
> > > > > +	init_srcu_struct(&parent->unreg_srcu);
> > > > >
> > > > >  	parent->dev = dev;
> > > > >  	parent->ops = ops;
> > > > > @@ -214,6 +190,7 @@ int mdev_register_device(struct device
> > > > > *dev, const  
> > > > struct mdev_parent_ops *ops)  
> > > > >  	if (ret)
> > > > >  		dev_warn(dev, "Failed to create
> > > > > compatibility class link\n");
> > > > >
> > > > > +	rcu_assign_pointer(parent->self, parent);
> > > > >  	list_add(&parent->next, &parent_list);
> > > > >  	mutex_unlock(&parent_list_lock);
> > > > >
> > > > > @@ -244,21 +221,36 @@ void mdev_unregister_device(struct
> > > > > device *dev)
> > > > >
> > > > >  	mutex_lock(&parent_list_lock);
> > > > >  	parent = __find_parent_device(dev);
> > > > > -
> > > > >  	if (!parent) {
> > > > >  		mutex_unlock(&parent_list_lock);
> > > > >  		return;
> > > > >  	}
> > > > > +	list_del(&parent->next);
> > > > > +	mutex_unlock(&parent_list_lock);
> > > > > +
> > > > >  	dev_info(dev, "MDEV: Unregistering\n");
> > > > >
> > > > > -	list_del(&parent->next);
> > > > > +	/* Publish that this mdev parent is unregistering.
> > > > > So any new
> > > > > +	 * create/remove cannot start on this parent anymore
> > > > > by user.
> > > > > +	 */  
> > > >
> > > > Comment style, we're not in netdev.  
> > > Yep. Will fix it.  
> > > >  
> > > > > +	rcu_assign_pointer(parent->self, NULL);
> > > > > +
> > > > > +	/*
> > > > > +	 * Wait for any active create() or remove() mdev ops
> > > > > on the parent
> > > > > +	 * to complete.
> > > > > +	 */
> > > > > +	synchronize_srcu(&parent->unreg_srcu);
> > > > > +
> > > > > +	/* At this point it is confirmed that any pending
> > > > > user initiated
> > > > > +	 * create or remove callbacks accessing the parent
> > > > > are completed.
> > > > > +	 * It is safe to remove the parent now.
> > > > > +	 */
> > > > >  	class_compat_remove_link(mdev_bus_compat_class, dev,
> > > > > NULL);
> > > > >
> > > > >  	device_for_each_child(dev, NULL,
> > > > > mdev_device_remove_cb);
> > > > >
> > > > >  	parent_remove_sysfs_files(parent);
> > > > >
> > > > > -	mutex_unlock(&parent_list_lock);
> > > > >  	mdev_put_parent(parent);
> > > > >  }
> > > > >  EXPORT_SYMBOL(mdev_unregister_device);
> > > > > @@ -278,14 +270,24 @@ static void mdev_device_release(struct
> > > > > device
> > > > > *dev)  int mdev_device_create(struct kobject *kobj, struct
> > > > > device *dev, uuid_le uuid)  {
> > > > >  	int ret;
> > > > > +	struct mdev_parent *valid_parent;
> > > > >  	struct mdev_device *mdev, *tmp;
> > > > >  	struct mdev_parent *parent;
> > > > >  	struct mdev_type *type = to_mdev_type(kobj);
> > > > > +	int srcu_idx;
> > > > >
> > > > >  	parent = mdev_get_parent(type->parent);
> > > > >  	if (!parent)
> > > > >  		return -EINVAL;
> > > > >
> > > > > +	srcu_idx = srcu_read_lock(&parent->unreg_srcu);
> > > > > +	valid_parent = srcu_dereference(parent->self,
> > > > > &parent->unreg_srcu);
> > > > > +	if (!valid_parent) {
> > > > > +		/* parent is undergoing unregistration */
> > > > > +		ret = -ENODEV;
> > > > > +		goto mdev_fail;
> > > > > +	}
> > > > > +
> > > > >  	mutex_lock(&mdev_list_lock);
> > > > >
> > > > >  	/* Check for duplicate */
> > > > > @@ -310,68 +312,76 @@ int mdev_device_create(struct kobject
> > > > > *kobj, struct device *dev, uuid_le uuid)
> > > > >
> > > > >  	mdev->parent = parent;
> > > > >
> > > > > +	device_initialize(&mdev->dev);
> > > > >  	mdev->dev.parent  = dev;
> > > > >  	mdev->dev.bus     = &mdev_bus_type;
> > > > >  	mdev->dev.release = mdev_device_release;
> > > > > +	mdev->dev.groups =
> > > > > type->parent->ops->mdev_attr_groups; dev_set_name(&mdev->dev,
> > > > > "%pUl", uuid.b);
> > > > >
> > > > > -	ret = device_register(&mdev->dev);
> > > > > +	ret = type->parent->ops->create(kobj, mdev);
> > > > >  	if (ret)
> > > > > -		goto mdev_fail;
> > > > > +		goto create_fail;
> > > > >
> > > > > -	ret = mdev_device_create_ops(kobj, mdev);
> > > > > +	ret = device_add(&mdev->dev);  
> > > >
> > > > Separating device_initialize() and device_add() also looks like
> > > > a separate patch, then the srcu could be added at the end.
> > > > Thanks,
> > > >
> > > > Alex  
> > >
> > > I saw little more core generated that way, but I think its fine.
> > > Basically, create/remove callback sequencing that does the
> > > device_inititailze/add etc in one patch and User side race
> > > handling using  
> > srcu in another patch.  
> > > Sounds good?  
> > 
> > Splitting device_register into device_intialize/device_add solves
> > the first issue alone, that can be one patch.    
> Yes, once this is done, mdev_device_create_ops() is just a one line
> wrapper to groups creation. Hence I was considering to do in same
> patch, but its fine as a separate clean up patch. More split details
> below.
> 
> > Creating the common remove function
> > seems like a logical next patch.  The third patch could be using
> > the driver- core group attribute via those paths.  Another patch
> > could then incorporate the srcu code to gate the create/remove
> > around parent removal.  This basically matches your steps to
> > address these issues, it seems very split-able. Thanks,
> >   
> So I reworked to split this one patch to following smaller refactor
> and fixes. 1. use of device_inititalize/add/remove helpers without
> fixing the sequence as prep patch 2. fix the create/remove sequence
> 3. factor out groups creation
> 4. remove helper function
> 5. srcu fix

Looks good, I think it will be much easier to review that way.  Thanks,

Alex
Parav Pandit March 26, 2019, 3:19 a.m. UTC | #8
> -----Original Message-----
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Monday, March 25, 2019 9:17 PM
> To: Parav Pandit <parav@mellanox.com>
> Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> kwankhede@nvidia.com
> Subject: Re: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
> 
> On Tue, 26 Mar 2019 01:43:44 +0000
> Parav Pandit <parav@mellanox.com> wrote:
> 
> > > -----Original Message-----
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Monday, March 25, 2019 7:06 PM
> > > To: Parav Pandit <parav@mellanox.com>
> > > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > kwankhede@nvidia.com
> > > Subject: Re: [PATCH 8/8] vfio/mdev: Improve the create/remove
> > > sequence
> > >
> > > On Mon, 25 Mar 2019 23:34:28 +0000
> > > Parav Pandit <parav@mellanox.com> wrote:
> > >
> > > > > -----Original Message-----
> > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > Sent: Monday, March 25, 2019 6:19 PM
> > > > > To: Parav Pandit <parav@mellanox.com>
> > > > > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > > > kwankhede@nvidia.com
> > > > > Subject: Re: [PATCH 8/8] vfio/mdev: Improve the create/remove
> > > > > sequence
> > > > >
> > > > > On Fri, 22 Mar 2019 18:20:35 -0500 Parav Pandit
> > > > > <parav@mellanox.com> wrote:
> > > > >
> > > > > > There are five problems with current code structure.
> > > > > > 1. mdev device is placed on the mdev bus before it is created
> > > > > > in the vendor driver. Once a device is placed on the mdev bus
> > > > > > without creating its supporting underlying vendor device, an
> > > > > > open() can get triggered by userspace on partially initialized device.
> > > > > > Below ladder diagram highlight it.
> > > > > >
> > > > > >       cpu-0                                       cpu-1
> > > > > >       -----                                       -----
> > > > > >    create_store()
> > > > > >      mdev_create_device()
> > > > > >        device_register()
> > > > > >           ...
> > > > > >          vfio_mdev_probe()
> > > > > >          ...creates char device
> > > > > >                                         vfio_mdev_open()
> > > > > >                                           parent->ops->open(mdev)
> > > > > >                                             vfio_ap_mdev_open()
> > > > > >                                               matrix_mdev = NULL
> > > > > >         [...]
> > > > > >         parent->ops->create()
> > > > > >           vfio_ap_mdev_create()
> > > > > >             mdev_set_drvdata(mdev, matrix_mdev);
> > > > > >             /* Valid pointer set above */
> > > > > >
> > > > > > 2. Current creation sequence is,
> > > > > >    parent->ops_create()
> > > > > >    groups_register()
> > > > > >
> > > > > > Remove sequence is,
> > > > > >    parent->ops->remove()
> > > > > >    groups_unregister()
> > > > > > However, remove sequence should be exact mirror of creation
> > > sequence.
> > > > > > Once this is achieved, all users of the mdev will be
> > > > > > terminated first before removing underlying vendor device.
> > > > > > (Follow standard linux driver model).
> > > > > > At that point vendor's remove() ops shouldn't failed because
> > > > > > device is taken off the bus that should terminate the users.
> > > > > >
> > > > > > 3. Additionally any new mdev driver that wants to work on mdev
> > > > > > device during probe() routine registered using
> > > > > > mdev_register_driver() needs to get stable mdev structure.
> > > > > >
> > > > > > 4. In following sequence, child devices created while removing
> > > > > > mdev parent device can be left out, or it may lead to race of
> > > > > > removing half initialized child mdev devices.
> > > > > >
> > > > > > issue-1:
> > > > > > --------
> > > > > >        cpu-0                         cpu-1
> > > > > >        -----                         -----
> > > > > >                                   mdev_unregister_device()
> > > > > >                                      device_for_each_child()
> > > > > >                                         mdev_device_remove_cb()
> > > > > >
> > > > > > mdev_device_remove()
> > > > > > create_store()
> > > > > >   mdev_device_create()                   [...]
> > > > > >        device_register()
> > > > > >                                   parent_remove_sysfs_files()
> > > > > >                                   /* BUG: device added by cpu-0
> > > > > >                                    * whose parent is getting removed.
> > > > > >                                    */
> > > > > >
> > > > > > issue-2:
> > > > > > --------
> > > > > >        cpu-0                         cpu-1
> > > > > >        -----                         -----
> > > > > > create_store()
> > > > > >   mdev_device_create()                   [...]
> > > > > >        device_register()
> > > > > >
> > > > > >        [...]                      mdev_unregister_device()
> > > > > >                                      device_for_each_child()
> > > > > >                                         mdev_device_remove_cb()
> > > > > >
> > > > > > mdev_device_remove()
> > > > > >
> > > > > >        mdev_create_sysfs_files()
> > > > > >        /* BUG: create is adding
> > > > > >         * sysfs files for a device
> > > > > >         * which is undergoing removal.
> > > > > >         */
> > > > > >                                  parent_remove_sysfs_files()
> > > > >
> > > > > In both cases above, it looks like the device will hold a
> > > > > reference to the parent, so while there is a race, the parent object
> isn't released.
> > > > Yes, parent object is not released but parent fields are not stable.
> > > >
> > > > >
> > > > > >
> > > > > > 5. Below crash is observed when user initiated remove is in
> > > > > > progress and mdev_unregister_driver() completes parent
> > > unregistration.
> > > > > >
> > > > > >        cpu-0                         cpu-1
> > > > > >        -----                         -----
> > > > > > remove_store()
> > > > > >    mdev_device_remove()
> > > > > >    active = false;
> > > > > >                                   mdev_unregister_device()
> > > > > >                                     remove type
> > > > > >    [...]
> > > > > >    mdev_remove_ops() crashes.
> > > > > >
> > > > > > This is similar race like create() racing with
> mdev_unregister_device().
> > > > >
> > > > > Not sure I catch this, the device should have a reference to the
> > > > > parent, and we don't specifically clear parent->ops, so what's
> > > > > getting removed that causes this oops?  Is .remove pointing at
> > > > > bad text
> > > regardless?
> > > > >
> > > > I guess the mdev_attr_groups being stale now.
> > > >
> > > > > > mtty mtty: MDEV: Registered
> > > > > > iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to
> > > > > > group
> > > > > > 57 vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV:
> > > > > > group_id = 57 mdev_device_remove sleep started mtty mtty: MDEV:
> > > > > > Unregistering
> > > > > > mtty_dev: Unloaded!
> > > > > > BUG: unable to handle kernel paging request at
> > > > > > ffffffffc027d668 PGD
> > > > > > af9818067 P4D af9818067 PUD af981a067 PMD 8583c3067 PTE 0
> > > > > > Oops: 0000 [#1] SMP PTI
> > > > > > CPU: 15 PID: 3517 Comm: bash Kdump: loaded Not tainted
> > > > > > 5.0.0-rc7-vdevbus+ #2 Hardware name: Supermicro
> > > > > > SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
> > > > > > RIP: 0010:mdev_device_remove_ops+0x1a/0x50 [mdev] Call Trace:
> > > > > >  mdev_device_remove+0xef/0x130 [mdev]
> > > > > >  remove_store+0x77/0xa0 [mdev]
> > > > > >  kernfs_fop_write+0x113/0x1a0
> > > > > >  __vfs_write+0x33/0x1b0
> > > > > >  ? rcu_read_lock_sched_held+0x64/0x70
> > > > > >  ? rcu_sync_lockdep_assert+0x2a/0x50  ?
> > > > > > __sb_start_write+0x121/0x1b0  ? vfs_write+0x17c/0x1b0
> > > > > >  vfs_write+0xad/0x1b0
> > > > > >  ? trace_hardirqs_on_thunk+0x1a/0x1c
> > > > > >  ksys_write+0x55/0xc0
> > > > > >  do_syscall_64+0x5a/0x210
> > > > > >
> > > > > > Therefore, mdev core is improved in following ways to overcome
> > > > > > above issues.
> > > > > >
> > > > > > 1. Before placing mdev devices on the bus, perform vendor
> > > > > > drivers creation which supports the mdev creation.
> > > > > > This ensures that mdev specific all necessary fields are
> > > > > > initialized before a given mdev can be accessed by bus driver.
> > > > > >
> > > > > > 2. During remove flow, first remove the device from the bus.
> > > > > > This ensures that any bus specific devices and data is cleared.
> > > > > > Once device is taken of the mdev bus, perform remove() of mdev
> > > > > > from the vendor driver.
> > > > > >
> > > > > > 3. Linux core device model provides way to register and auto
> > > > > > unregister the device sysfs attribute groups at dev->groups.
> > > > > > Make use of this groups to let core create the groups and
> > > > > > simplify code to avoid explicit groups creation and removal.
> > > > > >
> > > > > > 4. Wait for any ongoing mdev create() and remove() to finish
> > > > > > before unregistering parent device using srcu. This continues
> > > > > > to allow multiple create and remove to progress in parallel.
> > > > > > At the same time guard parent removal while parent is being
> > > > > > access by
> > > > > > create() and remove
> > > > > callbacks.
> > > > >
> > > > > So there should be 4-5 separate patches here?  Wishful thinking?
> > > > >
> > > > create, remove racing with unregister is handled using srcu.
> > > > Change-3 cannot be done without fixing the sequence so it should
> > > > be in
> > > patch that fixes it.
> > > > Change described changes 1-2-3 are just one change. It is just the
> > > > patch
> > > description to bring clarity.
> > > > Change-4 can be possibly done as split to different patch.
> > > >
> > > > > > Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> > > > > > Signed-off-by: Parav Pandit <parav@mellanox.com>
> > > > > > ---
> > > > > >  drivers/vfio/mdev/mdev_core.c    | 142 +++++++++++++++++++++----
> -----
> > > -----
> > > > > ----
> > > > > >  drivers/vfio/mdev/mdev_private.h |   7 +-
> > > > > >  drivers/vfio/mdev/mdev_sysfs.c   |   6 +-
> > > > > >  3 files changed, 84 insertions(+), 71 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/vfio/mdev/mdev_core.c
> > > > > > b/drivers/vfio/mdev/mdev_core.c index 944a058..8fe0ed1 100644
> > > > > > --- a/drivers/vfio/mdev/mdev_core.c
> > > > > > +++ b/drivers/vfio/mdev/mdev_core.c
> > > > > > @@ -84,6 +84,7 @@ static void mdev_release_parent(struct kref
> *kref)
> > > > > >  						  ref);
> > > > > >  	struct device *dev = parent->dev;
> > > > > >
> > > > > > +	cleanup_srcu_struct(&parent->unreg_srcu);
> > > > > >  	kfree(parent);
> > > > > >  	put_device(dev);
> > > > > >  }
> > > > > > @@ -103,56 +104,30 @@ static inline void
> > > > > > mdev_put_parent(struct
> > > > > mdev_parent *parent)
> > > > > >  		kref_put(&parent->ref, mdev_release_parent);  }
> > > > > >
> > > > > > -static int mdev_device_create_ops(struct kobject *kobj,
> > > > > > -				  struct mdev_device *mdev)
> > > > > > +static int mdev_device_must_remove(struct mdev_device *mdev)
> > > > >
> > > > > Naming is off here, mdev_device_remove_common()?
> > > > >
> > > > Yes, sounds better.
> > > >
> > > > > >  {
> > > > > > -	struct mdev_parent *parent = mdev->parent;
> > > > > > +	struct mdev_parent *parent;
> > > > > > +	struct mdev_type *type;
> > > > > >  	int ret;
> > > > > >
> > > > > > -	ret = parent->ops->create(kobj, mdev);
> > > > > > -	if (ret)
> > > > > > -		return ret;
> > > > > > +	type = to_mdev_type(mdev->type_kobj);
> > > > > >
> > > > > > -	ret = sysfs_create_groups(&mdev->dev.kobj,
> > > > > > -				  parent->ops->mdev_attr_groups);
> > > > > > +	mdev_remove_sysfs_files(&mdev->dev, type);
> > > > > > +	device_del(&mdev->dev);
> > > > > > +	parent = mdev->parent;
> > > > > > +	ret = parent->ops->remove(mdev);
> > > > > >  	if (ret)
> > > > > > -		parent->ops->remove(mdev);
> > > > > > +		dev_err(&mdev->dev, "Remove failed: err=%d\n",
> ret);
> > > > >
> > > > > Let the caller decide whether to be verbose with the error,
> > > > > parent removal might want to warn, sysfs remove might just return
> an error.
> > > > >
> > > > I didn't follow. Caller meaning mdev_device_remove_common() or
> > > > vendor
> > > driver?
> > >
> > > I mean the callback iterator on the parent remove can do a WARN_ON
> > > if this returns an error while the device remove path can silently
> > > return -EBUSY, the common function doesn't need to decide whether
> > > the parent ops remove function deserves a dev_err.
> > >
> > Ok. I understood.
> > But device remove returning silent -EBUSY looks an error that should
> > get logged in, because this is something not expected. Its probably
> > late for sysfs layer to return report an error by that time it prints
> > device name, because put_device() is done. So if remove() returns an
> > error, I think its legitimate failure to do WARN_ON or dev_err().
> 
> Calling put_device() if the parent remove op fails looks like a bug introduced
> by this series, the current code allows that failure leaving the device in a
> coherent state and returning errno to the sysfs store function.
> 
Why should it fail?
We are taking off the device bus first as describe in commit log.
This ensures that everything is closed before calling the remove().
We cannot avoid put_device() and put_parent, it all buggy path...
Parav Pandit March 26, 2019, 5:53 a.m. UTC | #9
> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org <linux-kernel-
> owner@vger.kernel.org> On Behalf Of Parav Pandit
> Sent: Monday, March 25, 2019 10:19 PM
> To: Alex Williamson <alex.williamson@redhat.com>
> Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> kwankhede@nvidia.com
> Subject: RE: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
> 
> 
> 
> > -----Original Message-----
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Monday, March 25, 2019 9:17 PM
> > To: Parav Pandit <parav@mellanox.com>
> > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > kwankhede@nvidia.com
> > Subject: Re: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
> >
> > On Tue, 26 Mar 2019 01:43:44 +0000
> > Parav Pandit <parav@mellanox.com> wrote:
> >
> > > > -----Original Message-----
> > > > From: Alex Williamson <alex.williamson@redhat.com>

> > > > I mean the callback iterator on the parent remove can do a WARN_ON
> > > > if this returns an error while the device remove path can silently
> > > > return -EBUSY, the common function doesn't need to decide whether
> > > > the parent ops remove function deserves a dev_err.
> > > >
> > > Ok. I understood.
> > > But device remove returning silent -EBUSY looks an error that should
> > > get logged in, because this is something not expected. Its probably
> > > late for sysfs layer to return report an error by that time it
> > > prints device name, because put_device() is done. So if remove()
> > > returns an error, I think its legitimate failure to do WARN_ON or
> dev_err().
> >
> > Calling put_device() if the parent remove op fails looks like a bug
> > introduced by this series, the current code allows that failure
> > leaving the device in a coherent state and returning errno to the sysfs
> store function.
> >
> Why should it fail?
> We are taking off the device bus first as describe in commit log.
> This ensures that everything is closed before calling the remove().
> We cannot avoid put_device() and put_parent, it all buggy path...

I audited remove() callbacks of kvmgt.c, vfio_ccw_ops.c, vfio_ap_ops.c, mbochs.c, mdpy.c, mtty.c, who makes the remove possible once the device release is executed.
This should complete once the device is taken off the bus.
This was not the case before this sequence where remove() is done while device is open...hence the check was needed in past.
dev_err() is to help catch any errors/bugs in this area.

I doubt we need to retry remove() like vfio_del_group_dev(), in mdev_core if release() is not yet complete.
Kirti Wankhede March 26, 2019, 7:06 a.m. UTC | #10
On 3/23/2019 4:50 AM, Parav Pandit wrote:
> There are five problems with current code structure.
> 1. mdev device is placed on the mdev bus before it is created in the
> vendor driver. Once a device is placed on the mdev bus without creating
> its supporting underlying vendor device, an open() can get triggered by
> userspace on partially initialized device.
> Below ladder diagram highlight it.
> 
>       cpu-0                                       cpu-1
>       -----                                       -----
>    create_store()
>      mdev_create_device()
>        device_register()
>           ...
>          vfio_mdev_probe()
>          ...creates char device
>                                         vfio_mdev_open()
>                                           parent->ops->open(mdev)
>                                             vfio_ap_mdev_open()
>                                               matrix_mdev = NULL
>         [...]
>         parent->ops->create()
>           vfio_ap_mdev_create()
>             mdev_set_drvdata(mdev, matrix_mdev);
>             /* Valid pointer set above */
> 

VFIO interface uses sysfs path of device or PCI device's BDF where it
checks sysfs file for that device exist.
In case of VFIO mdev device, above situation will never happen as open
will only get called if sysfs entry for that device exist.

If you don't use VFIO interface then this situation can arise. In that
case probe() can be used for very basic initialization then create
actual char device from create().


> 2. Current creation sequence is,
>    parent->ops_create()
>    groups_register()
> 
> Remove sequence is,
>    parent->ops->remove()
>    groups_unregister()
> However, remove sequence should be exact mirror of creation sequence.
> Once this is achieved, all users of the mdev will be terminated first
> before removing underlying vendor device.
> (Follow standard linux driver model).
> At that point vendor's remove() ops shouldn't failed because device is
> taken off the bus that should terminate the users.
> 

If VMM or user space application is using mdev device,
parent->ops->remove() can return failure. In that case sysfs files
shouldn't be removed. Hence above sequence is followed for remove.

Standard linux driver model doesn't allow remove() to fail, but in
of mdev framework, interface is defined to handle such error case.


> 3. Additionally any new mdev driver that wants to work on mdev device
> during probe() routine registered using mdev_register_driver() needs to
> get stable mdev structure.
> 

Things that you are trying to handle with mdev structure from probe(),
couldn't that be moved to create()?


> 4. In following sequence, child devices created while removing mdev parent
> device can be left out, or it may lead to race of removing half
> initialized child mdev devices.
> 
> issue-1:
> --------
>        cpu-0                         cpu-1
>        -----                         -----
>                                   mdev_unregister_device()
>                                      device_for_each_child()
>                                         mdev_device_remove_cb()
>                                             mdev_device_remove()
> create_store()
>   mdev_device_create()                   [...]
>        device_register()
>                                   parent_remove_sysfs_files()
>                                   /* BUG: device added by cpu-0
>                                    * whose parent is getting removed.
>                                    */
> 
> issue-2:
> --------
>        cpu-0                         cpu-1
>        -----                         -----
> create_store()
>   mdev_device_create()                   [...]
>        device_register()
> 
>        [...]                      mdev_unregister_device()
>                                      device_for_each_child()
>                                         mdev_device_remove_cb()
>                                             mdev_device_remove()
> 
>        mdev_create_sysfs_files()
>        /* BUG: create is adding
>         * sysfs files for a device
>         * which is undergoing removal.
>         */
>                                  parent_remove_sysfs_files()
> 
> 5. Below crash is observed when user initiated remove is in progress
> and mdev_unregister_driver() completes parent unregistration.
> 
>        cpu-0                         cpu-1
>        -----                         -----
> remove_store()
>    mdev_device_remove()
>    active = false;
>                                   mdev_unregister_device()
>                                     remove type
>    [...]
>    mdev_remove_ops() crashes.
> 
> This is similar race like create() racing with mdev_unregister_device().
> 
> mtty mtty: MDEV: Registered
> iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to group 57
> vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV: group_id = 57
> mdev_device_remove sleep started
> mtty mtty: MDEV: Unregistering
> mtty_dev: Unloaded!
> BUG: unable to handle kernel paging request at ffffffffc027d668
> PGD af9818067 P4D af9818067 PUD af981a067 PMD 8583c3067 PTE 0
> Oops: 0000 [#1] SMP PTI
> CPU: 15 PID: 3517 Comm: bash Kdump: loaded Not tainted 5.0.0-rc7-vdevbus+ #2
> Hardware name: Supermicro SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
> RIP: 0010:mdev_device_remove_ops+0x1a/0x50 [mdev]
> Call Trace:
>  mdev_device_remove+0xef/0x130 [mdev]
>  remove_store+0x77/0xa0 [mdev]
>  kernfs_fop_write+0x113/0x1a0
>  __vfs_write+0x33/0x1b0
>  ? rcu_read_lock_sched_held+0x64/0x70
>  ? rcu_sync_lockdep_assert+0x2a/0x50
>  ? __sb_start_write+0x121/0x1b0
>  ? vfs_write+0x17c/0x1b0
>  vfs_write+0xad/0x1b0
>  ? trace_hardirqs_on_thunk+0x1a/0x1c
>  ksys_write+0x55/0xc0
>  do_syscall_64+0x5a/0x210
> 
> Therefore, mdev core is improved in following ways to overcome above
> issues.
> 
> 1. Before placing mdev devices on the bus, perform vendor drivers
> creation which supports the mdev creation.
> This ensures that mdev specific all necessary fields are initialized
> before a given mdev can be accessed by bus driver.
> 
> 2. During remove flow, first remove the device from the bus. This
> ensures that any bus specific devices and data is cleared.
> Once device is taken of the mdev bus, perform remove() of mdev from the
> vendor driver.
>

If user space application is using the device and someone underneath
remove the device from bus, how would use space application know that
device is being removed?
If DMA is setup, user space application is accessing that memory and
device is removed from bus - how will you restrict to not to remove that
device? If remove() is not restricted then host might crash.
I know Linux kernel device core model doesn't allow remove() to fail,
but we had tackled that problem for mdev devices in this framework. I
prefer not to change this behavior. This will regress existing working
drivers.


> 3. Linux core device model provides way to register and auto unregister
> the device sysfs attribute groups at dev->groups.
> Make use of this groups to let core create the groups and simplify code
> to avoid explicit groups creation and removal.
> 
> 4. Wait for any ongoing mdev create() and remove() to finish before
> unregistering parent device using srcu. This continues to allow multiple
> create and remove to progress in parallel. At the same time guard parent
> removal while parent is being access by create() and remove callbacks.
> 

Agreed with this.
Alex already mentioned, it would be better to have separate patch for
this fix.

Thanks,
Kirti

> Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
> Signed-off-by: Parav Pandit <parav@mellanox.com>
> ---
>  drivers/vfio/mdev/mdev_core.c    | 142 +++++++++++++++++++++------------------
>  drivers/vfio/mdev/mdev_private.h |   7 +-
>  drivers/vfio/mdev/mdev_sysfs.c   |   6 +-
>  3 files changed, 84 insertions(+), 71 deletions(-)
> 
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> index 944a058..8fe0ed1 100644
> --- a/drivers/vfio/mdev/mdev_core.c
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -84,6 +84,7 @@ static void mdev_release_parent(struct kref *kref)
>  						  ref);
>  	struct device *dev = parent->dev;
>  
> +	cleanup_srcu_struct(&parent->unreg_srcu);
>  	kfree(parent);
>  	put_device(dev);
>  }
> @@ -103,56 +104,30 @@ static inline void mdev_put_parent(struct mdev_parent *parent)
>  		kref_put(&parent->ref, mdev_release_parent);
>  }
>  
> -static int mdev_device_create_ops(struct kobject *kobj,
> -				  struct mdev_device *mdev)
> +static int mdev_device_must_remove(struct mdev_device *mdev)
>  {
> -	struct mdev_parent *parent = mdev->parent;
> +	struct mdev_parent *parent;
> +	struct mdev_type *type;
>  	int ret;
>  
> -	ret = parent->ops->create(kobj, mdev);
> -	if (ret)
> -		return ret;
> +	type = to_mdev_type(mdev->type_kobj);
>  
> -	ret = sysfs_create_groups(&mdev->dev.kobj,
> -				  parent->ops->mdev_attr_groups);
> +	mdev_remove_sysfs_files(&mdev->dev, type);
> +	device_del(&mdev->dev);
> +	parent = mdev->parent;
> +	ret = parent->ops->remove(mdev);
>  	if (ret)
> -		parent->ops->remove(mdev);
> +		dev_err(&mdev->dev, "Remove failed: err=%d\n", ret);
>  
> +	/* Balances with device_initialize() */
> +	put_device(&mdev->dev);
>  	return ret;
>  }
>  
> -/*
> - * mdev_device_remove_ops gets called from sysfs's 'remove' and when parent
> - * device is being unregistered from mdev device framework.
> - * - 'force_remove' is set to 'false' when called from sysfs's 'remove' which
> - *   indicates that if the mdev device is active, used by VMM or userspace
> - *   application, vendor driver could return error then don't remove the device.
> - * - 'force_remove' is set to 'true' when called from mdev_unregister_device()
> - *   which indicate that parent device is being removed from mdev device
> - *   framework so remove mdev device forcefully.
> - */
> -static int mdev_device_remove_ops(struct mdev_device *mdev, bool force_remove)
> -{
> -	struct mdev_parent *parent = mdev->parent;
> -	int ret;
> -
> -	/*
> -	 * Vendor driver can return error if VMM or userspace application is
> -	 * using this mdev device.
> -	 */
> -	ret = parent->ops->remove(mdev);
> -	if (ret && !force_remove)
> -		return ret;
> -
> -	sysfs_remove_groups(&mdev->dev.kobj, parent->ops->mdev_attr_groups);
> -	return 0;
> -}
> -
>  static int mdev_device_remove_cb(struct device *dev, void *data)
>  {
>  	if (dev_is_mdev(dev))
> -		mdev_device_remove(dev, true);
> -
> +		mdev_device_must_remove(to_mdev_device(dev));
>  	return 0;
>  }
>  
> @@ -194,6 +169,7 @@ int mdev_register_device(struct device *dev, const struct mdev_parent_ops *ops)
>  	}
>  
>  	kref_init(&parent->ref);
> +	init_srcu_struct(&parent->unreg_srcu);
>  
>  	parent->dev = dev;
>  	parent->ops = ops;
> @@ -214,6 +190,7 @@ int mdev_register_device(struct device *dev, const struct mdev_parent_ops *ops)
>  	if (ret)
>  		dev_warn(dev, "Failed to create compatibility class link\n");
>  
> +	rcu_assign_pointer(parent->self, parent);
>  	list_add(&parent->next, &parent_list);
>  	mutex_unlock(&parent_list_lock);
>  
> @@ -244,21 +221,36 @@ void mdev_unregister_device(struct device *dev)
>  
>  	mutex_lock(&parent_list_lock);
>  	parent = __find_parent_device(dev);
> -
>  	if (!parent) {
>  		mutex_unlock(&parent_list_lock);
>  		return;
>  	}
> +	list_del(&parent->next);
> +	mutex_unlock(&parent_list_lock);
> +
>  	dev_info(dev, "MDEV: Unregistering\n");
>  
> -	list_del(&parent->next);
> +	/* Publish that this mdev parent is unregistering. So any new
> +	 * create/remove cannot start on this parent anymore by user.
> +	 */
> +	rcu_assign_pointer(parent->self, NULL);
> +
> +	/*
> +	 * Wait for any active create() or remove() mdev ops on the parent
> +	 * to complete.
> +	 */
> +	synchronize_srcu(&parent->unreg_srcu);
> +
> +	/* At this point it is confirmed that any pending user initiated
> +	 * create or remove callbacks accessing the parent are completed.
> +	 * It is safe to remove the parent now.
> +	 */
>  	class_compat_remove_link(mdev_bus_compat_class, dev, NULL);
>  
>  	device_for_each_child(dev, NULL, mdev_device_remove_cb);
>  
>  	parent_remove_sysfs_files(parent);
>  
> -	mutex_unlock(&parent_list_lock);
>  	mdev_put_parent(parent);
>  }
>  EXPORT_SYMBOL(mdev_unregister_device);
> @@ -278,14 +270,24 @@ static void mdev_device_release(struct device *dev)
>  int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
>  {
>  	int ret;
> +	struct mdev_parent *valid_parent;
>  	struct mdev_device *mdev, *tmp;
>  	struct mdev_parent *parent;
>  	struct mdev_type *type = to_mdev_type(kobj);
> +	int srcu_idx;
>  
>  	parent = mdev_get_parent(type->parent);
>  	if (!parent)
>  		return -EINVAL;
>  
> +	srcu_idx = srcu_read_lock(&parent->unreg_srcu);
> +	valid_parent = srcu_dereference(parent->self, &parent->unreg_srcu);
> +	if (!valid_parent) {
> +		/* parent is undergoing unregistration */
> +		ret = -ENODEV;
> +		goto mdev_fail;
> +	}
> +
>  	mutex_lock(&mdev_list_lock);
>  
>  	/* Check for duplicate */
> @@ -310,68 +312,76 @@ int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
>  
>  	mdev->parent = parent;
>  
> +	device_initialize(&mdev->dev);
>  	mdev->dev.parent  = dev;
>  	mdev->dev.bus     = &mdev_bus_type;
>  	mdev->dev.release = mdev_device_release;
> +	mdev->dev.groups = type->parent->ops->mdev_attr_groups;
>  	dev_set_name(&mdev->dev, "%pUl", uuid.b);
>  
> -	ret = device_register(&mdev->dev);
> +	ret = type->parent->ops->create(kobj, mdev);
>  	if (ret)
> -		goto mdev_fail;
> +		goto create_fail;
>  
> -	ret = mdev_device_create_ops(kobj, mdev);
> +	ret = device_add(&mdev->dev);
>  	if (ret)
> -		goto create_fail;
> +		goto dev_fail;
>  
>  	ret = mdev_create_sysfs_files(&mdev->dev, type);
> -	if (ret) {
> -		mdev_device_remove_ops(mdev, true);
> -		goto create_fail;
> -	}
> +	if (ret)
> +		goto sysfs_fail;
>  
>  	mdev->type_kobj = kobj;
>  	mdev->active = true;
>  	dev_dbg(&mdev->dev, "MDEV: created\n");
> +	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
>  
>  	return 0;
>  
> +sysfs_fail:
> +	device_del(&mdev->dev);
> +dev_fail:
> +	type->parent->ops->remove(mdev);
>  create_fail:
> -	device_unregister(&mdev->dev);
> +	put_device(&mdev->dev);
>  mdev_fail:
> +	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
>  	mdev_put_parent(parent);
>  	return ret;
>  }
>  
> -int mdev_device_remove(struct device *dev, bool force_remove)
> +int mdev_device_remove(struct device *dev)
>  {
> +	struct mdev_parent *valid_parent;
>  	struct mdev_device *mdev;
>  	struct mdev_parent *parent;
> -	struct mdev_type *type;
> +	int srcu_idx;
>  	int ret;
>  
>  	mdev = to_mdev_device(dev);
> +	parent = mdev->parent;
> +	srcu_idx = srcu_read_lock(&parent->unreg_srcu);
> +	valid_parent = srcu_dereference(parent->self, &parent->unreg_srcu);
> +	if (!valid_parent) {
> +		srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
> +		/* parent is undergoing unregistration */
> +		return -ENODEV;
> +	}
> +
> +	mutex_lock(&mdev_list_lock);
>  	if (!mdev->active) {
>  		mutex_unlock(&mdev_list_lock);
> -		return -EAGAIN;
> +		srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
> +		return -ENODEV;
>  	}
> -
>  	mdev->active = false;
>  	mutex_unlock(&mdev_list_lock);
>  
> -	type = to_mdev_type(mdev->type_kobj);
> -	parent = mdev->parent;
> -
> -	ret = mdev_device_remove_ops(mdev, force_remove);
> -	if (ret) {
> -		mdev->active = true;
> -		return ret;
> -	}
> +	ret = mdev_device_must_remove(mdev);
> +	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
>  
> -	mdev_remove_sysfs_files(dev, type);
> -	device_unregister(dev);
>  	mdev_put_parent(parent);
> -
> -	return 0;
> +	return ret;
>  }
>  
>  static int __init mdev_init(void)
> diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
> index 84b2b6c..3d17db9 100644
> --- a/drivers/vfio/mdev/mdev_private.h
> +++ b/drivers/vfio/mdev/mdev_private.h
> @@ -23,6 +23,11 @@ struct mdev_parent {
>  	struct list_head next;
>  	struct kset *mdev_types_kset;
>  	struct list_head type_list;
> +	/* Protects unregistration to wait until create/remove
> +	 * are completed.
> +	 */
> +	struct srcu_struct unreg_srcu;
> +	struct mdev_parent __rcu *self;
>  };
>  
>  struct mdev_device {
> @@ -58,6 +63,6 @@ struct mdev_type {
>  void mdev_remove_sysfs_files(struct device *dev, struct mdev_type *type);
>  
>  int  mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid);
> -int  mdev_device_remove(struct device *dev, bool force_remove);
> +int  mdev_device_remove(struct device *dev);
>  
>  #endif /* MDEV_PRIVATE_H */
> diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
> index c782fa9..68a8191 100644
> --- a/drivers/vfio/mdev/mdev_sysfs.c
> +++ b/drivers/vfio/mdev/mdev_sysfs.c
> @@ -236,11 +236,9 @@ static ssize_t remove_store(struct device *dev, struct device_attribute *attr,
>  	if (val && device_remove_file_self(dev, attr)) {
>  		int ret;
>  
> -		ret = mdev_device_remove(dev, false);
> -		if (ret) {
> -			device_create_file(dev, attr);
> +		ret = mdev_device_remove(dev);
> +		if (ret)
>  			return ret;
> -		}
>  	}
>  
>  	return count;
>
Alex Williamson March 26, 2019, 3:21 p.m. UTC | #11
On Tue, 26 Mar 2019 05:53:22 +0000
Parav Pandit <parav@mellanox.com> wrote:

> > -----Original Message-----
> > From: linux-kernel-owner@vger.kernel.org <linux-kernel-  
> > owner@vger.kernel.org> On Behalf Of Parav Pandit  
> > Sent: Monday, March 25, 2019 10:19 PM
> > To: Alex Williamson <alex.williamson@redhat.com>
> > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > kwankhede@nvidia.com
> > Subject: RE: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
> > 
> > 
> >   
> > > -----Original Message-----
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Monday, March 25, 2019 9:17 PM
> > > To: Parav Pandit <parav@mellanox.com>
> > > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > kwankhede@nvidia.com
> > > Subject: Re: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
> > >
> > > On Tue, 26 Mar 2019 01:43:44 +0000
> > > Parav Pandit <parav@mellanox.com> wrote:
> > >  
> > > > > -----Original Message-----
> > > > > From: Alex Williamson <alex.williamson@redhat.com>  
> 
> > > > > I mean the callback iterator on the parent remove can do a WARN_ON
> > > > > if this returns an error while the device remove path can silently
> > > > > return -EBUSY, the common function doesn't need to decide whether
> > > > > the parent ops remove function deserves a dev_err.
> > > > >  
> > > > Ok. I understood.
> > > > But device remove returning silent -EBUSY looks an error that should
> > > > get logged in, because this is something not expected. Its probably
> > > > late for sysfs layer to return report an error by that time it
> > > > prints device name, because put_device() is done. So if remove()
> > > > returns an error, I think its legitimate failure to do WARN_ON or  
> > dev_err().  
> > >
> > > Calling put_device() if the parent remove op fails looks like a bug
> > > introduced by this series, the current code allows that failure
> > > leaving the device in a coherent state and returning errno to the sysfs  
> > store function.  
> > >  
> > Why should it fail?
> > We are taking off the device bus first as describe in commit log.
> > This ensures that everything is closed before calling the remove().
> > We cannot avoid put_device() and put_parent, it all buggy path...  
> 
> I audited remove() callbacks of kvmgt.c, vfio_ccw_ops.c,
> vfio_ap_ops.c, mbochs.c, mdpy.c, mtty.c, who makes the remove
> possible once the device release is executed. This should complete
> once the device is taken off the bus. This was not the case before
> this sequence where remove() is done while device is open...hence the
> check was needed in past. dev_err() is to help catch any errors/bugs
> in this area.
> 
> I doubt we need to retry remove() like vfio_del_group_dev(), in
> mdev_core if release() is not yet complete.

I'm ok with this, I've always thought the 'force' semantics and
allowing remove to fail were not terribly inline with other drivers,
even if ultimately I wish drivers could nak a remove request to avoid
the ugliness of blocking.  But ultimately you'll need to come to an
agreement with Kirti, the drivers we have in-tree are not the complete
set of mdev drivers, but it also doesn't necessarily make sense to cater
to the lone out-of-tree driver either.  Thanks,

Alex
Alex Williamson March 26, 2019, 3:26 p.m. UTC | #12
On Tue, 26 Mar 2019 12:36:22 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 3/23/2019 4:50 AM, Parav Pandit wrote:
> > There are five problems with current code structure.
> > 1. mdev device is placed on the mdev bus before it is created in the
> > vendor driver. Once a device is placed on the mdev bus without creating
> > its supporting underlying vendor device, an open() can get triggered by
> > userspace on partially initialized device.
> > Below ladder diagram highlight it.
> > 
> >       cpu-0                                       cpu-1
> >       -----                                       -----
> >    create_store()
> >      mdev_create_device()
> >        device_register()
> >           ...
> >          vfio_mdev_probe()
> >          ...creates char device
> >                                         vfio_mdev_open()
> >                                           parent->ops->open(mdev)
> >                                             vfio_ap_mdev_open()
> >                                               matrix_mdev = NULL
> >         [...]
> >         parent->ops->create()
> >           vfio_ap_mdev_create()
> >             mdev_set_drvdata(mdev, matrix_mdev);
> >             /* Valid pointer set above */
> >   
> 
> VFIO interface uses sysfs path of device or PCI device's BDF where it
> checks sysfs file for that device exist.
> In case of VFIO mdev device, above situation will never happen as open
> will only get called if sysfs entry for that device exist.
> 
> If you don't use VFIO interface then this situation can arise. In that
> case probe() can be used for very basic initialization then create
> actual char device from create().
> 
> 
> > 2. Current creation sequence is,
> >    parent->ops_create()
> >    groups_register()
> > 
> > Remove sequence is,
> >    parent->ops->remove()
> >    groups_unregister()
> > However, remove sequence should be exact mirror of creation sequence.
> > Once this is achieved, all users of the mdev will be terminated first
> > before removing underlying vendor device.
> > (Follow standard linux driver model).
> > At that point vendor's remove() ops shouldn't failed because device is
> > taken off the bus that should terminate the users.
> >   
> 
> If VMM or user space application is using mdev device,
> parent->ops->remove() can return failure. In that case sysfs files
> shouldn't be removed. Hence above sequence is followed for remove.
> 
> Standard linux driver model doesn't allow remove() to fail, but in
> of mdev framework, interface is defined to handle such error case.
> 
> 
> > 3. Additionally any new mdev driver that wants to work on mdev device
> > during probe() routine registered using mdev_register_driver() needs to
> > get stable mdev structure.
> >   
> 
> Things that you are trying to handle with mdev structure from probe(),
> couldn't that be moved to create()?
> 
> 
> > 4. In following sequence, child devices created while removing mdev parent
> > device can be left out, or it may lead to race of removing half
> > initialized child mdev devices.
> > 
> > issue-1:
> > --------
> >        cpu-0                         cpu-1
> >        -----                         -----
> >                                   mdev_unregister_device()
> >                                      device_for_each_child()
> >                                         mdev_device_remove_cb()
> >                                             mdev_device_remove()
> > create_store()
> >   mdev_device_create()                   [...]
> >        device_register()
> >                                   parent_remove_sysfs_files()
> >                                   /* BUG: device added by cpu-0
> >                                    * whose parent is getting removed.
> >                                    */
> > 
> > issue-2:
> > --------
> >        cpu-0                         cpu-1
> >        -----                         -----
> > create_store()
> >   mdev_device_create()                   [...]
> >        device_register()
> > 
> >        [...]                      mdev_unregister_device()
> >                                      device_for_each_child()
> >                                         mdev_device_remove_cb()
> >                                             mdev_device_remove()
> > 
> >        mdev_create_sysfs_files()
> >        /* BUG: create is adding
> >         * sysfs files for a device
> >         * which is undergoing removal.
> >         */
> >                                  parent_remove_sysfs_files()
> > 
> > 5. Below crash is observed when user initiated remove is in progress
> > and mdev_unregister_driver() completes parent unregistration.
> > 
> >        cpu-0                         cpu-1
> >        -----                         -----
> > remove_store()
> >    mdev_device_remove()
> >    active = false;
> >                                   mdev_unregister_device()
> >                                     remove type
> >    [...]
> >    mdev_remove_ops() crashes.
> > 
> > This is similar race like create() racing with mdev_unregister_device().
> > 
> > mtty mtty: MDEV: Registered
> > iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to group 57
> > vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV: group_id = 57
> > mdev_device_remove sleep started
> > mtty mtty: MDEV: Unregistering
> > mtty_dev: Unloaded!
> > BUG: unable to handle kernel paging request at ffffffffc027d668
> > PGD af9818067 P4D af9818067 PUD af981a067 PMD 8583c3067 PTE 0
> > Oops: 0000 [#1] SMP PTI
> > CPU: 15 PID: 3517 Comm: bash Kdump: loaded Not tainted 5.0.0-rc7-vdevbus+ #2
> > Hardware name: Supermicro SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
> > RIP: 0010:mdev_device_remove_ops+0x1a/0x50 [mdev]
> > Call Trace:
> >  mdev_device_remove+0xef/0x130 [mdev]
> >  remove_store+0x77/0xa0 [mdev]
> >  kernfs_fop_write+0x113/0x1a0
> >  __vfs_write+0x33/0x1b0
> >  ? rcu_read_lock_sched_held+0x64/0x70
> >  ? rcu_sync_lockdep_assert+0x2a/0x50
> >  ? __sb_start_write+0x121/0x1b0
> >  ? vfs_write+0x17c/0x1b0
> >  vfs_write+0xad/0x1b0
> >  ? trace_hardirqs_on_thunk+0x1a/0x1c
> >  ksys_write+0x55/0xc0
> >  do_syscall_64+0x5a/0x210
> > 
> > Therefore, mdev core is improved in following ways to overcome above
> > issues.
> > 
> > 1. Before placing mdev devices on the bus, perform vendor drivers
> > creation which supports the mdev creation.
> > This ensures that mdev specific all necessary fields are initialized
> > before a given mdev can be accessed by bus driver.
> > 
> > 2. During remove flow, first remove the device from the bus. This
> > ensures that any bus specific devices and data is cleared.
> > Once device is taken of the mdev bus, perform remove() of mdev from the
> > vendor driver.
> >  
> 
> If user space application is using the device and someone underneath
> remove the device from bus, how would use space application know that
> device is being removed?
> If DMA is setup, user space application is accessing that memory and
> device is removed from bus - how will you restrict to not to remove that
> device? If remove() is not restricted then host might crash.
> I know Linux kernel device core model doesn't allow remove() to fail,
> but we had tackled that problem for mdev devices in this framework. I
> prefer not to change this behavior. This will regress existing working
> drivers.


We have exactly this issue with vfio-pci, or really any vfio driver,
where the solution is that a remove request is blocked until the device
becomes unused by the user.  In fact there's a notification that
userspace can connect to so that we don't need to silently wait for
userspace to be done.  We could also potentially kill the userspace
application using the device, or if we ever implemented revoke support
for mmaps, we could unmap the device and the use could handle the
SIGBUS.  With Parav's suggestion to fix the ordering such that the
device is first removed from the bus, where the blocking opportunity
comes into play, it might be time to let go of this one-off
force/not-force behavior.  Thanks,

Alex
Parav Pandit March 26, 2019, 3:30 p.m. UTC | #13
> -----Original Message-----
> From: Kirti Wankhede <kwankhede@nvidia.com>
> Sent: Tuesday, March 26, 2019 2:06 AM
> To: Parav Pandit <parav@mellanox.com>; kvm@vger.kernel.org; linux-
> kernel@vger.kernel.org; alex.williamson@redhat.com
> Cc: Neo Jia <cjia@nvidia.com>
> Subject: Re: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
> 
> 
> 
> On 3/23/2019 4:50 AM, Parav Pandit wrote:
> > There are five problems with current code structure.
> > 1. mdev device is placed on the mdev bus before it is created in the
> > vendor driver. Once a device is placed on the mdev bus without
> > creating its supporting underlying vendor device, an open() can get
> > triggered by userspace on partially initialized device.
> > Below ladder diagram highlight it.
> >
> >       cpu-0                                       cpu-1
> >       -----                                       -----
> >    create_store()
> >      mdev_create_device()
> >        device_register()
> >           ...
> >          vfio_mdev_probe()
> >          ...creates char device
> >                                         vfio_mdev_open()
> >                                           parent->ops->open(mdev)
> >                                             vfio_ap_mdev_open()
> >                                               matrix_mdev = NULL
> >         [...]
> >         parent->ops->create()
> >           vfio_ap_mdev_create()
> >             mdev_set_drvdata(mdev, matrix_mdev);
> >             /* Valid pointer set above */
> >
> 
> VFIO interface uses sysfs path of device or PCI device's BDF where it checks
> sysfs file for that device exist.
> In case of VFIO mdev device, above situation will never happen as open will
> only get called if sysfs entry for that device exist.
> 
> If you don't use VFIO interface then this situation can arise. In that case
> probe() can be used for very basic initialization then create actual char
> device from create().
> 
I explained you that create() cannot do the heavy lifting work of creating netdev and rdma dev because at that stage driver doesn't know whether its getting used for VM or host.
create() needs to create the device that probe() can work on in stable manner.

> 
> > 2. Current creation sequence is,
> >    parent->ops_create()
> >    groups_register()
> >
> > Remove sequence is,
> >    parent->ops->remove()
> >    groups_unregister()
> > However, remove sequence should be exact mirror of creation sequence.
> > Once this is achieved, all users of the mdev will be terminated first
> > before removing underlying vendor device.
> > (Follow standard linux driver model).
> > At that point vendor's remove() ops shouldn't failed because device is
> > taken off the bus that should terminate the users.
> >
> 
> If VMM or user space application is using mdev device,
> parent->ops->remove() can return failure. In that case sysfs files
> shouldn't be removed. Hence above sequence is followed for remove.
> 
> Standard linux driver model doesn't allow remove() to fail, but in of mdev
> framework, interface is defined to handle such error case.
> 
But the sequence is incorrect for wider use case.
> 
> > 3. Additionally any new mdev driver that wants to work on mdev device
> > during probe() routine registered using mdev_register_driver() needs
> > to get stable mdev structure.
> >
> 
> Things that you are trying to handle with mdev structure from probe(),
> couldn't that be moved to create()?
> 
No, as explained before and above.
That approach just doesn't look right.
 
> 
> > 4. In following sequence, child devices created while removing mdev
> > parent device can be left out, or it may lead to race of removing half
> > initialized child mdev devices.
> >
> > issue-1:
> > --------
> >        cpu-0                         cpu-1
> >        -----                         -----
> >                                   mdev_unregister_device()
> >                                      device_for_each_child()
> >                                         mdev_device_remove_cb()
> >                                             mdev_device_remove()
> > create_store()
> >   mdev_device_create()                   [...]
> >        device_register()
> >                                   parent_remove_sysfs_files()
> >                                   /* BUG: device added by cpu-0
> >                                    * whose parent is getting removed.
> >                                    */
> >
> > issue-2:
> > --------
> >        cpu-0                         cpu-1
> >        -----                         -----
> > create_store()
> >   mdev_device_create()                   [...]
> >        device_register()
> >
> >        [...]                      mdev_unregister_device()
> >                                      device_for_each_child()
> >                                         mdev_device_remove_cb()
> >                                             mdev_device_remove()
> >
> >        mdev_create_sysfs_files()
> >        /* BUG: create is adding
> >         * sysfs files for a device
> >         * which is undergoing removal.
> >         */
> >                                  parent_remove_sysfs_files()
> >
> > 5. Below crash is observed when user initiated remove is in progress
> > and mdev_unregister_driver() completes parent unregistration.
> >
> >        cpu-0                         cpu-1
> >        -----                         -----
> > remove_store()
> >    mdev_device_remove()
> >    active = false;
> >                                   mdev_unregister_device()
> >                                     remove type
> >    [...]
> >    mdev_remove_ops() crashes.
> >
> > This is similar race like create() racing with mdev_unregister_device().
> >
> > mtty mtty: MDEV: Registered
> > iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to group 57
> > vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV: group_id = 57
> > mdev_device_remove sleep started mtty mtty: MDEV: Unregistering
> > mtty_dev: Unloaded!
> > BUG: unable to handle kernel paging request at ffffffffc027d668 PGD
> > af9818067 P4D af9818067 PUD af981a067 PMD 8583c3067 PTE 0
> > Oops: 0000 [#1] SMP PTI
> > CPU: 15 PID: 3517 Comm: bash Kdump: loaded Not tainted
> > 5.0.0-rc7-vdevbus+ #2 Hardware name: Supermicro
> > SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
> > RIP: 0010:mdev_device_remove_ops+0x1a/0x50 [mdev] Call Trace:
> >  mdev_device_remove+0xef/0x130 [mdev]
> >  remove_store+0x77/0xa0 [mdev]
> >  kernfs_fop_write+0x113/0x1a0
> >  __vfs_write+0x33/0x1b0
> >  ? rcu_read_lock_sched_held+0x64/0x70
> >  ? rcu_sync_lockdep_assert+0x2a/0x50
> >  ? __sb_start_write+0x121/0x1b0
> >  ? vfs_write+0x17c/0x1b0
> >  vfs_write+0xad/0x1b0
> >  ? trace_hardirqs_on_thunk+0x1a/0x1c
> >  ksys_write+0x55/0xc0
> >  do_syscall_64+0x5a/0x210
> >
> > Therefore, mdev core is improved in following ways to overcome above
> > issues.
> >
> > 1. Before placing mdev devices on the bus, perform vendor drivers
> > creation which supports the mdev creation.
> > This ensures that mdev specific all necessary fields are initialized
> > before a given mdev can be accessed by bus driver.
> >
> > 2. During remove flow, first remove the device from the bus. This
> > ensures that any bus specific devices and data is cleared.
> > Once device is taken of the mdev bus, perform remove() of mdev from
> > the vendor driver.
> >
> 
> If user space application is using the device and someone underneath
> remove the device from bus, how would use space application know that
> device is being removed?
vfio_mdev guards and wait for device to get closed.

One sample trace is below.
[<0>] vfio_del_group_dev+0x34a/0x3c0 [vfio]
[<0>] mdev_remove+0x21/0x40 [mdev]
[<0>] device_release_driver_internal+0xe8/0x1b0
[<0>] bus_remove_device+0xf9/0x170
[<0>] device_del+0x168/0x350
[<0>] mdev_device_remove_common+0x1e/0x60 [mdev]
[<0>] mdev_device_remove_cb+0x1a/0x30 [mdev]
[<0>] device_for_each_child+0x47/0x90
[<0>] mdev_unregister_device+0xdb/0x100 [mdev]
[<0>] mtty_dev_exit+0x17/0x843 [mtty]
[<0>] __x64_sys_delete_module+0x16b/0x240
[<0>] do_syscall_64+0x5a/0x210
[<0>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[<0>] 0xffffffffffffffff

> If DMA is setup, user space application is accessing that memory and device
> is removed from bus - how will you restrict to not to remove that device? If
> remove() is not restricted then host might crash.
> I know Linux kernel device core model doesn't allow remove() to fail, but we
> had tackled that problem for mdev devices in this framework. I prefer not to
> change this behavior. This will regress existing working drivers.
> 
vfio layer ensures that open device cannot be removed from above trace.

Other drivers will follow similar method. In case of mlx5 driver which binds
to mdev follows standard driver model to terminate for this mdev device,
similar way for pci device.

> 
> > 3. Linux core device model provides way to register and auto
> > unregister the device sysfs attribute groups at dev->groups.
> > Make use of this groups to let core create the groups and simplify
> > code to avoid explicit groups creation and removal.
> >
> > 4. Wait for any ongoing mdev create() and remove() to finish before
> > unregistering parent device using srcu. This continues to allow
> > multiple create and remove to progress in parallel. At the same time
> > guard parent removal while parent is being access by create() and remove
> callbacks.
> >
> 
> Agreed with this.
> Alex already mentioned, it would be better to have separate patch for this
> fix.
> 
Patches are ready, I am waiting for above discussion to close before posting v1.
Parav Pandit March 27, 2019, 3:19 a.m. UTC | #14
Hi Alex,

> -----Original Message-----
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Tuesday, March 26, 2019 10:27 AM
> To: Kirti Wankhede <kwankhede@nvidia.com>
> Cc: Parav Pandit <parav@mellanox.com>; kvm@vger.kernel.org; linux-
> kernel@vger.kernel.org; Neo Jia <cjia@nvidia.com>
> Subject: Re: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
> 
> On Tue, 26 Mar 2019 12:36:22 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > On 3/23/2019 4:50 AM, Parav Pandit wrote:
> > > There are five problems with current code structure.
> > > 1. mdev device is placed on the mdev bus before it is created in the
> > > vendor driver. Once a device is placed on the mdev bus without
> > > creating its supporting underlying vendor device, an open() can get
> > > triggered by userspace on partially initialized device.
> > > Below ladder diagram highlight it.
> > >
> > >       cpu-0                                       cpu-1
> > >       -----                                       -----
> > >    create_store()
> > >      mdev_create_device()
> > >        device_register()
> > >           ...
> > >          vfio_mdev_probe()
> > >          ...creates char device
> > >                                         vfio_mdev_open()
> > >                                           parent->ops->open(mdev)
> > >                                             vfio_ap_mdev_open()
> > >                                               matrix_mdev = NULL
> > >         [...]
> > >         parent->ops->create()
> > >           vfio_ap_mdev_create()
> > >             mdev_set_drvdata(mdev, matrix_mdev);
> > >             /* Valid pointer set above */
> > >
> >
> > VFIO interface uses sysfs path of device or PCI device's BDF where it
> > checks sysfs file for that device exist.
> > In case of VFIO mdev device, above situation will never happen as open
> > will only get called if sysfs entry for that device exist.
> >
> > If you don't use VFIO interface then this situation can arise. In that
> > case probe() can be used for very basic initialization then create
> > actual char device from create().
> >
> >
> > > 2. Current creation sequence is,
> > >    parent->ops_create()
> > >    groups_register()
> > >
> > > Remove sequence is,
> > >    parent->ops->remove()
> > >    groups_unregister()
> > > However, remove sequence should be exact mirror of creation sequence.
> > > Once this is achieved, all users of the mdev will be terminated
> > > first before removing underlying vendor device.
> > > (Follow standard linux driver model).
> > > At that point vendor's remove() ops shouldn't failed because device
> > > is taken off the bus that should terminate the users.
> > >
> >
> > If VMM or user space application is using mdev device,
> > parent->ops->remove() can return failure. In that case sysfs files
> > shouldn't be removed. Hence above sequence is followed for remove.
> >
> > Standard linux driver model doesn't allow remove() to fail, but in of
> > mdev framework, interface is defined to handle such error case.
> >
> >
> > > 3. Additionally any new mdev driver that wants to work on mdev
> > > device during probe() routine registered using
> > > mdev_register_driver() needs to get stable mdev structure.
> > >
> >
> > Things that you are trying to handle with mdev structure from probe(),
> > couldn't that be moved to create()?
> >
> >
> > > 4. In following sequence, child devices created while removing mdev
> > > parent device can be left out, or it may lead to race of removing
> > > half initialized child mdev devices.
> > >
> > > issue-1:
> > > --------
> > >        cpu-0                         cpu-1
> > >        -----                         -----
> > >                                   mdev_unregister_device()
> > >                                      device_for_each_child()
> > >                                         mdev_device_remove_cb()
> > >                                             mdev_device_remove()
> > > create_store()
> > >   mdev_device_create()                   [...]
> > >        device_register()
> > >                                   parent_remove_sysfs_files()
> > >                                   /* BUG: device added by cpu-0
> > >                                    * whose parent is getting removed.
> > >                                    */
> > >
> > > issue-2:
> > > --------
> > >        cpu-0                         cpu-1
> > >        -----                         -----
> > > create_store()
> > >   mdev_device_create()                   [...]
> > >        device_register()
> > >
> > >        [...]                      mdev_unregister_device()
> > >                                      device_for_each_child()
> > >                                         mdev_device_remove_cb()
> > >                                             mdev_device_remove()
> > >
> > >        mdev_create_sysfs_files()
> > >        /* BUG: create is adding
> > >         * sysfs files for a device
> > >         * which is undergoing removal.
> > >         */
> > >                                  parent_remove_sysfs_files()
> > >
> > > 5. Below crash is observed when user initiated remove is in progress
> > > and mdev_unregister_driver() completes parent unregistration.
> > >
> > >        cpu-0                         cpu-1
> > >        -----                         -----
> > > remove_store()
> > >    mdev_device_remove()
> > >    active = false;
> > >                                   mdev_unregister_device()
> > >                                     remove type
> > >    [...]
> > >    mdev_remove_ops() crashes.
> > >
> > > This is similar race like create() racing with mdev_unregister_device().
> > >
> > > mtty mtty: MDEV: Registered
> > > iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to group
> > > 57 vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV: group_id =
> > > 57 mdev_device_remove sleep started mtty mtty: MDEV: Unregistering
> > > mtty_dev: Unloaded!
> > > BUG: unable to handle kernel paging request at ffffffffc027d668 PGD
> > > af9818067 P4D af9818067 PUD af981a067 PMD 8583c3067 PTE 0
> > > Oops: 0000 [#1] SMP PTI
> > > CPU: 15 PID: 3517 Comm: bash Kdump: loaded Not tainted
> > > 5.0.0-rc7-vdevbus+ #2 Hardware name: Supermicro
> > > SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
> > > RIP: 0010:mdev_device_remove_ops+0x1a/0x50 [mdev] Call Trace:
> > >  mdev_device_remove+0xef/0x130 [mdev]
> > >  remove_store+0x77/0xa0 [mdev]
> > >  kernfs_fop_write+0x113/0x1a0
> > >  __vfs_write+0x33/0x1b0
> > >  ? rcu_read_lock_sched_held+0x64/0x70
> > >  ? rcu_sync_lockdep_assert+0x2a/0x50  ? __sb_start_write+0x121/0x1b0
> > > ? vfs_write+0x17c/0x1b0
> > >  vfs_write+0xad/0x1b0
> > >  ? trace_hardirqs_on_thunk+0x1a/0x1c
> > >  ksys_write+0x55/0xc0
> > >  do_syscall_64+0x5a/0x210
> > >
> > > Therefore, mdev core is improved in following ways to overcome above
> > > issues.
> > >
> > > 1. Before placing mdev devices on the bus, perform vendor drivers
> > > creation which supports the mdev creation.
> > > This ensures that mdev specific all necessary fields are initialized
> > > before a given mdev can be accessed by bus driver.
> > >
> > > 2. During remove flow, first remove the device from the bus. This
> > > ensures that any bus specific devices and data is cleared.
> > > Once device is taken of the mdev bus, perform remove() of mdev from
> > > the vendor driver.
> > >
> >
> > If user space application is using the device and someone underneath
> > remove the device from bus, how would use space application know that
> > device is being removed?
> > If DMA is setup, user space application is accessing that memory and
> > device is removed from bus - how will you restrict to not to remove
> > that device? If remove() is not restricted then host might crash.
> > I know Linux kernel device core model doesn't allow remove() to fail,
> > but we had tackled that problem for mdev devices in this framework. I
> > prefer not to change this behavior. This will regress existing working
> > drivers.
> 
> 
> We have exactly this issue with vfio-pci, or really any vfio driver, where the
> solution is that a remove request is blocked until the device becomes
> unused by the user.  In fact there's a notification that userspace can connect
> to so that we don't need to silently wait for userspace to be done.  We could
> also potentially kill the userspace application using the device, or if we ever
> implemented revoke support for mmaps, we could unmap the device and
> the use could handle the SIGBUS.  With Parav's suggestion to fix the ordering
> such that the device is first removed from the bus, where the blocking
> opportunity comes into play, it might be time to let go of this one-off
> force/not-force behavior.  Thanks,
>
 
Yes. I think we should do it.
For now (for next few days), I am dropping this particular order fixing patch from the series.
From my last 8th patch, I am keeping only the fix for create/remove race with parent removal along with other fixes and cleanup.
Posting the v1 in sometime to make progress on already reviewed parts and part of the 8th patch.

I cannot split the remove_common() helper function to a different patch, because remove_cb() will bypass mdev->active check without srcu().
So as individual patch, its not correct behavior.
Hence, that small refactor is part of srcu fix.
Kirti Wankhede March 28, 2019, 5:20 p.m. UTC | #15
On 3/26/2019 9:00 PM, Parav Pandit wrote:
> 
> 
>> -----Original Message-----
>> From: Kirti Wankhede <kwankhede@nvidia.com>
>> Sent: Tuesday, March 26, 2019 2:06 AM
>> To: Parav Pandit <parav@mellanox.com>; kvm@vger.kernel.org; linux-
>> kernel@vger.kernel.org; alex.williamson@redhat.com
>> Cc: Neo Jia <cjia@nvidia.com>
>> Subject: Re: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
>>
>>
>>
>> On 3/23/2019 4:50 AM, Parav Pandit wrote:
>>> There are five problems with current code structure.
>>> 1. mdev device is placed on the mdev bus before it is created in the
>>> vendor driver. Once a device is placed on the mdev bus without
>>> creating its supporting underlying vendor device, an open() can get
>>> triggered by userspace on partially initialized device.
>>> Below ladder diagram highlight it.
>>>
>>>       cpu-0                                       cpu-1
>>>       -----                                       -----
>>>    create_store()
>>>      mdev_create_device()
>>>        device_register()
>>>           ...
>>>          vfio_mdev_probe()
>>>          ...creates char device
>>>                                         vfio_mdev_open()
>>>                                           parent->ops->open(mdev)
>>>                                             vfio_ap_mdev_open()
>>>                                               matrix_mdev = NULL
>>>         [...]
>>>         parent->ops->create()
>>>           vfio_ap_mdev_create()
>>>             mdev_set_drvdata(mdev, matrix_mdev);
>>>             /* Valid pointer set above */
>>>
>>
>> VFIO interface uses sysfs path of device or PCI device's BDF where it checks
>> sysfs file for that device exist.
>> In case of VFIO mdev device, above situation will never happen as open will
>> only get called if sysfs entry for that device exist.
>>
>> If you don't use VFIO interface then this situation can arise. In that case
>> probe() can be used for very basic initialization then create actual char
>> device from create().
>>
> I explained you that create() cannot do the heavy lifting work of creating netdev and rdma dev because at that stage driver doesn't know whether its getting used for VM or host.
> create() needs to create the device that probe() can work on in stable manner.
> 

You can identify if its getting used by VM or host from create(). Since
probe() happens first, from create() you can check
mdev_dev(mdev)->driver->name, if its 'vfio_mdev' then its getting used
by VM, otherwise used by host.

>>
>>> 2. Current creation sequence is,
>>>    parent->ops_create()
>>>    groups_register()
>>>
>>> Remove sequence is,
>>>    parent->ops->remove()
>>>    groups_unregister()
>>> However, remove sequence should be exact mirror of creation sequence.
>>> Once this is achieved, all users of the mdev will be terminated first
>>> before removing underlying vendor device.
>>> (Follow standard linux driver model).
>>> At that point vendor's remove() ops shouldn't failed because device is
>>> taken off the bus that should terminate the users.
>>>
>>
>> If VMM or user space application is using mdev device,
>> parent->ops->remove() can return failure. In that case sysfs files
>> shouldn't be removed. Hence above sequence is followed for remove.
>>
>> Standard linux driver model doesn't allow remove() to fail, but in of mdev
>> framework, interface is defined to handle such error case.
>>
> But the sequence is incorrect for wider use case.
>>
>>> 3. Additionally any new mdev driver that wants to work on mdev device
>>> during probe() routine registered using mdev_register_driver() needs
>>> to get stable mdev structure.
>>>
>>
>> Things that you are trying to handle with mdev structure from probe(),
>> couldn't that be moved to create()?
>>
> No, as explained before and above.
> That approach just doesn't look right.
>

As I mentioned abouve, you can do that.


>>
>>> 4. In following sequence, child devices created while removing mdev
>>> parent device can be left out, or it may lead to race of removing half
>>> initialized child mdev devices.
>>>
>>> issue-1:
>>> --------
>>>        cpu-0                         cpu-1
>>>        -----                         -----
>>>                                   mdev_unregister_device()
>>>                                      device_for_each_child()
>>>                                         mdev_device_remove_cb()
>>>                                             mdev_device_remove()
>>> create_store()
>>>   mdev_device_create()                   [...]
>>>        device_register()
>>>                                   parent_remove_sysfs_files()
>>>                                   /* BUG: device added by cpu-0
>>>                                    * whose parent is getting removed.
>>>                                    */
>>>
>>> issue-2:
>>> --------
>>>        cpu-0                         cpu-1
>>>        -----                         -----
>>> create_store()
>>>   mdev_device_create()                   [...]
>>>        device_register()
>>>
>>>        [...]                      mdev_unregister_device()
>>>                                      device_for_each_child()
>>>                                         mdev_device_remove_cb()
>>>                                             mdev_device_remove()
>>>
>>>        mdev_create_sysfs_files()
>>>        /* BUG: create is adding
>>>         * sysfs files for a device
>>>         * which is undergoing removal.
>>>         */
>>>                                  parent_remove_sysfs_files()
>>>
>>> 5. Below crash is observed when user initiated remove is in progress
>>> and mdev_unregister_driver() completes parent unregistration.
>>>
>>>        cpu-0                         cpu-1
>>>        -----                         -----
>>> remove_store()
>>>    mdev_device_remove()
>>>    active = false;
>>>                                   mdev_unregister_device()
>>>                                     remove type
>>>    [...]
>>>    mdev_remove_ops() crashes.
>>>
>>> This is similar race like create() racing with mdev_unregister_device().
>>>
>>> mtty mtty: MDEV: Registered
>>> iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to group 57
>>> vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV: group_id = 57
>>> mdev_device_remove sleep started mtty mtty: MDEV: Unregistering
>>> mtty_dev: Unloaded!
>>> BUG: unable to handle kernel paging request at ffffffffc027d668 PGD
>>> af9818067 P4D af9818067 PUD af981a067 PMD 8583c3067 PTE 0
>>> Oops: 0000 [#1] SMP PTI
>>> CPU: 15 PID: 3517 Comm: bash Kdump: loaded Not tainted
>>> 5.0.0-rc7-vdevbus+ #2 Hardware name: Supermicro
>>> SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
>>> RIP: 0010:mdev_device_remove_ops+0x1a/0x50 [mdev] Call Trace:
>>>  mdev_device_remove+0xef/0x130 [mdev]
>>>  remove_store+0x77/0xa0 [mdev]
>>>  kernfs_fop_write+0x113/0x1a0
>>>  __vfs_write+0x33/0x1b0
>>>  ? rcu_read_lock_sched_held+0x64/0x70
>>>  ? rcu_sync_lockdep_assert+0x2a/0x50
>>>  ? __sb_start_write+0x121/0x1b0
>>>  ? vfs_write+0x17c/0x1b0
>>>  vfs_write+0xad/0x1b0
>>>  ? trace_hardirqs_on_thunk+0x1a/0x1c
>>>  ksys_write+0x55/0xc0
>>>  do_syscall_64+0x5a/0x210
>>>
>>> Therefore, mdev core is improved in following ways to overcome above
>>> issues.
>>>
>>> 1. Before placing mdev devices on the bus, perform vendor drivers
>>> creation which supports the mdev creation.
>>> This ensures that mdev specific all necessary fields are initialized
>>> before a given mdev can be accessed by bus driver.
>>>
>>> 2. During remove flow, first remove the device from the bus. This
>>> ensures that any bus specific devices and data is cleared.
>>> Once device is taken of the mdev bus, perform remove() of mdev from
>>> the vendor driver.
>>>
>>
>> If user space application is using the device and someone underneath
>> remove the device from bus, how would use space application know that
>> device is being removed?
> vfio_mdev guards and wait for device to get closed.
> 
> One sample trace is below.
> [<0>] vfio_del_group_dev+0x34a/0x3c0 [vfio]
> [<0>] mdev_remove+0x21/0x40 [mdev]
> [<0>] device_release_driver_internal+0xe8/0x1b0
> [<0>] bus_remove_device+0xf9/0x170
> [<0>] device_del+0x168/0x350
> [<0>] mdev_device_remove_common+0x1e/0x60 [mdev]
> [<0>] mdev_device_remove_cb+0x1a/0x30 [mdev]
> [<0>] device_for_each_child+0x47/0x90
> [<0>] mdev_unregister_device+0xdb/0x100 [mdev]
> [<0>] mtty_dev_exit+0x17/0x843 [mtty]
> [<0>] __x64_sys_delete_module+0x16b/0x240
> [<0>] do_syscall_64+0x5a/0x210
> [<0>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [<0>] 0xffffffffffffffff
> 
>> If DMA is setup, user space application is accessing that memory and device
>> is removed from bus - how will you restrict to not to remove that device? If
>> remove() is not restricted then host might crash.
>> I know Linux kernel device core model doesn't allow remove() to fail, but we
>> had tackled that problem for mdev devices in this framework. I prefer not to
>> change this behavior. This will regress existing working drivers.
>>
> vfio layer ensures that open device cannot be removed from above trace.
> 
> Other drivers will follow similar method. In case of mlx5 driver which binds
> to mdev follows standard driver model to terminate for this mdev device,
> similar way for pci device.
> 

But then remove() or write on 'remove' sysfs would block, which could be
indefinite. For example in case of VM, it will block until VM is not
shutdown.
With current approach, write on 'remove' sysfs doesn't block.

Thanks,
Kirti

>>
>>> 3. Linux core device model provides way to register and auto
>>> unregister the device sysfs attribute groups at dev->groups.
>>> Make use of this groups to let core create the groups and simplify
>>> code to avoid explicit groups creation and removal.
>>>
>>> 4. Wait for any ongoing mdev create() and remove() to finish before
>>> unregistering parent device using srcu. This continues to allow
>>> multiple create and remove to progress in parallel. At the same time
>>> guard parent removal while parent is being access by create() and remove
>> callbacks.
>>>
>>
>> Agreed with this.
>> Alex already mentioned, it would be better to have separate patch for this
>> fix.
>>
> Patches are ready, I am waiting for above discussion to close before posting v1.
>
Alex Williamson March 29, 2019, 2:49 p.m. UTC | #16
On Thu, 28 Mar 2019 22:50:38 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 3/26/2019 9:00 PM, Parav Pandit wrote:
> > 
> >   
> >> -----Original Message-----
> >> From: Kirti Wankhede <kwankhede@nvidia.com>
> >> Sent: Tuesday, March 26, 2019 2:06 AM
> >> To: Parav Pandit <parav@mellanox.com>; kvm@vger.kernel.org; linux-
> >> kernel@vger.kernel.org; alex.williamson@redhat.com
> >> Cc: Neo Jia <cjia@nvidia.com>
> >> Subject: Re: [PATCH 8/8] vfio/mdev: Improve the create/remove sequence
> >>
> >>
> >>
> >> On 3/23/2019 4:50 AM, Parav Pandit wrote:  
> >>> There are five problems with current code structure.
> >>> 1. mdev device is placed on the mdev bus before it is created in the
> >>> vendor driver. Once a device is placed on the mdev bus without
> >>> creating its supporting underlying vendor device, an open() can get
> >>> triggered by userspace on partially initialized device.
> >>> Below ladder diagram highlight it.
> >>>
> >>>       cpu-0                                       cpu-1
> >>>       -----                                       -----
> >>>    create_store()
> >>>      mdev_create_device()
> >>>        device_register()
> >>>           ...
> >>>          vfio_mdev_probe()
> >>>          ...creates char device
> >>>                                         vfio_mdev_open()
> >>>                                           parent->ops->open(mdev)
> >>>                                             vfio_ap_mdev_open()
> >>>                                               matrix_mdev = NULL
> >>>         [...]
> >>>         parent->ops->create()
> >>>           vfio_ap_mdev_create()
> >>>             mdev_set_drvdata(mdev, matrix_mdev);
> >>>             /* Valid pointer set above */
> >>>  
> >>
> >> VFIO interface uses sysfs path of device or PCI device's BDF where it checks
> >> sysfs file for that device exist.
> >> In case of VFIO mdev device, above situation will never happen as open will
> >> only get called if sysfs entry for that device exist.
> >>
> >> If you don't use VFIO interface then this situation can arise. In that case
> >> probe() can be used for very basic initialization then create actual char
> >> device from create().
> >>  
> > I explained you that create() cannot do the heavy lifting work of creating netdev and rdma dev because at that stage driver doesn't know whether its getting used for VM or host.
> > create() needs to create the device that probe() can work on in stable manner.
> >   
> 
> You can identify if its getting used by VM or host from create(). Since
> probe() happens first, from create() you can check
> mdev_dev(mdev)->driver->name, if its 'vfio_mdev' then its getting used
> by VM, otherwise used by host.

If this is suggesting that we should have different create paths based
on driver name, please no.  Mdev devices should not be special, they're
attached to a bus which can host multiple drivers and devices on that
bus should have the ability to switch between drivers.  Not to mention
that a strcmp of a driver name to infer the purpose of a device is just
ugly as can be.

> >>> 2. Current creation sequence is,
> >>>    parent->ops_create()
> >>>    groups_register()
> >>>
> >>> Remove sequence is,
> >>>    parent->ops->remove()
> >>>    groups_unregister()
> >>> However, remove sequence should be exact mirror of creation sequence.
> >>> Once this is achieved, all users of the mdev will be terminated first
> >>> before removing underlying vendor device.
> >>> (Follow standard linux driver model).
> >>> At that point vendor's remove() ops shouldn't failed because device is
> >>> taken off the bus that should terminate the users.
> >>>  
> >>
> >> If VMM or user space application is using mdev device,
> >> parent->ops->remove() can return failure. In that case sysfs files
> >> shouldn't be removed. Hence above sequence is followed for remove.
> >>
> >> Standard linux driver model doesn't allow remove() to fail, but in of mdev
> >> framework, interface is defined to handle such error case.
> >>  
> > But the sequence is incorrect for wider use case.  
> >>  
> >>> 3. Additionally any new mdev driver that wants to work on mdev device
> >>> during probe() routine registered using mdev_register_driver() needs
> >>> to get stable mdev structure.
> >>>  
> >>
> >> Things that you are trying to handle with mdev structure from probe(),
> >> couldn't that be moved to create()?
> >>  
> > No, as explained before and above.
> > That approach just doesn't look right.
> >  
> 
> As I mentioned abouve, you can do that.

But it would be wrong to do so.

> >>  
> >>> 4. In following sequence, child devices created while removing mdev
> >>> parent device can be left out, or it may lead to race of removing half
> >>> initialized child mdev devices.
> >>>
> >>> issue-1:
> >>> --------
> >>>        cpu-0                         cpu-1
> >>>        -----                         -----
> >>>                                   mdev_unregister_device()
> >>>                                      device_for_each_child()
> >>>                                         mdev_device_remove_cb()
> >>>                                             mdev_device_remove()
> >>> create_store()
> >>>   mdev_device_create()                   [...]
> >>>        device_register()
> >>>                                   parent_remove_sysfs_files()
> >>>                                   /* BUG: device added by cpu-0
> >>>                                    * whose parent is getting removed.
> >>>                                    */
> >>>
> >>> issue-2:
> >>> --------
> >>>        cpu-0                         cpu-1
> >>>        -----                         -----
> >>> create_store()
> >>>   mdev_device_create()                   [...]
> >>>        device_register()
> >>>
> >>>        [...]                      mdev_unregister_device()
> >>>                                      device_for_each_child()
> >>>                                         mdev_device_remove_cb()
> >>>                                             mdev_device_remove()
> >>>
> >>>        mdev_create_sysfs_files()
> >>>        /* BUG: create is adding
> >>>         * sysfs files for a device
> >>>         * which is undergoing removal.
> >>>         */
> >>>                                  parent_remove_sysfs_files()
> >>>
> >>> 5. Below crash is observed when user initiated remove is in progress
> >>> and mdev_unregister_driver() completes parent unregistration.
> >>>
> >>>        cpu-0                         cpu-1
> >>>        -----                         -----
> >>> remove_store()
> >>>    mdev_device_remove()
> >>>    active = false;
> >>>                                   mdev_unregister_device()
> >>>                                     remove type
> >>>    [...]
> >>>    mdev_remove_ops() crashes.
> >>>
> >>> This is similar race like create() racing with mdev_unregister_device().
> >>>
> >>> mtty mtty: MDEV: Registered
> >>> iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to group 57
> >>> vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV: group_id = 57
> >>> mdev_device_remove sleep started mtty mtty: MDEV: Unregistering
> >>> mtty_dev: Unloaded!
> >>> BUG: unable to handle kernel paging request at ffffffffc027d668 PGD
> >>> af9818067 P4D af9818067 PUD af981a067 PMD 8583c3067 PTE 0
> >>> Oops: 0000 [#1] SMP PTI
> >>> CPU: 15 PID: 3517 Comm: bash Kdump: loaded Not tainted
> >>> 5.0.0-rc7-vdevbus+ #2 Hardware name: Supermicro
> >>> SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
> >>> RIP: 0010:mdev_device_remove_ops+0x1a/0x50 [mdev] Call Trace:
> >>>  mdev_device_remove+0xef/0x130 [mdev]
> >>>  remove_store+0x77/0xa0 [mdev]
> >>>  kernfs_fop_write+0x113/0x1a0
> >>>  __vfs_write+0x33/0x1b0
> >>>  ? rcu_read_lock_sched_held+0x64/0x70
> >>>  ? rcu_sync_lockdep_assert+0x2a/0x50
> >>>  ? __sb_start_write+0x121/0x1b0
> >>>  ? vfs_write+0x17c/0x1b0
> >>>  vfs_write+0xad/0x1b0
> >>>  ? trace_hardirqs_on_thunk+0x1a/0x1c
> >>>  ksys_write+0x55/0xc0
> >>>  do_syscall_64+0x5a/0x210
> >>>
> >>> Therefore, mdev core is improved in following ways to overcome above
> >>> issues.
> >>>
> >>> 1. Before placing mdev devices on the bus, perform vendor drivers
> >>> creation which supports the mdev creation.
> >>> This ensures that mdev specific all necessary fields are initialized
> >>> before a given mdev can be accessed by bus driver.
> >>>
> >>> 2. During remove flow, first remove the device from the bus. This
> >>> ensures that any bus specific devices and data is cleared.
> >>> Once device is taken of the mdev bus, perform remove() of mdev from
> >>> the vendor driver.
> >>>  
> >>
> >> If user space application is using the device and someone underneath
> >> remove the device from bus, how would use space application know that
> >> device is being removed?  
> > vfio_mdev guards and wait for device to get closed.
> > 
> > One sample trace is below.
> > [<0>] vfio_del_group_dev+0x34a/0x3c0 [vfio]
> > [<0>] mdev_remove+0x21/0x40 [mdev]
> > [<0>] device_release_driver_internal+0xe8/0x1b0
> > [<0>] bus_remove_device+0xf9/0x170
> > [<0>] device_del+0x168/0x350
> > [<0>] mdev_device_remove_common+0x1e/0x60 [mdev]
> > [<0>] mdev_device_remove_cb+0x1a/0x30 [mdev]
> > [<0>] device_for_each_child+0x47/0x90
> > [<0>] mdev_unregister_device+0xdb/0x100 [mdev]
> > [<0>] mtty_dev_exit+0x17/0x843 [mtty]
> > [<0>] __x64_sys_delete_module+0x16b/0x240
> > [<0>] do_syscall_64+0x5a/0x210
> > [<0>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > [<0>] 0xffffffffffffffff
> >   
> >> If DMA is setup, user space application is accessing that memory and device
> >> is removed from bus - how will you restrict to not to remove that device? If
> >> remove() is not restricted then host might crash.
> >> I know Linux kernel device core model doesn't allow remove() to fail, but we
> >> had tackled that problem for mdev devices in this framework. I prefer not to
> >> change this behavior. This will regress existing working drivers.
> >>  
> > vfio layer ensures that open device cannot be removed from above trace.
> > 
> > Other drivers will follow similar method. In case of mlx5 driver which binds
> > to mdev follows standard driver model to terminate for this mdev device,
> > similar way for pci device.
> >   
> 
> But then remove() or write on 'remove' sysfs would block, which could be
> indefinite. For example in case of VM, it will block until VM is not
> shutdown.
> With current approach, write on 'remove' sysfs doesn't block.

OTOH, why should mdev be different than any other driver?  Blocking is
the current solution for all directly assigned vfio devices.  This is a
compromise between the device model not allowing an error return and
lack of support to be able to revoke mmaps to the device.  We already
have an interface in vfio to request a device from a cooperative user
(Maxim proposed adding this to the mdev interface), lacking a revoke
interface, that can be further escalated to killing the process.  What
we've heard previously when pursuing an error path from removing a
device is that all responsibility lies with the admin in using these
interfaces.  If a remove is requested, it should be honored.  If that
results in killing a task, the fault is on the admin.  Mdev is not its
own island to decide a different model. Thanks,

Alex
diff mbox series

Patch

diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
index 944a058..8fe0ed1 100644
--- a/drivers/vfio/mdev/mdev_core.c
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -84,6 +84,7 @@  static void mdev_release_parent(struct kref *kref)
 						  ref);
 	struct device *dev = parent->dev;
 
+	cleanup_srcu_struct(&parent->unreg_srcu);
 	kfree(parent);
 	put_device(dev);
 }
@@ -103,56 +104,30 @@  static inline void mdev_put_parent(struct mdev_parent *parent)
 		kref_put(&parent->ref, mdev_release_parent);
 }
 
-static int mdev_device_create_ops(struct kobject *kobj,
-				  struct mdev_device *mdev)
+static int mdev_device_must_remove(struct mdev_device *mdev)
 {
-	struct mdev_parent *parent = mdev->parent;
+	struct mdev_parent *parent;
+	struct mdev_type *type;
 	int ret;
 
-	ret = parent->ops->create(kobj, mdev);
-	if (ret)
-		return ret;
+	type = to_mdev_type(mdev->type_kobj);
 
-	ret = sysfs_create_groups(&mdev->dev.kobj,
-				  parent->ops->mdev_attr_groups);
+	mdev_remove_sysfs_files(&mdev->dev, type);
+	device_del(&mdev->dev);
+	parent = mdev->parent;
+	ret = parent->ops->remove(mdev);
 	if (ret)
-		parent->ops->remove(mdev);
+		dev_err(&mdev->dev, "Remove failed: err=%d\n", ret);
 
+	/* Balances with device_initialize() */
+	put_device(&mdev->dev);
 	return ret;
 }
 
-/*
- * mdev_device_remove_ops gets called from sysfs's 'remove' and when parent
- * device is being unregistered from mdev device framework.
- * - 'force_remove' is set to 'false' when called from sysfs's 'remove' which
- *   indicates that if the mdev device is active, used by VMM or userspace
- *   application, vendor driver could return error then don't remove the device.
- * - 'force_remove' is set to 'true' when called from mdev_unregister_device()
- *   which indicate that parent device is being removed from mdev device
- *   framework so remove mdev device forcefully.
- */
-static int mdev_device_remove_ops(struct mdev_device *mdev, bool force_remove)
-{
-	struct mdev_parent *parent = mdev->parent;
-	int ret;
-
-	/*
-	 * Vendor driver can return error if VMM or userspace application is
-	 * using this mdev device.
-	 */
-	ret = parent->ops->remove(mdev);
-	if (ret && !force_remove)
-		return ret;
-
-	sysfs_remove_groups(&mdev->dev.kobj, parent->ops->mdev_attr_groups);
-	return 0;
-}
-
 static int mdev_device_remove_cb(struct device *dev, void *data)
 {
 	if (dev_is_mdev(dev))
-		mdev_device_remove(dev, true);
-
+		mdev_device_must_remove(to_mdev_device(dev));
 	return 0;
 }
 
@@ -194,6 +169,7 @@  int mdev_register_device(struct device *dev, const struct mdev_parent_ops *ops)
 	}
 
 	kref_init(&parent->ref);
+	init_srcu_struct(&parent->unreg_srcu);
 
 	parent->dev = dev;
 	parent->ops = ops;
@@ -214,6 +190,7 @@  int mdev_register_device(struct device *dev, const struct mdev_parent_ops *ops)
 	if (ret)
 		dev_warn(dev, "Failed to create compatibility class link\n");
 
+	rcu_assign_pointer(parent->self, parent);
 	list_add(&parent->next, &parent_list);
 	mutex_unlock(&parent_list_lock);
 
@@ -244,21 +221,36 @@  void mdev_unregister_device(struct device *dev)
 
 	mutex_lock(&parent_list_lock);
 	parent = __find_parent_device(dev);
-
 	if (!parent) {
 		mutex_unlock(&parent_list_lock);
 		return;
 	}
+	list_del(&parent->next);
+	mutex_unlock(&parent_list_lock);
+
 	dev_info(dev, "MDEV: Unregistering\n");
 
-	list_del(&parent->next);
+	/* Publish that this mdev parent is unregistering. So any new
+	 * create/remove cannot start on this parent anymore by user.
+	 */
+	rcu_assign_pointer(parent->self, NULL);
+
+	/*
+	 * Wait for any active create() or remove() mdev ops on the parent
+	 * to complete.
+	 */
+	synchronize_srcu(&parent->unreg_srcu);
+
+	/* At this point it is confirmed that any pending user initiated
+	 * create or remove callbacks accessing the parent are completed.
+	 * It is safe to remove the parent now.
+	 */
 	class_compat_remove_link(mdev_bus_compat_class, dev, NULL);
 
 	device_for_each_child(dev, NULL, mdev_device_remove_cb);
 
 	parent_remove_sysfs_files(parent);
 
-	mutex_unlock(&parent_list_lock);
 	mdev_put_parent(parent);
 }
 EXPORT_SYMBOL(mdev_unregister_device);
@@ -278,14 +270,24 @@  static void mdev_device_release(struct device *dev)
 int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
 {
 	int ret;
+	struct mdev_parent *valid_parent;
 	struct mdev_device *mdev, *tmp;
 	struct mdev_parent *parent;
 	struct mdev_type *type = to_mdev_type(kobj);
+	int srcu_idx;
 
 	parent = mdev_get_parent(type->parent);
 	if (!parent)
 		return -EINVAL;
 
+	srcu_idx = srcu_read_lock(&parent->unreg_srcu);
+	valid_parent = srcu_dereference(parent->self, &parent->unreg_srcu);
+	if (!valid_parent) {
+		/* parent is undergoing unregistration */
+		ret = -ENODEV;
+		goto mdev_fail;
+	}
+
 	mutex_lock(&mdev_list_lock);
 
 	/* Check for duplicate */
@@ -310,68 +312,76 @@  int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
 
 	mdev->parent = parent;
 
+	device_initialize(&mdev->dev);
 	mdev->dev.parent  = dev;
 	mdev->dev.bus     = &mdev_bus_type;
 	mdev->dev.release = mdev_device_release;
+	mdev->dev.groups = type->parent->ops->mdev_attr_groups;
 	dev_set_name(&mdev->dev, "%pUl", uuid.b);
 
-	ret = device_register(&mdev->dev);
+	ret = type->parent->ops->create(kobj, mdev);
 	if (ret)
-		goto mdev_fail;
+		goto create_fail;
 
-	ret = mdev_device_create_ops(kobj, mdev);
+	ret = device_add(&mdev->dev);
 	if (ret)
-		goto create_fail;
+		goto dev_fail;
 
 	ret = mdev_create_sysfs_files(&mdev->dev, type);
-	if (ret) {
-		mdev_device_remove_ops(mdev, true);
-		goto create_fail;
-	}
+	if (ret)
+		goto sysfs_fail;
 
 	mdev->type_kobj = kobj;
 	mdev->active = true;
 	dev_dbg(&mdev->dev, "MDEV: created\n");
+	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
 
 	return 0;
 
+sysfs_fail:
+	device_del(&mdev->dev);
+dev_fail:
+	type->parent->ops->remove(mdev);
 create_fail:
-	device_unregister(&mdev->dev);
+	put_device(&mdev->dev);
 mdev_fail:
+	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
 	mdev_put_parent(parent);
 	return ret;
 }
 
-int mdev_device_remove(struct device *dev, bool force_remove)
+int mdev_device_remove(struct device *dev)
 {
+	struct mdev_parent *valid_parent;
 	struct mdev_device *mdev;
 	struct mdev_parent *parent;
-	struct mdev_type *type;
+	int srcu_idx;
 	int ret;
 
 	mdev = to_mdev_device(dev);
+	parent = mdev->parent;
+	srcu_idx = srcu_read_lock(&parent->unreg_srcu);
+	valid_parent = srcu_dereference(parent->self, &parent->unreg_srcu);
+	if (!valid_parent) {
+		srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
+		/* parent is undergoing unregistration */
+		return -ENODEV;
+	}
+
+	mutex_lock(&mdev_list_lock);
 	if (!mdev->active) {
 		mutex_unlock(&mdev_list_lock);
-		return -EAGAIN;
+		srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
+		return -ENODEV;
 	}
-
 	mdev->active = false;
 	mutex_unlock(&mdev_list_lock);
 
-	type = to_mdev_type(mdev->type_kobj);
-	parent = mdev->parent;
-
-	ret = mdev_device_remove_ops(mdev, force_remove);
-	if (ret) {
-		mdev->active = true;
-		return ret;
-	}
+	ret = mdev_device_must_remove(mdev);
+	srcu_read_unlock(&parent->unreg_srcu, srcu_idx);
 
-	mdev_remove_sysfs_files(dev, type);
-	device_unregister(dev);
 	mdev_put_parent(parent);
-
-	return 0;
+	return ret;
 }
 
 static int __init mdev_init(void)
diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
index 84b2b6c..3d17db9 100644
--- a/drivers/vfio/mdev/mdev_private.h
+++ b/drivers/vfio/mdev/mdev_private.h
@@ -23,6 +23,11 @@  struct mdev_parent {
 	struct list_head next;
 	struct kset *mdev_types_kset;
 	struct list_head type_list;
+	/* Protects unregistration to wait until create/remove
+	 * are completed.
+	 */
+	struct srcu_struct unreg_srcu;
+	struct mdev_parent __rcu *self;
 };
 
 struct mdev_device {
@@ -58,6 +63,6 @@  struct mdev_type {
 void mdev_remove_sysfs_files(struct device *dev, struct mdev_type *type);
 
 int  mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid);
-int  mdev_device_remove(struct device *dev, bool force_remove);
+int  mdev_device_remove(struct device *dev);
 
 #endif /* MDEV_PRIVATE_H */
diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
index c782fa9..68a8191 100644
--- a/drivers/vfio/mdev/mdev_sysfs.c
+++ b/drivers/vfio/mdev/mdev_sysfs.c
@@ -236,11 +236,9 @@  static ssize_t remove_store(struct device *dev, struct device_attribute *attr,
 	if (val && device_remove_file_self(dev, attr)) {
 		int ret;
 
-		ret = mdev_device_remove(dev, false);
-		if (ret) {
-			device_create_file(dev, attr);
+		ret = mdev_device_remove(dev);
+		if (ret)
 			return ret;
-		}
 	}
 
 	return count;