Message ID | 20190904023515.7107-13-andrew.smirnov@gmail.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Herbert Xu |
Headers | show |
Series | CAAM bugfixes, small improvements | expand |
On 04.09.2019 05:35, Andrey Smirnov wrote: > Returning -EBUSY from platform device's .remove() callback won't stop > the removal process, so the code in caam_jr_remove() is not going to > have the desired effect of preventing JR from being removed. > > In order to be able to deal with removal of the JR device, change the > code as follows: > > 1. To make sure that underlying struct device remains valid for as > long as we have a reference to it, add appropriate device > refcount management to caam_jr_alloc() and caam_jr_free() > > 2. To make sure that device removal doesn't happen in parallel to > use using the device in caam_jr_enqueue() augment the latter to > acquire/release device lock for the duration of the subroutine > > 3. In order to handle the case when caam_jr_enqueue() is executed > right after corresponding caam_jr_remove(), add code to check > that driver data has not been set to NULL. What happens if a driver is unbound while a transform is executing asynchronously? Doing get_device and put_device in caam_jr_alloc and caam_jr_free doesn't protect you against an explicit unbind of caam_jr, that path doesn't care about device reference counts. For example on imx6qp: # echo 2102000.jr1 > /sys/bus/platform/drivers/caam_jr/unbind CPU: 2 PID: 561 Comm: bash Not tainted 5.3.0-rc7-next-20190904 Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree) Backtrace: [<c010f8a4>] (dump_backtrace) from [<c010fc5c>] (show_stack+0x20/0x24) [<c010fc3c>] (show_stack) from [<c0d21600>] (dump_stack+0xdc/0x114) [<c0d21524>] (dump_stack) from [<c0955774>] (caam_jr_remove+0x24/0xf8) [<c0955750>] (caam_jr_remove) from [<c06e2d98>] (platform_drv_remove+0x34/0x4c) [<c06e2d64>] (platform_drv_remove) from [<c06e0b74>] (device_release_driver_internal+0xf8/0x1b0) [<c06e0a7c>] (device_release_driver_internal) from [<c06e0c70>] (device_driver_detach+0x20/0x24) [<c06e0c50>] (device_driver_detach) from [<c06de5a4>] (unbind_store+0x6c/0xe0) [<c06de538>] (unbind_store) from [<c06dd950>] (drv_attr_store+0x30/0x3c) [<c06dd920>] (drv_attr_store) from [<c0364c18>] (sysfs_kf_write+0x50/0x5c) [<c0364bc8>] (sysfs_kf_write) from [<c0363e0c>] (kernfs_fop_write+0x10c/0x1f8) [<c0363d00>] (kernfs_fop_write) from [<c02c3a30>] (__vfs_write+0x44/0x1d0) [<c02c39ec>] (__vfs_write) from [<c02c68ec>] (vfs_write+0xb0/0x194) [<c02c683c>] (vfs_write) from [<c02c6b7c>] (ksys_write+0x64/0xd8) [<c02c6b18>] (ksys_write) from [<c02c6c08>] (sys_write+0x18/0x1c) [<c02c6bf0>] (sys_write) from [<c0101000>] (ret_fast_syscall+0x0/0x28) I think you need to do some form of slow wait loop in jrpriv until jrpriv->tfm_count reaches zero. Preventing new operations from starting would help but isn't strictly necessary for correctness. > diff --git a/drivers/crypto/caam/jr.c b/drivers/crypto/caam/jr.c > index 47b389cb1c62..8a30bbd7f2aa 100644 > --- a/drivers/crypto/caam/jr.c > +++ b/drivers/crypto/caam/jr.c > @@ -124,14 +124,6 @@ static int caam_jr_remove(struct platform_device *pdev) > jrdev = &pdev->dev; > jrpriv = dev_get_drvdata(jrdev); > > - /* > - * Return EBUSY if job ring already allocated. > - */ > - if (atomic_read(&jrpriv->tfm_count)) { > - dev_err(jrdev, "Device is busy\n"); > - return -EBUSY; > - } > - > /* Unregister JR-based RNG & crypto algorithms */ > unregister_algs(); > > @@ -300,7 +292,7 @@ struct device *caam_jr_alloc(void) > > if (min_jrpriv) { > atomic_inc(&min_jrpriv->tfm_count); > - dev = min_jrpriv->dev; > + dev = get_device(min_jrpriv->dev); > } > spin_unlock(&driver_data.jr_alloc_lock); > > @@ -318,13 +310,16 @@ void caam_jr_free(struct device *rdev) > struct caam_drv_private_jr *jrpriv = dev_get_drvdata(rdev); > > atomic_dec(&jrpriv->tfm_count); > + put_device(rdev); > } > EXPORT_SYMBOL(caam_jr_free); > > /** > * caam_jr_enqueue() - Enqueue a job descriptor head. Returns 0 if OK, > * -EBUSY if the queue is full, -EIO if it cannot map the caller's > - * descriptor. > + * descriptor, -ENODEV if given device was removed and is no longer > + * valid > + * > * @dev: device of the job ring to be used. This device should have > * been assigned prior by caam_jr_register(). > * @desc: points to a job descriptor that execute our request. All > @@ -354,15 +349,32 @@ int caam_jr_enqueue(struct device *dev, u32 *desc, > u32 status, void *areq), > void *areq) > { > - struct caam_drv_private_jr *jrp = dev_get_drvdata(dev); > + struct caam_drv_private_jr *jrp; > struct caam_jrentry_info *head_entry; > int head, tail, desc_size; > dma_addr_t desc_dma; > > + /* > + * Lock the device to prevent it from being removed while we > + * are using it > + */ > + device_lock(dev); > + > + /* > + * If driver data is NULL, it is very likely that this device > + * was removed already. Nothing we can do here but bail out. > + */ > + jrp = dev_get_drvdata(dev); > + if (!jrp) { > + device_unlock(dev); > + return -ENODEV; > + } > + > desc_size = (caam32_to_cpu(*desc) & HDR_JD_LENGTH_MASK) * sizeof(u32); > desc_dma = dma_map_single(dev, desc, desc_size, DMA_TO_DEVICE); > if (dma_mapping_error(dev, desc_dma)) { > dev_err(dev, "caam_jr_enqueue(): can't map jobdesc\n"); > + device_unlock(dev); > return -EIO; > } > > @@ -375,6 +387,7 @@ int caam_jr_enqueue(struct device *dev, u32 *desc, > CIRC_SPACE(head, tail, JOBR_DEPTH) <= 0) { > spin_unlock_bh(&jrp->inplock); > dma_unmap_single(dev, desc_dma, desc_size, DMA_TO_DEVICE); > + device_unlock(dev); > return -EBUSY; > } > > @@ -411,6 +424,7 @@ int caam_jr_enqueue(struct device *dev, u32 *desc, > jrp->inpring_avail = rd_reg32(&jrp->rregs->inpring_avail); > > spin_unlock_bh(&jrp->inplock); > + device_unlock(dev); > > return 0; > }
On Fri, Sep 13, 2019 at 12:16 PM Leonard Crestez <leonard.crestez@nxp.com> wrote: > > On 04.09.2019 05:35, Andrey Smirnov wrote: > > Returning -EBUSY from platform device's .remove() callback won't stop > > the removal process, so the code in caam_jr_remove() is not going to > > have the desired effect of preventing JR from being removed. > > > > In order to be able to deal with removal of the JR device, change the > > code as follows: > > > > 1. To make sure that underlying struct device remains valid for as > > long as we have a reference to it, add appropriate device > > refcount management to caam_jr_alloc() and caam_jr_free() > > > > 2. To make sure that device removal doesn't happen in parallel to > > use using the device in caam_jr_enqueue() augment the latter to > > acquire/release device lock for the duration of the subroutine > > > > 3. In order to handle the case when caam_jr_enqueue() is executed > > right after corresponding caam_jr_remove(), add code to check > > that driver data has not been set to NULL. > > What happens if a driver is unbound while a transform is executing > asynchronously? > AFAICT caam_jr_shutdown() is going to disable JR's interrupt and kill corresponding tasklet, so I think that transform will never finish. We probably could handle that better by walking the list of unfinished jobs and calling their callbacks with an appropriate error code. > Doing get_device and put_device in caam_jr_alloc and caam_jr_free > doesn't protect you against an explicit unbind of caam_jr, that path > doesn't care about device reference counts. For example on imx6qp: > My thinking with get_device() and put_device() was to prevent struct device * from being freed while we are using it. The intent with explicit unbind was to protect against it by device_lock()/device_unlock() as a well as check that device data is not NULL. I think I did miss code to do that in caam_jr_free() before accessing tfm_count. > # echo 2102000.jr1 > /sys/bus/platform/drivers/caam_jr/unbind > A bit of a silly question, but is this with my patches applied or against the vanilla driver? > CPU: 2 PID: 561 Comm: bash Not tainted 5.3.0-rc7-next-20190904 > Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree) > Backtrace: > [<c010f8a4>] (dump_backtrace) from [<c010fc5c>] (show_stack+0x20/0x24) > [<c010fc3c>] (show_stack) from [<c0d21600>] (dump_stack+0xdc/0x114) > [<c0d21524>] (dump_stack) from [<c0955774>] (caam_jr_remove+0x24/0xf8) > [<c0955750>] (caam_jr_remove) from [<c06e2d98>] > (platform_drv_remove+0x34/0x4c) > [<c06e2d64>] (platform_drv_remove) from [<c06e0b74>] > (device_release_driver_internal+0xf8/0x1b0) > [<c06e0a7c>] (device_release_driver_internal) from [<c06e0c70>] > (device_driver_detach+0x20/0x24) > [<c06e0c50>] (device_driver_detach) from [<c06de5a4>] > (unbind_store+0x6c/0xe0) > [<c06de538>] (unbind_store) from [<c06dd950>] (drv_attr_store+0x30/0x3c) > [<c06dd920>] (drv_attr_store) from [<c0364c18>] (sysfs_kf_write+0x50/0x5c) > [<c0364bc8>] (sysfs_kf_write) from [<c0363e0c>] > (kernfs_fop_write+0x10c/0x1f8) > [<c0363d00>] (kernfs_fop_write) from [<c02c3a30>] (__vfs_write+0x44/0x1d0) > [<c02c39ec>] (__vfs_write) from [<c02c68ec>] (vfs_write+0xb0/0x194) > [<c02c683c>] (vfs_write) from [<c02c6b7c>] (ksys_write+0x64/0xd8) > [<c02c6b18>] (ksys_write) from [<c02c6c08>] (sys_write+0x18/0x1c) > [<c02c6bf0>] (sys_write) from [<c0101000>] (ret_fast_syscall+0x0/0x28) > > I think you need to do some form of slow wait loop in jrpriv until > jrpriv->tfm_count reaches zero. Hmm, what do we do if it never does? Why do you think it would be better than cancelling all outstanding jobs and resetting the HW? > Preventing new operations from starting > would help but isn't strictly necessary for correctness. Wouldn't it be strictly necessary if you want to wait for tfm_count reaching zero? Thanks, Andrey Smirnov
On 9/18/2019 6:13 AM, Andrey Smirnov wrote: >> I think you need to do some form of slow wait loop in jrpriv until >> jrpriv->tfm_count reaches zero. > Hmm, what do we do if it never does? Why do you think it would be > better than cancelling all outstanding jobs and resetting the HW? > Herbert, What should a driver do when: -user tries to unbind it AND -there are tfms referencing algorithms registered by this driver 1. If driver tries to unregister the algorithms during its .remove() callback, then this BUG_ON is hit: int crypto_unregister_alg(struct crypto_alg *alg) { [...] BUG_ON(refcount_read(&alg->cra_refcnt) != 1); 2. If driver exits without unregistering the algorithms, next time one of the tfms referencing those algorithms will be used bad things will happen. 3. There is no mechanism in crypto core for notifying users to stop using a tfm. Thanks, Horia
On Thu, Sep 19, 2019 at 11:19:22AM +0000, Horia Geanta wrote: > > What should a driver do when: > -user tries to unbind it AND > -there are tfms referencing algorithms registered by this driver > > 1. If driver tries to unregister the algorithms during its .remove() > callback, then this BUG_ON is hit: > > int crypto_unregister_alg(struct crypto_alg *alg) > { > [...] > BUG_ON(refcount_read(&alg->cra_refcnt) != 1); You must not unregister the algorithm until it's no longer in use. Normally we enforce this using module reference counts. For hardware devices that need to be unregistered without unloading the module at the same time, you will need your own reference counting to determine when unregistration can be carried out. But it must be carefully done so that it is not racy. > 2. If driver exits without unregistering the algorithms, > next time one of the tfms referencing those algorithms will be used > bad things will happen. Well remember hardware devices can always go away (e.g., dead or unplugged) so the driver must do something sensible when that happens. If this happened on a live tfm then further operations should ideally fail and any outstanding requests should also fail in a timely manner. > 3. There is no mechanism in crypto core for notifying users > to stop using a tfm. For software implementations we use the module reference count as the mechanism to signal that an algorithm is going away. In particular try_module_get (and consequently crypto_mod_get) will fail when the module is being unregistered. We should extend this to cover hardware devices. Perhaps we can introduce a callback in crypto_alg that crypto_mod_get can invoke which would then fail if the device is going away (or has gone away). Cheers,
diff --git a/drivers/crypto/caam/jr.c b/drivers/crypto/caam/jr.c index 47b389cb1c62..8a30bbd7f2aa 100644 --- a/drivers/crypto/caam/jr.c +++ b/drivers/crypto/caam/jr.c @@ -124,14 +124,6 @@ static int caam_jr_remove(struct platform_device *pdev) jrdev = &pdev->dev; jrpriv = dev_get_drvdata(jrdev); - /* - * Return EBUSY if job ring already allocated. - */ - if (atomic_read(&jrpriv->tfm_count)) { - dev_err(jrdev, "Device is busy\n"); - return -EBUSY; - } - /* Unregister JR-based RNG & crypto algorithms */ unregister_algs(); @@ -300,7 +292,7 @@ struct device *caam_jr_alloc(void) if (min_jrpriv) { atomic_inc(&min_jrpriv->tfm_count); - dev = min_jrpriv->dev; + dev = get_device(min_jrpriv->dev); } spin_unlock(&driver_data.jr_alloc_lock); @@ -318,13 +310,16 @@ void caam_jr_free(struct device *rdev) struct caam_drv_private_jr *jrpriv = dev_get_drvdata(rdev); atomic_dec(&jrpriv->tfm_count); + put_device(rdev); } EXPORT_SYMBOL(caam_jr_free); /** * caam_jr_enqueue() - Enqueue a job descriptor head. Returns 0 if OK, * -EBUSY if the queue is full, -EIO if it cannot map the caller's - * descriptor. + * descriptor, -ENODEV if given device was removed and is no longer + * valid + * * @dev: device of the job ring to be used. This device should have * been assigned prior by caam_jr_register(). * @desc: points to a job descriptor that execute our request. All @@ -354,15 +349,32 @@ int caam_jr_enqueue(struct device *dev, u32 *desc, u32 status, void *areq), void *areq) { - struct caam_drv_private_jr *jrp = dev_get_drvdata(dev); + struct caam_drv_private_jr *jrp; struct caam_jrentry_info *head_entry; int head, tail, desc_size; dma_addr_t desc_dma; + /* + * Lock the device to prevent it from being removed while we + * are using it + */ + device_lock(dev); + + /* + * If driver data is NULL, it is very likely that this device + * was removed already. Nothing we can do here but bail out. + */ + jrp = dev_get_drvdata(dev); + if (!jrp) { + device_unlock(dev); + return -ENODEV; + } + desc_size = (caam32_to_cpu(*desc) & HDR_JD_LENGTH_MASK) * sizeof(u32); desc_dma = dma_map_single(dev, desc, desc_size, DMA_TO_DEVICE); if (dma_mapping_error(dev, desc_dma)) { dev_err(dev, "caam_jr_enqueue(): can't map jobdesc\n"); + device_unlock(dev); return -EIO; } @@ -375,6 +387,7 @@ int caam_jr_enqueue(struct device *dev, u32 *desc, CIRC_SPACE(head, tail, JOBR_DEPTH) <= 0) { spin_unlock_bh(&jrp->inplock); dma_unmap_single(dev, desc_dma, desc_size, DMA_TO_DEVICE); + device_unlock(dev); return -EBUSY; } @@ -411,6 +424,7 @@ int caam_jr_enqueue(struct device *dev, u32 *desc, jrp->inpring_avail = rd_reg32(&jrp->rregs->inpring_avail); spin_unlock_bh(&jrp->inplock); + device_unlock(dev); return 0; }
Returning -EBUSY from platform device's .remove() callback won't stop the removal process, so the code in caam_jr_remove() is not going to have the desired effect of preventing JR from being removed. In order to be able to deal with removal of the JR device, change the code as follows: 1. To make sure that underlying struct device remains valid for as long as we have a reference to it, add appropriate device refcount management to caam_jr_alloc() and caam_jr_free() 2. To make sure that device removal doesn't happen in parallel to use using the device in caam_jr_enqueue() augment the latter to acquire/release device lock for the duration of the subroutine 3. In order to handle the case when caam_jr_enqueue() is executed right after corresponding caam_jr_remove(), add code to check that driver data has not been set to NULL. Signed-off-by: Andrey Smirnov <andrew.smirnov@gmail.com> Cc: Chris Healy <cphealy@gmail.com> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: Horia Geantă <horia.geanta@nxp.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Iuliana Prodan <iuliana.prodan@nxp.com> Cc: linux-crypto@vger.kernel.org Cc: linux-kernel@vger.kernel.org --- drivers/crypto/caam/jr.c | 36 +++++++++++++++++++++++++----------- 1 file changed, 25 insertions(+), 11 deletions(-)