From patchwork Wed Nov 9 01:22:10 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Eric W. Biederman" X-Patchwork-Id: 9418343 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 1BE7E60585 for ; Wed, 9 Nov 2016 01:24:43 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0BEA42833F for ; Wed, 9 Nov 2016 01:24:43 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id F145E28A33; Wed, 9 Nov 2016 01:24:42 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 1F7772833F for ; Wed, 9 Nov 2016 01:24:41 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751695AbcKIBYl (ORCPT ); Tue, 8 Nov 2016 20:24:41 -0500 Received: from out02.mta.xmission.com ([166.70.13.232]:41800 "EHLO out02.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750927AbcKIBYk (ORCPT ); Tue, 8 Nov 2016 20:24:40 -0500 Received: from in02.mta.xmission.com ([166.70.13.52]) by out02.mta.xmission.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.87) (envelope-from ) id 1c4Hd0-0001pf-VT; Tue, 08 Nov 2016 18:24:39 -0700 Received: from 75-170-125-99.omah.qwest.net ([75.170.125.99] helo=x220.xmission.com) by in02.mta.xmission.com with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.87) (envelope-from ) id 1c4Hcv-0001IQ-0Z; Tue, 08 Nov 2016 18:24:38 -0700 From: ebiederm@xmission.com (Eric W. Biederman) To: Bart Van Assche Cc: James Bottomley , "Martin K. Petersen" , Greg Kroah-Hartman , Hannes Reinecke , Johannes Thumshirn , Sagi Grimberg , "linux-scsi\@vger.kernel.org" References: <7d35e3f1-6c58-26bc-297b-73993aa90f0b@sandisk.com> <1478618887.2824.2.camel@linux.vnet.ibm.com> <1478628101.2824.27.camel@linux.vnet.ibm.com> <87oa1pvl8f.fsf@xmission.com> <0129858f-382b-0903-7157-40e5ec8a9e6a@sandisk.com> Date: Tue, 08 Nov 2016 19:22:10 -0600 In-Reply-To: <0129858f-382b-0903-7157-40e5ec8a9e6a@sandisk.com> (Bart Van Assche's message of "Tue, 8 Nov 2016 15:33:33 -0800") Message-ID: <87shr1phvh.fsf@xmission.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.1 (gnu/linux) MIME-Version: 1.0 X-XM-SPF: eid=1c4Hcv-0001IQ-0Z; ; ; mid=<87shr1phvh.fsf@xmission.com>; ; ; hst=in02.mta.xmission.com; ; ; ip=75.170.125.99; ; ; frm=ebiederm@xmission.com; ; ; spf=neutral X-XM-AID: U2FsdGVkX18rzNjkuzZHDnjLxgnpm/KKZHgawLwmR70= X-SA-Exim-Connect-IP: 75.170.125.99 X-SA-Exim-Mail-From: ebiederm@xmission.com Subject: Re: [PATCH] Avoid that SCSI device removal through sysfs triggers a deadlock X-SA-Exim-Version: 4.2.1 (built Thu, 05 May 2016 13:38:54 -0600) X-SA-Exim-Scanned: Yes (on in02.mta.xmission.com) Sender: linux-scsi-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-scsi@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Bart Van Assche writes: > On 11/08/2016 11:15 AM, Eric W. Biederman wrote: >> James Bottomley writes: >> >>> On Tue, 2016-11-08 at 08:52 -0800, Bart Van Assche wrote: >>>> On 11/08/2016 07:28 AM, James Bottomley wrote: >>>>> On Mon, 2016-11-07 at 16:32 -0800, Bart Van Assche wrote: >>>>>> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c >>>>>> index cf4c636..44ec536 100644 >>>>>> --- a/fs/kernfs/dir.c >>>>>> +++ b/fs/kernfs/dir.c >>>>>> @@ -1410,7 +1410,7 @@ int kernfs_remove_by_name_ns(struct >>>>>> kernfs_node >>>>>> *parent, const char *name, >>>>>> mutex_lock(&kernfs_mutex); >>>>>> >>>>>> kn = kernfs_find_ns(parent, name, ns); >>>>>> - if (kn) >>>>>> + if (kn && !(kn->flags & KERNFS_SUICIDED)) >>>>> >>>>> Actually, wrong flag, you need KERNFS_SUICIDAL. The reason is that >>>>> kernfs_mutex is actually dropped half way through __kernfs_remove, >>>>> so KERNFS_SUICIDED is not set atomically with this mutex. >>>> >>>> Hello James, >>>> >>>> Sorry but what you wrote is not correct. >>> >>> I think you agree it is dropped. I don't need to add the bit about the >>> reacquisition because the race is mediated by the first acquisition not >>> the second one, if you mediate on KERNFS_SUICIDAL, you only need to >>> worry about this because the mediation is in the first acquisition. If >>> you mediate on KERNFS_SUICIDED, you need to explain that the final >>> thing that means the race can't happen is the unbreak in the sysfs >>> delete path re-acquiring s_active ... the explanation of what's going >>> on and why gets about 2x more complex. >> >> Is it really the dropping of the lock that is causing this? >> I don't see that when I read those traces. >> >> I am going to put my vote in for doing all of the self removal >> sem-asynchronously by using task_work_add, and killing >> device_remove_self, sysfs_remove_self, kernfs_remove_self. Using >> task_work_add remains synchronous with userspace so userspace should not >> care, and we get the benefit of not having two different variants of the >> same code racing with each other. >> >> It might take a little more work but will leave code that is much more >> maintainable in the long run. > > Hello Eric, > > I'm completely in favor of keeping code maintainable. But what's not clear to me > is whether asynchronous I/O can be submitted to a sysfs attribute? If so, on > what context will task work queued through task_work_add() be executed if an aio > write is used to write into a sysfs attribute? I don't believe so. I believe the filesystem has to opt into aio. In either case I don't believe we have to worry about aio because either it won't be supported or it will be a synchronous write to sysfs. > Additionally, can a process enter the exiting state after writing into the sysfs > delete attribute started and before task_work_add() has been called? I'm asking > this because in that case task_work_add() will fail. Not before task_work_add has been called because task_work_add is what will be called synchronously from the write function of the delete attribute. Furthermore task_work_run is guaranteed to be called before you return to userspace. It is remotely possible that someone sends the process a SIGKILL while the delete attribute is being written. In that case task_work_run should run before the task gets too deep into the exiting state. But the SIGKILL will not be processed until the write syscall finishes and the kernel returns to the edge of userspace. So I can't imagine anything that would make task_work_add be a poor choice in these circumstances. And basically that fix for just the scsi bits I think is as simple as: drivers/scsi/scsi_sysfs.c | 20 +++++++++++++++++--- include/scsi/scsi_device.h | 1 + 2 files changed, 18 insertions(+), 3 deletions(-) Eric --- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c index 07349270535d..fee88cde7c12 100644 --- a/drivers/scsi/scsi_sysfs.c +++ b/drivers/scsi/scsi_sysfs.c @@ -12,6 +12,7 @@ #include #include #include +#include #include #include @@ -705,13 +706,26 @@ store_rescan_field (struct device *dev, struct device_attribute *attr, } static DEVICE_ATTR(rescan, S_IWUSR, NULL, store_rescan_field); +static void sdev_remove_self(struct callback_head *cb) +{ + struct scsi_device *sdev = + container_of(cb, struct scsi_device, delete_work); + + scsi_remove_device(sdev); +} + static ssize_t sdev_store_delete(struct device *dev, struct device_attribute *attr, const char *buf, size_t count) { - if (device_remove_file_self(dev, attr)) - scsi_remove_device(to_scsi_device(dev)); - return count; + struct scsi_device *sdev = to_scsi_device(dev); + ssize_t err; + + init_task_work(&sdev->delete_work, sdev_remove_self); + err = task_work_add(current, &sdev->delete_work, false); + if (!err) + err = count; + return err; }; static DEVICE_ATTR(delete, S_IWUSR, NULL, sdev_store_delete); diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h index 8a9563144890..e82a151c8543 100644 --- a/include/scsi/scsi_device.h +++ b/include/scsi/scsi_device.h @@ -206,6 +206,7 @@ struct scsi_device { struct scsi_device_handler *handler; void *handler_data; + struct callback_head delete_work; unsigned char access_state; enum scsi_device_state sdev_state; unsigned long sdev_data[0];