Message ID | 20200812132034.14363-1-oneukum@suse.com (mailing list archive) |
---|---|
Headers | show |
Series | fix races in CDC-WDM | expand |
On 2020/08/12 22:20, Oliver Neukum wrote: > CDC-WDM was not written with multithreaded users or > uncooperative devices in mind. > This leads to race conditions and hangs in flush(). > In patch "[RFC 2/5] CDC-WDM: introduce a timeout in flush()", since multiple users can share "desc", wouldn't someone's usb_submit_urb() from wdm_write() gets unexpectedly interfered by someone else's usb_kill_urb(desc->command) from wdm_open() ? In patch "[RFC 2/5] CDC-WDM: introduce a timeout in flush()", don't we need to similarly apply timeout to wait_event_interruptible() in wdm_write(), for the same problem will happen if hardware remained silent forever? In patch "[RFC 3/5] CDC-WDM: making flush() interruptible", it is legal to return -EINTR from close(). But I think that returning -EINTR from close() is not recommended because it can confuse multithreaded application (retrying close() upon -EINTR is not safe). In patch "[RFC 5/5] CDC-WDM: remove use of intf->dev after potential disconnect", /* cannot dereference desc->intf if WDM_DISCONNECTING */ if (test_bit(WDM_DISCONNECTING, &desc->flags)) return -ENODEV; can be also removed, for this check is meant for not to dereference desc->intf after disconnect ? Guessing from symmetry, do we need to check WDM_DISCONNECTING or WDM_RESETTING in wait_event_interruptible_timeout() from wdm_flush() when wait_event_interruptible() in wdm_write() is not checking WDM_DISCONNECTING nor WDM_RESETTING ? Does it make sense to wait for response of someone else's usb_submit_urb() when someone is calling close(), for there is no guarantee that failure notice received via wdm_flush() via some file descriptor corresponds to usb_submit_urb() request from wdm_write() call from that file descriptor?
Am Mittwoch, den 12.08.2020, 23:29 +0900 schrieb Tetsuo Handa: > On 2020/08/12 22:20, Oliver Neukum wrote: > > CDC-WDM was not written with multithreaded users or > > uncooperative devices in mind. > > This leads to race conditions and hangs in flush(). Hi, thank you for waiting. Family emergency. > In patch "[RFC 2/5] CDC-WDM: introduce a timeout in flush()", since multiple users can > share "desc", wouldn't someone's usb_submit_urb() from wdm_write() gets unexpectedly > interfered by someone else's usb_kill_urb(desc->command) from wdm_open() ? Yes. Fixed. > In patch "[RFC 2/5] CDC-WDM: introduce a timeout in flush()", don't we need to similarly > apply timeout to wait_event_interruptible() in wdm_write(), for the same problem will > happen if hardware remained silent forever? Technically a device may take forever. Interrupting a wait in write() is not problematic. > In patch "[RFC 3/5] CDC-WDM: making flush() interruptible", it is legal to return -EINTR > from close(). But I think that returning -EINTR from close() is not recommended because > it can confuse multithreaded application (retrying close() upon -EINTR is not safe). Well, but what is the alternative? Should we ignore signals? > In patch "[RFC 5/5] CDC-WDM: remove use of intf->dev after potential disconnect", > > /* cannot dereference desc->intf if WDM_DISCONNECTING */ > if (test_bit(WDM_DISCONNECTING, &desc->flags)) > return -ENODEV; > > can be also removed, for this check is meant for not to dereference desc->intf > after disconnect ? It can be removed, but that would make error reporting worse. > Guessing from symmetry, do we need to check WDM_DISCONNECTING or WDM_RESETTING > in wait_event_interruptible_timeout() from wdm_flush() when wait_event_interruptible() > in wdm_write() is not checking WDM_DISCONNECTING nor WDM_RESETTING ? Added > Does it make sense to wait for response of someone else's usb_submit_urb() when > someone is calling close(), for there is no guarantee that failure notice received > via wdm_flush() via some file descriptor corresponds to usb_submit_urb() request from > wdm_write() call from that file descriptor? Well, user space may do multithreading. Whether it makes sense is another question. We just need to return results confirming to the standards. You noticed bugs. I think the next version will fix them. What else would we do? Regards Oliver
On 2020/09/10 18:09, Oliver Neukum wrote: >> Does it make sense to wait for response of someone else's usb_submit_urb() when >> someone is calling close(), for there is no guarantee that failure notice received >> via wdm_flush() via some file descriptor corresponds to usb_submit_urb() request from >> wdm_write() call from that file descriptor? > > Well, user space may do multithreading. Whether it makes sense is > another question. We just need to return results confirming to the > standards. You noticed bugs. I think the next version will fix them. My question is how do you handle if App1 and App2 (not multithreading but multiprocessing) shares the "desc" ? Unless >> In patch "[RFC 3/5] CDC-WDM: making flush() interruptible", it is legal to return -EINTR >> from close(). But I think that returning -EINTR from close() is not recommended because >> it can confuse multithreaded application (retrying close() upon -EINTR is not safe). > > Well, but what is the alternative? Should we ignore signals? > we return the error from write() request (i.e. give up trying to report errors from close() event), we can't pass results to the intended recipients.
Am Donnerstag, den 10.09.2020, 19:01 +0900 schrieb Tetsuo Handa: > On 2020/09/10 18:09, Oliver Neukum wrote: > > > Does it make sense to wait for response of someone else's usb_submit_urb() when > > > someone is calling close(), for there is no guarantee that failure notice received > > > via wdm_flush() via some file descriptor corresponds to usb_submit_urb() request from > > > wdm_write() call from that file descriptor? > > > > Well, user space may do multithreading. Whether it makes sense is > > another question. We just need to return results confirming to the > > standards. You noticed bugs. I think the next version will fix them. > > My question is how do you handle if App1 and App2 (not multithreading but > multiprocessing) shares the "desc" ? Unless Well, device locking is a function of user space. This is Unix. If you have two uncoordinated writers to a device, you cannot expect sensible results. You can expect that the driver does not hang, of course. > > > In patch "[RFC 3/5] CDC-WDM: making flush() interruptible", it is legal to return -EINTR > > > from close(). But I think that returning -EINTR from close() is not recommended because > > > it can confuse multithreaded application (retrying close() upon -EINTR is not safe). > > > > Well, but what is the alternative? Should we ignore signals? > > > > we return the error from write() request (i.e. give up trying to report errors from > close() event), we can't pass results to the intended recipients. That means * harming the single threaded for the sake of the few multithreaded * it would not work for O_NONBLOCK * if you use a device from multiple threads or tasks, locking is your problem Is there something we can do in flush()? Regards Oliver
On 2020/09/15 18:14, Oliver Neukum wrote: >>>> In patch "[RFC 3/5] CDC-WDM: making flush() interruptible", it is legal to return -EINTR >>>> from close(). But I think that returning -EINTR from close() is not recommended because >>>> it can confuse multithreaded application (retrying close() upon -EINTR is not safe). >>> >>> Well, but what is the alternative? Should we ignore signals? >>> >> >> we return the error from write() request (i.e. give up trying to report errors from >> close() event), we can't pass results to the intended recipients. > > That means > > * harming the single threaded for the sake of the few multithreaded What is wrong with the single threaded user? As I describe below, there is no guarantee that wdm_write() can complete immediately (even if O_NONBLOCK flag is set). > * it would not work for O_NONBLOCK It does work for O_NONBLOCK. Please see my proposal at https://lkml.kernel.org/r/b347b882-a986-24c6-2b37-0b1a092931b9@i-love.sakura.ne.jp . Since my proposal calls mutex_unlock() before start waiting for response, my proposal does not block mutex_lock_interruptible() from O_NONBLOCK write(). O_NONBLOCK cannot guarantee that "we do not wait for memory allocation request by memdup_user()/copy_from_user()/usb_submit_urb(GFP_KERNEL)". It is possible that O_NONBLOCK write() is blocked for many _minutes_ at (at least) these three locations. O_NONBLOCK only guarantees that "it does not wait when we can't start immediately". > * if you use a device from multiple threads or tasks, locking is your > problem You mean "a device" as "struct wdm_device *desc" ? Then, it does not matter because the mutex, buffer, status etc. are per "struct wdm_device *desc" objects. Nobody will be disturbed by somebody else using different "struct wdm_device *desc". > > Is there something we can do in flush()? I consider that wdm_flush() is a wrong place to return an error. It is possible that a userspace process reaches wdm_flush() due to being killed by SIGKILL (rather than via calling close() syscall). Then, that userspace process will never receive the error fetched from wdm_flush(). Also, if that userspace process is killed by the OOM killer, being able to terminate and release resources as soon as possible is more preferable than try to wait for response.
Am Dienstag, den 15.09.2020, 19:30 +0900 schrieb Tetsuo Handa: > On 2020/09/15 18:14, Oliver Neukum wrote > > Is there something we can do in flush()? > > I consider that wdm_flush() is a wrong place to return an error. It is possible that I am afraid that is a basic problem we need to resolve. As I understand it, flush() as a method exists precisely to report errors. Otherwise you could implement it in release(). But this is not called for every close(). Hence a driver is supposed to start IO upon write() and report the result to the next call, which can be either write() or close(), the latter corresponding to flush(). > a userspace process reaches wdm_flush() due to being killed by SIGKILL (rather than > via calling close() syscall). Then, that userspace process will never receive the error If you are killed by a signal you are in a race condition anyway. It cannot be handled. > fetched from wdm_flush(). Also, if that userspace process is killed by the OOM killer, > being able to terminate and release resources as soon as possible is more preferable > than try to wait for response. Right, so should we just proceed in case of a dieing task? How do we do that? Regards Oliver
On 2020/09/16 19:18, Oliver Neukum wrote: > Am Dienstag, den 15.09.2020, 19:30 +0900 schrieb Tetsuo Handa: >> On 2020/09/15 18:14, Oliver Neukum wrote > >>> Is there something we can do in flush()? >> >> I consider that wdm_flush() is a wrong place to return an error. It is possible that > > I am afraid that is a basic problem we need to resolve. As I understand > it, flush() as a method exists precisely to report errors. Otherwise > you could implement it in release(). But this is not called for every > close(). I think fsync() or ioctl() is a better method for reporting errors. But anyway I consider that any error from N'th write() request should be returned by N'th write() request. > > Hence a driver is supposed to start IO upon write() and report the > result to the next call, which can be either write() or close(), the > latter corresponding to flush(). Whether N'th write() request succeeded remains unknown until (N+1)'th write() request or close() request is issued? That sounds a strange design. If there is nothing more to write(), how userspace process knows whether N'th write() request succeeded? Wait for writability using poll() ? How do we guarantee that N'th write() request already set desc->werr before (N+1)'th next write() request is issued? If (N+1)'th write() request reached memdup_user() before desc->werr is set by callback of N'th write() request, (N+1)'th write() request will fail to report the error from N'th write() request. Yes, that error would be reported by (N+2)'th write() request, but the userspace process might have already discarded data needed for taking some actions (e.g. print error messages, retry the write() request with same argument). > >> a userspace process reaches wdm_flush() due to being killed by SIGKILL (rather than >> via calling close() syscall). Then, that userspace process will never receive the error > > If you are killed by a signal you are in a race condition > anyway. It cannot be handled. What is the purpose of sending the error to the userspace process via write() or close()? Isn't the purpose to allow userspace process to do something (e.g. print error messages, retry the write() request with same argument) ? If close() returned an error, it might be too late to retry the write() request with same argument. > >> fetched from wdm_flush(). Also, if that userspace process is killed by the OOM killer, >> being able to terminate and release resources as soon as possible is more preferable >> than try to wait for response. > > Right, so should we just proceed in case of a dieing task? How do we > do that? We need to avoid waiting at wdm_flush() if killed. And I think that wdm_flush() is a strange interface for reporting the error.
Am Mittwoch, den 16.09.2020, 20:14 +0900 schrieb Tetsuo Handa: > On 2020/09/16 19:18, Oliver Neukum wrote: > > Am Dienstag, den 15.09.2020, 19:30 +0900 schrieb Tetsuo Handa: > > > On 2020/09/15 18:14, Oliver Neukum wrote > > > > Is there something we can do in flush()? > > > > > > I consider that wdm_flush() is a wrong place to return an error. It is possible that > > > > I am afraid that is a basic problem we need to resolve. As I understand > > it, flush() as a method exists precisely to report errors. Otherwise > > you could implement it in release(). But this is not called for every > > close(). > > I think fsync() or ioctl() is a better method for reporting errors. Very well, I am implementing fsync(). However, I must say that the very existance of fsync() tells us that write() is not expected to push data out to devices and hence report results. > Whether N'th write() request succeeded remains unknown until (N+1)'th > write() request or close() request is issued? That sounds a strange design. Welcome to the world of Unix. This is necessary if you want to block as rarely as possible. > If there is nothing more to write(), how userspace process knows whether > N'th write() request succeeded? Wait for writability using poll() ? Technically fsync(). poll() will tell you that you can write again. That is a related but different concept. > What is the purpose of sending the error to the userspace process via write() or close()? Yes. However to do so, user space must be running. That the death of a process interferes with error handling is independent from that. > And I think that wdm_flush() is a strange interface for reporting the error. Well, POSIX is what it is.The close() syscall is supposed to return errors. Hence flush() must report them. Regards Oliver
On 2020/09/17 18:50, Oliver Neukum wrote: > Am Mittwoch, den 16.09.2020, 20:14 +0900 schrieb Tetsuo Handa: >> On 2020/09/16 19:18, Oliver Neukum wrote: >>> Am Dienstag, den 15.09.2020, 19:30 +0900 schrieb Tetsuo Handa: >>>> On 2020/09/15 18:14, Oliver Neukum wrote >>>>> Is there something we can do in flush()? >>>> >>>> I consider that wdm_flush() is a wrong place to return an error. It is possible that >>> >>> I am afraid that is a basic problem we need to resolve. As I understand >>> it, flush() as a method exists precisely to report errors. Otherwise >>> you could implement it in release(). But this is not called for every >>> close(). >> >> I think fsync() or ioctl() is a better method for reporting errors. > > Very well, I am implementing fsync(). However, I must say that the very > existance of fsync() tells us that write() is not expected to push > data out to devices and hence report results. If you ask userspace programs to be updated to call fsync(), we can ask userspace programs be updated to call ioctl(). I expected you to implement wdm_ioctl() for fetching last error code. Then, > >> Whether N'th write() request succeeded remains unknown until (N+1)'th >> write() request or close() request is issued? That sounds a strange design. > > Welcome to the world of Unix. This is necessary if you want to block > as rarely as possible. we can apply my proposal (wait for response at wdm_write() by default) as a baseline for not to break existing userspace programs (except latency), followed by a patch which allows userspace programs to use that wdm_ioctl() in order not to wait for response at wdm_write(), which is enabled by calling wdm_ioctl() (in order to recover latency caused by waiting for response at wdm_write()). >> What is the purpose of sending the error to the userspace process via write() or close()? > > Yes. However to do so, user space must be running. That the death > of a process interferes with error handling is independent from that. Why need to send the error to the userspace process when that process was killed? My question Isn't the purpose to allow userspace process to do something (e.g. print error messages, retry the write() request with same argument) ? If close() returned an error, it might be too late to retry the write() request with same argument. which is unrelated to the userspace process being killed. My suggestion is that we can send the error from wdm_write() (for synchronous mode) or wdm_ioctl() (for asynchronous mode). > >> And I think that wdm_flush() is a strange interface for reporting the error. > > Well, POSIX is what it is.The close() syscall is supposed to return > errors. Hence flush() must report them. If we check the error at wdm_write() or wdm_ioctl(), there is no error to report at wdm_flush(). Therefore, we can remove wdm_flush() completely. I can't read this series without squashing into single patch. Making changes which will be after all removed in [RFC 5/7] is sad. Please do [RFC 5/7] before [RFC 4/7]. Then, you won't need to make unneeded modifications. I'd like to see one cleanup patch, one possible unsafe dereference fix patch, and one deadlock avoidance patch. You did not answer to How do we guarantee that N'th write() request already set desc->werr before (N+1)'th next write() request is issued? If (N+1)'th write() request reached memdup_user() before desc->werr is set by callback of N'th write() request, (N+1)'th write() request will fail to report the error from N'th write() request. Yes, that error would be reported by (N+2)'th write() request, but the userspace process might have already discarded data needed for taking some actions (e.g. print error messages, retry the write() request with same argument). . At least I think that spin_lock_irq(&desc->iuspin); we = desc->werr; desc->werr = 0; spin_unlock_irq(&desc->iuspin); if (we < 0) return usb_translate_errors(we); in wdm_write() should be moved to after !test_bit(WDM_IN_USE, &desc->flags). In [RFC 2/7], I think that /* in case flush() had timed out */ usb_kill_urb(desc->command); which is called only when desc->count == 0 in wdm_open() is pointless, for since desc->count is incremented/decremented with wdm_mutex held, kill_urbs(desc) which is called when desc->count == 0 in wdm_release() must have already called usb_kill_urb(desc->command) before allowing wdm_open() to reach there. In addition, is /* using write lock to protect desc->count */ mutex_lock(&desc->wlock); required? Isn't wdm_mutex that is actually protecting desc->count from modification? If it is desc->wlock that is actually protecting desc->count, the !desc->count check in wdm_release() and the desc->count == 1 check in wdm_open() have to be done with desc->wlock held. In [RFC 3/7], patch description says There is no need for flush() to be uninterruptible. close(2) is allowed to return -EAGAIN. Change it. but the code does if (rv < 0) return -EINTR; . Which error code do you want to use? (I still prefer removing wdm_flush() completely...)
Am Donnerstag, den 17.09.2020, 20:24 +0900 schrieb Tetsuo Handa: > On 2020/09/17 18:50, Oliver Neukum wrote: > > Am Mittwoch, den 16.09.2020, 20:14 +0900 schrieb Tetsuo Handa: > If you ask userspace programs to be updated to call fsync(), we can ask > userspace programs be updated to call ioctl(). > > I expected you to implement wdm_ioctl() for fetching last error code. Then, Again, we are not redefining APIs. The APIs for character devices are well defined by POSIX. Please see the man pages. Introducing a whole new ioctl() is out of question. The API and its semantics are clear. Write schedules a write: A successful return from write() does not make any guarantee that data has been committed to disk. On some filesystems, including NFS, it does not even guarantee that space has successfully been reserved for the data. In this case, some errors might be delayed until a future write(2), fsync(2), or even close(2). The only way to be sure is to call fsync(2) after you are done writing all your data. Fsync flushes data: fsync() transfers ("flushes") all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device (or other permanent storage device) so that all changed information can be retrieved even if the system crashes or is rebooted. This includes writing through or flushing a disk cache if present. The call blocks until the device reports that the transfer has completed. If user space does not call fsync(), the error is supposed to be reported by the next write() and if there is no next write(), close() shall report it. > we can apply my proposal (wait for response at wdm_write() by default) as a baseline > for not to break existing userspace programs (except latency), followed by a patch > which allows userspace programs to use that wdm_ioctl() in order not to wait for > response at wdm_write(), which is enabled by calling wdm_ioctl() (in order to > recover latency caused by waiting for response at wdm_write()). I am sorry, but the API is defined by POSIX. > > > What is the purpose of sending the error to the userspace process via write() or close()? > > > > Yes. However to do so, user space must be running. That the death > > of a process interferes with error handling is independent from that. > > Why need to send the error to the userspace process when that process was killed? > My question > > Isn't the purpose to allow userspace process to do something (e.g. print error messages, > retry the write() request with same argument) ? If close() returned an error, it might be > too late to retry the write() request with same argument. Yes. Technically you need to use fsync(). Hence I implemented it. > If we check the error at wdm_write() or wdm_ioctl(), there is no error to report at > wdm_flush(). Therefore, we can remove wdm_flush() completely. Again, the API is defined by POSIX. If user space calls write() and then close(), close() must report the error. > I can't read this series without squashing into single patch. Making changes which > will be after all removed in [RFC 5/7] is sad. Please do [RFC 5/7] before [RFC 4/7]. Done. > Then, you won't need to make unneeded modifications. I'd like to see one cleanup > patch, one possible unsafe dereference fix patch, and one deadlock avoidance patch. This needs to partially go into stable. Hence fixes must come first. > You did not answer to > > How do we guarantee that N'th write() request already set desc->werr before > (N+1)'th next write() request is issued? If (N+1)'th write() request reached > memdup_user() before desc->werr is set by callback of N'th write() request, > (N+1)'th write() request will fail to report the error from N'th write() request. > Yes, that error would be reported by (N+2)'th write() request, but the userspace > process might have already discarded data needed for taking some actions (e.g. > print error messages, retry the write() request with same argument). We don't, nor do we have to. You are right, error reporting can be improved. I implemented fsync() to do so. > . At least I think that > > spin_lock_irq(&desc->iuspin); > we = desc->werr; > desc->werr = 0; > spin_unlock_irq(&desc->iuspin); > if (we < 0) > return usb_translate_errors(we); > > in wdm_write() should be moved to after !test_bit(WDM_IN_USE, &desc->flags). Why? > In [RFC 2/7], I think that > > /* in case flush() had timed out */ > usb_kill_urb(desc->command); > > which is called only when desc->count == 0 in wdm_open() is pointless, for since > desc->count is incremented/decremented with wdm_mutex held, kill_urbs(desc) which > is called when desc->count == 0 in wdm_release() must have already called > usb_kill_urb(desc->command) before allowing wdm_open() to reach there. You are right. I am going to remove it. > In addition, is > > /* using write lock to protect desc->count */ > mutex_lock(&desc->wlock); > > required? Isn't wdm_mutex that is actually protecting desc->count from modification? > If it is desc->wlock that is actually protecting desc->count, the !desc->count check > in wdm_release() and the desc->count == 1 check in wdm_open() have to be done with > desc->wlock held. Correct. So should wdm_mutex be dropped earlier? > In [RFC 3/7], patch description says > > There is no need for flush() to be uninterruptible. close(2) > is allowed to return -EAGAIN. Change it. > > but the code does > > if (rv < 0) > return -EINTR; > > . Which error code do you want to use? (I still prefer removing wdm_flush() completely...) -EINTR. Sorry. I shall change the description. Regards Oliver
On 2020/09/17 23:17, Oliver Neukum wrote: > The API and its semantics are clear. Write schedules a write: > > A successful return from write() does not make any guarantee that data has been committed to disk. On some filesystems, including NFS, it does not even guarantee that space has successfully been reserved for the data. In this case, some errors might be > delayed until a future write(2), fsync(2), or even close(2). The only way to be sure is to call fsync(2) after you are done writing all your data. But I think that this leaves a room for allowing write() to imply fflush() (i.e. write() is allowed to wait for data to be committed to disk). > > Fsync flushes data: > > fsync() transfers ("flushes") all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device (or other permanent storage device) so that all changed information can be retrieved even if > the system crashes or is rebooted. This includes writing through or flushing a disk cache if present. The call blocks until the device reports that the transfer has completed. > > If user space does not call fsync(), the error is supposed to be reported > by the next write() and if there is no next write(), close() shall report it. Where does "the next" (and not "the next after the next") write() come from? >> You did not answer to >> >> How do we guarantee that N'th write() request already set desc->werr before >> (N+1)'th next write() request is issued? If (N+1)'th write() request reached >> memdup_user() before desc->werr is set by callback of N'th write() request, >> (N+1)'th write() request will fail to report the error from N'th write() request. >> Yes, that error would be reported by (N+2)'th write() request, but the userspace >> process might have already discarded data needed for taking some actions (e.g. >> print error messages, retry the write() request with same argument). > > We don't, nor do we have to. You are right, error reporting can be > improved. I implemented fsync() to do so. You are saying that if user space does not call fsync(), the error is allowed to be reported by the next after the next (in other words, (N+2)'th) write() ? > >> . At least I think that >> >> spin_lock_irq(&desc->iuspin); >> we = desc->werr; >> desc->werr = 0; >> spin_unlock_irq(&desc->iuspin); >> if (we < 0) >> return usb_translate_errors(we); >> >> in wdm_write() should be moved to after !test_bit(WDM_IN_USE, &desc->flags). > > Why? Otherwise, we can't make sure (N+1)'th write() will report error from N'th write(). Since I don't know the characteristics of data passed via wdm_write() (I guess that the data is some stateful controlling commands rather than meaningless byte stream), I guess that (N+1)'th wdm_write() attempt should be made only after confirming that N'th wdm_write() attempt received wdm_callback() response. To preserve state / data used by N'th wdm_write() attempt, reporting the error from too late write() attempt would be useless. >> In addition, is >> >> /* using write lock to protect desc->count */ >> mutex_lock(&desc->wlock); >> >> required? Isn't wdm_mutex that is actually protecting desc->count from modification? >> If it is desc->wlock that is actually protecting desc->count, the !desc->count check >> in wdm_release() and the desc->count == 1 check in wdm_open() have to be done with >> desc->wlock held. > > Correct. So should wdm_mutex be dropped earlier? If recover_from_urb_loss() can tolerate stale desc->count value, wdm_mutex already protects desc->count. I don't know how this module works. I don't know whether wdm_mutex and/or desc->wlock is held when recover_from_urb_loss() is called from wdm_resume(). It seems that desc->wlock is held but wdm_mutex is not held when recover_from_urb_loss() is called from wdm_post_reset(). By the way, after the fixes, we could replace spin_lock_irq(&desc->iuspin); rv = desc->werr; desc->werr = 0; spin_unlock_irq(&desc->iuspin); with rv = xchg(&desc->werr, 0); and avoid spin_lock_irq()/spin_unlock_irq() because there are many locations which needs to check and clear the error...
Am Freitag, den 18.09.2020, 01:17 +0900 schrieb Tetsuo Handa: > On 2020/09/17 23:17, Oliver Neukum wrote: > > The API and its semantics are clear. Write schedules a write: > > > > A successful return from write() does not make any guarantee that data has been committed to disk. On some filesystems, including NFS, it does not even guarantee that space has successfully been reserved for the data. In this case, some errors might be > > delayed until a future write(2), fsync(2), or even close(2). The only way to be sure is to call fsync(2) after you are done writing all your data. > > But I think that this leaves a room for allowing write() to imply fflush() > (i.e. write() is allowed to wait for data to be committed to disk). That would be inferior and very bad for the non-blocking case. > > If user space does not call fsync(), the error is supposed to be reported > > by the next write() and if there is no next write(), close() shall report it. > > Where does "the next" (and not "the next after the next") write() come from? We would indeed by on spec. However, we perform best if we return an error as soon as possible. > You are saying that if user space does not call fsync(), the error is allowed to be > reported by the next after the next (in other words, (N+2)'th) write() ? Yes. The man page is clear on that. > > > . At least I think that > > > > > > spin_lock_irq(&desc->iuspin); > > > we = desc->werr; > > > desc->werr = 0; > > > spin_unlock_irq(&desc->iuspin); > > > if (we < 0) > > > return usb_translate_errors(we); > > > > > > in wdm_write() should be moved to after !test_bit(WDM_IN_USE, &desc->flags). > > > > Why? > > Otherwise, we can't make sure (N+1)'th write() will report error from N'th write(). We should move the test for reporting errors later, so that it is sure to be carried out? I am afraid I cannot follow that logic. > Since I don't know the characteristics of data passed via wdm_write() (I guess that > the data is some stateful controlling commands rather than meaningless byte stream), > I guess that (N+1)'th wdm_write() attempt should be made only after confirming that > N'th wdm_write() attempt received wdm_callback() response. To preserve state / data > used by N'th wdm_write() attempt, reporting the error from too late write() attempt > would be useless. We cannot make assumptions on how user space uses the driver. Somebody manually connecting and typing in commands letter by letter must also work. We can optimize for the common case, but we must operate according to the specs. > > > > > > In addition, is > > > > > > /* using write lock to protect desc->count */ > > > mutex_lock(&desc->wlock); > > > > > > required? Isn't wdm_mutex that is actually protecting desc->count from modification? > > > If it is desc->wlock that is actually protecting desc->count, the !desc->count check > > > in wdm_release() and the desc->count == 1 check in wdm_open() have to be done with > > > desc->wlock held. > > > > Correct. So should wdm_mutex be dropped earlier? > > If recover_from_urb_loss() can tolerate stale desc->count value, wdm_mutex already It cannot. > protects desc->count. I don't know how this module works. I don't know whether > wdm_mutex and/or desc->wlock is held when recover_from_urb_loss() is called from > wdm_resume(). It seems that desc->wlock is held but wdm_mutex is not held when > recover_from_urb_loss() is called from wdm_post_reset(). Indeed. > > > > By the way, after the fixes, we could replace > > spin_lock_irq(&desc->iuspin); > rv = desc->werr; > desc->werr = 0; > spin_unlock_irq(&desc->iuspin); > > with > > rv = xchg(&desc->werr, 0); > > and avoid spin_lock_irq()/spin_unlock_irq() because there are many > locations which needs to check and clear the error... Have you checked whether this has implications on memory ordering? Regards Oliver
On 2020/09/21 19:52, Oliver Neukum wrote: >>> If user space does not call fsync(), the error is supposed to be reported >>> by the next write() and if there is no next write(), close() shall report it. >> >> Where does "the next" (and not "the next after the next") write() come from? > > We would indeed by on spec. However, we perform best if we return an > error as soon as possible. > >> You are saying that if user space does not call fsync(), the error is allowed to be >> reported by the next after the next (in other words, (N+2)'th) write() ? > > Yes. The man page is clear on that. To understand it, I must understand why it is safe to defer error reporting. > >>>> . At least I think that >>>> >>>> spin_lock_irq(&desc->iuspin); >>>> we = desc->werr; >>>> desc->werr = 0; >>>> spin_unlock_irq(&desc->iuspin); >>>> if (we < 0) >>>> return usb_translate_errors(we); >>>> >>>> in wdm_write() should be moved to after !test_bit(WDM_IN_USE, &desc->flags). >>> >>> Why? >> >> Otherwise, we can't make sure (N+1)'th write() will report error from N'th write(). > > We should move the test for reporting errors later, so that it is sure > to be carried out? I am afraid I cannot follow that logic. > >> Since I don't know the characteristics of data passed via wdm_write() (I guess that >> the data is some stateful controlling commands rather than meaningless byte stream), >> I guess that (N+1)'th wdm_write() attempt should be made only after confirming that >> N'th wdm_write() attempt received wdm_callback() response. To preserve state / data >> used by N'th wdm_write() attempt, reporting the error from too late write() attempt >> would be useless. > > We cannot make assumptions on how user space uses the driver. Somebody > manually connecting and typing in commands letter by letter must also > work. I'm not making any assumptions on how user space uses the driver. Up to now, your explanation makes me think that you are making assumptions on how user space uses cdc-wdm driver. I'm querying you about characteristics of data passed to wdm_write(). Without knowing the difference between writing to cdc-wdm driver and normal file on some filesystem, I can't judge whether it is acceptable to defer reporting errors. When an error occurred when writing to normal file on some filesystem, the userspace would simply treat that file as broken (e.g. delete such file). The userspace of this cdc-wdm driver might want that any error is immediately reported, for I'm thinking that each data passed to wdm_write() is stateful, for /* CDC-WMC r1.1 requires wMaxCommand to be "at least 256 decimal (0x100)" */ #define WDM_DEFAULT_BUFSIZE 256 and if (count > desc->wMaxCommand) count = desc->wMaxCommand; makes me think that wdm_write() has to be able to handle chunk of data in atomic manner. > > We can optimize for the common case, but we must operate according to > the specs. >> >> >> >>>> In addition, is >>>> >>>> /* using write lock to protect desc->count */ >>>> mutex_lock(&desc->wlock); >>>> >>>> required? Isn't wdm_mutex that is actually protecting desc->count from modification? >>>> If it is desc->wlock that is actually protecting desc->count, the !desc->count check >>>> in wdm_release() and the desc->count == 1 check in wdm_open() have to be done with >>>> desc->wlock held. >>> >>> Correct. So should wdm_mutex be dropped earlier? >> >> If recover_from_urb_loss() can tolerate stale desc->count value, wdm_mutex already > > It cannot. > >> protects desc->count. I don't know how this module works. I don't know whether >> wdm_mutex and/or desc->wlock is held when recover_from_urb_loss() is called from >> wdm_resume(). It seems that desc->wlock is held but wdm_mutex is not held when >> recover_from_urb_loss() is called from wdm_post_reset(). > > Indeed. I don't like [RFC 8/8]. Please drop [RFC 8/8] because not only it is unrelated to hang up problem syzbot is reporting but also lock dependency is still unclear. The /* using write lock to protect desc->count */ comment is hardly helpful because wdm_disconnect() accesses desc->count with only wdm_mutex held. Detailed explanation about why releasing wdm_mutex early in wdm_open() is safe with regards to e.g. wdm_disconnect()/recover_from_urb_loss()/wdm_release() will be required for [RFC 8/8]. Also, pointless + /* in case flush() had timed out */ + usb_kill_urb(desc->command); is still in [RFC 2/8]. Please do squash into one patch, and add detailed/polite patch description like https://lkml.kernel.org/r/1597188375-4787-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp . Current series is too fragmented to understand, and nobody can review such series.
Am Dienstag, den 22.09.2020, 10:56 +0900 schrieb Tetsuo Handa: > On 2020/09/21 19:52, Oliver Neukum wrote: > > To understand it, I must understand why it is safe to defer error reporting. It is not. There is nothing to understand here. If user space needs a guarantee that data has been pushed out without an error, it will have to call fsync() Let me explain. POSIX, as described in the man page, sets some rules about what a driver must do and must not do. In other areas it makes recommendations a good driver should follow. One of them is that we block as little as possible, at the risk of delaying error reporting. A good driver will report errors as soon as possible while being compatible with doing IO asynchronously. > I'm querying you about characteristics of data passed to wdm_write(). > Without knowing the difference between writing to cdc-wdm driver and normal file on > some filesystem, I can't judge whether it is acceptable to defer reporting errors. That is simply not a decision you or I make. The man page clearly says that it is acceptable. If user space does not like that, it must call fsync() after write(). > When an error occurred when writing to normal file on some filesystem, the userspace > would simply treat that file as broken (e.g. delete such file). The userspace of this > cdc-wdm driver might want that any error is immediately reported, for I'm thinking that > each data passed to wdm_write() is stateful, for > > /* CDC-WMC r1.1 requires wMaxCommand to be "at least 256 decimal (0x100)" */ > #define WDM_DEFAULT_BUFSIZE 256 > > and > > if (count > desc->wMaxCommand) > count = desc->wMaxCommand; > > makes me think that wdm_write() has to be able to handle chunk of data in atomic > manner. No. the WDM spec says the clear opposite and the man page of write says so, too. > I don't like [RFC 8/8]. Please drop [RFC 8/8] because not only it is unrelated to > hang up problem syzbot is reporting but also lock dependency is still unclear. Very well. I will make a patch set that clearly fixes bugs and a secondary one with optimizations. Regards Oliver
On 2020/09/22 16:33, Oliver Neukum wrote: > Am Dienstag, den 22.09.2020, 10:56 +0900 schrieb Tetsuo Handa: >> On 2020/09/21 19:52, Oliver Neukum wrote: >> To understand it, I must understand why it is safe to defer error reporting. > > It is not. There is nothing to understand here. If user space needs > a guarantee that data has been pushed out without an error, it will > have to call fsync() >> I'm querying you about characteristics of data passed to wdm_write(). >> Without knowing the difference between writing to cdc-wdm driver and normal file on >> some filesystem, I can't judge whether it is acceptable to defer reporting errors. > > That is simply not a decision you or I make. The man page clearly > says that it is acceptable. If user space does not like that, it must > call fsync() after write(). Then, cdc-wdm driver did not implement fsync() was a big problem. Userspace needs to be careful not to give up upon -EINVAL when running on older kernels which did not implement wdm_fsync(). The remaining concern would be how to handle unresponding hardware, for blocking wdm_write()/wdm_read() and wdm_fsync() are using wait_event_interruptible(). If the caller do not have a mean to send a signal, the caller might hung up forever when the hardware stopped responding. Please add a comment that userspace needs to be careful when calling these functions.
Am Dienstag, den 22.09.2020, 17:34 +0900 schrieb Tetsuo Handa: > On 2020/09/22 16:33, Oliver Neukum wrote: > > Am Dienstag, den 22.09.2020, 10:56 +0900 schrieb Tetsuo Handa: > > > On 2020/09/21 19:52, Oliver Neukum wrote: > > > To understand it, I must understand why it is safe to defer error reporting. > > > > It is not. There is nothing to understand here. If user space needs > > a guarantee that data has been pushed out without an error, it will > > have to call fsync() > > > I'm querying you about characteristics of data passed to wdm_write(). > > > Without knowing the difference between writing to cdc-wdm driver and normal file on > > > some filesystem, I can't judge whether it is acceptable to defer reporting errors. > > > > That is simply not a decision you or I make. The man page clearly > > says that it is acceptable. If user space does not like that, it must > > call fsync() after write(). > > Then, cdc-wdm driver did not implement fsync() was a big problem. Userspace > needs to be careful not to give up upon -EINVAL when running on older kernels > which did not implement wdm_fsync(). Very well. So I'll call the lack of fsync() a bug, which should be fixed in stable. > The remaining concern would be how to handle unresponding hardware, for blocking > wdm_write()/wdm_read() and wdm_fsync() are using wait_event_interruptible(). If > the caller do not have a mean to send a signal, the caller might hung up forever > when the hardware stopped responding. Please add a comment that userspace needs to > be careful when calling these functions. wdm_flush() has such a comment. Yet no driver can make a guarantee that a device will make progress in IO. The driver must, however, provide a means of dealing with such cases. Usually that means handling signals. That is the normal semantics of a write() syscall. I believe we are covered on that. Regards Oliver