Message ID | 20180403175855.30858-1-daniel.vetter@ffwll.ch (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Tue, Apr 03, 2018 at 07:58:55PM +0200, Daniel Vetter wrote: > After a discussion with Wily I got the nagging feeling we might have > some cases of nasty busy loops. The window is fairly small since we > always have a non-faulting fastpath (using page_fault_dis|enable()) > first, usually followed by a pile of pending signal checks, before we > go into the slowpath copy_to|from_user that might blow up for real. > > Test patch to check what CI thinks of this theory. > > v2: Don't WARN on success (Ville). > > Cc: Matthew Wilcox <willy@infradead.org> > Cc: Chris Wilson <chris@chris-wilson.co.uk> > Cc: Ville Syrjälä <ville.syrjala@linux.intel.com> > Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Ok, results from CI are in, we're hitting this path with our test-suite, and nothing bad happens. From my very layman reading of the x86 uaccess code what happens is that we take a fault, don't insert the pte because there's a signal pending (VM_FAULT_NOPAGE case). But since it's a kernel fault we don't just restart the same instruction, but go do the fixup code. That calls copy_user_handle_tail, which does a dead-slow bytewise copy. If that fails again, then there's a different fixup handler which just gives up and results in a short read. In eb_relocate_slow we retry the entire thing upon short reads, and the first thing we do is check for signals. If that's the case we bail out, and change the errno from -EFAULT to -ERESTARTSYS. For most other parts (and everything else in the kernel I suspect) we get an -EFAULT or similar indicator of short reads/writes. Kernel at large seems indeed not prepared to deal with page faults not going through because of signals, but then I think we get away with that because you should pass graphics buffer addresses to random things anyway. Both GL and vk (and everything else) have explicit upload/download functions, and you can't just write() into gpu memory. I think. tldr; I agree with Wily that we're probably breaking kernel fault api expectations here, but I think it's ok ... Only way I can see us fix this reliably is if we have a pending_signal_disable/enable() pair, which makes all interruptible sleeps non-interruptible. I don't core think kernel devs would like that :-) Cheers, Daniel > --- > drivers/gpu/drm/i915/i915_gem.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c > index 9650a7b10c5f..9766fa152e05 100644 > --- a/drivers/gpu/drm/i915/i915_gem.c > +++ b/drivers/gpu/drm/i915/i915_gem.c > @@ -1902,6 +1902,7 @@ int i915_gem_fault(struct vm_fault *vmf) > struct i915_vma *vma; > pgoff_t page_offset; > unsigned int flags; > + bool user_fault = vmf->flags & FAULT_FLAG_USER; > int ret; > > /* We don't use vmf->pgoff since that has the fake offset */ > @@ -2017,9 +2018,10 @@ int i915_gem_fault(struct vm_fault *vmf) > * handler to reset everything when re-faulting in > * i915_mutex_lock_interruptible. > */ > - case 0: > case -ERESTARTSYS: > case -EINTR: > + WARN_ON(!user_fault); > + case 0: > case -EBUSY: > /* > * EBUSY is ok: this just means that another thread > -- > 2.16.2 >
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index 9650a7b10c5f..9766fa152e05 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -1902,6 +1902,7 @@ int i915_gem_fault(struct vm_fault *vmf) struct i915_vma *vma; pgoff_t page_offset; unsigned int flags; + bool user_fault = vmf->flags & FAULT_FLAG_USER; int ret; /* We don't use vmf->pgoff since that has the fake offset */ @@ -2017,9 +2018,10 @@ int i915_gem_fault(struct vm_fault *vmf) * handler to reset everything when re-faulting in * i915_mutex_lock_interruptible. */ - case 0: case -ERESTARTSYS: case -EINTR: + WARN_ON(!user_fault); + case 0: case -EBUSY: /* * EBUSY is ok: this just means that another thread
After a discussion with Wily I got the nagging feeling we might have some cases of nasty busy loops. The window is fairly small since we always have a non-faulting fastpath (using page_fault_dis|enable()) first, usually followed by a pile of pending signal checks, before we go into the slowpath copy_to|from_user that might blow up for real. Test patch to check what CI thinks of this theory. v2: Don't WARN on success (Ville). Cc: Matthew Wilcox <willy@infradead.org> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> --- drivers/gpu/drm/i915/i915_gem.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)