drm/i915: Convert WARNs during userptr revoke to SIGBUS

On Tue, Oct 13, 2015 at 02:23:57PM +0200, Daniel Vetter wrote:
> On Tue, Oct 13, 2015 at 12:44:05PM +0100, Chris Wilson wrote:
> > On Tue, Oct 13, 2015 at 01:26:36PM +0200, Daniel Vetter wrote:
> > > On Mon, Oct 12, 2015 at 10:31:35AM +0100, Chris Wilson wrote:
> > > > On Mon, Oct 12, 2015 at 10:06:23AM +0100, Tvrtko Ursulin wrote:
> > > > > 
> > > > > On 09/10/15 18:26, Chris Wilson wrote:
> > > > > >On Fri, Oct 09, 2015 at 07:14:02PM +0200, Daniel Vetter wrote:
> > > > > >>On Fri, Oct 09, 2015 at 10:03:14AM +0100, Tvrtko Ursulin wrote:
> > > > > >>>
> > > > > >>>On 09/10/15 09:55, Daniel Vetter wrote:
> > > > > >>>>On Fri, Oct 09, 2015 at 09:40:53AM +0100, Chris Wilson wrote:
> > > > > >>>>>On Fri, Oct 09, 2015 at 09:48:01AM +0200, Daniel Vetter wrote:
> > > > > >>>>>>On Thu, Oct 08, 2015 at 10:45:47AM +0100, Tvrtko Ursulin wrote:
> > > > > >>>>>>The concern is that this isn't how SIG_SEGV works, it's a signal the
> > > > > >>>>>>thread who made the invalid access gets directly. You never get a SIG_SEGV
> > > > > >>>>>>for bad access someone else has made. So essentially it's new ABI.
> > > > > >>>>>
> > > > > >>>>>SIGBUS. For which the answer is yes, you can and do get SIGBUS for
> > > > > >>>>>actions taken by other processes.
> > > > > >>>>
> > > > > >>>>Oh right I always forget that SIGBUS aliases with SIGIO. Anyway if
> > > > > >>>>userspace wants SIGIO we just need to provide it with a pollable fd and
> > > > > >>>>then it can use fcntl to make that happen. That's imo a much better api
> > > > > >>>>than unconditionally throwing around signals. Also we already have the
> > > > > >>>>reset stats ioctl to tell userspace that its gpu context is toats. If
> > > > > >>>>anyone wants that to be pollable (or even send SIGIO) I think we should
> > > > > >>>>extend that, with all the usual "needs userspace&igt" stuff on top.
> > > > > >>>
> > > > > >>>I don't see that this notification can be optional. Process is confused
> > > > > >>>about its memory map use so should die. :)
> > > > > >>>
> > > > > >>>This is not a GPU error/hang - this is the process doing stupid things.
> > > > > >>>
> > > > > >>>MMU notifiers do not support decision making otherwise we could say
> > > > > >>>-ETXTBUSY or something on munmap, but we can't. Not even sure that it would
> > > > > >>>help in all cases, would have to fail clone as well and who knows what.
> > > > > >>
> > > > > >>So what happens if the gpu just keeps using the memory? It'll all be
> > > > > >>horribly undefined behaviour and eventually it'll die on an -EFAULT in
> > > > > >>execbuf, but does anything else bad happen?
> > > > > >
> > > > > >We don't see an EFAULT unless a miracle occurs, and the stale pages
> > > > > >continue to be read/written by other processes (as well as the client).
> > > > > >Horribly undefined behaviour with a misinformation leak.
> > > > > 
> > > > > What other processes? Pages will still be referenced so won't be
> > > > > reused so there is not information leak across unrelated processes.
> > > > > Unless you meant ones involved in object sharing?
> > > > 
> > > > This client is trying to replace the userptr with a fresh set of pages.
> > > > The GPU and other processes continue to see the old pages i.e. old
> > > > information (misinformation :) leaks.
> > > 
> > > userptr explicitly doesn't support this. You need to tear down the
> > > existing userptr object and then create a new one if you change the
> > > mmap'ing. So that's really just a bug in userspace, and the question is
> > > how do we tell userspace best that it's done something stupid.
> > 
> > Pardon? Note this also affects munmap if you don't accept mremap (which
> > is not explicitly unsupported as it fits quite nicely within the
> > existing rules).
> > 
> > > My stance that the best one is to either kill any ctx using that object or
> > > at least indicate there's trouble using reset stats. But sending a
> > > SIGBUS/SIG_SEGV (which can only happen to the thread that does a memory
> > > access, not any other thread that's accidentally in the same process
> > > group) is imo abuse.
> > 
> > The signal is sent to everything that inherited the mm, not bound to the
> > single thread.
> > 
> > > Or we just need to make sure we do get the EFAULT on
> > > the next execbuffer.
> > > 
> > > Or maybe it just doesn't matter, i.e. where is the userspace which a) does
> > > silly stuff like this b) wants proper notification? Adding ABI just
> > > because we can't isn't going to get merged.
> > 
> > No client wants to be killed just because it does something stupid, it
> > is killed to protect the integrity of the system.
> 
> But killing the client won't get rid of the ctx/objects, so won't really
> solve all that much? Especially it won't get rid of the framebuffers,
> which is the real trouble here it seems. That's because logind keeps a
> duped copy of the fd for it's own purposes.

I mentioned that in the commit message - but do note that what is pinned
is what the client has mmapped at that time. Before it is changed, the
client is killed and so we do not get into the situation where the
output is invalid (just old).

> So the better fix would be to make sure we don't accidentally pin a
> userptr object, i.e. adding a check in intel_pin_and_fence_fb for that
> (plus igt testcase). I somehow missed in all this discussion that this is
> about pinned objects ;-)

Preventing wrapping a userptr in a fb is one possibility.

And we can return to this discussion the next time someone mentions
hitting the WARN. Or they complain about not being able to create fb.
-Chris

drm/i915: Convert WARNs during userptr revoke to SIGBUS

Commit Message

Comments

Patch