Message ID | 20180622155716.GE10465@dhcp22.suse.cz (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, Jun 22, 2018 at 05:57:16PM +0200, Michal Hocko wrote: > On Fri 22-06-18 16:36:49, Chris Wilson wrote: > > Quoting Michal Hocko (2018-06-22 16:02:42) > > > Hi, > > > this is an RFC and not tested at all. I am not very familiar with the > > > mmu notifiers semantics very much so this is a crude attempt to achieve > > > what I need basically. It might be completely wrong but I would like > > > to discuss what would be a better way if that is the case. > > > > > > get_maintainers gave me quite large list of people to CC so I had to trim > > > it down. If you think I have forgot somebody, please let me know > > > > > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c > > > index 854bd51b9478..5285df9331fa 100644 > > > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c > > > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c > > > @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo) > > > mo->attached = false; > > > } > > > > > > -static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn, > > > +static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn, > > > struct mm_struct *mm, > > > unsigned long start, > > > - unsigned long end) > > > + unsigned long end, > > > + bool blockable) > > > { > > > struct i915_mmu_notifier *mn = > > > container_of(_mn, struct i915_mmu_notifier, mn); > > > @@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn, > > > LIST_HEAD(cancelled); > > > > > > if (RB_EMPTY_ROOT(&mn->objects.rb_root)) > > > - return; > > > + return 0; > > > > The principle wait here is for the HW (even after fixing all the locks > > to be not so coarse, we still have to wait for the HW to finish its > > access). > > Is this wait bound or it can take basically arbitrary amount of time? Arbitrary amount of time but in desktop use case you can assume that it should never go above 16ms for a 60frame per second rendering of your desktop (in GPU compute case this kind of assumption does not hold). Is the process exit_state already updated by the time this mmu notifier callbacks happen ? > > > The first pass would be then to not do anything here if > > !blockable. > > something like this? (incremental diff) What i wanted to do with HMM and mmu notifier is split the invalidation in 2 pass. First pass tell the drivers to stop/cancel pending jobs that depends on the range and invalidate internal driver states (like clear buffer object pages array in case of GPU but not GPU page table). While the second callback would do the actual wait on the GPU to be done and update the GPU page table. Now in this scheme in case the task is already in some exit state and that all CPU threads are frozen/kill then we can probably find a way to do the first path mostly lock less. AFAICR nor AMD nor Intel allow to share userptr bo hence a uptr bo should only ever be access through ioctl submited by the process. The second call can then be delayed and ping from time to time to see if GPU jobs are done. Note that what you propose might still be useful as in case there is no buffer object for a range then OOM can make progress in freeing a range of memory. It is very likely that significant virtual address range of a process and backing memory can be reclaim that way. This assume OOM reclaim vma by vma or in some form of granularity like reclaiming 1GB by 1GB. Or we could also update blocking callback to return range that are blocking that way OOM can reclaim around. Cheers, Jérôme
[Hmm, the cc list got mangled somehow - you have just made many people to work for suse ;) and to kvack.org in the preious one - fixed up hopefully] On Fri 22-06-18 17:07:21, Chris Wilson wrote: > Quoting Michal Hocko (2018-06-22 16:57:16) > > On Fri 22-06-18 16:36:49, Chris Wilson wrote: > > > Quoting Michal Hocko (2018-06-22 16:02:42) > > > > Hi, > > > > this is an RFC and not tested at all. I am not very familiar with the > > > > mmu notifiers semantics very much so this is a crude attempt to achieve > > > > what I need basically. It might be completely wrong but I would like > > > > to discuss what would be a better way if that is the case. > > > > > > > > get_maintainers gave me quite large list of people to CC so I had to trim > > > > it down. If you think I have forgot somebody, please let me know > > > > > > > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c > > > > index 854bd51b9478..5285df9331fa 100644 > > > > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c > > > > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c > > > > @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo) > > > > mo->attached = false; > > > > } > > > > > > > > -static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn, > > > > +static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn, > > > > struct mm_struct *mm, > > > > unsigned long start, > > > > - unsigned long end) > > > > + unsigned long end, > > > > + bool blockable) > > > > { > > > > struct i915_mmu_notifier *mn = > > > > container_of(_mn, struct i915_mmu_notifier, mn); > > > > @@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn, > > > > LIST_HEAD(cancelled); > > > > > > > > if (RB_EMPTY_ROOT(&mn->objects.rb_root)) > > > > - return; > > > > + return 0; > > > > > > The principle wait here is for the HW (even after fixing all the locks > > > to be not so coarse, we still have to wait for the HW to finish its > > > access). > > > > Is this wait bound or it can take basically arbitrary amount of time? > > Arbitrary. It waits for the last operation in the queue that needs that > set of backing pages, and that queue is unbounded and not even confined > to the local driver. (Though each operation should be bounded to be > completed within an interval or be cancelled, that interval is on the > order of 10s!) OK, I see. We should rather not wait that long so backoff is just better. The whole point of the oom_reaper is to tear down and free some memory. We do not really need to reclaim all of it. It would be great if we could do something like - kick the tear down of the device memory but have it done in the background. We wouldn't tear the vma down in that case but the whole process would start at least. I am not sure something like that is possible. > > > The first pass would be then to not do anything here if > > > !blockable. > > > > something like this? (incremental diff) > > Yup. Cool, I will start with that because even that is an improvement from the oom_reaper POV. Thanks!
[Resnding with the CC list fixed] On Fri 22-06-18 18:40:26, Michal Hocko wrote: > On Fri 22-06-18 12:18:46, Jerome Glisse wrote: > > On Fri, Jun 22, 2018 at 05:57:16PM +0200, Michal Hocko wrote: > > > On Fri 22-06-18 16:36:49, Chris Wilson wrote: > > > > Quoting Michal Hocko (2018-06-22 16:02:42) > > > > > Hi, > > > > > this is an RFC and not tested at all. I am not very familiar with the > > > > > mmu notifiers semantics very much so this is a crude attempt to achieve > > > > > what I need basically. It might be completely wrong but I would like > > > > > to discuss what would be a better way if that is the case. > > > > > > > > > > get_maintainers gave me quite large list of people to CC so I had to trim > > > > > it down. If you think I have forgot somebody, please let me know > > > > > > > > > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c > > > > > index 854bd51b9478..5285df9331fa 100644 > > > > > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c > > > > > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c > > > > > @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo) > > > > > mo->attached = false; > > > > > } > > > > > > > > > > -static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn, > > > > > +static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn, > > > > > struct mm_struct *mm, > > > > > unsigned long start, > > > > > - unsigned long end) > > > > > + unsigned long end, > > > > > + bool blockable) > > > > > { > > > > > struct i915_mmu_notifier *mn = > > > > > container_of(_mn, struct i915_mmu_notifier, mn); > > > > > @@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn, > > > > > LIST_HEAD(cancelled); > > > > > > > > > > if (RB_EMPTY_ROOT(&mn->objects.rb_root)) > > > > > - return; > > > > > + return 0; > > > > > > > > The principle wait here is for the HW (even after fixing all the locks > > > > to be not so coarse, we still have to wait for the HW to finish its > > > > access). > > > > > > Is this wait bound or it can take basically arbitrary amount of time? > > > > Arbitrary amount of time but in desktop use case you can assume that > > it should never go above 16ms for a 60frame per second rendering of > > your desktop (in GPU compute case this kind of assumption does not > > hold). Is the process exit_state already updated by the time this mmu > > notifier callbacks happen ? > > What do you mean? The process is killed (by SIGKILL) at the time but we > do not know much more than that. The task might be stuck anywhere in the > kernel before handling that signal. > > > > > The first pass would be then to not do anything here if > > > > !blockable. > > > > > > something like this? (incremental diff) > > > > What i wanted to do with HMM and mmu notifier is split the invalidation > > in 2 pass. First pass tell the drivers to stop/cancel pending jobs that > > depends on the range and invalidate internal driver states (like clear > > buffer object pages array in case of GPU but not GPU page table). While > > the second callback would do the actual wait on the GPU to be done and > > update the GPU page table. > > What can you do after the first phase? Can I unmap the range? > > > Now in this scheme in case the task is already in some exit state and > > that all CPU threads are frozen/kill then we can probably find a way to > > do the first path mostly lock less. AFAICR nor AMD nor Intel allow to > > share userptr bo hence a uptr bo should only ever be access through > > ioctl submited by the process. > > > > The second call can then be delayed and ping from time to time to see > > if GPU jobs are done. > > > > > > Note that what you propose might still be useful as in case there is > > no buffer object for a range then OOM can make progress in freeing a > > range of memory. It is very likely that significant virtual address > > range of a process and backing memory can be reclaim that way. This > > assume OOM reclaim vma by vma or in some form of granularity like > > reclaiming 1GB by 1GB. Or we could also update blocking callback to > > return range that are blocking that way OOM can reclaim around. > > Exactly my point. What we have right now is all or nothing which is > obviously too coarse to be useful.
On Fri, Jun 22, 2018 at 06:42:43PM +0200, Michal Hocko wrote: > [Resnding with the CC list fixed] > > On Fri 22-06-18 18:40:26, Michal Hocko wrote: > > On Fri 22-06-18 12:18:46, Jerome Glisse wrote: > > > On Fri, Jun 22, 2018 at 05:57:16PM +0200, Michal Hocko wrote: > > > > On Fri 22-06-18 16:36:49, Chris Wilson wrote: > > > > > Quoting Michal Hocko (2018-06-22 16:02:42) > > > > > > Hi, > > > > > > this is an RFC and not tested at all. I am not very familiar with the > > > > > > mmu notifiers semantics very much so this is a crude attempt to achieve > > > > > > what I need basically. It might be completely wrong but I would like > > > > > > to discuss what would be a better way if that is the case. > > > > > > > > > > > > get_maintainers gave me quite large list of people to CC so I had to trim > > > > > > it down. If you think I have forgot somebody, please let me know > > > > > > > > > > > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c > > > > > > index 854bd51b9478..5285df9331fa 100644 > > > > > > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c > > > > > > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c > > > > > > @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo) > > > > > > mo->attached = false; > > > > > > } > > > > > > > > > > > > -static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn, > > > > > > +static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn, > > > > > > struct mm_struct *mm, > > > > > > unsigned long start, > > > > > > - unsigned long end) > > > > > > + unsigned long end, > > > > > > + bool blockable) > > > > > > { > > > > > > struct i915_mmu_notifier *mn = > > > > > > container_of(_mn, struct i915_mmu_notifier, mn); > > > > > > @@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn, > > > > > > LIST_HEAD(cancelled); > > > > > > > > > > > > if (RB_EMPTY_ROOT(&mn->objects.rb_root)) > > > > > > - return; > > > > > > + return 0; > > > > > > > > > > The principle wait here is for the HW (even after fixing all the locks > > > > > to be not so coarse, we still have to wait for the HW to finish its > > > > > access). > > > > > > > > Is this wait bound or it can take basically arbitrary amount of time? > > > > > > Arbitrary amount of time but in desktop use case you can assume that > > > it should never go above 16ms for a 60frame per second rendering of > > > your desktop (in GPU compute case this kind of assumption does not > > > hold). Is the process exit_state already updated by the time this mmu > > > notifier callbacks happen ? > > > > What do you mean? The process is killed (by SIGKILL) at the time but we > > do not know much more than that. The task might be stuck anywhere in the > > kernel before handling that signal. I was wondering if another thread might still be dereferencing any of the structure concurrently with the OOM mmu notifier callback. Saddly yes, it would be simpler if we could make such assumption. > > > > > > > The first pass would be then to not do anything here if > > > > > !blockable. > > > > > > > > something like this? (incremental diff) > > > > > > What i wanted to do with HMM and mmu notifier is split the invalidation > > > in 2 pass. First pass tell the drivers to stop/cancel pending jobs that > > > depends on the range and invalidate internal driver states (like clear > > > buffer object pages array in case of GPU but not GPU page table). While > > > the second callback would do the actual wait on the GPU to be done and > > > update the GPU page table. > > > > What can you do after the first phase? Can I unmap the range? No you can't do anything but this force synchronization as any other thread that concurrently trie to do something with those would queue up. So it would serialize thing. Also main motivation on my side is multi-GPU, right now multi-GPU are not that common but this is changing quickly and what we see on high end (4, 8 or 16 GPUs per socket) is spreading into more configurations. Here in mutli GPU case splitting in two would avoid having to fully wait on first GPU before trying to invaidate on second GPU, so on and so forth. > > > > > Now in this scheme in case the task is already in some exit state and > > > that all CPU threads are frozen/kill then we can probably find a way to > > > do the first path mostly lock less. AFAICR nor AMD nor Intel allow to > > > share userptr bo hence a uptr bo should only ever be access through > > > ioctl submited by the process. > > > > > > The second call can then be delayed and ping from time to time to see > > > if GPU jobs are done. > > > > > > > > > Note that what you propose might still be useful as in case there is > > > no buffer object for a range then OOM can make progress in freeing a > > > range of memory. It is very likely that significant virtual address > > > range of a process and backing memory can be reclaim that way. This > > > assume OOM reclaim vma by vma or in some form of granularity like > > > reclaiming 1GB by 1GB. Or we could also update blocking callback to > > > return range that are blocking that way OOM can reclaim around. > > > > Exactly my point. What we have right now is all or nothing which is > > obviously too coarse to be useful. Yes i think it is a good step in the right direction. Cheers, Jérôme
diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c index 5285df9331fa..e9ed0d2cfabc 100644 --- a/drivers/gpu/drm/i915/i915_gem_userptr.c +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c @@ -122,6 +122,7 @@ static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn, container_of(_mn, struct i915_mmu_notifier, mn); struct i915_mmu_object *mo; struct interval_tree_node *it; + int ret = 0; LIST_HEAD(cancelled); if (RB_EMPTY_ROOT(&mn->objects.rb_root)) @@ -133,6 +134,10 @@ static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn, spin_lock(&mn->lock); it = interval_tree_iter_first(&mn->objects, start, end); while (it) { + if (!blockable) { + ret = -EAGAIN; + goto out_unlock; + } /* The mmu_object is released late when destroying the * GEM object so it is entirely possible to gain a * reference on an object in the process of being freed @@ -154,8 +159,10 @@ static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn, spin_unlock(&mn->lock); /* TODO: can we skip waiting here? */ - if (!list_empty(&cancelled) && blockable) + if (!list_empty(&cancelled)) flush_workqueue(mn->wq); + + return ret; } static const struct mmu_notifier_ops i915_gem_userptr_notifier = {