Message ID | 20230811023137.659037-2-matthew.brost@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | DRM scheduler changes for Xe | expand |
Hi Matt, On 8/11/23 04:31, Matthew Brost wrote: > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1 > mapping between a drm_gpu_scheduler and drm_sched_entity. At first this > seems a bit odd but let us explain the reasoning below. > > 1. In XE the submission order from multiple drm_sched_entity is not > guaranteed to be the same completion even if targeting the same hardware > engine. This is because in XE we have a firmware scheduler, the GuC, > which allowed to reorder, timeslice, and preempt submissions. If a using > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls > apart as the TDR expects submission order == completion order. Using a > dedicated drm_gpu_scheduler per drm_sched_entity solve this problem. > > 2. In XE submissions are done via programming a ring buffer (circular > buffer), a drm_gpu_scheduler provides a limit on number of jobs, if the > limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we get flow > control on the ring for free. In XE, where does the limitation of MAX_SIZE_PER_JOB come from? In Nouveau we currently do have such a limitation as well, but it is derived from the RING_SIZE, hence RING_SIZE / MAX_SIZE_PER_JOB would always be 1. However, I think most jobs won't actually utilize the whole ring. Given that, it seems like it would be better to let the scheduler keep track of empty ring "slots" instead, such that the scheduler can deceide whether a subsequent job will still fit on the ring and if not re-evaluate once a previous job finished. Of course each submitted job would be required to carry the number of slots it requires on the ring. What to you think of implementing this as alternative flow control mechanism? Implementation wise this could be a union with the existing hw_submission_limit. - Danilo > > A problem with this design is currently a drm_gpu_scheduler uses a > kthread for submission / job cleanup. This doesn't scale if a large > number of drm_gpu_scheduler are used. To work around the scaling issue, > use a worker rather than kthread for submission / job cleanup. > > v2: > - (Rob Clark) Fix msm build > - Pass in run work queue > v3: > - (Boris) don't have loop in worker > v4: > - (Tvrtko) break out submit ready, stop, start helpers into own patch > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
On 8/16/23 16:05, Christian König wrote: > Am 16.08.23 um 13:30 schrieb Danilo Krummrich: >> Hi Matt, >> >> On 8/11/23 04:31, Matthew Brost wrote: >>> In XE, the new Intel GPU driver, a choice has made to have a 1 to 1 >>> mapping between a drm_gpu_scheduler and drm_sched_entity. At first this >>> seems a bit odd but let us explain the reasoning below. >>> >>> 1. In XE the submission order from multiple drm_sched_entity is not >>> guaranteed to be the same completion even if targeting the same hardware >>> engine. This is because in XE we have a firmware scheduler, the GuC, >>> which allowed to reorder, timeslice, and preempt submissions. If a using >>> shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls >>> apart as the TDR expects submission order == completion order. Using a >>> dedicated drm_gpu_scheduler per drm_sched_entity solve this problem. >>> >>> 2. In XE submissions are done via programming a ring buffer (circular >>> buffer), a drm_gpu_scheduler provides a limit on number of jobs, if the >>> limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we get flow >>> control on the ring for free. >> >> In XE, where does the limitation of MAX_SIZE_PER_JOB come from? >> >> In Nouveau we currently do have such a limitation as well, but it is >> derived from the RING_SIZE, hence RING_SIZE / MAX_SIZE_PER_JOB would >> always be 1. However, I think most jobs won't actually utilize the >> whole ring. > > Well that should probably rather be RING_SIZE / MAX_SIZE_PER_JOB = > hw_submission_limit (or even hw_submission_limit - 1 when the hw can't > distinct full vs empty ring buffer). Not sure if I get you right, let me try to clarify what I was trying to say: I wanted to say that in Nouveau MAX_SIZE_PER_JOB isn't really limited by anything other than the RING_SIZE and hence we'd never allow more than 1 active job. However, it seems to be more efficient to base ring flow control on the actual size of each incoming job rather than the worst case, namely the maximum size of a job. > > Otherwise your scheduler might just overwrite the ring buffer by pushing > things to fast. > > Christian. > >> >> Given that, it seems like it would be better to let the scheduler keep >> track of empty ring "slots" instead, such that the scheduler can >> deceide whether a subsequent job will still fit on the ring and if not >> re-evaluate once a previous job finished. Of course each submitted job >> would be required to carry the number of slots it requires on the ring. >> >> What to you think of implementing this as alternative flow control >> mechanism? Implementation wise this could be a union with the existing >> hw_submission_limit. >> >> - Danilo >> >>> >>> A problem with this design is currently a drm_gpu_scheduler uses a >>> kthread for submission / job cleanup. This doesn't scale if a large >>> number of drm_gpu_scheduler are used. To work around the scaling issue, >>> use a worker rather than kthread for submission / job cleanup. >>> >>> v2: >>> - (Rob Clark) Fix msm build >>> - Pass in run work queue >>> v3: >>> - (Boris) don't have loop in worker >>> v4: >>> - (Tvrtko) break out submit ready, stop, start helpers into own patch >>> >>> Signed-off-by: Matthew Brost <matthew.brost@intel.com> >> >
Am 16.08.23 um 13:30 schrieb Danilo Krummrich: > Hi Matt, > > On 8/11/23 04:31, Matthew Brost wrote: >> In XE, the new Intel GPU driver, a choice has made to have a 1 to 1 >> mapping between a drm_gpu_scheduler and drm_sched_entity. At first this >> seems a bit odd but let us explain the reasoning below. >> >> 1. In XE the submission order from multiple drm_sched_entity is not >> guaranteed to be the same completion even if targeting the same hardware >> engine. This is because in XE we have a firmware scheduler, the GuC, >> which allowed to reorder, timeslice, and preempt submissions. If a using >> shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls >> apart as the TDR expects submission order == completion order. Using a >> dedicated drm_gpu_scheduler per drm_sched_entity solve this problem. >> >> 2. In XE submissions are done via programming a ring buffer (circular >> buffer), a drm_gpu_scheduler provides a limit on number of jobs, if the >> limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we get flow >> control on the ring for free. > > In XE, where does the limitation of MAX_SIZE_PER_JOB come from? > > In Nouveau we currently do have such a limitation as well, but it is > derived from the RING_SIZE, hence RING_SIZE / MAX_SIZE_PER_JOB would > always be 1. However, I think most jobs won't actually utilize the > whole ring. Well that should probably rather be RING_SIZE / MAX_SIZE_PER_JOB = hw_submission_limit (or even hw_submission_limit - 1 when the hw can't distinct full vs empty ring buffer). Otherwise your scheduler might just overwrite the ring buffer by pushing things to fast. Christian. > > Given that, it seems like it would be better to let the scheduler keep > track of empty ring "slots" instead, such that the scheduler can > deceide whether a subsequent job will still fit on the ring and if not > re-evaluate once a previous job finished. Of course each submitted job > would be required to carry the number of slots it requires on the ring. > > What to you think of implementing this as alternative flow control > mechanism? Implementation wise this could be a union with the existing > hw_submission_limit. > > - Danilo > >> >> A problem with this design is currently a drm_gpu_scheduler uses a >> kthread for submission / job cleanup. This doesn't scale if a large >> number of drm_gpu_scheduler are used. To work around the scaling issue, >> use a worker rather than kthread for submission / job cleanup. >> >> v2: >> - (Rob Clark) Fix msm build >> - Pass in run work queue >> v3: >> - (Boris) don't have loop in worker >> v4: >> - (Tvrtko) break out submit ready, stop, start helpers into own patch >> >> Signed-off-by: Matthew Brost <matthew.brost@intel.com> >
On Wed, Aug 16, 2023 at 02:30:38PM +0200, Danilo Krummrich wrote: > On 8/16/23 16:05, Christian König wrote: > > Am 16.08.23 um 13:30 schrieb Danilo Krummrich: > > > Hi Matt, > > > > > > On 8/11/23 04:31, Matthew Brost wrote: > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1 > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first this > > > > seems a bit odd but let us explain the reasoning below. > > > > > > > > 1. In XE the submission order from multiple drm_sched_entity is not > > > > guaranteed to be the same completion even if targeting the same hardware > > > > engine. This is because in XE we have a firmware scheduler, the GuC, > > > > which allowed to reorder, timeslice, and preempt submissions. If a using > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls > > > > apart as the TDR expects submission order == completion order. Using a > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this problem. > > > > > > > > 2. In XE submissions are done via programming a ring buffer (circular > > > > buffer), a drm_gpu_scheduler provides a limit on number of jobs, if the > > > > limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we get flow > > > > control on the ring for free. > > > > > > In XE, where does the limitation of MAX_SIZE_PER_JOB come from? > > > In Xe the job submission is series of ring instructions done by the KMD. The instructions are cache flushes, seqno writes, jump to user BB, etc... The exact instructions for each job vary based on hw engine type, platform, etc... We dervive MAX_SIZE_PER_JOB from the largest set of instructions to submit a job and have a define in the driver for this. I believe it is currently set to 192 bytes (the actual define is MAX_JOB_SIZE_BYTES). So a 16k ring lets Xe have 85 jobs inflight at once. > > > In Nouveau we currently do have such a limitation as well, but it is > > > derived from the RING_SIZE, hence RING_SIZE / MAX_SIZE_PER_JOB would > > > always be 1. However, I think most jobs won't actually utilize the > > > whole ring. > > > > Well that should probably rather be RING_SIZE / MAX_SIZE_PER_JOB = > > hw_submission_limit (or even hw_submission_limit - 1 when the hw can't Yes, hw_submission_limit = RING_SIZE / MAX_SIZE_PER_JOB in Xe. > > distinct full vs empty ring buffer). > > Not sure if I get you right, let me try to clarify what I was trying to say: > I wanted to say that in Nouveau MAX_SIZE_PER_JOB isn't really limited by > anything other than the RING_SIZE and hence we'd never allow more than 1 > active job. I'm confused how there isn't a limit on the size of the job in Nouveau? Based on what you have said, a job could be larger than the ring then? > > However, it seems to be more efficient to base ring flow control on the > actual size of each incoming job rather than the worst case, namely the > maximum size of a job. > If this doesn't work for Nouveau, feel free flow control the ring differently but this works rather well (and simple) for Xe. Matt > > > > Otherwise your scheduler might just overwrite the ring buffer by pushing > > things to fast. > > > > Christian. > > > > > > > > Given that, it seems like it would be better to let the scheduler > > > keep track of empty ring "slots" instead, such that the scheduler > > > can deceide whether a subsequent job will still fit on the ring and > > > if not re-evaluate once a previous job finished. Of course each > > > submitted job would be required to carry the number of slots it > > > requires on the ring. > > > > > > What to you think of implementing this as alternative flow control > > > mechanism? Implementation wise this could be a union with the > > > existing hw_submission_limit. > > > > > > - Danilo > > > > > > > > > > > A problem with this design is currently a drm_gpu_scheduler uses a > > > > kthread for submission / job cleanup. This doesn't scale if a large > > > > number of drm_gpu_scheduler are used. To work around the scaling issue, > > > > use a worker rather than kthread for submission / job cleanup. > > > > > > > > v2: > > > > - (Rob Clark) Fix msm build > > > > - Pass in run work queue > > > > v3: > > > > - (Boris) don't have loop in worker > > > > v4: > > > > - (Tvrtko) break out submit ready, stop, start helpers into own patch > > > > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com> > > > > > >
Am 16.08.23 um 14:30 schrieb Danilo Krummrich: > On 8/16/23 16:05, Christian König wrote: >> Am 16.08.23 um 13:30 schrieb Danilo Krummrich: >>> Hi Matt, >>> >>> On 8/11/23 04:31, Matthew Brost wrote: >>>> In XE, the new Intel GPU driver, a choice has made to have a 1 to 1 >>>> mapping between a drm_gpu_scheduler and drm_sched_entity. At first >>>> this >>>> seems a bit odd but let us explain the reasoning below. >>>> >>>> 1. In XE the submission order from multiple drm_sched_entity is not >>>> guaranteed to be the same completion even if targeting the same >>>> hardware >>>> engine. This is because in XE we have a firmware scheduler, the GuC, >>>> which allowed to reorder, timeslice, and preempt submissions. If a >>>> using >>>> shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR >>>> falls >>>> apart as the TDR expects submission order == completion order. Using a >>>> dedicated drm_gpu_scheduler per drm_sched_entity solve this problem. >>>> >>>> 2. In XE submissions are done via programming a ring buffer (circular >>>> buffer), a drm_gpu_scheduler provides a limit on number of jobs, if >>>> the >>>> limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we get >>>> flow >>>> control on the ring for free. >>> >>> In XE, where does the limitation of MAX_SIZE_PER_JOB come from? >>> >>> In Nouveau we currently do have such a limitation as well, but it is >>> derived from the RING_SIZE, hence RING_SIZE / MAX_SIZE_PER_JOB would >>> always be 1. However, I think most jobs won't actually utilize the >>> whole ring. >> >> Well that should probably rather be RING_SIZE / MAX_SIZE_PER_JOB = >> hw_submission_limit (or even hw_submission_limit - 1 when the hw >> can't distinct full vs empty ring buffer). > > Not sure if I get you right, let me try to clarify what I was trying > to say: I wanted to say that in Nouveau MAX_SIZE_PER_JOB isn't really > limited by anything other than the RING_SIZE and hence we'd never > allow more than 1 active job. But that lets the hw run dry between submissions. That is usually a pretty horrible idea for performance. > > However, it seems to be more efficient to base ring flow control on > the actual size of each incoming job rather than the worst case, > namely the maximum size of a job. That doesn't sounds like a good idea to me. See we don't limit the number of submitted jobs based on the ring size, but rather we calculate the ring size based on the number of submitted jobs. In other words the hw_submission_limit defines the ring size, not the other way around. And you usually want the hw_submission_limit as low as possible for good scheduler granularity and to avoid extra overhead. Christian. > >> >> Otherwise your scheduler might just overwrite the ring buffer by >> pushing things to fast. >> >> Christian. >> >>> >>> Given that, it seems like it would be better to let the scheduler >>> keep track of empty ring "slots" instead, such that the scheduler >>> can deceide whether a subsequent job will still fit on the ring and >>> if not re-evaluate once a previous job finished. Of course each >>> submitted job would be required to carry the number of slots it >>> requires on the ring. >>> >>> What to you think of implementing this as alternative flow control >>> mechanism? Implementation wise this could be a union with the >>> existing hw_submission_limit. >>> >>> - Danilo >>> >>>> >>>> A problem with this design is currently a drm_gpu_scheduler uses a >>>> kthread for submission / job cleanup. This doesn't scale if a large >>>> number of drm_gpu_scheduler are used. To work around the scaling >>>> issue, >>>> use a worker rather than kthread for submission / job cleanup. >>>> >>>> v2: >>>> - (Rob Clark) Fix msm build >>>> - Pass in run work queue >>>> v3: >>>> - (Boris) don't have loop in worker >>>> v4: >>>> - (Tvrtko) break out submit ready, stop, start helpers into own >>>> patch >>>> >>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com> >>> >> >
On 8/16/23 16:38, Matthew Brost wrote: > On Wed, Aug 16, 2023 at 02:30:38PM +0200, Danilo Krummrich wrote: >> On 8/16/23 16:05, Christian König wrote: >>> Am 16.08.23 um 13:30 schrieb Danilo Krummrich: >>>> Hi Matt, >>>> >>>> On 8/11/23 04:31, Matthew Brost wrote: >>>>> In XE, the new Intel GPU driver, a choice has made to have a 1 to 1 >>>>> mapping between a drm_gpu_scheduler and drm_sched_entity. At first this >>>>> seems a bit odd but let us explain the reasoning below. >>>>> >>>>> 1. In XE the submission order from multiple drm_sched_entity is not >>>>> guaranteed to be the same completion even if targeting the same hardware >>>>> engine. This is because in XE we have a firmware scheduler, the GuC, >>>>> which allowed to reorder, timeslice, and preempt submissions. If a using >>>>> shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls >>>>> apart as the TDR expects submission order == completion order. Using a >>>>> dedicated drm_gpu_scheduler per drm_sched_entity solve this problem. >>>>> >>>>> 2. In XE submissions are done via programming a ring buffer (circular >>>>> buffer), a drm_gpu_scheduler provides a limit on number of jobs, if the >>>>> limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we get flow >>>>> control on the ring for free. >>>> >>>> In XE, where does the limitation of MAX_SIZE_PER_JOB come from? >>>> > > In Xe the job submission is series of ring instructions done by the KMD. > The instructions are cache flushes, seqno writes, jump to user BB, > etc... The exact instructions for each job vary based on hw engine type, > platform, etc... We dervive MAX_SIZE_PER_JOB from the largest set of > instructions to submit a job and have a define in the driver for this. I > believe it is currently set to 192 bytes (the actual define is > MAX_JOB_SIZE_BYTES). So a 16k ring lets Xe have 85 jobs inflight at > once. Ok, that sounds different to how Nouveau works. The "largest set of instructions to submit a job" really is a given by how the hardware works rather than an arbitrary limit? In Nouveau, userspace can submit an arbitrary amount of addresses of indirect bufferes containing the ring instructions. The ring on the kernel side takes the addresses of the indirect buffers rather than the instructions themself. Hence, technically there isn't really a limit on the amount of IBs submitted by a job except for the ring size. > >>>> In Nouveau we currently do have such a limitation as well, but it is >>>> derived from the RING_SIZE, hence RING_SIZE / MAX_SIZE_PER_JOB would >>>> always be 1. However, I think most jobs won't actually utilize the >>>> whole ring. >>> >>> Well that should probably rather be RING_SIZE / MAX_SIZE_PER_JOB = >>> hw_submission_limit (or even hw_submission_limit - 1 when the hw can't > > Yes, hw_submission_limit = RING_SIZE / MAX_SIZE_PER_JOB in Xe. > > >>> distinct full vs empty ring buffer). >> >> Not sure if I get you right, let me try to clarify what I was trying to say: >> I wanted to say that in Nouveau MAX_SIZE_PER_JOB isn't really limited by >> anything other than the RING_SIZE and hence we'd never allow more than 1 >> active job. > > I'm confused how there isn't a limit on the size of the job in Nouveau? > Based on what you have said, a job could be larger than the ring then? As explained above, theoretically it could. It's only limited by the ring size. > >> >> However, it seems to be more efficient to base ring flow control on the >> actual size of each incoming job rather than the worst case, namely the >> maximum size of a job. >> > > If this doesn't work for Nouveau, feel free flow control the ring > differently but this works rather well (and simple) for Xe. > > Matt > >>> >>> Otherwise your scheduler might just overwrite the ring buffer by pushing >>> things to fast. >>> >>> Christian. >>> >>>> >>>> Given that, it seems like it would be better to let the scheduler >>>> keep track of empty ring "slots" instead, such that the scheduler >>>> can deceide whether a subsequent job will still fit on the ring and >>>> if not re-evaluate once a previous job finished. Of course each >>>> submitted job would be required to carry the number of slots it >>>> requires on the ring. >>>> >>>> What to you think of implementing this as alternative flow control >>>> mechanism? Implementation wise this could be a union with the >>>> existing hw_submission_limit. >>>> >>>> - Danilo >>>> >>>>> >>>>> A problem with this design is currently a drm_gpu_scheduler uses a >>>>> kthread for submission / job cleanup. This doesn't scale if a large >>>>> number of drm_gpu_scheduler are used. To work around the scaling issue, >>>>> use a worker rather than kthread for submission / job cleanup. >>>>> >>>>> v2: >>>>> - (Rob Clark) Fix msm build >>>>> - Pass in run work queue >>>>> v3: >>>>> - (Boris) don't have loop in worker >>>>> v4: >>>>> - (Tvrtko) break out submit ready, stop, start helpers into own patch >>>>> >>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com> >>>> >>> >> >
On 8/16/23 16:59, Christian König wrote: > Am 16.08.23 um 14:30 schrieb Danilo Krummrich: >> On 8/16/23 16:05, Christian König wrote: >>> Am 16.08.23 um 13:30 schrieb Danilo Krummrich: >>>> Hi Matt, >>>> >>>> On 8/11/23 04:31, Matthew Brost wrote: >>>>> In XE, the new Intel GPU driver, a choice has made to have a 1 to 1 >>>>> mapping between a drm_gpu_scheduler and drm_sched_entity. At first >>>>> this >>>>> seems a bit odd but let us explain the reasoning below. >>>>> >>>>> 1. In XE the submission order from multiple drm_sched_entity is not >>>>> guaranteed to be the same completion even if targeting the same >>>>> hardware >>>>> engine. This is because in XE we have a firmware scheduler, the GuC, >>>>> which allowed to reorder, timeslice, and preempt submissions. If a >>>>> using >>>>> shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR >>>>> falls >>>>> apart as the TDR expects submission order == completion order. Using a >>>>> dedicated drm_gpu_scheduler per drm_sched_entity solve this problem. >>>>> >>>>> 2. In XE submissions are done via programming a ring buffer (circular >>>>> buffer), a drm_gpu_scheduler provides a limit on number of jobs, if >>>>> the >>>>> limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we get >>>>> flow >>>>> control on the ring for free. >>>> >>>> In XE, where does the limitation of MAX_SIZE_PER_JOB come from? >>>> >>>> In Nouveau we currently do have such a limitation as well, but it is >>>> derived from the RING_SIZE, hence RING_SIZE / MAX_SIZE_PER_JOB would >>>> always be 1. However, I think most jobs won't actually utilize the >>>> whole ring. >>> >>> Well that should probably rather be RING_SIZE / MAX_SIZE_PER_JOB = >>> hw_submission_limit (or even hw_submission_limit - 1 when the hw >>> can't distinct full vs empty ring buffer). >> >> Not sure if I get you right, let me try to clarify what I was trying >> to say: I wanted to say that in Nouveau MAX_SIZE_PER_JOB isn't really >> limited by anything other than the RING_SIZE and hence we'd never >> allow more than 1 active job. > > But that lets the hw run dry between submissions. That is usually a > pretty horrible idea for performance. Correct, that's the reason why I said it seems to be more efficient to base ring flow control on the actual size of each incoming job rather than the maximum size of a job. > >> >> However, it seems to be more efficient to base ring flow control on >> the actual size of each incoming job rather than the worst case, >> namely the maximum size of a job. > > That doesn't sounds like a good idea to me. See we don't limit the > number of submitted jobs based on the ring size, but rather we calculate > the ring size based on the number of submitted jobs. > My point isn't really about whether we derive the ring size from the job limit or the other way around. It's more about the job size (or its maximum size) being arbitrary. As mentioned in my reply to Matt: "In Nouveau, userspace can submit an arbitrary amount of addresses of indirect bufferes containing the ring instructions. The ring on the kernel side takes the addresses of the indirect buffers rather than the instructions themself. Hence, technically there isn't really a limit on the amount of IBs submitted by a job except for the ring size." So, my point is that I don't really want to limit the job size artificially just to be able to fit multiple jobs into the ring even if they're submitted at their "artificial" maximum size, but rather track how much of the ring the submitted job actually occupies. > In other words the hw_submission_limit defines the ring size, not the > other way around. And you usually want the hw_submission_limit as low as > possible for good scheduler granularity and to avoid extra overhead. I don't think you really mean "as low as possible", do you? Because one really is the minimum if you want to do work at all, but as you mentioned above a job limit of one can let the ring run dry. In the end my proposal comes down to tracking the actual size of a job rather than just assuming a pre-defined maximum job size, and hence a dynamic job limit. I don't think this would hurt the scheduler granularity. In fact, it should even contribute to the desire of not letting the ring run dry even better. Especially for sequences of small jobs, where the current implementation might wrongly assume the ring is already full although actually there would still be enough space left. > > Christian. > >> >>> >>> Otherwise your scheduler might just overwrite the ring buffer by >>> pushing things to fast. >>> >>> Christian. >>> >>>> >>>> Given that, it seems like it would be better to let the scheduler >>>> keep track of empty ring "slots" instead, such that the scheduler >>>> can deceide whether a subsequent job will still fit on the ring and >>>> if not re-evaluate once a previous job finished. Of course each >>>> submitted job would be required to carry the number of slots it >>>> requires on the ring. >>>> >>>> What to you think of implementing this as alternative flow control >>>> mechanism? Implementation wise this could be a union with the >>>> existing hw_submission_limit. >>>> >>>> - Danilo >>>> >>>>> >>>>> A problem with this design is currently a drm_gpu_scheduler uses a >>>>> kthread for submission / job cleanup. This doesn't scale if a large >>>>> number of drm_gpu_scheduler are used. To work around the scaling >>>>> issue, >>>>> use a worker rather than kthread for submission / job cleanup. >>>>> >>>>> v2: >>>>> - (Rob Clark) Fix msm build >>>>> - Pass in run work queue >>>>> v3: >>>>> - (Boris) don't have loop in worker >>>>> v4: >>>>> - (Tvrtko) break out submit ready, stop, start helpers into own >>>>> patch >>>>> >>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com> >>>> >>> >> >
Am 16.08.23 um 18:33 schrieb Danilo Krummrich: > On 8/16/23 16:59, Christian König wrote: >> Am 16.08.23 um 14:30 schrieb Danilo Krummrich: >>> On 8/16/23 16:05, Christian König wrote: >>>> Am 16.08.23 um 13:30 schrieb Danilo Krummrich: >>>>> Hi Matt, >>>>> >>>>> On 8/11/23 04:31, Matthew Brost wrote: >>>>>> In XE, the new Intel GPU driver, a choice has made to have a 1 to 1 >>>>>> mapping between a drm_gpu_scheduler and drm_sched_entity. At >>>>>> first this >>>>>> seems a bit odd but let us explain the reasoning below. >>>>>> >>>>>> 1. In XE the submission order from multiple drm_sched_entity is not >>>>>> guaranteed to be the same completion even if targeting the same >>>>>> hardware >>>>>> engine. This is because in XE we have a firmware scheduler, the GuC, >>>>>> which allowed to reorder, timeslice, and preempt submissions. If >>>>>> a using >>>>>> shared drm_gpu_scheduler across multiple drm_sched_entity, the >>>>>> TDR falls >>>>>> apart as the TDR expects submission order == completion order. >>>>>> Using a >>>>>> dedicated drm_gpu_scheduler per drm_sched_entity solve this problem. >>>>>> >>>>>> 2. In XE submissions are done via programming a ring buffer >>>>>> (circular >>>>>> buffer), a drm_gpu_scheduler provides a limit on number of jobs, >>>>>> if the >>>>>> limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we >>>>>> get flow >>>>>> control on the ring for free. >>>>> >>>>> In XE, where does the limitation of MAX_SIZE_PER_JOB come from? >>>>> >>>>> In Nouveau we currently do have such a limitation as well, but it >>>>> is derived from the RING_SIZE, hence RING_SIZE / MAX_SIZE_PER_JOB >>>>> would always be 1. However, I think most jobs won't actually >>>>> utilize the whole ring. >>>> >>>> Well that should probably rather be RING_SIZE / MAX_SIZE_PER_JOB = >>>> hw_submission_limit (or even hw_submission_limit - 1 when the hw >>>> can't distinct full vs empty ring buffer). >>> >>> Not sure if I get you right, let me try to clarify what I was trying >>> to say: I wanted to say that in Nouveau MAX_SIZE_PER_JOB isn't >>> really limited by anything other than the RING_SIZE and hence we'd >>> never allow more than 1 active job. >> >> But that lets the hw run dry between submissions. That is usually a >> pretty horrible idea for performance. > > Correct, that's the reason why I said it seems to be more efficient to > base ring flow control on the actual size of each incoming job rather > than the maximum size of a job. > >> >>> >>> However, it seems to be more efficient to base ring flow control on >>> the actual size of each incoming job rather than the worst case, >>> namely the maximum size of a job. >> >> That doesn't sounds like a good idea to me. See we don't limit the >> number of submitted jobs based on the ring size, but rather we >> calculate the ring size based on the number of submitted jobs. >> > > My point isn't really about whether we derive the ring size from the > job limit or the other way around. It's more about the job size (or > its maximum size) being arbitrary. > > As mentioned in my reply to Matt: > > "In Nouveau, userspace can submit an arbitrary amount of addresses of > indirect bufferes containing the ring instructions. The ring on the > kernel side takes the addresses of the indirect buffers rather than > the instructions themself. Hence, technically there isn't really a > limit on the amount of IBs submitted by a job except for the ring size." > > So, my point is that I don't really want to limit the job size > artificially just to be able to fit multiple jobs into the ring even > if they're submitted at their "artificial" maximum size, but rather > track how much of the ring the submitted job actually occupies. > >> In other words the hw_submission_limit defines the ring size, not the >> other way around. And you usually want the hw_submission_limit as low >> as possible for good scheduler granularity and to avoid extra overhead. > > I don't think you really mean "as low as possible", do you? No, I do mean as low as possible or in other words as few as possible. Ideally the scheduler would submit only the minimum amount of work to the hardware to keep the hardware busy. The hardware seems to work mostly the same for all vendors, but you somehow seem to think that filling the ring is somehow beneficial which is really not the case as far as I can see. Regards, Christian. > Because one really is the minimum if you want to do work at all, but > as you mentioned above a job limit of one can let the ring run dry. > > In the end my proposal comes down to tracking the actual size of a job > rather than just assuming a pre-defined maximum job size, and hence a > dynamic job limit. > > I don't think this would hurt the scheduler granularity. In fact, it > should even contribute to the desire of not letting the ring run dry > even better. Especially for sequences of small jobs, where the current > implementation might wrongly assume the ring is already full although > actually there would still be enough space left. > >> >> Christian. >> >>> >>>> >>>> Otherwise your scheduler might just overwrite the ring buffer by >>>> pushing things to fast. >>>> >>>> Christian. >>>> >>>>> >>>>> Given that, it seems like it would be better to let the scheduler >>>>> keep track of empty ring "slots" instead, such that the scheduler >>>>> can deceide whether a subsequent job will still fit on the ring >>>>> and if not re-evaluate once a previous job finished. Of course >>>>> each submitted job would be required to carry the number of slots >>>>> it requires on the ring. >>>>> >>>>> What to you think of implementing this as alternative flow control >>>>> mechanism? Implementation wise this could be a union with the >>>>> existing hw_submission_limit. >>>>> >>>>> - Danilo >>>>> >>>>>> >>>>>> A problem with this design is currently a drm_gpu_scheduler uses a >>>>>> kthread for submission / job cleanup. This doesn't scale if a large >>>>>> number of drm_gpu_scheduler are used. To work around the scaling >>>>>> issue, >>>>>> use a worker rather than kthread for submission / job cleanup. >>>>>> >>>>>> v2: >>>>>> - (Rob Clark) Fix msm build >>>>>> - Pass in run work queue >>>>>> v3: >>>>>> - (Boris) don't have loop in worker >>>>>> v4: >>>>>> - (Tvrtko) break out submit ready, stop, start helpers into >>>>>> own patch >>>>>> >>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com> >>>>> >>>> >>> >> >
On 8/17/23 07:33, Christian König wrote: > Am 16.08.23 um 18:33 schrieb Danilo Krummrich: >> On 8/16/23 16:59, Christian König wrote: >>> Am 16.08.23 um 14:30 schrieb Danilo Krummrich: >>>> On 8/16/23 16:05, Christian König wrote: >>>>> Am 16.08.23 um 13:30 schrieb Danilo Krummrich: >>>>>> Hi Matt, >>>>>> >>>>>> On 8/11/23 04:31, Matthew Brost wrote: >>>>>>> In XE, the new Intel GPU driver, a choice has made to have a 1 to 1 >>>>>>> mapping between a drm_gpu_scheduler and drm_sched_entity. At >>>>>>> first this >>>>>>> seems a bit odd but let us explain the reasoning below. >>>>>>> >>>>>>> 1. In XE the submission order from multiple drm_sched_entity is not >>>>>>> guaranteed to be the same completion even if targeting the same >>>>>>> hardware >>>>>>> engine. This is because in XE we have a firmware scheduler, the GuC, >>>>>>> which allowed to reorder, timeslice, and preempt submissions. If >>>>>>> a using >>>>>>> shared drm_gpu_scheduler across multiple drm_sched_entity, the >>>>>>> TDR falls >>>>>>> apart as the TDR expects submission order == completion order. >>>>>>> Using a >>>>>>> dedicated drm_gpu_scheduler per drm_sched_entity solve this problem. >>>>>>> >>>>>>> 2. In XE submissions are done via programming a ring buffer >>>>>>> (circular >>>>>>> buffer), a drm_gpu_scheduler provides a limit on number of jobs, >>>>>>> if the >>>>>>> limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we >>>>>>> get flow >>>>>>> control on the ring for free. >>>>>> >>>>>> In XE, where does the limitation of MAX_SIZE_PER_JOB come from? >>>>>> >>>>>> In Nouveau we currently do have such a limitation as well, but it >>>>>> is derived from the RING_SIZE, hence RING_SIZE / MAX_SIZE_PER_JOB >>>>>> would always be 1. However, I think most jobs won't actually >>>>>> utilize the whole ring. >>>>> >>>>> Well that should probably rather be RING_SIZE / MAX_SIZE_PER_JOB = >>>>> hw_submission_limit (or even hw_submission_limit - 1 when the hw >>>>> can't distinct full vs empty ring buffer). >>>> >>>> Not sure if I get you right, let me try to clarify what I was trying >>>> to say: I wanted to say that in Nouveau MAX_SIZE_PER_JOB isn't >>>> really limited by anything other than the RING_SIZE and hence we'd >>>> never allow more than 1 active job. >>> >>> But that lets the hw run dry between submissions. That is usually a >>> pretty horrible idea for performance. >> >> Correct, that's the reason why I said it seems to be more efficient to >> base ring flow control on the actual size of each incoming job rather >> than the maximum size of a job. >> >>> >>>> >>>> However, it seems to be more efficient to base ring flow control on >>>> the actual size of each incoming job rather than the worst case, >>>> namely the maximum size of a job. >>> >>> That doesn't sounds like a good idea to me. See we don't limit the >>> number of submitted jobs based on the ring size, but rather we >>> calculate the ring size based on the number of submitted jobs. >>> >> >> My point isn't really about whether we derive the ring size from the >> job limit or the other way around. It's more about the job size (or >> its maximum size) being arbitrary. >> >> As mentioned in my reply to Matt: >> >> "In Nouveau, userspace can submit an arbitrary amount of addresses of >> indirect bufferes containing the ring instructions. The ring on the >> kernel side takes the addresses of the indirect buffers rather than >> the instructions themself. Hence, technically there isn't really a >> limit on the amount of IBs submitted by a job except for the ring size." >> >> So, my point is that I don't really want to limit the job size >> artificially just to be able to fit multiple jobs into the ring even >> if they're submitted at their "artificial" maximum size, but rather >> track how much of the ring the submitted job actually occupies. >> >>> In other words the hw_submission_limit defines the ring size, not the >>> other way around. And you usually want the hw_submission_limit as low >>> as possible for good scheduler granularity and to avoid extra overhead. >> >> I don't think you really mean "as low as possible", do you? > > No, I do mean as low as possible or in other words as few as possible. > > Ideally the scheduler would submit only the minimum amount of work to > the hardware to keep the hardware busy. > > The hardware seems to work mostly the same for all vendors, but you > somehow seem to think that filling the ring is somehow beneficial which > is really not the case as far as I can see. I think that's a misunderstanding. I'm not trying to say that it is *always* beneficial to fill up the ring as much as possible. But I think it is under certain circumstances, exactly those circumstances I described for Nouveau. As mentioned, in Nouveau the size of a job is only really limited by the ring size, which means that one job can (but does not necessarily) fill up the whole ring. We both agree that this is inefficient, because it potentially results into the HW run dry due to hw_submission_limit == 1. I recognize you said that one should define hw_submission_limit and adjust the other parts of the equation accordingly, the options I see are: (1) Increase the ring size while keeping the maximum job size. (2) Decrease the maximum job size while keeping the ring size. (3) Let the scheduler track the actual job size rather than the maximum job size. (1) results into potentially wasted ring memory, because we're not always reaching the maximum job size, but the scheduler assumes so. (2) results into more IOCTLs from userspace for the same amount of IBs and more jobs result into more memory allocations and more work being submitted to the workqueue (with Matt's patches). (3) doesn't seem to have any of those draw backs. What would be your take on that? Actually, if none of the other drivers is interested into a more precise way of keeping track of the ring utilization, I'd be totally fine to do it in a driver specific way. However, unfortunately I don't see how this would be possible. My proposal would be to just keep the hw_submission_limit (maybe rename it to submission_unit_limit) and add a submission_units field to struct drm_sched_job. By default a jobs submission_units field would be 0 and the scheduler would behave the exact same way as it does now. Accordingly, jobs with submission_units > 1 would contribute more than one unit to the submission_unit_limit. What do you think about that? Besides all that, you said that filling up the ring just enough to not let the HW run dry rather than filling it up entirely is desirable. Why do you think so? I tend to think that in most cases it shouldn't make difference. - Danilo > > Regards, > Christian. > >> Because one really is the minimum if you want to do work at all, but >> as you mentioned above a job limit of one can let the ring run dry. >> >> In the end my proposal comes down to tracking the actual size of a job >> rather than just assuming a pre-defined maximum job size, and hence a >> dynamic job limit. >> >> I don't think this would hurt the scheduler granularity. In fact, it >> should even contribute to the desire of not letting the ring run dry >> even better. Especially for sequences of small jobs, where the current >> implementation might wrongly assume the ring is already full although >> actually there would still be enough space left. >> >>> >>> Christian. >>> >>>> >>>>> >>>>> Otherwise your scheduler might just overwrite the ring buffer by >>>>> pushing things to fast. >>>>> >>>>> Christian. >>>>> >>>>>> >>>>>> Given that, it seems like it would be better to let the scheduler >>>>>> keep track of empty ring "slots" instead, such that the scheduler >>>>>> can deceide whether a subsequent job will still fit on the ring >>>>>> and if not re-evaluate once a previous job finished. Of course >>>>>> each submitted job would be required to carry the number of slots >>>>>> it requires on the ring. >>>>>> >>>>>> What to you think of implementing this as alternative flow control >>>>>> mechanism? Implementation wise this could be a union with the >>>>>> existing hw_submission_limit. >>>>>> >>>>>> - Danilo >>>>>> >>>>>>> >>>>>>> A problem with this design is currently a drm_gpu_scheduler uses a >>>>>>> kthread for submission / job cleanup. This doesn't scale if a large >>>>>>> number of drm_gpu_scheduler are used. To work around the scaling >>>>>>> issue, >>>>>>> use a worker rather than kthread for submission / job cleanup. >>>>>>> >>>>>>> v2: >>>>>>> - (Rob Clark) Fix msm build >>>>>>> - Pass in run work queue >>>>>>> v3: >>>>>>> - (Boris) don't have loop in worker >>>>>>> v4: >>>>>>> - (Tvrtko) break out submit ready, stop, start helpers into >>>>>>> own patch >>>>>>> >>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com> >>>>>> >>>>> >>>> >>> >> >
On 8/17/23 15:35, Christian König wrote: > Am 17.08.23 um 13:13 schrieb Danilo Krummrich: >> On 8/17/23 07:33, Christian König wrote: >>> [SNIP] >>> The hardware seems to work mostly the same for all vendors, but you >>> somehow seem to think that filling the ring is somehow beneficial >>> which is really not the case as far as I can see. >> >> I think that's a misunderstanding. I'm not trying to say that it is >> *always* beneficial to fill up the ring as much as possible. But I >> think it is under certain circumstances, exactly those circumstances I >> described for Nouveau. > > As far as I can see this is not correct for Nouveau either. > >> >> As mentioned, in Nouveau the size of a job is only really limited by >> the ring size, which means that one job can (but does not necessarily) >> fill up the whole ring. We both agree that this is inefficient, >> because it potentially results into the HW run dry due to >> hw_submission_limit == 1. >> >> I recognize you said that one should define hw_submission_limit and >> adjust the other parts of the equation accordingly, the options I see >> are: >> >> (1) Increase the ring size while keeping the maximum job size. >> (2) Decrease the maximum job size while keeping the ring size. >> (3) Let the scheduler track the actual job size rather than the >> maximum job size. >> >> (1) results into potentially wasted ring memory, because we're not >> always reaching the maximum job size, but the scheduler assumes so. >> >> (2) results into more IOCTLs from userspace for the same amount of IBs >> and more jobs result into more memory allocations and more work being >> submitted to the workqueue (with Matt's patches). >> >> (3) doesn't seem to have any of those draw backs. >> >> What would be your take on that? >> >> Actually, if none of the other drivers is interested into a more >> precise way of keeping track of the ring utilization, I'd be totally >> fine to do it in a driver specific way. However, unfortunately I don't >> see how this would be possible. >> >> My proposal would be to just keep the hw_submission_limit (maybe >> rename it to submission_unit_limit) and add a submission_units field >> to struct drm_sched_job. By default a jobs submission_units field >> would be 0 and the scheduler would behave the exact same way as it >> does now. >> >> Accordingly, jobs with submission_units > 1 would contribute more than >> one unit to the submission_unit_limit. >> >> What do you think about that? > > I think you are approaching this from the completely wrong side. First of all, thanks for keeping up the discussion - I appreciate it. Some more comments / questions below. > > See the UAPI needs to be stable, so you need a maximum job size > otherwise it can happen that a combination of large and small > submissions work while a different combination doesn't. How is this related to the uAPI being stable? What do you mean by 'stable' in this context? The Nouveau uAPI allows userspace to pass EXEC jobs by supplying the ring ID (channel), in-/out-syncs and a certain amount of indirect push buffers. The amount of IBs per job is limited by the amount of IBs fitting into the ring. Just to be clear, when I say 'job size' I mean the amount of IBs per job. Maybe I should also mention that the rings we are talking about are software rings managed by a firmware scheduler. We can have an arbitrary amount of software rings and even multiple ones per FD. Given a constant ring size I really don't see why I should limit the maximum amount of IBs userspace can push per job just to end up with a hw_submission_limit > 1. For example, let's just assume the ring can take 128 IBs, why would I limit userspace to submit just e.g. 16 IBs at a time, such that the hw_submission_limit becomes 8? What is the advantage of doing that, rather than letting userspace submit *up to* 128 IBs per job and just letting the scheduler push IBs to the ring as long as there's actually space left on the ring? > > So what you usually do, and this is driver independent because simply a > requirement of the UAPI, is that you say here that's my maximum job size > as well as the number of submission which should be pushed to the hw at > the same time. And then get the resulting ring size by the product of > the two. Given the above, how is that a requirement of the uAPI? > > That the ring in this use case can't be fully utilized is not a draw > back, this is completely intentional design which should apply to all > drivers independent of the vendor. Why wouldn't we want to fully utilize the ring size? - Danilo > >> >> Besides all that, you said that filling up the ring just enough to not >> let the HW run dry rather than filling it up entirely is desirable. >> Why do you think so? I tend to think that in most cases it shouldn't >> make difference. > > That results in better scheduling behavior. It's mostly beneficial if > you don't have a hw scheduler, but as far as I can see there is no need > to pump everything to the hw as fast as possible. > > Regards, > Christian. > >> >> - Danilo >> >>> >>> Regards, >>> Christian. >>> >>>> Because one really is the minimum if you want to do work at all, but >>>> as you mentioned above a job limit of one can let the ring run dry. >>>> >>>> In the end my proposal comes down to tracking the actual size of a >>>> job rather than just assuming a pre-defined maximum job size, and >>>> hence a dynamic job limit. >>>> >>>> I don't think this would hurt the scheduler granularity. In fact, it >>>> should even contribute to the desire of not letting the ring run dry >>>> even better. Especially for sequences of small jobs, where the >>>> current implementation might wrongly assume the ring is already full >>>> although actually there would still be enough space left. >>>> >>>>> >>>>> Christian. >>>>> >>>>>> >>>>>>> >>>>>>> Otherwise your scheduler might just overwrite the ring buffer by >>>>>>> pushing things to fast. >>>>>>> >>>>>>> Christian. >>>>>>> >>>>>>>> >>>>>>>> Given that, it seems like it would be better to let the >>>>>>>> scheduler keep track of empty ring "slots" instead, such that >>>>>>>> the scheduler can deceide whether a subsequent job will still >>>>>>>> fit on the ring and if not re-evaluate once a previous job >>>>>>>> finished. Of course each submitted job would be required to >>>>>>>> carry the number of slots it requires on the ring. >>>>>>>> >>>>>>>> What to you think of implementing this as alternative flow >>>>>>>> control mechanism? Implementation wise this could be a union >>>>>>>> with the existing hw_submission_limit. >>>>>>>> >>>>>>>> - Danilo >>>>>>>> >>>>>>>>> >>>>>>>>> A problem with this design is currently a drm_gpu_scheduler uses a >>>>>>>>> kthread for submission / job cleanup. This doesn't scale if a >>>>>>>>> large >>>>>>>>> number of drm_gpu_scheduler are used. To work around the >>>>>>>>> scaling issue, >>>>>>>>> use a worker rather than kthread for submission / job cleanup. >>>>>>>>> >>>>>>>>> v2: >>>>>>>>> - (Rob Clark) Fix msm build >>>>>>>>> - Pass in run work queue >>>>>>>>> v3: >>>>>>>>> - (Boris) don't have loop in worker >>>>>>>>> v4: >>>>>>>>> - (Tvrtko) break out submit ready, stop, start helpers into >>>>>>>>> own patch >>>>>>>>> >>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
Am 17.08.23 um 13:13 schrieb Danilo Krummrich: > On 8/17/23 07:33, Christian König wrote: >> [SNIP] >> The hardware seems to work mostly the same for all vendors, but you >> somehow seem to think that filling the ring is somehow beneficial >> which is really not the case as far as I can see. > > I think that's a misunderstanding. I'm not trying to say that it is > *always* beneficial to fill up the ring as much as possible. But I > think it is under certain circumstances, exactly those circumstances I > described for Nouveau. As far as I can see this is not correct for Nouveau either. > > As mentioned, in Nouveau the size of a job is only really limited by > the ring size, which means that one job can (but does not necessarily) > fill up the whole ring. We both agree that this is inefficient, > because it potentially results into the HW run dry due to > hw_submission_limit == 1. > > I recognize you said that one should define hw_submission_limit and > adjust the other parts of the equation accordingly, the options I see > are: > > (1) Increase the ring size while keeping the maximum job size. > (2) Decrease the maximum job size while keeping the ring size. > (3) Let the scheduler track the actual job size rather than the > maximum job size. > > (1) results into potentially wasted ring memory, because we're not > always reaching the maximum job size, but the scheduler assumes so. > > (2) results into more IOCTLs from userspace for the same amount of IBs > and more jobs result into more memory allocations and more work being > submitted to the workqueue (with Matt's patches). > > (3) doesn't seem to have any of those draw backs. > > What would be your take on that? > > Actually, if none of the other drivers is interested into a more > precise way of keeping track of the ring utilization, I'd be totally > fine to do it in a driver specific way. However, unfortunately I don't > see how this would be possible. > > My proposal would be to just keep the hw_submission_limit (maybe > rename it to submission_unit_limit) and add a submission_units field > to struct drm_sched_job. By default a jobs submission_units field > would be 0 and the scheduler would behave the exact same way as it > does now. > > Accordingly, jobs with submission_units > 1 would contribute more than > one unit to the submission_unit_limit. > > What do you think about that? I think you are approaching this from the completely wrong side. See the UAPI needs to be stable, so you need a maximum job size otherwise it can happen that a combination of large and small submissions work while a different combination doesn't. So what you usually do, and this is driver independent because simply a requirement of the UAPI, is that you say here that's my maximum job size as well as the number of submission which should be pushed to the hw at the same time. And then get the resulting ring size by the product of the two. That the ring in this use case can't be fully utilized is not a draw back, this is completely intentional design which should apply to all drivers independent of the vendor. > > Besides all that, you said that filling up the ring just enough to not > let the HW run dry rather than filling it up entirely is desirable. > Why do you think so? I tend to think that in most cases it shouldn't > make difference. That results in better scheduling behavior. It's mostly beneficial if you don't have a hw scheduler, but as far as I can see there is no need to pump everything to the hw as fast as possible. Regards, Christian. > > - Danilo > >> >> Regards, >> Christian. >> >>> Because one really is the minimum if you want to do work at all, but >>> as you mentioned above a job limit of one can let the ring run dry. >>> >>> In the end my proposal comes down to tracking the actual size of a >>> job rather than just assuming a pre-defined maximum job size, and >>> hence a dynamic job limit. >>> >>> I don't think this would hurt the scheduler granularity. In fact, it >>> should even contribute to the desire of not letting the ring run dry >>> even better. Especially for sequences of small jobs, where the >>> current implementation might wrongly assume the ring is already full >>> although actually there would still be enough space left. >>> >>>> >>>> Christian. >>>> >>>>> >>>>>> >>>>>> Otherwise your scheduler might just overwrite the ring buffer by >>>>>> pushing things to fast. >>>>>> >>>>>> Christian. >>>>>> >>>>>>> >>>>>>> Given that, it seems like it would be better to let the >>>>>>> scheduler keep track of empty ring "slots" instead, such that >>>>>>> the scheduler can deceide whether a subsequent job will still >>>>>>> fit on the ring and if not re-evaluate once a previous job >>>>>>> finished. Of course each submitted job would be required to >>>>>>> carry the number of slots it requires on the ring. >>>>>>> >>>>>>> What to you think of implementing this as alternative flow >>>>>>> control mechanism? Implementation wise this could be a union >>>>>>> with the existing hw_submission_limit. >>>>>>> >>>>>>> - Danilo >>>>>>> >>>>>>>> >>>>>>>> A problem with this design is currently a drm_gpu_scheduler uses a >>>>>>>> kthread for submission / job cleanup. This doesn't scale if a >>>>>>>> large >>>>>>>> number of drm_gpu_scheduler are used. To work around the >>>>>>>> scaling issue, >>>>>>>> use a worker rather than kthread for submission / job cleanup. >>>>>>>> >>>>>>>> v2: >>>>>>>> - (Rob Clark) Fix msm build >>>>>>>> - Pass in run work queue >>>>>>>> v3: >>>>>>>> - (Boris) don't have loop in worker >>>>>>>> v4: >>>>>>>> - (Tvrtko) break out submit ready, stop, start helpers into >>>>>>>> own patch >>>>>>>> >>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
Am 17.08.23 um 14:48 schrieb Danilo Krummrich: > On 8/17/23 15:35, Christian König wrote: >> Am 17.08.23 um 13:13 schrieb Danilo Krummrich: >>> On 8/17/23 07:33, Christian König wrote: >>>> [SNIP] >>>> My proposal would be to just keep the hw_submission_limit (maybe >>>> rename it to submission_unit_limit) and add a submission_units >>>> field to struct drm_sched_job. By default a jobs submission_units >>>> field would be 0 and the scheduler would behave the exact same way >>>> as it does now. >>>> >>>> Accordingly, jobs with submission_units > 1 would contribute more >>>> than one unit to the submission_unit_limit. >>>> >>>> What do you think about that? >> >> I think you are approaching this from the completely wrong side. > > First of all, thanks for keeping up the discussion - I appreciate it. > Some more comments / questions below. > >> >> See the UAPI needs to be stable, so you need a maximum job size >> otherwise it can happen that a combination of large and small >> submissions work while a different combination doesn't. > > How is this related to the uAPI being stable? What do you mean by > 'stable' in this context? Stable is in you don't get indifferent behavior, not stable is in the sense of backward compatibility. Sorry for the confusing wording :) > > The Nouveau uAPI allows userspace to pass EXEC jobs by supplying the > ring ID (channel), in-/out-syncs and a certain amount of indirect push > buffers. The amount of IBs per job is limited by the amount of IBs > fitting into the ring. Just to be clear, when I say 'job size' I mean > the amount of IBs per job. Well that more or less sounds identical to all other hardware I know of, e.g. AMD, Intel and the different ARM chips seem to all work like this. But on those drivers the job size limit is not the ring size, but rather a fixed value (at least as far as I know). > > Maybe I should also mention that the rings we are talking about are > software rings managed by a firmware scheduler. We can have an > arbitrary amount of software rings and even multiple ones per FD. > > Given a constant ring size I really don't see why I should limit the > maximum amount of IBs userspace can push per job just to end up with a > hw_submission_limit > 1. > > For example, let's just assume the ring can take 128 IBs, why would I > limit userspace to submit just e.g. 16 IBs at a time, such that the > hw_submission_limit becomes 8? Well the question is what happens when you have two submissions back to back which use more than halve of the ring buffer? I only see two possible outcomes: 1. You return -EBUSY (or similar) error code indicating the the hw can't receive more commands. 2. Wait on previously pushed commands to be executed. (3. Your driver crash because you accidentally overwrite stuff in the ring buffer which is still executed. I just assume that's prevented). Resolution #1 with -EBUSY is actually something the UAPI should not do, because your UAPI then depends on the specific timing of submissions which is a really bad idea. Resolution #2 is usually bad because it forces the hw to run dry between submission and so degrade performance. > > What is the advantage of doing that, rather than letting userspace > submit *up to* 128 IBs per job and just letting the scheduler push IBs > to the ring as long as there's actually space left on the ring? Predictable behavior I think. Basically you want organize things so that the hw is at least kept busy all the time without depending on actual timing. > >> >> So what you usually do, and this is driver independent because simply >> a requirement of the UAPI, is that you say here that's my maximum job >> size as well as the number of submission which should be pushed to >> the hw at the same time. And then get the resulting ring size by the >> product of the two. > > Given the above, how is that a requirement of the uAPI? The requirement of the UAPI is actually pretty simple: You should get consistent results, independent of the timing (at least as long as you don't do stuff in parallel). Otherwise you can run into issues when on a certain configuration stuff suddenly runs faster or slower than expected. In other words you should not depend on that stuff finishes in a certain amount of time. > >> >> That the ring in this use case can't be fully utilized is not a draw >> back, this is completely intentional design which should apply to all >> drivers independent of the vendor. > > Why wouldn't we want to fully utilize the ring size? As far as I know everybody restricts the submission size to something fixed which is at least smaller than halve the ring size to avoid the problems mentioned above. Regards, Christian. > > - Danilo > >> >>> >>> Besides all that, you said that filling up the ring just enough to >>> not let the HW run dry rather than filling it up entirely is >>> desirable. Why do you think so? I tend to think that in most cases >>> it shouldn't make difference. >> >> That results in better scheduling behavior. It's mostly beneficial if >> you don't have a hw scheduler, but as far as I can see there is no >> need to pump everything to the hw as fast as possible. >> >> Regards, >> Christian. >> >>> >>> - Danilo >>> >>>> >>>> Regards, >>>> Christian. >>>> >>>>> Because one really is the minimum if you want to do work at all, >>>>> but as you mentioned above a job limit of one can let the ring run >>>>> dry. >>>>> >>>>> In the end my proposal comes down to tracking the actual size of a >>>>> job rather than just assuming a pre-defined maximum job size, and >>>>> hence a dynamic job limit. >>>>> >>>>> I don't think this would hurt the scheduler granularity. In fact, >>>>> it should even contribute to the desire of not letting the ring >>>>> run dry even better. Especially for sequences of small jobs, where >>>>> the current implementation might wrongly assume the ring is >>>>> already full although actually there would still be enough space >>>>> left. >>>>> >>>>>> >>>>>> Christian. >>>>>> >>>>>>> >>>>>>>> >>>>>>>> Otherwise your scheduler might just overwrite the ring buffer >>>>>>>> by pushing things to fast. >>>>>>>> >>>>>>>> Christian. >>>>>>>> >>>>>>>>> >>>>>>>>> Given that, it seems like it would be better to let the >>>>>>>>> scheduler keep track of empty ring "slots" instead, such that >>>>>>>>> the scheduler can deceide whether a subsequent job will still >>>>>>>>> fit on the ring and if not re-evaluate once a previous job >>>>>>>>> finished. Of course each submitted job would be required to >>>>>>>>> carry the number of slots it requires on the ring. >>>>>>>>> >>>>>>>>> What to you think of implementing this as alternative flow >>>>>>>>> control mechanism? Implementation wise this could be a union >>>>>>>>> with the existing hw_submission_limit. >>>>>>>>> >>>>>>>>> - Danilo >>>>>>>>> >>>>>>>>>> >>>>>>>>>> A problem with this design is currently a drm_gpu_scheduler >>>>>>>>>> uses a >>>>>>>>>> kthread for submission / job cleanup. This doesn't scale if a >>>>>>>>>> large >>>>>>>>>> number of drm_gpu_scheduler are used. To work around the >>>>>>>>>> scaling issue, >>>>>>>>>> use a worker rather than kthread for submission / job cleanup. >>>>>>>>>> >>>>>>>>>> v2: >>>>>>>>>> - (Rob Clark) Fix msm build >>>>>>>>>> - Pass in run work queue >>>>>>>>>> v3: >>>>>>>>>> - (Boris) don't have loop in worker >>>>>>>>>> v4: >>>>>>>>>> - (Tvrtko) break out submit ready, stop, start helpers >>>>>>>>>> into own patch >>>>>>>>>> >>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
On Thu, Aug 17, 2023 at 01:13:31PM +0200, Danilo Krummrich wrote: > On 8/17/23 07:33, Christian König wrote: > > Am 16.08.23 um 18:33 schrieb Danilo Krummrich: > > > On 8/16/23 16:59, Christian König wrote: > > > > Am 16.08.23 um 14:30 schrieb Danilo Krummrich: > > > > > On 8/16/23 16:05, Christian König wrote: > > > > > > Am 16.08.23 um 13:30 schrieb Danilo Krummrich: > > > > > > > Hi Matt, > > > > > > > > > > > > > > On 8/11/23 04:31, Matthew Brost wrote: > > > > > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1 > > > > > > > > mapping between a drm_gpu_scheduler and > > > > > > > > drm_sched_entity. At first this > > > > > > > > seems a bit odd but let us explain the reasoning below. > > > > > > > > > > > > > > > > 1. In XE the submission order from multiple drm_sched_entity is not > > > > > > > > guaranteed to be the same completion even if > > > > > > > > targeting the same hardware > > > > > > > > engine. This is because in XE we have a firmware scheduler, the GuC, > > > > > > > > which allowed to reorder, timeslice, and preempt > > > > > > > > submissions. If a using > > > > > > > > shared drm_gpu_scheduler across multiple > > > > > > > > drm_sched_entity, the TDR falls > > > > > > > > apart as the TDR expects submission order == > > > > > > > > completion order. Using a > > > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this problem. > > > > > > > > > > > > > > > > 2. In XE submissions are done via programming a > > > > > > > > ring buffer (circular > > > > > > > > buffer), a drm_gpu_scheduler provides a limit on > > > > > > > > number of jobs, if the > > > > > > > > limit of number jobs is set to RING_SIZE / > > > > > > > > MAX_SIZE_PER_JOB we get flow > > > > > > > > control on the ring for free. > > > > > > > > > > > > > > In XE, where does the limitation of MAX_SIZE_PER_JOB come from? > > > > > > > > > > > > > > In Nouveau we currently do have such a limitation as > > > > > > > well, but it is derived from the RING_SIZE, hence > > > > > > > RING_SIZE / MAX_SIZE_PER_JOB would always be 1. > > > > > > > However, I think most jobs won't actually utilize > > > > > > > the whole ring. > > > > > > > > > > > > Well that should probably rather be RING_SIZE / > > > > > > MAX_SIZE_PER_JOB = hw_submission_limit (or even > > > > > > hw_submission_limit - 1 when the hw can't distinct full > > > > > > vs empty ring buffer). > > > > > > > > > > Not sure if I get you right, let me try to clarify what I > > > > > was trying to say: I wanted to say that in Nouveau > > > > > MAX_SIZE_PER_JOB isn't really limited by anything other than > > > > > the RING_SIZE and hence we'd never allow more than 1 active > > > > > job. > > > > > > > > But that lets the hw run dry between submissions. That is > > > > usually a pretty horrible idea for performance. > > > > > > Correct, that's the reason why I said it seems to be more efficient > > > to base ring flow control on the actual size of each incoming job > > > rather than the maximum size of a job. > > > > > > > > > > > > > > > > > However, it seems to be more efficient to base ring flow > > > > > control on the actual size of each incoming job rather than > > > > > the worst case, namely the maximum size of a job. > > > > > > > > That doesn't sounds like a good idea to me. See we don't limit > > > > the number of submitted jobs based on the ring size, but rather > > > > we calculate the ring size based on the number of submitted > > > > jobs. > > > > > > > > > > My point isn't really about whether we derive the ring size from the > > > job limit or the other way around. It's more about the job size (or > > > its maximum size) being arbitrary. > > > > > > As mentioned in my reply to Matt: > > > > > > "In Nouveau, userspace can submit an arbitrary amount of addresses > > > of indirect bufferes containing the ring instructions. The ring on > > > the kernel side takes the addresses of the indirect buffers rather > > > than the instructions themself. Hence, technically there isn't > > > really a limit on the amount of IBs submitted by a job except for > > > the ring size." > > > > > > So, my point is that I don't really want to limit the job size > > > artificially just to be able to fit multiple jobs into the ring even > > > if they're submitted at their "artificial" maximum size, but rather > > > track how much of the ring the submitted job actually occupies. > > > > > > > In other words the hw_submission_limit defines the ring size, > > > > not the other way around. And you usually want the > > > > hw_submission_limit as low as possible for good scheduler > > > > granularity and to avoid extra overhead. > > > > > > I don't think you really mean "as low as possible", do you? > > > > No, I do mean as low as possible or in other words as few as possible. > > > > Ideally the scheduler would submit only the minimum amount of work to > > the hardware to keep the hardware busy. > > > The hardware seems to work mostly the same for all vendors, but you > > somehow seem to think that filling the ring is somehow beneficial which > > is really not the case as far as I can see. > > I think that's a misunderstanding. I'm not trying to say that it is *always* > beneficial to fill up the ring as much as possible. But I think it is under > certain circumstances, exactly those circumstances I described for Nouveau. > > As mentioned, in Nouveau the size of a job is only really limited by the > ring size, which means that one job can (but does not necessarily) fill up > the whole ring. We both agree that this is inefficient, because it > potentially results into the HW run dry due to hw_submission_limit == 1. > > I recognize you said that one should define hw_submission_limit and adjust > the other parts of the equation accordingly, the options I see are: > > (1) Increase the ring size while keeping the maximum job size. > (2) Decrease the maximum job size while keeping the ring size. > (3) Let the scheduler track the actual job size rather than the maximum job > size. > > (1) results into potentially wasted ring memory, because we're not always > reaching the maximum job size, but the scheduler assumes so. > > (2) results into more IOCTLs from userspace for the same amount of IBs and > more jobs result into more memory allocations and more work being submitted > to the workqueue (with Matt's patches). > > (3) doesn't seem to have any of those draw backs. > > What would be your take on that? > > Actually, if none of the other drivers is interested into a more precise way > of keeping track of the ring utilization, I'd be totally fine to do it in a > driver specific way. However, unfortunately I don't see how this would be > possible. > > My proposal would be to just keep the hw_submission_limit (maybe rename it > to submission_unit_limit) and add a submission_units field to struct > drm_sched_job. By default a jobs submission_units field would be 0 and the > scheduler would behave the exact same way as it does now. > > Accordingly, jobs with submission_units > 1 would contribute more than one > unit to the submission_unit_limit. > > What do you think about that? > This seems reasonible to me and a very minimal change to the scheduler. Matt > Besides all that, you said that filling up the ring just enough to not let > the HW run dry rather than filling it up entirely is desirable. Why do you > think so? I tend to think that in most cases it shouldn't make difference. > > - Danilo > > > > > Regards, > > Christian. > > > > > Because one really is the minimum if you want to do work at all, but > > > as you mentioned above a job limit of one can let the ring run dry. > > > > > > In the end my proposal comes down to tracking the actual size of a > > > job rather than just assuming a pre-defined maximum job size, and > > > hence a dynamic job limit. > > > > > > I don't think this would hurt the scheduler granularity. In fact, it > > > should even contribute to the desire of not letting the ring run dry > > > even better. Especially for sequences of small jobs, where the > > > current implementation might wrongly assume the ring is already full > > > although actually there would still be enough space left. > > > > > > > > > > > Christian. > > > > > > > > > > > > > > > > > > > > > Otherwise your scheduler might just overwrite the ring > > > > > > buffer by pushing things to fast. > > > > > > > > > > > > Christian. > > > > > > > > > > > > > > > > > > > > Given that, it seems like it would be better to let > > > > > > > the scheduler keep track of empty ring "slots" > > > > > > > instead, such that the scheduler can deceide whether > > > > > > > a subsequent job will still fit on the ring and if > > > > > > > not re-evaluate once a previous job finished. Of > > > > > > > course each submitted job would be required to carry > > > > > > > the number of slots it requires on the ring. > > > > > > > > > > > > > > What to you think of implementing this as > > > > > > > alternative flow control mechanism? Implementation > > > > > > > wise this could be a union with the existing > > > > > > > hw_submission_limit. > > > > > > > > > > > > > > - Danilo > > > > > > > > > > > > > > > > > > > > > > > A problem with this design is currently a drm_gpu_scheduler uses a > > > > > > > > kthread for submission / job cleanup. This doesn't scale if a large > > > > > > > > number of drm_gpu_scheduler are used. To work > > > > > > > > around the scaling issue, > > > > > > > > use a worker rather than kthread for submission / job cleanup. > > > > > > > > > > > > > > > > v2: > > > > > > > > - (Rob Clark) Fix msm build > > > > > > > > - Pass in run work queue > > > > > > > > v3: > > > > > > > > - (Boris) don't have loop in worker > > > > > > > > v4: > > > > > > > > - (Tvrtko) break out submit ready, stop, > > > > > > > > start helpers into own patch > > > > > > > > > > > > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com> > > > > > > > > > > > > > > > > > > > > > > > > > > > >
Am 18.08.23 um 05:08 schrieb Matthew Brost: > On Thu, Aug 17, 2023 at 01:13:31PM +0200, Danilo Krummrich wrote: >> On 8/17/23 07:33, Christian König wrote: >>> Am 16.08.23 um 18:33 schrieb Danilo Krummrich: >>>> On 8/16/23 16:59, Christian König wrote: >>>>> Am 16.08.23 um 14:30 schrieb Danilo Krummrich: >>>>>> On 8/16/23 16:05, Christian König wrote: >>>>>>> Am 16.08.23 um 13:30 schrieb Danilo Krummrich: >>>>>>>> Hi Matt, >>>>>>>> >>>>>>>> On 8/11/23 04:31, Matthew Brost wrote: >>>>>>>>> In XE, the new Intel GPU driver, a choice has made to have a 1 to 1 >>>>>>>>> mapping between a drm_gpu_scheduler and >>>>>>>>> drm_sched_entity. At first this >>>>>>>>> seems a bit odd but let us explain the reasoning below. >>>>>>>>> >>>>>>>>> 1. In XE the submission order from multiple drm_sched_entity is not >>>>>>>>> guaranteed to be the same completion even if >>>>>>>>> targeting the same hardware >>>>>>>>> engine. This is because in XE we have a firmware scheduler, the GuC, >>>>>>>>> which allowed to reorder, timeslice, and preempt >>>>>>>>> submissions. If a using >>>>>>>>> shared drm_gpu_scheduler across multiple >>>>>>>>> drm_sched_entity, the TDR falls >>>>>>>>> apart as the TDR expects submission order == >>>>>>>>> completion order. Using a >>>>>>>>> dedicated drm_gpu_scheduler per drm_sched_entity solve this problem. >>>>>>>>> >>>>>>>>> 2. In XE submissions are done via programming a >>>>>>>>> ring buffer (circular >>>>>>>>> buffer), a drm_gpu_scheduler provides a limit on >>>>>>>>> number of jobs, if the >>>>>>>>> limit of number jobs is set to RING_SIZE / >>>>>>>>> MAX_SIZE_PER_JOB we get flow >>>>>>>>> control on the ring for free. >>>>>>>> In XE, where does the limitation of MAX_SIZE_PER_JOB come from? >>>>>>>> >>>>>>>> In Nouveau we currently do have such a limitation as >>>>>>>> well, but it is derived from the RING_SIZE, hence >>>>>>>> RING_SIZE / MAX_SIZE_PER_JOB would always be 1. >>>>>>>> However, I think most jobs won't actually utilize >>>>>>>> the whole ring. >>>>>>> Well that should probably rather be RING_SIZE / >>>>>>> MAX_SIZE_PER_JOB = hw_submission_limit (or even >>>>>>> hw_submission_limit - 1 when the hw can't distinct full >>>>>>> vs empty ring buffer). >>>>>> Not sure if I get you right, let me try to clarify what I >>>>>> was trying to say: I wanted to say that in Nouveau >>>>>> MAX_SIZE_PER_JOB isn't really limited by anything other than >>>>>> the RING_SIZE and hence we'd never allow more than 1 active >>>>>> job. >>>>> But that lets the hw run dry between submissions. That is >>>>> usually a pretty horrible idea for performance. >>>> Correct, that's the reason why I said it seems to be more efficient >>>> to base ring flow control on the actual size of each incoming job >>>> rather than the maximum size of a job. >>>> >>>>>> However, it seems to be more efficient to base ring flow >>>>>> control on the actual size of each incoming job rather than >>>>>> the worst case, namely the maximum size of a job. >>>>> That doesn't sounds like a good idea to me. See we don't limit >>>>> the number of submitted jobs based on the ring size, but rather >>>>> we calculate the ring size based on the number of submitted >>>>> jobs. >>>>> >>>> My point isn't really about whether we derive the ring size from the >>>> job limit or the other way around. It's more about the job size (or >>>> its maximum size) being arbitrary. >>>> >>>> As mentioned in my reply to Matt: >>>> >>>> "In Nouveau, userspace can submit an arbitrary amount of addresses >>>> of indirect bufferes containing the ring instructions. The ring on >>>> the kernel side takes the addresses of the indirect buffers rather >>>> than the instructions themself. Hence, technically there isn't >>>> really a limit on the amount of IBs submitted by a job except for >>>> the ring size." >>>> >>>> So, my point is that I don't really want to limit the job size >>>> artificially just to be able to fit multiple jobs into the ring even >>>> if they're submitted at their "artificial" maximum size, but rather >>>> track how much of the ring the submitted job actually occupies. >>>> >>>>> In other words the hw_submission_limit defines the ring size, >>>>> not the other way around. And you usually want the >>>>> hw_submission_limit as low as possible for good scheduler >>>>> granularity and to avoid extra overhead. >>>> I don't think you really mean "as low as possible", do you? >>> No, I do mean as low as possible or in other words as few as possible. >>> >>> Ideally the scheduler would submit only the minimum amount of work to >>> the hardware to keep the hardware busy. > >>> The hardware seems to work mostly the same for all vendors, but you >>> somehow seem to think that filling the ring is somehow beneficial which >>> is really not the case as far as I can see. >> I think that's a misunderstanding. I'm not trying to say that it is *always* >> beneficial to fill up the ring as much as possible. But I think it is under >> certain circumstances, exactly those circumstances I described for Nouveau. >> >> As mentioned, in Nouveau the size of a job is only really limited by the >> ring size, which means that one job can (but does not necessarily) fill up >> the whole ring. We both agree that this is inefficient, because it >> potentially results into the HW run dry due to hw_submission_limit == 1. >> >> I recognize you said that one should define hw_submission_limit and adjust >> the other parts of the equation accordingly, the options I see are: >> >> (1) Increase the ring size while keeping the maximum job size. >> (2) Decrease the maximum job size while keeping the ring size. >> (3) Let the scheduler track the actual job size rather than the maximum job >> size. >> >> (1) results into potentially wasted ring memory, because we're not always >> reaching the maximum job size, but the scheduler assumes so. >> >> (2) results into more IOCTLs from userspace for the same amount of IBs and >> more jobs result into more memory allocations and more work being submitted >> to the workqueue (with Matt's patches). >> >> (3) doesn't seem to have any of those draw backs. >> >> What would be your take on that? >> >> Actually, if none of the other drivers is interested into a more precise way >> of keeping track of the ring utilization, I'd be totally fine to do it in a >> driver specific way. However, unfortunately I don't see how this would be >> possible. >> >> My proposal would be to just keep the hw_submission_limit (maybe rename it >> to submission_unit_limit) and add a submission_units field to struct >> drm_sched_job. By default a jobs submission_units field would be 0 and the >> scheduler would behave the exact same way as it does now. >> >> Accordingly, jobs with submission_units > 1 would contribute more than one >> unit to the submission_unit_limit. >> >> What do you think about that? >> > This seems reasonible to me and a very minimal change to the scheduler. If you have a good use case for this then the approach sounds sane to me as well. My question is rather how exactly does Nouveau comes to have this use case? Allowing the full ring size in the UAPI sounds a bit questionable. Christian. > > Matt > >> Besides all that, you said that filling up the ring just enough to not let >> the HW run dry rather than filling it up entirely is desirable. Why do you >> think so? I tend to think that in most cases it shouldn't make difference. >> >> - Danilo >> >>> Regards, >>> Christian. >>> >>>> Because one really is the minimum if you want to do work at all, but >>>> as you mentioned above a job limit of one can let the ring run dry. >>>> >>>> In the end my proposal comes down to tracking the actual size of a >>>> job rather than just assuming a pre-defined maximum job size, and >>>> hence a dynamic job limit. >>>> >>>> I don't think this would hurt the scheduler granularity. In fact, it >>>> should even contribute to the desire of not letting the ring run dry >>>> even better. Especially for sequences of small jobs, where the >>>> current implementation might wrongly assume the ring is already full >>>> although actually there would still be enough space left. >>>> >>>>> Christian. >>>>> >>>>>>> Otherwise your scheduler might just overwrite the ring >>>>>>> buffer by pushing things to fast. >>>>>>> >>>>>>> Christian. >>>>>>> >>>>>>>> Given that, it seems like it would be better to let >>>>>>>> the scheduler keep track of empty ring "slots" >>>>>>>> instead, such that the scheduler can deceide whether >>>>>>>> a subsequent job will still fit on the ring and if >>>>>>>> not re-evaluate once a previous job finished. Of >>>>>>>> course each submitted job would be required to carry >>>>>>>> the number of slots it requires on the ring. >>>>>>>> >>>>>>>> What to you think of implementing this as >>>>>>>> alternative flow control mechanism? Implementation >>>>>>>> wise this could be a union with the existing >>>>>>>> hw_submission_limit. >>>>>>>> >>>>>>>> - Danilo >>>>>>>> >>>>>>>>> A problem with this design is currently a drm_gpu_scheduler uses a >>>>>>>>> kthread for submission / job cleanup. This doesn't scale if a large >>>>>>>>> number of drm_gpu_scheduler are used. To work >>>>>>>>> around the scaling issue, >>>>>>>>> use a worker rather than kthread for submission / job cleanup. >>>>>>>>> >>>>>>>>> v2: >>>>>>>>> - (Rob Clark) Fix msm build >>>>>>>>> - Pass in run work queue >>>>>>>>> v3: >>>>>>>>> - (Boris) don't have loop in worker >>>>>>>>> v4: >>>>>>>>> - (Tvrtko) break out submit ready, stop, >>>>>>>>> start helpers into own patch >>>>>>>>> >>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
On 8/17/23 18:17, Christian König wrote: > Am 17.08.23 um 14:48 schrieb Danilo Krummrich: >> On 8/17/23 15:35, Christian König wrote: >>> Am 17.08.23 um 13:13 schrieb Danilo Krummrich: >>>> On 8/17/23 07:33, Christian König wrote: >>>>> [SNIP] >>>>> My proposal would be to just keep the hw_submission_limit (maybe >>>>> rename it to submission_unit_limit) and add a submission_units >>>>> field to struct drm_sched_job. By default a jobs submission_units >>>>> field would be 0 and the scheduler would behave the exact same way >>>>> as it does now. >>>>> >>>>> Accordingly, jobs with submission_units > 1 would contribute more >>>>> than one unit to the submission_unit_limit. >>>>> >>>>> What do you think about that? >>> >>> I think you are approaching this from the completely wrong side. >> >> First of all, thanks for keeping up the discussion - I appreciate it. >> Some more comments / questions below. >> >>> >>> See the UAPI needs to be stable, so you need a maximum job size >>> otherwise it can happen that a combination of large and small >>> submissions work while a different combination doesn't. >> >> How is this related to the uAPI being stable? What do you mean by >> 'stable' in this context? > > Stable is in you don't get indifferent behavior, not stable is in the > sense of backward compatibility. Sorry for the confusing wording :) > >> >> The Nouveau uAPI allows userspace to pass EXEC jobs by supplying the >> ring ID (channel), in-/out-syncs and a certain amount of indirect push >> buffers. The amount of IBs per job is limited by the amount of IBs >> fitting into the ring. Just to be clear, when I say 'job size' I mean >> the amount of IBs per job. > > Well that more or less sounds identical to all other hardware I know of, > e.g. AMD, Intel and the different ARM chips seem to all work like this. > But on those drivers the job size limit is not the ring size, but rather > a fixed value (at least as far as I know). > >> >> Maybe I should also mention that the rings we are talking about are >> software rings managed by a firmware scheduler. We can have an >> arbitrary amount of software rings and even multiple ones per FD. >> >> Given a constant ring size I really don't see why I should limit the >> maximum amount of IBs userspace can push per job just to end up with a >> hw_submission_limit > 1. >> >> For example, let's just assume the ring can take 128 IBs, why would I >> limit userspace to submit just e.g. 16 IBs at a time, such that the >> hw_submission_limit becomes 8? > > Well the question is what happens when you have two submissions back to > back which use more than halve of the ring buffer? > > I only see two possible outcomes: > 1. You return -EBUSY (or similar) error code indicating the the hw can't > receive more commands. > 2. Wait on previously pushed commands to be executed. > (3. Your driver crash because you accidentally overwrite stuff in the > ring buffer which is still executed. I just assume that's prevented). > > Resolution #1 with -EBUSY is actually something the UAPI should not do, > because your UAPI then depends on the specific timing of submissions > which is a really bad idea. > > Resolution #2 is usually bad because it forces the hw to run dry between > submission and so degrade performance. I agree, that is a good reason for at least limiting the maximum job size to half of the ring size. However, there could still be cases where two subsequent jobs are submitted with just a single IB, which as is would still block subsequent jobs to be pushed to the ring although there is still plenty of space. Depending on the (CPU) scheduler latency, such a case can let the HW run dry as well. Surely, we could just continue decrease the maximum job size even further, but this would result in further overhead on user and kernel for larger IB counts. Tracking the actual job size seems to be the better solution for drivers where the job size can vary over a rather huge range. - Danilo > >> >> What is the advantage of doing that, rather than letting userspace >> submit *up to* 128 IBs per job and just letting the scheduler push IBs >> to the ring as long as there's actually space left on the ring? > > Predictable behavior I think. Basically you want organize things so that > the hw is at least kept busy all the time without depending on actual > timing. > >> >>> >>> So what you usually do, and this is driver independent because simply >>> a requirement of the UAPI, is that you say here that's my maximum job >>> size as well as the number of submission which should be pushed to >>> the hw at the same time. And then get the resulting ring size by the >>> product of the two. >> >> Given the above, how is that a requirement of the uAPI? > > The requirement of the UAPI is actually pretty simple: You should get > consistent results, independent of the timing (at least as long as you > don't do stuff in parallel). > > Otherwise you can run into issues when on a certain configuration stuff > suddenly runs faster or slower than expected. In other words you should > not depend on that stuff finishes in a certain amount of time. > >> >>> >>> That the ring in this use case can't be fully utilized is not a draw >>> back, this is completely intentional design which should apply to all >>> drivers independent of the vendor. >> >> Why wouldn't we want to fully utilize the ring size? > > As far as I know everybody restricts the submission size to something > fixed which is at least smaller than halve the ring size to avoid the > problems mentioned above. > > Regards, > Christian. > >> >> - Danilo >> >>> >>>> >>>> Besides all that, you said that filling up the ring just enough to >>>> not let the HW run dry rather than filling it up entirely is >>>> desirable. Why do you think so? I tend to think that in most cases >>>> it shouldn't make difference. >>> >>> That results in better scheduling behavior. It's mostly beneficial if >>> you don't have a hw scheduler, but as far as I can see there is no >>> need to pump everything to the hw as fast as possible. >>> >>> Regards, >>> Christian. >>> >>>> >>>> - Danilo >>>> >>>>> >>>>> Regards, >>>>> Christian. >>>>> >>>>>> Because one really is the minimum if you want to do work at all, >>>>>> but as you mentioned above a job limit of one can let the ring run >>>>>> dry. >>>>>> >>>>>> In the end my proposal comes down to tracking the actual size of a >>>>>> job rather than just assuming a pre-defined maximum job size, and >>>>>> hence a dynamic job limit. >>>>>> >>>>>> I don't think this would hurt the scheduler granularity. In fact, >>>>>> it should even contribute to the desire of not letting the ring >>>>>> run dry even better. Especially for sequences of small jobs, where >>>>>> the current implementation might wrongly assume the ring is >>>>>> already full although actually there would still be enough space >>>>>> left. >>>>>> >>>>>>> >>>>>>> Christian. >>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> Otherwise your scheduler might just overwrite the ring buffer >>>>>>>>> by pushing things to fast. >>>>>>>>> >>>>>>>>> Christian. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Given that, it seems like it would be better to let the >>>>>>>>>> scheduler keep track of empty ring "slots" instead, such that >>>>>>>>>> the scheduler can deceide whether a subsequent job will still >>>>>>>>>> fit on the ring and if not re-evaluate once a previous job >>>>>>>>>> finished. Of course each submitted job would be required to >>>>>>>>>> carry the number of slots it requires on the ring. >>>>>>>>>> >>>>>>>>>> What to you think of implementing this as alternative flow >>>>>>>>>> control mechanism? Implementation wise this could be a union >>>>>>>>>> with the existing hw_submission_limit. >>>>>>>>>> >>>>>>>>>> - Danilo >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> A problem with this design is currently a drm_gpu_scheduler >>>>>>>>>>> uses a >>>>>>>>>>> kthread for submission / job cleanup. This doesn't scale if a >>>>>>>>>>> large >>>>>>>>>>> number of drm_gpu_scheduler are used. To work around the >>>>>>>>>>> scaling issue, >>>>>>>>>>> use a worker rather than kthread for submission / job cleanup. >>>>>>>>>>> >>>>>>>>>>> v2: >>>>>>>>>>> - (Rob Clark) Fix msm build >>>>>>>>>>> - Pass in run work queue >>>>>>>>>>> v3: >>>>>>>>>>> - (Boris) don't have loop in worker >>>>>>>>>>> v4: >>>>>>>>>>> - (Tvrtko) break out submit ready, stop, start helpers >>>>>>>>>>> into own patch >>>>>>>>>>> >>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
On 8/18/23 14:49, Matthew Brost wrote: > On Fri, Aug 18, 2023 at 07:40:41AM +0200, Christian König wrote: >> Am 18.08.23 um 05:08 schrieb Matthew Brost: >>> On Thu, Aug 17, 2023 at 01:13:31PM +0200, Danilo Krummrich wrote: >>>> On 8/17/23 07:33, Christian König wrote: >>>>> Am 16.08.23 um 18:33 schrieb Danilo Krummrich: >>>>>> On 8/16/23 16:59, Christian König wrote: >>>>>>> Am 16.08.23 um 14:30 schrieb Danilo Krummrich: >>>>>>>> On 8/16/23 16:05, Christian König wrote: >>>>>>>>> Am 16.08.23 um 13:30 schrieb Danilo Krummrich: >>>>>>>>>> Hi Matt, >>>>>>>>>> >>>>>>>>>> On 8/11/23 04:31, Matthew Brost wrote: >>>>>>>>>>> In XE, the new Intel GPU driver, a choice has made to have a 1 to 1 >>>>>>>>>>> mapping between a drm_gpu_scheduler and >>>>>>>>>>> drm_sched_entity. At first this >>>>>>>>>>> seems a bit odd but let us explain the reasoning below. >>>>>>>>>>> >>>>>>>>>>> 1. In XE the submission order from multiple drm_sched_entity is not >>>>>>>>>>> guaranteed to be the same completion even if >>>>>>>>>>> targeting the same hardware >>>>>>>>>>> engine. This is because in XE we have a firmware scheduler, the GuC, >>>>>>>>>>> which allowed to reorder, timeslice, and preempt >>>>>>>>>>> submissions. If a using >>>>>>>>>>> shared drm_gpu_scheduler across multiple >>>>>>>>>>> drm_sched_entity, the TDR falls >>>>>>>>>>> apart as the TDR expects submission order == >>>>>>>>>>> completion order. Using a >>>>>>>>>>> dedicated drm_gpu_scheduler per drm_sched_entity solve this problem. >>>>>>>>>>> >>>>>>>>>>> 2. In XE submissions are done via programming a >>>>>>>>>>> ring buffer (circular >>>>>>>>>>> buffer), a drm_gpu_scheduler provides a limit on >>>>>>>>>>> number of jobs, if the >>>>>>>>>>> limit of number jobs is set to RING_SIZE / >>>>>>>>>>> MAX_SIZE_PER_JOB we get flow >>>>>>>>>>> control on the ring for free. >>>>>>>>>> In XE, where does the limitation of MAX_SIZE_PER_JOB come from? >>>>>>>>>> >>>>>>>>>> In Nouveau we currently do have such a limitation as >>>>>>>>>> well, but it is derived from the RING_SIZE, hence >>>>>>>>>> RING_SIZE / MAX_SIZE_PER_JOB would always be 1. >>>>>>>>>> However, I think most jobs won't actually utilize >>>>>>>>>> the whole ring. >>>>>>>>> Well that should probably rather be RING_SIZE / >>>>>>>>> MAX_SIZE_PER_JOB = hw_submission_limit (or even >>>>>>>>> hw_submission_limit - 1 when the hw can't distinct full >>>>>>>>> vs empty ring buffer). >>>>>>>> Not sure if I get you right, let me try to clarify what I >>>>>>>> was trying to say: I wanted to say that in Nouveau >>>>>>>> MAX_SIZE_PER_JOB isn't really limited by anything other than >>>>>>>> the RING_SIZE and hence we'd never allow more than 1 active >>>>>>>> job. >>>>>>> But that lets the hw run dry between submissions. That is >>>>>>> usually a pretty horrible idea for performance. >>>>>> Correct, that's the reason why I said it seems to be more efficient >>>>>> to base ring flow control on the actual size of each incoming job >>>>>> rather than the maximum size of a job. >>>>>> >>>>>>>> However, it seems to be more efficient to base ring flow >>>>>>>> control on the actual size of each incoming job rather than >>>>>>>> the worst case, namely the maximum size of a job. >>>>>>> That doesn't sounds like a good idea to me. See we don't limit >>>>>>> the number of submitted jobs based on the ring size, but rather >>>>>>> we calculate the ring size based on the number of submitted >>>>>>> jobs. >>>>>>> >>>>>> My point isn't really about whether we derive the ring size from the >>>>>> job limit or the other way around. It's more about the job size (or >>>>>> its maximum size) being arbitrary. >>>>>> >>>>>> As mentioned in my reply to Matt: >>>>>> >>>>>> "In Nouveau, userspace can submit an arbitrary amount of addresses >>>>>> of indirect bufferes containing the ring instructions. The ring on >>>>>> the kernel side takes the addresses of the indirect buffers rather >>>>>> than the instructions themself. Hence, technically there isn't >>>>>> really a limit on the amount of IBs submitted by a job except for >>>>>> the ring size." >>>>>> >>>>>> So, my point is that I don't really want to limit the job size >>>>>> artificially just to be able to fit multiple jobs into the ring even >>>>>> if they're submitted at their "artificial" maximum size, but rather >>>>>> track how much of the ring the submitted job actually occupies. >>>>>> >>>>>>> In other words the hw_submission_limit defines the ring size, >>>>>>> not the other way around. And you usually want the >>>>>>> hw_submission_limit as low as possible for good scheduler >>>>>>> granularity and to avoid extra overhead. >>>>>> I don't think you really mean "as low as possible", do you? >>>>> No, I do mean as low as possible or in other words as few as possible. >>>>> >>>>> Ideally the scheduler would submit only the minimum amount of work to >>>>> the hardware to keep the hardware busy. > >>>>> The hardware seems to work mostly the same for all vendors, but you >>>>> somehow seem to think that filling the ring is somehow beneficial which >>>>> is really not the case as far as I can see. >>>> I think that's a misunderstanding. I'm not trying to say that it is *always* >>>> beneficial to fill up the ring as much as possible. But I think it is under >>>> certain circumstances, exactly those circumstances I described for Nouveau. >>>> >>>> As mentioned, in Nouveau the size of a job is only really limited by the >>>> ring size, which means that one job can (but does not necessarily) fill up >>>> the whole ring. We both agree that this is inefficient, because it >>>> potentially results into the HW run dry due to hw_submission_limit == 1. >>>> >>>> I recognize you said that one should define hw_submission_limit and adjust >>>> the other parts of the equation accordingly, the options I see are: >>>> >>>> (1) Increase the ring size while keeping the maximum job size. >>>> (2) Decrease the maximum job size while keeping the ring size. >>>> (3) Let the scheduler track the actual job size rather than the maximum job >>>> size. >>>> >>>> (1) results into potentially wasted ring memory, because we're not always >>>> reaching the maximum job size, but the scheduler assumes so. >>>> >>>> (2) results into more IOCTLs from userspace for the same amount of IBs and >>>> more jobs result into more memory allocations and more work being submitted >>>> to the workqueue (with Matt's patches). >>>> >>>> (3) doesn't seem to have any of those draw backs. >>>> >>>> What would be your take on that? >>>> >>>> Actually, if none of the other drivers is interested into a more precise way >>>> of keeping track of the ring utilization, I'd be totally fine to do it in a >>>> driver specific way. However, unfortunately I don't see how this would be >>>> possible. >>>> >>>> My proposal would be to just keep the hw_submission_limit (maybe rename it >>>> to submission_unit_limit) and add a submission_units field to struct >>>> drm_sched_job. By default a jobs submission_units field would be 0 and the >>>> scheduler would behave the exact same way as it does now. >>>> >>>> Accordingly, jobs with submission_units > 1 would contribute more than one >>>> unit to the submission_unit_limit. >>>> >>>> What do you think about that? >>>> >>> This seems reasonible to me and a very minimal change to the scheduler. >> >> If you have a good use case for this then the approach sounds sane to me as >> well. >> > > Xe does not have a use case as the difference between the minimum size > of a job and the maximum is not all that large (maybe 100-192 bytes is > the range) so the accounting of a unit of 1 per job is just fine for now > even though it may waste space. > > In Nouveau it seems like the min / max for size of job can vary wildly > so it needs finer grained units to mke effective use of the ring space. > Updating the scheduler to support this is rather trivial, hence no real > opposition from me. Also I do see this valid use case that other driver > or even perhaps Xe may use someday. Yes, exactly that. > >> My question is rather how exactly does Nouveau comes to have this use case? >> Allowing the full ring size in the UAPI sounds a bit questionable. >> > > I agree allowing the user completely fill the ring is a bit > questionable, surely there has to be some upper limit. But lets say it > allows 1-64 IB, that still IMO could be used to justify finer grained > accouting in the DRM scheduler as stated above this make the difference > between the min / max job quite large. Yes, I agree that limiting the job size to at least ring size half makes sense to guarantee a contiguous flow. > > Matt > >> Christian. >> >>> >>> Matt >>> >>>> Besides all that, you said that filling up the ring just enough to not let >>>> the HW run dry rather than filling it up entirely is desirable. Why do you >>>> think so? I tend to think that in most cases it shouldn't make difference. >>>> >>>> - Danilo >>>> >>>>> Regards, >>>>> Christian. >>>>> >>>>>> Because one really is the minimum if you want to do work at all, but >>>>>> as you mentioned above a job limit of one can let the ring run dry. >>>>>> >>>>>> In the end my proposal comes down to tracking the actual size of a >>>>>> job rather than just assuming a pre-defined maximum job size, and >>>>>> hence a dynamic job limit. >>>>>> >>>>>> I don't think this would hurt the scheduler granularity. In fact, it >>>>>> should even contribute to the desire of not letting the ring run dry >>>>>> even better. Especially for sequences of small jobs, where the >>>>>> current implementation might wrongly assume the ring is already full >>>>>> although actually there would still be enough space left. >>>>>> >>>>>>> Christian. >>>>>>> >>>>>>>>> Otherwise your scheduler might just overwrite the ring >>>>>>>>> buffer by pushing things to fast. >>>>>>>>> >>>>>>>>> Christian. >>>>>>>>> >>>>>>>>>> Given that, it seems like it would be better to let >>>>>>>>>> the scheduler keep track of empty ring "slots" >>>>>>>>>> instead, such that the scheduler can deceide whether >>>>>>>>>> a subsequent job will still fit on the ring and if >>>>>>>>>> not re-evaluate once a previous job finished. Of >>>>>>>>>> course each submitted job would be required to carry >>>>>>>>>> the number of slots it requires on the ring. >>>>>>>>>> >>>>>>>>>> What to you think of implementing this as >>>>>>>>>> alternative flow control mechanism? Implementation >>>>>>>>>> wise this could be a union with the existing >>>>>>>>>> hw_submission_limit. >>>>>>>>>> >>>>>>>>>> - Danilo >>>>>>>>>> >>>>>>>>>>> A problem with this design is currently a drm_gpu_scheduler uses a >>>>>>>>>>> kthread for submission / job cleanup. This doesn't scale if a large >>>>>>>>>>> number of drm_gpu_scheduler are used. To work >>>>>>>>>>> around the scaling issue, >>>>>>>>>>> use a worker rather than kthread for submission / job cleanup. >>>>>>>>>>> >>>>>>>>>>> v2: >>>>>>>>>>> - (Rob Clark) Fix msm build >>>>>>>>>>> - Pass in run work queue >>>>>>>>>>> v3: >>>>>>>>>>> - (Boris) don't have loop in worker >>>>>>>>>>> v4: >>>>>>>>>>> - (Tvrtko) break out submit ready, stop, >>>>>>>>>>> start helpers into own patch >>>>>>>>>>> >>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com> >> >
On Fri, Aug 18, 2023 at 07:40:41AM +0200, Christian König wrote: > Am 18.08.23 um 05:08 schrieb Matthew Brost: > > On Thu, Aug 17, 2023 at 01:13:31PM +0200, Danilo Krummrich wrote: > > > On 8/17/23 07:33, Christian König wrote: > > > > Am 16.08.23 um 18:33 schrieb Danilo Krummrich: > > > > > On 8/16/23 16:59, Christian König wrote: > > > > > > Am 16.08.23 um 14:30 schrieb Danilo Krummrich: > > > > > > > On 8/16/23 16:05, Christian König wrote: > > > > > > > > Am 16.08.23 um 13:30 schrieb Danilo Krummrich: > > > > > > > > > Hi Matt, > > > > > > > > > > > > > > > > > > On 8/11/23 04:31, Matthew Brost wrote: > > > > > > > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1 > > > > > > > > > > mapping between a drm_gpu_scheduler and > > > > > > > > > > drm_sched_entity. At first this > > > > > > > > > > seems a bit odd but let us explain the reasoning below. > > > > > > > > > > > > > > > > > > > > 1. In XE the submission order from multiple drm_sched_entity is not > > > > > > > > > > guaranteed to be the same completion even if > > > > > > > > > > targeting the same hardware > > > > > > > > > > engine. This is because in XE we have a firmware scheduler, the GuC, > > > > > > > > > > which allowed to reorder, timeslice, and preempt > > > > > > > > > > submissions. If a using > > > > > > > > > > shared drm_gpu_scheduler across multiple > > > > > > > > > > drm_sched_entity, the TDR falls > > > > > > > > > > apart as the TDR expects submission order == > > > > > > > > > > completion order. Using a > > > > > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this problem. > > > > > > > > > > > > > > > > > > > > 2. In XE submissions are done via programming a > > > > > > > > > > ring buffer (circular > > > > > > > > > > buffer), a drm_gpu_scheduler provides a limit on > > > > > > > > > > number of jobs, if the > > > > > > > > > > limit of number jobs is set to RING_SIZE / > > > > > > > > > > MAX_SIZE_PER_JOB we get flow > > > > > > > > > > control on the ring for free. > > > > > > > > > In XE, where does the limitation of MAX_SIZE_PER_JOB come from? > > > > > > > > > > > > > > > > > > In Nouveau we currently do have such a limitation as > > > > > > > > > well, but it is derived from the RING_SIZE, hence > > > > > > > > > RING_SIZE / MAX_SIZE_PER_JOB would always be 1. > > > > > > > > > However, I think most jobs won't actually utilize > > > > > > > > > the whole ring. > > > > > > > > Well that should probably rather be RING_SIZE / > > > > > > > > MAX_SIZE_PER_JOB = hw_submission_limit (or even > > > > > > > > hw_submission_limit - 1 when the hw can't distinct full > > > > > > > > vs empty ring buffer). > > > > > > > Not sure if I get you right, let me try to clarify what I > > > > > > > was trying to say: I wanted to say that in Nouveau > > > > > > > MAX_SIZE_PER_JOB isn't really limited by anything other than > > > > > > > the RING_SIZE and hence we'd never allow more than 1 active > > > > > > > job. > > > > > > But that lets the hw run dry between submissions. That is > > > > > > usually a pretty horrible idea for performance. > > > > > Correct, that's the reason why I said it seems to be more efficient > > > > > to base ring flow control on the actual size of each incoming job > > > > > rather than the maximum size of a job. > > > > > > > > > > > > However, it seems to be more efficient to base ring flow > > > > > > > control on the actual size of each incoming job rather than > > > > > > > the worst case, namely the maximum size of a job. > > > > > > That doesn't sounds like a good idea to me. See we don't limit > > > > > > the number of submitted jobs based on the ring size, but rather > > > > > > we calculate the ring size based on the number of submitted > > > > > > jobs. > > > > > > > > > > > My point isn't really about whether we derive the ring size from the > > > > > job limit or the other way around. It's more about the job size (or > > > > > its maximum size) being arbitrary. > > > > > > > > > > As mentioned in my reply to Matt: > > > > > > > > > > "In Nouveau, userspace can submit an arbitrary amount of addresses > > > > > of indirect bufferes containing the ring instructions. The ring on > > > > > the kernel side takes the addresses of the indirect buffers rather > > > > > than the instructions themself. Hence, technically there isn't > > > > > really a limit on the amount of IBs submitted by a job except for > > > > > the ring size." > > > > > > > > > > So, my point is that I don't really want to limit the job size > > > > > artificially just to be able to fit multiple jobs into the ring even > > > > > if they're submitted at their "artificial" maximum size, but rather > > > > > track how much of the ring the submitted job actually occupies. > > > > > > > > > > > In other words the hw_submission_limit defines the ring size, > > > > > > not the other way around. And you usually want the > > > > > > hw_submission_limit as low as possible for good scheduler > > > > > > granularity and to avoid extra overhead. > > > > > I don't think you really mean "as low as possible", do you? > > > > No, I do mean as low as possible or in other words as few as possible. > > > > > > > > Ideally the scheduler would submit only the minimum amount of work to > > > > the hardware to keep the hardware busy. > > > > > The hardware seems to work mostly the same for all vendors, but you > > > > somehow seem to think that filling the ring is somehow beneficial which > > > > is really not the case as far as I can see. > > > I think that's a misunderstanding. I'm not trying to say that it is *always* > > > beneficial to fill up the ring as much as possible. But I think it is under > > > certain circumstances, exactly those circumstances I described for Nouveau. > > > > > > As mentioned, in Nouveau the size of a job is only really limited by the > > > ring size, which means that one job can (but does not necessarily) fill up > > > the whole ring. We both agree that this is inefficient, because it > > > potentially results into the HW run dry due to hw_submission_limit == 1. > > > > > > I recognize you said that one should define hw_submission_limit and adjust > > > the other parts of the equation accordingly, the options I see are: > > > > > > (1) Increase the ring size while keeping the maximum job size. > > > (2) Decrease the maximum job size while keeping the ring size. > > > (3) Let the scheduler track the actual job size rather than the maximum job > > > size. > > > > > > (1) results into potentially wasted ring memory, because we're not always > > > reaching the maximum job size, but the scheduler assumes so. > > > > > > (2) results into more IOCTLs from userspace for the same amount of IBs and > > > more jobs result into more memory allocations and more work being submitted > > > to the workqueue (with Matt's patches). > > > > > > (3) doesn't seem to have any of those draw backs. > > > > > > What would be your take on that? > > > > > > Actually, if none of the other drivers is interested into a more precise way > > > of keeping track of the ring utilization, I'd be totally fine to do it in a > > > driver specific way. However, unfortunately I don't see how this would be > > > possible. > > > > > > My proposal would be to just keep the hw_submission_limit (maybe rename it > > > to submission_unit_limit) and add a submission_units field to struct > > > drm_sched_job. By default a jobs submission_units field would be 0 and the > > > scheduler would behave the exact same way as it does now. > > > > > > Accordingly, jobs with submission_units > 1 would contribute more than one > > > unit to the submission_unit_limit. > > > > > > What do you think about that? > > > > > This seems reasonible to me and a very minimal change to the scheduler. > > If you have a good use case for this then the approach sounds sane to me as > well. > Xe does not have a use case as the difference between the minimum size of a job and the maximum is not all that large (maybe 100-192 bytes is the range) so the accounting of a unit of 1 per job is just fine for now even though it may waste space. In Nouveau it seems like the min / max for size of job can vary wildly so it needs finer grained units to mke effective use of the ring space. Updating the scheduler to support this is rather trivial, hence no real opposition from me. Also I do see this valid use case that other driver or even perhaps Xe may use someday. > My question is rather how exactly does Nouveau comes to have this use case? > Allowing the full ring size in the UAPI sounds a bit questionable. > I agree allowing the user completely fill the ring is a bit questionable, surely there has to be some upper limit. But lets say it allows 1-64 IB, that still IMO could be used to justify finer grained accouting in the DRM scheduler as stated above this make the difference between the min / max job quite large. Matt > Christian. > > > > > Matt > > > > > Besides all that, you said that filling up the ring just enough to not let > > > the HW run dry rather than filling it up entirely is desirable. Why do you > > > think so? I tend to think that in most cases it shouldn't make difference. > > > > > > - Danilo > > > > > > > Regards, > > > > Christian. > > > > > > > > > Because one really is the minimum if you want to do work at all, but > > > > > as you mentioned above a job limit of one can let the ring run dry. > > > > > > > > > > In the end my proposal comes down to tracking the actual size of a > > > > > job rather than just assuming a pre-defined maximum job size, and > > > > > hence a dynamic job limit. > > > > > > > > > > I don't think this would hurt the scheduler granularity. In fact, it > > > > > should even contribute to the desire of not letting the ring run dry > > > > > even better. Especially for sequences of small jobs, where the > > > > > current implementation might wrongly assume the ring is already full > > > > > although actually there would still be enough space left. > > > > > > > > > > > Christian. > > > > > > > > > > > > > > Otherwise your scheduler might just overwrite the ring > > > > > > > > buffer by pushing things to fast. > > > > > > > > > > > > > > > > Christian. > > > > > > > > > > > > > > > > > Given that, it seems like it would be better to let > > > > > > > > > the scheduler keep track of empty ring "slots" > > > > > > > > > instead, such that the scheduler can deceide whether > > > > > > > > > a subsequent job will still fit on the ring and if > > > > > > > > > not re-evaluate once a previous job finished. Of > > > > > > > > > course each submitted job would be required to carry > > > > > > > > > the number of slots it requires on the ring. > > > > > > > > > > > > > > > > > > What to you think of implementing this as > > > > > > > > > alternative flow control mechanism? Implementation > > > > > > > > > wise this could be a union with the existing > > > > > > > > > hw_submission_limit. > > > > > > > > > > > > > > > > > > - Danilo > > > > > > > > > > > > > > > > > > > A problem with this design is currently a drm_gpu_scheduler uses a > > > > > > > > > > kthread for submission / job cleanup. This doesn't scale if a large > > > > > > > > > > number of drm_gpu_scheduler are used. To work > > > > > > > > > > around the scaling issue, > > > > > > > > > > use a worker rather than kthread for submission / job cleanup. > > > > > > > > > > > > > > > > > > > > v2: > > > > > > > > > > - (Rob Clark) Fix msm build > > > > > > > > > > - Pass in run work queue > > > > > > > > > > v3: > > > > > > > > > > - (Boris) don't have loop in worker > > > > > > > > > > v4: > > > > > > > > > > - (Tvrtko) break out submit ready, stop, > > > > > > > > > > start helpers into own patch > > > > > > > > > > > > > > > > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com> >
Am 18.08.23 um 13:58 schrieb Danilo Krummrich: > [SNIP] >> I only see two possible outcomes: >> 1. You return -EBUSY (or similar) error code indicating the the hw >> can't receive more commands. >> 2. Wait on previously pushed commands to be executed. >> (3. Your driver crash because you accidentally overwrite stuff in the >> ring buffer which is still executed. I just assume that's prevented). >> >> Resolution #1 with -EBUSY is actually something the UAPI should not >> do, because your UAPI then depends on the specific timing of >> submissions which is a really bad idea. >> >> Resolution #2 is usually bad because it forces the hw to run dry >> between submission and so degrade performance. > > I agree, that is a good reason for at least limiting the maximum job > size to half of the ring size. > > However, there could still be cases where two subsequent jobs are > submitted with just a single IB, which as is would still block > subsequent jobs to be pushed to the ring although there is still > plenty of space. Depending on the (CPU) scheduler latency, such a case > can let the HW run dry as well. Yeah, that was intentionally not done as well. The crux here is that the more you push to the hw the worse the scheduling granularity becomes. It's just that neither Xe nor Nouveau relies that much on the scheduling granularity at all (because of hw queues). But Xe doesn't seem to need that feature and I would still try to avoid it because the more you have pushed to the hw the harder it is to get going again after a reset. > > Surely, we could just continue decrease the maximum job size even > further, but this would result in further overhead on user and kernel > for larger IB counts. Tracking the actual job size seems to be the > better solution for drivers where the job size can vary over a rather > huge range. I strongly disagree on that. A larger ring buffer is trivial to allocate and if userspace submissions are so small that the scheduler can't keep up submitting them then your ring buffer size is your smallest problem. In other words the submission overhead will completely kill your performance and you should probably consider stuffing more into a single submission. Regards, Christian. > > - Danilo
On 8/21/23 16:07, Christian König wrote: > Am 18.08.23 um 13:58 schrieb Danilo Krummrich: >> [SNIP] >>> I only see two possible outcomes: >>> 1. You return -EBUSY (or similar) error code indicating the the hw >>> can't receive more commands. >>> 2. Wait on previously pushed commands to be executed. >>> (3. Your driver crash because you accidentally overwrite stuff in the >>> ring buffer which is still executed. I just assume that's prevented). >>> >>> Resolution #1 with -EBUSY is actually something the UAPI should not >>> do, because your UAPI then depends on the specific timing of >>> submissions which is a really bad idea. >>> >>> Resolution #2 is usually bad because it forces the hw to run dry >>> between submission and so degrade performance. >> >> I agree, that is a good reason for at least limiting the maximum job >> size to half of the ring size. >> >> However, there could still be cases where two subsequent jobs are >> submitted with just a single IB, which as is would still block >> subsequent jobs to be pushed to the ring although there is still >> plenty of space. Depending on the (CPU) scheduler latency, such a case >> can let the HW run dry as well. > > Yeah, that was intentionally not done as well. The crux here is that the > more you push to the hw the worse the scheduling granularity becomes. > It's just that neither Xe nor Nouveau relies that much on the scheduling > granularity at all (because of hw queues). > > But Xe doesn't seem to need that feature and I would still try to avoid > it because the more you have pushed to the hw the harder it is to get > going again after a reset. > >> >> Surely, we could just continue decrease the maximum job size even >> further, but this would result in further overhead on user and kernel >> for larger IB counts. Tracking the actual job size seems to be the >> better solution for drivers where the job size can vary over a rather >> huge range. > > I strongly disagree on that. A larger ring buffer is trivial to allocate That sounds like a workaround to me. The problem, in the case above, isn't that the ring buffer does not have enough space, the problem is that we account for the maximum job size although the actual job size is much smaller. And enabling the scheduler to track the actual job size is trivial as well. > and if userspace submissions are so small that the scheduler can't keep > up submitting them then your ring buffer size is your smallest problem. > > In other words the submission overhead will completely kill your > performance and you should probably consider stuffing more into a single > submission. I fully agree and that is also the reason why I want to keep the maximum job size as large as possible. However, afaik with Vulkan it's the applications themselves deciding when and with how many command buffers a queue is submitted (@Faith: please correct me if I'm wrong). Hence, why not optimize for this case as well? It's not that it would make another case worse, right? - Danilo > > Regards, > Christian. > >> >> - Danilo >
Am 21.08.23 um 20:01 schrieb Danilo Krummrich: > On 8/21/23 16:07, Christian König wrote: >> Am 18.08.23 um 13:58 schrieb Danilo Krummrich: >>> [SNIP] >>>> I only see two possible outcomes: >>>> 1. You return -EBUSY (or similar) error code indicating the the hw >>>> can't receive more commands. >>>> 2. Wait on previously pushed commands to be executed. >>>> (3. Your driver crash because you accidentally overwrite stuff in >>>> the ring buffer which is still executed. I just assume that's >>>> prevented). >>>> >>>> Resolution #1 with -EBUSY is actually something the UAPI should not >>>> do, because your UAPI then depends on the specific timing of >>>> submissions which is a really bad idea. >>>> >>>> Resolution #2 is usually bad because it forces the hw to run dry >>>> between submission and so degrade performance. >>> >>> I agree, that is a good reason for at least limiting the maximum job >>> size to half of the ring size. >>> >>> However, there could still be cases where two subsequent jobs are >>> submitted with just a single IB, which as is would still block >>> subsequent jobs to be pushed to the ring although there is still >>> plenty of space. Depending on the (CPU) scheduler latency, such a >>> case can let the HW run dry as well. >> >> Yeah, that was intentionally not done as well. The crux here is that >> the more you push to the hw the worse the scheduling granularity >> becomes. It's just that neither Xe nor Nouveau relies that much on >> the scheduling granularity at all (because of hw queues). >> >> But Xe doesn't seem to need that feature and I would still try to >> avoid it because the more you have pushed to the hw the harder it is >> to get going again after a reset. >> >>> >>> Surely, we could just continue decrease the maximum job size even >>> further, but this would result in further overhead on user and >>> kernel for larger IB counts. Tracking the actual job size seems to >>> be the better solution for drivers where the job size can vary over >>> a rather huge range. >> >> I strongly disagree on that. A larger ring buffer is trivial to allocate > > That sounds like a workaround to me. The problem, in the case above, > isn't that the ring buffer does not have enough space, the problem is > that we account for the maximum job size although the actual job size > is much smaller. And enabling the scheduler to track the actual job > size is trivial as well. That's what I agree on, so far I just didn't see the reason for doing it but at least a few reason for not doing it. > >> and if userspace submissions are so small that the scheduler can't >> keep up submitting them then your ring buffer size is your smallest >> problem. >> >> In other words the submission overhead will completely kill your >> performance and you should probably consider stuffing more into a >> single submission. > > I fully agree and that is also the reason why I want to keep the > maximum job size as large as possible. > > However, afaik with Vulkan it's the applications themselves deciding > when and with how many command buffers a queue is submitted (@Faith: > please correct me if I'm wrong). Hence, why not optimize for this case > as well? It's not that it would make another case worse, right? As I said it does make both the scheduling granularity as well as the reset behavior worse. In general I think we should try to push just enough work to the hardware to keep it busy and not as much as possible. So as long as nobody from userspace comes and says we absolutely need to optimize this use case I would rather not do it. Regards, Christian. > > - Danilo > >> >> Regards, >> Christian. >> >>> >>> - Danilo >> >
On 8/21/23 20:12, Christian König wrote: > Am 21.08.23 um 20:01 schrieb Danilo Krummrich: >> On 8/21/23 16:07, Christian König wrote: >>> Am 18.08.23 um 13:58 schrieb Danilo Krummrich: >>>> [SNIP] >>>>> I only see two possible outcomes: >>>>> 1. You return -EBUSY (or similar) error code indicating the the hw >>>>> can't receive more commands. >>>>> 2. Wait on previously pushed commands to be executed. >>>>> (3. Your driver crash because you accidentally overwrite stuff in >>>>> the ring buffer which is still executed. I just assume that's >>>>> prevented). >>>>> >>>>> Resolution #1 with -EBUSY is actually something the UAPI should not >>>>> do, because your UAPI then depends on the specific timing of >>>>> submissions which is a really bad idea. >>>>> >>>>> Resolution #2 is usually bad because it forces the hw to run dry >>>>> between submission and so degrade performance. >>>> >>>> I agree, that is a good reason for at least limiting the maximum job >>>> size to half of the ring size. >>>> >>>> However, there could still be cases where two subsequent jobs are >>>> submitted with just a single IB, which as is would still block >>>> subsequent jobs to be pushed to the ring although there is still >>>> plenty of space. Depending on the (CPU) scheduler latency, such a >>>> case can let the HW run dry as well. >>> >>> Yeah, that was intentionally not done as well. The crux here is that >>> the more you push to the hw the worse the scheduling granularity >>> becomes. It's just that neither Xe nor Nouveau relies that much on >>> the scheduling granularity at all (because of hw queues). >>> >>> But Xe doesn't seem to need that feature and I would still try to >>> avoid it because the more you have pushed to the hw the harder it is >>> to get going again after a reset. >>> >>>> >>>> Surely, we could just continue decrease the maximum job size even >>>> further, but this would result in further overhead on user and >>>> kernel for larger IB counts. Tracking the actual job size seems to >>>> be the better solution for drivers where the job size can vary over >>>> a rather huge range. >>> >>> I strongly disagree on that. A larger ring buffer is trivial to allocate >> >> That sounds like a workaround to me. The problem, in the case above, >> isn't that the ring buffer does not have enough space, the problem is >> that we account for the maximum job size although the actual job size >> is much smaller. And enabling the scheduler to track the actual job >> size is trivial as well. > > That's what I agree on, so far I just didn't see the reason for doing it > but at least a few reason for not doing it. > >> >>> and if userspace submissions are so small that the scheduler can't >>> keep up submitting them then your ring buffer size is your smallest >>> problem. >>> >>> In other words the submission overhead will completely kill your >>> performance and you should probably consider stuffing more into a >>> single submission. >> >> I fully agree and that is also the reason why I want to keep the >> maximum job size as large as possible. >> >> However, afaik with Vulkan it's the applications themselves deciding >> when and with how many command buffers a queue is submitted (@Faith: >> please correct me if I'm wrong). Hence, why not optimize for this case >> as well? It's not that it would make another case worse, right? > > As I said it does make both the scheduling granularity as well as the > reset behavior worse. As you already mentioned Nouveau (and XE) don't really rely much on scheduling granularity. For Nouveau, the same is true for the reset behavior; if things go south the channel is killed anyway. Userspace would just request a new ring in this case. Hence, I think Nouveau would profit from accounting the actual job size. And at the same time, other drivers having a benefit of always accounting for the maximum job size would still do so, by default. Arbitrary ratios of how much the job size contributes to the ring being considered as full would also be possible. - Danilo > > In general I think we should try to push just enough work to the > hardware to keep it busy and not as much as possible. > > So as long as nobody from userspace comes and says we absolutely need to > optimize this use case I would rather not do it. > > Regards, > Christian. > >> >> - Danilo >> >>> >>> Regards, >>> Christian. >>> >>>> >>>> - Danilo >>> >> >
On Mon, Aug 21, 2023 at 1:13 PM Christian König <christian.koenig@amd.com> wrote: > Am 21.08.23 um 20:01 schrieb Danilo Krummrich: > > On 8/21/23 16:07, Christian König wrote: > >> Am 18.08.23 um 13:58 schrieb Danilo Krummrich: > >>> [SNIP] > >>>> I only see two possible outcomes: > >>>> 1. You return -EBUSY (or similar) error code indicating the the hw > >>>> can't receive more commands. > >>>> 2. Wait on previously pushed commands to be executed. > >>>> (3. Your driver crash because you accidentally overwrite stuff in > >>>> the ring buffer which is still executed. I just assume that's > >>>> prevented). > >>>> > >>>> Resolution #1 with -EBUSY is actually something the UAPI should not > >>>> do, because your UAPI then depends on the specific timing of > >>>> submissions which is a really bad idea. > >>>> > >>>> Resolution #2 is usually bad because it forces the hw to run dry > >>>> between submission and so degrade performance. > >>> > >>> I agree, that is a good reason for at least limiting the maximum job > >>> size to half of the ring size. > >>> > >>> However, there could still be cases where two subsequent jobs are > >>> submitted with just a single IB, which as is would still block > >>> subsequent jobs to be pushed to the ring although there is still > >>> plenty of space. Depending on the (CPU) scheduler latency, such a > >>> case can let the HW run dry as well. > >> > >> Yeah, that was intentionally not done as well. The crux here is that > >> the more you push to the hw the worse the scheduling granularity > >> becomes. It's just that neither Xe nor Nouveau relies that much on > >> the scheduling granularity at all (because of hw queues). > >> > >> But Xe doesn't seem to need that feature and I would still try to > >> avoid it because the more you have pushed to the hw the harder it is > >> to get going again after a reset. > >> > >>> > >>> Surely, we could just continue decrease the maximum job size even > >>> further, but this would result in further overhead on user and > >>> kernel for larger IB counts. Tracking the actual job size seems to > >>> be the better solution for drivers where the job size can vary over > >>> a rather huge range. > >> > >> I strongly disagree on that. A larger ring buffer is trivial to > allocate > > > > That sounds like a workaround to me. The problem, in the case above, > > isn't that the ring buffer does not have enough space, the problem is > > that we account for the maximum job size although the actual job size > > is much smaller. And enabling the scheduler to track the actual job > > size is trivial as well. > > That's what I agree on, so far I just didn't see the reason for doing it > but at least a few reason for not doing it. > > > > >> and if userspace submissions are so small that the scheduler can't > >> keep up submitting them then your ring buffer size is your smallest > >> problem. > >> > >> In other words the submission overhead will completely kill your > >> performance and you should probably consider stuffing more into a > >> single submission. > > > > I fully agree and that is also the reason why I want to keep the > > maximum job size as large as possible. > > > > However, afaik with Vulkan it's the applications themselves deciding > > when and with how many command buffers a queue is submitted (@Faith: > > please correct me if I'm wrong). Hence, why not optimize for this case > > as well? It's not that it would make another case worse, right? > > As I said it does make both the scheduling granularity as well as the > reset behavior worse. > > In general I think we should try to push just enough work to the > hardware to keep it busy and not as much as possible. > > So as long as nobody from userspace comes and says we absolutely need to > optimize this use case I would rather not do it. > This is a place where nouveau's needs are legitimately different from AMD or Intel, I think. NVIDIA's command streamer model is very different from AMD and Intel. On AMD and Intel, each EXEC turns into a single small packet (on the order of 16B) which kicks off a command buffer. There may be a bit of cache management or something around it but that's it. From there, it's userspace's job to make one command buffer chain to another until it's finally done and then do a "return", whatever that looks like. NVIDIA's model is much more static. Each packet in the HW/FW ring is an address and a size and that much data is processed and then it grabs the next packet and processes. The result is that, if we use multiple buffers of commands, there's no way to chain them together. We just have to pass the whole list of buffers to the kernel. A single EXEC ioctl / job may have 500 such addr+size packets depending on how big the command buffer is. It gets worse on pre-Turing hardware where we have to split the batch for every single DrawIndirect or DispatchIndirect. Lest you think NVIDIA is just crazy here, it's a perfectly reasonable model if you assume that userspace is feeding the firmware. When that's happening, you just have a userspace thread that sits there and feeds the ringbuffer with whatever is next and you can marshal as much data through as you want. Sure, it'd be nice to have a 2nd level batch thing that gets launched from the FW ring and has all the individual launch commands but it's not at all necessary. What does that mean from a gpu_scheduler PoV? Basically, it means a variable packet size. What does this mean for implementation? IDK. One option would be to teach the scheduler about actual job sizes. Another would be to virtualize it and have another layer underneath the scheduler that does the actual feeding of the ring. Another would be to decrease the job size somewhat and then have the front-end submit as many jobs as it needs to service userspace and only put the out-fences on the last job. All the options kinda suck. ~Faith
Am 21.08.23 um 21:07 schrieb Danilo Krummrich: > On 8/21/23 20:12, Christian König wrote: >> Am 21.08.23 um 20:01 schrieb Danilo Krummrich: >>> On 8/21/23 16:07, Christian König wrote: >>>> Am 18.08.23 um 13:58 schrieb Danilo Krummrich: >>>>> [SNIP] >>>>>> I only see two possible outcomes: >>>>>> 1. You return -EBUSY (or similar) error code indicating the the >>>>>> hw can't receive more commands. >>>>>> 2. Wait on previously pushed commands to be executed. >>>>>> (3. Your driver crash because you accidentally overwrite stuff in >>>>>> the ring buffer which is still executed. I just assume that's >>>>>> prevented). >>>>>> >>>>>> Resolution #1 with -EBUSY is actually something the UAPI should >>>>>> not do, because your UAPI then depends on the specific timing of >>>>>> submissions which is a really bad idea. >>>>>> >>>>>> Resolution #2 is usually bad because it forces the hw to run dry >>>>>> between submission and so degrade performance. >>>>> >>>>> I agree, that is a good reason for at least limiting the maximum >>>>> job size to half of the ring size. >>>>> >>>>> However, there could still be cases where two subsequent jobs are >>>>> submitted with just a single IB, which as is would still block >>>>> subsequent jobs to be pushed to the ring although there is still >>>>> plenty of space. Depending on the (CPU) scheduler latency, such a >>>>> case can let the HW run dry as well. >>>> >>>> Yeah, that was intentionally not done as well. The crux here is >>>> that the more you push to the hw the worse the scheduling >>>> granularity becomes. It's just that neither Xe nor Nouveau relies >>>> that much on the scheduling granularity at all (because of hw queues). >>>> >>>> But Xe doesn't seem to need that feature and I would still try to >>>> avoid it because the more you have pushed to the hw the harder it >>>> is to get going again after a reset. >>>> >>>>> >>>>> Surely, we could just continue decrease the maximum job size even >>>>> further, but this would result in further overhead on user and >>>>> kernel for larger IB counts. Tracking the actual job size seems to >>>>> be the better solution for drivers where the job size can vary >>>>> over a rather huge range. >>>> >>>> I strongly disagree on that. A larger ring buffer is trivial to >>>> allocate >>> >>> That sounds like a workaround to me. The problem, in the case above, >>> isn't that the ring buffer does not have enough space, the problem >>> is that we account for the maximum job size although the actual job >>> size is much smaller. And enabling the scheduler to track the actual >>> job size is trivial as well. >> >> That's what I agree on, so far I just didn't see the reason for doing >> it but at least a few reason for not doing it. >> >>> >>>> and if userspace submissions are so small that the scheduler can't >>>> keep up submitting them then your ring buffer size is your smallest >>>> problem. >>>> >>>> In other words the submission overhead will completely kill your >>>> performance and you should probably consider stuffing more into a >>>> single submission. >>> >>> I fully agree and that is also the reason why I want to keep the >>> maximum job size as large as possible. >>> >>> However, afaik with Vulkan it's the applications themselves deciding >>> when and with how many command buffers a queue is submitted (@Faith: >>> please correct me if I'm wrong). Hence, why not optimize for this >>> case as well? It's not that it would make another case worse, right? >> >> As I said it does make both the scheduling granularity as well as the >> reset behavior worse. > > As you already mentioned Nouveau (and XE) don't really rely much on > scheduling granularity. For Nouveau, the same is true for the reset > behavior; if things go south the channel is killed anyway. Userspace > would just request a new ring in this case. > > Hence, I think Nouveau would profit from accounting the actual job > size. And at the same time, other drivers having a benefit of always > accounting for the maximum job size would still do so, by default. > > Arbitrary ratios of how much the job size contributes to the ring > being considered as full would also be possible. That would indeed be rather interesting since for a bunch of drivers the limiting part is not the ring buffer size, but rather the utilization of engines. But no idea how to properly design that. You would have multiple values to check instead of just one. Christian. > > - Danilo > >> >> In general I think we should try to push just enough work to the >> hardware to keep it busy and not as much as possible. >> >> So as long as nobody from userspace comes and says we absolutely need >> to optimize this use case I would rather not do it. >> >> Regards, >> Christian. >> >>> >>> - Danilo >>> >>>> >>>> Regards, >>>> Christian. >>>> >>>>> >>>>> - Danilo >>>> >>> >> >
Am 21.08.23 um 21:46 schrieb Faith Ekstrand: > On Mon, Aug 21, 2023 at 1:13 PM Christian König > <christian.koenig@amd.com> wrote: > > [SNIP] > So as long as nobody from userspace comes and says we absolutely > need to > optimize this use case I would rather not do it. > > > This is a place where nouveau's needs are legitimately different from > AMD or Intel, I think. NVIDIA's command streamer model is very > different from AMD and Intel. On AMD and Intel, each EXEC turns into > a single small packet (on the order of 16B) which kicks off a command > buffer. There may be a bit of cache management or something around it > but that's it. From there, it's userspace's job to make one command > buffer chain to another until it's finally done and then do a > "return", whatever that looks like. > > NVIDIA's model is much more static. Each packet in the HW/FW ring is > an address and a size and that much data is processed and then it > grabs the next packet and processes. The result is that, if we use > multiple buffers of commands, there's no way to chain them together. > We just have to pass the whole list of buffers to the kernel. So far that is actually completely identical to what AMD has. > A single EXEC ioctl / job may have 500 such addr+size packets > depending on how big the command buffer is. And that is what I don't understand. Why would you need 100dreds of such addr+size packets? This is basically identical to what AMD has (well on newer hw there is an extension in the CP packets to JUMP/CALL subsequent IBs, but this isn't widely used as far as I know). Previously the limit was something like 4 which we extended to because Bas came up with similar requirements for the AMD side from RADV. But essentially those approaches with 100dreds of IBs doesn't sound like a good idea to me. > It gets worse on pre-Turing hardware where we have to split the batch > for every single DrawIndirect or DispatchIndirect. > > Lest you think NVIDIA is just crazy here, it's a perfectly reasonable > model if you assume that userspace is feeding the firmware. When > that's happening, you just have a userspace thread that sits there and > feeds the ringbuffer with whatever is next and you can marshal as much > data through as you want. Sure, it'd be nice to have a 2nd level batch > thing that gets launched from the FW ring and has all the individual > launch commands but it's not at all necessary. > > What does that mean from a gpu_scheduler PoV? Basically, it means a > variable packet size. > > What does this mean for implementation? IDK. One option would be to > teach the scheduler about actual job sizes. Another would be to > virtualize it and have another layer underneath the scheduler that > does the actual feeding of the ring. Another would be to decrease the > job size somewhat and then have the front-end submit as many jobs as > it needs to service userspace and only put the out-fences on the last > job. All the options kinda suck. Yeah, agree. The job size Danilo suggested is still the least painful. Christian. > > ~Faith
On Tue, Aug 22, 2023 at 4:51 AM Christian König <christian.koenig@amd.com> wrote: > Am 21.08.23 um 21:46 schrieb Faith Ekstrand: > > On Mon, Aug 21, 2023 at 1:13 PM Christian König <christian.koenig@amd.com> > wrote: > >> [SNIP] >> So as long as nobody from userspace comes and says we absolutely need to >> optimize this use case I would rather not do it. >> > > This is a place where nouveau's needs are legitimately different from AMD > or Intel, I think. NVIDIA's command streamer model is very different from > AMD and Intel. On AMD and Intel, each EXEC turns into a single small > packet (on the order of 16B) which kicks off a command buffer. There may > be a bit of cache management or something around it but that's it. From > there, it's userspace's job to make one command buffer chain to another > until it's finally done and then do a "return", whatever that looks like. > > NVIDIA's model is much more static. Each packet in the HW/FW ring is an > address and a size and that much data is processed and then it grabs the > next packet and processes. The result is that, if we use multiple buffers > of commands, there's no way to chain them together. We just have to pass > the whole list of buffers to the kernel. > > > So far that is actually completely identical to what AMD has. > > A single EXEC ioctl / job may have 500 such addr+size packets depending on > how big the command buffer is. > > > And that is what I don't understand. Why would you need 100dreds of such > addr+size packets? > Well, we're not really in control of it. We can control our base pushbuf size and that's something we can tune but we're still limited by the client. We have to submit another pushbuf whenever: 1. We run out of space (power-of-two growth is also possible but the size is limited to a maximum of about 4MiB due to hardware limitations.) 2. The client calls a secondary command buffer. 3. Any usage of indirect draw or dispatch on pre-Turing hardware. At some point we need to tune our BO size a bit to avoid (1) while also avoiding piles of tiny BOs. However, (2) and (3) are out of our control. This is basically identical to what AMD has (well on newer hw there is an > extension in the CP packets to JUMP/CALL subsequent IBs, but this isn't > widely used as far as I know). > According to Bas, RADV chains on recent hardware. > Previously the limit was something like 4 which we extended to because Bas > came up with similar requirements for the AMD side from RADV. > > But essentially those approaches with 100dreds of IBs doesn't sound like a > good idea to me. > No one's arguing that they like it. Again, the hardware isn't designed to have a kernel in the way. It's designed to be fed by userspace. But we're going to have the kernel in the middle for a while so we need to make it not suck too bad. ~Faith It gets worse on pre-Turing hardware where we have to split the batch for > every single DrawIndirect or DispatchIndirect. > > Lest you think NVIDIA is just crazy here, it's a perfectly reasonable > model if you assume that userspace is feeding the firmware. When that's > happening, you just have a userspace thread that sits there and feeds the > ringbuffer with whatever is next and you can marshal as much data through > as you want. Sure, it'd be nice to have a 2nd level batch thing that gets > launched from the FW ring and has all the individual launch commands but > it's not at all necessary. > > What does that mean from a gpu_scheduler PoV? Basically, it means a > variable packet size. > > What does this mean for implementation? IDK. One option would be to teach > the scheduler about actual job sizes. Another would be to virtualize it and > have another layer underneath the scheduler that does the actual feeding of > the ring. Another would be to decrease the job size somewhat and then have > the front-end submit as many jobs as it needs to service userspace and only > put the out-fences on the last job. All the options kinda suck. > > > Yeah, agree. The job size Danilo suggested is still the least painful. > > Christian. > > > ~Faith > > >
On Tue, Aug 22, 2023 at 6:55 PM Faith Ekstrand <faith@gfxstrand.net> wrote: > On Tue, Aug 22, 2023 at 4:51 AM Christian König <christian.koenig@amd.com> > wrote: > >> Am 21.08.23 um 21:46 schrieb Faith Ekstrand: >> >> On Mon, Aug 21, 2023 at 1:13 PM Christian König <christian.koenig@amd.com> >> wrote: >> >>> [SNIP] >>> So as long as nobody from userspace comes and says we absolutely need to >>> optimize this use case I would rather not do it. >>> >> >> This is a place where nouveau's needs are legitimately different from AMD >> or Intel, I think. NVIDIA's command streamer model is very different from >> AMD and Intel. On AMD and Intel, each EXEC turns into a single small >> packet (on the order of 16B) which kicks off a command buffer. There may >> be a bit of cache management or something around it but that's it. From >> there, it's userspace's job to make one command buffer chain to another >> until it's finally done and then do a "return", whatever that looks like. >> >> NVIDIA's model is much more static. Each packet in the HW/FW ring is an >> address and a size and that much data is processed and then it grabs the >> next packet and processes. The result is that, if we use multiple buffers >> of commands, there's no way to chain them together. We just have to pass >> the whole list of buffers to the kernel. >> >> >> So far that is actually completely identical to what AMD has. >> >> A single EXEC ioctl / job may have 500 such addr+size packets depending >> on how big the command buffer is. >> >> >> And that is what I don't understand. Why would you need 100dreds of such >> addr+size packets? >> > > Well, we're not really in control of it. We can control our base pushbuf > size and that's something we can tune but we're still limited by the > client. We have to submit another pushbuf whenever: > > 1. We run out of space (power-of-two growth is also possible but the size > is limited to a maximum of about 4MiB due to hardware limitations.) > 2. The client calls a secondary command buffer. > 3. Any usage of indirect draw or dispatch on pre-Turing hardware. > > At some point we need to tune our BO size a bit to avoid (1) while also > avoiding piles of tiny BOs. However, (2) and (3) are out of our control. > > This is basically identical to what AMD has (well on newer hw there is an >> extension in the CP packets to JUMP/CALL subsequent IBs, but this isn't >> widely used as far as I know). >> > > According to Bas, RADV chains on recent hardware. > well: 1) on GFX6 and older we can't chain at all 2) on Compute/DMA we can't chain at all 3) with VK_COMMAND_BUFFER_USAGE_SIMULTANEOUS_USE_BIT we can't chain between cmdbuffers 4) for some secondary use cases we can't chain. so we have to do the "submit multiple" dance in many cases. > > >> Previously the limit was something like 4 which we extended to because >> Bas came up with similar requirements for the AMD side from RADV. >> >> But essentially those approaches with 100dreds of IBs doesn't sound like >> a good idea to me. >> > > No one's arguing that they like it. Again, the hardware isn't designed to > have a kernel in the way. It's designed to be fed by userspace. But we're > going to have the kernel in the middle for a while so we need to make it > not suck too bad. > > ~Faith > > It gets worse on pre-Turing hardware where we have to split the batch for >> every single DrawIndirect or DispatchIndirect. >> >> Lest you think NVIDIA is just crazy here, it's a perfectly reasonable >> model if you assume that userspace is feeding the firmware. When that's >> happening, you just have a userspace thread that sits there and feeds the >> ringbuffer with whatever is next and you can marshal as much data through >> as you want. Sure, it'd be nice to have a 2nd level batch thing that gets >> launched from the FW ring and has all the individual launch commands but >> it's not at all necessary. >> >> What does that mean from a gpu_scheduler PoV? Basically, it means a >> variable packet size. >> >> What does this mean for implementation? IDK. One option would be to >> teach the scheduler about actual job sizes. Another would be to virtualize >> it and have another layer underneath the scheduler that does the actual >> feeding of the ring. Another would be to decrease the job size somewhat and >> then have the front-end submit as many jobs as it needs to service >> userspace and only put the out-fences on the last job. All the options >> kinda suck. >> >> >> Yeah, agree. The job size Danilo suggested is still the least painful. >> >> Christian. >> >> >> ~Faith >> >> >>
On Thu, 17 Aug 2023 13:13:31 +0200 Danilo Krummrich <dakr@redhat.com> wrote: > I think that's a misunderstanding. I'm not trying to say that it is > *always* beneficial to fill up the ring as much as possible. But I think > it is under certain circumstances, exactly those circumstances I > described for Nouveau. > > As mentioned, in Nouveau the size of a job is only really limited by the > ring size, which means that one job can (but does not necessarily) fill > up the whole ring. We both agree that this is inefficient, because it > potentially results into the HW run dry due to hw_submission_limit == 1. > > I recognize you said that one should define hw_submission_limit and > adjust the other parts of the equation accordingly, the options I see are: > > (1) Increase the ring size while keeping the maximum job size. > (2) Decrease the maximum job size while keeping the ring size. > (3) Let the scheduler track the actual job size rather than the maximum > job size. > > (1) results into potentially wasted ring memory, because we're not > always reaching the maximum job size, but the scheduler assumes so. > > (2) results into more IOCTLs from userspace for the same amount of IBs > and more jobs result into more memory allocations and more work being > submitted to the workqueue (with Matt's patches). > > (3) doesn't seem to have any of those draw backs. > > What would be your take on that? > > Actually, if none of the other drivers is interested into a more precise > way of keeping track of the ring utilization, I'd be totally fine to do > it in a driver specific way. However, unfortunately I don't see how this > would be possible. I'm not entirely sure, but I think PowerVR is pretty close to your description: jobs size is dynamic size, and the ring buffer size is picked by the driver at queue initialization time. What we did was to set hw_submission_limit to an arbitrarily high value of 64k (we could have used something like ringbuf_size/min_job_size instead), and then have the control flow implemented with ->prepare_job() [1] (CCCB is the PowerVR ring buffer). This allows us to maximize ring buffer utilization while still allowing dynamic-size jobs. > > My proposal would be to just keep the hw_submission_limit (maybe rename > it to submission_unit_limit) and add a submission_units field to struct > drm_sched_job. By default a jobs submission_units field would be 0 and > the scheduler would behave the exact same way as it does now. > > Accordingly, jobs with submission_units > 1 would contribute more than > one unit to the submission_unit_limit. > > What do you think about that? > > Besides all that, you said that filling up the ring just enough to not > let the HW run dry rather than filling it up entirely is desirable. Why > do you think so? I tend to think that in most cases it shouldn't make > difference. [1]https://gitlab.freedesktop.org/frankbinns/powervr/-/blob/powervr-next/drivers/gpu/drm/imagination/pvr_queue.c#L502
On 9/12/23 16:28, Boris Brezillon wrote: > On Thu, 17 Aug 2023 13:13:31 +0200 > Danilo Krummrich <dakr@redhat.com> wrote: > >> I think that's a misunderstanding. I'm not trying to say that it is >> *always* beneficial to fill up the ring as much as possible. But I think >> it is under certain circumstances, exactly those circumstances I >> described for Nouveau. >> >> As mentioned, in Nouveau the size of a job is only really limited by the >> ring size, which means that one job can (but does not necessarily) fill >> up the whole ring. We both agree that this is inefficient, because it >> potentially results into the HW run dry due to hw_submission_limit == 1. >> >> I recognize you said that one should define hw_submission_limit and >> adjust the other parts of the equation accordingly, the options I see are: >> >> (1) Increase the ring size while keeping the maximum job size. >> (2) Decrease the maximum job size while keeping the ring size. >> (3) Let the scheduler track the actual job size rather than the maximum >> job size. >> >> (1) results into potentially wasted ring memory, because we're not >> always reaching the maximum job size, but the scheduler assumes so. >> >> (2) results into more IOCTLs from userspace for the same amount of IBs >> and more jobs result into more memory allocations and more work being >> submitted to the workqueue (with Matt's patches). >> >> (3) doesn't seem to have any of those draw backs. >> >> What would be your take on that? >> >> Actually, if none of the other drivers is interested into a more precise >> way of keeping track of the ring utilization, I'd be totally fine to do >> it in a driver specific way. However, unfortunately I don't see how this >> would be possible. > > I'm not entirely sure, but I think PowerVR is pretty close to your > description: jobs size is dynamic size, and the ring buffer size is > picked by the driver at queue initialization time. What we did was to > set hw_submission_limit to an arbitrarily high value of 64k (we could > have used something like ringbuf_size/min_job_size instead), and then > have the control flow implemented with ->prepare_job() [1] (CCCB is the > PowerVR ring buffer). This allows us to maximize ring buffer utilization > while still allowing dynamic-size jobs. I guess this would work, but I think it would be better to bake this in, especially if more drivers do have this need. I already have an implementation [1] for doing that in the scheduler. My plan was to push that as soon as Matt sends out V3. [1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commit/269f05d6a2255384badff8b008b3c32d640d2d95 > >> >> My proposal would be to just keep the hw_submission_limit (maybe rename >> it to submission_unit_limit) and add a submission_units field to struct >> drm_sched_job. By default a jobs submission_units field would be 0 and >> the scheduler would behave the exact same way as it does now. >> >> Accordingly, jobs with submission_units > 1 would contribute more than >> one unit to the submission_unit_limit. >> >> What do you think about that? >> >> Besides all that, you said that filling up the ring just enough to not >> let the HW run dry rather than filling it up entirely is desirable. Why >> do you think so? I tend to think that in most cases it shouldn't make >> difference. > > [1]https://gitlab.freedesktop.org/frankbinns/powervr/-/blob/powervr-next/drivers/gpu/drm/imagination/pvr_queue.c#L502 >
On Tue, 12 Sep 2023 16:33:01 +0200 Danilo Krummrich <dakr@redhat.com> wrote: > On 9/12/23 16:28, Boris Brezillon wrote: > > On Thu, 17 Aug 2023 13:13:31 +0200 > > Danilo Krummrich <dakr@redhat.com> wrote: > > > >> I think that's a misunderstanding. I'm not trying to say that it is > >> *always* beneficial to fill up the ring as much as possible. But I think > >> it is under certain circumstances, exactly those circumstances I > >> described for Nouveau. > >> > >> As mentioned, in Nouveau the size of a job is only really limited by the > >> ring size, which means that one job can (but does not necessarily) fill > >> up the whole ring. We both agree that this is inefficient, because it > >> potentially results into the HW run dry due to hw_submission_limit == 1. > >> > >> I recognize you said that one should define hw_submission_limit and > >> adjust the other parts of the equation accordingly, the options I see are: > >> > >> (1) Increase the ring size while keeping the maximum job size. > >> (2) Decrease the maximum job size while keeping the ring size. > >> (3) Let the scheduler track the actual job size rather than the maximum > >> job size. > >> > >> (1) results into potentially wasted ring memory, because we're not > >> always reaching the maximum job size, but the scheduler assumes so. > >> > >> (2) results into more IOCTLs from userspace for the same amount of IBs > >> and more jobs result into more memory allocations and more work being > >> submitted to the workqueue (with Matt's patches). > >> > >> (3) doesn't seem to have any of those draw backs. > >> > >> What would be your take on that? > >> > >> Actually, if none of the other drivers is interested into a more precise > >> way of keeping track of the ring utilization, I'd be totally fine to do > >> it in a driver specific way. However, unfortunately I don't see how this > >> would be possible. > > > > I'm not entirely sure, but I think PowerVR is pretty close to your > > description: jobs size is dynamic size, and the ring buffer size is > > picked by the driver at queue initialization time. What we did was to > > set hw_submission_limit to an arbitrarily high value of 64k (we could > > have used something like ringbuf_size/min_job_size instead), and then > > have the control flow implemented with ->prepare_job() [1] (CCCB is the > > PowerVR ring buffer). This allows us to maximize ring buffer utilization > > while still allowing dynamic-size jobs. > > I guess this would work, but I think it would be better to bake this in, > especially if more drivers do have this need. I already have an > implementation [1] for doing that in the scheduler. My plan was to push > that as soon as Matt sends out V3. > > [1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commit/269f05d6a2255384badff8b008b3c32d640d2d95 PowerVR's ->can_fit_in_ringbuf() logic is a bit more involved in that native fences waits are passed to the FW, and those add to the job size. When we know our job is ready for execution (all non-native deps are signaled), we evict already signaled native-deps (or native fences) to shrink the job size further more, but that's something we need to calculate late if we want the job size to be minimal. Of course, we can always over-estimate the job size, but if we go for a full-blown drm_sched integration, I wonder if it wouldn't be preferable to have a ->get_job_size() callback returning the number of units needed by job, and have the core pick 1 when the hook is not implemented.
On Tue, 12 Sep 2023 16:49:09 +0200 Boris Brezillon <boris.brezillon@collabora.com> wrote: > On Tue, 12 Sep 2023 16:33:01 +0200 > Danilo Krummrich <dakr@redhat.com> wrote: > > > On 9/12/23 16:28, Boris Brezillon wrote: > > > On Thu, 17 Aug 2023 13:13:31 +0200 > > > Danilo Krummrich <dakr@redhat.com> wrote: > > > > > >> I think that's a misunderstanding. I'm not trying to say that it is > > >> *always* beneficial to fill up the ring as much as possible. But I think > > >> it is under certain circumstances, exactly those circumstances I > > >> described for Nouveau. > > >> > > >> As mentioned, in Nouveau the size of a job is only really limited by the > > >> ring size, which means that one job can (but does not necessarily) fill > > >> up the whole ring. We both agree that this is inefficient, because it > > >> potentially results into the HW run dry due to hw_submission_limit == 1. > > >> > > >> I recognize you said that one should define hw_submission_limit and > > >> adjust the other parts of the equation accordingly, the options I see are: > > >> > > >> (1) Increase the ring size while keeping the maximum job size. > > >> (2) Decrease the maximum job size while keeping the ring size. > > >> (3) Let the scheduler track the actual job size rather than the maximum > > >> job size. > > >> > > >> (1) results into potentially wasted ring memory, because we're not > > >> always reaching the maximum job size, but the scheduler assumes so. > > >> > > >> (2) results into more IOCTLs from userspace for the same amount of IBs > > >> and more jobs result into more memory allocations and more work being > > >> submitted to the workqueue (with Matt's patches). > > >> > > >> (3) doesn't seem to have any of those draw backs. > > >> > > >> What would be your take on that? > > >> > > >> Actually, if none of the other drivers is interested into a more precise > > >> way of keeping track of the ring utilization, I'd be totally fine to do > > >> it in a driver specific way. However, unfortunately I don't see how this > > >> would be possible. > > > > > > I'm not entirely sure, but I think PowerVR is pretty close to your > > > description: jobs size is dynamic size, and the ring buffer size is > > > picked by the driver at queue initialization time. What we did was to > > > set hw_submission_limit to an arbitrarily high value of 64k (we could > > > have used something like ringbuf_size/min_job_size instead), and then > > > have the control flow implemented with ->prepare_job() [1] (CCCB is the > > > PowerVR ring buffer). This allows us to maximize ring buffer utilization > > > while still allowing dynamic-size jobs. > > > > I guess this would work, but I think it would be better to bake this in, > > especially if more drivers do have this need. I already have an > > implementation [1] for doing that in the scheduler. My plan was to push > > that as soon as Matt sends out V3. > > > > [1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commit/269f05d6a2255384badff8b008b3c32d640d2d95 > > PowerVR's ->can_fit_in_ringbuf() logic is a bit more involved in that > native fences waits are passed to the FW, and those add to the job size. > When we know our job is ready for execution (all non-native deps are > signaled), we evict already signaled native-deps (or native fences) to > shrink the job size further more, but that's something we need to > calculate late if we want the job size to be minimal. Of course, we can > always over-estimate the job size, but if we go for a full-blown > drm_sched integration, I wonder if it wouldn't be preferable to have a > ->get_job_size() callback returning the number of units needed by job, > and have the core pick 1 when the hook is not implemented. FWIW, I think last time I asked how to do that, I've been pointed to ->prepare_job() by someone (don't remember if it was Daniel or Christian), hence the PowerVR implementation. If that's still the preferred solution, there's some opportunity to have a generic layer to automate ringbuf utilization tracking and some helpers to prepare wait_for_ringbuf dma_fences that drivers could return from ->prepare_job() (those fences would then be signaled when the driver calls drm_ringbuf_job_done() and the next job waiting for ringbuf space now fits in the ringbuf).
On 9/12/23 16:49, Boris Brezillon wrote: > On Tue, 12 Sep 2023 16:33:01 +0200 > Danilo Krummrich <dakr@redhat.com> wrote: > >> On 9/12/23 16:28, Boris Brezillon wrote: >>> On Thu, 17 Aug 2023 13:13:31 +0200 >>> Danilo Krummrich <dakr@redhat.com> wrote: >>> >>>> I think that's a misunderstanding. I'm not trying to say that it is >>>> *always* beneficial to fill up the ring as much as possible. But I think >>>> it is under certain circumstances, exactly those circumstances I >>>> described for Nouveau. >>>> >>>> As mentioned, in Nouveau the size of a job is only really limited by the >>>> ring size, which means that one job can (but does not necessarily) fill >>>> up the whole ring. We both agree that this is inefficient, because it >>>> potentially results into the HW run dry due to hw_submission_limit == 1. >>>> >>>> I recognize you said that one should define hw_submission_limit and >>>> adjust the other parts of the equation accordingly, the options I see are: >>>> >>>> (1) Increase the ring size while keeping the maximum job size. >>>> (2) Decrease the maximum job size while keeping the ring size. >>>> (3) Let the scheduler track the actual job size rather than the maximum >>>> job size. >>>> >>>> (1) results into potentially wasted ring memory, because we're not >>>> always reaching the maximum job size, but the scheduler assumes so. >>>> >>>> (2) results into more IOCTLs from userspace for the same amount of IBs >>>> and more jobs result into more memory allocations and more work being >>>> submitted to the workqueue (with Matt's patches). >>>> >>>> (3) doesn't seem to have any of those draw backs. >>>> >>>> What would be your take on that? >>>> >>>> Actually, if none of the other drivers is interested into a more precise >>>> way of keeping track of the ring utilization, I'd be totally fine to do >>>> it in a driver specific way. However, unfortunately I don't see how this >>>> would be possible. >>> >>> I'm not entirely sure, but I think PowerVR is pretty close to your >>> description: jobs size is dynamic size, and the ring buffer size is >>> picked by the driver at queue initialization time. What we did was to >>> set hw_submission_limit to an arbitrarily high value of 64k (we could >>> have used something like ringbuf_size/min_job_size instead), and then >>> have the control flow implemented with ->prepare_job() [1] (CCCB is the >>> PowerVR ring buffer). This allows us to maximize ring buffer utilization >>> while still allowing dynamic-size jobs. >> >> I guess this would work, but I think it would be better to bake this in, >> especially if more drivers do have this need. I already have an >> implementation [1] for doing that in the scheduler. My plan was to push >> that as soon as Matt sends out V3. >> >> [1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commit/269f05d6a2255384badff8b008b3c32d640d2d95 > > PowerVR's ->can_fit_in_ringbuf() logic is a bit more involved in that > native fences waits are passed to the FW, and those add to the job size. > When we know our job is ready for execution (all non-native deps are > signaled), we evict already signaled native-deps (or native fences) to > shrink the job size further more, but that's something we need to > calculate late if we want the job size to be minimal. Of course, we can > always over-estimate the job size, but if we go for a full-blown > drm_sched integration, I wonder if it wouldn't be preferable to have a > ->get_job_size() callback returning the number of units needed by job, > and have the core pick 1 when the hook is not implemented. > Sure, why not. Sounds reasonable to me.
On 9/12/23 17:13, Boris Brezillon wrote: > On Tue, 12 Sep 2023 16:49:09 +0200 > Boris Brezillon <boris.brezillon@collabora.com> wrote: > >> On Tue, 12 Sep 2023 16:33:01 +0200 >> Danilo Krummrich <dakr@redhat.com> wrote: >> >>> On 9/12/23 16:28, Boris Brezillon wrote: >>>> On Thu, 17 Aug 2023 13:13:31 +0200 >>>> Danilo Krummrich <dakr@redhat.com> wrote: >>>> >>>>> I think that's a misunderstanding. I'm not trying to say that it is >>>>> *always* beneficial to fill up the ring as much as possible. But I think >>>>> it is under certain circumstances, exactly those circumstances I >>>>> described for Nouveau. >>>>> >>>>> As mentioned, in Nouveau the size of a job is only really limited by the >>>>> ring size, which means that one job can (but does not necessarily) fill >>>>> up the whole ring. We both agree that this is inefficient, because it >>>>> potentially results into the HW run dry due to hw_submission_limit == 1. >>>>> >>>>> I recognize you said that one should define hw_submission_limit and >>>>> adjust the other parts of the equation accordingly, the options I see are: >>>>> >>>>> (1) Increase the ring size while keeping the maximum job size. >>>>> (2) Decrease the maximum job size while keeping the ring size. >>>>> (3) Let the scheduler track the actual job size rather than the maximum >>>>> job size. >>>>> >>>>> (1) results into potentially wasted ring memory, because we're not >>>>> always reaching the maximum job size, but the scheduler assumes so. >>>>> >>>>> (2) results into more IOCTLs from userspace for the same amount of IBs >>>>> and more jobs result into more memory allocations and more work being >>>>> submitted to the workqueue (with Matt's patches). >>>>> >>>>> (3) doesn't seem to have any of those draw backs. >>>>> >>>>> What would be your take on that? >>>>> >>>>> Actually, if none of the other drivers is interested into a more precise >>>>> way of keeping track of the ring utilization, I'd be totally fine to do >>>>> it in a driver specific way. However, unfortunately I don't see how this >>>>> would be possible. >>>> >>>> I'm not entirely sure, but I think PowerVR is pretty close to your >>>> description: jobs size is dynamic size, and the ring buffer size is >>>> picked by the driver at queue initialization time. What we did was to >>>> set hw_submission_limit to an arbitrarily high value of 64k (we could >>>> have used something like ringbuf_size/min_job_size instead), and then >>>> have the control flow implemented with ->prepare_job() [1] (CCCB is the >>>> PowerVR ring buffer). This allows us to maximize ring buffer utilization >>>> while still allowing dynamic-size jobs. >>> >>> I guess this would work, but I think it would be better to bake this in, >>> especially if more drivers do have this need. I already have an >>> implementation [1] for doing that in the scheduler. My plan was to push >>> that as soon as Matt sends out V3. >>> >>> [1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commit/269f05d6a2255384badff8b008b3c32d640d2d95 >> >> PowerVR's ->can_fit_in_ringbuf() logic is a bit more involved in that >> native fences waits are passed to the FW, and those add to the job size. >> When we know our job is ready for execution (all non-native deps are >> signaled), we evict already signaled native-deps (or native fences) to >> shrink the job size further more, but that's something we need to >> calculate late if we want the job size to be minimal. Of course, we can >> always over-estimate the job size, but if we go for a full-blown >> drm_sched integration, I wonder if it wouldn't be preferable to have a >> ->get_job_size() callback returning the number of units needed by job, >> and have the core pick 1 when the hook is not implemented. > > FWIW, I think last time I asked how to do that, I've been pointed to > ->prepare_job() by someone (don't remember if it was Daniel or > Christian), hence the PowerVR implementation. If that's still the > preferred solution, there's some opportunity to have a generic layer to > automate ringbuf utilization tracking and some helpers to prepare > wait_for_ringbuf dma_fences that drivers could return from > ->prepare_job() (those fences would then be signaled when the driver > calls drm_ringbuf_job_done() and the next job waiting for ringbuf space > now fits in the ringbuf). > Not sure I like that, it's basically a different implementation to work around limitations of an implementation that is supposed to cover this case in general.
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 5f5efec6c444..3ebf9882edba 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -2331,7 +2331,7 @@ static int amdgpu_device_init_schedulers(struct amdgpu_device *adev) break; } - r = drm_sched_init(&ring->sched, &amdgpu_sched_ops, + r = drm_sched_init(&ring->sched, &amdgpu_sched_ops, NULL, ring->num_hw_submission, 0, timeout, adev->reset_domain->wq, ring->sched_score, ring->name, diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c b/drivers/gpu/drm/etnaviv/etnaviv_sched.c index 1ae87dfd19c4..8486a2923f1b 100644 --- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c +++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c @@ -133,7 +133,7 @@ int etnaviv_sched_init(struct etnaviv_gpu *gpu) { int ret; - ret = drm_sched_init(&gpu->sched, &etnaviv_sched_ops, + ret = drm_sched_init(&gpu->sched, &etnaviv_sched_ops, NULL, etnaviv_hw_jobs_limit, etnaviv_job_hang_limit, msecs_to_jiffies(500), NULL, NULL, dev_name(gpu->dev), gpu->dev); diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c index ffd91a5ee299..8d858aed0e56 100644 --- a/drivers/gpu/drm/lima/lima_sched.c +++ b/drivers/gpu/drm/lima/lima_sched.c @@ -488,7 +488,7 @@ int lima_sched_pipe_init(struct lima_sched_pipe *pipe, const char *name) INIT_WORK(&pipe->recover_work, lima_sched_recover_work); - return drm_sched_init(&pipe->base, &lima_sched_ops, 1, + return drm_sched_init(&pipe->base, &lima_sched_ops, NULL, 1, lima_job_hang_limit, msecs_to_jiffies(timeout), NULL, NULL, name, pipe->ldev->dev); diff --git a/drivers/gpu/drm/msm/msm_ringbuffer.c b/drivers/gpu/drm/msm/msm_ringbuffer.c index b60199184409..79aa112118da 100644 --- a/drivers/gpu/drm/msm/msm_ringbuffer.c +++ b/drivers/gpu/drm/msm/msm_ringbuffer.c @@ -93,7 +93,7 @@ struct msm_ringbuffer *msm_ringbuffer_new(struct msm_gpu *gpu, int id, /* currently managing hangcheck ourselves: */ sched_timeout = MAX_SCHEDULE_TIMEOUT; - ret = drm_sched_init(&ring->sched, &msm_sched_ops, + ret = drm_sched_init(&ring->sched, &msm_sched_ops, NULL, num_hw_submissions, 0, sched_timeout, NULL, NULL, to_msm_bo(ring->bo)->name, gpu->dev->dev); if (ret) { diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c index 3424a1bf6af3..2d607de5d393 100644 --- a/drivers/gpu/drm/nouveau/nouveau_sched.c +++ b/drivers/gpu/drm/nouveau/nouveau_sched.c @@ -407,7 +407,7 @@ int nouveau_sched_init(struct nouveau_drm *drm) if (!drm->sched_wq) return -ENOMEM; - return drm_sched_init(sched, &nouveau_sched_ops, + return drm_sched_init(sched, &nouveau_sched_ops, NULL, NOUVEAU_SCHED_HW_SUBMISSIONS, 0, job_hang_limit, NULL, NULL, "nouveau_sched", drm->dev->dev); } diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c index a8b4827dc425..4efbc8aa3c2f 100644 --- a/drivers/gpu/drm/panfrost/panfrost_job.c +++ b/drivers/gpu/drm/panfrost/panfrost_job.c @@ -831,7 +831,7 @@ int panfrost_job_init(struct panfrost_device *pfdev) js->queue[j].fence_context = dma_fence_context_alloc(1); ret = drm_sched_init(&js->queue[j].sched, - &panfrost_sched_ops, + &panfrost_sched_ops, NULL, nentries, 0, msecs_to_jiffies(JOB_TIMEOUT_MS), pfdev->reset.wq, diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index e4fa62abca41..614e8c97a622 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -48,7 +48,6 @@ * through the jobs entity pointer. */ -#include <linux/kthread.h> #include <linux/wait.h> #include <linux/sched.h> #include <linux/completion.h> @@ -256,6 +255,16 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq) return rb ? rb_entry(rb, struct drm_sched_entity, rb_tree_node) : NULL; } +/** + * drm_sched_submit_queue - scheduler queue submission + * @sched: scheduler instance + */ +static void drm_sched_submit_queue(struct drm_gpu_scheduler *sched) +{ + if (!READ_ONCE(sched->pause_submit)) + queue_work(sched->submit_wq, &sched->work_submit); +} + /** * drm_sched_job_done - complete a job * @s_job: pointer to the job which is done @@ -275,7 +284,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job, int result) dma_fence_get(&s_fence->finished); drm_sched_fence_finished(s_fence, result); dma_fence_put(&s_fence->finished); - wake_up_interruptible(&sched->wake_up_worker); + drm_sched_submit_queue(sched); } /** @@ -868,7 +877,7 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched) void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched) { if (drm_sched_can_queue(sched)) - wake_up_interruptible(&sched->wake_up_worker); + drm_sched_submit_queue(sched); } /** @@ -978,61 +987,42 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list, } EXPORT_SYMBOL(drm_sched_pick_best); -/** - * drm_sched_blocked - check if the scheduler is blocked - * - * @sched: scheduler instance - * - * Returns true if blocked, otherwise false. - */ -static bool drm_sched_blocked(struct drm_gpu_scheduler *sched) -{ - if (kthread_should_park()) { - kthread_parkme(); - return true; - } - - return false; -} - /** * drm_sched_main - main scheduler thread * * @param: scheduler instance - * - * Returns 0. */ -static int drm_sched_main(void *param) +static void drm_sched_main(struct work_struct *w) { - struct drm_gpu_scheduler *sched = (struct drm_gpu_scheduler *)param; + struct drm_gpu_scheduler *sched = + container_of(w, struct drm_gpu_scheduler, work_submit); + struct drm_sched_entity *entity; + struct drm_sched_job *cleanup_job; int r; - sched_set_fifo_low(current); + if (READ_ONCE(sched->pause_submit)) + return; - while (!kthread_should_stop()) { - struct drm_sched_entity *entity = NULL; - struct drm_sched_fence *s_fence; - struct drm_sched_job *sched_job; - struct dma_fence *fence; - struct drm_sched_job *cleanup_job = NULL; + cleanup_job = drm_sched_get_cleanup_job(sched); + entity = drm_sched_select_entity(sched); - wait_event_interruptible(sched->wake_up_worker, - (cleanup_job = drm_sched_get_cleanup_job(sched)) || - (!drm_sched_blocked(sched) && - (entity = drm_sched_select_entity(sched))) || - kthread_should_stop()); + if (!entity && !cleanup_job) + return; /* No more work */ - if (cleanup_job) - sched->ops->free_job(cleanup_job); + if (cleanup_job) + sched->ops->free_job(cleanup_job); - if (!entity) - continue; + if (entity) { + struct dma_fence *fence; + struct drm_sched_fence *s_fence; + struct drm_sched_job *sched_job; sched_job = drm_sched_entity_pop_job(entity); - if (!sched_job) { complete_all(&entity->entity_idle); - continue; + if (!cleanup_job) + return; /* No more work */ + goto again; } s_fence = sched_job->s_fence; @@ -1063,7 +1053,9 @@ static int drm_sched_main(void *param) wake_up(&sched->job_scheduled); } - return 0; + +again: + drm_sched_submit_queue(sched); } /** @@ -1071,6 +1063,7 @@ static int drm_sched_main(void *param) * * @sched: scheduler instance * @ops: backend operations for this scheduler + * @submit_wq: workqueue to use for submission. If NULL, the system_wq is used * @hw_submission: number of hw submissions that can be in flight * @hang_limit: number of times to allow a job to hang before dropping it * @timeout: timeout value in jiffies for the scheduler @@ -1084,14 +1077,16 @@ static int drm_sched_main(void *param) */ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_backend_ops *ops, + struct workqueue_struct *submit_wq, unsigned hw_submission, unsigned hang_limit, long timeout, struct workqueue_struct *timeout_wq, atomic_t *score, const char *name, struct device *dev) { - int i, ret; + int i; sched->ops = ops; sched->hw_submission_limit = hw_submission; sched->name = name; + sched->submit_wq = submit_wq ? : system_wq; sched->timeout = timeout; sched->timeout_wq = timeout_wq ? : system_wq; sched->hang_limit = hang_limit; @@ -1100,23 +1095,15 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, for (i = DRM_SCHED_PRIORITY_MIN; i < DRM_SCHED_PRIORITY_COUNT; i++) drm_sched_rq_init(sched, &sched->sched_rq[i]); - init_waitqueue_head(&sched->wake_up_worker); init_waitqueue_head(&sched->job_scheduled); INIT_LIST_HEAD(&sched->pending_list); spin_lock_init(&sched->job_list_lock); atomic_set(&sched->hw_rq_count, 0); INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout); + INIT_WORK(&sched->work_submit, drm_sched_main); atomic_set(&sched->_score, 0); atomic64_set(&sched->job_id_count, 0); - - /* Each scheduler will run on a seperate kernel thread */ - sched->thread = kthread_run(drm_sched_main, sched, sched->name); - if (IS_ERR(sched->thread)) { - ret = PTR_ERR(sched->thread); - sched->thread = NULL; - DRM_DEV_ERROR(sched->dev, "Failed to create scheduler for %s.\n", name); - return ret; - } + sched->pause_submit = false; sched->ready = true; return 0; @@ -1135,8 +1122,7 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched) struct drm_sched_entity *s_entity; int i; - if (sched->thread) - kthread_stop(sched->thread); + drm_sched_submit_stop(sched); for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) { struct drm_sched_rq *rq = &sched->sched_rq[i]; @@ -1216,7 +1202,7 @@ EXPORT_SYMBOL(drm_sched_increase_karma); */ bool drm_sched_submit_ready(struct drm_gpu_scheduler *sched) { - return !!sched->thread; + return sched->ready; } EXPORT_SYMBOL(drm_sched_submit_ready); @@ -1228,7 +1214,8 @@ EXPORT_SYMBOL(drm_sched_submit_ready); */ void drm_sched_submit_stop(struct drm_gpu_scheduler *sched) { - kthread_park(sched->thread); + WRITE_ONCE(sched->pause_submit, true); + cancel_work_sync(&sched->work_submit); } EXPORT_SYMBOL(drm_sched_submit_stop); @@ -1239,6 +1226,7 @@ EXPORT_SYMBOL(drm_sched_submit_stop); */ void drm_sched_submit_start(struct drm_gpu_scheduler *sched) { - kthread_unpark(sched->thread); + WRITE_ONCE(sched->pause_submit, false); + queue_work(sched->submit_wq, &sched->work_submit); } EXPORT_SYMBOL(drm_sched_submit_start); diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c index 06238e6d7f5c..38e092ea41e6 100644 --- a/drivers/gpu/drm/v3d/v3d_sched.c +++ b/drivers/gpu/drm/v3d/v3d_sched.c @@ -388,7 +388,7 @@ v3d_sched_init(struct v3d_dev *v3d) int ret; ret = drm_sched_init(&v3d->queue[V3D_BIN].sched, - &v3d_bin_sched_ops, + &v3d_bin_sched_ops, NULL, hw_jobs_limit, job_hang_limit, msecs_to_jiffies(hang_limit_ms), NULL, NULL, "v3d_bin", v3d->drm.dev); @@ -396,7 +396,7 @@ v3d_sched_init(struct v3d_dev *v3d) return ret; ret = drm_sched_init(&v3d->queue[V3D_RENDER].sched, - &v3d_render_sched_ops, + &v3d_render_sched_ops, NULL, hw_jobs_limit, job_hang_limit, msecs_to_jiffies(hang_limit_ms), NULL, NULL, "v3d_render", v3d->drm.dev); @@ -404,7 +404,7 @@ v3d_sched_init(struct v3d_dev *v3d) goto fail; ret = drm_sched_init(&v3d->queue[V3D_TFU].sched, - &v3d_tfu_sched_ops, + &v3d_tfu_sched_ops, NULL, hw_jobs_limit, job_hang_limit, msecs_to_jiffies(hang_limit_ms), NULL, NULL, "v3d_tfu", v3d->drm.dev); @@ -413,7 +413,7 @@ v3d_sched_init(struct v3d_dev *v3d) if (v3d_has_csd(v3d)) { ret = drm_sched_init(&v3d->queue[V3D_CSD].sched, - &v3d_csd_sched_ops, + &v3d_csd_sched_ops, NULL, hw_jobs_limit, job_hang_limit, msecs_to_jiffies(hang_limit_ms), NULL, NULL, "v3d_csd", v3d->drm.dev); @@ -421,7 +421,7 @@ v3d_sched_init(struct v3d_dev *v3d) goto fail; ret = drm_sched_init(&v3d->queue[V3D_CACHE_CLEAN].sched, - &v3d_cache_clean_sched_ops, + &v3d_cache_clean_sched_ops, NULL, hw_jobs_limit, job_hang_limit, msecs_to_jiffies(hang_limit_ms), NULL, NULL, "v3d_cache_clean", v3d->drm.dev); diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h index f12c5aea5294..278106e358a8 100644 --- a/include/drm/gpu_scheduler.h +++ b/include/drm/gpu_scheduler.h @@ -473,17 +473,16 @@ struct drm_sched_backend_ops { * @timeout: the time after which a job is removed from the scheduler. * @name: name of the ring for which this scheduler is being used. * @sched_rq: priority wise array of run queues. - * @wake_up_worker: the wait queue on which the scheduler sleeps until a job - * is ready to be scheduled. * @job_scheduled: once @drm_sched_entity_do_release is called the scheduler * waits on this wait queue until all the scheduled jobs are * finished. * @hw_rq_count: the number of jobs currently in the hardware queue. * @job_id_count: used to assign unique id to the each job. + * @submit_wq: workqueue used to queue @work_submit * @timeout_wq: workqueue used to queue @work_tdr + * @work_submit: schedules jobs and cleans up entities * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the * timeout interval is over. - * @thread: the kthread on which the scheduler which run. * @pending_list: the list of jobs which are currently in the job queue. * @job_list_lock: lock to protect the pending_list. * @hang_limit: once the hangs by a job crosses this limit then it is marked @@ -492,6 +491,7 @@ struct drm_sched_backend_ops { * @_score: score used when the driver doesn't provide one * @ready: marks if the underlying HW is ready to work * @free_guilty: A hit to time out handler to free the guilty job. + * @pause_submit: pause queuing of @work_submit on @submit_wq * @dev: system &struct device * * One scheduler is implemented for each hardware ring. @@ -502,13 +502,13 @@ struct drm_gpu_scheduler { long timeout; const char *name; struct drm_sched_rq sched_rq[DRM_SCHED_PRIORITY_COUNT]; - wait_queue_head_t wake_up_worker; wait_queue_head_t job_scheduled; atomic_t hw_rq_count; atomic64_t job_id_count; + struct workqueue_struct *submit_wq; struct workqueue_struct *timeout_wq; + struct work_struct work_submit; struct delayed_work work_tdr; - struct task_struct *thread; struct list_head pending_list; spinlock_t job_list_lock; int hang_limit; @@ -516,11 +516,13 @@ struct drm_gpu_scheduler { atomic_t _score; bool ready; bool free_guilty; + bool pause_submit; struct device *dev; }; int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_backend_ops *ops, + struct workqueue_struct *submit_wq, uint32_t hw_submission, unsigned hang_limit, long timeout, struct workqueue_struct *timeout_wq, atomic_t *score, const char *name, struct device *dev);
In XE, the new Intel GPU driver, a choice has made to have a 1 to 1 mapping between a drm_gpu_scheduler and drm_sched_entity. At first this seems a bit odd but let us explain the reasoning below. 1. In XE the submission order from multiple drm_sched_entity is not guaranteed to be the same completion even if targeting the same hardware engine. This is because in XE we have a firmware scheduler, the GuC, which allowed to reorder, timeslice, and preempt submissions. If a using shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls apart as the TDR expects submission order == completion order. Using a dedicated drm_gpu_scheduler per drm_sched_entity solve this problem. 2. In XE submissions are done via programming a ring buffer (circular buffer), a drm_gpu_scheduler provides a limit on number of jobs, if the limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we get flow control on the ring for free. A problem with this design is currently a drm_gpu_scheduler uses a kthread for submission / job cleanup. This doesn't scale if a large number of drm_gpu_scheduler are used. To work around the scaling issue, use a worker rather than kthread for submission / job cleanup. v2: - (Rob Clark) Fix msm build - Pass in run work queue v3: - (Boris) don't have loop in worker v4: - (Tvrtko) break out submit ready, stop, start helpers into own patch Signed-off-by: Matthew Brost <matthew.brost@intel.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +- drivers/gpu/drm/etnaviv/etnaviv_sched.c | 2 +- drivers/gpu/drm/lima/lima_sched.c | 2 +- drivers/gpu/drm/msm/msm_ringbuffer.c | 2 +- drivers/gpu/drm/nouveau/nouveau_sched.c | 2 +- drivers/gpu/drm/panfrost/panfrost_job.c | 2 +- drivers/gpu/drm/scheduler/sched_main.c | 106 +++++++++------------ drivers/gpu/drm/v3d/v3d_sched.c | 10 +- include/drm/gpu_scheduler.h | 12 ++- 9 files changed, 65 insertions(+), 75 deletions(-)