Message ID | 20240704081841.30212-4-nirmoy.das@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v5,1/4] drm/ttm: Add a flag to allow drivers to skip clear-on-free | expand |
Gentle reminder! On 7/4/2024 10:18 AM, Nirmoy Das wrote: > On LNL because of flat CCS, driver creates a migrate job to clear > CCS meta data. Extend that to also clear system pages using GPU. > Inform TTM to allocate pages without __GFP_ZERO to avoid double page > clearing by clearing out TTM_TT_FLAG_ZERO_ALLOC flag and set > TTM_TT_FLAG_CLEARED_ON_FREE while freeing to skip ttm pool's > clearn-on-free as XE now takes care of clearing pages. If a bo > is in system placement and there is a cpu map then for such BO gpu > clear will be avoided as there is no dma mapping for such BO. > > To test the patch, created a small test that tries to submit a job > after binding various sizes of buffer which shows good gains for larger > buffer. For lower buffer sizes, the result is not very reliable as the > results vary a lot. > > With the patch > sudo ~/igt-gpu-tools/build/tests/xe_exec_store --run > basic-store-benchmark > IGT-Version: 1.28-g2ed908c0b (x86_64) (Linux: 6.10.0-rc2-xe+ x86_64) > Using IGT_SRANDOM=1719237905 for randomisation > Opened device: /dev/dri/card0 > Starting subtest: basic-store-benchmark > Starting dynamic subtest: WC > Dynamic subtest WC: SUCCESS (0.000s) > Time taken for size SZ_4K: 9493 us > Time taken for size SZ_2M: 5503 us > Time taken for size SZ_64M: 13016 us > Time taken for size SZ_128M: 29464 us > Time taken for size SZ_256M: 38408 us > Time taken for size SZ_1G: 148758 us > Starting dynamic subtest: WB > Dynamic subtest WB: SUCCESS (0.000s) > Time taken for size SZ_4K: 3889 us > Time taken for size SZ_2M: 6091 us > Time taken for size SZ_64M: 20920 us > Time taken for size SZ_128M: 32394 us > Time taken for size SZ_256M: 61710 us > Time taken for size SZ_1G: 215437 us > Subtest basic-store-benchmark: SUCCESS (0.589s) > > With the patch: > sudo ~/igt-gpu-tools/build/tests/xe_exec_store --run > basic-store-benchmark > IGT-Version: 1.28-g2ed908c0b (x86_64) (Linux: 6.10.0-rc2-xe+ x86_64) > Using IGT_SRANDOM=1719238062 for randomisation > Opened device: /dev/dri/card0 > Starting subtest: basic-store-benchmark > Starting dynamic subtest: WC > Dynamic subtest WC: SUCCESS (0.000s) > Time taken for size SZ_4K: 11803 us > Time taken for size SZ_2M: 4237 us > Time taken for size SZ_64M: 8649 us > Time taken for size SZ_128M: 14682 us > Time taken for size SZ_256M: 22156 us > Time taken for size SZ_1G: 74457 us > Starting dynamic subtest: WB > Dynamic subtest WB: SUCCESS (0.000s) > Time taken for size SZ_4K: 5129 us > Time taken for size SZ_2M: 12563 us > Time taken for size SZ_64M: 14860 us > Time taken for size SZ_128M: 26064 us > Time taken for size SZ_256M: 47167 us > Time taken for size SZ_1G: 170304 us > Subtest basic-store-benchmark: SUCCESS (0.417s) > > With the patch and init_on_alloc=0 > sudo ~/igt-gpu-tools/build/tests/xe_exec_store --run > basic-store-benchmark > IGT-Version: 1.28-g2ed908c0b (x86_64) (Linux: 6.10.0-rc2-xe+ x86_64) > Using IGT_SRANDOM=1719238219 for randomisation > Opened device: /dev/dri/card0 > Starting subtest: basic-store-benchmark > Starting dynamic subtest: WC > Dynamic subtest WC: SUCCESS (0.000s) > Time taken for size SZ_4K: 4803 us > Time taken for size SZ_2M: 9212 us > Time taken for size SZ_64M: 9643 us > Time taken for size SZ_128M: 13479 us > Time taken for size SZ_256M: 22429 us > Time taken for size SZ_1G: 83110 us > Starting dynamic subtest: WB > Dynamic subtest WB: SUCCESS (0.000s) > Time taken for size SZ_4K: 4003 us > Time taken for size SZ_2M: 4443 us > Time taken for size SZ_64M: 12960 us > Time taken for size SZ_128M: 13741 us > Time taken for size SZ_256M: 26841 us > Time taken for size SZ_1G: 84746 us > Subtest basic-store-benchmark: SUCCESS (0.290s) > > v2: Handle regression on dgfx(Himal) > Update commit message as no ttm API changes needed. > v3: Fix Kunit test. > v4: handle data leak on cpu mmap(Thomas) > > Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> > Cc: Matthew Auld <matthew.auld@intel.com> > Cc: "Thomas Hellström" <thomas.hellstrom@linux.intel.com> > Signed-off-by: Nirmoy Das <nirmoy.das@intel.com> > --- > drivers/gpu/drm/xe/xe_bo.c | 25 ++++++++++++++++++++++++- > drivers/gpu/drm/xe/xe_device.c | 7 +++++++ > drivers/gpu/drm/xe/xe_device_types.h | 2 ++ > 3 files changed, 33 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c > index 4d6315d2ae9a..b76a44fcf3b1 100644 > --- a/drivers/gpu/drm/xe/xe_bo.c > +++ b/drivers/gpu/drm/xe/xe_bo.c > @@ -387,6 +387,13 @@ static struct ttm_tt *xe_ttm_tt_create(struct ttm_buffer_object *ttm_bo, > caching = ttm_uncached; > } > > + /* If the device can support gpu clear pages then set proper ttm > + * flag. Zeroed pages are only required for ttm_bo_type_device so > + * unwanted data is leaked to userspace. > + */ > + if (ttm_bo->type == ttm_bo_type_device && xe->mem.gpu_page_clear) > + page_flags |= TTM_TT_FLAG_CLEARED_ON_FREE; > + > err = ttm_tt_init(&tt->ttm, &bo->ttm, page_flags, caching, extra_pages); > if (err) { > kfree(tt); > @@ -408,6 +415,10 @@ static int xe_ttm_tt_populate(struct ttm_device *ttm_dev, struct ttm_tt *tt, > if (tt->page_flags & TTM_TT_FLAG_EXTERNAL) > return 0; > > + /* Clear TTM_TT_FLAG_ZERO_ALLOC when GPU is set to clear pages */ > + if (tt->page_flags & TTM_TT_FLAG_CLEARED_ON_FREE) > + tt->page_flags &= ~TTM_TT_FLAG_ZERO_ALLOC; > + > err = ttm_pool_alloc(&ttm_dev->pool, tt, ctx); > if (err) > return err; > @@ -653,6 +664,14 @@ static int xe_bo_move(struct ttm_buffer_object *ttm_bo, bool evict, > > int ret = 0; > > + /* > + * Clear TTM_TT_FLAG_CLEARED_ON_FREE on bo creation path when > + * moving to system as the bo doesn't dma_mapping. > + */ > + if (!old_mem && ttm && !ttm_tt_is_populated(ttm)) { > + ttm->page_flags &= ~TTM_TT_FLAG_CLEARED_ON_FREE; > + } > + > /* Bo creation path, moving to system or TT. */ > if ((!old_mem && ttm) && !handle_system_ccs) { > if (new_mem->mem_type == XE_PL_TT) > @@ -676,7 +695,8 @@ static int xe_bo_move(struct ttm_buffer_object *ttm_bo, bool evict, > (!mem_type_is_vram(old_mem_type) && !tt_has_data); > > needs_clear = (ttm && ttm->page_flags & TTM_TT_FLAG_ZERO_ALLOC) || > - (!ttm && ttm_bo->type == ttm_bo_type_device); > + (!ttm && ttm_bo->type == ttm_bo_type_device) || > + (ttm && ttm->page_flags & TTM_TT_FLAG_CLEARED_ON_FREE); > > if (new_mem->mem_type == XE_PL_TT) { > ret = xe_tt_map_sg(ttm); > @@ -790,6 +810,9 @@ static int xe_bo_move(struct ttm_buffer_object *ttm_bo, bool evict, > handle_system_ccs; > bool clear_bo_data = mem_type_is_vram(new_mem->mem_type); > > + if (ttm && (ttm->page_flags & TTM_TT_FLAG_CLEARED_ON_FREE)) > + clear_bo_data |= true; > + > fence = xe_migrate_clear(migrate, bo, new_mem, > clear_bo_data, clear_ccs); > } > diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c > index 03492fbcb8fb..7c682a53f06e 100644 > --- a/drivers/gpu/drm/xe/xe_device.c > +++ b/drivers/gpu/drm/xe/xe_device.c > @@ -636,6 +636,13 @@ int xe_device_probe(struct xe_device *xe) > if (err) > goto err; > > + /** > + * On iGFX device with flat CCS, we clear CCS metadata, let's extend that > + * and use GPU to clear pages as well. > + */ > + if (xe_device_has_flat_ccs(xe) && !IS_DGFX(xe)) > + xe->mem.gpu_page_clear = true; > + > err = xe_vram_probe(xe); > if (err) > goto err; > diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h > index 3bca6d344744..28eaf2ab1f25 100644 > --- a/drivers/gpu/drm/xe/xe_device_types.h > +++ b/drivers/gpu/drm/xe/xe_device_types.h > @@ -325,6 +325,8 @@ struct xe_device { > struct xe_mem_region vram; > /** @mem.sys_mgr: system TTM manager */ > struct ttm_resource_manager sys_mgr; > + /** @gpu_page_clear: clear pages offloaded to GPU */ > + bool gpu_page_clear; > } mem; > > /** @sriov: device level virtualization data */
On 04/07/2024 09:18, Nirmoy Das wrote: > On LNL because of flat CCS, driver creates a migrate job to clear > CCS meta data. Extend that to also clear system pages using GPU. > Inform TTM to allocate pages without __GFP_ZERO to avoid double page > clearing by clearing out TTM_TT_FLAG_ZERO_ALLOC flag and set > TTM_TT_FLAG_CLEARED_ON_FREE while freeing to skip ttm pool's > clearn-on-free as XE now takes care of clearing pages. If a bo > is in system placement and there is a cpu map then for such BO gpu > clear will be avoided as there is no dma mapping for such BO. > > To test the patch, created a small test that tries to submit a job > after binding various sizes of buffer which shows good gains for larger > buffer. For lower buffer sizes, the result is not very reliable as the > results vary a lot. > > With the patch > sudo ~/igt-gpu-tools/build/tests/xe_exec_store --run > basic-store-benchmark > IGT-Version: 1.28-g2ed908c0b (x86_64) (Linux: 6.10.0-rc2-xe+ x86_64) > Using IGT_SRANDOM=1719237905 for randomisation > Opened device: /dev/dri/card0 > Starting subtest: basic-store-benchmark > Starting dynamic subtest: WC > Dynamic subtest WC: SUCCESS (0.000s) > Time taken for size SZ_4K: 9493 us > Time taken for size SZ_2M: 5503 us > Time taken for size SZ_64M: 13016 us > Time taken for size SZ_128M: 29464 us > Time taken for size SZ_256M: 38408 us > Time taken for size SZ_1G: 148758 us > Starting dynamic subtest: WB > Dynamic subtest WB: SUCCESS (0.000s) > Time taken for size SZ_4K: 3889 us > Time taken for size SZ_2M: 6091 us > Time taken for size SZ_64M: 20920 us > Time taken for size SZ_128M: 32394 us > Time taken for size SZ_256M: 61710 us > Time taken for size SZ_1G: 215437 us > Subtest basic-store-benchmark: SUCCESS (0.589s) > > With the patch: > sudo ~/igt-gpu-tools/build/tests/xe_exec_store --run > basic-store-benchmark > IGT-Version: 1.28-g2ed908c0b (x86_64) (Linux: 6.10.0-rc2-xe+ x86_64) > Using IGT_SRANDOM=1719238062 for randomisation > Opened device: /dev/dri/card0 > Starting subtest: basic-store-benchmark > Starting dynamic subtest: WC > Dynamic subtest WC: SUCCESS (0.000s) > Time taken for size SZ_4K: 11803 us > Time taken for size SZ_2M: 4237 us > Time taken for size SZ_64M: 8649 us > Time taken for size SZ_128M: 14682 us > Time taken for size SZ_256M: 22156 us > Time taken for size SZ_1G: 74457 us > Starting dynamic subtest: WB > Dynamic subtest WB: SUCCESS (0.000s) > Time taken for size SZ_4K: 5129 us > Time taken for size SZ_2M: 12563 us > Time taken for size SZ_64M: 14860 us > Time taken for size SZ_128M: 26064 us > Time taken for size SZ_256M: 47167 us > Time taken for size SZ_1G: 170304 us > Subtest basic-store-benchmark: SUCCESS (0.417s) > > With the patch and init_on_alloc=0 > sudo ~/igt-gpu-tools/build/tests/xe_exec_store --run > basic-store-benchmark > IGT-Version: 1.28-g2ed908c0b (x86_64) (Linux: 6.10.0-rc2-xe+ x86_64) > Using IGT_SRANDOM=1719238219 for randomisation > Opened device: /dev/dri/card0 > Starting subtest: basic-store-benchmark > Starting dynamic subtest: WC > Dynamic subtest WC: SUCCESS (0.000s) > Time taken for size SZ_4K: 4803 us > Time taken for size SZ_2M: 9212 us > Time taken for size SZ_64M: 9643 us > Time taken for size SZ_128M: 13479 us > Time taken for size SZ_256M: 22429 us > Time taken for size SZ_1G: 83110 us > Starting dynamic subtest: WB > Dynamic subtest WB: SUCCESS (0.000s) > Time taken for size SZ_4K: 4003 us > Time taken for size SZ_2M: 4443 us > Time taken for size SZ_64M: 12960 us > Time taken for size SZ_128M: 13741 us > Time taken for size SZ_256M: 26841 us > Time taken for size SZ_1G: 84746 us > Subtest basic-store-benchmark: SUCCESS (0.290s) > > v2: Handle regression on dgfx(Himal) > Update commit message as no ttm API changes needed. > v3: Fix Kunit test. > v4: handle data leak on cpu mmap(Thomas) > > Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> > Cc: Matthew Auld <matthew.auld@intel.com> > Cc: "Thomas Hellström" <thomas.hellstrom@linux.intel.com> > Signed-off-by: Nirmoy Das <nirmoy.das@intel.com> > --- > drivers/gpu/drm/xe/xe_bo.c | 25 ++++++++++++++++++++++++- > drivers/gpu/drm/xe/xe_device.c | 7 +++++++ > drivers/gpu/drm/xe/xe_device_types.h | 2 ++ > 3 files changed, 33 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c > index 4d6315d2ae9a..b76a44fcf3b1 100644 > --- a/drivers/gpu/drm/xe/xe_bo.c > +++ b/drivers/gpu/drm/xe/xe_bo.c > @@ -387,6 +387,13 @@ static struct ttm_tt *xe_ttm_tt_create(struct ttm_buffer_object *ttm_bo, > caching = ttm_uncached; > } > > + /* If the device can support gpu clear pages then set proper ttm Nit: for multi-line block comment style is usually: /* * If the device > + * flag. Zeroed pages are only required for ttm_bo_type_device so > + * unwanted data is leaked to userspace. not leaked? > + */ > + if (ttm_bo->type == ttm_bo_type_device && xe->mem.gpu_page_clear) > + page_flags |= TTM_TT_FLAG_CLEARED_ON_FREE; > + > err = ttm_tt_init(&tt->ttm, &bo->ttm, page_flags, caching, extra_pages); > if (err) { > kfree(tt); > @@ -408,6 +415,10 @@ static int xe_ttm_tt_populate(struct ttm_device *ttm_dev, struct ttm_tt *tt, > if (tt->page_flags & TTM_TT_FLAG_EXTERNAL) > return 0; > > + /* Clear TTM_TT_FLAG_ZERO_ALLOC when GPU is set to clear pages */ > + if (tt->page_flags & TTM_TT_FLAG_CLEARED_ON_FREE) > + tt->page_flags &= ~TTM_TT_FLAG_ZERO_ALLOC; > + > err = ttm_pool_alloc(&ttm_dev->pool, tt, ctx); > if (err) > return err; > @@ -653,6 +664,14 @@ static int xe_bo_move(struct ttm_buffer_object *ttm_bo, bool evict, > > int ret = 0; > > + /* > + * Clear TTM_TT_FLAG_CLEARED_ON_FREE on bo creation path when > + * moving to system as the bo doesn't dma_mapping. > + */ Ok, so I guess with deferred allocation you always get CPU clear, but there is no funny business in the fault hander. > + if (!old_mem && ttm && !ttm_tt_is_populated(ttm)) { > + ttm->page_flags &= ~TTM_TT_FLAG_CLEARED_ON_FREE; > + } Nit: Can drop the {} > + > /* Bo creation path, moving to system or TT. */ > if ((!old_mem && ttm) && !handle_system_ccs) { > if (new_mem->mem_type == XE_PL_TT) > @@ -676,7 +695,8 @@ static int xe_bo_move(struct ttm_buffer_object *ttm_bo, bool evict, > (!mem_type_is_vram(old_mem_type) && !tt_has_data); > > needs_clear = (ttm && ttm->page_flags & TTM_TT_FLAG_ZERO_ALLOC) || > - (!ttm && ttm_bo->type == ttm_bo_type_device); > + (!ttm && ttm_bo->type == ttm_bo_type_device) || > + (ttm && ttm->page_flags & TTM_TT_FLAG_CLEARED_ON_FREE); > > if (new_mem->mem_type == XE_PL_TT) { > ret = xe_tt_map_sg(ttm); > @@ -790,6 +810,9 @@ static int xe_bo_move(struct ttm_buffer_object *ttm_bo, bool evict, > handle_system_ccs; > bool clear_bo_data = mem_type_is_vram(new_mem->mem_type); > > + if (ttm && (ttm->page_flags & TTM_TT_FLAG_CLEARED_ON_FREE)) > + clear_bo_data |= true; > + > fence = xe_migrate_clear(migrate, bo, new_mem, > clear_bo_data, clear_ccs); > } > diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c > index 03492fbcb8fb..7c682a53f06e 100644 > --- a/drivers/gpu/drm/xe/xe_device.c > +++ b/drivers/gpu/drm/xe/xe_device.c > @@ -636,6 +636,13 @@ int xe_device_probe(struct xe_device *xe) > if (err) > goto err; > > + /** Nit: can drop the extra * since this is not actual kernel-doc. > + * On iGFX device with flat CCS, we clear CCS metadata, let's extend that > + * and use GPU to clear pages as well. > + */ > + if (xe_device_has_flat_ccs(xe) && !IS_DGFX(xe)) > + xe->mem.gpu_page_clear = true; Could potentially move this into xe_ttm_sys_mgr_init() ? > + > err = xe_vram_probe(xe); > if (err) > goto err; > diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h > index 3bca6d344744..28eaf2ab1f25 100644 > --- a/drivers/gpu/drm/xe/xe_device_types.h > +++ b/drivers/gpu/drm/xe/xe_device_types.h > @@ -325,6 +325,8 @@ struct xe_device { > struct xe_mem_region vram; > /** @mem.sys_mgr: system TTM manager */ > struct ttm_resource_manager sys_mgr; > + /** @gpu_page_clear: clear pages offloaded to GPU */ @mem.gpu_page_clear? Should we maybe also rename this to gpu_page_clear_sys or similar? Since we technically already "clear pages offloaded to GPU" for vram. > + bool gpu_page_clear; > } mem; > > /** @sriov: device level virtualization data */
On 7/18/2024 6:27 PM, Matthew Auld wrote: > On 04/07/2024 09:18, Nirmoy Das wrote: >> On LNL because of flat CCS, driver creates a migrate job to clear >> CCS meta data. Extend that to also clear system pages using GPU. >> Inform TTM to allocate pages without __GFP_ZERO to avoid double page >> clearing by clearing out TTM_TT_FLAG_ZERO_ALLOC flag and set >> TTM_TT_FLAG_CLEARED_ON_FREE while freeing to skip ttm pool's >> clearn-on-free as XE now takes care of clearing pages. If a bo >> is in system placement and there is a cpu map then for such BO gpu >> clear will be avoided as there is no dma mapping for such BO. >> >> To test the patch, created a small test that tries to submit a job >> after binding various sizes of buffer which shows good gains for larger >> buffer. For lower buffer sizes, the result is not very reliable as the >> results vary a lot. >> >> With the patch >> sudo ~/igt-gpu-tools/build/tests/xe_exec_store --run >> basic-store-benchmark >> IGT-Version: 1.28-g2ed908c0b (x86_64) (Linux: 6.10.0-rc2-xe+ x86_64) >> Using IGT_SRANDOM=1719237905 for randomisation >> Opened device: /dev/dri/card0 >> Starting subtest: basic-store-benchmark >> Starting dynamic subtest: WC >> Dynamic subtest WC: SUCCESS (0.000s) >> Time taken for size SZ_4K: 9493 us >> Time taken for size SZ_2M: 5503 us >> Time taken for size SZ_64M: 13016 us >> Time taken for size SZ_128M: 29464 us >> Time taken for size SZ_256M: 38408 us >> Time taken for size SZ_1G: 148758 us >> Starting dynamic subtest: WB >> Dynamic subtest WB: SUCCESS (0.000s) >> Time taken for size SZ_4K: 3889 us >> Time taken for size SZ_2M: 6091 us >> Time taken for size SZ_64M: 20920 us >> Time taken for size SZ_128M: 32394 us >> Time taken for size SZ_256M: 61710 us >> Time taken for size SZ_1G: 215437 us >> Subtest basic-store-benchmark: SUCCESS (0.589s) >> >> With the patch: >> sudo ~/igt-gpu-tools/build/tests/xe_exec_store --run >> basic-store-benchmark >> IGT-Version: 1.28-g2ed908c0b (x86_64) (Linux: 6.10.0-rc2-xe+ x86_64) >> Using IGT_SRANDOM=1719238062 for randomisation >> Opened device: /dev/dri/card0 >> Starting subtest: basic-store-benchmark >> Starting dynamic subtest: WC >> Dynamic subtest WC: SUCCESS (0.000s) >> Time taken for size SZ_4K: 11803 us >> Time taken for size SZ_2M: 4237 us >> Time taken for size SZ_64M: 8649 us >> Time taken for size SZ_128M: 14682 us >> Time taken for size SZ_256M: 22156 us >> Time taken for size SZ_1G: 74457 us >> Starting dynamic subtest: WB >> Dynamic subtest WB: SUCCESS (0.000s) >> Time taken for size SZ_4K: 5129 us >> Time taken for size SZ_2M: 12563 us >> Time taken for size SZ_64M: 14860 us >> Time taken for size SZ_128M: 26064 us >> Time taken for size SZ_256M: 47167 us >> Time taken for size SZ_1G: 170304 us >> Subtest basic-store-benchmark: SUCCESS (0.417s) >> >> With the patch and init_on_alloc=0 >> sudo ~/igt-gpu-tools/build/tests/xe_exec_store --run >> basic-store-benchmark >> IGT-Version: 1.28-g2ed908c0b (x86_64) (Linux: 6.10.0-rc2-xe+ x86_64) >> Using IGT_SRANDOM=1719238219 for randomisation >> Opened device: /dev/dri/card0 >> Starting subtest: basic-store-benchmark >> Starting dynamic subtest: WC >> Dynamic subtest WC: SUCCESS (0.000s) >> Time taken for size SZ_4K: 4803 us >> Time taken for size SZ_2M: 9212 us >> Time taken for size SZ_64M: 9643 us >> Time taken for size SZ_128M: 13479 us >> Time taken for size SZ_256M: 22429 us >> Time taken for size SZ_1G: 83110 us >> Starting dynamic subtest: WB >> Dynamic subtest WB: SUCCESS (0.000s) >> Time taken for size SZ_4K: 4003 us >> Time taken for size SZ_2M: 4443 us >> Time taken for size SZ_64M: 12960 us >> Time taken for size SZ_128M: 13741 us >> Time taken for size SZ_256M: 26841 us >> Time taken for size SZ_1G: 84746 us >> Subtest basic-store-benchmark: SUCCESS (0.290s) >> >> v2: Handle regression on dgfx(Himal) >> Update commit message as no ttm API changes needed. >> v3: Fix Kunit test. >> v4: handle data leak on cpu mmap(Thomas) >> >> Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> >> Cc: Matthew Auld <matthew.auld@intel.com> >> Cc: "Thomas Hellström" <thomas.hellstrom@linux.intel.com> >> Signed-off-by: Nirmoy Das <nirmoy.das@intel.com> >> --- >> drivers/gpu/drm/xe/xe_bo.c | 25 ++++++++++++++++++++++++- >> drivers/gpu/drm/xe/xe_device.c | 7 +++++++ >> drivers/gpu/drm/xe/xe_device_types.h | 2 ++ >> 3 files changed, 33 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c >> index 4d6315d2ae9a..b76a44fcf3b1 100644 >> --- a/drivers/gpu/drm/xe/xe_bo.c >> +++ b/drivers/gpu/drm/xe/xe_bo.c >> @@ -387,6 +387,13 @@ static struct ttm_tt *xe_ttm_tt_create(struct >> ttm_buffer_object *ttm_bo, >> caching = ttm_uncached; >> } >> + /* If the device can support gpu clear pages then set proper ttm > > Nit: for multi-line block comment style is usually: > > /* > * If the device > >> + * flag. Zeroed pages are only required for ttm_bo_type_device so >> + * unwanted data is leaked to userspace. > > not leaked? Yes, should be not leaked. > >> + */ >> + if (ttm_bo->type == ttm_bo_type_device && xe->mem.gpu_page_clear) >> + page_flags |= TTM_TT_FLAG_CLEARED_ON_FREE; >> + >> err = ttm_tt_init(&tt->ttm, &bo->ttm, page_flags, caching, >> extra_pages); >> if (err) { >> kfree(tt); >> @@ -408,6 +415,10 @@ static int xe_ttm_tt_populate(struct ttm_device >> *ttm_dev, struct ttm_tt *tt, >> if (tt->page_flags & TTM_TT_FLAG_EXTERNAL) >> return 0; >> + /* Clear TTM_TT_FLAG_ZERO_ALLOC when GPU is set to clear pages */ >> + if (tt->page_flags & TTM_TT_FLAG_CLEARED_ON_FREE) >> + tt->page_flags &= ~TTM_TT_FLAG_ZERO_ALLOC; >> + >> err = ttm_pool_alloc(&ttm_dev->pool, tt, ctx); >> if (err) >> return err; >> @@ -653,6 +664,14 @@ static int xe_bo_move(struct ttm_buffer_object >> *ttm_bo, bool evict, >> int ret = 0; >> + /* >> + * Clear TTM_TT_FLAG_CLEARED_ON_FREE on bo creation path when >> + * moving to system as the bo doesn't dma_mapping. >> + */ > > Ok, so I guess with deferred allocation you always get CPU clear, Yes > but there is no funny business in the fault hander. There is sort of one, so we have 3 cases: 1 DRM_XE_GEM_CREATE_FLAG_DEFER_BACKING + CPU mmap --> CPU clear always 2 Not DRM_XE_GEM_CREATE_FLAG_DEFER_BACKING + CPU mmap --> GPU clear I wasn't sure if this was correct but Thomas confirmed that this is fine as "cpu fault handler already waits for the kernel fence that the xe_migrate_clear() attaches" 3 Not DRM_XE_GEM_CREATE_FLAG_DEFER_BACKING + exec_buf/GPU access --> GPU clear > >> + if (!old_mem && ttm && !ttm_tt_is_populated(ttm)) { >> + ttm->page_flags &= ~TTM_TT_FLAG_CLEARED_ON_FREE; >> + } > > Nit: Can drop the {} +1 > >> + >> /* Bo creation path, moving to system or TT. */ >> if ((!old_mem && ttm) && !handle_system_ccs) { >> if (new_mem->mem_type == XE_PL_TT) >> @@ -676,7 +695,8 @@ static int xe_bo_move(struct ttm_buffer_object >> *ttm_bo, bool evict, >> (!mem_type_is_vram(old_mem_type) && >> !tt_has_data); >> needs_clear = (ttm && ttm->page_flags & >> TTM_TT_FLAG_ZERO_ALLOC) || >> - (!ttm && ttm_bo->type == ttm_bo_type_device); >> + (!ttm && ttm_bo->type == ttm_bo_type_device) || >> + (ttm && ttm->page_flags & TTM_TT_FLAG_CLEARED_ON_FREE); >> if (new_mem->mem_type == XE_PL_TT) { >> ret = xe_tt_map_sg(ttm); >> @@ -790,6 +810,9 @@ static int xe_bo_move(struct ttm_buffer_object >> *ttm_bo, bool evict, >> handle_system_ccs; >> bool clear_bo_data = mem_type_is_vram(new_mem->mem_type); >> + if (ttm && (ttm->page_flags & >> TTM_TT_FLAG_CLEARED_ON_FREE)) >> + clear_bo_data |= true; >> + >> fence = xe_migrate_clear(migrate, bo, new_mem, >> clear_bo_data, clear_ccs); >> } >> diff --git a/drivers/gpu/drm/xe/xe_device.c >> b/drivers/gpu/drm/xe/xe_device.c >> index 03492fbcb8fb..7c682a53f06e 100644 >> --- a/drivers/gpu/drm/xe/xe_device.c >> +++ b/drivers/gpu/drm/xe/xe_device.c >> @@ -636,6 +636,13 @@ int xe_device_probe(struct xe_device *xe) >> if (err) >> goto err; >> + /** > > Nit: can drop the extra * since this is not actual kernel-doc. +1 > >> + * On iGFX device with flat CCS, we clear CCS metadata, let's >> extend that >> + * and use GPU to clear pages as well. >> + */ >> + if (xe_device_has_flat_ccs(xe) && !IS_DGFX(xe)) >> + xe->mem.gpu_page_clear = true; > > Could potentially move this into xe_ttm_sys_mgr_init() ? Yes, good idea. > >> + >> err = xe_vram_probe(xe); >> if (err) >> goto err; >> diff --git a/drivers/gpu/drm/xe/xe_device_types.h >> b/drivers/gpu/drm/xe/xe_device_types.h >> index 3bca6d344744..28eaf2ab1f25 100644 >> --- a/drivers/gpu/drm/xe/xe_device_types.h >> +++ b/drivers/gpu/drm/xe/xe_device_types.h >> @@ -325,6 +325,8 @@ struct xe_device { >> struct xe_mem_region vram; >> /** @mem.sys_mgr: system TTM manager */ >> struct ttm_resource_manager sys_mgr; >> + /** @gpu_page_clear: clear pages offloaded to GPU */ > > @mem.gpu_page_clear? > > Should we maybe also rename this to gpu_page_clear_sys or similar? > Since we technically already "clear pages offloaded to GPU" for vram. Yes, gpu_page_clear_sys should make it more clear. Thanks, Nirmoy > >> + bool gpu_page_clear; >> } mem; >> /** @sriov: device level virtualization data */
diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c index 4d6315d2ae9a..b76a44fcf3b1 100644 --- a/drivers/gpu/drm/xe/xe_bo.c +++ b/drivers/gpu/drm/xe/xe_bo.c @@ -387,6 +387,13 @@ static struct ttm_tt *xe_ttm_tt_create(struct ttm_buffer_object *ttm_bo, caching = ttm_uncached; } + /* If the device can support gpu clear pages then set proper ttm + * flag. Zeroed pages are only required for ttm_bo_type_device so + * unwanted data is leaked to userspace. + */ + if (ttm_bo->type == ttm_bo_type_device && xe->mem.gpu_page_clear) + page_flags |= TTM_TT_FLAG_CLEARED_ON_FREE; + err = ttm_tt_init(&tt->ttm, &bo->ttm, page_flags, caching, extra_pages); if (err) { kfree(tt); @@ -408,6 +415,10 @@ static int xe_ttm_tt_populate(struct ttm_device *ttm_dev, struct ttm_tt *tt, if (tt->page_flags & TTM_TT_FLAG_EXTERNAL) return 0; + /* Clear TTM_TT_FLAG_ZERO_ALLOC when GPU is set to clear pages */ + if (tt->page_flags & TTM_TT_FLAG_CLEARED_ON_FREE) + tt->page_flags &= ~TTM_TT_FLAG_ZERO_ALLOC; + err = ttm_pool_alloc(&ttm_dev->pool, tt, ctx); if (err) return err; @@ -653,6 +664,14 @@ static int xe_bo_move(struct ttm_buffer_object *ttm_bo, bool evict, int ret = 0; + /* + * Clear TTM_TT_FLAG_CLEARED_ON_FREE on bo creation path when + * moving to system as the bo doesn't dma_mapping. + */ + if (!old_mem && ttm && !ttm_tt_is_populated(ttm)) { + ttm->page_flags &= ~TTM_TT_FLAG_CLEARED_ON_FREE; + } + /* Bo creation path, moving to system or TT. */ if ((!old_mem && ttm) && !handle_system_ccs) { if (new_mem->mem_type == XE_PL_TT) @@ -676,7 +695,8 @@ static int xe_bo_move(struct ttm_buffer_object *ttm_bo, bool evict, (!mem_type_is_vram(old_mem_type) && !tt_has_data); needs_clear = (ttm && ttm->page_flags & TTM_TT_FLAG_ZERO_ALLOC) || - (!ttm && ttm_bo->type == ttm_bo_type_device); + (!ttm && ttm_bo->type == ttm_bo_type_device) || + (ttm && ttm->page_flags & TTM_TT_FLAG_CLEARED_ON_FREE); if (new_mem->mem_type == XE_PL_TT) { ret = xe_tt_map_sg(ttm); @@ -790,6 +810,9 @@ static int xe_bo_move(struct ttm_buffer_object *ttm_bo, bool evict, handle_system_ccs; bool clear_bo_data = mem_type_is_vram(new_mem->mem_type); + if (ttm && (ttm->page_flags & TTM_TT_FLAG_CLEARED_ON_FREE)) + clear_bo_data |= true; + fence = xe_migrate_clear(migrate, bo, new_mem, clear_bo_data, clear_ccs); } diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c index 03492fbcb8fb..7c682a53f06e 100644 --- a/drivers/gpu/drm/xe/xe_device.c +++ b/drivers/gpu/drm/xe/xe_device.c @@ -636,6 +636,13 @@ int xe_device_probe(struct xe_device *xe) if (err) goto err; + /** + * On iGFX device with flat CCS, we clear CCS metadata, let's extend that + * and use GPU to clear pages as well. + */ + if (xe_device_has_flat_ccs(xe) && !IS_DGFX(xe)) + xe->mem.gpu_page_clear = true; + err = xe_vram_probe(xe); if (err) goto err; diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h index 3bca6d344744..28eaf2ab1f25 100644 --- a/drivers/gpu/drm/xe/xe_device_types.h +++ b/drivers/gpu/drm/xe/xe_device_types.h @@ -325,6 +325,8 @@ struct xe_device { struct xe_mem_region vram; /** @mem.sys_mgr: system TTM manager */ struct ttm_resource_manager sys_mgr; + /** @gpu_page_clear: clear pages offloaded to GPU */ + bool gpu_page_clear; } mem; /** @sriov: device level virtualization data */
On LNL because of flat CCS, driver creates a migrate job to clear CCS meta data. Extend that to also clear system pages using GPU. Inform TTM to allocate pages without __GFP_ZERO to avoid double page clearing by clearing out TTM_TT_FLAG_ZERO_ALLOC flag and set TTM_TT_FLAG_CLEARED_ON_FREE while freeing to skip ttm pool's clearn-on-free as XE now takes care of clearing pages. If a bo is in system placement and there is a cpu map then for such BO gpu clear will be avoided as there is no dma mapping for such BO. To test the patch, created a small test that tries to submit a job after binding various sizes of buffer which shows good gains for larger buffer. For lower buffer sizes, the result is not very reliable as the results vary a lot. With the patch sudo ~/igt-gpu-tools/build/tests/xe_exec_store --run basic-store-benchmark IGT-Version: 1.28-g2ed908c0b (x86_64) (Linux: 6.10.0-rc2-xe+ x86_64) Using IGT_SRANDOM=1719237905 for randomisation Opened device: /dev/dri/card0 Starting subtest: basic-store-benchmark Starting dynamic subtest: WC Dynamic subtest WC: SUCCESS (0.000s) Time taken for size SZ_4K: 9493 us Time taken for size SZ_2M: 5503 us Time taken for size SZ_64M: 13016 us Time taken for size SZ_128M: 29464 us Time taken for size SZ_256M: 38408 us Time taken for size SZ_1G: 148758 us Starting dynamic subtest: WB Dynamic subtest WB: SUCCESS (0.000s) Time taken for size SZ_4K: 3889 us Time taken for size SZ_2M: 6091 us Time taken for size SZ_64M: 20920 us Time taken for size SZ_128M: 32394 us Time taken for size SZ_256M: 61710 us Time taken for size SZ_1G: 215437 us Subtest basic-store-benchmark: SUCCESS (0.589s) With the patch: sudo ~/igt-gpu-tools/build/tests/xe_exec_store --run basic-store-benchmark IGT-Version: 1.28-g2ed908c0b (x86_64) (Linux: 6.10.0-rc2-xe+ x86_64) Using IGT_SRANDOM=1719238062 for randomisation Opened device: /dev/dri/card0 Starting subtest: basic-store-benchmark Starting dynamic subtest: WC Dynamic subtest WC: SUCCESS (0.000s) Time taken for size SZ_4K: 11803 us Time taken for size SZ_2M: 4237 us Time taken for size SZ_64M: 8649 us Time taken for size SZ_128M: 14682 us Time taken for size SZ_256M: 22156 us Time taken for size SZ_1G: 74457 us Starting dynamic subtest: WB Dynamic subtest WB: SUCCESS (0.000s) Time taken for size SZ_4K: 5129 us Time taken for size SZ_2M: 12563 us Time taken for size SZ_64M: 14860 us Time taken for size SZ_128M: 26064 us Time taken for size SZ_256M: 47167 us Time taken for size SZ_1G: 170304 us Subtest basic-store-benchmark: SUCCESS (0.417s) With the patch and init_on_alloc=0 sudo ~/igt-gpu-tools/build/tests/xe_exec_store --run basic-store-benchmark IGT-Version: 1.28-g2ed908c0b (x86_64) (Linux: 6.10.0-rc2-xe+ x86_64) Using IGT_SRANDOM=1719238219 for randomisation Opened device: /dev/dri/card0 Starting subtest: basic-store-benchmark Starting dynamic subtest: WC Dynamic subtest WC: SUCCESS (0.000s) Time taken for size SZ_4K: 4803 us Time taken for size SZ_2M: 9212 us Time taken for size SZ_64M: 9643 us Time taken for size SZ_128M: 13479 us Time taken for size SZ_256M: 22429 us Time taken for size SZ_1G: 83110 us Starting dynamic subtest: WB Dynamic subtest WB: SUCCESS (0.000s) Time taken for size SZ_4K: 4003 us Time taken for size SZ_2M: 4443 us Time taken for size SZ_64M: 12960 us Time taken for size SZ_128M: 13741 us Time taken for size SZ_256M: 26841 us Time taken for size SZ_1G: 84746 us Subtest basic-store-benchmark: SUCCESS (0.290s) v2: Handle regression on dgfx(Himal) Update commit message as no ttm API changes needed. v3: Fix Kunit test. v4: handle data leak on cpu mmap(Thomas) Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: "Thomas Hellström" <thomas.hellstrom@linux.intel.com> Signed-off-by: Nirmoy Das <nirmoy.das@intel.com> --- drivers/gpu/drm/xe/xe_bo.c | 25 ++++++++++++++++++++++++- drivers/gpu/drm/xe/xe_device.c | 7 +++++++ drivers/gpu/drm/xe/xe_device_types.h | 2 ++ 3 files changed, 33 insertions(+), 1 deletion(-)