Message ID | 20250324132952.1075209-1-andi.shyti@linux.intel.com (mailing list archive) |
---|---|
Headers | show |
Series | CCS static load balance | expand |
Acked-by: Michal Mrozek <michal.mrozek@intel.com>
Quoting Andi Shyti (2025-03-24 15:29:36) > Hi, > > Back in v3, this patch series was turned down due to community > policies regarding i915 GEM development. Since then, I have > received several requests from userspace developers, which I > initially declined in order to respect those policies. > > However, with the latest request from UMD users, I decided to > give this series another chance. I believe that when a feature > is genuinely needed, our goal should be to support it, not to > dismiss user and customer needs blindly. We had plenty of community bug reports when the move to fixed CCS mode was initially implemented with some bugs. After those bugs were fixed, nobody was reporting impactful performance regressions. Do you have a reference to some GitLab issues or maybe some external project issues where regressions around here are discussed? Regards, Joonas
The ccs mode setting support via sysfs is required by our customer.
Acked-by: Arshad Mehmood <arshad.mehmood@intel.com>
Hi Joonas, thanks a lot for your reply! On Tue, Mar 25, 2025 at 10:24:42AM +0200, Joonas Lahtinen wrote: > Quoting Andi Shyti (2025-03-24 15:29:36) > > Back in v3, this patch series was turned down due to community > > policies regarding i915 GEM development. Since then, I have > > received several requests from userspace developers, which I > > initially declined in order to respect those policies. > > > > However, with the latest request from UMD users, I decided to > > give this series another chance. I believe that when a feature > > is genuinely needed, our goal should be to support it, not to > > dismiss user and customer needs blindly. > > We had plenty of community bug reports when the move to fixed CCS mode > was initially implemented with some bugs. > > After those bugs were fixed, nobody was reporting impactful performance > regressions. > > Do you have a reference to some GitLab issues or maybe some external > project issues where regressions around here are discussed? AFAIK, there's no GitLab issue for this because we're not fixing a bug here; we're adding a new sysfs interface. All known issues and reports related to CCS load balancing have already been addressed. What we're still missing is a way for compute applications to tweak CCS load balancing settings. I already shared the link [1], but if you take a look at that code, you'll find 'execution_environment_drm.cpp' [2], where the new interface is used. If you're feeling lazy, I can point out the relevant parts, otherwise, feel free to skip to the final greetings :-) In 'void ExecutionEnvironment::configureCcsMode()', the app sets up the path like this: const std::string drmPath = "/sys/class/drm"; const std::string expectedFilePrefix = drmPath + "/card"; ... auto gtFiles = Directory::getFiles(gtPath.c_str()); auto expectedGtFilePrefix = gtPath + "/gt"; ... std::string ccsFile = gtFile + "/ccs_mode"; Then it writes the desired CCS mode value: uint32_t ccsValue = 0; ssize_t ret = SysCalls::read(fd, &ccsValue, sizeof(uint32_t)); ... do { ret = SysCalls::write(fd, &ccsMode, sizeof(uint32_t)); } while (ret == -1 && errno == -EBUSY); Arshad and Usha can definitely help if there are any technical questions about how the application uses the interface. Usha, would you please be able to share your use case? Thanks, Andi [1] https://github.com/intel/compute-runtime [2] https://github.com/intel/compute-runtime/blob/master/shared/source/execution_environment/execution_environment_drm.cpp
Quoting Andi Shyti (2025-03-25 12:52:58) > On Tue, Mar 25, 2025 at 10:24:42AM +0200, Joonas Lahtinen wrote: <SNIP> > > Do you have a reference to some GitLab issues or maybe some external > > project issues where regressions around here are discussed? > > AFAIK, there's no GitLab issue for this because we're not fixing > a bug here; we're adding a new sysfs interface. This sysfs interface was exactly designed to address performance regressions coming from limiting the number of CCS to 1. So unless we have a specific workload and end-user reporting a regression on it, there's no incentive to spend any further time here. <SNIP> > Arshad and Usha can definitely help if there are any technical > questions about how the application uses the interface. I don't have any technical questions as I specified the interface initially :) This is not about technical opens about how the interface works. To recap, when we initially implemented the 1CCS mode, we got active feedback on the community on regressions. We were careful to verify that all userspace would cleanly fall back to using 1CCS mode after it was implemented. And indeed, nobody has been asking for the 4CCS mode back after the 1CCS mode bugs were fixed. So as far as I see it, there are no users for this interface in upstream, and thus we should not spend the time on it. Regards, Joonas
Justification: To address a hardware bug causing stability issues when RCS and multiple CCS operate simultaneously in dynamic load balancing mode, CCG limited the CCS count to 1 as a software workaround. Many ECG customers run compute-only workloads without requiring rendering tasks. Therefore, it is important to provide customers with a runtime configuration option to increase the CCS count for compute-only workloads, in order to meet the performance requirements.
Acked-by: Usharani Ayyalasomayajula <usharani.ayyalasomayajula@intel.com>
Thanks,
Usha.
I’d like to provide additional context regarding the necessity of these patches. The shift from dynamic load balancing mode to fixed mode, with CCS usage restricted to a single unit, has led to a notable performance regression, with workloads experiencing an approximately 10% FPS drop. For example, on DG2, the ResNet-50 inference benchmark previously achieved ~10,500 FPS in dynamic load balancing mode. However, after limiting CCS to 1 in fixed mode, performance dropped to ~9,200 FPS. With these patches, enabling all 4 CCS units via sysfs (in fixed mode) restores performance back to nearly 10,500 FPS, effectively matching the previous dynamic mode results. Given customer expectations to maintain prior performance levels, these patches are essential to ensuring workloads utilizing multiple CCS units do not experience unnecessary degradation. The proposed sysfs interface provides configurability, allowing controlled re-enablement of all 4 CCS units while keeping fixed mode intact. Since fixed mode is now in use, having a configurable approach ensures flexibility to address different scenarios that may arise. Let me know if you need further details. Regards, Arshad