Message ID | 20241228063245.61874-1-xueshuai@linux.alibaba.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | drm/amdgpu: Enable runtime modification of gpu_recovery parameter with validation | expand |
Am 28.12.24 um 07:32 schrieb Shuai Xue: > It's observed that most GPU jobs utilize less than one server, typically > with each GPU being used by an independent job. If a job consumed poisoned > data, a SIGBUS signal will be sent to terminate it. Meanwhile, the > gpu_recovery parameter is set to -1 by default, the amdgpu driver resets > all GPUs on the server. As a result, all jobs are terminated. Setting > gpu_recovery to 0 provides an opportunity to preemptively evacuate other > jobs and subsequently manually reset all GPUs. *BIG* NAK to this whole approach! Setting gpu_recovery to 0 in a production environment is *NOT* supported at all and should never be done. This is a pure debugging feature for JTAG debugging and can result in random crashes and/or compromised data. Please don't tell me that you tried to use this in a production environment. Regards, Christian. > However, this parameter is > read-only, necessitating correct settings at driver load. And reloading the > GPU driver in a production environment can be challenging due to reference > counts maintained by various monitoring services. > > Set the gpu_recovery parameter with read-write permission to enable runtime > modification. It will enables users to dynamically manage GPU recovery > mechanisms based on real-time requirements or conditions. > > Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 26 ++++++++++++++++++++++++- > 1 file changed, 25 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > index 38686203bea6..03dd902e1cec 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > @@ -563,12 +563,36 @@ module_param_named(lbpw, amdgpu_lbpw, int, 0444); > MODULE_PARM_DESC(compute_multipipe, "Force compute queues to be spread across pipes (1 = enable, 0 = disable, -1 = auto)"); > module_param_named(compute_multipipe, amdgpu_compute_multipipe, int, 0444); > > +static int amdgpu_set_gpu_recovery(const char *buf, > + const struct kernel_param *kp) > +{ > + unsigned long val; > + int ret; > + > + ret = kstrtol(buf, 10, &val); > + if (ret < 0) > + return ret; > + > + if (val != 1 && val != 0 && val != -1) { > + pr_err("Invalid value for gpu_recovery: %ld, excepted 0,1,-1\n", > + val); > + return -EINVAL; > + } > + > + return param_set_int(buf, kp); > +} > + > +static const struct kernel_param_ops amdgpu_gpu_recovery_ops = { > + .set = amdgpu_set_gpu_recovery, > + .get = param_get_int, > +}; > + > /** > * DOC: gpu_recovery (int) > * Set to enable GPU recovery mechanism (1 = enable, 0 = disable). The default is -1 (auto, disabled except SRIOV). > */ > MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (1 = enable, 0 = disable, -1 = auto)"); > -module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444); > +module_param_cb(gpu_recovery, &amdgpu_gpu_recovery_ops, &amdgpu_gpu_recovery, 0644); > > /** > * DOC: emu_mode (int)
在 2024/12/30 04:11, Christian König 写道: > Am 28.12.24 um 07:32 schrieb Shuai Xue: >> It's observed that most GPU jobs utilize less than one server, typically >> with each GPU being used by an independent job. If a job consumed poisoned >> data, a SIGBUS signal will be sent to terminate it. Meanwhile, the >> gpu_recovery parameter is set to -1 by default, the amdgpu driver resets >> all GPUs on the server. As a result, all jobs are terminated. Setting >> gpu_recovery to 0 provides an opportunity to preemptively evacuate other >> jobs and subsequently manually reset all GPUs. > > *BIG* NAK to this whole approach! > > Setting gpu_recovery to 0 in a production environment is *NOT* supported at all and should never be done. > > This is a pure debugging feature for JTAG debugging and can result in random crashes and/or compromised data. > > Please don't tell me that you tried to use this in a production environment. > > Regards, > Christian. Hi, Christian, Thank you for your quick reply. When an application encounters uncorrected error, it will be terminate by a SIGBUS signal. The related bad pages are retired. I did not figure why gpu_recovery=0 can result in random crashes and/or compromised data. I test with error injection in my dev enviroment: 1. load driver with gpu_recovery=0 #cat /sys/bus/pci/drivers/amdgpu/module/parameters/gpu_recovery 0 2. inject a Uncorrectable ECC error to UMC #sudo amdgpuras -d 0 -b 2 -t 8 Poison inject, logical addr:0x7f2b495f9000 physical addr:0x27f5d4b000 vmid:5 Bus error 3. GPU 0000:0a:00.0 reports error address with PA #dmesg | grep 27f5 [424443.174154] amdgpu 0000:0a:00.0: amdgpu: Error Address(PA):0x27f5d43080 Row:0x1fd7 Col:0x0 Bank:0xa Channel:0x30 [424443.174156] amdgpu 0000:0a:00.0: amdgpu: Error Address(PA):0x27f5d4b080 Row:0x1fd7 Col:0x4 Bank:0xa Channel:0x30 [424443.174158] amdgpu 0000:0a:00.0: amdgpu: Error Address(PA):0x27f5d53080 Row:0x1fd7 Col:0x8 Bank:0xa Channel:0x30 [424443.174160] amdgpu 0000:0a:00.0: amdgpu: Error Address(PA):0x27f5d5b080 Row:0x1fd7 Col:0xc Bank:0xa Channel:0x30 [424443.174162] amdgpu 0000:0a:00.0: amdgpu: Error Address(PA):0x27f5f43080 Row:0x1fd7 Col:0x10 Bank:0xa Channel:0x30 [424443.174169] amdgpu 0000:0a:00.0: amdgpu: Error Address(PA):0x27f5f4b080 Row:0x1fd7 Col:0x14 Bank:0xa Channel:0x30 [424443.174172] amdgpu 0000:0a:00.0: amdgpu: Error Address(PA):0x27f5f53080 Row:0x1fd7 Col:0x18 Bank:0xa Channel:0x30 [424443.174174] amdgpu 0000:0a:00.0: amdgpu: Error Address(PA):0x27f5f5b080 Row:0x1fd7 Col:0x1c Bank:0xa Channel:0x30 4. All the related bad pages are AMDGPU_RAS_RETIRE_PAGE_RESERVED. #cat /sys/devices/pci0000:05/0000:05:01.0/0000:06:00.0/0000:07:00.0/0000:08:00.0/0000:09:00.0/0000:0a:00.0/ras/gpu_vram_bad_pages | grep 27f5 0x027f5d43 : 0x00001000 : R 0x027f5d4b : 0x00001000 : R 0x027f5d53 : 0x00001000 : R 0x027f5d5b : 0x00001000 : R 0x027f5f43 : 0x00001000 : R 0x027f5f4b : 0x00001000 : R 0x027f5f53 : 0x00001000 : R 0x027f5f5b : 0x00001000 : R AFAIK, the reserved bad pages will not be used any more. Please correct me if I missed anything. DRAM ECC issues are the most common problems. When it occurs, the kernel will attempt to hard-offline the page, by trying to unmap the page or killing any owner, or triggering IO errors if needed. ECC error is also common for HBM and error isolation from each user's job is a basic requirement in public cloud. For NVIDIA GPU, a ECC error could be contained to a process. > XID 94: Contained ECC error > XID 95: UnContained ECC error > > For Xid 94, these errors are contained to one application, and the application > that encountered this error must be restarted. All other applications running > at the time of the Xid are unaffected. It is recommended to reset the GPU when > convenient. Applications can continue to be run until the reset can be > performed. > > For Xid 95, these errors affect multiple applications, and the affected GPU > must be reset before applications can restart. > > https://docs.nvidia.com/deploy/xid-errors/ Does AMD GPU provide a similar way to achieve error isolation requirement? Best Regards, Shuai > >> However, this parameter is >> read-only, necessitating correct settings at driver load. And reloading the >> GPU driver in a production environment can be challenging due to reference >> counts maintained by various monitoring services. >> >> Set the gpu_recovery parameter with read-write permission to enable runtime >> modification. It will enables users to dynamically manage GPU recovery >> mechanisms based on real-time requirements or conditions. >> >> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 26 ++++++++++++++++++++++++- >> 1 file changed, 25 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >> index 38686203bea6..03dd902e1cec 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >> @@ -563,12 +563,36 @@ module_param_named(lbpw, amdgpu_lbpw, int, 0444); >> MODULE_PARM_DESC(compute_multipipe, "Force compute queues to be spread across pipes (1 = enable, 0 = disable, -1 = auto)"); >> module_param_named(compute_multipipe, amdgpu_compute_multipipe, int, 0444); >> +static int amdgpu_set_gpu_recovery(const char *buf, >> + const struct kernel_param *kp) >> +{ >> + unsigned long val; >> + int ret; >> + >> + ret = kstrtol(buf, 10, &val); >> + if (ret < 0) >> + return ret; >> + >> + if (val != 1 && val != 0 && val != -1) { >> + pr_err("Invalid value for gpu_recovery: %ld, excepted 0,1,-1\n", >> + val); >> + return -EINVAL; >> + } >> + >> + return param_set_int(buf, kp); >> +} >> + >> +static const struct kernel_param_ops amdgpu_gpu_recovery_ops = { >> + .set = amdgpu_set_gpu_recovery, >> + .get = param_get_int, >> +}; >> + >> /** >> * DOC: gpu_recovery (int) >> * Set to enable GPU recovery mechanism (1 = enable, 0 = disable). The default is -1 (auto, disabled except SRIOV). >> */ >> MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (1 = enable, 0 = disable, -1 = auto)"); >> -module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444); >> +module_param_cb(gpu_recovery, &amdgpu_gpu_recovery_ops, &amdgpu_gpu_recovery, 0644); >> /** >> * DOC: emu_mode (int)
[AMD Official Use Only - AMD Internal Distribution Only] Hi Shuai, setting gpu_recovery=0 is not even remotely related to RAS. If that option affects RAS behavior in any way then that is a bug. The purpose of setting gpu_recovery=0 is to disable resets after a submission timeout most likely caused by an unrecoverable HW error. This is necessary for JTAG debugging in our labs during HW bringup and should *NEVER* be used on any production system. We already discussed with upstream maintainers that we should probably mark the kernel as tainted to indicate that it might be in an unreliable HW state. I will push for this now since there seems to be a big misunderstanding what this option does. Regards, Christian.
On Fri, Jan 03, 2025 at 08:21:43AM +0000, Koenig, Christian wrote: > [AMD Official Use Only - AMD Internal Distribution Only] > > Hi Shuai, > > setting gpu_recovery=0 is not even remotely related to RAS. If that > option affects RAS behavior in any way then that is a bug. > > The purpose of setting gpu_recovery=0 is to disable resets after a > submission timeout most likely caused by an unrecoverable HW error. > > This is necessary for JTAG debugging in our labs during HW bringup and > should *NEVER* be used on any production system. > > We already discussed with upstream maintainers that we should probably > mark the kernel as tainted to indicate that it might be in an unreliable > HW state. I will push for this now since there seems to be a big > misunderstanding what this option does. module_param_unsafe and friends really should be the default for module options really, since generally they're just for debugging and other hacks. With multiple gpus you can't control options per-device with module options in a reasonable way, so that's all no-go. So might want to go large-scale relabelling module options while you're at it. -Sima > > Regards, > Christian. > > ________________________________________ > Von: Shuai Xue <xueshuai@linux.alibaba.com> > Gesendet: Montag, 30. Dezember 2024 09:50 > An: Koenig, Christian; Deucher, Alexander; Pan, Xinhui; airlied@gmail.com; simona@ffwll.ch; Lazar, Lijo; Ma, Le; hamza.mahfooz@amd.com; tzimmermann@suse.de; Liu, Shaoyun; Jun.Ma2@amd.com > Cc: amd-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; linux-kernel@vger.kernel.org; tianruidong@linux.alibaba.com > Betreff: Re: [PATCH] drm/amdgpu: Enable runtime modification of gpu_recovery parameter with validation > > > > 在 2024/12/30 04:11, Christian König 写道: > > Am 28.12.24 um 07:32 schrieb Shuai Xue: > >> It's observed that most GPU jobs utilize less than one server, typically > >> with each GPU being used by an independent job. If a job consumed poisoned > >> data, a SIGBUS signal will be sent to terminate it. Meanwhile, the > >> gpu_recovery parameter is set to -1 by default, the amdgpu driver resets > >> all GPUs on the server. As a result, all jobs are terminated. Setting > >> gpu_recovery to 0 provides an opportunity to preemptively evacuate other > >> jobs and subsequently manually reset all GPUs. > > > > *BIG* NAK to this whole approach! > > > > Setting gpu_recovery to 0 in a production environment is *NOT* supported at all and should never be done. > > > > This is a pure debugging feature for JTAG debugging and can result in random crashes and/or compromised data. > > > > Please don't tell me that you tried to use this in a production environment. > > > > Regards, > > Christian. > > Hi, Christian, > > Thank you for your quick reply. > > When an application encounters uncorrected error, it will be terminate by a > SIGBUS signal. The related bad pages are retired. I did not figure why > gpu_recovery=0 can result in random crashes and/or compromised data. > > I test with error injection in my dev enviroment: > > 1. load driver with gpu_recovery=0 > #cat /sys/bus/pci/drivers/amdgpu/module/parameters/gpu_recovery > 0 > > 2. inject a Uncorrectable ECC error to UMC > #sudo amdgpuras -d 0 -b 2 -t 8 > Poison inject, logical addr:0x7f2b495f9000 physical addr:0x27f5d4b000 vmid:5 > Bus error > > 3. GPU 0000:0a:00.0 reports error address with PA > #dmesg | grep 27f5 > [424443.174154] amdgpu 0000:0a:00.0: amdgpu: Error Address(PA):0x27f5d43080 Row:0x1fd7 Col:0x0 Bank:0xa Channel:0x30 > [424443.174156] amdgpu 0000:0a:00.0: amdgpu: Error Address(PA):0x27f5d4b080 Row:0x1fd7 Col:0x4 Bank:0xa Channel:0x30 > [424443.174158] amdgpu 0000:0a:00.0: amdgpu: Error Address(PA):0x27f5d53080 Row:0x1fd7 Col:0x8 Bank:0xa Channel:0x30 > [424443.174160] amdgpu 0000:0a:00.0: amdgpu: Error Address(PA):0x27f5d5b080 Row:0x1fd7 Col:0xc Bank:0xa Channel:0x30 > [424443.174162] amdgpu 0000:0a:00.0: amdgpu: Error Address(PA):0x27f5f43080 Row:0x1fd7 Col:0x10 Bank:0xa Channel:0x30 > [424443.174169] amdgpu 0000:0a:00.0: amdgpu: Error Address(PA):0x27f5f4b080 Row:0x1fd7 Col:0x14 Bank:0xa Channel:0x30 > [424443.174172] amdgpu 0000:0a:00.0: amdgpu: Error Address(PA):0x27f5f53080 Row:0x1fd7 Col:0x18 Bank:0xa Channel:0x30 > [424443.174174] amdgpu 0000:0a:00.0: amdgpu: Error Address(PA):0x27f5f5b080 Row:0x1fd7 Col:0x1c Bank:0xa Channel:0x30 > > 4. All the related bad pages are AMDGPU_RAS_RETIRE_PAGE_RESERVED. > #cat /sys/devices/pci0000:05/0000:05:01.0/0000:06:00.0/0000:07:00.0/0000:08:00.0/0000:09:00.0/0000:0a:00.0/ras/gpu_vram_bad_pages | grep 27f5 > 0x027f5d43 : 0x00001000 : R > 0x027f5d4b : 0x00001000 : R > 0x027f5d53 : 0x00001000 : R > 0x027f5d5b : 0x00001000 : R > 0x027f5f43 : 0x00001000 : R > 0x027f5f4b : 0x00001000 : R > 0x027f5f53 : 0x00001000 : R > 0x027f5f5b : 0x00001000 : R > > AFAIK, the reserved bad pages will not be used any more. Please correct me if > I missed anything. > > DRAM ECC issues are the most common problems. When it occurs, the kernel will > attempt to hard-offline the page, by trying to unmap the page or killing any > owner, or triggering IO errors if needed. > > ECC error is also common for HBM and error isolation from each user's job is a > basic requirement in public cloud. For NVIDIA GPU, a ECC error could be > contained to a process. > > > XID 94: Contained ECC error > > XID 95: UnContained ECC error > > > > For Xid 94, these errors are contained to one application, and the application > > that encountered this error must be restarted. All other applications running > > at the time of the Xid are unaffected. It is recommended to reset the GPU when > > convenient. Applications can continue to be run until the reset can be > > performed. > > > > For Xid 95, these errors affect multiple applications, and the affected GPU > > must be reset before applications can restart. > > > > https://docs.nvidia.com/deploy/xid-errors/ > > Does AMD GPU provide a similar way to achieve error isolation requirement? > > Best Regards, > Shuai > > > > >> However, this parameter is > >> read-only, necessitating correct settings at driver load. And reloading the > >> GPU driver in a production environment can be challenging due to reference > >> counts maintained by various monitoring services. > >> > >> Set the gpu_recovery parameter with read-write permission to enable runtime > >> modification. It will enables users to dynamically manage GPU recovery > >> mechanisms based on real-time requirements or conditions. > >> > >> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> > >> --- > >> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 26 ++++++++++++++++++++++++- > >> 1 file changed, 25 insertions(+), 1 deletion(-) > >> > >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > >> index 38686203bea6..03dd902e1cec 100644 > >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > >> @@ -563,12 +563,36 @@ module_param_named(lbpw, amdgpu_lbpw, int, 0444); > >> MODULE_PARM_DESC(compute_multipipe, "Force compute queues to be spread across pipes (1 = enable, 0 = disable, -1 = auto)"); > >> module_param_named(compute_multipipe, amdgpu_compute_multipipe, int, 0444); > >> +static int amdgpu_set_gpu_recovery(const char *buf, > >> + const struct kernel_param *kp) > >> +{ > >> + unsigned long val; > >> + int ret; > >> + > >> + ret = kstrtol(buf, 10, &val); > >> + if (ret < 0) > >> + return ret; > >> + > >> + if (val != 1 && val != 0 && val != -1) { > >> + pr_err("Invalid value for gpu_recovery: %ld, excepted 0,1,-1\n", > >> + val); > >> + return -EINVAL; > >> + } > >> + > >> + return param_set_int(buf, kp); > >> +} > >> + > >> +static const struct kernel_param_ops amdgpu_gpu_recovery_ops = { > >> + .set = amdgpu_set_gpu_recovery, > >> + .get = param_get_int, > >> +}; > >> + > >> /** > >> * DOC: gpu_recovery (int) > >> * Set to enable GPU recovery mechanism (1 = enable, 0 = disable). The default is -1 (auto, disabled except SRIOV). > >> */ > >> MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (1 = enable, 0 = disable, -1 = auto)"); > >> -module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444); > >> +module_param_cb(gpu_recovery, &amdgpu_gpu_recovery_ops, &amdgpu_gpu_recovery, 0644); > >> /** > >> * DOC: emu_mode (int)
在 2025/1/3 16:21, Koenig, Christian 写道: > [AMD Official Use Only - AMD Internal Distribution Only] > > Hi Shuai, > > setting gpu_recovery=0 is not even remotely related to RAS. If that option affects RAS behavior in any way then that is a bug. > > The purpose of setting gpu_recovery=0 is to disable resets after a submission timeout most likely caused by an unrecoverable HW error. > > This is necessary for JTAG debugging in our labs during HW bringup and should *NEVER* be used on any production system. > > We already discussed with upstream maintainers that we should probably mark the kernel as tainted to indicate that it might be in an unreliable HW state. I will push for this now since there seems to be a big misunderstanding what this option does. > > Regards, > Christian. Hi, Christian, Got the purpose of setting gpu_recovery=0. Thanks for the your patient explanation. When a ECC error occurs, the AMD GPU driver auto resets all GPUs and all jobs are terminated. My ultimate goal is provide error isolation between independent jobs which use a different GPU. Any suggestion? Thank you. Best Regards, > > ________________________________________ > Von: Shuai Xue <xueshuai@linux.alibaba.com> > Gesendet: Montag, 30. Dezember 2024 09:50 > An: Koenig, Christian; Deucher, Alexander; Pan, Xinhui; airlied@gmail.com; simona@ffwll.ch; Lazar, Lijo; Ma, Le; hamza.mahfooz@amd.com; tzimmermann@suse.de; Liu, Shaoyun; Jun.Ma2@amd.com > Cc: amd-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; linux-kernel@vger.kernel.org; tianruidong@linux.alibaba.com > Betreff: Re: [PATCH] drm/amdgpu: Enable runtime modification of gpu_recovery parameter with validation > > > > 在 2024/12/30 04:11, Christian König 写道: >> Am 28.12.24 um 07:32 schrieb Shuai Xue: >>> It's observed that most GPU jobs utilize less than one server, typically >>> with each GPU being used by an independent job. If a job consumed poisoned >>> data, a SIGBUS signal will be sent to terminate it. Meanwhile, the >>> gpu_recovery parameter is set to -1 by default, the amdgpu driver resets >>> all GPUs on the server. As a result, all jobs are terminated. Setting >>> gpu_recovery to 0 provides an opportunity to preemptively evacuate other >>> jobs and subsequently manually reset all GPUs. >> >> *BIG* NAK to this whole approach! >> >> Setting gpu_recovery to 0 in a production environment is *NOT* supported at all and should never be done. >> >> This is a pure debugging feature for JTAG debugging and can result in random crashes and/or compromised data. >> >> Please don't tell me that you tried to use this in a production environment. >> >> Regards, >> Christian. > > Hi, Christian, > > Thank you for your quick reply. > > When an application encounters uncorrected error, it will be terminate by a > SIGBUS signal. The related bad pages are retired. I did not figure why > gpu_recovery=0 can result in random crashes and/or compromised data. > > I test with error injection in my dev enviroment: > > 1. load driver with gpu_recovery=0 > #cat /sys/bus/pci/drivers/amdgpu/module/parameters/gpu_recovery > 0 > > 2. inject a Uncorrectable ECC error to UMC > #sudo amdgpuras -d 0 -b 2 -t 8 > Poison inject, logical addr:0x7f2b495f9000 physical addr:0x27f5d4b000 vmid:5 > Bus error > > 3. GPU 0000:0a:00.0 reports error address with PA > #dmesg | grep 27f5 > [424443.174154] amdgpu 0000:0a:00.0: amdgpu: Error Address(PA):0x27f5d43080 Row:0x1fd7 Col:0x0 Bank:0xa Channel:0x30 > [424443.174156] amdgpu 0000:0a:00.0: amdgpu: Error Address(PA):0x27f5d4b080 Row:0x1fd7 Col:0x4 Bank:0xa Channel:0x30 > [424443.174158] amdgpu 0000:0a:00.0: amdgpu: Error Address(PA):0x27f5d53080 Row:0x1fd7 Col:0x8 Bank:0xa Channel:0x30 > [424443.174160] amdgpu 0000:0a:00.0: amdgpu: Error Address(PA):0x27f5d5b080 Row:0x1fd7 Col:0xc Bank:0xa Channel:0x30 > [424443.174162] amdgpu 0000:0a:00.0: amdgpu: Error Address(PA):0x27f5f43080 Row:0x1fd7 Col:0x10 Bank:0xa Channel:0x30 > [424443.174169] amdgpu 0000:0a:00.0: amdgpu: Error Address(PA):0x27f5f4b080 Row:0x1fd7 Col:0x14 Bank:0xa Channel:0x30 > [424443.174172] amdgpu 0000:0a:00.0: amdgpu: Error Address(PA):0x27f5f53080 Row:0x1fd7 Col:0x18 Bank:0xa Channel:0x30 > [424443.174174] amdgpu 0000:0a:00.0: amdgpu: Error Address(PA):0x27f5f5b080 Row:0x1fd7 Col:0x1c Bank:0xa Channel:0x30 > > 4. All the related bad pages are AMDGPU_RAS_RETIRE_PAGE_RESERVED. > #cat /sys/devices/pci0000:05/0000:05:01.0/0000:06:00.0/0000:07:00.0/0000:08:00.0/0000:09:00.0/0000:0a:00.0/ras/gpu_vram_bad_pages | grep 27f5 > 0x027f5d43 : 0x00001000 : R > 0x027f5d4b : 0x00001000 : R > 0x027f5d53 : 0x00001000 : R > 0x027f5d5b : 0x00001000 : R > 0x027f5f43 : 0x00001000 : R > 0x027f5f4b : 0x00001000 : R > 0x027f5f53 : 0x00001000 : R > 0x027f5f5b : 0x00001000 : R > > AFAIK, the reserved bad pages will not be used any more. Please correct me if > I missed anything. > > DRAM ECC issues are the most common problems. When it occurs, the kernel will > attempt to hard-offline the page, by trying to unmap the page or killing any > owner, or triggering IO errors if needed. > > ECC error is also common for HBM and error isolation from each user's job is a > basic requirement in public cloud. For NVIDIA GPU, a ECC error could be > contained to a process. > >> XID 94: Contained ECC error >> XID 95: UnContained ECC error >> >> For Xid 94, these errors are contained to one application, and the application >> that encountered this error must be restarted. All other applications running >> at the time of the Xid are unaffected. It is recommended to reset the GPU when >> convenient. Applications can continue to be run until the reset can be >> performed. >> >> For Xid 95, these errors affect multiple applications, and the affected GPU >> must be reset before applications can restart. >> >> https://docs.nvidia.com/deploy/xid-errors/ > > Does AMD GPU provide a similar way to achieve error isolation requirement? > > Best Regards, > Shuai > >> >>> However, this parameter is >>> read-only, necessitating correct settings at driver load. And reloading the >>> GPU driver in a production environment can be challenging due to reference >>> counts maintained by various monitoring services. >>> >>> Set the gpu_recovery parameter with read-write permission to enable runtime >>> modification. It will enables users to dynamically manage GPU recovery >>> mechanisms based on real-time requirements or conditions. >>> >>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 26 ++++++++++++++++++++++++- >>> 1 file changed, 25 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>> index 38686203bea6..03dd902e1cec 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>> @@ -563,12 +563,36 @@ module_param_named(lbpw, amdgpu_lbpw, int, 0444); >>> MODULE_PARM_DESC(compute_multipipe, "Force compute queues to be spread across pipes (1 = enable, 0 = disable, -1 = auto)"); >>> module_param_named(compute_multipipe, amdgpu_compute_multipipe, int, 0444); >>> +static int amdgpu_set_gpu_recovery(const char *buf, >>> + const struct kernel_param *kp) >>> +{ >>> + unsigned long val; >>> + int ret; >>> + >>> + ret = kstrtol(buf, 10, &val); >>> + if (ret < 0) >>> + return ret; >>> + >>> + if (val != 1 && val != 0 && val != -1) { >>> + pr_err("Invalid value for gpu_recovery: %ld, excepted 0,1,-1\n", >>> + val); >>> + return -EINVAL; >>> + } >>> + >>> + return param_set_int(buf, kp); >>> +} >>> + >>> +static const struct kernel_param_ops amdgpu_gpu_recovery_ops = { >>> + .set = amdgpu_set_gpu_recovery, >>> + .get = param_get_int, >>> +}; >>> + >>> /** >>> * DOC: gpu_recovery (int) >>> * Set to enable GPU recovery mechanism (1 = enable, 0 = disable). The default is -1 (auto, disabled except SRIOV). >>> */ >>> MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (1 = enable, 0 = disable, -1 = auto)"); >>> -module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444); >>> +module_param_cb(gpu_recovery, &amdgpu_gpu_recovery_ops, &amdgpu_gpu_recovery, 0644); >>> /** >>> * DOC: emu_mode (int)
Am 07.01.25 um 08:06 schrieb Shuai Xue: > > > 在 2025/1/3 16:21, Koenig, Christian 写道: >> [AMD Official Use Only - AMD Internal Distribution Only] >> >> Hi Shuai, >> >> setting gpu_recovery=0 is not even remotely related to RAS. If that >> option affects RAS behavior in any way then that is a bug. >> >> The purpose of setting gpu_recovery=0 is to disable resets after a >> submission timeout most likely caused by an unrecoverable HW error. >> >> This is necessary for JTAG debugging in our labs during HW bringup >> and should *NEVER* be used on any production system. >> >> We already discussed with upstream maintainers that we should >> probably mark the kernel as tainted to indicate that it might be in >> an unreliable HW state. I will push for this now since there seems to >> be a big misunderstanding what this option does. >> >> Regards, >> Christian. > > > Hi, Christian, > > Got the purpose of setting gpu_recovery=0. Thanks for the your patient > explanation. > > When a ECC error occurs, the AMD GPU driver auto resets all GPUs and > all jobs > are terminated. My ultimate goal is provide error isolation between > independent > jobs which use a different GPU. Any suggestion? Not of hand. Hawking is the expert for this, but resetting all GPUs and all jobs on a RAS error is a must have for system stability. What we might be able to do is to isolate the GPU from the PCIe bus. The only problem with that is that you need a full system reset to get out of this again. Regards, Christian. > > Thank you. > Best Regards, > >> >> ________________________________________ >> Von: Shuai Xue <xueshuai@linux.alibaba.com> >> Gesendet: Montag, 30. Dezember 2024 09:50 >> An: Koenig, Christian; Deucher, Alexander; Pan, Xinhui; >> airlied@gmail.com; simona@ffwll.ch; Lazar, Lijo; Ma, Le; >> hamza.mahfooz@amd.com; tzimmermann@suse.de; Liu, Shaoyun; >> Jun.Ma2@amd.com >> Cc: amd-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; >> linux-kernel@vger.kernel.org; tianruidong@linux.alibaba.com >> Betreff: Re: [PATCH] drm/amdgpu: Enable runtime modification of >> gpu_recovery parameter with validation >> >> >> >> 在 2024/12/30 04:11, Christian König 写道: >>> Am 28.12.24 um 07:32 schrieb Shuai Xue: >>>> It's observed that most GPU jobs utilize less than one server, >>>> typically >>>> with each GPU being used by an independent job. If a job consumed >>>> poisoned >>>> data, a SIGBUS signal will be sent to terminate it. Meanwhile, the >>>> gpu_recovery parameter is set to -1 by default, the amdgpu driver >>>> resets >>>> all GPUs on the server. As a result, all jobs are terminated. Setting >>>> gpu_recovery to 0 provides an opportunity to preemptively evacuate >>>> other >>>> jobs and subsequently manually reset all GPUs. >>> >>> *BIG* NAK to this whole approach! >>> >>> Setting gpu_recovery to 0 in a production environment is *NOT* >>> supported at all and should never be done. >>> >>> This is a pure debugging feature for JTAG debugging and can result >>> in random crashes and/or compromised data. >>> >>> Please don't tell me that you tried to use this in a production >>> environment. >>> >>> Regards, >>> Christian. >> >> Hi, Christian, >> >> Thank you for your quick reply. >> >> When an application encounters uncorrected error, it will be >> terminate by a >> SIGBUS signal. The related bad pages are retired. I did not figure why >> gpu_recovery=0 can result in random crashes and/or compromised data. >> >> I test with error injection in my dev enviroment: >> >> 1. load driver with gpu_recovery=0 >> #cat /sys/bus/pci/drivers/amdgpu/module/parameters/gpu_recovery >> 0 >> >> 2. inject a Uncorrectable ECC error to UMC >> #sudo amdgpuras -d 0 -b 2 -t 8 >> Poison inject, logical addr:0x7f2b495f9000 physical addr:0x27f5d4b000 >> vmid:5 >> Bus error >> >> 3. GPU 0000:0a:00.0 reports error address with PA >> #dmesg | grep 27f5 >> [424443.174154] amdgpu 0000:0a:00.0: amdgpu: Error >> Address(PA):0x27f5d43080 Row:0x1fd7 Col:0x0 Bank:0xa Channel:0x30 >> [424443.174156] amdgpu 0000:0a:00.0: amdgpu: Error >> Address(PA):0x27f5d4b080 Row:0x1fd7 Col:0x4 Bank:0xa Channel:0x30 >> [424443.174158] amdgpu 0000:0a:00.0: amdgpu: Error >> Address(PA):0x27f5d53080 Row:0x1fd7 Col:0x8 Bank:0xa Channel:0x30 >> [424443.174160] amdgpu 0000:0a:00.0: amdgpu: Error >> Address(PA):0x27f5d5b080 Row:0x1fd7 Col:0xc Bank:0xa Channel:0x30 >> [424443.174162] amdgpu 0000:0a:00.0: amdgpu: Error >> Address(PA):0x27f5f43080 Row:0x1fd7 Col:0x10 Bank:0xa Channel:0x30 >> [424443.174169] amdgpu 0000:0a:00.0: amdgpu: Error >> Address(PA):0x27f5f4b080 Row:0x1fd7 Col:0x14 Bank:0xa Channel:0x30 >> [424443.174172] amdgpu 0000:0a:00.0: amdgpu: Error >> Address(PA):0x27f5f53080 Row:0x1fd7 Col:0x18 Bank:0xa Channel:0x30 >> [424443.174174] amdgpu 0000:0a:00.0: amdgpu: Error >> Address(PA):0x27f5f5b080 Row:0x1fd7 Col:0x1c Bank:0xa Channel:0x30 >> >> 4. All the related bad pages are AMDGPU_RAS_RETIRE_PAGE_RESERVED. >> #cat >> /sys/devices/pci0000:05/0000:05:01.0/0000:06:00.0/0000:07:00.0/0000:08:00.0/0000:09:00.0/0000:0a:00.0/ras/gpu_vram_bad_pages >> | grep 27f5 >> 0x027f5d43 : 0x00001000 : R >> 0x027f5d4b : 0x00001000 : R >> 0x027f5d53 : 0x00001000 : R >> 0x027f5d5b : 0x00001000 : R >> 0x027f5f43 : 0x00001000 : R >> 0x027f5f4b : 0x00001000 : R >> 0x027f5f53 : 0x00001000 : R >> 0x027f5f5b : 0x00001000 : R >> >> AFAIK, the reserved bad pages will not be used any more. Please >> correct me if >> I missed anything. >> >> DRAM ECC issues are the most common problems. When it occurs, the >> kernel will >> attempt to hard-offline the page, by trying to unmap the page or >> killing any >> owner, or triggering IO errors if needed. >> >> ECC error is also common for HBM and error isolation from each user's >> job is a >> basic requirement in public cloud. For NVIDIA GPU, a ECC error could be >> contained to a process. >> >>> XID 94: Contained ECC error >>> XID 95: UnContained ECC error >>> >>> For Xid 94, these errors are contained to one application, and the >>> application >>> that encountered this error must be restarted. All other >>> applications running >>> at the time of the Xid are unaffected. It is recommended to reset >>> the GPU when >>> convenient. Applications can continue to be run until the reset can be >>> performed. >>> >>> For Xid 95, these errors affect multiple applications, and the >>> affected GPU >>> must be reset before applications can restart. >>> >>> https://docs.nvidia.com/deploy/xid-errors/ >> >> Does AMD GPU provide a similar way to achieve error isolation >> requirement? >> >> Best Regards, >> Shuai >> >>> >>>> However, this parameter is >>>> read-only, necessitating correct settings at driver load. And >>>> reloading the >>>> GPU driver in a production environment can be challenging due to >>>> reference >>>> counts maintained by various monitoring services. >>>> >>>> Set the gpu_recovery parameter with read-write permission to enable >>>> runtime >>>> modification. It will enables users to dynamically manage GPU recovery >>>> mechanisms based on real-time requirements or conditions. >>>> >>>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> >>>> --- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 26 >>>> ++++++++++++++++++++++++- >>>> 1 file changed, 25 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>>> index 38686203bea6..03dd902e1cec 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>>> @@ -563,12 +563,36 @@ module_param_named(lbpw, amdgpu_lbpw, int, >>>> 0444); >>>> MODULE_PARM_DESC(compute_multipipe, "Force compute queues to be >>>> spread across pipes (1 = enable, 0 = disable, -1 = auto)"); >>>> module_param_named(compute_multipipe, amdgpu_compute_multipipe, >>>> int, 0444); >>>> +static int amdgpu_set_gpu_recovery(const char *buf, >>>> + const struct kernel_param *kp) >>>> +{ >>>> + unsigned long val; >>>> + int ret; >>>> + >>>> + ret = kstrtol(buf, 10, &val); >>>> + if (ret < 0) >>>> + return ret; >>>> + >>>> + if (val != 1 && val != 0 && val != -1) { >>>> + pr_err("Invalid value for gpu_recovery: %ld, excepted >>>> 0,1,-1\n", >>>> + val); >>>> + return -EINVAL; >>>> + } >>>> + >>>> + return param_set_int(buf, kp); >>>> +} >>>> + >>>> +static const struct kernel_param_ops amdgpu_gpu_recovery_ops = { >>>> + .set = amdgpu_set_gpu_recovery, >>>> + .get = param_get_int, >>>> +}; >>>> + >>>> /** >>>> * DOC: gpu_recovery (int) >>>> * Set to enable GPU recovery mechanism (1 = enable, 0 = >>>> disable). The default is -1 (auto, disabled except SRIOV). >>>> */ >>>> MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, >>>> (1 = enable, 0 = disable, -1 = auto)"); >>>> -module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444); >>>> +module_param_cb(gpu_recovery, &amdgpu_gpu_recovery_ops, >>>> &amdgpu_gpu_recovery, 0644); >>>> /** >>>> * DOC: emu_mode (int)
Am 06.01.25 um 13:09 schrieb Simona Vetter: > On Fri, Jan 03, 2025 at 08:21:43AM +0000, Koenig, Christian wrote: >> [AMD Official Use Only - AMD Internal Distribution Only] >> >> Hi Shuai, >> >> setting gpu_recovery=0 is not even remotely related to RAS. If that >> option affects RAS behavior in any way then that is a bug. >> >> The purpose of setting gpu_recovery=0 is to disable resets after a >> submission timeout most likely caused by an unrecoverable HW error. >> >> This is necessary for JTAG debugging in our labs during HW bringup and >> should *NEVER* be used on any production system. >> >> We already discussed with upstream maintainers that we should probably >> mark the kernel as tainted to indicate that it might be in an unreliable >> HW state. I will push for this now since there seems to be a big >> misunderstanding what this option does. > module_param_unsafe and friends really should be the default for module > options really, since generally they're just for debugging and other > hacks. With multiple gpus you can't control options per-device with module > options in a reasonable way, so that's all no-go. So might want to go > large-scale relabelling module options while you're at it. Oh, thanks a lot for this pointer. I wasn't aware of the unsafe module option variants. Going to prepare a patch to make sure that debugging options are not used in a production environment. Thanks, Christian. > -Sima > >> Regards, >> Christian. >> >> ________________________________________ >> Von: Shuai Xue <xueshuai@linux.alibaba.com> >> Gesendet: Montag, 30. Dezember 2024 09:50 >> An: Koenig, Christian; Deucher, Alexander; Pan, Xinhui; airlied@gmail.com; simona@ffwll.ch; Lazar, Lijo; Ma, Le; hamza.mahfooz@amd.com; tzimmermann@suse.de; Liu, Shaoyun; Jun.Ma2@amd.com >> Cc: amd-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org; linux-kernel@vger.kernel.org; tianruidong@linux.alibaba.com >> Betreff: Re: [PATCH] drm/amdgpu: Enable runtime modification of gpu_recovery parameter with validation >> >> >> >> 在 2024/12/30 04:11, Christian König 写道: >>> Am 28.12.24 um 07:32 schrieb Shuai Xue: >>>> It's observed that most GPU jobs utilize less than one server, typically >>>> with each GPU being used by an independent job. If a job consumed poisoned >>>> data, a SIGBUS signal will be sent to terminate it. Meanwhile, the >>>> gpu_recovery parameter is set to -1 by default, the amdgpu driver resets >>>> all GPUs on the server. As a result, all jobs are terminated. Setting >>>> gpu_recovery to 0 provides an opportunity to preemptively evacuate other >>>> jobs and subsequently manually reset all GPUs. >>> *BIG* NAK to this whole approach! >>> >>> Setting gpu_recovery to 0 in a production environment is *NOT* supported at all and should never be done. >>> >>> This is a pure debugging feature for JTAG debugging and can result in random crashes and/or compromised data. >>> >>> Please don't tell me that you tried to use this in a production environment. >>> >>> Regards, >>> Christian. >> Hi, Christian, >> >> Thank you for your quick reply. >> >> When an application encounters uncorrected error, it will be terminate by a >> SIGBUS signal. The related bad pages are retired. I did not figure why >> gpu_recovery=0 can result in random crashes and/or compromised data. >> >> I test with error injection in my dev enviroment: >> >> 1. load driver with gpu_recovery=0 >> #cat /sys/bus/pci/drivers/amdgpu/module/parameters/gpu_recovery >> 0 >> >> 2. inject a Uncorrectable ECC error to UMC >> #sudo amdgpuras -d 0 -b 2 -t 8 >> Poison inject, logical addr:0x7f2b495f9000 physical addr:0x27f5d4b000 vmid:5 >> Bus error >> >> 3. GPU 0000:0a:00.0 reports error address with PA >> #dmesg | grep 27f5 >> [424443.174154] amdgpu 0000:0a:00.0: amdgpu: Error Address(PA):0x27f5d43080 Row:0x1fd7 Col:0x0 Bank:0xa Channel:0x30 >> [424443.174156] amdgpu 0000:0a:00.0: amdgpu: Error Address(PA):0x27f5d4b080 Row:0x1fd7 Col:0x4 Bank:0xa Channel:0x30 >> [424443.174158] amdgpu 0000:0a:00.0: amdgpu: Error Address(PA):0x27f5d53080 Row:0x1fd7 Col:0x8 Bank:0xa Channel:0x30 >> [424443.174160] amdgpu 0000:0a:00.0: amdgpu: Error Address(PA):0x27f5d5b080 Row:0x1fd7 Col:0xc Bank:0xa Channel:0x30 >> [424443.174162] amdgpu 0000:0a:00.0: amdgpu: Error Address(PA):0x27f5f43080 Row:0x1fd7 Col:0x10 Bank:0xa Channel:0x30 >> [424443.174169] amdgpu 0000:0a:00.0: amdgpu: Error Address(PA):0x27f5f4b080 Row:0x1fd7 Col:0x14 Bank:0xa Channel:0x30 >> [424443.174172] amdgpu 0000:0a:00.0: amdgpu: Error Address(PA):0x27f5f53080 Row:0x1fd7 Col:0x18 Bank:0xa Channel:0x30 >> [424443.174174] amdgpu 0000:0a:00.0: amdgpu: Error Address(PA):0x27f5f5b080 Row:0x1fd7 Col:0x1c Bank:0xa Channel:0x30 >> >> 4. All the related bad pages are AMDGPU_RAS_RETIRE_PAGE_RESERVED. >> #cat /sys/devices/pci0000:05/0000:05:01.0/0000:06:00.0/0000:07:00.0/0000:08:00.0/0000:09:00.0/0000:0a:00.0/ras/gpu_vram_bad_pages | grep 27f5 >> 0x027f5d43 : 0x00001000 : R >> 0x027f5d4b : 0x00001000 : R >> 0x027f5d53 : 0x00001000 : R >> 0x027f5d5b : 0x00001000 : R >> 0x027f5f43 : 0x00001000 : R >> 0x027f5f4b : 0x00001000 : R >> 0x027f5f53 : 0x00001000 : R >> 0x027f5f5b : 0x00001000 : R >> >> AFAIK, the reserved bad pages will not be used any more. Please correct me if >> I missed anything. >> >> DRAM ECC issues are the most common problems. When it occurs, the kernel will >> attempt to hard-offline the page, by trying to unmap the page or killing any >> owner, or triggering IO errors if needed. >> >> ECC error is also common for HBM and error isolation from each user's job is a >> basic requirement in public cloud. For NVIDIA GPU, a ECC error could be >> contained to a process. >> >>> XID 94: Contained ECC error >>> XID 95: UnContained ECC error >>> >>> For Xid 94, these errors are contained to one application, and the application >>> that encountered this error must be restarted. All other applications running >>> at the time of the Xid are unaffected. It is recommended to reset the GPU when >>> convenient. Applications can continue to be run until the reset can be >>> performed. >>> >>> For Xid 95, these errors affect multiple applications, and the affected GPU >>> must be reset before applications can restart. >>> >>> https://docs.nvidia.com/deploy/xid-errors/ >> Does AMD GPU provide a similar way to achieve error isolation requirement? >> >> Best Regards, >> Shuai >> >>>> However, this parameter is >>>> read-only, necessitating correct settings at driver load. And reloading the >>>> GPU driver in a production environment can be challenging due to reference >>>> counts maintained by various monitoring services. >>>> >>>> Set the gpu_recovery parameter with read-write permission to enable runtime >>>> modification. It will enables users to dynamically manage GPU recovery >>>> mechanisms based on real-time requirements or conditions. >>>> >>>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> >>>> --- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 26 ++++++++++++++++++++++++- >>>> 1 file changed, 25 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>>> index 38686203bea6..03dd902e1cec 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>>> @@ -563,12 +563,36 @@ module_param_named(lbpw, amdgpu_lbpw, int, 0444); >>>> MODULE_PARM_DESC(compute_multipipe, "Force compute queues to be spread across pipes (1 = enable, 0 = disable, -1 = auto)"); >>>> module_param_named(compute_multipipe, amdgpu_compute_multipipe, int, 0444); >>>> +static int amdgpu_set_gpu_recovery(const char *buf, >>>> + const struct kernel_param *kp) >>>> +{ >>>> + unsigned long val; >>>> + int ret; >>>> + >>>> + ret = kstrtol(buf, 10, &val); >>>> + if (ret < 0) >>>> + return ret; >>>> + >>>> + if (val != 1 && val != 0 && val != -1) { >>>> + pr_err("Invalid value for gpu_recovery: %ld, excepted 0,1,-1\n", >>>> + val); >>>> + return -EINVAL; >>>> + } >>>> + >>>> + return param_set_int(buf, kp); >>>> +} >>>> + >>>> +static const struct kernel_param_ops amdgpu_gpu_recovery_ops = { >>>> + .set = amdgpu_set_gpu_recovery, >>>> + .get = param_get_int, >>>> +}; >>>> + >>>> /** >>>> * DOC: gpu_recovery (int) >>>> * Set to enable GPU recovery mechanism (1 = enable, 0 = disable). The default is -1 (auto, disabled except SRIOV). >>>> */ >>>> MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (1 = enable, 0 = disable, -1 = auto)"); >>>> -module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444); >>>> +module_param_cb(gpu_recovery, &amdgpu_gpu_recovery_ops, &amdgpu_gpu_recovery, 0644); >>>> /** >>>> * DOC: emu_mode (int)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index 38686203bea6..03dd902e1cec 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -563,12 +563,36 @@ module_param_named(lbpw, amdgpu_lbpw, int, 0444); MODULE_PARM_DESC(compute_multipipe, "Force compute queues to be spread across pipes (1 = enable, 0 = disable, -1 = auto)"); module_param_named(compute_multipipe, amdgpu_compute_multipipe, int, 0444); +static int amdgpu_set_gpu_recovery(const char *buf, + const struct kernel_param *kp) +{ + unsigned long val; + int ret; + + ret = kstrtol(buf, 10, &val); + if (ret < 0) + return ret; + + if (val != 1 && val != 0 && val != -1) { + pr_err("Invalid value for gpu_recovery: %ld, excepted 0,1,-1\n", + val); + return -EINVAL; + } + + return param_set_int(buf, kp); +} + +static const struct kernel_param_ops amdgpu_gpu_recovery_ops = { + .set = amdgpu_set_gpu_recovery, + .get = param_get_int, +}; + /** * DOC: gpu_recovery (int) * Set to enable GPU recovery mechanism (1 = enable, 0 = disable). The default is -1 (auto, disabled except SRIOV). */ MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (1 = enable, 0 = disable, -1 = auto)"); -module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444); +module_param_cb(gpu_recovery, &amdgpu_gpu_recovery_ops, &amdgpu_gpu_recovery, 0644); /** * DOC: emu_mode (int)
It's observed that most GPU jobs utilize less than one server, typically with each GPU being used by an independent job. If a job consumed poisoned data, a SIGBUS signal will be sent to terminate it. Meanwhile, the gpu_recovery parameter is set to -1 by default, the amdgpu driver resets all GPUs on the server. As a result, all jobs are terminated. Setting gpu_recovery to 0 provides an opportunity to preemptively evacuate other jobs and subsequently manually reset all GPUs. However, this parameter is read-only, necessitating correct settings at driver load. And reloading the GPU driver in a production environment can be challenging due to reference counts maintained by various monitoring services. Set the gpu_recovery parameter with read-write permission to enable runtime modification. It will enables users to dynamically manage GPU recovery mechanisms based on real-time requirements or conditions. Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 26 ++++++++++++++++++++++++- 1 file changed, 25 insertions(+), 1 deletion(-)