Message ID | 20200214004413.12450-1-john.stultz@linaro.org (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
Series | [RFC] driver core: Extend returning EPROBE_DEFER for two minutes after late_initcall | expand |
On Thu, Feb 13, 2020 at 6:44 PM John Stultz <john.stultz@linaro.org> wrote: > > Due to commit e01afc3250255 ("PM / Domains: Stop deferring probe > at the end of initcall"), along with commit 25b4e70dcce9 > ("driver core: allow stopping deferred probe after init") after > late_initcall, drivers will stop getting EPROBE_DEFER, and > instead see an error causing the driver to fail to load. > > That change causes trouble when trying to use many clk drivers > as modules, as the clk modules may not load until much later > after init has started. If a dependent driver loads and gets an > error instead of EPROBE_DEFER, it won't try to reload later when > the dependency is met, and will thus fail to load. > > Instead of reverting that patch, this patch tries to extend the > time that EPROBE_DEFER is retruned by two minutes, to (hopefully) > ensure that everything has had a chance to load. I think regulators already has some delay like this. We should use the same timeouts. We also have the 'deferred_probe_timeout' cmdline option. It's deemed a debug option currently, but we could change that and change the default. > Specifically, on db845c, this change allows us to set > SDM_GPUCC_845, QCOM_CLK_RPMH and COMMON_CLK_QCOM as modules and > get a working system, where as without it the display will fail > to load. > > Cc: Alexander Graf <agraf@suse.de> > Cc: Rob Herring <robh@kernel.org> > Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> > Cc: Kevin Hilman <khilman@kernel.org> > Cc: Ulf Hansson <ulf.hansson@linaro.org> > Cc: Pavel Machek <pavel@ucw.cz> > Cc: Len Brown <len.brown@intel.com> > Cc: Todd Kjos <tkjos@google.com> > Cc: Bjorn Andersson <bjorn.andersson@linaro.org> > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > Cc: linux-pm@vger.kernel.org > Fixes: e01afc3250255 ("PM / Domains: Stop deferring probe at the end of initcall") > Fixes: 25b4e70dcce9 ("driver core: allow stopping deferred probe after init") We can debate the design, but those work as designed. So Fixes? > Signed-off-by: John Stultz <john.stultz@linaro.org> > --- > drivers/base/dd.c | 8 +++++++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/drivers/base/dd.c b/drivers/base/dd.c > index b25bcab2a26b..35ebae8b65be 100644 > --- a/drivers/base/dd.c > +++ b/drivers/base/dd.c > @@ -311,6 +311,12 @@ static void deferred_probe_timeout_work_func(struct work_struct *work) > } > static DECLARE_DELAYED_WORK(deferred_probe_timeout_work, deferred_probe_timeout_work_func); > > +static void deferred_initcall_done_work_func(struct work_struct *work) > +{ > + initcalls_done = true; > +} > +static DECLARE_DELAYED_WORK(deferred_initcall_done_work, deferred_initcall_done_work_func); > + > /** > * deferred_probe_initcall() - Enable probing of deferred devices > * > @@ -327,7 +333,7 @@ static int deferred_probe_initcall(void) > driver_deferred_probe_trigger(); > /* Sort as many dependencies as possible before exiting initcalls */ > flush_work(&deferred_probe_work); > - initcalls_done = true; > + schedule_delayed_work(&deferred_initcall_done_work, 120 * HZ); > > /* > * Trigger deferred probe again, this time we won't defer anything > -- > 2.17.1 >
On Thu 13 Feb 16:44 PST 2020, John Stultz wrote: > Due to commit e01afc3250255 ("PM / Domains: Stop deferring probe > at the end of initcall"), along with commit 25b4e70dcce9 > ("driver core: allow stopping deferred probe after init") after > late_initcall, drivers will stop getting EPROBE_DEFER, and > instead see an error causing the driver to fail to load. > > That change causes trouble when trying to use many clk drivers > as modules, as the clk modules may not load until much later > after init has started. If a dependent driver loads and gets an > error instead of EPROBE_DEFER, it won't try to reload later when > the dependency is met, and will thus fail to load. > > Instead of reverting that patch, this patch tries to extend the > time that EPROBE_DEFER is retruned by two minutes, to (hopefully) > ensure that everything has had a chance to load. > > Specifically, on db845c, this change allows us to set > SDM_GPUCC_845, QCOM_CLK_RPMH and COMMON_CLK_QCOM as modules and > get a working system, where as without it the display will fail > to load. The purpose of 25b4e70dcce9 ("driver core: allow stopping deferred probe after init") is to ensure that when the kernel boots with a DeviceTree blob that references a resource (power-domain in this case) that either hasn't been compiled in, or simply doesn't exist yet, it should continue to boot - under the assumption that these resources probably aren't needed to provide a functional system. I don't think your patch maintains this behavior, because when userspace kicks in and load kernel modules during the first two minutes they will all end up in the probe deferral list. Past two minutes any event that registers a new driver (i.e. manual intervention) will kick of a new wave of probing, which will now continue as expected, ignoring any power-domains that is yet to be probed (either because they don't exist or they are further down the probe deferral list). You can improve the situation somewhat by calling driver_deferred_probe_trigger() in your deferred_initcall_done_work_func(), to remove the need for human intervention. But the outcome will still depend on the order in deferred_probe_active_list. That said, your patch does resolve an important problem for me! We have a number of drivers providing power-domains that are registered subject to timing in interaction with co-processors. So with a sufficiently small kernel (e.g. heavy use of kernel modules) it's likely that these are registered past late_initcall. Your extension of this to two minutes past late_initcall will for sure be sufficient to avoid this issue. Regards, Bjorn > > Cc: Alexander Graf <agraf@suse.de> > Cc: Rob Herring <robh@kernel.org> > Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> > Cc: Kevin Hilman <khilman@kernel.org> > Cc: Ulf Hansson <ulf.hansson@linaro.org> > Cc: Pavel Machek <pavel@ucw.cz> > Cc: Len Brown <len.brown@intel.com> > Cc: Todd Kjos <tkjos@google.com> > Cc: Bjorn Andersson <bjorn.andersson@linaro.org> > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > Cc: linux-pm@vger.kernel.org > Fixes: e01afc3250255 ("PM / Domains: Stop deferring probe at the end of initcall") > Fixes: 25b4e70dcce9 ("driver core: allow stopping deferred probe after init") > Signed-off-by: John Stultz <john.stultz@linaro.org> > --- > drivers/base/dd.c | 8 +++++++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/drivers/base/dd.c b/drivers/base/dd.c > index b25bcab2a26b..35ebae8b65be 100644 > --- a/drivers/base/dd.c > +++ b/drivers/base/dd.c > @@ -311,6 +311,12 @@ static void deferred_probe_timeout_work_func(struct work_struct *work) > } > static DECLARE_DELAYED_WORK(deferred_probe_timeout_work, deferred_probe_timeout_work_func); > > +static void deferred_initcall_done_work_func(struct work_struct *work) > +{ > + initcalls_done = true; > +} > +static DECLARE_DELAYED_WORK(deferred_initcall_done_work, deferred_initcall_done_work_func); > + > /** > * deferred_probe_initcall() - Enable probing of deferred devices > * > @@ -327,7 +333,7 @@ static int deferred_probe_initcall(void) > driver_deferred_probe_trigger(); > /* Sort as many dependencies as possible before exiting initcalls */ > flush_work(&deferred_probe_work); > - initcalls_done = true; > + schedule_delayed_work(&deferred_initcall_done_work, 120 * HZ); > > /* > * Trigger deferred probe again, this time we won't defer anything > -- > 2.17.1 >
On Thu, Feb 13, 2020 at 5:51 PM Rob Herring <robh@kernel.org> wrote: > On Thu, Feb 13, 2020 at 6:44 PM John Stultz <john.stultz@linaro.org> wrote: > > Due to commit e01afc3250255 ("PM / Domains: Stop deferring probe > > at the end of initcall"), along with commit 25b4e70dcce9 > > ("driver core: allow stopping deferred probe after init") after > > late_initcall, drivers will stop getting EPROBE_DEFER, and > > instead see an error causing the driver to fail to load. > > > > That change causes trouble when trying to use many clk drivers > > as modules, as the clk modules may not load until much later > > after init has started. If a dependent driver loads and gets an > > error instead of EPROBE_DEFER, it won't try to reload later when > > the dependency is met, and will thus fail to load. > > > > Instead of reverting that patch, this patch tries to extend the > > time that EPROBE_DEFER is retruned by two minutes, to (hopefully) > > ensure that everything has had a chance to load. > > I think regulators already has some delay like this. We should use the > same timeouts. Sounds good. My memory was a bit foggy from the time I initially brought this up, and I looked briefly before sending this out and didn't find the regulator change you had mentioned. If you have a specific pointer (or patch name or something) let me know, otherwise I'll dig around later tonight/tomorrow. > We also have the 'deferred_probe_timeout' cmdline option. It's deemed > a debug option currently, but we could change that and change the > default. I looked at that code, but couldn't really make heads or tails of it. The initcalls_done is checked and returns before the deferred_probe_timeout is looked at, so I was guessing the deferred_probe_timeout was addressing a bit more subtle issue than what I was going for. If its really the same functionality, I'm happy to try to rework it. > > Specifically, on db845c, this change allows us to set > > SDM_GPUCC_845, QCOM_CLK_RPMH and COMMON_CLK_QCOM as modules and > > get a working system, where as without it the display will fail > > to load. > > > > Cc: Alexander Graf <agraf@suse.de> > > Cc: Rob Herring <robh@kernel.org> > > Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> > > Cc: Kevin Hilman <khilman@kernel.org> > > Cc: Ulf Hansson <ulf.hansson@linaro.org> > > Cc: Pavel Machek <pavel@ucw.cz> > > Cc: Len Brown <len.brown@intel.com> > > Cc: Todd Kjos <tkjos@google.com> > > Cc: Bjorn Andersson <bjorn.andersson@linaro.org> > > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > > Cc: linux-pm@vger.kernel.org > > Fixes: e01afc3250255 ("PM / Domains: Stop deferring probe at the end of initcall") > > Fixes: 25b4e70dcce9 ("driver core: allow stopping deferred probe after init") > > We can debate the design, but those work as designed. So Fixes? > Well, clk module loading would have worked, and then stopped working with e01afc3250255, so it is a regression of sorts. And really the tags are mostly for making sure patches get applied to trees that backported these commits (and it's not my intention to shame a patch as broken. :) thanks -john
On Thu, Feb 13, 2020 at 6:19 PM Bjorn Andersson <bjorn.andersson@linaro.org> wrote: > The purpose of 25b4e70dcce9 ("driver core: allow stopping deferred probe > after init") is to ensure that when the kernel boots with a DeviceTree > blob that references a resource (power-domain in this case) that either > hasn't been compiled in, or simply doesn't exist yet, it should continue > to boot - under the assumption that these resources probably aren't > needed to provide a functional system. > > I don't think your patch maintains this behavior, because when userspace > kicks in and load kernel modules during the first two minutes they will > all end up in the probe deferral list. Past two minutes any event that > registers a new driver (i.e. manual intervention) will kick of a new > wave of probing, which will now continue as expected, ignoring any > power-domains that is yet to be probed (either because they don't exist > or they are further down the probe deferral list). Hmm. I'll have to look at that again. I worry the logic is overloaded a bit, because the logic in __driver_deferred_probe_check_state() will only return -EPROBE_DEFER before late_initcall otherwise it returns -ETIMEDOUT or 0. So if we call__genpd_dev_pm_attach() after late_initcall and the pd isn't ready, the driver probe will fail permanently and not function. I'd think in the case you describe (correct me if I'm misunderstanding you), modules that load in the first two minutes would hit EPROBE_DEFER only if a dependency is missing, and will continue to try to probe next round. But once the two minutes are up, they will catch ETIMEDOUT and fail permanently. > You can improve the situation somewhat by calling > driver_deferred_probe_trigger() in your > deferred_initcall_done_work_func(), to remove the need for human > intervention. But the outcome will still depend on the order in > deferred_probe_active_list. Ok. I'll take a look at that. Thanks so much for the feedback! -john
On Thu 13 Feb 20:05 PST 2020, John Stultz wrote: > On Thu, Feb 13, 2020 at 6:19 PM Bjorn Andersson > <bjorn.andersson@linaro.org> wrote: > > The purpose of 25b4e70dcce9 ("driver core: allow stopping deferred probe > > after init") is to ensure that when the kernel boots with a DeviceTree > > blob that references a resource (power-domain in this case) that either > > hasn't been compiled in, or simply doesn't exist yet, it should continue > > to boot - under the assumption that these resources probably aren't > > needed to provide a functional system. > > > > I don't think your patch maintains this behavior, because when userspace > > kicks in and load kernel modules during the first two minutes they will > > all end up in the probe deferral list. Past two minutes any event that > > registers a new driver (i.e. manual intervention) will kick of a new > > wave of probing, which will now continue as expected, ignoring any > > power-domains that is yet to be probed (either because they don't exist > > or they are further down the probe deferral list). > > Hmm. I'll have to look at that again. I worry the logic is overloaded > a bit, because the logic in __driver_deferred_probe_check_state() will > only return -EPROBE_DEFER before late_initcall otherwise it returns > -ETIMEDOUT or 0. So if we call__genpd_dev_pm_attach() after > late_initcall and the pd isn't ready, the driver probe will fail > permanently and not function. > Correct. And the motivation for this is that if you use a dtb from the future it might describe a power-domain provider that is not yet implemented in the booting kernel and as such the purpose is to fail fast - in a way that drivers can ignore, rather than probe deferring indefinitely. > I'd think in the case you describe (correct me if I'm misunderstanding > you), modules that load in the first two minutes would hit > EPROBE_DEFER only if a dependency is missing, and will continue to try > to probe next round. But once the two minutes are up, they will catch > ETIMEDOUT and fail permanently. > This extends the time that probe deferral is functional from late_initcall to 2 minutes from boot, which should solve all practical problems you and I have with the current situation. But the specific detail that your patch is missing is that drivers that probe defer will end up on the deferral list and this list is only processed whenever drivers are added or some driver succeeds to probe. So before the 2 minutes the deferral dance will stop and you need one of these events to kick off the dance again. > > You can improve the situation somewhat by calling > > driver_deferred_probe_trigger() in your > > deferred_initcall_done_work_func(), to remove the need for human > > intervention. But the outcome will still depend on the order in > > deferred_probe_active_list. > > Ok. I'll take a look at that. > Cool > Thanks so much for the feedback! Thank you for working on this, I've spent days debugging subtle issues because of this feature... Regards, Bjorn
On Thu, Feb 13, 2020 at 8:42 PM John Stultz <john.stultz@linaro.org> wrote: > > On Thu, Feb 13, 2020 at 5:51 PM Rob Herring <robh@kernel.org> wrote: > > On Thu, Feb 13, 2020 at 6:44 PM John Stultz <john.stultz@linaro.org> wrote: > > > Due to commit e01afc3250255 ("PM / Domains: Stop deferring probe > > > at the end of initcall"), along with commit 25b4e70dcce9 > > > ("driver core: allow stopping deferred probe after init") after > > > late_initcall, drivers will stop getting EPROBE_DEFER, and > > > instead see an error causing the driver to fail to load. > > > > > > That change causes trouble when trying to use many clk drivers > > > as modules, as the clk modules may not load until much later > > > after init has started. If a dependent driver loads and gets an > > > error instead of EPROBE_DEFER, it won't try to reload later when > > > the dependency is met, and will thus fail to load. > > > > > > Instead of reverting that patch, this patch tries to extend the > > > time that EPROBE_DEFER is retruned by two minutes, to (hopefully) > > > ensure that everything has had a chance to load. > > > > I think regulators already has some delay like this. We should use the > > same timeouts. > > Sounds good. My memory was a bit foggy from the time I initially > brought this up, and I looked briefly before sending this out and > didn't find the regulator change you had mentioned. If you have a > specific pointer (or patch name or something) let me know, otherwise > I'll dig around later tonight/tomorrow. > > > We also have the 'deferred_probe_timeout' cmdline option. It's deemed > > a debug option currently, but we could change that and change the > > default. > > I looked at that code, but couldn't really make heads or tails of it. > The initcalls_done is checked and returns before the > deferred_probe_timeout is looked at, so I was guessing the > deferred_probe_timeout was addressing a bit more subtle issue than > what I was going for. If its really the same functionality, I'm happy > to try to rework it. Subsystems have to opt-in in the current code to stop deferring at init done. Though this was extended with driver_deferred_probe_check_state_continue() which defers until the timeout (commit 62a6bc3a1e4f4). The only purpose of the timeout was to dump out what devices are still deferred. I don't see any point in having 2 timeouts, so make deferred_probe_timeout do what you want it to do. I think that's mainly default to a sensible timeout value and have the subsystems you care about honor the timeout. Perhaps we don't need both driver_deferred_probe_check_state() and driver_deferred_probe_check_state_continue() and can just have the latter if all the subsystems need to honor the timeout. I think it would be good if the console printed something periodically about waiting for drivers. That way the system doesn't appear hung for 30 or 120 sec before giving up. Also, there was one issue in particular that I ran into. Consoles have to be built in and probed before init is done. So if your UART has any dependencies, then those dependencies also have to be built-in. I don't think anyone working on the 'everything is a module' problem has fixed that. > > > Specifically, on db845c, this change allows us to set > > > SDM_GPUCC_845, QCOM_CLK_RPMH and COMMON_CLK_QCOM as modules and > > > get a working system, where as without it the display will fail > > > to load. > > > > > > Cc: Alexander Graf <agraf@suse.de> > > > Cc: Rob Herring <robh@kernel.org> > > > Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> > > > Cc: Kevin Hilman <khilman@kernel.org> > > > Cc: Ulf Hansson <ulf.hansson@linaro.org> > > > Cc: Pavel Machek <pavel@ucw.cz> > > > Cc: Len Brown <len.brown@intel.com> > > > Cc: Todd Kjos <tkjos@google.com> > > > Cc: Bjorn Andersson <bjorn.andersson@linaro.org> > > > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > > > Cc: linux-pm@vger.kernel.org > > > Fixes: e01afc3250255 ("PM / Domains: Stop deferring probe at the end of initcall") > > > Fixes: 25b4e70dcce9 ("driver core: allow stopping deferred probe after init") > > > > We can debate the design, but those work as designed. So Fixes? > > > > Well, clk module loading would have worked, and then stopped working > with e01afc3250255, so it is a regression of sorts. If there was a clock driver upstream where this worked, then I agree. > And really the > tags are mostly for making sure patches get applied to trees that > backported these commits (and it's not my intention to shame a patch > as broken. :) But 'Fixes' implies 'apply to stable' which is not the same as 'apply backport to a distro kernel'. Rob
diff --git a/drivers/base/dd.c b/drivers/base/dd.c index b25bcab2a26b..35ebae8b65be 100644 --- a/drivers/base/dd.c +++ b/drivers/base/dd.c @@ -311,6 +311,12 @@ static void deferred_probe_timeout_work_func(struct work_struct *work) } static DECLARE_DELAYED_WORK(deferred_probe_timeout_work, deferred_probe_timeout_work_func); +static void deferred_initcall_done_work_func(struct work_struct *work) +{ + initcalls_done = true; +} +static DECLARE_DELAYED_WORK(deferred_initcall_done_work, deferred_initcall_done_work_func); + /** * deferred_probe_initcall() - Enable probing of deferred devices * @@ -327,7 +333,7 @@ static int deferred_probe_initcall(void) driver_deferred_probe_trigger(); /* Sort as many dependencies as possible before exiting initcalls */ flush_work(&deferred_probe_work); - initcalls_done = true; + schedule_delayed_work(&deferred_initcall_done_work, 120 * HZ); /* * Trigger deferred probe again, this time we won't defer anything
Due to commit e01afc3250255 ("PM / Domains: Stop deferring probe at the end of initcall"), along with commit 25b4e70dcce9 ("driver core: allow stopping deferred probe after init") after late_initcall, drivers will stop getting EPROBE_DEFER, and instead see an error causing the driver to fail to load. That change causes trouble when trying to use many clk drivers as modules, as the clk modules may not load until much later after init has started. If a dependent driver loads and gets an error instead of EPROBE_DEFER, it won't try to reload later when the dependency is met, and will thus fail to load. Instead of reverting that patch, this patch tries to extend the time that EPROBE_DEFER is retruned by two minutes, to (hopefully) ensure that everything has had a chance to load. Specifically, on db845c, this change allows us to set SDM_GPUCC_845, QCOM_CLK_RPMH and COMMON_CLK_QCOM as modules and get a working system, where as without it the display will fail to load. Cc: Alexander Graf <agraf@suse.de> Cc: Rob Herring <robh@kernel.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Kevin Hilman <khilman@kernel.org> Cc: Ulf Hansson <ulf.hansson@linaro.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: Len Brown <len.brown@intel.com> Cc: Todd Kjos <tkjos@google.com> Cc: Bjorn Andersson <bjorn.andersson@linaro.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: linux-pm@vger.kernel.org Fixes: e01afc3250255 ("PM / Domains: Stop deferring probe at the end of initcall") Fixes: 25b4e70dcce9 ("driver core: allow stopping deferred probe after init") Signed-off-by: John Stultz <john.stultz@linaro.org> --- drivers/base/dd.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)