Message ID | 1519433209-14581-1-git-send-email-fkan@apm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Sat, Feb 24, 2018 at 1:46 AM, Feng Kan <fkan@apm.com> wrote: > This is not a patch, but rather a question regarding the deferred > probe's effect on PCIe PM ordering. This happens on our system > which defer the probing of root bridge due to the IOMMU not being > ready. Because of the deferred action, the bridge is moved to the > end of the dpm_list which results in incorrect suspend and resume > sequence. > > In the cases I have seen, the bridge is always reordered because of > startup sequence. They are always place after the endpoint. If that > is the case the following code should be able to prevent such cases. > However, is there some cases here that would violate such situation? The code in dd.c assumes that the device being probed has no children (or consumers, for that matter) and so it is safe to move it to the end of the list. If the device has children (or consumers), moving it to the end of the list by itself doesn't work, which is the case for you. > --- > drivers/base/dd.c | 20 +++++++++++--------- > 1 file changed, 11 insertions(+), 9 deletions(-) > > diff --git a/drivers/base/dd.c b/drivers/base/dd.c > index de6fd09..5b96d5c 100644 > --- a/drivers/base/dd.c > +++ b/drivers/base/dd.c > @@ -116,15 +116,17 @@ static void deferred_probe_work_func(struct work_struct *work) > */ > mutex_unlock(&deferred_probe_mutex); > > - /* > - * Force the device to the end of the dpm_list since > - * the PM code assumes that the order we add things to > - * the list is a good order for suspend but deferred > - * probe makes that very unsafe. > - */ > - device_pm_lock(); > - device_pm_move_last(dev); > - device_pm_unlock(); > + if (!dev_is_pci(dev)) { You can't do this here. It is not clear why all PCI devices should be an exception in the first place. There may be PCI devices without children that need to be moved to the end of the list and you'd break that case. > + /* > + * Force the device to the end of the dpm_list since > + * the PM code assumes that the order we add things to > + * the list is a good order for suspend but deferred > + * probe makes that very unsafe. > + */ > + device_pm_lock(); > + device_pm_move_last(dev); > + device_pm_unlock(); > + } > > dev_dbg(dev, "Retrying from deferred list\n"); > if (initcall_debug && !initcalls_done) > -- You can try to replace the device_pm_move_last(dev) in deferred_probe_work_func(struct() with device_reorder_to_tail(), but that has to be called under device_links_read_lock/unloc () and device_pm_lock/unloc() (in the right order).
On Sat, Feb 24, 2018 at 10:35 AM, Rafael J. Wysocki <rafael@kernel.org> wrote: > On Sat, Feb 24, 2018 at 1:46 AM, Feng Kan <fkan@apm.com> wrote: >> This is not a patch, but rather a question regarding the deferred >> probe's effect on PCIe PM ordering. This happens on our system >> which defer the probing of root bridge due to the IOMMU not being >> ready. Because of the deferred action, the bridge is moved to the >> end of the dpm_list which results in incorrect suspend and resume >> sequence. >> >> In the cases I have seen, the bridge is always reordered because of >> startup sequence. They are always place after the endpoint. If that >> is the case the following code should be able to prevent such cases. >> However, is there some cases here that would violate such situation? > > The code in dd.c assumes that the device being probed has no children > (or consumers, for that matter) and so it is safe to move it to the > end of the list. > > If the device has children (or consumers), moving it to the end of the > list by itself doesn't work, which is the case for you. > >> --- >> drivers/base/dd.c | 20 +++++++++++--------- >> 1 file changed, 11 insertions(+), 9 deletions(-) >> >> diff --git a/drivers/base/dd.c b/drivers/base/dd.c >> index de6fd09..5b96d5c 100644 >> --- a/drivers/base/dd.c >> +++ b/drivers/base/dd.c >> @@ -116,15 +116,17 @@ static void deferred_probe_work_func(struct work_struct *work) >> */ >> mutex_unlock(&deferred_probe_mutex); >> >> - /* >> - * Force the device to the end of the dpm_list since >> - * the PM code assumes that the order we add things to >> - * the list is a good order for suspend but deferred >> - * probe makes that very unsafe. >> - */ >> - device_pm_lock(); >> - device_pm_move_last(dev); >> - device_pm_unlock(); >> + if (!dev_is_pci(dev)) { > > You can't do this here. > > It is not clear why all PCI devices should be an exception in the > first place. There may be PCI devices without children that need to > be moved to the end of the list and you'd break that case. > >> + /* >> + * Force the device to the end of the dpm_list since >> + * the PM code assumes that the order we add things to >> + * the list is a good order for suspend but deferred >> + * probe makes that very unsafe. >> + */ >> + device_pm_lock(); >> + device_pm_move_last(dev); >> + device_pm_unlock(); >> + } >> >> dev_dbg(dev, "Retrying from deferred list\n"); >> if (initcall_debug && !initcalls_done) >> -- > > You can try to replace the device_pm_move_last(dev) in > deferred_probe_work_func(struct() with device_reorder_to_tail(), but > that has to be called under device_links_read_lock/unloc () and > device_pm_lock/unloc() (in the right order). Alternatively, you can replace your !dev_is_pci(dev) check with a check for the presence of children or consumers and only move the device to the end of the PM list if there are not any. That would kind of make sense, but it may break other assumptions in the deferred probing mechanism which I don't recall ATM. Or avoid deferred probing of the host bridge driver entirely. I guess you can use it with limited functionality until the IOMMU driver is ready and switch over to the fully functional operation mode when that happens, but that would need to hook into the IOMMU code somehow.
On Sat, Feb 24, 2018 at 2:59 PM, Lukas Wunner <lukas@wunner.de> wrote: > On Sat, Feb 24, 2018 at 10:47:26AM +0100, Rafael J. Wysocki wrote: >> On Sat, Feb 24, 2018 at 10:35 AM, Rafael J. Wysocki <rafael@kernel.org> wrote: >> > On Sat, Feb 24, 2018 at 1:46 AM, Feng Kan <fkan@apm.com> wrote: >> >> This is not a patch, but rather a question regarding the deferred >> >> probe's effect on PCIe PM ordering. This happens on our system >> >> which defer the probing of root bridge due to the IOMMU not being >> >> ready. Because of the deferred action, the bridge is moved to the >> >> end of the dpm_list which results in incorrect suspend and resume >> >> sequence. >> >> >> >> In the cases I have seen, the bridge is always reordered because of >> >> startup sequence. They are always place after the endpoint. If that >> >> is the case the following code should be able to prevent such cases. >> >> However, is there some cases here that would violate such situation? >> > >> > The code in dd.c assumes that the device being probed has no children >> > (or consumers, for that matter) and so it is safe to move it to the >> > end of the list. >> > >> > If the device has children (or consumers), moving it to the end of the >> > list by itself doesn't work, which is the case for you. >> > >> > You can try to replace the device_pm_move_last(dev) in >> > deferred_probe_work_func(struct() with device_reorder_to_tail(), but >> > that has to be called under device_links_read_lock/unloc () and >> > device_pm_lock/unloc() (in the right order). >> >> Alternatively, you can replace your !dev_is_pci(dev) check with a >> check for the presence of children or consumers and only move the >> device to the end of the PM list if there are not any. That would >> kind of make sense, but it may break other assumptions in the deferred >> probing mechanism which I don't recall ATM. >> >> Or avoid deferred probing of the host bridge driver entirely. I guess >> you can use it with limited functionality until the IOMMU driver is >> ready and switch over to the fully functional operation mode when that >> happens, but that would need to hook into the IOMMU code somehow. > > Or model the root bridge's dependency on the iommu using a device link: > https://www.kernel.org/doc/html/latest/driver-api/device_link.html Apparently, there are children registered under the bridge device before the driver for it is probed. That is kind of unusual and I'm not really sure why it happens at all, because (in theory) the bridge driver should be responsible for assigning resources to the devices on the bus, among other things. Depending on what the reason is, your suggestion may or may not work.
On Sun, Feb 25, 2018 at 1:29 AM, Rafael J. Wysocki <rafael@kernel.org> wrote: > On Sat, Feb 24, 2018 at 2:59 PM, Lukas Wunner <lukas@wunner.de> wrote: >> On Sat, Feb 24, 2018 at 10:47:26AM +0100, Rafael J. Wysocki wrote: >>> On Sat, Feb 24, 2018 at 10:35 AM, Rafael J. Wysocki <rafael@kernel.org> wrote: >>> > On Sat, Feb 24, 2018 at 1:46 AM, Feng Kan <fkan@apm.com> wrote: >>> >> This is not a patch, but rather a question regarding the deferred >>> >> probe's effect on PCIe PM ordering. This happens on our system >>> >> which defer the probing of root bridge due to the IOMMU not being >>> >> ready. Because of the deferred action, the bridge is moved to the >>> >> end of the dpm_list which results in incorrect suspend and resume >>> >> sequence. >>> >> >>> >> In the cases I have seen, the bridge is always reordered because of >>> >> startup sequence. They are always place after the endpoint. If that >>> >> is the case the following code should be able to prevent such cases. >>> >> However, is there some cases here that would violate such situation? >>> > >>> > The code in dd.c assumes that the device being probed has no children >>> > (or consumers, for that matter) and so it is safe to move it to the >>> > end of the list. >>> > >>> > If the device has children (or consumers), moving it to the end of the >>> > list by itself doesn't work, which is the case for you. >>> > >>> > You can try to replace the device_pm_move_last(dev) in >>> > deferred_probe_work_func(struct() with device_reorder_to_tail(), but >>> > that has to be called under device_links_read_lock/unloc () and >>> > device_pm_lock/unloc() (in the right order). >>> >>> Alternatively, you can replace your !dev_is_pci(dev) check with a >>> check for the presence of children or consumers and only move the >>> device to the end of the PM list if there are not any. That would >>> kind of make sense, but it may break other assumptions in the deferred >>> probing mechanism which I don't recall ATM. >>> >>> Or avoid deferred probing of the host bridge driver entirely. I guess >>> you can use it with limited functionality until the IOMMU driver is >>> ready and switch over to the fully functional operation mode when that >>> happens, but that would need to hook into the IOMMU code somehow. >> >> Or model the root bridge's dependency on the iommu using a device link: >> https://www.kernel.org/doc/html/latest/driver-api/device_link.html > > Apparently, there are children registered under the bridge device > before the driver for it is probed. Yes, so the order seems to be like this: 1. root port and endporint is enumerated and added to dpm_list 2. root port probed and deferred 3. iommu probed and successful 4. root port probed again and successful -> moved to last in dpm_list 5. endpoint probed and successful I guess another approach would be to prevent 1 from happening by not adding to dpm_list until the probe is successful. > > That is kind of unusual and I'm not really sure why it happens at all, > because (in theory) the bridge driver should be responsible for > assigning resources to the devices on the bus, among other things. > Depending on what the reason is, your suggestion may or may not work.
On Sun, Feb 25, 2018 at 5:28 PM, Feng Kan <fkan@apm.com> wrote: > On Sun, Feb 25, 2018 at 1:29 AM, Rafael J. Wysocki <rafael@kernel.org> wrote: >> On Sat, Feb 24, 2018 at 2:59 PM, Lukas Wunner <lukas@wunner.de> wrote: >>> On Sat, Feb 24, 2018 at 10:47:26AM +0100, Rafael J. Wysocki wrote: >>>> On Sat, Feb 24, 2018 at 10:35 AM, Rafael J. Wysocki <rafael@kernel.org> wrote: >>>> > On Sat, Feb 24, 2018 at 1:46 AM, Feng Kan <fkan@apm.com> wrote: >>>> >> This is not a patch, but rather a question regarding the deferred >>>> >> probe's effect on PCIe PM ordering. This happens on our system >>>> >> which defer the probing of root bridge due to the IOMMU not being >>>> >> ready. Because of the deferred action, the bridge is moved to the >>>> >> end of the dpm_list which results in incorrect suspend and resume >>>> >> sequence. >>>> >> >>>> >> In the cases I have seen, the bridge is always reordered because of >>>> >> startup sequence. They are always place after the endpoint. If that >>>> >> is the case the following code should be able to prevent such cases. >>>> >> However, is there some cases here that would violate such situation? >>>> > >>>> > The code in dd.c assumes that the device being probed has no children >>>> > (or consumers, for that matter) and so it is safe to move it to the >>>> > end of the list. >>>> > >>>> > If the device has children (or consumers), moving it to the end of the >>>> > list by itself doesn't work, which is the case for you. >>>> > >>>> > You can try to replace the device_pm_move_last(dev) in >>>> > deferred_probe_work_func(struct() with device_reorder_to_tail(), but >>>> > that has to be called under device_links_read_lock/unloc () and >>>> > device_pm_lock/unloc() (in the right order). >>>> >>>> Alternatively, you can replace your !dev_is_pci(dev) check with a >>>> check for the presence of children or consumers and only move the >>>> device to the end of the PM list if there are not any. That would >>>> kind of make sense, but it may break other assumptions in the deferred >>>> probing mechanism which I don't recall ATM. >>>> >>>> Or avoid deferred probing of the host bridge driver entirely. I guess >>>> you can use it with limited functionality until the IOMMU driver is >>>> ready and switch over to the fully functional operation mode when that >>>> happens, but that would need to hook into the IOMMU code somehow. >>> >>> Or model the root bridge's dependency on the iommu using a device link: >>> https://www.kernel.org/doc/html/latest/driver-api/device_link.html >> >> Apparently, there are children registered under the bridge device >> before the driver for it is probed. > > Yes, so the order seems to be like this: > 1. root port and endporint is enumerated and added to dpm_list > 2. root port probed and deferred > 3. iommu probed and successful > 4. root port probed again and successful -> moved to last in dpm_list > 5. endpoint probed and successful > > I guess another approach would be to prevent 1 from happening by not > adding to dpm_list until the probe is successful. No, system-wide suspend/resume of PCIe devices is handled by the PCI bus type layer in part, so they have to be in dpm_list from the outset. Root ports, as their parents, need to be there before the endpoints.
diff --git a/drivers/base/dd.c b/drivers/base/dd.c index de6fd09..5b96d5c 100644 --- a/drivers/base/dd.c +++ b/drivers/base/dd.c @@ -116,15 +116,17 @@ static void deferred_probe_work_func(struct work_struct *work) */ mutex_unlock(&deferred_probe_mutex); - /* - * Force the device to the end of the dpm_list since - * the PM code assumes that the order we add things to - * the list is a good order for suspend but deferred - * probe makes that very unsafe. - */ - device_pm_lock(); - device_pm_move_last(dev); - device_pm_unlock(); + if (!dev_is_pci(dev)) { + /* + * Force the device to the end of the dpm_list since + * the PM code assumes that the order we add things to + * the list is a good order for suspend but deferred + * probe makes that very unsafe. + */ + device_pm_lock(); + device_pm_move_last(dev); + device_pm_unlock(); + } dev_dbg(dev, "Retrying from deferred list\n"); if (initcall_debug && !initcalls_done)