Message ID | 5783e36561bb77a1deb6ba67e5a9824488cc69c6.camel@kernel.crashing.org (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
Series | [RFC,v2] arm64: acpi/pci: invoke _DSM whether to preserve firmware PCI setup | expand |
On Thu, Jun 13, 2019 at 05:54:56PM +1000, Benjamin Herrenschmidt wrote: > The current arm64 PCI code for ACPI platforms will unconditionally > reassign all resources. > > This is not only suboptimal, it's also wrong for a number of cases, for > example, this could invalidate a UEFI framebuffer address, or a runtime > firmware could be using some of the devices in their original location. > > There is an ACPI method defined today for P2P bridges (_DSM #5) that > can indicate that a bridge resources set by firmware. There is current > discussions to extend that method to cover host bridges, and define > a value of "0" as meaning that the resources must be preserved. The ECR does extend the r3.2 spec so the _DSM can apply to host bridges. Apart from that change, the ECR clarifies the language without changing the sense. The meaning of "0" doesn't change. > This patch adds the resource assignment policy to struct > pci_host_bridge and sets it based on the presence of that method and if > present the value returned, and honors it on arm64. > > No other architectures are currently affected, and the default is kept > to "reassign everything" on arm64 for now via an #ifdef, though we do > plan to get rid of that in a separate patch. > > The setting in pci_host_bridge "looks" generic because I intend in > subsquent work to consolidate the resource allocation policy accross > architectures and I intend for that setting to be the canonical > location used by the generic code to decide what to do. > > This is based on some earlier work by > Ard Biesheuvel <ard.biesheuvel@linaro.org> > > Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> > --- > > arch/arm64/kernel/pci.c | 12 ++++++++++-- > drivers/acpi/pci_root.c | 42 ++++++++++++++++++++++++++++++++++++++++ > include/linux/pci-acpi.h | 7 ++++--- > include/linux/pci.h | 10 ++++++++++ > 4 files changed, 66 insertions(+), 5 deletions(-) > > diff --git a/arch/arm64/kernel/pci.c b/arch/arm64/kernel/pci.c > index bb85e2f4603f..b209a506f390 100644 > --- a/arch/arm64/kernel/pci.c > +++ b/arch/arm64/kernel/pci.c > @@ -168,6 +168,7 @@ struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root) > struct acpi_pci_generic_root_info *ri; > struct pci_bus *bus, *child; > struct acpi_pci_root_ops *root_ops; > + struct pci_host_bridge *hb; The only other use of "struct pci_host_bridge *" in this file uses "bridge" as the variable, so I'd follow suit. > ri = kzalloc(sizeof(*ri), GFP_KERNEL); > if (!ri) > @@ -193,8 +194,15 @@ struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root) > if (!bus) > return NULL; > > - pci_bus_size_bridges(bus); > - pci_bus_assign_resources(bus); > + hb = pci_find_host_bridge(bus); > + > + /* If the policy is normal or probe only, claim existing resources */ > + if (hb->rsrc_policy != pci_rsrc_reassign) > + pci_bus_claim_resources(bus); > + > + /* If the policy is not probe only, assign what's left unassigned */ > + if (hb->rsrc_policy != pci_rsrc_probe_only) > + pci_assign_unassigned_root_bus_resources(bus); > > list_for_each_entry(child, &bus->children, node) > pcie_bus_configure_settings(child); > diff --git a/drivers/acpi/pci_root.c b/drivers/acpi/pci_root.c > index 39f5d172e84f..410f7f2b2587 100644 > --- a/drivers/acpi/pci_root.c > +++ b/drivers/acpi/pci_root.c > @@ -881,6 +881,7 @@ struct pci_bus *acpi_pci_root_create(struct acpi_pci_root *root, > int node = acpi_get_node(device->handle); > struct pci_bus *bus; > struct pci_host_bridge *host_bridge; > + union acpi_object *obj; > > info->root = root; > info->bridge = device; > @@ -917,6 +918,47 @@ struct pci_bus *acpi_pci_root_create(struct acpi_pci_root *root, > if (!(root->osc_control_set & OSC_PCI_EXPRESS_LTR_CONTROL)) > host_bridge->native_ltr = 0; > > + /* > + * Invoke the PCI device specific method (_DSM) #5 'Ignore PCI Boot > + * Configuration', on the host bridge. This tells us whether the > + * firmware wants us to preserve or reassign the configuration of > + * the PCI resource tree for this root bridge. > + * > + * There are three possible outcomes here: > + * > + * - _DSM #5 is absent. This is the default. Currently it will be > + * architecture specific in order to maintain existing behaviours > + * but the plan is to move arm64 into the fold: x86 and ia64 will > + * claim the existing config, and reassign if needed. arm64 will > + * always reassign. The spec (PCI FW r3.2) says a _DSM that returns 0 means "OS must not ignore config done by firmware". The ECR in the works changes the wording to something like "OS must preserve config done by firmware", which is equivalent but clearer. The r3.2 spec goes on to suggest that a missing _DSM means the same thing ("OS must not ignore firmware config"). I think that part is crap, and the ECR removes that wording. I don't accept the _DSM #5 section in the PCI FW spec as being normative about what a missing _DSM #5 means. This section didn't even exist until r3.2, and all it says is "this situation is the same as the legacy situation where this _DSM is not provided". That's just hand-waving; it's not a requirement. I'm not aware of any spec that says the OS can't change PCI resources (if there is such a spec, please cite it). So my opinion is that a missing _DSM means nothing, and the default situation is that the OS can change PCI resources as necessary. The ECR input from Windows was that in the absence of a _DSM #5, they keep the boot configuration unless a FW bug causes an overlap, a hot-add requires rebalancing, or the system includes external (e.g., Thunderbolt) devices. That's what I think Linux should do, too: keep the config from firmware unless we have a reason to change it. > + * - _DSM #5 exists and is 1. This is the FW telling us to ignore > + * the configuration it performed. This is currently only supported > + * on arm64. The r3.2 spec actually says "the OS *may* ignore config done by firmware". There's no *requirement* that the OS change anything. IMHO *this* is the same as the case where there's no _DSM at all. > + * - _DSM #5 exists and is 0. This should be the same as the default > + * (_DSM #5 absent). However there are some assumptions flying around > + * that this means we must keep the FW configuration intact. So we > + * treat that as "probe only" for the time being. This is currently > + * only supported on arm64. PCI FW r3.2 says 0 means "the OS must not ignore config done by firmware." That means we must keep the FW configuration intact. > + obj = acpi_evaluate_dsm(ACPI_HANDLE(bus->bridge), &pci_acpi_dsm_guid, 1, > + IGNORE_PCI_BOOT_CONFIG_DSM, NULL); > + if (obj && obj->type == ACPI_TYPE_INTEGER) { > + if (obj->integer.value == 1) > + host_bridge->rsrc_policy = pci_rsrc_reassign; > + else > + host_bridge->rsrc_policy = pci_rsrc_probe_only; > + } else { > + /* Default is arch specific ... for now */ > +#ifdef CONFIG_ARM64 > + host_bridge->rsrc_policy = pci_rsrc_reassign; > +#else > + host_bridge->rsrc_policy = pci_rsrc_normal; > +#endif > + } I think this needs to be a single bit, not a 3-choice thing. I don't think it's possible to clearly explain how pci_rsrc_normal is different from pci_rsrc_reassign. We either need to preserve the config or we don't. A middle ground of "we don't need to preserve the config and in fact we *must* reassign resources" is pointless because we don't know *why* we have to reassign things, and we don't know what sort of change would be correct. > + ACPI_FREE(obj); > + > pci_scan_child_bus(bus); > pci_set_host_bridge_release(host_bridge, acpi_pci_root_release_info, > info); > diff --git a/include/linux/pci-acpi.h b/include/linux/pci-acpi.h > index 8082b612f561..62b7fdcc661c 100644 > --- a/include/linux/pci-acpi.h > +++ b/include/linux/pci-acpi.h > @@ -107,9 +107,10 @@ static inline void acpiphp_check_host_bridge(struct acpi_device *adev) { } > #endif > > extern const guid_t pci_acpi_dsm_guid; > -#define DEVICE_LABEL_DSM 0x07 > -#define RESET_DELAY_DSM 0x08 > -#define FUNCTION_DELAY_DSM 0x09 > +#define IGNORE_PCI_BOOT_CONFIG_DSM 0x05 > +#define DEVICE_LABEL_DSM 0x07 > +#define RESET_DELAY_DSM 0x08 > +#define FUNCTION_DELAY_DSM 0x09 > > #else /* CONFIG_ACPI */ > static inline void acpi_pci_add_bus(struct pci_bus *bus) { } > diff --git a/include/linux/pci.h b/include/linux/pci.h > index dd436da7eccc..7ff5cedb30cf 100644 > --- a/include/linux/pci.h > +++ b/include/linux/pci.h > @@ -486,6 +486,12 @@ static inline int pci_channel_offline(struct pci_dev *pdev) > return (pdev->error_state != pci_channel_io_normal); > } > > +enum pci_host_rsrc_policy { > + pci_rsrc_normal, /* Probe and (re)assign what's missing/broken */ > + pci_rsrc_probe_only, /* Probe only */ > + pci_rsrc_reassign, /* Reassign resources */ > +}; > + > struct pci_host_bridge { > struct device dev; > struct pci_bus *bus; /* Root bus */ > @@ -506,6 +512,10 @@ struct pci_host_bridge { > unsigned int native_shpc_hotplug:1; /* OS may use SHPC hotplug */ > unsigned int native_pme:1; /* OS may use PCIe PME */ > unsigned int native_ltr:1; /* OS may use PCIe LTR */ > + > + /* Resource assignment/allocation policy */ > + enum pci_host_rsrc_policy rsrc_policy; > + > /* Resource alignment requirements */ > resource_size_t (*align_resource)(struct pci_dev *dev, > const struct resource *res, > >
On Thu, 2019-06-13 at 14:02 -0500, Bjorn Helgaas wrote: > > + /* > > + * Invoke the PCI device specific method (_DSM) #5 'Ignore PCI Boot > > + * Configuration', on the host bridge. This tells us whether the > > + * firmware wants us to preserve or reassign the configuration of > > + * the PCI resource tree for this root bridge. > > + * > > + * There are three possible outcomes here: > > + * > > + * - _DSM #5 is absent. This is the default. Currently it will be > > + * architecture specific in order to maintain existing behaviours > > + * but the plan is to move arm64 into the fold: x86 and ia64 will > > + * claim the existing config, and reassign if needed. arm64 will > > + * always reassign. > > The spec (PCI FW r3.2) says a _DSM that returns 0 means "OS must not > ignore config done by firmware". The ECR in the works changes the > wording to something like "OS must preserve config done by firmware", > which is equivalent but clearer. > > The r3.2 spec goes on to suggest that a missing _DSM means the same > thing ("OS must not ignore firmware config"). I think that part is > crap, and the ECR removes that wording. > > I don't accept the _DSM #5 section in the PCI FW spec as being > normative about what a missing _DSM #5 means. This section didn't > even exist until r3.2, and all it says is "this situation is the same > as the legacy situation where this _DSM is not provided". That's just > hand-waving; it's not a requirement. > > I'm not aware of any spec that says the OS can't change PCI resources > (if there is such a spec, please cite it). > > So my opinion is that a missing _DSM means nothing, and the default > situation is that the OS can change PCI resources as necessary. Sure, but what do you think of my wording in the patch ? This is pretty much what I've done: Keep the current (default) behaviour, which on x86/ia64 is to claim what's there if it looks sensible and change thing as deemed fit (reassigned what's not sensible, extent HP bridge resources etc...). This patch doesn't change the arm64 behaviour in that case which is to currently reallocate everything but Lorenzo seems to agree that this isn't great and we should change towards a more x86-like default, which I'm working on, but I want to keep that change separate from this patch, if anything for bisectability of potential regressions. > The ECR input from Windows was that in the absence of a _DSM #5, they > keep the boot configuration unless a FW bug causes an overlap, a > hot-add requires rebalancing, or the system includes external (e.g., > Thunderbolt) devices. That's what I think Linux should do, too: keep > the config from firmware unless we have a reason to change it. Right, and this is what x86 does, and this is what we'll eventually do on arm64 as above. > > + * - _DSM #5 exists and is 1. This is the FW telling us to ignore > > + * the configuration it performed. This is currently only supported > > + * on arm64. > > The r3.2 spec actually says "the OS *may* ignore config done by > firmware". There's no *requirement* that the OS change anything. > > IMHO *this* is the same as the case where there's no _DSM at all. May or may not be. Seattle platforms here tells us explicitely that we should ignore the configuration, unless I'm mistaken. Lorenzo do you know better here ? > > + * - _DSM #5 exists and is 0. This should be the same as the default > > + * (_DSM #5 absent). However there are some assumptions flying around > > + * that this means we must keep the FW configuration intact. So we > > + * treat that as "probe only" for the time being. This is currently > > + * only supported on arm64. > > PCI FW r3.2 says 0 means "the OS must not ignore config done by > firmware." That means we must keep the FW configuration intact. So "must not ignore" doesn't mean "keep intact" in my book. In fact, to be honest, the wording in the spec pretty much means the same for all 3 cases, ie, you "may ignore" or "must not ignore" is bullshit. I can perfectly "not ignore but decide it's bad and change it anyway". This is all a terrible wording. Like all firmware specs, it's designed for people to misinterpret it and get things wrong :-) What we know is that there is one platform that implements _DSM #5 today and returns 1 with the intention, unless I misread what I was told, to tell us to explicitly ignore what FW did. > > + obj = acpi_evaluate_dsm(ACPI_HANDLE(bus->bridge), &pci_acpi_dsm_guid, 1, > > + IGNORE_PCI_BOOT_CONFIG_DSM, NULL); > > + if (obj && obj->type == ACPI_TYPE_INTEGER) { > > + if (obj->integer.value == 1) > > + host_bridge->rsrc_policy = pci_rsrc_reassign; > > + else > > + host_bridge->rsrc_policy = pci_rsrc_probe_only; > > + } else { > > + /* Default is arch specific ... for now */ > > +#ifdef CONFIG_ARM64 > > + host_bridge->rsrc_policy = pci_rsrc_reassign; > > +#else > > + host_bridge->rsrc_policy = pci_rsrc_normal; > > +#endif > > + } > > I think this needs to be a single bit, not a 3-choice thing. Well, no :-) The enum in host_bridge needs to be 3 choices because it's meant to represent the 3 policies that Linux has today sprinkled accross architectures in rather inconsistent ways, but I'm not going to break that in one patch and there are reasons for all 3 to exist. We can discuss that separately if you want. I plan to start cleaning up code to use that field to represent the policies properly accross the board as I also consolidate code accross archs/platforms. However, what can possibly be debated here is how _DSM #5 maps to those 3. You think it should map to only two of them, I tend to disagree with your arguments above but short of sitting on the spec commitee and clarifying the wording, it's a matter of how has the pointiest hat and the rounder crystal ball here. > I don't > think it's possible to clearly explain how pci_rsrc_normal is > different from pci_rsrc_reassign. We either need to preserve the > config or we don't. No. "reassign" is "must reassign" or "must ignore FW settings". It may not be what we want to do with ACPI _DSM #5 = 1, but it's definitely what all current ARM32, powerpc4xx, and in general most embedded platforms want today. In fact we have more platforms in the tree who want just that. Now you may want to make the argument that those platforms shouldn't want that, but changing it would probably break a good half of them in one go, that's what they've always done and I'm not here to mess around with this. Once I've consolidated more code, it might become easier on a per platform basis to chose to switch to the "use what's there and reassign what's broken or missing" model (ie, "pci_rsrc_normal"), but that will have to be done on a case by case basis with appropriate testing etc.. > A middle ground of "we don't need to preserve the config and in fact > we *must* reassign resources" is pointless because we don't know *why* > we have to reassign things, and we don't know what sort of change > would be correct. For the ACPI case, maybe ... for all the other cases this is pretty moot. Reassign all is what they do today, and changing it will probably break them. Cheers, Ben. > > + ACPI_FREE(obj); > > + > > pci_scan_child_bus(bus); > > pci_set_host_bridge_release(host_bridge, acpi_pci_root_release_info, > > info); > > diff --git a/include/linux/pci-acpi.h b/include/linux/pci-acpi.h > > index 8082b612f561..62b7fdcc661c 100644 > > --- a/include/linux/pci-acpi.h > > +++ b/include/linux/pci-acpi.h > > @@ -107,9 +107,10 @@ static inline void acpiphp_check_host_bridge(struct acpi_device *adev) { } > > #endif > > > > extern const guid_t pci_acpi_dsm_guid; > > -#define DEVICE_LABEL_DSM 0x07 > > -#define RESET_DELAY_DSM 0x08 > > -#define FUNCTION_DELAY_DSM 0x09 > > +#define IGNORE_PCI_BOOT_CONFIG_DSM 0x05 > > +#define DEVICE_LABEL_DSM 0x07 > > +#define RESET_DELAY_DSM 0x08 > > +#define FUNCTION_DELAY_DSM 0x09 > > > > #else /* CONFIG_ACPI */ > > static inline void acpi_pci_add_bus(struct pci_bus *bus) { } > > diff --git a/include/linux/pci.h b/include/linux/pci.h > > index dd436da7eccc..7ff5cedb30cf 100644 > > --- a/include/linux/pci.h > > +++ b/include/linux/pci.h > > @@ -486,6 +486,12 @@ static inline int pci_channel_offline(struct pci_dev *pdev) > > return (pdev->error_state != pci_channel_io_normal); > > } > > > > +enum pci_host_rsrc_policy { > > + pci_rsrc_normal, /* Probe and (re)assign what's missing/broken */ > > + pci_rsrc_probe_only, /* Probe only */ > > + pci_rsrc_reassign, /* Reassign resources */ > > +}; > > + > > struct pci_host_bridge { > > struct device dev; > > struct pci_bus *bus; /* Root bus */ > > @@ -506,6 +512,10 @@ struct pci_host_bridge { > > unsigned int native_shpc_hotplug:1; /* OS may use SHPC hotplug */ > > unsigned int native_pme:1; /* OS may use PCIe PME */ > > unsigned int native_ltr:1; /* OS may use PCIe LTR */ > > + > > + /* Resource assignment/allocation policy */ > > + enum pci_host_rsrc_policy rsrc_policy; > > + > > /* Resource alignment requirements */ > > resource_size_t (*align_resource)(struct pci_dev *dev, > > const struct resource *res, > > > >
On Thu, 2019-06-13 at 14:02 -0500, Bjorn Helgaas wrote: > > PCI FW r3.2 says 0 means "the OS must not ignore config done by > firmware." That means we must keep the FW configuration intact. So I tried implementing what you seem to want and it doesn't make sense imho. I think the wording of the spec is terrible and doesn't convey the intent here. What is it that _DSM #5 is about ? Is it about telling us that the FW config shall not be trusted ? That seem to be its original intent based on the existing wording and the fact that "1" says "may ignore". Or was it always intended to be some kind of inverted logic with 0 meaning that we *must* preserve what FW did ? But preserving what FW did was always the default for x86 and Windows... It's just that we happen to do something wrong today on Linux/ARM64 which is to always reassign everything. The way Linux resource assignment works accross platforms has generally been based on one of these 3 methods: - The standard x86 method, which is to claim what's there when it doesn't look completely busted and has been assigned, then assign what's left. This allows for FW doing partial assignment, and allows to work around a number of BIOS bugs. - The "probe only" method. This was created independently on powerpc and some other archs afaik. At least for powerpc, the reason for that is some interesting virtualization cases where we just cannot touch or change or move anything. The effect is to not reassign even what we dont like, and not call pci_assign_unassign_resources(). - The "reassign everything" method. This is used by almost all embedded patforms accross archs. All arm32, all arm64 today (but we agree that's wrong), all embedded powerpc etc... This is basically meant for us not trusting whatever random uboot or other embedded FW, if any, and do a full from-scratch assignment. There are issues in how that is implemented accross the various platforms/archs, some for example still honor existing bus numbers and some don't, but I doubt it's intentional etc... but that method is there to stay. Now, the questions are two fold - How do we map _DSM #5 to these, at least on arm64, maybe in the long run we can also look at affecting x86, but that's less urgent. - How do I ensure the above fixes my Amazon platform ? :-) There's one obvious thing here. If we don't want to break existing things, then the absence of _DSM #5 must not change our existing behaviour. I think we can all agree on this. We might explore changing arm64 default behaviour as a second step since we all agree it's not doing what it should, but we also know that it will probably break some things so we need to be careful, understand the issues and work around them. This isn't the scope of the initial _DSM #5 patch. That leaves us with the _DSM #5 present cases. Now we have two values. What do they mean ? As I already said before, the wording with "must not ignore" and "may ignore" is completely useless and doesn't tell us a thing about the intention here. We don't know why the FW folks may have put a given value here, and what they expect us to do about it. What we do know is that Seattle returns 1 and needs reassignment, and we do know that the Amazon platforms return 0 and will want us to not touch the existing setup. However, does 1 means "business as usual" or does it mean "reassign everything" ? Does 0 means "probe only" ? Or do we still do an assignment pass for things that the FW may have left unassigned ? Today in Linux, the "probe only" logic tends to not call pci_assign_unassigned_resources for example. From a pure reading of the spec, there's an argument to be made that both 0 and 1 values can lead to the same code that reads what's there and reassign what's missing. So this is a mess, a usual when it comes to specs written by committees, but at this stage I'm at a loss as to what you want me to do. Cheers, Ben.
On Fri, 14 Jun 2019 at 01:07, Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > > On Thu, 2019-06-13 at 14:02 -0500, Bjorn Helgaas wrote: > > > > PCI FW r3.2 says 0 means "the OS must not ignore config done by > > firmware." That means we must keep the FW configuration intact. > > So I tried implementing what you seem to want and it doesn't make sense > imho. > > I think the wording of the spec is terrible and doesn't convey the > intent here. > > What is it that _DSM #5 is about ? Is it about telling us that the FW > config shall not be trusted ? That seem to be its original intent based > on the existing wording and the fact that "1" says "may ignore". > > Or was it always intended to be some kind of inverted logic with 0 > meaning that we *must* preserve what FW did ? > The original purpose was for firmware running on 64-bit hosts to not create a PCI resource assignment that was incompatible with the OS booting in 32-bit mode. So the expectation was that a 32-bit OS would reuse whatever config the firmware created, and the 64-bit version would be enlightened to understand that it could reassign resource assignments to make better use of the available resource windows, but this required a mechanism to describe which resources should be left alone by the OS. So I don't think anybody cares about that use case anymore, and I have no idea how widespread its use was when people did. > But preserving what FW did was always the default for x86 and > Windows... It's just that we happen to do something wrong today on > Linux/ARM64 which is to always reassign everything. > > The way Linux resource assignment works accross platforms has generally > been based on one of these 3 methods: > > - The standard x86 method, which is to claim what's there when it > doesn't look completely busted and has been assigned, then assign > what's left. This allows for FW doing partial assignment, and allows to > work around a number of BIOS bugs. > > - The "probe only" method. This was created independently on powerpc > and some other archs afaik. At least for powerpc, the reason for that > is some interesting virtualization cases where we just cannot touch or > change or move anything. The effect is to not reassign even what we > dont like, and not call pci_assign_unassign_resources(). > > - The "reassign everything" method. This is used by almost all > embedded patforms accross archs. All arm32, all arm64 today (but we > agree that's wrong), all embedded powerpc etc... This is basically > meant for us not trusting whatever random uboot or other embedded FW, > if any, and do a full from-scratch assignment. There are issues in how > that is implemented accross the various platforms/archs, some for > example still honor existing bus numbers and some don't, but I doubt > it's intentional etc... but that method is there to stay. > > Now, the questions are two fold > > - How do we map _DSM #5 to these, at least on arm64, maybe in the > long run we can also look at affecting x86, but that's less urgent. > > - How do I ensure the above fixes my Amazon platform ? :-) > It would help if you could explain what exactly is wrong with your Amazon platform :-) > There's one obvious thing here. If we don't want to break existing > things, then the absence of _DSM #5 must not change our existing > behaviour. I think we can all agree on this. > > We might explore changing arm64 default behaviour as a second step > since we all agree it's not doing what it should, but we also know that > it will probably break some things so we need to be careful, understand > the issues and work around them. This isn't the scope of the initial > _DSM #5 patch. > > That leaves us with the _DSM #5 present cases. > > Now we have two values. What do they mean ? As I already said before, > the wording with "must not ignore" and "may ignore" is completely > useless and doesn't tell us a thing about the intention here. We don't > know why the FW folks may have put a given value here, and what they > expect us to do about it. > > What we do know is that Seattle returns 1 and needs reassignment, and > we do know that the Amazon platforms return 0 and will want us to not > touch the existing setup. > > However, does 1 means "business as usual" or does it mean "reassign > everything" ? > > Does 0 means "probe only" ? Or do we still do an assignment pass for > things that the FW may have left unassigned ? > > Today in Linux, the "probe only" logic tends to not call > pci_assign_unassigned_resources for example. > > From a pure reading of the spec, there's an argument to be made that > both 0 and 1 values can lead to the same code that reads what's there > and reassign what's missing. > > So this is a mess, a usual when it comes to specs written by > committees, but at this stage I'm at a loss as to what you want me to > do. >
On Fri, 2019-06-14 at 09:42 +0200, Ard Biesheuvel wrote: > The original purpose was for firmware running on 64-bit hosts to not > create a PCI resource assignment that was incompatible with the OS > booting in 32-bit mode. > > So the expectation was that a 32-bit OS would reuse whatever config > the firmware created, and the 64-bit version would be enlightened to > understand that it could reassign resource assignments to make better > use of the available resource windows, but this required a mechanism > to describe which resources should be left alone by the OS. > > So I don't think anybody cares about that use case anymore, and I have > no idea how widespread its use was when people did. Ok. At least my thinkpad happily assigns stuff in 64-bit space. AFAIK even 32-bit distros can deal with it with PAE no ? > > - The "probe only" method. This was created independently on powerpc > > and some other archs afaik. At least for powerpc, the reason for that > > is some interesting virtualization cases where we just cannot touch or > > change or move anything. The effect is to not reassign even what we > > dont like, and not call pci_assign_unassign_resources(). > > > > - The "reassign everything" method. This is used by almost all > > embedded patforms accross archs. All arm32, all arm64 today (but we > > agree that's wrong), all embedded powerpc etc... This is basically > > meant for us not trusting whatever random uboot or other embedded FW, > > if any, and do a full from-scratch assignment. There are issues in how > > that is implemented accross the various platforms/archs, some for > > example still honor existing bus numbers and some don't, but I doubt > > it's intentional etc... but that method is there to stay. > > > > Now, the questions are two fold > > > > - How do we map _DSM #5 to these, at least on arm64, maybe in the > > long run we can also look at affecting x86, but that's less urgent. > > > > - How do I ensure the above fixes my Amazon platform ? :-) > > > > It would help if you could explain what exactly is wrong with your > Amazon platform :-) Linux can't change the switch configuration. I may have mentioned earlier it has to do with platform sec policies. But that's not totally relevant, we shoudn't be changing resources anyway since in theory runtime FW might rely on where some system devices were assigned at boot. EFI fb is another example of that. The biggest issue for me right now is that the spec says pretty much at _DSM #5 = 0 is equivalent to _DSM #5 absent, and Bjorn seems keen on having it this way, but for arm64, we specifically want to distinguish those 2 cases. We want to honor _DSM #5 = 0, and at least initially, leave the rest alone. Now, we *also* want to look at switching the rest to the "normal" (for ACPI platforms at least) mechanism of using what FW setup and fixing up if necessary, but that's not what the code does today, we know just switching to pci_bus_claim_resources() will break some platforms, and we need more testing and possibly quirks to get there, so it's material for a separate patch. But in the meantime, I need to differenciate. Also using "probe_only" for _DSM #5 = 0 isn't a good idea, at least as implemented today in the rest of the kernel, probe_only also means we shouldn't assign what was left unassigned. However _DSM #5 allows this. So we'll need to find some more subtle way to convey these. Bjorn: At this point, because of all the above, I'm keen on going back to my original patch (slightly modified Ard's patch), possibly rewording a thing or two and addressing Lorenzo comment. We can look at a better and more generic implementation of _DSM #5 including for child nodes after I have consolidated more of the resource management code. Looking at the spec (and followup discussions for specs updates), I'm quite keen on treating _DSM #5 = 1 as "wipe out the resource for that endpoint/bridge and realloc something better. There are reasons for that, but we can probably discuss that later. Cheers, Ben.
On Fri, Jun 14, 2019 at 06:36:32PM +1000, Benjamin Herrenschmidt wrote: [...] > The biggest issue for me right now is that the spec says pretty much at > _DSM #5 = 0 is equivalent to _DSM #5 absent, and Bjorn seems keen on > having it this way, but for arm64, we specifically want to distinguish > those 2 cases. > > We want to honor _DSM #5 = 0, and at least initially, leave the rest > alone. > > Now, we *also* want to look at switching the rest to the "normal" (for > ACPI platforms at least) mechanism of using what FW setup and fixing up > if necessary, but that's not what the code does today, we know just > switching to pci_bus_claim_resources() will break some platforms, and > we need more testing and possibly quirks to get there, so it's material > for a separate patch. > > But in the meantime, I need to differenciate. > > Also using "probe_only" for _DSM #5 = 0 isn't a good idea, at least as > implemented today in the rest of the kernel, probe_only also means we > shouldn't assign what was left unassigned. However _DSM #5 allows this. I am not sure about this. PCI_PROBE_ONLY cannot stop an OS from reassigning BARs that are clearly misconfigured, it does not make any sense. It can't stop an OS from writing those BARs anyway, since they must be sized, why firmware would prevent an OS from reassigning BARs that are programmed with values that can be deemed 100% bogus ? Or put it differently, why must an OS preserve those values willy-nilly ? For me, PCI_PROBE_ONLY and _DSM == 0 on a host bridge must be considered equivalent. I agree with Bjorn on his reading of _DSM #5 and I think that the original patch that claims on _DSM #5 == 0 is a good starting point. I would like to make it a default even without _DSM #5 == 0 so that claim and reassign on claim failure works irrespective of _DSM #5, it is now or never, I think we can give it a shot, with an incremental patch. Lorenzo > So we'll need to find some more subtle way to convey these. > > Bjorn: At this point, because of all the above, I'm keen on going back > to my original patch (slightly modified Ard's patch), possibly > rewording a thing or two and addressing Lorenzo comment. > > We can look at a better and more generic implementation of _DSM #5 > including for child nodes after I have consolidated more of the > resource management code. > > Looking at the spec (and followup discussions for specs updates), I'm > quite keen on treating _DSM #5 = 1 as "wipe out the resource for that > endpoint/bridge and realloc something better. There are reasons for > that, but we can probably discuss that later. > > Cheers, > Ben. > >
On Fri, 2019-06-14 at 10:57 +0100, Lorenzo Pieralisi wrote: > > Also using "probe_only" for _DSM #5 = 0 isn't a good idea, at least as > > implemented today in the rest of the kernel, probe_only also means we > > shouldn't assign what was left unassigned. However _DSM #5 allows this. > > I am not sure about this. PCI_PROBE_ONLY cannot stop an OS from > reassigning BARs that are clearly misconfigured, it does not make > any sense. PCI_PROBE_ONLY is a linux thing which, as implemented today, implies no assignment at all. I believe it originates as a merge of variants of the same thing, at least one of them being one I created for powerpc back in the days due to our proprietary hypervisor not letting you touch any of the PCI config space. If a device looks broken, disable it, don't use it, but don't reassign either. At least that the semantics we have today. And as such they don't match _DSM #5 = 0. > It can't stop an OS from writing those BARs anyway, > since they must be sized, why firmware would prevent an OS from > reassigning BARs that are programmed with values that can be > deemed 100% bogus ? Or put it differently, why must an OS preserve > those values willy-nilly ? Don't ask me ... IBM firmware :-) At least that was the idea back then. That said I suppose some platforms may also have set that flag to indicate that they aren't sure what other "ghost" things might be in the address space, ie Linux doesnt have a clear view of what's free to allocate devices to for example. > For me, PCI_PROBE_ONLY and _DSM == 0 on a host bridge must be considered > equivalent. Well, that's not what PCI_PROBE_ONLY is today in Linux. It might be what you would like it to be but it's not what it is :-) And I'd like to avoid making arm64 different than everybody else here because I want to consolidate things. Fundamentally, is what _DSM #5 == 0 does any different from our standard (not PROBE_ONLY) mode of operation on server platforms anyway ? Ie, we read what's there, and we leave it alone unless it's broken or unassigned ? This is precisely the definition of _DSM #5 == 0 no ? PROBE_ONLY is .. something else. > I agree with Bjorn on his reading of _DSM #5 and I think that > the original patch that claims on _DSM #5 == 0 is a good > starting point. The original patch is a good starting point, we agree. The only point of disagreement with Bjorn at this stage is what the "default" is in absence of _DSM #5. The spec says it should be the same as _DSM #5 == 0, but we know today it will introduce a much wider ranging change to arm64 to treat it that way. At the very least, changing the default should be a different patch. > I would like to make it a default even without > _DSM #5 == 0 so that claim and reassign on claim failure works > irrespective of _DSM #5, it is now or never, I think we can give > it a shot, with an incremental patch. We should. In fact, I was thinking about it on the way home tonight and was going to ask you and Ard to try this out and send me the debug level log output of anything that looks wrong on any platform: diff --git a/arch/arm64/kernel/pci.c b/arch/arm64/kernel/pci.c index bb85e2f4603f..0af1f1b4e4d8 100644 --- a/arch/arm64/kernel/pci.c +++ b/arch/arm64/kernel/pci.c @@ -193,8 +193,8 @@ struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root) if (!bus) return NULL; - pci_bus_size_bridges(bus); - pci_bus_assign_resources(bus); + pci_bus_claim_resources(bus); + pci_assign_unassigned_root_bus_resources(bus); list_for_each_entry(child, &bus->children, node) pcie_bus_configure_settings(child); Cheers, Ben.
On Fri, 2019-06-14 at 20:36 +1000, Benjamin Herrenschmidt wrote: > On Fri, 2019-06-14 at 10:57 +0100, Lorenzo Pieralisi wrote: > > > Also using "probe_only" for _DSM #5 = 0 isn't a good idea, at least as > > > implemented today in the rest of the kernel, probe_only also means we > > > shouldn't assign what was left unassigned. However _DSM #5 allows this. > > > > I am not sure about this. PCI_PROBE_ONLY cannot stop an OS from > > reassigning BARs that are clearly misconfigured, it does not make > > any sense. > > PCI_PROBE_ONLY is a linux thing which, as implemented today, implies no > assignment at all. I believe it originates as a merge of variants of > the same thing, at least one of them being one I created for powerpc > back in the days due to our proprietary hypervisor not letting you > touch any of the PCI config space. Well... you could "touch" things and even BAR size but mostly we don't even do that on powerpc on these systems these days, we read the BAR values (and a bunch of other things) from OFW and manufacture the pci_dev. The generic code still use cfg space to fill up the blanks. sparc64 uses the same technique. This least to another conversation we hinted at earlier.. we should probably have a way to do the same at least for BARs on ACPI systems so we don't have to temporarily disable access to a device to size them. This can be problematic is the device in question need to be used during the sizing. It can happen with some system devices used behind the scene by FW, or the device on which the console is (how many time did I crash bcs I had too verbose printk's in the PCI code during BAR sizing of the framebuffer or the serial card...) Cheers, Ben.
On Fri, Jun 14, 2019 at 06:36:32PM +1000, Benjamin Herrenschmidt wrote: > Linux can't change the switch configuration. I may have mentioned > earlier it has to do with platform sec policies. But that's not totally > relevant, we shoudn't be changing resources anyway since in theory > runtime FW might rely on where some system devices were assigned at > boot. EFI fb is another example of that. "We shouldn't be changing resources anyway" is not really useful guidance. I completely agree that we shouldn't change things *unnecessarily*, but it's up to the OS to define what makes it necessary -- it might be for rebalancing for hotplug, to make space for SR-IOV, to respect user requests to increase alignment, etc. IMO the real value of _DSM #5 is as a mechanism to advertise dependencies runtime firmware has on devices, e.g., SMM firmware might want to log errors to a PCI remote management device. If the OS moved that managment device, the SMM logging would itself cause errors. > The biggest issue for me right now is that the spec says pretty much at > _DSM #5 = 0 is equivalent to _DSM #5 absent, and Bjorn seems keen on > having it this way, but for arm64, we specifically want to distinguish > those 2 cases. Nope, my opinion is exactly the opposite. Sorry that I'm not communicating this clearly. It's true that the r3.2 spec *does* say _DSM #5 = 0 is equivalent to the situation where it's absent, but that's based on the assumption that the OS is never allowed to change PCI configuration. I think that assumption is complete nonsense and should be disregarded. _DSM #5 = 0: OS must preserve this device's BARs (current spec says "OS must not ignore") _DSM #5 = 1: OS *may* change this device's BARs (current spec says "OS may ignore") Other than _DSM #5, there's no spec I'm aware of that restricts the OS's ability to change BARs. Therefore I think "_DSM #5 = 1" is equivalent to _DSM #5 being absent, and in both cases the OS is free to change BARs as it sees fit. > Looking at the spec (and followup discussions for specs updates), I'm > quite keen on treating _DSM #5 = 1 as "wipe out the resource for that > endpoint/bridge and realloc something better. There are reasons for > that, but we can probably discuss that later. I disagree on the "wipe out all resources" part of this because we have no idea how to realloc something better. We should of course look for problems (overlapping devices, etc) and fix them. But starting from scratch and reallocating won't reliably produce anything different from the original, supposedly broken, configuration. Bjorn
On Fri, Jun 14, 2019 at 08:43:19PM +1000, Benjamin Herrenschmidt wrote: > This least to another conversation we hinted at earlier.. we should > probably have a way to do the same at least for BARs on ACPI systems so > we don't have to temporarily disable access to a device to size them. The PCI Enhanced Allocation capability provides a way to do this. I don't know how widely used it is, but it's theoretically possible.
On Fri, 2019-06-14 at 08:09 -0500, Bjorn Helgaas wrote: > On Fri, Jun 14, 2019 at 06:36:32PM +1000, Benjamin Herrenschmidt wrote: > > Linux can't change the switch configuration. I may have mentioned > > earlier it has to do with platform sec policies. But that's not totally > > relevant, we shoudn't be changing resources anyway since in theory > > runtime FW might rely on where some system devices were assigned at > > boot. EFI fb is another example of that. > > "We shouldn't be changing resources anyway" is not really useful > guidance. I completely agree that we shouldn't change things > *unnecessarily*, but it's up to the OS to define what makes it > necessary -- it might be for rebalancing for hotplug, to make space > for SR-IOV, to respect user requests to increase alignment, etc. > > IMO the real value of _DSM #5 is as a mechanism to advertise > dependencies runtime firmware has on devices, e.g., SMM firmware might > want to log errors to a PCI remote management device. If the OS moved > that managment device, the SMM logging would itself cause errors. This I agree with, though that wasn't specifically why it was created but this is where it's going yes. However, I think the resource management code in Linux is some way from being able to handle that at device granularity. It's actually a very difficult problem to solve in an optimal way. For example you could have a bridge that can be moved but a child of that bridge is fixed. That means that now, the bridge must only be moved within the constraints that it still contains that child. This in turn is very tricky to properly define because the bridge can have other children who aren't fixed, but whose alignment constraints will also limit where the bridge can go. I don't think we want to practically create such a mechanism, we have to be realistic. So that means if something needs to be fixed, all of its parents need too. Due to how the spec is written, if that _DSM #5 == 0 object is a bridge, then both all of its parents and all of its descendants are now fixed (except hotplugged descendants), unless said descendants are themselves tagged with _DSM #5 == 1. But then, as I explain below, this does more than just "cancel" _DSM #5 == 0 ... what a mess :-) That said, I have been thinking about ways we could better honor that _DSM #5 at various levels of the tree, and I think it's doable. There's going to still be some arguments to sort out about policy details as above, and we might end up trying to get the SIG to clarify things further but I think we can get somewhere reasonably sane in practice once we've changed arm64 to use an x86-style allocation rather than trying to do everything from scratch. Working on it ... > > The biggest issue for me right now is that the spec says pretty much at > > _DSM #5 = 0 is equivalent to _DSM #5 absent, and Bjorn seems keen on > > having it this way, but for arm64, we specifically want to distinguish > > those 2 cases. > > Nope, my opinion is exactly the opposite. Sorry that I'm not > communicating this clearly. > > It's true that the r3.2 spec *does* say _DSM #5 = 0 is equivalent to > the situation where it's absent, but that's based on the assumption > that the OS is never allowed to change PCI configuration. I think > that assumption is complete nonsense and should be disregarded. > > _DSM #5 = 0: OS must preserve this device's BARs > (current spec says "OS must not ignore") > > _DSM #5 = 1: OS *may* change this device's BARs > (current spec says "OS may ignore") > > Other than _DSM #5, there's no spec I'm aware of that restricts the > OS's ability to change BARs. Therefore I think "_DSM #5 = 1" is > equivalent to _DSM #5 being absent, and in both cases the OS is free > to change BARs as it sees fit. I think we are conflating too many different things together here, it's not binary. If you look at the reasons why DSM#5 was created in the first place, it had to do with firmwares setting up a 32-bit compatible layout which is suboptimal for 64-bit, and thus marking with DSM #1 things that could (and probably should) be reallocated elsewhere. Based on that pattern, it's tempting (and I'm tempted...) to handle _DSM #5 == 1, at least at a non-root-bridge level, by wiping the resources (or setting IORESOURCE_UNSET), to force Linux to create a potentially more optimal allocation. However, that shouldn't be our default either :-) So my personal opinion is that DSM #5 0, 1 and absent are actually 3 different things. > > Looking at the spec (and followup discussions for specs updates), I'm > > quite keen on treating _DSM #5 = 1 as "wipe out the resource for that > > endpoint/bridge and realloc something better. There are reasons for > > that, but we can probably discuss that later. > > I disagree on the "wipe out all resources" part of this because we > have no idea how to realloc something better. We should of course > look for problems (overlapping devices, etc) and fix them. But > starting from scratch and reallocating won't reliably produce anything > different from the original, supposedly broken, configuration. Well, again, it depends *why* _DSM #5 was set to 1 in the first place. As I said above, there are actually more than 2 cases here and _DSM is trying to convey too much with too little information. We lack the firmware intent, it's not being conveyed well. From my understanding of the spec history, _DSM #5 == 1 was specifically created because the FW knew it was setting up some resources in a sub optimal way. The 32-bit example is spelled out. I can imagine others, such as FW whacking a fixed BAR value on a boot device to get going knowing full well it's sub optimal. Now unless we start building some pretty serious smarts into our allocation scheme to "decide" whether a valid resource could be done better elsewhere or not, which we aren't near being able to do I suspect, that case will not work *unless* we just wipe such resources and let Linux do what it thinks is best with it. All those interpretations are valid on either the current or the "proposed" spec, and it still leaves us in a bind because there is not enough intent being conveyed. It needs to be more than a binary choice ideally. That said, I'm not in a hurry to deal with the _DSM #5 == 1 case, so we can defer that conversation. Maybe we can create a mechanism where we do a "dummy run" of our allocation code on a snapshot of resources with the assumption that _DSM #5 == 1 ones get wiped, and create some metrics to decide whether the end result is "better" than what's already there, and based on that chose what to apply, but it sounds really really messy. Cheers, Ben.
On Fri, 2019-06-14 at 08:12 -0500, Bjorn Helgaas wrote: > On Fri, Jun 14, 2019 at 08:43:19PM +1000, Benjamin Herrenschmidt > wrote: > > > This least to another conversation we hinted at earlier.. we should > > probably have a way to do the same at least for BARs on ACPI > > systems so > > we don't have to temporarily disable access to a device to size > > them. > > The PCI Enhanced Allocation capability provides a way to do this. I > don't know how widely used it is, but it's theoretically possible. Ok, I have to read about this. I haven't seen a device with that on yet, it looks messy at a quick glance. Can ACPI convey the information ? On powerpc and sparc64 we have ways to read the BAR values from the device-tree created by OF when it assigned them. Cheers, Ben.
On Fri, Jun 14, 2019 at 11:48:18PM +1000, Benjamin Herrenschmidt wrote: > On Fri, 2019-06-14 at 08:12 -0500, Bjorn Helgaas wrote: > > On Fri, Jun 14, 2019 at 08:43:19PM +1000, Benjamin Herrenschmidt > > wrote: > > > > > This least to another conversation we hinted at earlier.. we should > > > probably have a way to do the same at least for BARs on ACPI > > > systems so > > > we don't have to temporarily disable access to a device to size > > > them. > > > > The PCI Enhanced Allocation capability provides a way to do this. I > > don't know how widely used it is, but it's theoretically possible. > > Ok, I have to read about this. I haven't seen a device with that on > yet, it looks messy at a quick glance. > > Can ACPI convey the information ? On powerpc and sparc64 we have ways > to read the BAR values from the device-tree created by OF when it > assigned them. I agree, EA is messy. I don't think it's feasible to do this in ACPI. It's a pretty fundamental principle of PCI that you can discover what resources a device needs and uses by looking at its config space. In general PCIe requires ECAM, which gives the OS direct access to config space, although it does allow exceptions for architecture-specific firmware interfaces for accessing config space, e.g., ia64 SAL (PCIe r4.0, sec 7.2.2). Bjorn
On Fri, 2019-06-14 at 15:13 -0500, Bjorn Helgaas wrote: > Ok, I have to read about this. I haven't seen a device with that on > > yet, it looks messy at a quick glance. > > > > Can ACPI convey the information ? On powerpc and sparc64 we have ways > > to read the BAR values from the device-tree created by OF when it > > assigned them. > > I agree, EA is messy. > > I don't think it's feasible to do this in ACPI. It's a pretty > fundamental principle of PCI that you can discover what resources a > device needs and uses by looking at its config space. In general PCIe > requires ECAM, which gives the OS direct access to config space, > although it does allow exceptions for architecture-specific firmware > interfaces for accessing config space, e.g., ia64 SAL (PCIe r4.0, sec > 7.2.2). So this isn't something I need, but if others do, we can find a reasonable compromise here and push it to the spec. It's actually fairly easy: If a device is used by FW (SMM, SMCCC or whatever other runtime thingy) to the extent that temporarily disabling it for BAR sizing can cause random boot failures (if the wrong event happens at the wrong time), it would be easy for FW to "mark" that device as such (_DSM #5 == 2 ? just kidding...) and provide some forms of properties/datas that expose the resources that were assigned. On OF, the properties for that already exist, so just adding something like "no-bar-sizing" or such in the node for the device would do. It's easy because FW only has to "represent" endpoints that have such properties and leave everything else to the OS. There is no need to mandate a full representation of all PCI devices. There are a few details to be careful of, for example, if any bridge in the parent chain of such an endpoint has BARs (not windows, actual BARs), then they should have those properties too, otherwise sizing them will temporarily disable the path to the device since BAR sizing should be done with memory/io decode off. But otherwise, it's a pretty trivial thing to specify and implement I suspect. A lot easier than requiring HW to implement EA is my gut feeling :-) As I said, I don't have a pressing need for that now (could have come in handy back in the powermac days ... oh well). But if enough people do, I'm happy to help ironing something out. Cheers, Ben
diff --git a/arch/arm64/kernel/pci.c b/arch/arm64/kernel/pci.c index bb85e2f4603f..b209a506f390 100644 --- a/arch/arm64/kernel/pci.c +++ b/arch/arm64/kernel/pci.c @@ -168,6 +168,7 @@ struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root) struct acpi_pci_generic_root_info *ri; struct pci_bus *bus, *child; struct acpi_pci_root_ops *root_ops; + struct pci_host_bridge *hb; ri = kzalloc(sizeof(*ri), GFP_KERNEL); if (!ri) @@ -193,8 +194,15 @@ struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root) if (!bus) return NULL; - pci_bus_size_bridges(bus); - pci_bus_assign_resources(bus); + hb = pci_find_host_bridge(bus); + + /* If the policy is normal or probe only, claim existing resources */ + if (hb->rsrc_policy != pci_rsrc_reassign) + pci_bus_claim_resources(bus); + + /* If the policy is not probe only, assign what's left unassigned */ + if (hb->rsrc_policy != pci_rsrc_probe_only) + pci_assign_unassigned_root_bus_resources(bus); list_for_each_entry(child, &bus->children, node) pcie_bus_configure_settings(child); diff --git a/drivers/acpi/pci_root.c b/drivers/acpi/pci_root.c index 39f5d172e84f..410f7f2b2587 100644 --- a/drivers/acpi/pci_root.c +++ b/drivers/acpi/pci_root.c @@ -881,6 +881,7 @@ struct pci_bus *acpi_pci_root_create(struct acpi_pci_root *root, int node = acpi_get_node(device->handle); struct pci_bus *bus; struct pci_host_bridge *host_bridge; + union acpi_object *obj; info->root = root; info->bridge = device; @@ -917,6 +918,47 @@ struct pci_bus *acpi_pci_root_create(struct acpi_pci_root *root, if (!(root->osc_control_set & OSC_PCI_EXPRESS_LTR_CONTROL)) host_bridge->native_ltr = 0; + /* + * Invoke the PCI device specific method (_DSM) #5 'Ignore PCI Boot + * Configuration', on the host bridge. This tells us whether the + * firmware wants us to preserve or reassign the configuration of + * the PCI resource tree for this root bridge. + * + * There are three possible outcomes here: + * + * - _DSM #5 is absent. This is the default. Currently it will be + * architecture specific in order to maintain existing behaviours + * but the plan is to move arm64 into the fold: x86 and ia64 will + * claim the existing config, and reassign if needed. arm64 will + * always reassign. + * + * - _DSM #5 exists and is 1. This is the FW telling us to ignore + * the configuration it performed. This is currently only supported + * on arm64. + * + * - _DSM #5 exists and is 0. This should be the same as the default + * (_DSM #5 absent). However there are some assumptions flying around + * that this means we must keep the FW configuration intact. So we + * treat that as "probe only" for the time being. This is currently + * only supported on arm64. + */ + obj = acpi_evaluate_dsm(ACPI_HANDLE(bus->bridge), &pci_acpi_dsm_guid, 1, + IGNORE_PCI_BOOT_CONFIG_DSM, NULL); + if (obj && obj->type == ACPI_TYPE_INTEGER) { + if (obj->integer.value == 1) + host_bridge->rsrc_policy = pci_rsrc_reassign; + else + host_bridge->rsrc_policy = pci_rsrc_probe_only; + } else { + /* Default is arch specific ... for now */ +#ifdef CONFIG_ARM64 + host_bridge->rsrc_policy = pci_rsrc_reassign; +#else + host_bridge->rsrc_policy = pci_rsrc_normal; +#endif + } + ACPI_FREE(obj); + pci_scan_child_bus(bus); pci_set_host_bridge_release(host_bridge, acpi_pci_root_release_info, info); diff --git a/include/linux/pci-acpi.h b/include/linux/pci-acpi.h index 8082b612f561..62b7fdcc661c 100644 --- a/include/linux/pci-acpi.h +++ b/include/linux/pci-acpi.h @@ -107,9 +107,10 @@ static inline void acpiphp_check_host_bridge(struct acpi_device *adev) { } #endif extern const guid_t pci_acpi_dsm_guid; -#define DEVICE_LABEL_DSM 0x07 -#define RESET_DELAY_DSM 0x08 -#define FUNCTION_DELAY_DSM 0x09 +#define IGNORE_PCI_BOOT_CONFIG_DSM 0x05 +#define DEVICE_LABEL_DSM 0x07 +#define RESET_DELAY_DSM 0x08 +#define FUNCTION_DELAY_DSM 0x09 #else /* CONFIG_ACPI */ static inline void acpi_pci_add_bus(struct pci_bus *bus) { } diff --git a/include/linux/pci.h b/include/linux/pci.h index dd436da7eccc..7ff5cedb30cf 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -486,6 +486,12 @@ static inline int pci_channel_offline(struct pci_dev *pdev) return (pdev->error_state != pci_channel_io_normal); } +enum pci_host_rsrc_policy { + pci_rsrc_normal, /* Probe and (re)assign what's missing/broken */ + pci_rsrc_probe_only, /* Probe only */ + pci_rsrc_reassign, /* Reassign resources */ +}; + struct pci_host_bridge { struct device dev; struct pci_bus *bus; /* Root bus */ @@ -506,6 +512,10 @@ struct pci_host_bridge { unsigned int native_shpc_hotplug:1; /* OS may use SHPC hotplug */ unsigned int native_pme:1; /* OS may use PCIe PME */ unsigned int native_ltr:1; /* OS may use PCIe LTR */ + + /* Resource assignment/allocation policy */ + enum pci_host_rsrc_policy rsrc_policy; + /* Resource alignment requirements */ resource_size_t (*align_resource)(struct pci_dev *dev, const struct resource *res,
The current arm64 PCI code for ACPI platforms will unconditionally reassign all resources. This is not only suboptimal, it's also wrong for a number of cases, for example, this could invalidate a UEFI framebuffer address, or a runtime firmware could be using some of the devices in their original location. There is an ACPI method defined today for P2P bridges (_DSM #5) that can indicate that a bridge resources set by firmware. There is current discussions to extend that method to cover host bridges, and define a value of "0" as meaning that the resources must be preserved. This patch adds the resource assignment policy to struct pci_host_bridge and sets it based on the presence of that method and if present the value returned, and honors it on arm64. No other architectures are currently affected, and the default is kept to "reassign everything" on arm64 for now via an #ifdef, though we do plan to get rid of that in a separate patch. The setting in pci_host_bridge "looks" generic because I intend in subsquent work to consolidate the resource allocation policy accross architectures and I intend for that setting to be the canonical location used by the generic code to decide what to do. This is based on some earlier work by Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> --- arch/arm64/kernel/pci.c | 12 ++++++++++-- drivers/acpi/pci_root.c | 42 ++++++++++++++++++++++++++++++++++++++++ include/linux/pci-acpi.h | 7 ++++--- include/linux/pci.h | 10 ++++++++++ 4 files changed, 66 insertions(+), 5 deletions(-)