[1/1] pci: fix dmar fault for kdump kernel

Message ID	1412925191-27970-1-git-send-email-zhen-hual@hp.com (mailing list archive)
State	New, archived
Delegated to:	Bjorn Helgaas
Headers	show Return-Path: <linux-pci-owner@kernel.org> From: "Li, Zhen-Hua" <zhen-hual@hp.com> To: Bjorn Helgaas <bhelgaas@google.com>, <linux-pci@vger.kernel.org>, <linux-kernel@vger.kernel.org> Cc: <linda.knippers@hp.com>, <jerry.hoemann@hp.com>, <lisa.mitchell@hp.com>, <alexander.duyck@gmail.com>, "Li, Zhen-Hua" <zhen-hual@hp.com>, Takao Indoh <indou.takao@jp.fujitsu.com>, Randy Wright <rwright@hp.com> Subject: [PATCH 1/1] pci: fix dmar fault for kdump kernel Date: Fri, 10 Oct 2014 15:13:11 +0800 Message-Id: <1412925191-27970-1-git-send-email-zhen-hual@hp.com> Sender: linux-pci-owner@vger.kernel.org Precedence: bulk

Li, Zhen-Hua Oct. 10, 2014, 7:13 a.m. UTC

On a HP system with Intel vt-d supported and many PCI devices on it, 
when kernel crashed and the kdump kernel boots with intel_iommu=on, 
there may be some unexpected DMA requests on this adapter, which will 
cause DMA Remapping faults like:
    dmar: DRHD: handling fault status reg 102
    dmar: DMAR:[DMA Read] Request device [41:00.0] fault addr fff81000
    DMAR:[fault reason 01] Present bit in root entry is clear

This bug may happen on *any* PCI device.
Analysis for this bug:

The present bit is set in this function:

static struct context_entry * device_to_context_entry(
                struct intel_iommu *iommu, u8 bus, u8 devfn)
{
    ......
                set_root_present(root);
    ......
}

Calling tree:
    device driver
        intel_alloc_coherent
            __intel_map_single
                domain_context_mapping
                    domain_context_mapping_one
                        device_to_context_entry

This means, the present bit in root entry will not be set until the device 
driver is loaded.

But in the kdump kernel, hardware devices are not aware that control has 
transferred to the second kernel, and those drivers must initialize again. 
Consequently there may be unexpected DMA requests from devices activity
initiated in the first kernel leading to the DMA Remapping errors in the 
second kernel.

To fix this DMAR fault, we need to reset the bus that this device on. Reset 
the device itself does not work.

A patch for this bug that has been sent before:
https://lkml.org/lkml/2014/9/30/55
As in discussion, this bug may happen on *any* device, so we need to reset all 
pci devices.

There was an original version(Takao Indoh) that resets the pcie devices:
https://lkml.org/lkml/2013/5/14/9

Update of this new version, comparing with Takao Indoh's version:
    Add support for legacy PCI devices.
    Use pci_try_reset_bus instead of do_downstream_device_reset in original version

Randy Wright corrects some misunderstanding in this description.

Signed-off-by: Li, Zhen-Hua <zhen-hual@hp.com>
Signed-off-by: Takao Indoh <indou.takao@jp.fujitsu.com>
Signed-off-by: Randy Wright <rwright@hp.com>
---
 drivers/pci/pci.c | 84 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 84 insertions(+)

Li, Zhen-Hua Oct. 14, 2014, 9:34 a.m. UTC | #1

I tested on the latest stable version 3.17, it works well.

On 10/10/2014 03:13 PM, Li, Zhen-Hua wrote:
> On a HP system with Intel vt-d supported and many PCI devices on it,
> when kernel crashed and the kdump kernel boots with intel_iommu=on,
> there may be some unexpected DMA requests on this adapter, which will
> cause DMA Remapping faults like:
>      dmar: DRHD: handling fault status reg 102
>      dmar: DMAR:[DMA Read] Request device [41:00.0] fault addr fff81000
>      DMAR:[fault reason 01] Present bit in root entry is clear
>
> This bug may happen on *any* PCI device.
> Analysis for this bug:
>
> The present bit is set in this function:
>
> static struct context_entry * device_to_context_entry(
>                  struct intel_iommu *iommu, u8 bus, u8 devfn)
> {
>      ......
>                  set_root_present(root);
>      ......
> }
>
> Calling tree:
>      device driver
>          intel_alloc_coherent
>              __intel_map_single
>                  domain_context_mapping
>                      domain_context_mapping_one
>                          device_to_context_entry
>
> This means, the present bit in root entry will not be set until the device
> driver is loaded.
>
> But in the kdump kernel, hardware devices are not aware that control has
> transferred to the second kernel, and those drivers must initialize again.
> Consequently there may be unexpected DMA requests from devices activity
> initiated in the first kernel leading to the DMA Remapping errors in the
> second kernel.
>
> To fix this DMAR fault, we need to reset the bus that this device on. Reset
> the device itself does not work.
>
> A patch for this bug that has been sent before:
> https://lkml.org/lkml/2014/9/30/55
> As in discussion, this bug may happen on *any* device, so we need to reset all
> pci devices.
>
> There was an original version(Takao Indoh) that resets the pcie devices:
> https://lkml.org/lkml/2013/5/14/9
>
> Update of this new version, comparing with Takao Indoh's version:
>      Add support for legacy PCI devices.
>      Use pci_try_reset_bus instead of do_downstream_device_reset in original version
>
> Randy Wright corrects some misunderstanding in this description.
>
> Signed-off-by: Li, Zhen-Hua <zhen-hual@hp.com>
> Signed-off-by: Takao Indoh <indou.takao@jp.fujitsu.com>
> Signed-off-by: Randy Wright <rwright@hp.com>
> ---
>   drivers/pci/pci.c | 84 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 84 insertions(+)
>
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 2c9ac70..8cb146c 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -23,6 +23,7 @@
>   #include <linux/device.h>
>   #include <linux/pm_runtime.h>
>   #include <linux/pci_hotplug.h>
> +#include <linux/crash_dump.h>
>   #include <asm-generic/pci-bridge.h>
>   #include <asm/setup.h>
>   #include "pci.h"
> @@ -4423,6 +4424,89 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus)
>   }
>   EXPORT_SYMBOL(pci_fixup_cardbus);
>
> +/*
> + * Return true if dev is PCI root port or downstream port whose child is PCI
> + * endpoint except VGA device.
> + */
> +static int __pci_dev_need_reset(struct pci_dev *dev)
> +{
> +	struct pci_bus *subordinate;
> +	struct pci_dev *child;
> +
> +	if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE)
> +		return 0;
> +
> +	if (pci_is_pcie(dev)) {
> +		if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) &&
> +		    (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM))
> +			return 0;
> +	}
> +
> +	subordinate = dev->subordinate;
> +	list_for_each_entry(child, &subordinate->devices, bus_list) {
> +		/* Don't reset switch, bridge, VGA device */
> +		if ((child->hdr_type == PCI_HEADER_TYPE_BRIDGE) ||
> +		    ((child->class >> 16) == PCI_BASE_CLASS_BRIDGE) ||
> +		    ((child->class >> 16) == PCI_BASE_CLASS_DISPLAY))
> +			return 0;
> +
> +		if (pci_is_pcie(child)) {
> +			if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) ||
> +			    (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE))
> +				return 0;
> +		}
> +	}
> +
> +	return 1;
> +}
> +
> +struct pci_dev_reset_entry {
> +	struct list_head list;
> +	struct pci_dev *dev;
> +};
> +int __init pci_reset_endpoints(void)
> +{
> +	struct pci_dev *dev = NULL;
> +	struct pci_dev_reset_entry *pdev_entry, *tmp;
> +	struct pci_bus *subordinate = NULL;
> +	int has_it;
> +
> +	LIST_HEAD(pdev_list);
> +
> +	if (likely(!is_kdump_kernel()))
> +		return 0;
> +
> +	for_each_pci_dev(dev) {
> +		subordinate = dev->subordinate;
> +		if (!subordinate || list_empty(&subordinate->devices))
> +			continue;
> +
> +		has_it = 0;
> +		list_for_each_entry(pdev_entry, &pdev_list, list) {
> +			if (dev == pdev_entry->dev) {
> +				has_it = 1;
> +				break;
> +			}
> +		}
> +		if (has_it)
> +			continue;
> +
> +		if (__pci_dev_need_reset(dev)) {
> +			pdev_entry = kmalloc(sizeof(*pdev_entry), GFP_KERNEL);
> +			pdev_entry->dev = dev;
> +			list_add(&pdev_entry->list, &pdev_list);
> +		}
> +	}
> +
> +	list_for_each_entry_safe(pdev_entry, tmp, &pdev_list, list) {
> +		pci_try_reset_bus(pdev_entry->dev->subordinate);
> +		kfree(pdev_entry);
> +	}
> +
> +	return 0;
> +}
> +fs_initcall_sync(pci_reset_endpoints);
> +
>   static int __init pci_setup(char *str)
>   {
>   	while (str) {
>

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Takao Indoh Oct. 15, 2014, 8:14 a.m. UTC | #2

(2014/10/14 18:34), Li, ZhenHua wrote:
> I tested on the latest stable version 3.17, it works well.
> 
> On 10/10/2014 03:13 PM, Li, Zhen-Hua wrote:
>> On a HP system with Intel vt-d supported and many PCI devices on it,
>> when kernel crashed and the kdump kernel boots with intel_iommu=on,
>> there may be some unexpected DMA requests on this adapter, which will
>> cause DMA Remapping faults like:
>>      dmar: DRHD: handling fault status reg 102
>>      dmar: DMAR:[DMA Read] Request device [41:00.0] fault addr fff81000
>>      DMAR:[fault reason 01] Present bit in root entry is clear
>>
>> This bug may happen on *any* PCI device.
>> Analysis for this bug:
>>
>> The present bit is set in this function:
>>
>> static struct context_entry * device_to_context_entry(
>>                  struct intel_iommu *iommu, u8 bus, u8 devfn)
>> {
>>      ......
>>                  set_root_present(root);
>>      ......
>> }
>>
>> Calling tree:
>>      device driver
>>          intel_alloc_coherent
>>              __intel_map_single
>>                  domain_context_mapping
>>                      domain_context_mapping_one
>>                          device_to_context_entry
>>
>> This means, the present bit in root entry will not be set until the device
>> driver is loaded.
>>
>> But in the kdump kernel, hardware devices are not aware that control has
>> transferred to the second kernel, and those drivers must initialize again.
>> Consequently there may be unexpected DMA requests from devices activity
>> initiated in the first kernel leading to the DMA Remapping errors in the
>> second kernel.
>>
>> To fix this DMAR fault, we need to reset the bus that this device on. Reset
>> the device itself does not work.
>>
>> A patch for this bug that has been sent before:
>> https://lkml.org/lkml/2014/9/30/55
>> As in discussion, this bug may happen on *any* device, so we need to reset all
>> pci devices.
>>
>> There was an original version(Takao Indoh) that resets the pcie devices:
>> https://lkml.org/lkml/2013/5/14/9

As far as I can remember, the original patch was nacked by
the following reasons:

1) On sparc, the IOMMU is initialized before PCI devices are enumerated,
   so there would still be a window where ongoing DMA could cause an
   IOMMU error.

2) Basically Bjorn is thinking device reset should be done in the
   1st kernel before jumping into 2nd kernel.

And Bill Sumner proposed another idea.
http://comments.gmane.org/gmane.linux.kernel.iommu/4828
I don't know the current status of this patch, but I think Jerry Hoemann
is working on this.

Thanks,
Takao Indoh


>>
>> Update of this new version, comparing with Takao Indoh's version:
>>      Add support for legacy PCI devices.
>>      Use pci_try_reset_bus instead of do_downstream_device_reset in original version
>>
>> Randy Wright corrects some misunderstanding in this description.
>>
>> Signed-off-by: Li, Zhen-Hua <zhen-hual@hp.com>
>> Signed-off-by: Takao Indoh <indou.takao@jp.fujitsu.com>
>> Signed-off-by: Randy Wright <rwright@hp.com>
>> ---
>>   drivers/pci/pci.c | 84 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 84 insertions(+)
>>
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index 2c9ac70..8cb146c 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -23,6 +23,7 @@
>>   #include <linux/device.h>
>>   #include <linux/pm_runtime.h>
>>   #include <linux/pci_hotplug.h>
>> +#include <linux/crash_dump.h>
>>   #include <asm-generic/pci-bridge.h>
>>   #include <asm/setup.h>
>>   #include "pci.h"
>> @@ -4423,6 +4424,89 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus)
>>   }
>>   EXPORT_SYMBOL(pci_fixup_cardbus);
>>
>> +/*
>> + * Return true if dev is PCI root port or downstream port whose child is PCI
>> + * endpoint except VGA device.
>> + */
>> +static int __pci_dev_need_reset(struct pci_dev *dev)
>> +{
>> +    struct pci_bus *subordinate;
>> +    struct pci_dev *child;
>> +
>> +    if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE)
>> +        return 0;
>> +
>> +    if (pci_is_pcie(dev)) {
>> +        if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) &&
>> +            (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM))
>> +            return 0;
>> +    }
>> +
>> +    subordinate = dev->subordinate;
>> +    list_for_each_entry(child, &subordinate->devices, bus_list) {
>> +        /* Don't reset switch, bridge, VGA device */
>> +        if ((child->hdr_type == PCI_HEADER_TYPE_BRIDGE) ||
>> +            ((child->class >> 16) == PCI_BASE_CLASS_BRIDGE) ||
>> +            ((child->class >> 16) == PCI_BASE_CLASS_DISPLAY))
>> +            return 0;
>> +
>> +        if (pci_is_pcie(child)) {
>> +            if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) ||
>> +                (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE))
>> +                return 0;
>> +        }
>> +    }
>> +
>> +    return 1;
>> +}
>> +
>> +struct pci_dev_reset_entry {
>> +    struct list_head list;
>> +    struct pci_dev *dev;
>> +};
>> +int __init pci_reset_endpoints(void)
>> +{
>> +    struct pci_dev *dev = NULL;
>> +    struct pci_dev_reset_entry *pdev_entry, *tmp;
>> +    struct pci_bus *subordinate = NULL;
>> +    int has_it;
>> +
>> +    LIST_HEAD(pdev_list);
>> +
>> +    if (likely(!is_kdump_kernel()))
>> +        return 0;
>> +
>> +    for_each_pci_dev(dev) {
>> +        subordinate = dev->subordinate;
>> +        if (!subordinate || list_empty(&subordinate->devices))
>> +            continue;
>> +
>> +        has_it = 0;
>> +        list_for_each_entry(pdev_entry, &pdev_list, list) {
>> +            if (dev == pdev_entry->dev) {
>> +                has_it = 1;
>> +                break;
>> +            }
>> +        }
>> +        if (has_it)
>> +            continue;
>> +
>> +        if (__pci_dev_need_reset(dev)) {
>> +            pdev_entry = kmalloc(sizeof(*pdev_entry), GFP_KERNEL);
>> +            pdev_entry->dev = dev;
>> +            list_add(&pdev_entry->list, &pdev_list);
>> +        }
>> +    }
>> +
>> +    list_for_each_entry_safe(pdev_entry, tmp, &pdev_list, list) {
>> +        pci_try_reset_bus(pdev_entry->dev->subordinate);
>> +        kfree(pdev_entry);
>> +    }
>> +
>> +    return 0;
>> +}
>> +fs_initcall_sync(pci_reset_endpoints);
>> +
>>   static int __init pci_setup(char *str)
>>   {
>>       while (str) {
>>
> 
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Li, Zhen-Hua Oct. 15, 2014, 8:31 a.m. UTC | #3

I still think resetting the devices is necessary.  Reading your original
mails ,  I think reason 2 you mentioned, resetting the devices in the
first kernel,  may be a good solution.

Well, I am now working on Bill's patch that you mentioned. As it is not
accepted yet, I will create a new version for latest stable kernel 3.17.

Thanks
Zhenhua


On 10/15/2014 04:14 PM, Takao Indoh wrote:
> (2014/10/14 18:34), Li, ZhenHua wrote:
>> I tested on the latest stable version 3.17, it works well.
>>
>> On 10/10/2014 03:13 PM, Li, Zhen-Hua wrote:
>>> On a HP system with Intel vt-d supported and many PCI devices on it,
>>> when kernel crashed and the kdump kernel boots with intel_iommu=on,
>>> there may be some unexpected DMA requests on this adapter, which will
>>> cause DMA Remapping faults like:
>>>       dmar: DRHD: handling fault status reg 102
>>>       dmar: DMAR:[DMA Read] Request device [41:00.0] fault addr fff81000
>>>       DMAR:[fault reason 01] Present bit in root entry is clear
>>>
>>> This bug may happen on *any* PCI device.
>>> Analysis for this bug:
>>>
>>> The present bit is set in this function:
>>>
>>> static struct context_entry * device_to_context_entry(
>>>                   struct intel_iommu *iommu, u8 bus, u8 devfn)
>>> {
>>>       ......
>>>                   set_root_present(root);
>>>       ......
>>> }
>>>
>>> Calling tree:
>>>       device driver
>>>           intel_alloc_coherent
>>>               __intel_map_single
>>>                   domain_context_mapping
>>>                       domain_context_mapping_one
>>>                           device_to_context_entry
>>>
>>> This means, the present bit in root entry will not be set until the device
>>> driver is loaded.
>>>
>>> But in the kdump kernel, hardware devices are not aware that control has
>>> transferred to the second kernel, and those drivers must initialize again.
>>> Consequently there may be unexpected DMA requests from devices activity
>>> initiated in the first kernel leading to the DMA Remapping errors in the
>>> second kernel.
>>>
>>> To fix this DMAR fault, we need to reset the bus that this device on. Reset
>>> the device itself does not work.
>>>
>>> A patch for this bug that has been sent before:
>>> https://lkml.org/lkml/2014/9/30/55
>>> As in discussion, this bug may happen on *any* device, so we need to reset all
>>> pci devices.
>>>
>>> There was an original version(Takao Indoh) that resets the pcie devices:
>>> https://lkml.org/lkml/2013/5/14/9
> 
> As far as I can remember, the original patch was nacked by
> the following reasons:
> 
> 1) On sparc, the IOMMU is initialized before PCI devices are enumerated,
>     so there would still be a window where ongoing DMA could cause an
>     IOMMU error.
> 
> 2) Basically Bjorn is thinking device reset should be done in the
>     1st kernel before jumping into 2nd kernel.
> 
> And Bill Sumner proposed another idea.
> http://comments.gmane.org/gmane.linux.kernel.iommu/4828
> I don't know the current status of this patch, but I think Jerry Hoemann
> is working on this.
> 
> Thanks,
> Takao Indoh
> 
> 
>>>
>>> Update of this new version, comparing with Takao Indoh's version:
>>>       Add support for legacy PCI devices.
>>>       Use pci_try_reset_bus instead of do_downstream_device_reset in original version
>>>
>>> Randy Wright corrects some misunderstanding in this description.
>>>
>>> Signed-off-by: Li, Zhen-Hua <zhen-hual@hp.com>
>>> Signed-off-by: Takao Indoh <indou.takao@jp.fujitsu.com>
>>> Signed-off-by: Randy Wright <rwright@hp.com>
>>> ---
>>>    drivers/pci/pci.c | 84 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>    1 file changed, 84 insertions(+)
>>>
>>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>>> index 2c9ac70..8cb146c 100644
>>> --- a/drivers/pci/pci.c
>>> +++ b/drivers/pci/pci.c
>>> @@ -23,6 +23,7 @@
>>>    #include <linux/device.h>
>>>    #include <linux/pm_runtime.h>
>>>    #include <linux/pci_hotplug.h>
>>> +#include <linux/crash_dump.h>
>>>    #include <asm-generic/pci-bridge.h>
>>>    #include <asm/setup.h>
>>>    #include "pci.h"
>>> @@ -4423,6 +4424,89 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus)
>>>    }
>>>    EXPORT_SYMBOL(pci_fixup_cardbus);
>>>
>>> +/*
>>> + * Return true if dev is PCI root port or downstream port whose child is PCI
>>> + * endpoint except VGA device.
>>> + */
>>> +static int __pci_dev_need_reset(struct pci_dev *dev)
>>> +{
>>> +    struct pci_bus *subordinate;
>>> +    struct pci_dev *child;
>>> +
>>> +    if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE)
>>> +        return 0;
>>> +
>>> +    if (pci_is_pcie(dev)) {
>>> +        if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) &&
>>> +            (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM))
>>> +            return 0;
>>> +    }
>>> +
>>> +    subordinate = dev->subordinate;
>>> +    list_for_each_entry(child, &subordinate->devices, bus_list) {
>>> +        /* Don't reset switch, bridge, VGA device */
>>> +        if ((child->hdr_type == PCI_HEADER_TYPE_BRIDGE) ||
>>> +            ((child->class >> 16) == PCI_BASE_CLASS_BRIDGE) ||
>>> +            ((child->class >> 16) == PCI_BASE_CLASS_DISPLAY))
>>> +            return 0;
>>> +
>>> +        if (pci_is_pcie(child)) {
>>> +            if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) ||
>>> +                (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE))
>>> +                return 0;
>>> +        }
>>> +    }
>>> +
>>> +    return 1;
>>> +}
>>> +
>>> +struct pci_dev_reset_entry {
>>> +    struct list_head list;
>>> +    struct pci_dev *dev;
>>> +};
>>> +int __init pci_reset_endpoints(void)
>>> +{
>>> +    struct pci_dev *dev = NULL;
>>> +    struct pci_dev_reset_entry *pdev_entry, *tmp;
>>> +    struct pci_bus *subordinate = NULL;
>>> +    int has_it;
>>> +
>>> +    LIST_HEAD(pdev_list);
>>> +
>>> +    if (likely(!is_kdump_kernel()))
>>> +        return 0;
>>> +
>>> +    for_each_pci_dev(dev) {
>>> +        subordinate = dev->subordinate;
>>> +        if (!subordinate || list_empty(&subordinate->devices))
>>> +            continue;
>>> +
>>> +        has_it = 0;
>>> +        list_for_each_entry(pdev_entry, &pdev_list, list) {
>>> +            if (dev == pdev_entry->dev) {
>>> +                has_it = 1;
>>> +                break;
>>> +            }
>>> +        }
>>> +        if (has_it)
>>> +            continue;
>>> +
>>> +        if (__pci_dev_need_reset(dev)) {
>>> +            pdev_entry = kmalloc(sizeof(*pdev_entry), GFP_KERNEL);
>>> +            pdev_entry->dev = dev;
>>> +            list_add(&pdev_entry->list, &pdev_list);
>>> +        }
>>> +    }
>>> +
>>> +    list_for_each_entry_safe(pdev_entry, tmp, &pdev_list, list) {
>>> +        pci_try_reset_bus(pdev_entry->dev->subordinate);
>>> +        kfree(pdev_entry);
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>>> +fs_initcall_sync(pci_reset_endpoints);
>>> +
>>>    static int __init pci_setup(char *str)
>>>    {
>>>        while (str) {
>>>
>>
>>
>>
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Li, Zhen-Hua Oct. 20, 2014, 2:19 a.m. UTC | #4

Hi  Takao Indoh,

According to this discussion
	https://lkml.org/lkml/2014/10/17/107

It seems that we can not do the resetting on the first kernel.  It can
only be called during kdump kernel boots.

Thanks
Zhenhua
On 10/15/2014 04:14 PM, Takao Indoh wrote:
> (2014/10/14 18:34), Li, ZhenHua wrote:
>> I tested on the latest stable version 3.17, it works well.
>>
>> On 10/10/2014 03:13 PM, Li, Zhen-Hua wrote:
>>> On a HP system with Intel vt-d supported and many PCI devices on it,
>>> when kernel crashed and the kdump kernel boots with intel_iommu=on,
>>> there may be some unexpected DMA requests on this adapter, which will
>>> cause DMA Remapping faults like:
>>>       dmar: DRHD: handling fault status reg 102
>>>       dmar: DMAR:[DMA Read] Request device [41:00.0] fault addr fff81000
>>>       DMAR:[fault reason 01] Present bit in root entry is clear
>>>
>>> This bug may happen on *any* PCI device.
>>> Analysis for this bug:
>>>
>>> The present bit is set in this function:
>>>
>>> static struct context_entry * device_to_context_entry(
>>>                   struct intel_iommu *iommu, u8 bus, u8 devfn)
>>> {
>>>       ......
>>>                   set_root_present(root);
>>>       ......
>>> }
>>>
>>> Calling tree:
>>>       device driver
>>>           intel_alloc_coherent
>>>               __intel_map_single
>>>                   domain_context_mapping
>>>                       domain_context_mapping_one
>>>                           device_to_context_entry
>>>
>>> This means, the present bit in root entry will not be set until the device
>>> driver is loaded.
>>>
>>> But in the kdump kernel, hardware devices are not aware that control has
>>> transferred to the second kernel, and those drivers must initialize again.
>>> Consequently there may be unexpected DMA requests from devices activity
>>> initiated in the first kernel leading to the DMA Remapping errors in the
>>> second kernel.
>>>
>>> To fix this DMAR fault, we need to reset the bus that this device on. Reset
>>> the device itself does not work.
>>>
>>> A patch for this bug that has been sent before:
>>> https://lkml.org/lkml/2014/9/30/55
>>> As in discussion, this bug may happen on *any* device, so we need to reset all
>>> pci devices.
>>>
>>> There was an original version(Takao Indoh) that resets the pcie devices:
>>> https://lkml.org/lkml/2013/5/14/9
> 
> As far as I can remember, the original patch was nacked by
> the following reasons:
> 
> 1) On sparc, the IOMMU is initialized before PCI devices are enumerated,
>     so there would still be a window where ongoing DMA could cause an
>     IOMMU error.
> 
> 2) Basically Bjorn is thinking device reset should be done in the
>     1st kernel before jumping into 2nd kernel.
> 
> And Bill Sumner proposed another idea.
> http://comments.gmane.org/gmane.linux.kernel.iommu/4828
> I don't know the current status of this patch, but I think Jerry Hoemann
> is working on this.
> 
> Thanks,
> Takao Indoh
> 
> 
>>>
>>> Update of this new version, comparing with Takao Indoh's version:
>>>       Add support for legacy PCI devices.
>>>       Use pci_try_reset_bus instead of do_downstream_device_reset in original version
>>>
>>> Randy Wright corrects some misunderstanding in this description.
>>>
>>> Signed-off-by: Li, Zhen-Hua <zhen-hual@hp.com>
>>> Signed-off-by: Takao Indoh <indou.takao@jp.fujitsu.com>
>>> Signed-off-by: Randy Wright <rwright@hp.com>
>>> ---
>>>    drivers/pci/pci.c | 84 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>    1 file changed, 84 insertions(+)
>>>
>>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>>> index 2c9ac70..8cb146c 100644
>>> --- a/drivers/pci/pci.c
>>> +++ b/drivers/pci/pci.c
>>> @@ -23,6 +23,7 @@
>>>    #include <linux/device.h>
>>>    #include <linux/pm_runtime.h>
>>>    #include <linux/pci_hotplug.h>
>>> +#include <linux/crash_dump.h>
>>>    #include <asm-generic/pci-bridge.h>
>>>    #include <asm/setup.h>
>>>    #include "pci.h"
>>> @@ -4423,6 +4424,89 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus)
>>>    }
>>>    EXPORT_SYMBOL(pci_fixup_cardbus);
>>>
>>> +/*
>>> + * Return true if dev is PCI root port or downstream port whose child is PCI
>>> + * endpoint except VGA device.
>>> + */
>>> +static int __pci_dev_need_reset(struct pci_dev *dev)
>>> +{
>>> +    struct pci_bus *subordinate;
>>> +    struct pci_dev *child;
>>> +
>>> +    if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE)
>>> +        return 0;
>>> +
>>> +    if (pci_is_pcie(dev)) {
>>> +        if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) &&
>>> +            (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM))
>>> +            return 0;
>>> +    }
>>> +
>>> +    subordinate = dev->subordinate;
>>> +    list_for_each_entry(child, &subordinate->devices, bus_list) {
>>> +        /* Don't reset switch, bridge, VGA device */
>>> +        if ((child->hdr_type == PCI_HEADER_TYPE_BRIDGE) ||
>>> +            ((child->class >> 16) == PCI_BASE_CLASS_BRIDGE) ||
>>> +            ((child->class >> 16) == PCI_BASE_CLASS_DISPLAY))
>>> +            return 0;
>>> +
>>> +        if (pci_is_pcie(child)) {
>>> +            if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) ||
>>> +                (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE))
>>> +                return 0;
>>> +        }
>>> +    }
>>> +
>>> +    return 1;
>>> +}
>>> +
>>> +struct pci_dev_reset_entry {
>>> +    struct list_head list;
>>> +    struct pci_dev *dev;
>>> +};
>>> +int __init pci_reset_endpoints(void)
>>> +{
>>> +    struct pci_dev *dev = NULL;
>>> +    struct pci_dev_reset_entry *pdev_entry, *tmp;
>>> +    struct pci_bus *subordinate = NULL;
>>> +    int has_it;
>>> +
>>> +    LIST_HEAD(pdev_list);
>>> +
>>> +    if (likely(!is_kdump_kernel()))
>>> +        return 0;
>>> +
>>> +    for_each_pci_dev(dev) {
>>> +        subordinate = dev->subordinate;
>>> +        if (!subordinate || list_empty(&subordinate->devices))
>>> +            continue;
>>> +
>>> +        has_it = 0;
>>> +        list_for_each_entry(pdev_entry, &pdev_list, list) {
>>> +            if (dev == pdev_entry->dev) {
>>> +                has_it = 1;
>>> +                break;
>>> +            }
>>> +        }
>>> +        if (has_it)
>>> +            continue;
>>> +
>>> +        if (__pci_dev_need_reset(dev)) {
>>> +            pdev_entry = kmalloc(sizeof(*pdev_entry), GFP_KERNEL);
>>> +            pdev_entry->dev = dev;
>>> +            list_add(&pdev_entry->list, &pdev_list);
>>> +        }
>>> +    }
>>> +
>>> +    list_for_each_entry_safe(pdev_entry, tmp, &pdev_list, list) {
>>> +        pci_try_reset_bus(pdev_entry->dev->subordinate);
>>> +        kfree(pdev_entry);
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>>> +fs_initcall_sync(pci_reset_endpoints);
>>> +
>>>    static int __init pci_setup(char *str)
>>>    {
>>>        while (str) {
>>>
>>
>>
>>
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Takao Indoh Oct. 21, 2014, 8:23 a.m. UTC | #5

Hi ZhenHua,

(2014/10/20 11:19), Li, ZhenHua wrote:
> Hi  Takao Indoh,
> 
> According to this discussion
> 	https://lkml.org/lkml/2014/10/17/107
> 
> It seems that we can not do the resetting on the first kernel.  It can
> only be called during kdump kernel boots.

Sounds like that. Do you know any example cases which cannot be fixed by
Bill's patch?

Thanks,
Takao Indoh


> 
> Thanks
> Zhenhua
> On 10/15/2014 04:14 PM, Takao Indoh wrote:
>> (2014/10/14 18:34), Li, ZhenHua wrote:
>>> I tested on the latest stable version 3.17, it works well.
>>>
>>> On 10/10/2014 03:13 PM, Li, Zhen-Hua wrote:
>>>> On a HP system with Intel vt-d supported and many PCI devices on it,
>>>> when kernel crashed and the kdump kernel boots with intel_iommu=on,
>>>> there may be some unexpected DMA requests on this adapter, which will
>>>> cause DMA Remapping faults like:
>>>>        dmar: DRHD: handling fault status reg 102
>>>>        dmar: DMAR:[DMA Read] Request device [41:00.0] fault addr fff81000
>>>>        DMAR:[fault reason 01] Present bit in root entry is clear
>>>>
>>>> This bug may happen on *any* PCI device.
>>>> Analysis for this bug:
>>>>
>>>> The present bit is set in this function:
>>>>
>>>> static struct context_entry * device_to_context_entry(
>>>>                    struct intel_iommu *iommu, u8 bus, u8 devfn)
>>>> {
>>>>        ......
>>>>                    set_root_present(root);
>>>>        ......
>>>> }
>>>>
>>>> Calling tree:
>>>>        device driver
>>>>            intel_alloc_coherent
>>>>                __intel_map_single
>>>>                    domain_context_mapping
>>>>                        domain_context_mapping_one
>>>>                            device_to_context_entry
>>>>
>>>> This means, the present bit in root entry will not be set until the device
>>>> driver is loaded.
>>>>
>>>> But in the kdump kernel, hardware devices are not aware that control has
>>>> transferred to the second kernel, and those drivers must initialize again.
>>>> Consequently there may be unexpected DMA requests from devices activity
>>>> initiated in the first kernel leading to the DMA Remapping errors in the
>>>> second kernel.
>>>>
>>>> To fix this DMAR fault, we need to reset the bus that this device on. Reset
>>>> the device itself does not work.
>>>>
>>>> A patch for this bug that has been sent before:
>>>> https://lkml.org/lkml/2014/9/30/55
>>>> As in discussion, this bug may happen on *any* device, so we need to reset all
>>>> pci devices.
>>>>
>>>> There was an original version(Takao Indoh) that resets the pcie devices:
>>>> https://lkml.org/lkml/2013/5/14/9
>>
>> As far as I can remember, the original patch was nacked by
>> the following reasons:
>>
>> 1) On sparc, the IOMMU is initialized before PCI devices are enumerated,
>>      so there would still be a window where ongoing DMA could cause an
>>      IOMMU error.
>>
>> 2) Basically Bjorn is thinking device reset should be done in the
>>      1st kernel before jumping into 2nd kernel.
>>
>> And Bill Sumner proposed another idea.
>> http://comments.gmane.org/gmane.linux.kernel.iommu/4828
>> I don't know the current status of this patch, but I think Jerry Hoemann
>> is working on this.
>>
>> Thanks,
>> Takao Indoh
>>
>>
>>>>
>>>> Update of this new version, comparing with Takao Indoh's version:
>>>>        Add support for legacy PCI devices.
>>>>        Use pci_try_reset_bus instead of do_downstream_device_reset in original version
>>>>
>>>> Randy Wright corrects some misunderstanding in this description.
>>>>
>>>> Signed-off-by: Li, Zhen-Hua <zhen-hual@hp.com>
>>>> Signed-off-by: Takao Indoh <indou.takao@jp.fujitsu.com>
>>>> Signed-off-by: Randy Wright <rwright@hp.com>
>>>> ---
>>>>     drivers/pci/pci.c | 84 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>     1 file changed, 84 insertions(+)
>>>>
>>>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>>>> index 2c9ac70..8cb146c 100644
>>>> --- a/drivers/pci/pci.c
>>>> +++ b/drivers/pci/pci.c
>>>> @@ -23,6 +23,7 @@
>>>>     #include <linux/device.h>
>>>>     #include <linux/pm_runtime.h>
>>>>     #include <linux/pci_hotplug.h>
>>>> +#include <linux/crash_dump.h>
>>>>     #include <asm-generic/pci-bridge.h>
>>>>     #include <asm/setup.h>
>>>>     #include "pci.h"
>>>> @@ -4423,6 +4424,89 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus)
>>>>     }
>>>>     EXPORT_SYMBOL(pci_fixup_cardbus);
>>>>
>>>> +/*
>>>> + * Return true if dev is PCI root port or downstream port whose child is PCI
>>>> + * endpoint except VGA device.
>>>> + */
>>>> +static int __pci_dev_need_reset(struct pci_dev *dev)
>>>> +{
>>>> +    struct pci_bus *subordinate;
>>>> +    struct pci_dev *child;
>>>> +
>>>> +    if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE)
>>>> +        return 0;
>>>> +
>>>> +    if (pci_is_pcie(dev)) {
>>>> +        if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) &&
>>>> +            (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM))
>>>> +            return 0;
>>>> +    }
>>>> +
>>>> +    subordinate = dev->subordinate;
>>>> +    list_for_each_entry(child, &subordinate->devices, bus_list) {
>>>> +        /* Don't reset switch, bridge, VGA device */
>>>> +        if ((child->hdr_type == PCI_HEADER_TYPE_BRIDGE) ||
>>>> +            ((child->class >> 16) == PCI_BASE_CLASS_BRIDGE) ||
>>>> +            ((child->class >> 16) == PCI_BASE_CLASS_DISPLAY))
>>>> +            return 0;
>>>> +
>>>> +        if (pci_is_pcie(child)) {
>>>> +            if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) ||
>>>> +                (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE))
>>>> +                return 0;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    return 1;
>>>> +}
>>>> +
>>>> +struct pci_dev_reset_entry {
>>>> +    struct list_head list;
>>>> +    struct pci_dev *dev;
>>>> +};
>>>> +int __init pci_reset_endpoints(void)
>>>> +{
>>>> +    struct pci_dev *dev = NULL;
>>>> +    struct pci_dev_reset_entry *pdev_entry, *tmp;
>>>> +    struct pci_bus *subordinate = NULL;
>>>> +    int has_it;
>>>> +
>>>> +    LIST_HEAD(pdev_list);
>>>> +
>>>> +    if (likely(!is_kdump_kernel()))
>>>> +        return 0;
>>>> +
>>>> +    for_each_pci_dev(dev) {
>>>> +        subordinate = dev->subordinate;
>>>> +        if (!subordinate || list_empty(&subordinate->devices))
>>>> +            continue;
>>>> +
>>>> +        has_it = 0;
>>>> +        list_for_each_entry(pdev_entry, &pdev_list, list) {
>>>> +            if (dev == pdev_entry->dev) {
>>>> +                has_it = 1;
>>>> +                break;
>>>> +            }
>>>> +        }
>>>> +        if (has_it)
>>>> +            continue;
>>>> +
>>>> +        if (__pci_dev_need_reset(dev)) {
>>>> +            pdev_entry = kmalloc(sizeof(*pdev_entry), GFP_KERNEL);
>>>> +            pdev_entry->dev = dev;
>>>> +            list_add(&pdev_entry->list, &pdev_list);
>>>> +        }
>>>> +    }
>>>> +
>>>> +    list_for_each_entry_safe(pdev_entry, tmp, &pdev_list, list) {
>>>> +        pci_try_reset_bus(pdev_entry->dev->subordinate);
>>>> +        kfree(pdev_entry);
>>>> +    }
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +fs_initcall_sync(pci_reset_endpoints);
>>>> +
>>>>     static int __init pci_setup(char *str)
>>>>     {
>>>>         while (str) {
>>>>
>>>
>>>
>>>
>>
>>
> 
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bjorn Helgaas Oct. 22, 2014, 2:47 a.m. UTC | #6

[+cc Joerg, Eric, Tom, David, iommu list]

On Wed, Oct 15, 2014 at 2:14 AM, Takao Indoh <indou.takao@jp.fujitsu.com> wrote:
> (2014/10/14 18:34), Li, ZhenHua wrote:
>> I tested on the latest stable version 3.17, it works well.
>>
>> On 10/10/2014 03:13 PM, Li, Zhen-Hua wrote:
>>> On a HP system with Intel vt-d supported and many PCI devices on it,
>>> when kernel crashed and the kdump kernel boots with intel_iommu=on,
>>> there may be some unexpected DMA requests on this adapter, which will
>>> cause DMA Remapping faults like:
>>>      dmar: DRHD: handling fault status reg 102
>>>      dmar: DMAR:[DMA Read] Request device [41:00.0] fault addr fff81000
>>>      DMAR:[fault reason 01] Present bit in root entry is clear
>>>
>>> This bug may happen on *any* PCI device.
>>> Analysis for this bug:
>>>
>>> The present bit is set in this function:
>>>
>>> static struct context_entry * device_to_context_entry(
>>>                  struct intel_iommu *iommu, u8 bus, u8 devfn)
>>> {
>>>      ......
>>>                  set_root_present(root);
>>>      ......
>>> }
>>>
>>> Calling tree:
>>>      device driver
>>>          intel_alloc_coherent
>>>              __intel_map_single
>>>                  domain_context_mapping
>>>                      domain_context_mapping_one
>>>                          device_to_context_entry
>>>
>>> This means, the present bit in root entry will not be set until the device
>>> driver is loaded.
>>>
>>> But in the kdump kernel, hardware devices are not aware that control has
>>> transferred to the second kernel, and those drivers must initialize again.
>>> Consequently there may be unexpected DMA requests from devices activity
>>> initiated in the first kernel leading to the DMA Remapping errors in the
>>> second kernel.
>>>
>>> To fix this DMAR fault, we need to reset the bus that this device on. Reset
>>> the device itself does not work.

You have not explained why the DMAR faults are a problem.  The fault
is just an indication that the IOMMU prevented a DMA from completing.
If the DMA is an artifact of the crashed kernel, we probably don't
*want* it to complete, so taking a DMAR fault seems like exactly the
right thing.

If the problem is that we're being flooded with messages, it's easy
enough to just tone down the printks.

>>> A patch for this bug that has been sent before:
>>> https://lkml.org/lkml/2014/9/30/55
>>> As in discussion, this bug may happen on *any* device, so we need to reset all
>>> pci devices.
>>>
>>> There was an original version(Takao Indoh) that resets the pcie devices:
>>> https://lkml.org/lkml/2013/5/14/9
>
> As far as I can remember, the original patch was nacked by
> the following reasons:
>
> 1) On sparc, the IOMMU is initialized before PCI devices are enumerated,
>    so there would still be a window where ongoing DMA could cause an
>    IOMMU error.
>
> 2) Basically Bjorn is thinking device reset should be done in the
>    1st kernel before jumping into 2nd kernel.

If you're referring to this: https://lkml.org/lkml/2013/6/12/16, what
I said was "It would be at least conceivable to reset the devices ...
before the kexec."  That's not a requirement to do it in the first
kernel, just an idea that I thought should be investigated.  And Eric
has good reasons for *not* doing the reset in the first kernel, so it
turned out not to be a very good idea.

My fundamental problem with this whole reset thing is that it's a
sledgehammer approach and it's ugly.  Using the IOMMU seems like a
much more elegant approach.

So if we are forced to accept the reset solution, I want to at least
have a concise explanation of why we can't use the IOMMU.

The changelog above is perfectly accurate, but it's really not very
useful because it only explains the code without exploring any of the
interesting issues.

Bjorn

> And Bill Sumner proposed another idea.
> http://comments.gmane.org/gmane.linux.kernel.iommu/4828
> I don't know the current status of this patch, but I think Jerry Hoemann
> is working on this.
>
> Thanks,
> Takao Indoh
>
>
>>>
>>> Update of this new version, comparing with Takao Indoh's version:
>>>      Add support for legacy PCI devices.
>>>      Use pci_try_reset_bus instead of do_downstream_device_reset in original version
>>>
>>> Randy Wright corrects some misunderstanding in this description.
>>>
>>> Signed-off-by: Li, Zhen-Hua <zhen-hual@hp.com>
>>> Signed-off-by: Takao Indoh <indou.takao@jp.fujitsu.com>
>>> Signed-off-by: Randy Wright <rwright@hp.com>
>>> ---
>>>   drivers/pci/pci.c | 84 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>   1 file changed, 84 insertions(+)
>>>
>>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>>> index 2c9ac70..8cb146c 100644
>>> --- a/drivers/pci/pci.c
>>> +++ b/drivers/pci/pci.c
>>> @@ -23,6 +23,7 @@
>>>   #include <linux/device.h>
>>>   #include <linux/pm_runtime.h>
>>>   #include <linux/pci_hotplug.h>
>>> +#include <linux/crash_dump.h>
>>>   #include <asm-generic/pci-bridge.h>
>>>   #include <asm/setup.h>
>>>   #include "pci.h"
>>> @@ -4423,6 +4424,89 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus)
>>>   }
>>>   EXPORT_SYMBOL(pci_fixup_cardbus);
>>>
>>> +/*
>>> + * Return true if dev is PCI root port or downstream port whose child is PCI
>>> + * endpoint except VGA device.
>>> + */
>>> +static int __pci_dev_need_reset(struct pci_dev *dev)
>>> +{
>>> +    struct pci_bus *subordinate;
>>> +    struct pci_dev *child;
>>> +
>>> +    if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE)
>>> +        return 0;
>>> +
>>> +    if (pci_is_pcie(dev)) {
>>> +        if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) &&
>>> +            (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM))
>>> +            return 0;
>>> +    }
>>> +
>>> +    subordinate = dev->subordinate;
>>> +    list_for_each_entry(child, &subordinate->devices, bus_list) {
>>> +        /* Don't reset switch, bridge, VGA device */
>>> +        if ((child->hdr_type == PCI_HEADER_TYPE_BRIDGE) ||
>>> +            ((child->class >> 16) == PCI_BASE_CLASS_BRIDGE) ||
>>> +            ((child->class >> 16) == PCI_BASE_CLASS_DISPLAY))
>>> +            return 0;
>>> +
>>> +        if (pci_is_pcie(child)) {
>>> +            if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) ||
>>> +                (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE))
>>> +                return 0;
>>> +        }
>>> +    }
>>> +
>>> +    return 1;
>>> +}
>>> +
>>> +struct pci_dev_reset_entry {
>>> +    struct list_head list;
>>> +    struct pci_dev *dev;
>>> +};
>>> +int __init pci_reset_endpoints(void)
>>> +{
>>> +    struct pci_dev *dev = NULL;
>>> +    struct pci_dev_reset_entry *pdev_entry, *tmp;
>>> +    struct pci_bus *subordinate = NULL;
>>> +    int has_it;
>>> +
>>> +    LIST_HEAD(pdev_list);
>>> +
>>> +    if (likely(!is_kdump_kernel()))
>>> +        return 0;
>>> +
>>> +    for_each_pci_dev(dev) {
>>> +        subordinate = dev->subordinate;
>>> +        if (!subordinate || list_empty(&subordinate->devices))
>>> +            continue;
>>> +
>>> +        has_it = 0;
>>> +        list_for_each_entry(pdev_entry, &pdev_list, list) {
>>> +            if (dev == pdev_entry->dev) {
>>> +                has_it = 1;
>>> +                break;
>>> +            }
>>> +        }
>>> +        if (has_it)
>>> +            continue;
>>> +
>>> +        if (__pci_dev_need_reset(dev)) {
>>> +            pdev_entry = kmalloc(sizeof(*pdev_entry), GFP_KERNEL);
>>> +            pdev_entry->dev = dev;
>>> +            list_add(&pdev_entry->list, &pdev_list);
>>> +        }
>>> +    }
>>> +
>>> +    list_for_each_entry_safe(pdev_entry, tmp, &pdev_list, list) {
>>> +        pci_try_reset_bus(pdev_entry->dev->subordinate);
>>> +        kfree(pdev_entry);
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>>> +fs_initcall_sync(pci_reset_endpoints);
>>> +
>>>   static int __init pci_setup(char *str)
>>>   {
>>>       while (str) {
>>>
>>
>>
>>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Li, Zhen-Hua Oct. 22, 2014, 3:02 a.m. UTC | #7

On 10/22/2014 10:47 AM, Bjorn Helgaas wrote:
> [+cc Joerg, Eric, Tom, David, iommu list]
>
> On Wed, Oct 15, 2014 at 2:14 AM, Takao Indoh <indou.takao@jp.fujitsu.com> wrote:
>> (2014/10/14 18:34), Li, ZhenHua wrote:
>>> I tested on the latest stable version 3.17, it works well.
>>>
>>> On 10/10/2014 03:13 PM, Li, Zhen-Hua wrote:
>>>> On a HP system with Intel vt-d supported and many PCI devices on it,
>>>> when kernel crashed and the kdump kernel boots with intel_iommu=on,
>>>> there may be some unexpected DMA requests on this adapter, which will
>>>> cause DMA Remapping faults like:
>>>>       dmar: DRHD: handling fault status reg 102
>>>>       dmar: DMAR:[DMA Read] Request device [41:00.0] fault addr fff81000
>>>>       DMAR:[fault reason 01] Present bit in root entry is clear
>>>>
>>>> This bug may happen on *any* PCI device.
>>>> Analysis for this bug:
>>>>
>>>> The present bit is set in this function:
>>>>
>>>> static struct context_entry * device_to_context_entry(
>>>>                   struct intel_iommu *iommu, u8 bus, u8 devfn)
>>>> {
>>>>       ......
>>>>                   set_root_present(root);
>>>>       ......
>>>> }
>>>>
>>>> Calling tree:
>>>>       device driver
>>>>           intel_alloc_coherent
>>>>               __intel_map_single
>>>>                   domain_context_mapping
>>>>                       domain_context_mapping_one
>>>>                           device_to_context_entry
>>>>
>>>> This means, the present bit in root entry will not be set until the device
>>>> driver is loaded.
>>>>
>>>> But in the kdump kernel, hardware devices are not aware that control has
>>>> transferred to the second kernel, and those drivers must initialize again.
>>>> Consequently there may be unexpected DMA requests from devices activity
>>>> initiated in the first kernel leading to the DMA Remapping errors in the
>>>> second kernel.
>>>>
>>>> To fix this DMAR fault, we need to reset the bus that this device on. Reset
>>>> the device itself does not work.
>
> You have not explained why the DMAR faults are a problem.  The fault
> is just an indication that the IOMMU prevented a DMA from completing.
> If the DMA is an artifact of the crashed kernel, we probably don't
> *want* it to complete, so taking a DMAR fault seems like exactly the
> right thing.
Well, I still need more time to think about other contents you mentioned 
and explained in these mails. But about the DMA fault, I think it is not 
"the iommu prevented a DMA from completing", it is the iommu could not 
help system complete the dma, so the iommu reported an error.

Also I agree with you that these DMA should not be completed.
But I still think,  these dma, programmed by the old kernel, should be 
stopped because they are some kind of illegal for the kdump kernel, no 
matter what iommu did.

>
> If the problem is that we're being flooded with messages, it's easy
> enough to just tone down the printks.
>
>>>> A patch for this bug that has been sent before:
>>>> https://lkml.org/lkml/2014/9/30/55
>>>> As in discussion, this bug may happen on *any* device, so we need to reset all
>>>> pci devices.
>>>>
>>>> There was an original version(Takao Indoh) that resets the pcie devices:
>>>> https://lkml.org/lkml/2013/5/14/9
>>
>> As far as I can remember, the original patch was nacked by
>> the following reasons:
>>
>> 1) On sparc, the IOMMU is initialized before PCI devices are enumerated,
>>     so there would still be a window where ongoing DMA could cause an
>>     IOMMU error.
>>
>> 2) Basically Bjorn is thinking device reset should be done in the
>>     1st kernel before jumping into 2nd kernel.
>
> If you're referring to this: https://lkml.org/lkml/2013/6/12/16, what
> I said was "It would be at least conceivable to reset the devices ...
> before the kexec."  That's not a requirement to do it in the first
> kernel, just an idea that I thought should be investigated.  And Eric
> has good reasons for *not* doing the reset in the first kernel, so it
> turned out not to be a very good idea.
>
> My fundamental problem with this whole reset thing is that it's a
> sledgehammer approach and it's ugly.  Using the IOMMU seems like a
> much more elegant approach.
>
> So if we are forced to accept the reset solution, I want to at least
> have a concise explanation of why we can't use the IOMMU.
>
> The changelog above is perfectly accurate, but it's really not very
> useful because it only explains the code without exploring any of the
> interesting issues.
>
> Bjorn
>
>> And Bill Sumner proposed another idea.
>> http://comments.gmane.org/gmane.linux.kernel.iommu/4828
>> I don't know the current status of this patch, but I think Jerry Hoemann
>> is working on this.
>>
>> Thanks,
>> Takao Indoh
>>
>>
>>>>
>>>> Update of this new version, comparing with Takao Indoh's version:
>>>>       Add support for legacy PCI devices.
>>>>       Use pci_try_reset_bus instead of do_downstream_device_reset in original version
>>>>
>>>> Randy Wright corrects some misunderstanding in this description.
>>>>
>>>> Signed-off-by: Li, Zhen-Hua <zhen-hual@hp.com>
>>>> Signed-off-by: Takao Indoh <indou.takao@jp.fujitsu.com>
>>>> Signed-off-by: Randy Wright <rwright@hp.com>
>>>> ---
>>>>    drivers/pci/pci.c | 84 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>    1 file changed, 84 insertions(+)
>>>>
>>>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>>>> index 2c9ac70..8cb146c 100644
>>>> --- a/drivers/pci/pci.c
>>>> +++ b/drivers/pci/pci.c
>>>> @@ -23,6 +23,7 @@
>>>>    #include <linux/device.h>
>>>>    #include <linux/pm_runtime.h>
>>>>    #include <linux/pci_hotplug.h>
>>>> +#include <linux/crash_dump.h>
>>>>    #include <asm-generic/pci-bridge.h>
>>>>    #include <asm/setup.h>
>>>>    #include "pci.h"
>>>> @@ -4423,6 +4424,89 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus)
>>>>    }
>>>>    EXPORT_SYMBOL(pci_fixup_cardbus);
>>>>
>>>> +/*
>>>> + * Return true if dev is PCI root port or downstream port whose child is PCI
>>>> + * endpoint except VGA device.
>>>> + */
>>>> +static int __pci_dev_need_reset(struct pci_dev *dev)
>>>> +{
>>>> +    struct pci_bus *subordinate;
>>>> +    struct pci_dev *child;
>>>> +
>>>> +    if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE)
>>>> +        return 0;
>>>> +
>>>> +    if (pci_is_pcie(dev)) {
>>>> +        if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) &&
>>>> +            (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM))
>>>> +            return 0;
>>>> +    }
>>>> +
>>>> +    subordinate = dev->subordinate;
>>>> +    list_for_each_entry(child, &subordinate->devices, bus_list) {
>>>> +        /* Don't reset switch, bridge, VGA device */
>>>> +        if ((child->hdr_type == PCI_HEADER_TYPE_BRIDGE) ||
>>>> +            ((child->class >> 16) == PCI_BASE_CLASS_BRIDGE) ||
>>>> +            ((child->class >> 16) == PCI_BASE_CLASS_DISPLAY))
>>>> +            return 0;
>>>> +
>>>> +        if (pci_is_pcie(child)) {
>>>> +            if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) ||
>>>> +                (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE))
>>>> +                return 0;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    return 1;
>>>> +}
>>>> +
>>>> +struct pci_dev_reset_entry {
>>>> +    struct list_head list;
>>>> +    struct pci_dev *dev;
>>>> +};
>>>> +int __init pci_reset_endpoints(void)
>>>> +{
>>>> +    struct pci_dev *dev = NULL;
>>>> +    struct pci_dev_reset_entry *pdev_entry, *tmp;
>>>> +    struct pci_bus *subordinate = NULL;
>>>> +    int has_it;
>>>> +
>>>> +    LIST_HEAD(pdev_list);
>>>> +
>>>> +    if (likely(!is_kdump_kernel()))
>>>> +        return 0;
>>>> +
>>>> +    for_each_pci_dev(dev) {
>>>> +        subordinate = dev->subordinate;
>>>> +        if (!subordinate || list_empty(&subordinate->devices))
>>>> +            continue;
>>>> +
>>>> +        has_it = 0;
>>>> +        list_for_each_entry(pdev_entry, &pdev_list, list) {
>>>> +            if (dev == pdev_entry->dev) {
>>>> +                has_it = 1;
>>>> +                break;
>>>> +            }
>>>> +        }
>>>> +        if (has_it)
>>>> +            continue;
>>>> +
>>>> +        if (__pci_dev_need_reset(dev)) {
>>>> +            pdev_entry = kmalloc(sizeof(*pdev_entry), GFP_KERNEL);
>>>> +            pdev_entry->dev = dev;
>>>> +            list_add(&pdev_entry->list, &pdev_list);
>>>> +        }
>>>> +    }
>>>> +
>>>> +    list_for_each_entry_safe(pdev_entry, tmp, &pdev_list, list) {
>>>> +        pci_try_reset_bus(pdev_entry->dev->subordinate);
>>>> +        kfree(pdev_entry);
>>>> +    }
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +fs_initcall_sync(pci_reset_endpoints);
>>>> +
>>>>    static int __init pci_setup(char *str)
>>>>    {
>>>>        while (str) {
>>>>
>>>
>>>
>>>
>>
>>

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Alexander Duyck Oct. 22, 2014, 4:54 p.m. UTC | #8

On 10/21/2014 07:47 PM, Bjorn Helgaas wrote:
> [+cc Joerg, Eric, Tom, David, iommu list]
>
> On Wed, Oct 15, 2014 at 2:14 AM, Takao Indoh <indou.takao@jp.fujitsu.com> wrote:
>> (2014/10/14 18:34), Li, ZhenHua wrote:
>>> I tested on the latest stable version 3.17, it works well.
>>>
>>> On 10/10/2014 03:13 PM, Li, Zhen-Hua wrote:
>>>> On a HP system with Intel vt-d supported and many PCI devices on it,
>>>> when kernel crashed and the kdump kernel boots with intel_iommu=on,
>>>> there may be some unexpected DMA requests on this adapter, which will
>>>> cause DMA Remapping faults like:
>>>>      dmar: DRHD: handling fault status reg 102
>>>>      dmar: DMAR:[DMA Read] Request device [41:00.0] fault addr fff81000
>>>>      DMAR:[fault reason 01] Present bit in root entry is clear
>>>>
>>>> This bug may happen on *any* PCI device.
>>>> Analysis for this bug:
>>>>
>>>> The present bit is set in this function:
>>>>
>>>> static struct context_entry * device_to_context_entry(
>>>>                  struct intel_iommu *iommu, u8 bus, u8 devfn)
>>>> {
>>>>      ......
>>>>                  set_root_present(root);
>>>>      ......
>>>> }
>>>>
>>>> Calling tree:
>>>>      device driver
>>>>          intel_alloc_coherent
>>>>              __intel_map_single
>>>>                  domain_context_mapping
>>>>                      domain_context_mapping_one
>>>>                          device_to_context_entry
>>>>
>>>> This means, the present bit in root entry will not be set until the device
>>>> driver is loaded.
>>>>
>>>> But in the kdump kernel, hardware devices are not aware that control has
>>>> transferred to the second kernel, and those drivers must initialize again.
>>>> Consequently there may be unexpected DMA requests from devices activity
>>>> initiated in the first kernel leading to the DMA Remapping errors in the
>>>> second kernel.
>>>>
>>>> To fix this DMAR fault, we need to reset the bus that this device on. Reset
>>>> the device itself does not work.
> You have not explained why the DMAR faults are a problem.  The fault
> is just an indication that the IOMMU prevented a DMA from completing.
> If the DMA is an artifact of the crashed kernel, we probably don't
> *want* it to complete, so taking a DMAR fault seems like exactly the
> right thing.
>
> If the problem is that we're being flooded with messages, it's easy
> enough to just tone down the printks.

As I recall what we have seen in the past with the network controllers
is that they get stuck in a state where they can no longer perform any
DMA due to the fact that some of the transactions have returned errors
from the IOMMU being reset.  The only way out is to perform a PCIe reset
on the part after the IOMMU has been enabled which doesn't occur
automatically unless AER or EEH is enabled in the system.

One thought would be to take a look at the IOMMU reset code.  Is there
any way to go through and make sure that all of the PCI devices that
make use of the IOMMU have the bus mastering disabled prior to the IOMMU
being reset?  For example could we suspend all of the parts in order to
force them to hold off any transactions, and then resume them after the
IOMMU has been reset?  If we could do at least that much that would
prevent the errors and should allow for a graceful reset.

Thanks,

Alex


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bjorn Helgaas Oct. 22, 2014, 5:24 p.m. UTC | #9

On Wed, Oct 22, 2014 at 10:54 AM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On 10/21/2014 07:47 PM, Bjorn Helgaas wrote:
>> [+cc Joerg, Eric, Tom, David, iommu list]
>>
>> On Wed, Oct 15, 2014 at 2:14 AM, Takao Indoh <indou.takao@jp.fujitsu.com> wrote:
>>> (2014/10/14 18:34), Li, ZhenHua wrote:
>>>> I tested on the latest stable version 3.17, it works well.
>>>>
>>>> On 10/10/2014 03:13 PM, Li, Zhen-Hua wrote:

>>>>> To fix this DMAR fault, we need to reset the bus that this device on. Reset
>>>>> the device itself does not work.
>> You have not explained why the DMAR faults are a problem.  The fault
>> is just an indication that the IOMMU prevented a DMA from completing.
>> If the DMA is an artifact of the crashed kernel, we probably don't
>> *want* it to complete, so taking a DMAR fault seems like exactly the
>> right thing.
>>
>> If the problem is that we're being flooded with messages, it's easy
>> enough to just tone down the printks.
>
> As I recall what we have seen in the past with the network controllers
> is that they get stuck in a state where they can no longer perform any
> DMA due to the fact that some of the transactions have returned errors
> from the IOMMU being reset.  The only way out is to perform a PCIe reset
> on the part after the IOMMU has been enabled which doesn't occur
> automatically unless AER or EEH is enabled in the system.

OK, now we're talking about a real issue, the sort of thing that
should be in the changelog for a change like this.

I'm uneasy about the strategy of "it hurts when an IOMMU fault occurs,
therefore we need to avoid all IOMMU faults."  Isn't the whole *point*
of an IOMMU to generate faults?  It seems like we need to be able to
handle faults gracefully.

If having AER or EEH enabled in the kdump kernel is part of what's
required to recover, I don't see a problem with requiring that.

Don't we have to be able to recover from IOMMU faults for the device
pass-through case anyway?  If a NIC is passed through to a malicious
guest, I assume the guest can cause IOMMU faults.  I assume we handle
this today by resetting the NIC when the guest exits.

> One thought would be to take a look at the IOMMU reset code.  Is there
> any way to go through and make sure that all of the PCI devices that
> make use of the IOMMU have the bus mastering disabled prior to the IOMMU
> being reset?  For example could we suspend all of the parts in order to
> force them to hold off any transactions, and then resume them after the
> IOMMU has been reset?  If we could do at least that much that would
> prevent the errors and should allow for a graceful reset.
>
> Thanks,
>
> Alex
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Li, Zhen-Hua Oct. 23, 2014, 7:26 a.m. UTC | #10

On 10/22/2014 10:47 AM, Bjorn Helgaas wrote:
> [+cc Joerg, Eric, Tom, David, iommu list]
>
> On Wed, Oct 15, 2014 at 2:14 AM, Takao Indoh <indou.takao@jp.fujitsu.com> wrote:
>> (2014/10/14 18:34), Li, ZhenHua wrote:
>>> I tested on the latest stable version 3.17, it works well.
>>>
>>> On 10/10/2014 03:13 PM, Li, Zhen-Hua wrote:
>>>> On a HP system with Intel vt-d supported and many PCI devices on it,
>>>> when kernel crashed and the kdump kernel boots with intel_iommu=on,
>>>> there may be some unexpected DMA requests on this adapter, which will
>>>> cause DMA Remapping faults like:
>>>>       dmar: DRHD: handling fault status reg 102
>>>>       dmar: DMAR:[DMA Read] Request device [41:00.0] fault addr fff81000
>>>>       DMAR:[fault reason 01] Present bit in root entry is clear
>>>>
>>>> This bug may happen on *any* PCI device.
>>>> Analysis for this bug:
>>>>
>>>> The present bit is set in this function:
>>>>
>>>> static struct context_entry * device_to_context_entry(
>>>>                   struct intel_iommu *iommu, u8 bus, u8 devfn)
>>>> {
>>>>       ......
>>>>                   set_root_present(root);
>>>>       ......
>>>> }
>>>>
>>>> Calling tree:
>>>>       device driver
>>>>           intel_alloc_coherent
>>>>               __intel_map_single
>>>>                   domain_context_mapping
>>>>                       domain_context_mapping_one
>>>>                           device_to_context_entry
>>>>
>>>> This means, the present bit in root entry will not be set until the device
>>>> driver is loaded.
>>>>
>>>> But in the kdump kernel, hardware devices are not aware that control has
>>>> transferred to the second kernel, and those drivers must initialize again.
>>>> Consequently there may be unexpected DMA requests from devices activity
>>>> initiated in the first kernel leading to the DMA Remapping errors in the
>>>> second kernel.
>>>>
>>>> To fix this DMAR fault, we need to reset the bus that this device on. Reset
>>>> the device itself does not work.
>
> You have not explained why the DMAR faults are a problem.  The fault
> is just an indication that the IOMMU prevented a DMA from completing.
> If the DMA is an artifact of the crashed kernel, we probably don't
> *want* it to complete, so taking a DMAR fault seems like exactly the
> right thing.
>
> If the problem is that we're being flooded with messages, it's easy
> enough to just tone down the printks.
>
>>>> A patch for this bug that has been sent before:
>>>> https://lkml.org/lkml/2014/9/30/55
>>>> As in discussion, this bug may happen on *any* device, so we need to reset all
>>>> pci devices.
>>>>
>>>> There was an original version(Takao Indoh) that resets the pcie devices:
>>>> https://lkml.org/lkml/2013/5/14/9
>>
>> As far as I can remember, the original patch was nacked by
>> the following reasons:
>>
>> 1) On sparc, the IOMMU is initialized before PCI devices are enumerated,
>>     so there would still be a window where ongoing DMA could cause an
>>     IOMMU error.
>>
>> 2) Basically Bjorn is thinking device reset should be done in the
>>     1st kernel before jumping into 2nd kernel.
>
> If you're referring to this: https://lkml.org/lkml/2013/6/12/16, what
> I said was "It would be at least conceivable to reset the devices ...
> before the kexec."  That's not a requirement to do it in the first
> kernel, just an idea that I thought should be investigated.  And Eric
> has good reasons for *not* doing the reset in the first kernel, so it
> turned out not to be a very good idea.
>
> My fundamental problem with this whole reset thing is that it's a
> sledgehammer approach and it's ugly.  Using the IOMMU seems like a
> much more elegant approach.
>
> So if we are forced to accept the reset solution, I want to at least
> have a concise explanation of why we can't use the IOMMU.
>
In my understanding, the reset is necessary, because no one knows if
there are any other problems if the devices are not reset in the kdump
kernel.  OS must boot with the devices in a clean state.

But this will not prevent us from using Bill's iommu patch. I think the
most important part for Bill's patch is to dump the data, not to fix
the dmar faults, though it also fixed that.

Zhenhua
> The changelog above is perfectly accurate, but it's really not very
> useful because it only explains the code without exploring any of the
> interesting issues.
>
> Bjorn
>
>> And Bill Sumner proposed another idea.
>> http://comments.gmane.org/gmane.linux.kernel.iommu/4828
>> I don't know the current status of this patch, but I think Jerry Hoemann
>> is working on this.
>>
>> Thanks,
>> Takao Indoh
>>
>>

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[1/1] pci: fix dmar fault for kdump kernel

Commit Message

Comments

Patch