diff mbox

[v8,02/10] iommu/vt-d: Items required for kdump

Message ID 1421046388-27925-3-git-send-email-zhen-hual@hp.com (mailing list archive)
State New, archived
Delegated to: Bjorn Helgaas
Headers show

Commit Message

Li, Zhen-Hua Jan. 12, 2015, 7:06 a.m. UTC
Add structure type domain_values_entry used for kdump;
Add context entry functions needed for kdump.

Bill Sumner:
    Original version;

Li, Zhenhua:
    Changed the name of new functions, make them consistent with current
    context get/set functions.

Signed-off-by: Bill Sumner <billsumnerlinux@gmail.com>
Signed-off-by: Li, Zhen-Hua <zhen-hual@hp.com>
---
 drivers/iommu/intel-iommu.c | 70 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 70 insertions(+)

Comments

Joerg Roedel Jan. 12, 2015, 3:22 p.m. UTC | #1
On Mon, Jan 12, 2015 at 03:06:20PM +0800, Li, Zhen-Hua wrote:
> +
> +#ifdef CONFIG_CRASH_DUMP
> +
> +/*
> + * Fix Crashdump failure caused by leftover DMA through a hardware IOMMU
> + *
> + * Fixes the crashdump kernel to deal with an active iommu and legacy
> + * DMA from the (old) panicked kernel in a manner similar to how legacy
> + * DMA is handled when no hardware iommu was in use by the old kernel --
> + * allow the legacy DMA to continue into its current buffers.
> + *
> + * In the crashdump kernel, this code:
> + * 1. skips disabling the IOMMU's translating of IO Virtual Addresses (IOVA).
> + * 2. Do not re-enable IOMMU's translating.
> + * 3. In kdump kernel, use the old root entry table.
> + * 4. Leaves the current translations in-place so that legacy DMA will
> + *    continue to use its current buffers.
> + * 5. Allocates to the device drivers in the crashdump kernel
> + *    portions of the iova address ranges that are different
> + *    from the iova address ranges that were being used by the old kernel
> + *    at the time of the panic.
> + *
> + */

It looks like you are still copying the io-page-tables from the old
kernel into the kdump kernel, is that right? With the approach that was
proposed you only need to copy over the context entries 1-1. They are
still pointing to the page-tables in the old kernels memory (which is
just fine).

The root-entry of the old kernel is also re-used, and when the kdump
kernel starts to use a device, its context entry is updated to point to
a newly allocated page-table.


	Joerg

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vivek Goyal Jan. 12, 2015, 3:29 p.m. UTC | #2
On Mon, Jan 12, 2015 at 04:22:08PM +0100, Joerg Roedel wrote:
> On Mon, Jan 12, 2015 at 03:06:20PM +0800, Li, Zhen-Hua wrote:
> > +
> > +#ifdef CONFIG_CRASH_DUMP
> > +
> > +/*
> > + * Fix Crashdump failure caused by leftover DMA through a hardware IOMMU
> > + *
> > + * Fixes the crashdump kernel to deal with an active iommu and legacy
> > + * DMA from the (old) panicked kernel in a manner similar to how legacy
> > + * DMA is handled when no hardware iommu was in use by the old kernel --
> > + * allow the legacy DMA to continue into its current buffers.
> > + *
> > + * In the crashdump kernel, this code:
> > + * 1. skips disabling the IOMMU's translating of IO Virtual Addresses (IOVA).
> > + * 2. Do not re-enable IOMMU's translating.
> > + * 3. In kdump kernel, use the old root entry table.
> > + * 4. Leaves the current translations in-place so that legacy DMA will
> > + *    continue to use its current buffers.
> > + * 5. Allocates to the device drivers in the crashdump kernel
> > + *    portions of the iova address ranges that are different
> > + *    from the iova address ranges that were being used by the old kernel
> > + *    at the time of the panic.
> > + *
> > + */
> 
> It looks like you are still copying the io-page-tables from the old
> kernel into the kdump kernel, is that right? With the approach that was
> proposed you only need to copy over the context entries 1-1. They are
> still pointing to the page-tables in the old kernels memory (which is
> just fine).

Kdump has the notion of backup region. Where certain parts of old kernels
memory can be moved to a different location (first 640K on x86 as of now)
and new kernel can make use of this memory now.

So we will have to just make sure that no parts of this old page table
fall into backup region.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Joerg Roedel Jan. 12, 2015, 4:06 p.m. UTC | #3
On Mon, Jan 12, 2015 at 10:29:19AM -0500, Vivek Goyal wrote:
> Kdump has the notion of backup region. Where certain parts of old kernels
> memory can be moved to a different location (first 640K on x86 as of now)
> and new kernel can make use of this memory now.
> 
> So we will have to just make sure that no parts of this old page table
> fall into backup region.

Uuh, looks like the 'iommu-with-kdump-issue' isn't complicated enough
yet ;)
Sadly, your above statement is true for all hardware-accessible data
structures in IOMMU code. I think about how we can solve this, is there
an easy way to allocate memory that is not in any backup region?

Thanks,

	Joerg

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vivek Goyal Jan. 12, 2015, 4:15 p.m. UTC | #4
On Mon, Jan 12, 2015 at 05:06:46PM +0100, Joerg Roedel wrote:
> On Mon, Jan 12, 2015 at 10:29:19AM -0500, Vivek Goyal wrote:
> > Kdump has the notion of backup region. Where certain parts of old kernels
> > memory can be moved to a different location (first 640K on x86 as of now)
> > and new kernel can make use of this memory now.
> > 
> > So we will have to just make sure that no parts of this old page table
> > fall into backup region.
> 
> Uuh, looks like the 'iommu-with-kdump-issue' isn't complicated enough
> yet ;)
> Sadly, your above statement is true for all hardware-accessible data
> structures in IOMMU code. I think about how we can solve this, is there
> an easy way to allocate memory that is not in any backup region?

Hmm..., there does not seem to be any easy way to do this. In fact, as of
now, kernel does not even know where is backup region. All these details are
managed by user space completely (except for new kexec_file_load() syscall).

That means we are left with ugly options now.

- Define per arch kexec backup regions in kernel and export it to user
  space and let kexec-tools make use of that deinition (instead of
  defining its own). That way memory allocation code in kernel can look
  at this backup area and skip it for certain allocations.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Joerg Roedel Jan. 12, 2015, 4:48 p.m. UTC | #5
On Mon, Jan 12, 2015 at 11:15:38AM -0500, Vivek Goyal wrote:
> On Mon, Jan 12, 2015 at 05:06:46PM +0100, Joerg Roedel wrote:
> > On Mon, Jan 12, 2015 at 10:29:19AM -0500, Vivek Goyal wrote:
> > > Kdump has the notion of backup region. Where certain parts of old kernels
> > > memory can be moved to a different location (first 640K on x86 as of now)
> > > and new kernel can make use of this memory now.
> > > 
> > > So we will have to just make sure that no parts of this old page table
> > > fall into backup region.
> > 
> > Uuh, looks like the 'iommu-with-kdump-issue' isn't complicated enough
> > yet ;)
> > Sadly, your above statement is true for all hardware-accessible data
> > structures in IOMMU code. I think about how we can solve this, is there
> > an easy way to allocate memory that is not in any backup region?
> 
> Hmm..., there does not seem to be any easy way to do this. In fact, as of
> now, kernel does not even know where is backup region. All these details are
> managed by user space completely (except for new kexec_file_load() syscall).
> 
> That means we are left with ugly options now.
> 
> - Define per arch kexec backup regions in kernel and export it to user
>   space and let kexec-tools make use of that deinition (instead of
>   defining its own). That way memory allocation code in kernel can look
>   at this backup area and skip it for certain allocations.

Yes, that makes sense. In fact, I think all allocations for DMA memory
need to take this into account to avoid potentially serious data
corruption.
If any memory for a disk superblock gets allocated in backup memory and
a kdump happens, the new kernel might zero out that area and the disk
controler then writes the zeroes to disk instead of the superblock.


	Joerg

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Li, Zhen-Hua Jan. 13, 2015, 8:12 a.m. UTC | #6
On 01/12/2015 11:22 PM, Joerg Roedel wrote:
> On Mon, Jan 12, 2015 at 03:06:20PM +0800, Li, Zhen-Hua wrote:
>> +
>> +#ifdef CONFIG_CRASH_DUMP
>> +
>> +/*
>> + * Fix Crashdump failure caused by leftover DMA through a hardware IOMMU
>> + *
>> + * Fixes the crashdump kernel to deal with an active iommu and legacy
>> + * DMA from the (old) panicked kernel in a manner similar to how legacy
>> + * DMA is handled when no hardware iommu was in use by the old kernel --
>> + * allow the legacy DMA to continue into its current buffers.
>> + *
>> + * In the crashdump kernel, this code:
>> + * 1. skips disabling the IOMMU's translating of IO Virtual Addresses (IOVA).
>> + * 2. Do not re-enable IOMMU's translating.
>> + * 3. In kdump kernel, use the old root entry table.
>> + * 4. Leaves the current translations in-place so that legacy DMA will
>> + *    continue to use its current buffers.
>> + * 5. Allocates to the device drivers in the crashdump kernel
>> + *    portions of the iova address ranges that are different
>> + *    from the iova address ranges that were being used by the old kernel
>> + *    at the time of the panic.
>> + *
>> + */
>
> It looks like you are still copying the io-page-tables from the old
> kernel into the kdump kernel, is that right? With the approach that was
> proposed you only need to copy over the context entries 1-1. They are
> still pointing to the page-tables in the old kernels memory (which is
> just fine).
>
> The root-entry of the old kernel is also re-used, and when the kdump
> kernel starts to use a device, its context entry is updated to point to
> a newly allocated page-table.
>

Current status:
1. Use old root entry table,
2. Copy old context entry tables.
3. Copy old page tables.
4. Allocate new page table when device driver is loading.

I remember the progress we discussed was this. Checking the old mails,
founding we did not clearly discuss about the old page tables, only
agrred with allocating new page tables when device driver is loading.

To me, both of these two options are fine, copying old page tables, or
re-use the old ones. My debugging shows both of them works fine.

If we use the old page tables, the patchset will use less code.
Functions copy_page_addr, copy_page_table are no longer needed, and
functions copy_context_entry and copy_context_entry_table will be
reduced to several lines. The patchset will be much simpler.

Zhenhua


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Baoquan He Jan. 13, 2015, 11:41 a.m. UTC | #7
On 01/12/15 at 10:29am, Vivek Goyal wrote:
> On Mon, Jan 12, 2015 at 04:22:08PM +0100, Joerg Roedel wrote:
> > It looks like you are still copying the io-page-tables from the old
> > kernel into the kdump kernel, is that right? With the approach that was
> > proposed you only need to copy over the context entries 1-1. They are
> > still pointing to the page-tables in the old kernels memory (which is
> > just fine).
> 
> Kdump has the notion of backup region. Where certain parts of old kernels
> memory can be moved to a different location (first 640K on x86 as of now)
> and new kernel can make use of this memory now.

Hi Vivek,

About backup region I am a bit confusing. Just say x86, we usually copy
it to a backup region. And this first 640K will be used as a usable
memory region in 2nd kernel since its content has been copied to backup
region. And that backup region is taken from crashkernel reserved
memory and not passed to 2nd kernel as usable memory region.

Here did you mean the old page table could fall into first 640K memory
region or that reserved backup region?  Since as I understand the backup
region is taken from crashkernel memory which is not used by 1st kernel
process.

Thanks
Baoquan

> 
> So we will have to just make sure that no parts of this old page table
> fall into backup region.
> 
> Thanks
> Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Joerg Roedel Jan. 13, 2015, 11:52 a.m. UTC | #8
On Tue, Jan 13, 2015 at 04:12:29PM +0800, Li, ZhenHua wrote:

> Current status:
> 1. Use old root entry table,
> 2. Copy old context entry tables.
> 3. Copy old page tables.
> 4. Allocate new page table when device driver is loading.
> 
> I remember the progress we discussed was this. Checking the old mails,
> founding we did not clearly discuss about the old page tables, only
> agrred with allocating new page tables when device driver is loading.
> 
> To me, both of these two options are fine, copying old page tables, or
> re-use the old ones. My debugging shows both of them works fine.
> 
> If we use the old page tables, the patchset will use less code.
> Functions copy_page_addr, copy_page_table are no longer needed, and
> functions copy_context_entry and copy_context_entry_table will be
> reduced to several lines. The patchset will be much simpler.

Yes, please do it this way. There is no point in copying the old
page-tables. We never make changes to them, so they can stay where they
are.

The problem with the backup region still exists, but it is not
specific to IOMMUs. This problem and should be addressed seperatly.


	Joerg

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 8d5c400..a71de3f 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -40,6 +40,7 @@ 
 #include <linux/pci-ats.h>
 #include <linux/memblock.h>
 #include <linux/dma-contiguous.h>
+#include <linux/crash_dump.h>
 #include <asm/irq_remapping.h>
 #include <asm/cacheflush.h>
 #include <asm/iommu.h>
@@ -208,6 +209,12 @@  get_context_addr_from_root(struct root_entry *root)
 		NULL);
 }
 
+static inline unsigned long
+get_context_phys_from_root(struct root_entry *root)
+{
+	return  root_present(root) ? (root->val & VTD_PAGE_MASK) : 0;
+}
+
 /*
  * low 64 bits:
  * 0: present
@@ -228,6 +235,32 @@  static inline bool context_present(struct context_entry *context)
 {
 	return (context->lo & 1);
 }
+
+static inline int context_fault_enable(struct context_entry *c)
+{
+	return((c->lo >> 1) & 0x1);
+}
+
+static inline int context_translation_type(struct context_entry *c)
+{
+	return((c->lo >> 2) & 0x3);
+}
+
+static inline u64 context_address_root(struct context_entry *c)
+{
+	return((c->lo >> VTD_PAGE_SHIFT));
+}
+
+static inline int context_address_width(struct context_entry *c)
+{
+	return((c->hi >> 0) & 0x7);
+}
+
+static inline int context_domain_id(struct context_entry *c)
+{
+	return((c->hi >> 8) & 0xffff);
+}
+
 static inline void context_set_present(struct context_entry *context)
 {
 	context->lo |= 1;
@@ -313,6 +346,43 @@  static inline int first_pte_in_page(struct dma_pte *pte)
 	return !((unsigned long)pte & ~VTD_PAGE_MASK);
 }
 
+
+#ifdef CONFIG_CRASH_DUMP
+
+/*
+ * Fix Crashdump failure caused by leftover DMA through a hardware IOMMU
+ *
+ * Fixes the crashdump kernel to deal with an active iommu and legacy
+ * DMA from the (old) panicked kernel in a manner similar to how legacy
+ * DMA is handled when no hardware iommu was in use by the old kernel --
+ * allow the legacy DMA to continue into its current buffers.
+ *
+ * In the crashdump kernel, this code:
+ * 1. skips disabling the IOMMU's translating of IO Virtual Addresses (IOVA).
+ * 2. Do not re-enable IOMMU's translating.
+ * 3. In kdump kernel, use the old root entry table.
+ * 4. Leaves the current translations in-place so that legacy DMA will
+ *    continue to use its current buffers.
+ * 5. Allocates to the device drivers in the crashdump kernel
+ *    portions of the iova address ranges that are different
+ *    from the iova address ranges that were being used by the old kernel
+ *    at the time of the panic.
+ *
+ */
+
+struct domain_values_entry {
+	struct list_head link;		/* link entries into a list */
+	struct iova_domain iovad;	/* iova's that belong to this domain */
+	struct dma_pte	*pgd;		/* virtual address */
+	int    did;			/* domain id */
+	int    gaw;			/* max guest address width */
+	int    iommu_superpage;		/* Level of superpages supported:
+					   0 == 4KiB (no superpages), 1 == 2MiB,
+					   2 == 1GiB, 3 == 512GiB, 4 == 1TiB */
+};
+
+#endif /* CONFIG_CRASH_DUMP */
+
 /*
  * This domain is a statically identity mapping domain.
  *	1. This domain creats a static 1:1 mapping to all usable memory.