Message ID | 1421046388-27925-3-git-send-email-zhen-hual@hp.com (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
On Mon, Jan 12, 2015 at 03:06:20PM +0800, Li, Zhen-Hua wrote: > + > +#ifdef CONFIG_CRASH_DUMP > + > +/* > + * Fix Crashdump failure caused by leftover DMA through a hardware IOMMU > + * > + * Fixes the crashdump kernel to deal with an active iommu and legacy > + * DMA from the (old) panicked kernel in a manner similar to how legacy > + * DMA is handled when no hardware iommu was in use by the old kernel -- > + * allow the legacy DMA to continue into its current buffers. > + * > + * In the crashdump kernel, this code: > + * 1. skips disabling the IOMMU's translating of IO Virtual Addresses (IOVA). > + * 2. Do not re-enable IOMMU's translating. > + * 3. In kdump kernel, use the old root entry table. > + * 4. Leaves the current translations in-place so that legacy DMA will > + * continue to use its current buffers. > + * 5. Allocates to the device drivers in the crashdump kernel > + * portions of the iova address ranges that are different > + * from the iova address ranges that were being used by the old kernel > + * at the time of the panic. > + * > + */ It looks like you are still copying the io-page-tables from the old kernel into the kdump kernel, is that right? With the approach that was proposed you only need to copy over the context entries 1-1. They are still pointing to the page-tables in the old kernels memory (which is just fine). The root-entry of the old kernel is also re-used, and when the kdump kernel starts to use a device, its context entry is updated to point to a newly allocated page-table. Joerg -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jan 12, 2015 at 04:22:08PM +0100, Joerg Roedel wrote: > On Mon, Jan 12, 2015 at 03:06:20PM +0800, Li, Zhen-Hua wrote: > > + > > +#ifdef CONFIG_CRASH_DUMP > > + > > +/* > > + * Fix Crashdump failure caused by leftover DMA through a hardware IOMMU > > + * > > + * Fixes the crashdump kernel to deal with an active iommu and legacy > > + * DMA from the (old) panicked kernel in a manner similar to how legacy > > + * DMA is handled when no hardware iommu was in use by the old kernel -- > > + * allow the legacy DMA to continue into its current buffers. > > + * > > + * In the crashdump kernel, this code: > > + * 1. skips disabling the IOMMU's translating of IO Virtual Addresses (IOVA). > > + * 2. Do not re-enable IOMMU's translating. > > + * 3. In kdump kernel, use the old root entry table. > > + * 4. Leaves the current translations in-place so that legacy DMA will > > + * continue to use its current buffers. > > + * 5. Allocates to the device drivers in the crashdump kernel > > + * portions of the iova address ranges that are different > > + * from the iova address ranges that were being used by the old kernel > > + * at the time of the panic. > > + * > > + */ > > It looks like you are still copying the io-page-tables from the old > kernel into the kdump kernel, is that right? With the approach that was > proposed you only need to copy over the context entries 1-1. They are > still pointing to the page-tables in the old kernels memory (which is > just fine). Kdump has the notion of backup region. Where certain parts of old kernels memory can be moved to a different location (first 640K on x86 as of now) and new kernel can make use of this memory now. So we will have to just make sure that no parts of this old page table fall into backup region. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jan 12, 2015 at 10:29:19AM -0500, Vivek Goyal wrote: > Kdump has the notion of backup region. Where certain parts of old kernels > memory can be moved to a different location (first 640K on x86 as of now) > and new kernel can make use of this memory now. > > So we will have to just make sure that no parts of this old page table > fall into backup region. Uuh, looks like the 'iommu-with-kdump-issue' isn't complicated enough yet ;) Sadly, your above statement is true for all hardware-accessible data structures in IOMMU code. I think about how we can solve this, is there an easy way to allocate memory that is not in any backup region? Thanks, Joerg -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jan 12, 2015 at 05:06:46PM +0100, Joerg Roedel wrote: > On Mon, Jan 12, 2015 at 10:29:19AM -0500, Vivek Goyal wrote: > > Kdump has the notion of backup region. Where certain parts of old kernels > > memory can be moved to a different location (first 640K on x86 as of now) > > and new kernel can make use of this memory now. > > > > So we will have to just make sure that no parts of this old page table > > fall into backup region. > > Uuh, looks like the 'iommu-with-kdump-issue' isn't complicated enough > yet ;) > Sadly, your above statement is true for all hardware-accessible data > structures in IOMMU code. I think about how we can solve this, is there > an easy way to allocate memory that is not in any backup region? Hmm..., there does not seem to be any easy way to do this. In fact, as of now, kernel does not even know where is backup region. All these details are managed by user space completely (except for new kexec_file_load() syscall). That means we are left with ugly options now. - Define per arch kexec backup regions in kernel and export it to user space and let kexec-tools make use of that deinition (instead of defining its own). That way memory allocation code in kernel can look at this backup area and skip it for certain allocations. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jan 12, 2015 at 11:15:38AM -0500, Vivek Goyal wrote: > On Mon, Jan 12, 2015 at 05:06:46PM +0100, Joerg Roedel wrote: > > On Mon, Jan 12, 2015 at 10:29:19AM -0500, Vivek Goyal wrote: > > > Kdump has the notion of backup region. Where certain parts of old kernels > > > memory can be moved to a different location (first 640K on x86 as of now) > > > and new kernel can make use of this memory now. > > > > > > So we will have to just make sure that no parts of this old page table > > > fall into backup region. > > > > Uuh, looks like the 'iommu-with-kdump-issue' isn't complicated enough > > yet ;) > > Sadly, your above statement is true for all hardware-accessible data > > structures in IOMMU code. I think about how we can solve this, is there > > an easy way to allocate memory that is not in any backup region? > > Hmm..., there does not seem to be any easy way to do this. In fact, as of > now, kernel does not even know where is backup region. All these details are > managed by user space completely (except for new kexec_file_load() syscall). > > That means we are left with ugly options now. > > - Define per arch kexec backup regions in kernel and export it to user > space and let kexec-tools make use of that deinition (instead of > defining its own). That way memory allocation code in kernel can look > at this backup area and skip it for certain allocations. Yes, that makes sense. In fact, I think all allocations for DMA memory need to take this into account to avoid potentially serious data corruption. If any memory for a disk superblock gets allocated in backup memory and a kdump happens, the new kernel might zero out that area and the disk controler then writes the zeroes to disk instead of the superblock. Joerg -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 01/12/2015 11:22 PM, Joerg Roedel wrote: > On Mon, Jan 12, 2015 at 03:06:20PM +0800, Li, Zhen-Hua wrote: >> + >> +#ifdef CONFIG_CRASH_DUMP >> + >> +/* >> + * Fix Crashdump failure caused by leftover DMA through a hardware IOMMU >> + * >> + * Fixes the crashdump kernel to deal with an active iommu and legacy >> + * DMA from the (old) panicked kernel in a manner similar to how legacy >> + * DMA is handled when no hardware iommu was in use by the old kernel -- >> + * allow the legacy DMA to continue into its current buffers. >> + * >> + * In the crashdump kernel, this code: >> + * 1. skips disabling the IOMMU's translating of IO Virtual Addresses (IOVA). >> + * 2. Do not re-enable IOMMU's translating. >> + * 3. In kdump kernel, use the old root entry table. >> + * 4. Leaves the current translations in-place so that legacy DMA will >> + * continue to use its current buffers. >> + * 5. Allocates to the device drivers in the crashdump kernel >> + * portions of the iova address ranges that are different >> + * from the iova address ranges that were being used by the old kernel >> + * at the time of the panic. >> + * >> + */ > > It looks like you are still copying the io-page-tables from the old > kernel into the kdump kernel, is that right? With the approach that was > proposed you only need to copy over the context entries 1-1. They are > still pointing to the page-tables in the old kernels memory (which is > just fine). > > The root-entry of the old kernel is also re-used, and when the kdump > kernel starts to use a device, its context entry is updated to point to > a newly allocated page-table. > Current status: 1. Use old root entry table, 2. Copy old context entry tables. 3. Copy old page tables. 4. Allocate new page table when device driver is loading. I remember the progress we discussed was this. Checking the old mails, founding we did not clearly discuss about the old page tables, only agrred with allocating new page tables when device driver is loading. To me, both of these two options are fine, copying old page tables, or re-use the old ones. My debugging shows both of them works fine. If we use the old page tables, the patchset will use less code. Functions copy_page_addr, copy_page_table are no longer needed, and functions copy_context_entry and copy_context_entry_table will be reduced to several lines. The patchset will be much simpler. Zhenhua -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 01/12/15 at 10:29am, Vivek Goyal wrote: > On Mon, Jan 12, 2015 at 04:22:08PM +0100, Joerg Roedel wrote: > > It looks like you are still copying the io-page-tables from the old > > kernel into the kdump kernel, is that right? With the approach that was > > proposed you only need to copy over the context entries 1-1. They are > > still pointing to the page-tables in the old kernels memory (which is > > just fine). > > Kdump has the notion of backup region. Where certain parts of old kernels > memory can be moved to a different location (first 640K on x86 as of now) > and new kernel can make use of this memory now. Hi Vivek, About backup region I am a bit confusing. Just say x86, we usually copy it to a backup region. And this first 640K will be used as a usable memory region in 2nd kernel since its content has been copied to backup region. And that backup region is taken from crashkernel reserved memory and not passed to 2nd kernel as usable memory region. Here did you mean the old page table could fall into first 640K memory region or that reserved backup region? Since as I understand the backup region is taken from crashkernel memory which is not used by 1st kernel process. Thanks Baoquan > > So we will have to just make sure that no parts of this old page table > fall into backup region. > > Thanks > Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jan 13, 2015 at 04:12:29PM +0800, Li, ZhenHua wrote: > Current status: > 1. Use old root entry table, > 2. Copy old context entry tables. > 3. Copy old page tables. > 4. Allocate new page table when device driver is loading. > > I remember the progress we discussed was this. Checking the old mails, > founding we did not clearly discuss about the old page tables, only > agrred with allocating new page tables when device driver is loading. > > To me, both of these two options are fine, copying old page tables, or > re-use the old ones. My debugging shows both of them works fine. > > If we use the old page tables, the patchset will use less code. > Functions copy_page_addr, copy_page_table are no longer needed, and > functions copy_context_entry and copy_context_entry_table will be > reduced to several lines. The patchset will be much simpler. Yes, please do it this way. There is no point in copying the old page-tables. We never make changes to them, so they can stay where they are. The problem with the backup region still exists, but it is not specific to IOMMUs. This problem and should be addressed seperatly. Joerg -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c index 8d5c400..a71de3f 100644 --- a/drivers/iommu/intel-iommu.c +++ b/drivers/iommu/intel-iommu.c @@ -40,6 +40,7 @@ #include <linux/pci-ats.h> #include <linux/memblock.h> #include <linux/dma-contiguous.h> +#include <linux/crash_dump.h> #include <asm/irq_remapping.h> #include <asm/cacheflush.h> #include <asm/iommu.h> @@ -208,6 +209,12 @@ get_context_addr_from_root(struct root_entry *root) NULL); } +static inline unsigned long +get_context_phys_from_root(struct root_entry *root) +{ + return root_present(root) ? (root->val & VTD_PAGE_MASK) : 0; +} + /* * low 64 bits: * 0: present @@ -228,6 +235,32 @@ static inline bool context_present(struct context_entry *context) { return (context->lo & 1); } + +static inline int context_fault_enable(struct context_entry *c) +{ + return((c->lo >> 1) & 0x1); +} + +static inline int context_translation_type(struct context_entry *c) +{ + return((c->lo >> 2) & 0x3); +} + +static inline u64 context_address_root(struct context_entry *c) +{ + return((c->lo >> VTD_PAGE_SHIFT)); +} + +static inline int context_address_width(struct context_entry *c) +{ + return((c->hi >> 0) & 0x7); +} + +static inline int context_domain_id(struct context_entry *c) +{ + return((c->hi >> 8) & 0xffff); +} + static inline void context_set_present(struct context_entry *context) { context->lo |= 1; @@ -313,6 +346,43 @@ static inline int first_pte_in_page(struct dma_pte *pte) return !((unsigned long)pte & ~VTD_PAGE_MASK); } + +#ifdef CONFIG_CRASH_DUMP + +/* + * Fix Crashdump failure caused by leftover DMA through a hardware IOMMU + * + * Fixes the crashdump kernel to deal with an active iommu and legacy + * DMA from the (old) panicked kernel in a manner similar to how legacy + * DMA is handled when no hardware iommu was in use by the old kernel -- + * allow the legacy DMA to continue into its current buffers. + * + * In the crashdump kernel, this code: + * 1. skips disabling the IOMMU's translating of IO Virtual Addresses (IOVA). + * 2. Do not re-enable IOMMU's translating. + * 3. In kdump kernel, use the old root entry table. + * 4. Leaves the current translations in-place so that legacy DMA will + * continue to use its current buffers. + * 5. Allocates to the device drivers in the crashdump kernel + * portions of the iova address ranges that are different + * from the iova address ranges that were being used by the old kernel + * at the time of the panic. + * + */ + +struct domain_values_entry { + struct list_head link; /* link entries into a list */ + struct iova_domain iovad; /* iova's that belong to this domain */ + struct dma_pte *pgd; /* virtual address */ + int did; /* domain id */ + int gaw; /* max guest address width */ + int iommu_superpage; /* Level of superpages supported: + 0 == 4KiB (no superpages), 1 == 2MiB, + 2 == 1GiB, 3 == 512GiB, 4 == 1TiB */ +}; + +#endif /* CONFIG_CRASH_DUMP */ + /* * This domain is a statically identity mapping domain. * 1. This domain creats a static 1:1 mapping to all usable memory.