[v2,08/13] x86/irq: Install posted MSI notification handler

Message ID	20240405223110.1609888-9-jacob.jun.pan@linux.intel.com (mailing list archive)
State	New
Headers	show Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C6B69178CC3; Fri, 5 Apr 2024 22:26:51 +0000 (UTC) From: Jacob Pan <jacob.jun.pan@linux.intel.com> To: LKML <linux-kernel@vger.kernel.org>, X86 Kernel <x86@kernel.org>, Peter Zijlstra <peterz@infradead.org>, iommu@lists.linux.dev, Thomas Gleixner <tglx@linutronix.de>, "Lu Baolu" <baolu.lu@linux.intel.com>, kvm@vger.kernel.org, Dave Hansen <dave.hansen@intel.com>, Joerg Roedel <joro@8bytes.org>, "H. Peter Anvin" <hpa@zytor.com>, "Borislav Petkov" <bp@alien8.de>, "Ingo Molnar" <mingo@redhat.com> Cc: Paul Luse <paul.e.luse@intel.com>, Dan Williams <dan.j.williams@intel.com>, Jens Axboe <axboe@kernel.dk>, Raj Ashok <ashok.raj@intel.com>, "Tian, Kevin" <kevin.tian@intel.com>, maz@kernel.org, seanjc@google.com, "Robin Murphy" <robin.murphy@arm.com>, jim.harris@samsung.com, a.manzanares@samsung.com, "Bjorn Helgaas" <helgaas@kernel.org>, guang.zeng@intel.com, robert.hoo.linux@gmail.com, Jacob Pan <jacob.jun.pan@linux.intel.com> Subject: [PATCH v2 08/13] x86/irq: Install posted MSI notification handler Date: Fri, 5 Apr 2024 15:31:05 -0700 Message-Id: <20240405223110.1609888-9-jacob.jun.pan@linux.intel.com> In-Reply-To: <20240405223110.1609888-1-jacob.jun.pan@linux.intel.com> References: <20240405223110.1609888-1-jacob.jun.pan@linux.intel.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Coalesced Interrupt Delivery with posted MSI \| expand [v2,00/13] Coalesced Interrupt Delivery with posted MSI [v2,01/13] x86/irq: Move posted interrupt descriptor out of vmx code [v2,02/13] x86/irq: Unionize PID.PIR for 64bit access w/o casting [v2,03/13] x86/irq: Remove bitfields in posted interrupt descriptor [v2,04/13] x86/irq: Add a Kconfig option for posted MSI [v2,05/13] x86/irq: Reserve a per CPU IDT vector for posted MSIs [v2,06/13] x86/irq: Set up per host CPU posted interrupt descriptors [v2,07/13] x86/irq: Factor out calling ISR from common_interrupt [v2,08/13] x86/irq: Install posted MSI notification handler [v2,09/13] x86/irq: Factor out common code for checking pending interrupts [v2,10/13] x86/irq: Extend checks for pending vectors to posted interrupts [v2,11/13] iommu/vt-d: Make posted MSI an opt-in cmdline option [v2,12/13] iommu/vt-d: Add an irq_chip for posted MSIs [v2,13/13] iommu/vt-d: Enable posted mode for device MSIs

Message ID

20240405223110.1609888-9-jacob.jun.pan@linux.intel.com (mailing list archive)

State

New

Headers

From: Jacob Pan <jacob.jun.pan@linux.intel.com>
To: LKML <linux-kernel@vger.kernel.org>,
	X86 Kernel <x86@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	iommu@lists.linux.dev,
	Thomas Gleixner <tglx@linutronix.de>,
	"Lu Baolu" <baolu.lu@linux.intel.com>,
	kvm@vger.kernel.org,
	Dave Hansen <dave.hansen@intel.com>,
	Joerg Roedel <joro@8bytes.org>,
	"H. Peter Anvin" <hpa@zytor.com>,
	"Borislav Petkov" <bp@alien8.de>,
	"Ingo Molnar" <mingo@redhat.com>
Cc: Paul Luse <paul.e.luse@intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Jens Axboe <axboe@kernel.dk>,
	Raj Ashok <ashok.raj@intel.com>,
	"Tian, Kevin" <kevin.tian@intel.com>,
	maz@kernel.org,
	seanjc@google.com,
	"Robin Murphy" <robin.murphy@arm.com>,
	jim.harris@samsung.com,
	a.manzanares@samsung.com,
	"Bjorn Helgaas" <helgaas@kernel.org>,
	guang.zeng@intel.com,
	robert.hoo.linux@gmail.com,
	Jacob Pan <jacob.jun.pan@linux.intel.com>
Subject: [PATCH v2 08/13] x86/irq: Install posted MSI notification handler
Date: Fri,  5 Apr 2024 15:31:05 -0700
Message-Id: <20240405223110.1609888-9-jacob.jun.pan@linux.intel.com>
In-Reply-To: <20240405223110.1609888-1-jacob.jun.pan@linux.intel.com>
References: <20240405223110.1609888-1-jacob.jun.pan@linux.intel.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

Coalesced Interrupt Delivery with posted MSI | expand

Commit Message

Jacob Pan April 5, 2024, 10:31 p.m. UTC

All MSI vectors are multiplexed into a single notification vector when
posted MSI is enabled. It is the responsibility of the notification
vector handler to demultiplex MSI vectors. In this handler, for each
pending bit, MSI vector handlers are dispatched without IDT delivery.

For example, the interrupt flow will change as follows:
(3 MSIs of different vectors arrive in a a high frequency burst)

BEFORE:
interrupt(MSI)
    irq_enter()
    handler() /* EOI */
    irq_exit()
        process_softirq()
interrupt(MSI)
    irq_enter()
    handler() /* EOI */
    irq_exit()
        process_softirq()
interrupt(MSI)
    irq_enter()
    handler() /* EOI */
    irq_exit()
        process_softirq()

AFTER:
interrupt /* Posted MSI notification vector */
    irq_enter()
	atomic_xchg(PIR)
	handler()
	handler()
	handler()
	pi_clear_on()
    apic_eoi()
    irq_exit()
        process_softirq()

Except for the leading MSI, CPU notifications are skipped/coalesced.

For MSIs arrive at a low frequency, the demultiplexing loop does not
wait for more interrupts to coalesce. Therefore, there's no additional
latency other than the processing time.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>

---
v2:
   - Delete extra inline attribute
   - Fix pir pointer in xchg (Zeng Guang)
---
 arch/x86/include/asm/hardirq.h  |   3 +
 arch/x86/include/asm/idtentry.h |   3 +
 arch/x86/kernel/idt.c           |   3 +
 arch/x86/kernel/irq.c           | 113 ++++++++++++++++++++++++++++++++
 4 files changed, 122 insertions(+)

Comments

Tian, Kevin April 11, 2024, 7:52 a.m. UTC | #1

> From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Sent: Saturday, April 6, 2024 6:31 AM
> 
> +
> +/*
> + * De-multiplexing posted interrupts is on the performance path, the code
> + * below is written to optimize the cache performance based on the
> following
> + * considerations:
> + * 1.Posted interrupt descriptor (PID) fits in a cache line that is frequently
> + *   accessed by both CPU and IOMMU.
> + * 2.During posted MSI processing, the CPU needs to do 64-bit read and
> xchg
> + *   for checking and clearing posted interrupt request (PIR), a 256 bit field
> + *   within the PID.
> + * 3.On the other side, the IOMMU does atomic swaps of the entire PID
> cache
> + *   line when posting interrupts and setting control bits.
> + * 4.The CPU can access the cache line a magnitude faster than the IOMMU.
> + * 5.Each time the IOMMU does interrupt posting to the PIR will evict the
> PID
> + *   cache line. The cache line states after each operation are as follows:
> + *   CPU		IOMMU			PID Cache line state
> + *   ---------------------------------------------------------------
> + *...read64					exclusive
> + *...lock xchg64				modified
> + *...			post/atomic swap	invalid
> + *...-------------------------------------------------------------
> + *

According to VT-d spec: 5.2.3 Interrupt-Posting Hardware Operation:

"
- Read contents of the Posted Interrupt Descriptor, claiming exclusive
  ownership of its hosting cache-line.
  ...
- Modify the following descriptor field values atomically:
  ...
- Promote the cache-line to be globally observable, so that the modifications
  are visible to other caching agents. Hardware may write-back the cache-line
  anytime after this step.
"

sounds that the PID cache line is not evicted after IOMMU posting?

Thomas Gleixner April 11, 2024, 4:54 p.m. UTC | #2

On Fri, Apr 05 2024 at 15:31, Jacob Pan wrote:
>  
>  #ifdef CONFIG_SMP
> diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
> index fc37c8d83daf..f445bec516a0 100644
> --- a/arch/x86/kernel/idt.c
> +++ b/arch/x86/kernel/idt.c
> @@ -163,6 +163,9 @@ static const __initconst struct idt_data apic_idts[] = {
>  # endif
>  	INTG(SPURIOUS_APIC_VECTOR,		asm_sysvec_spurious_apic_interrupt),
>  	INTG(ERROR_APIC_VECTOR,			asm_sysvec_error_interrupt),
> +# ifdef CONFIG_X86_POSTED_MSI
> +	INTG(POSTED_MSI_NOTIFICATION_VECTOR,	asm_sysvec_posted_msi_notification),
> +# endif
>  #endif
>  };

Obviously lacks FRED support...

Jacob Pan April 11, 2024, 5:38 p.m. UTC | #3

Hi Kevin,

On Thu, 11 Apr 2024 07:52:28 +0000, "Tian, Kevin" <kevin.tian@intel.com>
wrote:

> > From: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Sent: Saturday, April 6, 2024 6:31 AM
> > 
> > +
> > +/*
> > + * De-multiplexing posted interrupts is on the performance path, the
> > code
> > + * below is written to optimize the cache performance based on the
> > following
> > + * considerations:
> > + * 1.Posted interrupt descriptor (PID) fits in a cache line that is
> > frequently
> > + *   accessed by both CPU and IOMMU.
> > + * 2.During posted MSI processing, the CPU needs to do 64-bit read and
> > xchg
> > + *   for checking and clearing posted interrupt request (PIR), a 256
> > bit field
> > + *   within the PID.
> > + * 3.On the other side, the IOMMU does atomic swaps of the entire PID
> > cache
> > + *   line when posting interrupts and setting control bits.
> > + * 4.The CPU can access the cache line a magnitude faster than the
> > IOMMU.
> > + * 5.Each time the IOMMU does interrupt posting to the PIR will evict
> > the PID
> > + *   cache line. The cache line states after each operation are as
> > follows:
> > + *   CPU		IOMMU			PID Cache line
> > state
> > + *   ---------------------------------------------------------------
> > + *...read64					exclusive
> > + *...lock xchg64				modified
> > + *...			post/atomic swap	invalid
> > + *...-------------------------------------------------------------
> > + *  
> 
> According to VT-d spec: 5.2.3 Interrupt-Posting Hardware Operation:
> 
> "
> - Read contents of the Posted Interrupt Descriptor, claiming exclusive
>   ownership of its hosting cache-line.
>   ...
> - Modify the following descriptor field values atomically:
>   ...
> - Promote the cache-line to be globally observable, so that the
> modifications are visible to other caching agents. Hardware may
> write-back the cache-line anytime after this step.
> "
> 
> sounds that the PID cache line is not evicted after IOMMU posting?
IOMMU follows the same MESI protocol defined in SDM. VT-d spec. also says
"This atomic read-modify-write operation will always snoop processor caches"

So if the PID cache line is in modified state (caused by writing ON bit,
clear PIR, etc.), IOMMU request ownership will evict the cache.


Thanks,

Jacob

Jacob Pan April 11, 2024, 6:29 p.m. UTC | #4

Hi Thomas,

On Thu, 11 Apr 2024 18:54:29 +0200, Thomas Gleixner <tglx@linutronix.de>
wrote:

> On Fri, Apr 05 2024 at 15:31, Jacob Pan wrote:
> >  
> >  #ifdef CONFIG_SMP
> > diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
> > index fc37c8d83daf..f445bec516a0 100644
> > --- a/arch/x86/kernel/idt.c
> > +++ b/arch/x86/kernel/idt.c
> > @@ -163,6 +163,9 @@ static const __initconst struct idt_data
> > apic_idts[] = { # endif
> >  	INTG(SPURIOUS_APIC_VECTOR,
> > asm_sysvec_spurious_apic_interrupt), INTG(ERROR_APIC_VECTOR,
> > 		asm_sysvec_error_interrupt), +# ifdef
> > CONFIG_X86_POSTED_MSI
> > +	INTG(POSTED_MSI_NOTIFICATION_VECTOR,
> > asm_sysvec_posted_msi_notification), +# endif
> >  #endif
> >  };  
> 
> Obviously lacks FRED support...
Good point, forgot FRED is merged :)

Will add an entry to entry_fred.c. I would not be able to test performance
though.

Thanks,

Jacob

diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h
index 01ea52859462..ee0c05366dec 100644
--- a/arch/x86/include/asm/hardirq.h
+++ b/arch/x86/include/asm/hardirq.h
@@ -44,6 +44,9 @@  typedef struct {
 	unsigned int irq_hv_reenlightenment_count;
 	unsigned int hyperv_stimer0_count;
 #endif
+#ifdef CONFIG_X86_POSTED_MSI
+	unsigned int posted_msi_notification_count;
+#endif
 } ____cacheline_aligned irq_cpustat_t;
 
 DECLARE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat);
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 749c7411d2f1..5a6fc7593cfc 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -701,6 +701,9 @@  DECLARE_IDTENTRY_SYSVEC(ERROR_APIC_VECTOR,		sysvec_error_interrupt);
 DECLARE_IDTENTRY_SYSVEC(SPURIOUS_APIC_VECTOR,		sysvec_spurious_apic_interrupt);
 DECLARE_IDTENTRY_SYSVEC(LOCAL_TIMER_VECTOR,		sysvec_apic_timer_interrupt);
 DECLARE_IDTENTRY_SYSVEC(X86_PLATFORM_IPI_VECTOR,	sysvec_x86_platform_ipi);
+# ifdef CONFIG_X86_POSTED_MSI
+DECLARE_IDTENTRY_SYSVEC(POSTED_MSI_NOTIFICATION_VECTOR,	sysvec_posted_msi_notification);
+# endif
 #endif
 
 #ifdef CONFIG_SMP
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index fc37c8d83daf..f445bec516a0 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -163,6 +163,9 @@  static const __initconst struct idt_data apic_idts[] = {
 # endif
 	INTG(SPURIOUS_APIC_VECTOR,		asm_sysvec_spurious_apic_interrupt),
 	INTG(ERROR_APIC_VECTOR,			asm_sysvec_error_interrupt),
+# ifdef CONFIG_X86_POSTED_MSI
+	INTG(POSTED_MSI_NOTIFICATION_VECTOR,	asm_sysvec_posted_msi_notification),
+# endif
 #endif
 };
 
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index c54de9378943..8772b0a3abf6 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -183,6 +183,13 @@  int arch_show_interrupts(struct seq_file *p, int prec)
 		seq_printf(p, "%10u ",
 			   irq_stats(j)->kvm_posted_intr_wakeup_ipis);
 	seq_puts(p, "  Posted-interrupt wakeup event\n");
+#endif
+#ifdef CONFIG_X86_POSTED_MSI
+	seq_printf(p, "%*s: ", prec, "PMN");
+	for_each_online_cpu(j)
+		seq_printf(p, "%10u ",
+			   irq_stats(j)->posted_msi_notification_count);
+	seq_puts(p, "  Posted MSI notification event\n");
 #endif
 	return 0;
 }
@@ -361,6 +368,112 @@  void intel_posted_msi_init(void)
 	destination = x2apic_enabled() ? apic_id : apic_id << 8;
 	this_cpu_write(posted_interrupt_desc.ndst, destination);
 }
+
+/*
+ * De-multiplexing posted interrupts is on the performance path, the code
+ * below is written to optimize the cache performance based on the following
+ * considerations:
+ * 1.Posted interrupt descriptor (PID) fits in a cache line that is frequently
+ *   accessed by both CPU and IOMMU.
+ * 2.During posted MSI processing, the CPU needs to do 64-bit read and xchg
+ *   for checking and clearing posted interrupt request (PIR), a 256 bit field
+ *   within the PID.
+ * 3.On the other side, the IOMMU does atomic swaps of the entire PID cache
+ *   line when posting interrupts and setting control bits.
+ * 4.The CPU can access the cache line a magnitude faster than the IOMMU.
+ * 5.Each time the IOMMU does interrupt posting to the PIR will evict the PID
+ *   cache line. The cache line states after each operation are as follows:
+ *   CPU		IOMMU			PID Cache line state
+ *   ---------------------------------------------------------------
+ *...read64					exclusive
+ *...lock xchg64				modified
+ *...			post/atomic swap	invalid
+ *...-------------------------------------------------------------
+ *
+ * To reduce L1 data cache miss, it is important to avoid contention with
+ * IOMMU's interrupt posting/atomic swap. Therefore, a copy of PIR is used
+ * to dispatch interrupt handlers.
+ *
+ * In addition, the code is trying to keep the cache line state consistent
+ * as much as possible. e.g. when making a copy and clearing the PIR
+ * (assuming non-zero PIR bits are present in the entire PIR), it does:
+ *		read, read, read, read, xchg, xchg, xchg, xchg
+ * instead of:
+ *		read, xchg, read, xchg, read, xchg, read, xchg
+ */
+static __always_inline bool handle_pending_pir(u64 *pir, struct pt_regs *regs)
+{
+	int i, vec = FIRST_EXTERNAL_VECTOR;
+	unsigned long pir_copy[4];
+	bool handled = false;
+
+	for (i = 0; i < 4; i++)
+		pir_copy[i] = pir[i];
+
+	for (i = 0; i < 4; i++) {
+		if (!pir_copy[i])
+			continue;
+
+		pir_copy[i] = arch_xchg(&pir[i], 0);
+		handled = true;
+	}
+
+	if (handled) {
+		for_each_set_bit_from(vec, pir_copy, FIRST_SYSTEM_VECTOR)
+			call_irq_handler(vec, regs);
+	}
+
+	return handled;
+}
+
+/*
+ * Performance data shows that 3 is good enough to harvest 90+% of the benefit
+ * on high IRQ rate workload.
+ */
+#define MAX_POSTED_MSI_COALESCING_LOOP 3
+
+/*
+ * For MSIs that are delivered as posted interrupts, the CPU notifications
+ * can be coalesced if the MSIs arrive in high frequency bursts.
+ */
+DEFINE_IDTENTRY_SYSVEC(sysvec_posted_msi_notification)
+{
+	struct pt_regs *old_regs = set_irq_regs(regs);
+	struct pi_desc *pid;
+	int i = 0;
+
+	pid = this_cpu_ptr(&posted_interrupt_desc);
+
+	inc_irq_stat(posted_msi_notification_count);
+	irq_enter();
+
+	/*
+	 * Max coalescing count includes the extra round of handle_pending_pir
+	 * after clearing the outstanding notification bit. Hence, at most
+	 * MAX_POSTED_MSI_COALESCING_LOOP - 1 loops are executed here.
+	 */
+	while (++i < MAX_POSTED_MSI_COALESCING_LOOP) {
+		if (!handle_pending_pir(pid->pir64, regs))
+			break;
+	}
+
+	/*
+	 * Clear outstanding notification bit to allow new IRQ notifications,
+	 * do this last to maximize the window of interrupt coalescing.
+	 */
+	pi_clear_on(pid);
+
+	/*
+	 * There could be a race of PI notification and the clearing of ON bit,
+	 * process PIR bits one last time such that handling the new interrupts
+	 * are not delayed until the next IRQ.
+	 */
+	handle_pending_pir(pid->pir64, regs);
+
+	apic_eoi();
+	irq_exit();
+	set_irq_regs(old_regs);
+}
 #endif /* X86_POSTED_MSI */
 
 #ifdef CONFIG_HOTPLUG_CPU

[v2,08/13] x86/irq: Install posted MSI notification handler

Commit Message

Comments

Patch