diff mbox series

ACPI/NFIT: Add no_deepflush param to dynamic control flush operation

Message ID 20220629083118.8737-1-dennis.wu@intel.com (mailing list archive)
State New, archived
Headers show
Series ACPI/NFIT: Add no_deepflush param to dynamic control flush operation | expand

Commit Message

dennis.wu June 29, 2022, 8:31 a.m. UTC
reason: in the current BTT implimentation deepflush is always
used and deepflush is very expensive. Since customer already
know the ADR can protect the WPQ data in memory controller and
no need to call deepflush to get better performance. BTT w/o
deepflush, performance can improve 300%~600% with diff FIO jobs.

How: Add one param "no_deepflush" in the nfit module parameter.
if "modprob nfit no_deepflush=1", customer can get the higher
performance but not strict data security. Before modprob nfit,
you may need to "ndctl disable-region".

Next: In the BTT, use flag to control the data w/o deepflush
in the case "no_deepflush=0".

Signed-off-by: Dennis.Wu <dennis.wu@intel.com>
---
 drivers/acpi/nfit/core.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

Comments

Christoph Hellwig June 29, 2022, 3:27 p.m. UTC | #1
On Wed, Jun 29, 2022 at 04:31:18PM +0800, Dennis.Wu wrote:
> reason: in the current BTT implimentation deepflush is always
> used and deepflush is very expensive. Since customer already
> know the ADR can protect the WPQ data in memory controller and
> no need to call deepflush to get better performance. BTT w/o
> deepflush, performance can improve 300%~600% with diff FIO jobs.
> 
> How: Add one param "no_deepflush" in the nfit module parameter.
> if "modprob nfit no_deepflush=1", customer can get the higher
> performance but not strict data security. Before modprob nfit,
> you may need to "ndctl disable-region".

This goes back to my question from years ago:  why do we ever
do this deep flush in the Linux nvdimm stack to start with?
Dan Williams July 13, 2022, 1:25 a.m. UTC | #2
Christoph Hellwig wrote:
> On Wed, Jun 29, 2022 at 04:31:18PM +0800, Dennis.Wu wrote:
> > reason: in the current BTT implimentation deepflush is always
> > used and deepflush is very expensive. Since customer already
> > know the ADR can protect the WPQ data in memory controller and
> > no need to call deepflush to get better performance. BTT w/o
> > deepflush, performance can improve 300%~600% with diff FIO jobs.
> > 
> > How: Add one param "no_deepflush" in the nfit module parameter.
> > if "modprob nfit no_deepflush=1", customer can get the higher
> > performance but not strict data security. Before modprob nfit,
> > you may need to "ndctl disable-region".
> 
> This goes back to my question from years ago:  why do we ever
> do this deep flush in the Linux nvdimm stack to start with?

The rationale is to push the data to smaller failure domain. Similar to
flushing disk write-caches. Otherwise, if you trust your memory power
supplies like you trust your disks then just rely on them to take care
of the data.

Otherwise, by default the kernel should default to taking as much care
as possible to push writes to the smallest failure domain possible.
Christoph Hellwig July 20, 2022, 6:24 a.m. UTC | #3
On Tue, Jul 12, 2022 at 06:25:30PM -0700, Dan Williams wrote:
> > This goes back to my question from years ago:  why do we ever
> > do this deep flush in the Linux nvdimm stack to start with?
> 
> The rationale is to push the data to smaller failure domain. Similar to
> flushing disk write-caches.

Flushing disk caches is not about a smaller failure domain.  Flushing
disk caches is about making data durable _at _all_.

> Otherwise, if you trust your memory power
> supplies like you trust your disks then just rely on them to take care
> of the data.

Well, it seems like all the benchmarketing schemes around pmem seem to
trust it.  Why would kernel block I/O be different from device dax,
MAP_SYNC?

> Otherwise, by default the kernel should default to taking as much care
> as possible to push writes to the smallest failure domain possible.

In which case we need remve the device dax direct map and MAP_SYNC.
Reducing the failure domain is not what fsync or REQ_OP_FLUSH are
about, they are about making changes durable.  How durable is up to
your device implementation.  But if you trust it only a little you
should not offer that half way option to start with.
dennis.wu Sept. 20, 2022, 3:08 a.m. UTC | #4
Hi, Dan,
Will we add this patch to some new kernel version?

Thank you a lot!
Dennis Wu

-----Original Message-----
From: Christoph Hellwig <hch@infradead.org> 
Sent: Wednesday, July 20, 2022 2:25 PM
To: Williams, Dan J <dan.j.williams@intel.com>
Cc: Christoph Hellwig <hch@infradead.org>; Wu, Dennis <dennis.wu@intel.com>; nvdimm@lists.linux.dev; Verma, Vishal L <vishal.l.verma@intel.com>; Jiang, Dave <dave.jiang@intel.com>; Weiny, Ira <ira.weiny@intel.com>
Subject: Re: [PATCH] ACPI/NFIT: Add no_deepflush param to dynamic control flush operation

On Tue, Jul 12, 2022 at 06:25:30PM -0700, Dan Williams wrote:
> > This goes back to my question from years ago:  why do we ever do 
> > this deep flush in the Linux nvdimm stack to start with?
> 
> The rationale is to push the data to smaller failure domain. Similar 
> to flushing disk write-caches.

Flushing disk caches is not about a smaller failure domain.  Flushing disk caches is about making data durable _at _all_.

> Otherwise, if you trust your memory power supplies like you trust your 
> disks then just rely on them to take care of the data.

Well, it seems like all the benchmarketing schemes around pmem seem to trust it.  Why would kernel block I/O be different from device dax, MAP_SYNC?

> Otherwise, by default the kernel should default to taking as much care 
> as possible to push writes to the smallest failure domain possible.

In which case we need remve the device dax direct map and MAP_SYNC.
Reducing the failure domain is not what fsync or REQ_OP_FLUSH are about, they are about making changes durable.  How durable is up to your device implementation.  But if you trust it only a little you should not offer that half way option to start with.
Christoph Hellwig Sept. 20, 2022, 11:46 a.m. UTC | #5
On Tue, Sep 20, 2022 at 03:08:03AM +0000, Wu, Dennis wrote:
> Hi, Dan,
> Will we add this patch to some new kernel version?

How about addressing my comments and explaining what the exact
data integrity model is here first?
Dan Williams Sept. 20, 2022, 7:30 p.m. UTC | #6
Christoph Hellwig wrote:
[..]
> > Otherwise, by default the kernel should default to taking as much care
> > as possible to push writes to the smallest failure domain possible.
> 
> In which case we need remve the device dax direct map and MAP_SYNC.
> Reducing the failure domain is not what fsync or REQ_OP_FLUSH are
> about, they are about making changes durable.  How durable is up to
> your device implementation.  But if you trust it only a little you
> should not offer that half way option to start with.

That's a good point, but (with my Linux kernel developer hat on) I would
flip it around and make this the platform vendor's problem. If the
platform vendor has validated ADR* and that platform power supplies
maintain stable power in a powerloss scenario, then 'deepflush' is a
complete nop, why publish a flush mechanism?

In other words, unless the platform vendor has no confidence in the
standard durability model (persistence / durability at global visibility
outside the CPU cache) it should skip publishing these flush hints in
the ACPI NFIT table.

The recourse for an end user whose vendor has published this mechanism
in error is to talk to their BIOS vendor to turn off the flush
capability, or use the ACPI table override mechanism to edit out the
flush capability.

I will also note that CXL has done away with this software flush concept
and defines a standard Global Persistence Flush mechanism in the
protocol that fires at impending power-loss events.

* ADR: Asynchronous DRAM refresh, a platform signal to flush write
  buffers in the device upon detection of power-loss.
dennis.wu Oct. 20, 2022, 6:23 a.m. UTC | #7
Hi Dan,

ADR is the platform capability which is default support in the Intel platform. (eADR is the extend capability to keep the cache as the persistent domain, not well support from the PRC customers)

So far, the customer use PMem would not consider the deep flush since 100% data security will not only leverage the single server, it will leverage the software (CRC, checksum) and distributed system. 
Deep flush is the smallest persistent domain and can assure that data persistent in any situation that might be useful for some Financial scenarios (didn't see in the PRC market so far).
That's why options can be used for user to pick the data security level. 

GPF, as my understanding, still leverage the ADR from the original design and still the capability of the platform and have the chance to be failure. 

Thanks a lot and best regards,
Dennis Wu 

-----Original Message-----
From: Williams, Dan J <dan.j.williams@intel.com> 
Sent: Wednesday, September 21, 2022 3:31 AM
To: Christoph Hellwig <hch@infradead.org>; Williams, Dan J <dan.j.williams@intel.com>
Cc: Christoph Hellwig <hch@infradead.org>; Wu, Dennis <dennis.wu@intel.com>; nvdimm@lists.linux.dev; Verma, Vishal L <vishal.l.verma@intel.com>; Jiang, Dave <dave.jiang@intel.com>; Weiny, Ira <ira.weiny@intel.com>
Subject: Re: [PATCH] ACPI/NFIT: Add no_deepflush param to dynamic control flush operation

Christoph Hellwig wrote:
[..]
> > Otherwise, by default the kernel should default to taking as much 
> > care as possible to push writes to the smallest failure domain possible.
> 
> In which case we need remve the device dax direct map and MAP_SYNC.
> Reducing the failure domain is not what fsync or REQ_OP_FLUSH are 
> about, they are about making changes durable.  How durable is up to 
> your device implementation.  But if you trust it only a little you 
> should not offer that half way option to start with.

That's a good point, but (with my Linux kernel developer hat on) I would flip it around and make this the platform vendor's problem. If the platform vendor has validated ADR* and that platform power supplies maintain stable power in a powerloss scenario, then 'deepflush' is a complete nop, why publish a flush mechanism?

In other words, unless the platform vendor has no confidence in the standard durability model (persistence / durability at global visibility outside the CPU cache) it should skip publishing these flush hints in the ACPI NFIT table.

The recourse for an end user whose vendor has published this mechanism in error is to talk to their BIOS vendor to turn off the flush capability, or use the ACPI table override mechanism to edit out the flush capability.

I will also note that CXL has done away with this software flush concept and defines a standard Global Persistence Flush mechanism in the protocol that fires at impending power-loss events.

* ADR: Asynchronous DRAM refresh, a platform signal to flush write
  buffers in the device upon detection of power-loss.
diff mbox series

Patch

diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index e5d7f2bda13f..ec0ad48b0283 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -52,6 +52,10 @@  static bool force_labels;
 module_param(force_labels, bool, 0444);
 MODULE_PARM_DESC(force_labels, "Opt-in to labels despite missing methods");
 
+static bool no_deepflush;
+module_param(no_deepflush, bool, 0644);
+MODULE_PARM_DESC(no_deepflush, "skip deep flush if ADR or no strict security");
+
 LIST_HEAD(acpi_descs);
 DEFINE_MUTEX(acpi_desc_lock);
 
@@ -981,8 +985,10 @@  static void *add_table(struct acpi_nfit_desc *acpi_desc,
 			return err;
 		break;
 	case ACPI_NFIT_TYPE_FLUSH_ADDRESS:
-		if (!add_flush(acpi_desc, prev, table))
-			return err;
+		if (!no_deepflush) {
+			if (!add_flush(acpi_desc, prev, table))
+				return err;
+		}
 		break;
 	case ACPI_NFIT_TYPE_SMBIOS:
 		dev_dbg(dev, "smbios\n");