Message ID | 149161025237.38725.13508986873214668503.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
> Subject: [PATCH v2] x86, pmem: fix broken __copy_user_nocache cache- > bypass assumptions > > Before we rework the "pmem api" to stop abusing __copy_user_nocache() > for memcpy_to_pmem() we need to fix cases where we may strand dirty data > in the cpu cache. The problem occurs when copy_from_iter_pmem() is used > for arbitrary data transfers from userspace. There is no guarantee that > these transfers, performed by dax_iomap_actor(), will have aligned > destinations or aligned transfer lengths. Backstop the usage > __copy_user_nocache() with explicit cache management in these unaligned > cases. > > Yes, copy_from_iter_pmem() is now too big for an inline, but addressing > that is saved for a later patch that moves the entirety of the "pmem > api" into the pmem driver directly. : > --- > v2: Change the condition for flushing the last cacheline of the > destination from 8-byte to 4-byte misalignment (Toshi) : > arch/x86/include/asm/pmem.h | 41 ++++++++++++++++++++++++++++++---- : > @@ -94,7 +86,34 @@ static inline size_t arch_copy_from_iter_pmem(void > *addr, size_t bytes, > /* TODO: skip the write-back by always using non-temporal stores */ > len = copy_from_iter_nocache(addr, bytes, i); > > - if (__iter_needs_pmem_wb(i)) > + /* > + * In the iovec case on x86_64 copy_from_iter_nocache() uses > + * non-temporal stores for the bulk of the transfer, but we need > + * to manually flush if the transfer is unaligned. In the > + * non-iovec case the entire destination needs to be flushed. > + */ > + if (iter_is_iovec(i)) { > + unsigned long dest = (unsigned long) addr; > + > + /* > + * If the destination is not 8-byte aligned then > + * __copy_user_nocache (on x86_64) uses cached copies > + */ > + if (dest & 8) { > + arch_wb_cache_pmem(addr, 1); > + dest = ALIGN(dest, 8); > + } > + > + /* > + * If the remaining transfer length, after accounting > + * for destination alignment, is not 4-byte aligned > + * then __copy_user_nocache() falls back to cached > + * copies for the trailing bytes in the final cacheline > + * of the transfer. > + */ > + if ((bytes - (dest - (unsigned long) addr)) & 4) > + arch_wb_cache_pmem(addr + bytes - 1, 1); > + } else > arch_wb_cache_pmem(addr, bytes); > > return len; Thanks for the update. I think the alignment check should be based on the following note in copy_user_nocache. * Note: Cached memory copy is used when destination or size is not * naturally aligned. That is: * - Require 8-byte alignment when size is 8 bytes or larger. * - Require 4-byte alignment when size is 4 bytes. So, I think the code may be something like this. I also made the following changes: - Mask with 7, not 8. - ALIGN with cacheline size, instead of 8. - Add (bytes > flushed) test since calculation with unsigned long still results in a negative value (as a positive value). if (bytes < 8) { if ((dest & 3) || (bytes != 4)) arch_wb_cache_pmem(addr, 1); } else { if (dest & 7) { dest = ALIGN(dest, boot_cpu_data.x86_clflush_size); arch_wb_cache_pmem(addr, 1); } flushed = dest - (unsigned long) addr; if ((bytes > flushed) && ((bytes - flushed) & 7)) arch_wb_cache_pmem(addr + bytes - 1, 1); } Thanks, -Toshi
On Mon, Apr 10, 2017 at 11:53 AM, Kani, Toshimitsu <toshi.kani@hpe.com> wrote: >> Subject: [PATCH v2] x86, pmem: fix broken __copy_user_nocache cache- >> bypass assumptions >> >> Before we rework the "pmem api" to stop abusing __copy_user_nocache() >> for memcpy_to_pmem() we need to fix cases where we may strand dirty data >> in the cpu cache. The problem occurs when copy_from_iter_pmem() is used >> for arbitrary data transfers from userspace. There is no guarantee that >> these transfers, performed by dax_iomap_actor(), will have aligned >> destinations or aligned transfer lengths. Backstop the usage >> __copy_user_nocache() with explicit cache management in these unaligned >> cases. >> >> Yes, copy_from_iter_pmem() is now too big for an inline, but addressing >> that is saved for a later patch that moves the entirety of the "pmem >> api" into the pmem driver directly. > : >> --- >> v2: Change the condition for flushing the last cacheline of the >> destination from 8-byte to 4-byte misalignment (Toshi) > : >> arch/x86/include/asm/pmem.h | 41 ++++++++++++++++++++++++++++++---- > : >> @@ -94,7 +86,34 @@ static inline size_t arch_copy_from_iter_pmem(void >> *addr, size_t bytes, >> /* TODO: skip the write-back by always using non-temporal stores */ >> len = copy_from_iter_nocache(addr, bytes, i); >> >> -if (__iter_needs_pmem_wb(i)) >> +/* >> + * In the iovec case on x86_64 copy_from_iter_nocache() uses >> + * non-temporal stores for the bulk of the transfer, but we need >> + * to manually flush if the transfer is unaligned. In the >> + * non-iovec case the entire destination needs to be flushed. >> + */ >> +if (iter_is_iovec(i)) { >> +unsigned long dest = (unsigned long) addr; >> + >> +/* >> + * If the destination is not 8-byte aligned then >> + * __copy_user_nocache (on x86_64) uses cached copies >> + */ >> +if (dest & 8) { >> +arch_wb_cache_pmem(addr, 1); >> +dest = ALIGN(dest, 8); >> +} >> + >> +/* >> + * If the remaining transfer length, after accounting >> + * for destination alignment, is not 4-byte aligned >> + * then __copy_user_nocache() falls back to cached >> + * copies for the trailing bytes in the final cacheline >> + * of the transfer. >> + */ >> +if ((bytes - (dest - (unsigned long) addr)) & 4) >> +arch_wb_cache_pmem(addr + bytes - 1, 1); >> +} else >> arch_wb_cache_pmem(addr, bytes); >> >> return len; > > Thanks for the update. I think the alignment check should be based on > the following note in copy_user_nocache. > > * Note: Cached memory copy is used when destination or size is not > * naturally aligned. That is: > * - Require 8-byte alignment when size is 8 bytes or larger. > * - Require 4-byte alignment when size is 4 bytes. > > So, I think the code may be something like this. I also made the following changes: Thanks! > - Mask with 7, not 8. Yes, good catch. > - ALIGN with cacheline size, instead of 8. > - Add (bytes > flushed) test since calculation with unsigned long still results in a negative > value (as a positive value). > > if (bytes < 8) { > if ((dest & 3) || (bytes != 4)) > arch_wb_cache_pmem(addr, 1); > } else { > if (dest & 7) { > dest = ALIGN(dest, boot_cpu_data.x86_clflush_size); Why align the destination to the next cacheline? As far as I can see the ALIGN_DESTINATION macro in arch/x86/include/asm/asm.h only aligns to the next 8-byte boundary.
> > Thanks for the update. I think the alignment check should be based on > > the following note in copy_user_nocache. > > > > * Note: Cached memory copy is used when destination or size is not > > * naturally aligned. That is: > > * - Require 8-byte alignment when size is 8 bytes or larger. > > * - Require 4-byte alignment when size is 4 bytes. > > > > So, I think the code may be something like this. I also made the following > changes: > > Thanks! > > > - Mask with 7, not 8. > > Yes, good catch. > > > - ALIGN with cacheline size, instead of 8. > > - Add (bytes > flushed) test since calculation with unsigned long still results > in a negative > > value (as a positive value). > > > > if (bytes < 8) { > > if ((dest & 3) || (bytes != 4)) > > arch_wb_cache_pmem(addr, 1); > > } else { > > if (dest & 7) { > > dest = ALIGN(dest, boot_cpu_data.x86_clflush_size); > > Why align the destination to the next cacheline? As far as I can see > the ALIGN_DESTINATION macro in arch/x86/include/asm/asm.h only aligns > to the next 8-byte boundary. The clflush here flushes for the cacheline size. So, we do not need to flush the same cacheline again when the unaligned tail is in the same line. Thanks, -Toshi
On Mon, Apr 10, 2017 at 2:28 PM, Kani, Toshimitsu <toshi.kani@hpe.com> wrote: >> > Thanks for the update. I think the alignment check should be based on >> > the following note in copy_user_nocache. >> > >> > * Note: Cached memory copy is used when destination or size is not >> > * naturally aligned. That is: >> > * - Require 8-byte alignment when size is 8 bytes or larger. >> > * - Require 4-byte alignment when size is 4 bytes. >> > >> > So, I think the code may be something like this. I also made the following >> changes: >> >> Thanks! >> >> > - Mask with 7, not 8. >> >> Yes, good catch. >> >> > - ALIGN with cacheline size, instead of 8. >> > - Add (bytes > flushed) test since calculation with unsigned long still results >> in a negative >> > value (as a positive value). >> > >> > if (bytes < 8) { >> > if ((dest & 3) || (bytes != 4)) >> > arch_wb_cache_pmem(addr, 1); >> > } else { >> > if (dest & 7) { >> > dest = ALIGN(dest, boot_cpu_data.x86_clflush_size); >> >> Why align the destination to the next cacheline? As far as I can see >> the ALIGN_DESTINATION macro in arch/x86/include/asm/asm.h only aligns >> to the next 8-byte boundary. > > The clflush here flushes for the cacheline size. So, we do not need to flush > the same cacheline again when the unaligned tail is in the same line. Ok, makes sense. Last question, can't we reduce the check to be: if ((bytes > flushed) && ((bytes - flushed) & 3)) ...since if 'bytes' was 4-byte aligned we would have performed non-temporal stores. Can I add your Signed-off-by: on v3?
> On Mon, Apr 10, 2017 at 2:28 PM, Kani, Toshimitsu <toshi.kani@hpe.com> > wrote: > >> > Thanks for the update. I think the alignment check should be based on > >> > the following note in copy_user_nocache. > >> > > >> > * Note: Cached memory copy is used when destination or size is not > >> > * naturally aligned. That is: > >> > * - Require 8-byte alignment when size is 8 bytes or larger. > >> > * - Require 4-byte alignment when size is 4 bytes. > >> > > >> > So, I think the code may be something like this. I also made the following > >> changes: > >> > >> Thanks! > >> > >> > - Mask with 7, not 8. > >> > >> Yes, good catch. > >> > >> > - ALIGN with cacheline size, instead of 8. > >> > - Add (bytes > flushed) test since calculation with unsigned long still > results > >> in a negative > >> > value (as a positive value). > >> > > >> > if (bytes < 8) { > >> > if ((dest & 3) || (bytes != 4)) > >> > arch_wb_cache_pmem(addr, 1); > >> > } else { > >> > if (dest & 7) { > >> > dest = ALIGN(dest, boot_cpu_data.x86_clflush_size); > >> > >> Why align the destination to the next cacheline? As far as I can see > >> the ALIGN_DESTINATION macro in arch/x86/include/asm/asm.h only aligns > >> to the next 8-byte boundary. > > > > The clflush here flushes for the cacheline size. So, we do not need to flush > > the same cacheline again when the unaligned tail is in the same line. > > Ok, makes sense. Last question, can't we reduce the check to be: > > if ((bytes > flushed) && ((bytes - flushed) & 3)) > > ...since if 'bytes' was 4-byte aligned we would have performed > non-temporal stores. That is not documented behavior of copy_user_nocache, but as long as the pmem version of copy_user_nocache follows the same implemented behavior, yes, that works. > Can I add your Signed-off-by: on v3? Sure. Thanks, -Toshi
On Mon, Apr 10, 2017 at 2:45 PM, Kani, Toshimitsu <toshi.kani@hpe.com> wrote: >> On Mon, Apr 10, 2017 at 2:28 PM, Kani, Toshimitsu <toshi.kani@hpe.com> >> wrote: >> >> > Thanks for the update. I think the alignment check should be based on >> >> > the following note in copy_user_nocache. >> >> > >> >> > * Note: Cached memory copy is used when destination or size is not >> >> > * naturally aligned. That is: >> >> > * - Require 8-byte alignment when size is 8 bytes or larger. >> >> > * - Require 4-byte alignment when size is 4 bytes. >> >> > >> >> > So, I think the code may be something like this. I also made the following >> >> changes: >> >> >> >> Thanks! >> >> >> >> > - Mask with 7, not 8. >> >> >> >> Yes, good catch. >> >> >> >> > - ALIGN with cacheline size, instead of 8. >> >> > - Add (bytes > flushed) test since calculation with unsigned long still >> results >> >> in a negative >> >> > value (as a positive value). >> >> > >> >> > if (bytes < 8) { >> >> > if ((dest & 3) || (bytes != 4)) >> >> > arch_wb_cache_pmem(addr, 1); >> >> > } else { >> >> > if (dest & 7) { >> >> > dest = ALIGN(dest, boot_cpu_data.x86_clflush_size); >> >> >> >> Why align the destination to the next cacheline? As far as I can see >> >> the ALIGN_DESTINATION macro in arch/x86/include/asm/asm.h only aligns >> >> to the next 8-byte boundary. >> > >> > The clflush here flushes for the cacheline size. So, we do not need to flush >> > the same cacheline again when the unaligned tail is in the same line. >> >> Ok, makes sense. Last question, can't we reduce the check to be: >> >> if ((bytes > flushed) && ((bytes - flushed) & 3)) >> >> ...since if 'bytes' was 4-byte aligned we would have performed >> non-temporal stores. > > That is not documented behavior of copy_user_nocache, but as long as the pmem > version of copy_user_nocache follows the same implemented behavior, yes, that > works. Hmm, sorry this comment confuses me, I'm only referring to the current version of __copy_user_nocache not the new pmem version. The way I read the current code we only ever jump to the cached copy loop (.L_1b_cache_copy_loop) if the trailing byte-count is 4-byte misaligned.
> >> > The clflush here flushes for the cacheline size. So, we do not need to > flush > >> > the same cacheline again when the unaligned tail is in the same line. > >> > >> Ok, makes sense. Last question, can't we reduce the check to be: > >> > >> if ((bytes > flushed) && ((bytes - flushed) & 3)) > >> > >> ...since if 'bytes' was 4-byte aligned we would have performed > >> non-temporal stores. > > > > That is not documented behavior of copy_user_nocache, but as long as the > pmem > > version of copy_user_nocache follows the same implemented behavior, yes, > that > > works. > > Hmm, sorry this comment confuses me, I'm only referring to the current > version of __copy_user_nocache not the new pmem version. The way I > read the current code we only ever jump to the cached copy loop > (.L_1b_cache_copy_loop) if the trailing byte-count is 4-byte > misaligned. Yes, you are right and that's how the code is implemented. I added this trailing 4-byte handling for the >=8B case, which is shared with <8B case, since it was easy to do. But I considered it a bonus. This function also needs to handle 4B-aligned destination if it is to state that it handles 4B alignment for the >=8B case as well. Otherwise, it's inconsistent. Since I did not see much point of supporting such case, I simply documented in the Note that 8 byte alignment is required for the >=8B case. Thanks, -Toshi
diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h index 2c1ebeb4d737..cf4e68faedc4 100644 --- a/arch/x86/include/asm/pmem.h +++ b/arch/x86/include/asm/pmem.h @@ -55,7 +55,8 @@ static inline int arch_memcpy_from_pmem(void *dst, const void *src, size_t n) * @size: number of bytes to write back * * Write back a cache range using the CLWB (cache line write back) - * instruction. + * instruction. Note that @size is internally rounded up to be cache + * line size aligned. */ static inline void arch_wb_cache_pmem(void *addr, size_t size) { @@ -69,15 +70,6 @@ static inline void arch_wb_cache_pmem(void *addr, size_t size) clwb(p); } -/* - * copy_from_iter_nocache() on x86 only uses non-temporal stores for iovec - * iterators, so for other types (bvec & kvec) we must do a cache write-back. - */ -static inline bool __iter_needs_pmem_wb(struct iov_iter *i) -{ - return iter_is_iovec(i) == false; -} - /** * arch_copy_from_iter_pmem - copy data from an iterator to PMEM * @addr: PMEM destination address @@ -94,7 +86,34 @@ static inline size_t arch_copy_from_iter_pmem(void *addr, size_t bytes, /* TODO: skip the write-back by always using non-temporal stores */ len = copy_from_iter_nocache(addr, bytes, i); - if (__iter_needs_pmem_wb(i)) + /* + * In the iovec case on x86_64 copy_from_iter_nocache() uses + * non-temporal stores for the bulk of the transfer, but we need + * to manually flush if the transfer is unaligned. In the + * non-iovec case the entire destination needs to be flushed. + */ + if (iter_is_iovec(i)) { + unsigned long dest = (unsigned long) addr; + + /* + * If the destination is not 8-byte aligned then + * __copy_user_nocache (on x86_64) uses cached copies + */ + if (dest & 8) { + arch_wb_cache_pmem(addr, 1); + dest = ALIGN(dest, 8); + } + + /* + * If the remaining transfer length, after accounting + * for destination alignment, is not 4-byte aligned + * then __copy_user_nocache() falls back to cached + * copies for the trailing bytes in the final cacheline + * of the transfer. + */ + if ((bytes - (dest - (unsigned long) addr)) & 4) + arch_wb_cache_pmem(addr + bytes - 1, 1); + } else arch_wb_cache_pmem(addr, bytes); return len;
Before we rework the "pmem api" to stop abusing __copy_user_nocache() for memcpy_to_pmem() we need to fix cases where we may strand dirty data in the cpu cache. The problem occurs when copy_from_iter_pmem() is used for arbitrary data transfers from userspace. There is no guarantee that these transfers, performed by dax_iomap_actor(), will have aligned destinations or aligned transfer lengths. Backstop the usage __copy_user_nocache() with explicit cache management in these unaligned cases. Yes, copy_from_iter_pmem() is now too big for an inline, but addressing that is saved for a later patch that moves the entirety of the "pmem api" into the pmem driver directly. Fixes: 5de490daec8b ("pmem: add copy_from_iter_pmem() and clear_pmem()") Cc: <stable@vger.kernel.org> Cc: <x86@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> [toshi: trailing bytes flush only needed in the 4B misalign case] Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- v2: Change the condition for flushing the last cacheline of the destination from 8-byte to 4-byte misalignment (Toshi) arch/x86/include/asm/pmem.h | 41 ++++++++++++++++++++++++++++++----------- 1 file changed, 30 insertions(+), 11 deletions(-)