pmem: fix __copy_user_flushcache

Message ID	alpine.LRH.2.02.2009161451140.21915@file01.intranet.prod.int.rdu2.redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=4Bxr=CZ=vger.kernel.org=linux-fsdevel-owner@kernel.org> Date: Wed, 16 Sep 2020 14:56:38 -0400 (EDT) From: Mikulas Patocka <mpatocka@redhat.com> To: Dan Williams <dan.j.williams@intel.com> cc: Linus Torvalds <torvalds@linux-foundation.org>, Alexander Viro <viro@zeniv.linux.org.uk>, Andrew Morton <akpm@linux-foundation.org>, Vishal Verma <vishal.l.verma@intel.com>, Dave Jiang <dave.jiang@intel.com>, Ira Weiny <ira.weiny@intel.com>, Matthew Wilcox <willy@infradead.org>, Jan Kara <jack@suse.cz>, Eric Sandeen <esandeen@redhat.com>, Dave Chinner <dchinner@redhat.com>, "Kani, Toshi" <toshi.kani@hpe.com>, "Norton, Scott J" <scott.norton@hpe.com>, "Tadakamadla, Rajesh (DCIG/CDI/HPS Perf)" <rajesh.tadakamadla@hpe.com>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, linux-fsdevel <linux-fsdevel@vger.kernel.org>, linux-nvdimm <linux-nvdimm@lists.01.org> Subject: [PATCH] pmem: fix __copy_user_flushcache In-Reply-To: <CAPcyv4gD0ZFkfajKTDnJhEEjf+5Av-GH+cHRFoyhzGe8bNEgAA@mail.gmail.com> Message-ID: <alpine.LRH.2.02.2009161451140.21915@file01.intranet.prod.int.rdu2.redhat.com> References: <alpine.LRH.2.02.2009140852030.22422@file01.intranet.prod.int.rdu2.redhat.com> <CAPcyv4gh=QaDB61_9_QTgtt-pZuTFdR6td0orE0VMH6=6SA2vw@mail.gmail.com> <alpine.LRH.2.02.2009151216050.16057@file01.intranet.prod.int.rdu2.redhat.com> <alpine.LRH.2.02.2009151332280.3851@file01.intranet.prod.int.rdu2.redhat.com> <alpine.LRH.2.02.2009160649560.20720@file01.intranet.prod.int.rdu2.redhat.com> <CAPcyv4gW6AvR+RaShHdQzOaEPv9nrq5myXDmywuoCTYDZxk-hw@mail.gmail.com> <alpine.LRH.2.02.2009161254400.745@file01.intranet.prod.int.rdu2.redhat.com> <CAPcyv4gD0ZFkfajKTDnJhEEjf+5Av-GH+cHRFoyhzGe8bNEgAA@mail.gmail.com> User-Agent: Alpine 2.02 (LRH 1266 2009-07-14) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk
Series	pmem: fix __copy_user_flushcache \| expand pmem: fix __copy_user_flushcache

Mikulas Patocka Sept. 16, 2020, 6:56 p.m. UTC

On Wed, 16 Sep 2020, Dan Williams wrote:

> On Wed, Sep 16, 2020 at 10:24 AM Mikulas Patocka <mpatocka@redhat.com> wrote:
> >
> >
> >
> > On Wed, 16 Sep 2020, Dan Williams wrote:
> >
> > > On Wed, Sep 16, 2020 at 3:57 AM Mikulas Patocka <mpatocka@redhat.com> wrote:
> > > >
> > > >
> > > >
> > > > I'm submitting this patch that adds the required exports (so that we could
> > > > use __copy_from_user_flushcache on x86, arm64 and powerpc). Please, queue
> > > > it for the next merge window.
> > >
> > > Why? This should go with the first user, and it's not clear that it
> > > needs to be relative to the current dax_operations export scheme.
> >
> > Before nvfs gets included in the kernel, I need to distribute it as a
> > module. So, it would make my maintenance easier. But if you don't want to
> > export it now, no problem, I can just copy __copy_user_flushcache from the
> > kernel to the module.
> 
> That sounds a better plan than exporting symbols with no in-kernel consumer.

BTW, this function is buggy. Here I'm submitting the patch.



From: Mikulas Patocka <mpatocka@redhat.com>

If we copy less than 8 bytes and if the destination crosses a cache line,
__copy_user_flushcache would invalidate only the first cache line. This
patch makes it invalidate the second cache line as well.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org

---
 arch/x86/lib/usercopy_64.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Dan Williams Sept. 18, 2020, 1:53 a.m. UTC | #1

On Wed, Sep 16, 2020 at 11:57 AM Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>
>
> On Wed, 16 Sep 2020, Dan Williams wrote:
>
> > On Wed, Sep 16, 2020 at 10:24 AM Mikulas Patocka <mpatocka@redhat.com> wrote:
> > >
> > >
> > >
> > > On Wed, 16 Sep 2020, Dan Williams wrote:
> > >
> > > > On Wed, Sep 16, 2020 at 3:57 AM Mikulas Patocka <mpatocka@redhat.com> wrote:
> > > > >
> > > > >
> > > > >
> > > > > I'm submitting this patch that adds the required exports (so that we could
> > > > > use __copy_from_user_flushcache on x86, arm64 and powerpc). Please, queue
> > > > > it for the next merge window.
> > > >
> > > > Why? This should go with the first user, and it's not clear that it
> > > > needs to be relative to the current dax_operations export scheme.
> > >
> > > Before nvfs gets included in the kernel, I need to distribute it as a
> > > module. So, it would make my maintenance easier. But if you don't want to
> > > export it now, no problem, I can just copy __copy_user_flushcache from the
> > > kernel to the module.
> >
> > That sounds a better plan than exporting symbols with no in-kernel consumer.
>
> BTW, this function is buggy. Here I'm submitting the patch.
>
>
>
> From: Mikulas Patocka <mpatocka@redhat.com>
>
> If we copy less than 8 bytes and if the destination crosses a cache line,
> __copy_user_flushcache would invalidate only the first cache line. This
> patch makes it invalidate the second cache line as well.

Good catch.

> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> Cc: stable@vger.kernel.org
>
> ---
>  arch/x86/lib/usercopy_64.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> Index: linux-2.6/arch/x86/lib/usercopy_64.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/lib/usercopy_64.c   2020-09-05 10:01:27.000000000 +0200
> +++ linux-2.6/arch/x86/lib/usercopy_64.c        2020-09-16 20:48:31.000000000 +0200
> @@ -120,7 +120,7 @@ long __copy_user_flushcache(void *dst, c
>          */
>         if (size < 8) {
>                 if (!IS_ALIGNED(dest, 4) || size != 4)
> -                       clean_cache_range(dst, 1);
> +                       clean_cache_range(dst, size);
>         } else {
>                 if (!IS_ALIGNED(dest, 8)) {
>                         dest = ALIGN(dest, boot_cpu_data.x86_clflush_size);
>

You can add:

Fixes: 0aed55af8834 ("x86, uaccess: introduce
copy_from_iter_flushcache for pmem / cache-bypass operations")
Reviewed-by: Dan Williams <dan.j.wiilliams@intel.com>

Mikulas Patocka Sept. 18, 2020, 12:25 p.m. UTC | #2

Hi

I'd like to ask about this problem: when we write to a file, the kernel 
takes the write inode lock. When we read from a file, no lock is taken - 
thus the read syscall can read data that are halfway modified by the write 
syscall.

The standard specifies the effects of the write syscall are atomic - see 
this:
https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_09_07

> 2.9.7 Thread Interactions with Regular File Operations
> 
> All of the following functions shall be atomic with respect to each 
> other in the effects specified in POSIX.1-2017 when they operate on 
> regular files or symbolic links:
> 
> chmod()    fchownat()  lseek()      readv()     unlink()
> chown()    fcntl()     lstat()      pwrite()    unlinkat()
> close()    fstat()     open()       rename()    utime()
> creat()    fstatat()   openat()     renameat()  utimensat()
> dup2()     ftruncate() pread()      stat()      utimes()
> fchmod()   lchown()    read()       symlink()   write()
> fchmodat() link()      readlink()   symlinkat() writev()
> fchown()   linkat()    readlinkat() truncate()
> 
> If two threads each call one of these functions, each call shall either 
> see all of the specified effects of the other call, or none of them. The 
> requirement on the close() function shall also apply whenever a file 
> descriptor is successfully closed, however caused (for example, as a 
> consequence of calling close(), calling dup2(), or of process 
> termination).

Should the read call take the read inode lock to make it atomic w.r.t. the 
write syscall? (I know - taking the read lock causes big performance hit 
due to cache line bouncing)

I've created this program to test it - it has two threads, one writing and 
the other reading and verifying. When I run it on OpenBSD or FreeBSD, it 
passes, on Linux it fails with "we read modified bytes".

Mikulas



#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>
#include <pthread.h>

#define L	65536

static int h;
static pthread_barrier_t barrier;
static pthread_t thr;

static char rpattern[L];
static char wpattern[L];

static void *reader(__attribute__((unused)) void *ptr)
{
	while (1) {
		int r;
		size_t i;
		r = pthread_barrier_wait(&barrier);
		if (r > 0) fprintf(stderr, "pthread_barrier_wait: %s\n", strerror(r)), exit(1);
		r = pread(h, rpattern, L, 0);
		if (r != L) perror("pread"), exit(1);
		for (i = 0; i < L; i++) {
			if (rpattern[i] != rpattern[0])
				fprintf(stderr, "we read modified bytes\n"), exit(1);
		}
	}
	return NULL;
}

int main(__attribute__((unused)) int argc, char *argv[])
{
	int r;
	h = open(argv[1], O_RDWR | O_CREAT | O_TRUNC, 0644);
	if (h < 0) perror("open"), exit(1);
	r = pwrite(h, wpattern, L, 0);
	if (r != L) perror("pwrite"), exit(1);
	r = pthread_barrier_init(&barrier, NULL, 2);
	if (r) fprintf(stderr, "pthread_barrier_init: %s\n", strerror(r)), exit(1);
	r = pthread_create(&thr, NULL, reader, NULL);
	if (r) fprintf(stderr, "pthread_create: %s\n", strerror(r)), exit(1);
	while (1) {
		size_t i;
		for (i = 0; i < L; i++)
			wpattern[i]++;
		r = pthread_barrier_wait(&barrier);
		if (r > 0) fprintf(stderr, "pthread_barrier_wait: %s\n", strerror(r)), exit(1);
		r = pwrite(h, wpattern, L, 0);
		if (r != L) perror("pwrite"), exit(1);
	}
}

Jan Kara Sept. 18, 2020, 1:13 p.m. UTC | #3

On Fri 18-09-20 08:25:28, Mikulas Patocka wrote:
> I'd like to ask about this problem: when we write to a file, the kernel 
> takes the write inode lock. When we read from a file, no lock is taken - 
> thus the read syscall can read data that are halfway modified by the write 
> syscall.
> 
> The standard specifies the effects of the write syscall are atomic - see 
> this:
> https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_09_07

Yes, but no Linux filesystem (except for XFS AFAIK) follows the POSIX spec
in this regard. Mostly because the mixed read-write performance sucks when
you follow it (not that it would absolutely have to suck - you can use
clever locking with range locks but nobody does it currently). In practice,
the read-write atomicity works on Linux only on per-page basis for buffered
IO.

								Honza

Linus Torvalds Sept. 18, 2020, 6:02 p.m. UTC | #4

On Fri, Sep 18, 2020 at 6:13 AM Jan Kara <jack@suse.cz> wrote:
>
> Yes, but no Linux filesystem (except for XFS AFAIK) follows the POSIX spec
> in this regard.

Yeah, and we never have. As you say, performance sucks, and nobody has
ever cared.

So the standard in this case is just something that we'll never
follow, and should just be ignored. It's irrelevant.

There are other places we don't follow POSIX either.

Feel free to document it (I think it's currently just a "everybody
knows" kind of undocumented thing).

             Linus

Dave Chinner Sept. 20, 2020, 11:41 p.m. UTC | #5

On Fri, Sep 18, 2020 at 03:13:17PM +0200, Jan Kara wrote:
> On Fri 18-09-20 08:25:28, Mikulas Patocka wrote:
> > I'd like to ask about this problem: when we write to a file, the kernel 
> > takes the write inode lock. When we read from a file, no lock is taken - 
> > thus the read syscall can read data that are halfway modified by the write 
> > syscall.
> > 
> > The standard specifies the effects of the write syscall are atomic - see 
> > this:
> > https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_09_07
> 
> Yes, but no Linux filesystem (except for XFS AFAIK) follows the POSIX spec
> in this regard. Mostly because the mixed read-write performance sucks when
> you follow it (not that it would absolutely have to suck - you can use
> clever locking with range locks but nobody does it currently). In practice,
> the read-write atomicity works on Linux only on per-page basis for buffered
> IO.

We come across this from time to time with POSIX compliant
applications being ported from other Unixes that rely on a write
from one thread being seen atomically from a read from another
thread. There are quite a few custom enterprise apps around that
rely on this POSIX behaviour, especially stuff that has come from
different Unixes that actually provided Posix compliant behaviour.

IOWs, from an upstream POV, POSIX atomic write behaviour doesn't
matter very much. From an enterprise distro POV it's often a
different story....

Cheers,

Dave.

pmem: fix __copy_user_flushcache

Commit Message

Comments

Patch