[Bug,208827,fio,io_uring] io_uring write data crc32c verify failed

https://bugzilla.kernel.org/show_bug.cgi?id=208827

--- Comment #5 from Dave Chinner (david@fromorbit.com) ---
On Mon, Aug 10, 2020 at 05:08:07PM +1000, Dave Chinner wrote:
> [cc Jens]
> 
> [Jens, data corruption w/ io_uring and simple fio reproducer. see
> the bz link below.]
> 
> On Mon, Aug 10, 2020 at 01:56:05PM +1000, Dave Chinner wrote:
> > On Mon, Aug 10, 2020 at 10:09:32AM +1000, Dave Chinner wrote:
> > > On Fri, Aug 07, 2020 at 03:12:03AM +0000,
> bugzilla-daemon@bugzilla.kernel.org wrote:
> > > > --- Comment #1 from Dave Chinner (david@fromorbit.com) ---
> > > > On Thu, Aug 06, 2020 at 04:57:58AM +0000,
> bugzilla-daemon@bugzilla.kernel.org
> > > > wrote:
> > > > > https://bugzilla.kernel.org/show_bug.cgi?id=208827
> > > > > 
> > > > >             Bug ID: 208827
> > > > >            Summary: [fio io_uring] io_uring write data crc32c verify
> > > > >                     failed
> > > > >            Product: File System
> > > > >            Version: 2.5
> > > > >     Kernel Version: xfs-linux xfs-5.9-merge-7 + v5.8-rc4
> > > 
> > > FWIW, I can reproduce this with a vanilla 5.8 release kernel,
> > > so this isn't related to contents of the XFS dev tree at all...
> > > 
> > > In fact, this bug isn't a recent regression. AFAICT, it was
> > > introduced between in 5.4 and 5.5 - 5.4 did not reproduce, 5.5 did
> > > reproduce. More info once I've finished bisecting it....
> > 
> > f67676d160c6ee2ed82917fadfed6d29cab8237c is the first bad commit
> > commit f67676d160c6ee2ed82917fadfed6d29cab8237c
> > Author: Jens Axboe <axboe@kernel.dk>
> > Date:   Mon Dec 2 11:03:47 2019 -0700
> > 
> >     io_uring: ensure async punted read/write requests copy iovec
> 
> ....
> 
> Ok, I went back to vanilla 5.8 to continue debugging and adding
> tracepoints, and it's proving strangely difficult to reproduce now.

Which turns out to be caused by a tracepoint I inserted to try to
narrow down if this was an invalidation race. I put this in
invalidate_complete_page:

                 * First and last FULL page! Partial pages are deliberately

by making the invalidation wait for the pages to go fully to the
clean state before starting.

This, however, only fixes the specific symptom being tripped over
here.  To further test this, I removed this writeback from
POSIX_FADV_DONTNEED completely so I could trigger writeback via
controlled background writeback. And, as I expected, whenever
background writeback ran to write back these dirty files, the
verification failures triggered again. It is quite reliable.

So it looks like there is some kind of writeback completion vs page
invalidation race condition occurring, but more work is needed to
isolate it further. I don't know what part the async read plays in
the corruption yet, because I don't know how we are getting pages in
the cache where page->index != the file offset stamped in the data.
That smells of leaking PageUptodate flags...

-Dave.

Message ID	bug-208827-201763-xNQFRS4LfF@https.bugzilla.kernel.org/ (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=bIZx=BU=vger.kernel.org=linux-xfs-owner@kernel.org> From: bugzilla-daemon@bugzilla.kernel.org To: linux-xfs@vger.kernel.org Subject: [Bug 208827] [fio io_uring] io_uring write data crc32c verify failed Date: Mon, 10 Aug 2020 09:09:04 +0000 Message-ID: <bug-208827-201763-xNQFRS4LfF@https.bugzilla.kernel.org/> In-Reply-To: <bug-208827-201763@https.bugzilla.kernel.org/> References: <bug-208827-201763@https.bugzilla.kernel.org/> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8BIT Auto-Submitted: auto-generated MIME-Version: 1.0 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk
Series	[Bug,208827,fio,io_uring] io_uring write data crc32c verify failed \| expand [Bug,208827,fio,io_uring] io_uring write data crc32c verify failed

[Bug,208827,fio,io_uring] io_uring write data crc32c verify failed

Commit Message

Patch