[-RFC,0/2] mm/ext4: avoid data corruption when extending DIO write race with buffered read

Message ID	20231202091432.8349-1-libaokun1@huawei.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Baokun Li <libaokun1@huawei.com> To: <linux-mm@kvack.org>, <linux-ext4@vger.kernel.org> CC: <tytso@mit.edu>, <adilger.kernel@dilger.ca>, <jack@suse.cz>, <willy@infradead.org>, <akpm@linux-foundation.org>, <ritesh.list@gmail.com>, <linux-kernel@vger.kernel.org>, <yi.zhang@huawei.com>, <yangerkun@huawei.com>, <yukuai3@huawei.com>, <libaokun1@huawei.com> Subject: [PATCH -RFC 0/2] mm/ext4: avoid data corruption when extending DIO write race with buffered read Date: Sat, 2 Dec 2023 17:14:30 +0800 Message-ID: <20231202091432.8349-1-libaokun1@huawei.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm/ext4: avoid data corruption when extending DIO write race with buffered read \| expand [-RFC,0/2] mm/ext4: avoid data corruption when extending DIO write race with buffered read [-RFC,1/2] mm: avoid data corruption when extending DIO write race with buffered read [-RFC,2/2] ext4: avoid data corruption when extending DIO write race with buffered read

Baokun Li Dec. 2, 2023, 9:14 a.m. UTC

Hello everyone!

Recently, while running some pressure tests on MYSQL, noticed that
occasionally a "corrupted data in log event" error would be reported.
After analyzing the error, I found that extending DIO write and buffered
read were competing, resulting in some zero-filled page end being read.
Since ext4 buffered read doesn't hold an inode lock, and there is no
field in the page to indicate the valid data size, it seems to me that
it is impossible to solve this problem perfectly without changing these
two things.

In this series, the first patch reads the inode size twice, and takes the
smaller of the two values as the copyout limit to avoid copying data that
was not actually read (0-padding) into the user buffer and causing data
corruption. This greatly reduces the probability of problems under 4k
page. However, the problem is still easily triggered under 64k page.

The second patch waits for the existing dio write to complete and
invalidate the stale page cache before performing a new buffered read
in ext4, avoiding data corruption by copying the stale page cache to
the user buffer. This makes it much less likely that the problem will
be triggered in a 64k page.

Do we have a plan to add a lock to the ext4 buffered read or a field in
the page that indicates the size of the valid data in the page? Or does
anyone have a better idea?

Comments and questions are, as always, welcome.

Baokun Li (2):
  mm: avoid data corruption when extending DIO write race with buffered
    read
  ext4: avoid data corruption when extending DIO write race with
    buffered read

 fs/ext4/file.c | 3 +++
 mm/filemap.c   | 5 +++--
 2 files changed, 6 insertions(+), 2 deletions(-)

Jan Kara Dec. 4, 2023, 12:11 p.m. UTC | #1

Hello!

On Sat 02-12-23 17:14:30, Baokun Li wrote:
> Recently, while running some pressure tests on MYSQL, noticed that
> occasionally a "corrupted data in log event" error would be reported.
> After analyzing the error, I found that extending DIO write and buffered
> read were competing, resulting in some zero-filled page end being read.
> Since ext4 buffered read doesn't hold an inode lock, and there is no
> field in the page to indicate the valid data size, it seems to me that
> it is impossible to solve this problem perfectly without changing these
> two things.

Yes, combining buffered reads with direct IO writes is a recipe for
problems and pretty much in the "don't do it" territory. So honestly I'd
consider this a MYSQL bug. Were you able to identify why does MYSQL use
buffered read in this case? It is just something specific to the test
you're doing?

> In this series, the first patch reads the inode size twice, and takes the
> smaller of the two values as the copyout limit to avoid copying data that
> was not actually read (0-padding) into the user buffer and causing data
> corruption. This greatly reduces the probability of problems under 4k
> page. However, the problem is still easily triggered under 64k page.
> 
> The second patch waits for the existing dio write to complete and
> invalidate the stale page cache before performing a new buffered read
> in ext4, avoiding data corruption by copying the stale page cache to
> the user buffer. This makes it much less likely that the problem will
> be triggered in a 64k page.
> 
> Do we have a plan to add a lock to the ext4 buffered read or a field in
> the page that indicates the size of the valid data in the page? Or does
> anyone have a better idea?

No, there are no plans to address this AFAIK. Because such locking will
slow down all the well behaved applications to fix a corner case for
application doing unsupported things. Sure we must not crash the kernel,
corrupt the filesystem or leak sensitive (e.g. uninitialized) data if app
combines buffered and direct IO but returning zeros instead of valid data
is in my opinion fully within the range of acceptable behavior for such
case.

								Honza

Baokun Li Dec. 4, 2023, 1:50 p.m. UTC | #2

On 2023/12/4 20:11, Jan Kara wrote:
> Hello!
Thank you for your reply!
>
> On Sat 02-12-23 17:14:30, Baokun Li wrote:
>> Recently, while running some pressure tests on MYSQL, noticed that
>> occasionally a "corrupted data in log event" error would be reported.
>> After analyzing the error, I found that extending DIO write and buffered
>> read were competing, resulting in some zero-filled page end being read.
>> Since ext4 buffered read doesn't hold an inode lock, and there is no
>> field in the page to indicate the valid data size, it seems to me that
>> it is impossible to solve this problem perfectly without changing these
>> two things.
> Yes, combining buffered reads with direct IO writes is a recipe for
> problems and pretty much in the "don't do it" territory. So honestly I'd
> consider this a MYSQL bug. Were you able to identify why does MYSQL use
> buffered read in this case? It is just something specific to the test
> you're doing?
The problem is with a one-master-twoslave MYSQL database with three
physical machines, and using sysbench pressure testing on each of the
three machines, the problem occurs about once every two to three hours.

The problem is with the relay log file, and when the problem occurs, the
middle dozens of bytes of the file are read as all zeros, while the data on
disk is not. This is a journal-like file where a write process gets the 
data from
the master node and writes it locally, and another replay process reads the
file and performs the replay operation accordingly (some SQL statements).
The problem is that when replaying, it finds that the data read is 
corrupted,
not valid SQL data, while the data on disk is normal.

It's not confirmed that buffered reads vs direct IO writes is actually 
causing
this issue, but this is the only scenario that we can reproduce with our 
local
simplified scripts. Also, after merging in patch 1, the MYSQL pressure test
scenario has now been tested for 5 days and has not been reproduced.

I'll double-check the problem scenario, although buffered reads with 
buffered
writes doesn't seem to have this problem.
>> In this series, the first patch reads the inode size twice, and takes the
>> smaller of the two values as the copyout limit to avoid copying data that
>> was not actually read (0-padding) into the user buffer and causing data
>> corruption. This greatly reduces the probability of problems under 4k
>> page. However, the problem is still easily triggered under 64k page.
>>
>> The second patch waits for the existing dio write to complete and
>> invalidate the stale page cache before performing a new buffered read
>> in ext4, avoiding data corruption by copying the stale page cache to
>> the user buffer. This makes it much less likely that the problem will
>> be triggered in a 64k page.
>>
>> Do we have a plan to add a lock to the ext4 buffered read or a field in
>> the page that indicates the size of the valid data in the page? Or does
>> anyone have a better idea?
> No, there are no plans to address this AFAIK. Because such locking will
> slow down all the well behaved applications to fix a corner case for
> application doing unsupported things. Sure we must not crash the kernel,
> corrupt the filesystem or leak sensitive (e.g. uninitialized) data if app
> combines buffered and direct IO but returning zeros instead of valid data
> is in my opinion fully within the range of acceptable behavior for such
> case.
>
> 								Honza
I also feel that a scenario like buffered reads + DIO writes is strange. But
theoretically when read doesn't return an error, the data read shouldn't 
be wrong.
And I tested that xfs guarantees data consistency in this scenario, 
which is why I
thought it might be buggy.

Thanks!

Jan Kara Dec. 4, 2023, 2:41 p.m. UTC | #3

On Mon 04-12-23 21:50:18, Baokun Li wrote:
> On 2023/12/4 20:11, Jan Kara wrote:
> > On Sat 02-12-23 17:14:30, Baokun Li wrote:
> > > Recently, while running some pressure tests on MYSQL, noticed that
> > > occasionally a "corrupted data in log event" error would be reported.
> > > After analyzing the error, I found that extending DIO write and buffered
> > > read were competing, resulting in some zero-filled page end being read.
> > > Since ext4 buffered read doesn't hold an inode lock, and there is no
> > > field in the page to indicate the valid data size, it seems to me that
> > > it is impossible to solve this problem perfectly without changing these
> > > two things.
> > Yes, combining buffered reads with direct IO writes is a recipe for
> > problems and pretty much in the "don't do it" territory. So honestly I'd
> > consider this a MYSQL bug. Were you able to identify why does MYSQL use
> > buffered read in this case? It is just something specific to the test
> > you're doing?
> The problem is with a one-master-twoslave MYSQL database with three
> physical machines, and using sysbench pressure testing on each of the
> three machines, the problem occurs about once every two to three hours.
> 
> The problem is with the relay log file, and when the problem occurs, the
> middle dozens of bytes of the file are read as all zeros, while the data on
> disk is not. This is a journal-like file where a write process gets the data
> from
> the master node and writes it locally, and another replay process reads the
> file and performs the replay operation accordingly (some SQL statements).
> The problem is that when replaying, it finds that the data read is
> corrupted,
> not valid SQL data, while the data on disk is normal.
> 
> It's not confirmed that buffered reads vs direct IO writes is actually
> causing this issue, but this is the only scenario that we can reproduce
> with our local simplified scripts. Also, after merging in patch 1, the
> MYSQL pressure test scenario has now been tested for 5 days and has not
> been reproduced.
> 
> I'll double-check the problem scenario, although buffered reads with
> buffered writes doesn't seem to have this problem.

Yeah, from what you write it seems that the replay code is using buffered
reads on the journal file. I guess you could confirm that with a bit of
kernel tracing but the symptoms look pretty convincing. Did you try talking
to MYSQL guys about why they are doing this?

> > > In this series, the first patch reads the inode size twice, and takes the
> > > smaller of the two values as the copyout limit to avoid copying data that
> > > was not actually read (0-padding) into the user buffer and causing data
> > > corruption. This greatly reduces the probability of problems under 4k
> > > page. However, the problem is still easily triggered under 64k page.
> > > 
> > > The second patch waits for the existing dio write to complete and
> > > invalidate the stale page cache before performing a new buffered read
> > > in ext4, avoiding data corruption by copying the stale page cache to
> > > the user buffer. This makes it much less likely that the problem will
> > > be triggered in a 64k page.
> > > 
> > > Do we have a plan to add a lock to the ext4 buffered read or a field in
> > > the page that indicates the size of the valid data in the page? Or does
> > > anyone have a better idea?
> > No, there are no plans to address this AFAIK. Because such locking will
> > slow down all the well behaved applications to fix a corner case for
> > application doing unsupported things. Sure we must not crash the kernel,
> > corrupt the filesystem or leak sensitive (e.g. uninitialized) data if app
> > combines buffered and direct IO but returning zeros instead of valid data
> > is in my opinion fully within the range of acceptable behavior for such
> > case.
> > 
> I also feel that a scenario like buffered reads + DIO writes is strange.
> But theoretically when read doesn't return an error, the data read
> shouldn't be wrong.  And I tested that xfs guarantees data consistency in
> this scenario, which is why I thought it might be buggy.

Yes, XFS has inherited stronger consistency guarantees from IRIX times than
Linux filesystems traditionally had. We generally don't even guarantee
buffered read vs buffered write atomicity (i.e., buffered read can see a
torn buffered write).

								Honza

Theodore Ts'o Dec. 5, 2023, 4:17 a.m. UTC | #4

On Mon, Dec 04, 2023 at 09:50:18PM +0800, Baokun Li wrote:
> The problem is with a one-master-twoslave MYSQL database with three
> physical machines, and using sysbench pressure testing on each of the
> three machines, the problem occurs about once every two to three hours.
> 
> The problem is with the relay log file, and when the problem occurs,
> the middle dozens of bytes of the file are read as all zeros, while
> the data on disk is not. This is a journal-like file where a write
> process gets the data from the master node and writes it locally,
> and another replay process reads the file and performs the replay
> operation accordingly (some SQL statements).  The problem is that
> when replaying, it finds that the data read is corrupted, not valid
> SQL data, while the data on disk is normal.

You mentioned "scripts" --- are these locally developped scripts by
any chance?  The procedure suggested in a few places that I looked up
don't involve needing to read the replay log.   For example from[1]:

On the master server:

root@repl-master:~# mysql -uroot -p;
mysql> CREATE USER ‘slave’@’12.34.56.789‘ IDENTIFIED BY ‘SLAVE_PASSWORD‘;
mysql> GRANT REPLICATION SLAVE ON . TO ‘slave’@’12.34.56.222 ‘;
mysql> FLUSH PRIVILEGES;
mysql> FLUSH TABLES WITH READ LOCK;

This will make the master server read-only, with all pending writes
flushed out (so you don't need to worry about the replay log), and
then you move the data from the master to slave:

root@repl-master:~# mysqldump -u root -p –all-databases –master-data > data.sql
root@repl-master:~# scp data.sql root@12.34.56.222

Then on the slave:

root@repl-slave:~# mysql -uroot -p < data.sql
root@repl-slave:~# mysql -uroot -p;
mysql> STOP SLAVE;

... and then on the master:

root@repl-master:~# mysql -uroot -p;
mysql> UNLOCK TABLES;

... and back on the slave:

root@repl-slave:~# mysql -uroot -p;
mysql> START SLAVE;

[1] https://hevodata.com/learn/mysql-master-slave-replication/

... or you could buy the product advertised at [1] which is easier for
the database administrators, but results in $$$ flowing to the Hevo
company.  :-)

In any case, I'm pretty sure that the official documented way of
setting up a failover replication setup doesn't involve buffered reads
of the replay file.

It is certainly the case that mysqldump uses buffered reads, but
that's why you have to temporary make the database read-only using
"FLUSH TABLES WITH READ LOCK" before taking a database snapshot, and
then re-enable database updates the "UNLOCK TABLES" SQL commands.

Cheers,

					- Ted

Baokun Li Dec. 5, 2023, 12:50 p.m. UTC | #5

On 2023/12/4 22:41, Jan Kara wrote:
> On Mon 04-12-23 21:50:18, Baokun Li wrote:
>> On 2023/12/4 20:11, Jan Kara wrote:
>>> On Sat 02-12-23 17:14:30, Baokun Li wrote:
>>>> Recently, while running some pressure tests on MYSQL, noticed that
>>>> occasionally a "corrupted data in log event" error would be reported.
>>>> After analyzing the error, I found that extending DIO write and buffered
>>>> read were competing, resulting in some zero-filled page end being read.
>>>> Since ext4 buffered read doesn't hold an inode lock, and there is no
>>>> field in the page to indicate the valid data size, it seems to me that
>>>> it is impossible to solve this problem perfectly without changing these
>>>> two things.
>>> Yes, combining buffered reads with direct IO writes is a recipe for
>>> problems and pretty much in the "don't do it" territory. So honestly I'd
>>> consider this a MYSQL bug. Were you able to identify why does MYSQL use
>>> buffered read in this case? It is just something specific to the test
>>> you're doing?
>> The problem is with a one-master-twoslave MYSQL database with three
>> physical machines, and using sysbench pressure testing on each of the
>> three machines, the problem occurs about once every two to three hours.
>>
>> The problem is with the relay log file, and when the problem occurs, the
>> middle dozens of bytes of the file are read as all zeros, while the data on
>> disk is not. This is a journal-like file where a write process gets the data
>> from
>> the master node and writes it locally, and another replay process reads the
>> file and performs the replay operation accordingly (some SQL statements).
>> The problem is that when replaying, it finds that the data read is
>> corrupted,
>> not valid SQL data, while the data on disk is normal.
>>
>> It's not confirmed that buffered reads vs direct IO writes is actually
>> causing this issue, but this is the only scenario that we can reproduce
>> with our local simplified scripts. Also, after merging in patch 1, the
>> MYSQL pressure test scenario has now been tested for 5 days and has not
>> been reproduced.
>>
>> I'll double-check the problem scenario, although buffered reads with
>> buffered writes doesn't seem to have this problem.
> Yeah, from what you write it seems that the replay code is using buffered
> reads on the journal file. I guess you could confirm that with a bit of
> kernel tracing but the symptoms look pretty convincing. Did you try talking
> to MYSQL guys about why they are doing this?
The operations performed on the relay log file are buffered reads and 
writes,
which I confirmed with the following bpftrace script:
```
#include <linux/fs.h>
#include <linux/path.h>
#include <linux/dcache.h>

kprobe:generic_file_buffered_read /!strncmp(str(((struct kiocb 
*)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ {
     printf("read path: %s\n", str(((struct kiocb 
*)arg0)->ki_filp->f_path.dentry->d_name.name));
}

kprobe:ext4_buffered_write_iter /!strncmp(str(((struct kiocb 
*)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ {
     printf("write path: %s\n", str(((struct kiocb 
*)arg0)->ki_filp->f_path.dentry->d_name.name));
}
```
I suspect there are DIO writes causing the problem, but I haven't caught
any DIO writes to such files via bpftrace.
>>>> In this series, the first patch reads the inode size twice, and takes the
>>>> smaller of the two values as the copyout limit to avoid copying data that
>>>> was not actually read (0-padding) into the user buffer and causing data
>>>> corruption. This greatly reduces the probability of problems under 4k
>>>> page. However, the problem is still easily triggered under 64k page.
>>>>
>>>> The second patch waits for the existing dio write to complete and
>>>> invalidate the stale page cache before performing a new buffered read
>>>> in ext4, avoiding data corruption by copying the stale page cache to
>>>> the user buffer. This makes it much less likely that the problem will
>>>> be triggered in a 64k page.
>>>>
>>>> Do we have a plan to add a lock to the ext4 buffered read or a field in
>>>> the page that indicates the size of the valid data in the page? Or does
>>>> anyone have a better idea?
>>> No, there are no plans to address this AFAIK. Because such locking will
>>> slow down all the well behaved applications to fix a corner case for
>>> application doing unsupported things. Sure we must not crash the kernel,
>>> corrupt the filesystem or leak sensitive (e.g. uninitialized) data if app
>>> combines buffered and direct IO but returning zeros instead of valid data
>>> is in my opinion fully within the range of acceptable behavior for such
>>> case.
>>>
>> I also feel that a scenario like buffered reads + DIO writes is strange.
>> But theoretically when read doesn't return an error, the data read
>> shouldn't be wrong.  And I tested that xfs guarantees data consistency in
>> this scenario, which is why I thought it might be buggy.
> Yes, XFS has inherited stronger consistency guarantees from IRIX times than
> Linux filesystems traditionally had. We generally don't even guarantee
> buffered read vs buffered write atomicity (i.e., buffered read can see a
> torn buffered write).
>
> 								Honza
I'm a bit confused here, buffered read vs buffered write uses the same 
page and
appears to be protected by a memory barrier, how does the inconsistency 
occur?


Thanks!

Baokun Li Dec. 5, 2023, 1:19 p.m. UTC | #6

On 2023/12/5 12:17, Theodore Ts'o wrote:
> On Mon, Dec 04, 2023 at 09:50:18PM +0800, Baokun Li wrote:
>> The problem is with a one-master-twoslave MYSQL database with three
>> physical machines, and using sysbench pressure testing on each of the
>> three machines, the problem occurs about once every two to three hours.
>>
>> The problem is with the relay log file, and when the problem occurs,
>> the middle dozens of bytes of the file are read as all zeros, while
>> the data on disk is not. This is a journal-like file where a write
>> process gets the data from the master node and writes it locally,
>> and another replay process reads the file and performs the replay
>> operation accordingly (some SQL statements).  The problem is that
>> when replaying, it finds that the data read is corrupted, not valid
>> SQL data, while the data on disk is normal.
> You mentioned "scripts" --- are these locally developped scripts by
> any chance?
This refers to the sql commands to be replayed in the relay log file.
  I don't know much about this file, but you can read the official
documentation.
https://dev.mysql.com/doc/refman/8.0/en/replica-logs-relaylog.html
> The procedure suggested in a few places that I looked up
> don't involve needing to read the replay log.   For example from[1]:
>
> On the master server:
>
> root@repl-master:~# mysql -uroot -p;
> mysql> CREATE USER ‘slave’@’12.34.56.789‘ IDENTIFIED BY ‘SLAVE_PASSWORD‘;
> mysql> GRANT REPLICATION SLAVE ON . TO ‘slave’@’12.34.56.222 ‘;
> mysql> FLUSH PRIVILEGES;
> mysql> FLUSH TABLES WITH READ LOCK;
>
> This will make the master server read-only, with all pending writes
> flushed out (so you don't need to worry about the replay log), and
> then you move the data from the master to slave:
>
> root@repl-master:~# mysqldump -u root -p –all-databases –master-data > data.sql
> root@repl-master:~# scp data.sql root@12.34.56.222
>
> Then on the slave:
>
> root@repl-slave:~# mysql -uroot -p < data.sql
> root@repl-slave:~# mysql -uroot -p;
> mysql> STOP SLAVE;
>
> ... and then on the master:
>
> root@repl-master:~# mysql -uroot -p;
> mysql> UNLOCK TABLES;
>
> ... and back on the slave:
>
> root@repl-slave:~# mysql -uroot -p;
> mysql> START SLAVE;
>
> [1] https://hevodata.com/learn/mysql-master-slave-replication/
>
> ... or you could buy the product advertised at [1] which is easier for
> the database administrators, but results in $$$ flowing to the Hevo
> company.  :-)
>
> In any case, I'm pretty sure that the official documented way of
> setting up a failover replication setup doesn't involve buffered reads
> of the replay file.
>
> It is certainly the case that mysqldump uses buffered reads, but
> that's why you have to temporary make the database read-only using
> "FLUSH TABLES WITH READ LOCK" before taking a database snapshot, and
> then re-enable database updates the "UNLOCK TABLES" SQL commands.
>
> Cheers,
>
> 					- Ted
Thank you very much for your detailed explanation!
But the downstream users do have buffered reads to read the relay log
file, as I confirmed with bpftrace. Here's an introduction to turning on
relay logging, but I'm not sure if you can access this link:
https://blog.csdn.net/javaanddonet/article/details/112596148

Thanks!

Dave Chinner Dec. 6, 2023, 8:35 a.m. UTC | #7

On Mon, Dec 04, 2023 at 09:50:18PM +0800, Baokun Li wrote:
> On 2023/12/4 20:11, Jan Kara wrote:
> > Hello!
> Thank you for your reply!
> > 
> > On Sat 02-12-23 17:14:30, Baokun Li wrote:
> > > Recently, while running some pressure tests on MYSQL, noticed that
> > > occasionally a "corrupted data in log event" error would be reported.
> > > After analyzing the error, I found that extending DIO write and buffered
> > > read were competing, resulting in some zero-filled page end being read.
> > > Since ext4 buffered read doesn't hold an inode lock, and there is no
> > > field in the page to indicate the valid data size, it seems to me that
> > > it is impossible to solve this problem perfectly without changing these
> > > two things.
> > Yes, combining buffered reads with direct IO writes is a recipe for
> > problems and pretty much in the "don't do it" territory. So honestly I'd
> > consider this a MYSQL bug. Were you able to identify why does MYSQL use
> > buffered read in this case? It is just something specific to the test
> > you're doing?
> The problem is with a one-master-twoslave MYSQL database with three
> physical machines, and using sysbench pressure testing on each of the
> three machines, the problem occurs about once every two to three hours.
> 
> The problem is with the relay log file, and when the problem occurs, the
> middle dozens of bytes of the file are read as all zeros, while the data on
> disk is not. This is a journal-like file where a write process gets the data
> from
> the master node and writes it locally, and another replay process reads the
> file and performs the replay operation accordingly (some SQL statements).
> The problem is that when replaying, it finds that the data read is
> corrupted,
> not valid SQL data, while the data on disk is normal.
> 
> It's not confirmed that buffered reads vs direct IO writes is actually
> causing
> this issue, but this is the only scenario that we can reproduce with our
> local
> simplified scripts. Also, after merging in patch 1, the MYSQL pressure test
> scenario has now been tested for 5 days and has not been reproduced.

Mixing overlapping buffered read with direct writes - especially partial block
extending DIO writes - is a recipe for data corruption. It's not a
matter of if, it's a matter of when.

Fundamentally, when you have overlapping write IO involving DIO, the
result of the overlapping IOs is undefined. One cannot control
submission order, the order that the overlapping IO hit the
media, or completion ordering that might clear flags like unwritten
extents. The only guarantee that we give in this case is that we
won't expose stale data from the disk to the user read.

As such, it is the responsibility of the application to avoid
overlapping IOs when doing direct IO. The fact that you['ve observed
data corruption due to overlapping IO ranges from the one
application indicates that this is, indeed, an application level
problem and can only be solved by fixing the application....

> I'll double-check the problem scenario, although buffered reads with
> buffered
> writes doesn't seem to have this problem.
> > > In this series, the first patch reads the inode size twice, and takes the
> > > smaller of the two values as the copyout limit to avoid copying data that
> > > was not actually read (0-padding) into the user buffer and causing data
> > > corruption. This greatly reduces the probability of problems under 4k
> > > page. However, the problem is still easily triggered under 64k page.
> > > 
> > > The second patch waits for the existing dio write to complete and
> > > invalidate the stale page cache before performing a new buffered read
> > > in ext4, avoiding data corruption by copying the stale page cache to
> > > the user buffer. This makes it much less likely that the problem will
> > > be triggered in a 64k page.
> > > 
> > > Do we have a plan to add a lock to the ext4 buffered read or a field in
> > > the page that indicates the size of the valid data in the page? Or does
> > > anyone have a better idea?
> > No, there are no plans to address this AFAIK. Because such locking will
> > slow down all the well behaved applications to fix a corner case for
> > application doing unsupported things. Sure we must not crash the kernel,
> > corrupt the filesystem or leak sensitive (e.g. uninitialized) data if app
> > combines buffered and direct IO but returning zeros instead of valid data
> > is in my opinion fully within the range of acceptable behavior for such
> > case.
> > 
> > 								Honza
> I also feel that a scenario like buffered reads + DIO writes is strange. But
> theoretically when read doesn't return an error, the data read shouldn't be
> wrong.

The data that was read isn't wrong - it just wasn't what the
application expected.

> And I tested that xfs guarantees data consistency in this scenario, which is
> why I
> thought it might be buggy.

XFS certainly does not guarantee data consistency between buffered
reads and DIO writes, especially for overlapping IO ranges using DIO
(see above).

IOWs, the fact that the data inconsistency doesn't reproduce on XFS
doesn't mean that XFS is providing some guarantee of consistency for
this IO pattern. All it means is that ext4 and XFS behave differently
in an situation where the operational result is indeterminate and all
we guarantee is that we won't expose stale data read from disk....

Cheers,

Dave.

Christoph Hellwig Dec. 6, 2023, 9:02 a.m. UTC | #8

On Wed, Dec 06, 2023 at 07:35:35PM +1100, Dave Chinner wrote:
> Mixing overlapping buffered read with direct writes - especially partial block
> extending DIO writes - is a recipe for data corruption. It's not a
> matter of if, it's a matter of when.
> 
> Fundamentally, when you have overlapping write IO involving DIO, the
> result of the overlapping IOs is undefined. One cannot control
> submission order, the order that the overlapping IO hit the
> media, or completion ordering that might clear flags like unwritten
> extents. The only guarantee that we give in this case is that we
> won't expose stale data from the disk to the user read.

Btw, one thing we could do to kill these races forever is to track if
there are any buffered openers for an inode and just fall back to
buffered I/O for that case.  With that and and inode_dio_wait for
when opening for buffered I/O we'd avoid the races an various crazy
workarounds entirely.

nfs and ocfs2 do (or did, I haven't checked for a while) something
like that.

Dave Chinner Dec. 6, 2023, 10:34 a.m. UTC | #9

On Wed, Dec 06, 2023 at 01:02:43AM -0800, Christoph Hellwig wrote:
> On Wed, Dec 06, 2023 at 07:35:35PM +1100, Dave Chinner wrote:
> > Mixing overlapping buffered read with direct writes - especially partial block
> > extending DIO writes - is a recipe for data corruption. It's not a
> > matter of if, it's a matter of when.
> > 
> > Fundamentally, when you have overlapping write IO involving DIO, the
> > result of the overlapping IOs is undefined. One cannot control
> > submission order, the order that the overlapping IO hit the
> > media, or completion ordering that might clear flags like unwritten
> > extents. The only guarantee that we give in this case is that we
> > won't expose stale data from the disk to the user read.
> 
> Btw, one thing we could do to kill these races forever is to track if
> there are any buffered openers for an inode and just fall back to
> buffered I/O for that case.  With that and and inode_dio_wait for
> when opening for buffered I/O we'd avoid the races an various crazy
> workarounds entirely.

That's basically what Solaris did 20-25 years ago. The inode held a
flag that indicated what IO was being done, and if the "buffered"
flag was set (either through mmap() based access or buffered
read/write syscalls) then direct IO would do also do buffered IO
until the flag was cleared and the cache cleaned and invalidated.

That had .... problems.

Largely they were performance problems - unpredictable IO latency
and CPU overhead for IO meant applications would randomly miss SLAs.
The application would see IO suddenly lose all concurrency, go real
slow and/or burn lots more CPU when the inode switched to buffered
mode.

I'm not sure that's a particularly viable model given the raw IO
throughput even cheap modern SSDs largely exceeds the capability of
buffered IO through the page cache. The differences in concurrency,
latency and throughput between buffered and DIO modes will be even
more stark itoday than they were 20 years ago....

-Dave.

Baokun Li Dec. 6, 2023, 11:57 a.m. UTC | #10

On 2023/12/6 16:35, Dave Ch inner wrote:
> On Mon, Dec 04, 2023 at 09:50:18PM +0800, Baokun Li wrote:
>> On 2023/12/4 20:11, Jan Kara wrote:
>>> Hello!
>> Thank you for your reply!
>>> On Sat 02-12-23 17:14:30, Baokun Li wrote:
>>>> Recently, while running some pressure tests on MYSQL, noticed that
>>>> occasionally a "corrupted data in log event" error would be reported.
>>>> After analyzing the error, I found that extending DIO write and buffered
>>>> read were competing, resulting in some zero-filled page end being read.
>>>> Since ext4 buffered read doesn't hold an inode lock, and there is no
>>>> field in the page to indicate the valid data size, it seems to me that
>>>> it is impossible to solve this problem perfectly without changing these
>>>> two things.
>>> Yes, combining buffered reads with direct IO writes is a recipe for
>>> problems and pretty much in the "don't do it" territory. So honestly I'd
>>> consider this a MYSQL bug. Were you able to identify why does MYSQL use
>>> buffered read in this case? It is just something specific to the test
>>> you're doing?
>> The problem is with a one-master-twoslave MYSQL database with three
>> physical machines, and using sysbench pressure testing on each of the
>> three machines, the problem occurs about once every two to three hours.
>>
>> The problem is with the relay log file, and when the problem occurs, the
>> middle dozens of bytes of the file are read as all zeros, while the data on
>> disk is not. This is a journal-like file where a write process gets the data
>> from
>> the master node and writes it locally, and another replay process reads the
>> file and performs the replay operation accordingly (some SQL statements).
>> The problem is that when replaying, it finds that the data read is
>> corrupted,
>> not valid SQL data, while the data on disk is normal.
>>
>> It's not confirmed that buffered reads vs direct IO writes is actually
>> causing
>> this issue, but this is the only scenario that we can reproduce with our
>> local
>> simplified scripts. Also, after merging in patch 1, the MYSQL pressure test
>> scenario has now been tested for 5 days and has not been reproduced.
> Mixing overlapping buffered read with direct writes - especially partial block
> extending DIO writes - is a recipe for data corruption. It's not a
> matter of if, it's a matter of when.
>
> Fundamentally, when you have overlapping write IO involving DIO, the
> result of the overlapping IOs is undefined. One cannot control
> submission order, the order that the overlapping IO hit the
> media, or completion ordering that might clear flags like unwritten
> extents. The only guarantee that we give in this case is that we
> won't expose stale data from the disk to the user read.
>
> As such, it is the responsibility of the application to avoid
> overlapping IOs when doing direct IO. The fact that you['ve observed
> data corruption due to overlapping IO ranges from the one
> application indicates that this is, indeed, an application level
> problem and can only be solved by fixing the application....
OK, thank you very much for your detailed explanation!
>> I'll double-check the problem scenario, although buffered reads with
>> buffered
>> writes doesn't seem to have this problem.
>>>> In this series, the first patch reads the inode size twice, and takes the
>>>> smaller of the two values as the copyout limit to avoid copying data that
>>>> was not actually read (0-padding) into the user buffer and causing data
>>>> corruption. This greatly reduces the probability of problems under 4k
>>>> page. However, the problem is still easily triggered under 64k page.
>>>>
>>>> The second patch waits for the existing dio write to complete and
>>>> invalidate the stale page cache before performing a new buffered read
>>>> in ext4, avoiding data corruption by copying the stale page cache to
>>>> the user buffer. This makes it much less likely that the problem will
>>>> be triggered in a 64k page.
>>>>
>>>> Do we have a plan to add a lock to the ext4 buffered read or a field in
>>>> the page that indicates the size of the valid data in the page? Or does
>>>> anyone have a better idea?
>>> No, there are no plans to address this AFAIK. Because such locking will
>>> slow down all the well behaved applications to fix a corner case for
>>> application doing unsupported things. Sure we must not crash the kernel,
>>> corrupt the filesystem or leak sensitive (e.g. uninitialized) data if app
>>> combines buffered and direct IO but returning zeros instead of valid data
>>> is in my opinion fully within the range of acceptable behavior for such
>>> case.
>>>
>>> 								Honza
>> I also feel that a scenario like buffered reads + DIO writes is strange. But
>> theoretically when read doesn't return an error, the data read shouldn't be
>> wrong.
> The data that was read isn't wrong - it just wasn't what the
> application expected.
>
>> And I tested that xfs guarantees data consistency in this scenario, which is
>> why I
>> thought it might be buggy.
> XFS certainly does not guarantee data consistency between buffered
> reads and DIO writes, especially for overlapping IO ranges using DIO
> (see above).
>
> IOWs, the fact that the data inconsistency doesn't reproduce on XFS
> doesn't mean that XFS is providing some guarantee of consistency for
> this IO pattern. All it means is that ext4 and XFS behave differently
> in an situation where the operational result is indeterminate and all
> we guarantee is that we won't expose stale data read from disk....
>
> Cheers,
>
> Dave.
Here's my test script, just put all three shell scripts in the same 
directory,
change the disk and dir, then run `sh test-io.sh`, and the inconsistency is
reproduced under ext4 in just one round, and after switching the
filesystem to xfs, the problem is no longer reproduced. So it looks like
xfs guarantees consistency in this scenario. But you're right, a test case
can't trigger a problem doesn't mean the problem doesn't exist.
Thank you very much for the correction.

------ test-write.sh
disk="/dev/sda"
dir="/tmp/test"

write_t()
{
     pos=0
     touch $dir/writefile
     while [ $pos -lt 16777216 ]
     do
         xfs_io -c "open -d $dir/writefile" -c "pwrite -q -b 1024 -i 
$dir/testfile -s $pos $pos 1024" &> /dev/null
         pos=`ls -l $dir/writefile | awk '{print $5}'`
     done
}
write_t
------------------------------------------------------------

------ test-read.sh
disk="/dev/sda"
dir="/tmp/test"

read_t()
{
     pos=0
     touch $dir/readfile
     while [ $pos -lt 16777216 ]
     do
         xfs_io -c "open -d $dir/readfile" -c "pwrite -q -b 65536 -i 
$dir/writefile -s $pos $pos 65536" &> /dev/null
         pos=`ls -l $dir/readfile | awk '{print $5}'`
     done
}
read_t
------------------------------------------------------------

------ test-io.sh
disk="/dev/sda"
dir="/tmp/test"

mkdir -p $dir
umount $dir
mkfs.ext4 -F $disk
#mkfs.xfs -f -m crc=1,rmapbt=0,reflink=0 $disk
mount $disk $dir

xfs_io -f -c "pwrite -S 0xff 0 16777216" $dir/testfile
sync

round=0
while true
do
     while true
     do
         echo "******* round $round ********"

         touch $dir/writefile
         taskset 0x00000001 sh test-read.sh &
         pid_read=$!
         taskset 0x0000000f sh test-write.sh &
         pid_write=$!

         wait $pid_read
         wait $pid_write
         sync

         md_raw=`md5sum $dir/testfile | awk '{print $1}'`
         md_write=`md5sum $dir/writefile | awk '{print $1}'`
         md_read=`md5sum $dir/readfile | awk '{print $1}'`
         if [ $md_write != $md_raw ]
         then
             echo "write error csum"
             echo "raw csum: $md_raw"
             echo "write csum: $md_write"
             exit 1
         elif [ $md_read != $md_write ]
         then
             echo "raed error csum"
             echo "write csum: $md_write"
             echo "read csum: $md_read"
             exit 1
         fi

         rm -f $dir/writefile
         rm -f $dir/readfile
         sync
         round=$(($round+1))
     done
done
------------------------------------------------------------

Thanks!

Christoph Hellwig Dec. 6, 2023, 12:20 p.m. UTC | #11

On Wed, Dec 06, 2023 at 09:34:49PM +1100, Dave Chinner wrote:
> Largely they were performance problems - unpredictable IO latency
> and CPU overhead for IO meant applications would randomly miss SLAs.
> The application would see IO suddenly lose all concurrency, go real
> slow and/or burn lots more CPU when the inode switched to buffered
> mode.
> 
> I'm not sure that's a particularly viable model given the raw IO
> throughput even cheap modern SSDs largely exceeds the capability of
> buffered IO through the page cache. The differences in concurrency,
> latency and throughput between buffered and DIO modes will be even
> more stark itoday than they were 20 years ago....

The question is what's worse:  random performance drops or random
corruption.  I suspect the former is less bad, especially if we have
good tracepoints to pin it down.

Jan Kara Dec. 6, 2023, 7:37 p.m. UTC | #12

On Tue 05-12-23 20:50:30, Baokun Li wrote:
> On 2023/12/4 22:41, Jan Kara wrote:
> > On Mon 04-12-23 21:50:18, Baokun Li wrote:
> > > On 2023/12/4 20:11, Jan Kara wrote:
> > > > On Sat 02-12-23 17:14:30, Baokun Li wrote:
> > > > > Recently, while running some pressure tests on MYSQL, noticed that
> > > > > occasionally a "corrupted data in log event" error would be reported.
> > > > > After analyzing the error, I found that extending DIO write and buffered
> > > > > read were competing, resulting in some zero-filled page end being read.
> > > > > Since ext4 buffered read doesn't hold an inode lock, and there is no
> > > > > field in the page to indicate the valid data size, it seems to me that
> > > > > it is impossible to solve this problem perfectly without changing these
> > > > > two things.
> > > > Yes, combining buffered reads with direct IO writes is a recipe for
> > > > problems and pretty much in the "don't do it" territory. So honestly I'd
> > > > consider this a MYSQL bug. Were you able to identify why does MYSQL use
> > > > buffered read in this case? It is just something specific to the test
> > > > you're doing?
> > > The problem is with a one-master-twoslave MYSQL database with three
> > > physical machines, and using sysbench pressure testing on each of the
> > > three machines, the problem occurs about once every two to three hours.
> > > 
> > > The problem is with the relay log file, and when the problem occurs, the
> > > middle dozens of bytes of the file are read as all zeros, while the data on
> > > disk is not. This is a journal-like file where a write process gets the data
> > > from
> > > the master node and writes it locally, and another replay process reads the
> > > file and performs the replay operation accordingly (some SQL statements).
> > > The problem is that when replaying, it finds that the data read is
> > > corrupted,
> > > not valid SQL data, while the data on disk is normal.
> > > 
> > > It's not confirmed that buffered reads vs direct IO writes is actually
> > > causing this issue, but this is the only scenario that we can reproduce
> > > with our local simplified scripts. Also, after merging in patch 1, the
> > > MYSQL pressure test scenario has now been tested for 5 days and has not
> > > been reproduced.
> > > 
> > > I'll double-check the problem scenario, although buffered reads with
> > > buffered writes doesn't seem to have this problem.
> > Yeah, from what you write it seems that the replay code is using buffered
> > reads on the journal file. I guess you could confirm that with a bit of
> > kernel tracing but the symptoms look pretty convincing. Did you try talking
> > to MYSQL guys about why they are doing this?
>
> The operations performed on the relay log file are buffered reads and
> writes, which I confirmed with the following bpftrace script:
> ```
> #include <linux/fs.h>
> #include <linux/path.h>
> #include <linux/dcache.h>
> 
> kprobe:generic_file_buffered_read /!strncmp(str(((struct kiocb
> *)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ {
>     printf("read path: %s\n", str(((struct kiocb
> *)arg0)->ki_filp->f_path.dentry->d_name.name));
> }
> 
> kprobe:ext4_buffered_write_iter /!strncmp(str(((struct kiocb
> *)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ {
>     printf("write path: %s\n", str(((struct kiocb
> *)arg0)->ki_filp->f_path.dentry->d_name.name));
> }
> ```
> I suspect there are DIO writes causing the problem, but I haven't caught
> any DIO writes to such files via bpftrace.

Interesting. Not sure how your partially zeroed-out buffers could happen
with fully buffered IO.

> > > > > In this series, the first patch reads the inode size twice, and takes the
> > > > > smaller of the two values as the copyout limit to avoid copying data that
> > > > > was not actually read (0-padding) into the user buffer and causing data
> > > > > corruption. This greatly reduces the probability of problems under 4k
> > > > > page. However, the problem is still easily triggered under 64k page.
> > > > > 
> > > > > The second patch waits for the existing dio write to complete and
> > > > > invalidate the stale page cache before performing a new buffered read
> > > > > in ext4, avoiding data corruption by copying the stale page cache to
> > > > > the user buffer. This makes it much less likely that the problem will
> > > > > be triggered in a 64k page.
> > > > > 
> > > > > Do we have a plan to add a lock to the ext4 buffered read or a field in
> > > > > the page that indicates the size of the valid data in the page? Or does
> > > > > anyone have a better idea?
> > > > No, there are no plans to address this AFAIK. Because such locking will
> > > > slow down all the well behaved applications to fix a corner case for
> > > > application doing unsupported things. Sure we must not crash the kernel,
> > > > corrupt the filesystem or leak sensitive (e.g. uninitialized) data if app
> > > > combines buffered and direct IO but returning zeros instead of valid data
> > > > is in my opinion fully within the range of acceptable behavior for such
> > > > case.
> > > > 
> > > I also feel that a scenario like buffered reads + DIO writes is strange.
> > > But theoretically when read doesn't return an error, the data read
> > > shouldn't be wrong.  And I tested that xfs guarantees data consistency in
> > > this scenario, which is why I thought it might be buggy.
> > Yes, XFS has inherited stronger consistency guarantees from IRIX times than
> > Linux filesystems traditionally had. We generally don't even guarantee
> > buffered read vs buffered write atomicity (i.e., buffered read can see a
> > torn buffered write).
> 
> I'm a bit confused here, buffered read vs buffered write uses the same
> page and appears to be protected by a memory barrier, how does the
> inconsistency occur?

Within the same page buffered reads and writes should be consistent because
they are synchronized by the page lock. However once reads and writes
involve multiple pages, there is no serialization so you can get contents
of some pages before write and some pages after being written. However this
doesn't seem to be your particular case here. I just wanted to point out
that in general even buffered reads vs writes are not fully consistent.

								Honza

Theodore Ts'o Dec. 6, 2023, 9:55 p.m. UTC | #13

On Tue, Dec 05, 2023 at 09:19:03PM +0800, Baokun Li wrote:
> Thank you very much for your detailed explanation!
> But the downstream users do have buffered reads to read the relay log
> file, as I confirmed with bpftrace. Here's an introduction to turning on
> relay logging, but I'm not sure if you can access this link:
> https://blog.csdn.net/javaanddonet/article/details/112596148

Well, if a mysql-supplied program is trying read the relay log using a
buffered read, when it's being written using Direct I/O, that's a bug
in mysql, and it should be reported as such to the mysql folks.

It does look like there is a newer way of doing replication which
doesn't rely on messign with log files.   From:

    https://dev.mysql.com/doc/refman/8.0/en/replication.html

    MySQL 8.0 supports different methods of replication. The
    traditional method is based on replicating events from the
    source's binary log, and requires the log files and positions in
    them to be synchronized between source and replica. The newer
    method based on global transaction identifiers (GTIDs) is
    transactional and therefore does not require working with log
    files or positions within these files, which greatly simplifies
    many common replication tasks. Replication using GTIDs guarantees
    consistency between source and replica as long as all transactions
    committed on the source have also been applied on the replica. For
    more information about GTIDs and GTID-based replication in MySQL,
    see Section 17.1.3, “Replication with Global Transaction
    Identifiers”. For information on using binary log file position
    based replication, see Section 17.1, “Configuring Replication”.

Perhaps you can try and see how mysql handles GTID-based replication
using bpftrace?

Cheers,

						- Ted

Baokun Li Dec. 7, 2023, 3:01 a.m. UTC | #14

On 2023/12/7 3:37, Jan Kara wrote:
> On Tue 05-12-23 20:50:30, Baokun Li wrote:
>> On 2023/12/4 22:41, Jan Kara wrote:
>>> On Mon 04-12-23 21:50:18, Baokun Li wrote:
>>>> On 2023/12/4 20:11, Jan Kara wrote:
>>>>> On Sat 02-12-23 17:14:30, Baokun Li wrote:
>>>>>> Recently, while running some pressure tests on MYSQL, noticed that
>>>>>> occasionally a "corrupted data in log event" error would be reported.
>>>>>> After analyzing the error, I found that extending DIO write and buffered
>>>>>> read were competing, resulting in some zero-filled page end being read.
>>>>>> Since ext4 buffered read doesn't hold an inode lock, and there is no
>>>>>> field in the page to indicate the valid data size, it seems to me that
>>>>>> it is impossible to solve this problem perfectly without changing these
>>>>>> two things.
>>>>> Yes, combining buffered reads with direct IO writes is a recipe for
>>>>> problems and pretty much in the "don't do it" territory. So honestly I'd
>>>>> consider this a MYSQL bug. Were you able to identify why does MYSQL use
>>>>> buffered read in this case? It is just something specific to the test
>>>>> you're doing?
>>>> The problem is with a one-master-twoslave MYSQL database with three
>>>> physical machines, and using sysbench pressure testing on each of the
>>>> three machines, the problem occurs about once every two to three hours.
>>>>
>>>> The problem is with the relay log file, and when the problem occurs, the
>>>> middle dozens of bytes of the file are read as all zeros, while the data on
>>>> disk is not. This is a journal-like file where a write process gets the data
>>>> from
>>>> the master node and writes it locally, and another replay process reads the
>>>> file and performs the replay operation accordingly (some SQL statements).
>>>> The problem is that when replaying, it finds that the data read is
>>>> corrupted,
>>>> not valid SQL data, while the data on disk is normal.
>>>>
>>>> It's not confirmed that buffered reads vs direct IO writes is actually
>>>> causing this issue, but this is the only scenario that we can reproduce
>>>> with our local simplified scripts. Also, after merging in patch 1, the
>>>> MYSQL pressure test scenario has now been tested for 5 days and has not
>>>> been reproduced.
>>>>
>>>> I'll double-check the problem scenario, although buffered reads with
>>>> buffered writes doesn't seem to have this problem.
>>> Yeah, from what you write it seems that the replay code is using buffered
>>> reads on the journal file. I guess you could confirm that with a bit of
>>> kernel tracing but the symptoms look pretty convincing. Did you try talking
>>> to MYSQL guys about why they are doing this?
>> The operations performed on the relay log file are buffered reads and
>> writes, which I confirmed with the following bpftrace script:
>> ```
>> #include <linux/fs.h>
>> #include <linux/path.h>
>> #include <linux/dcache.h>
>>
>> kprobe:generic_file_buffered_read /!strncmp(str(((struct kiocb
>> *)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ {
>>      printf("read path: %s\n", str(((struct kiocb
>> *)arg0)->ki_filp->f_path.dentry->d_name.name));
>> }
>>
>> kprobe:ext4_buffered_write_iter /!strncmp(str(((struct kiocb
>> *)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ {
>>      printf("write path: %s\n", str(((struct kiocb
>> *)arg0)->ki_filp->f_path.dentry->d_name.name));
>> }
>> ```
>> I suspect there are DIO writes causing the problem, but I haven't caught
>> any DIO writes to such files via bpftrace.
> Interesting. Not sure how your partially zeroed-out buffers could happen
> with fully buffered IO.
I'm still in the process of locating this issue and will sync here
if there is any progress.
>>>>>> In this series, the first patch reads the inode size twice, and takes the
>>>>>> smaller of the two values as the copyout limit to avoid copying data that
>>>>>> was not actually read (0-padding) into the user buffer and causing data
>>>>>> corruption. This greatly reduces the probability of problems under 4k
>>>>>> page. However, the problem is still easily triggered under 64k page.
>>>>>>
>>>>>> The second patch waits for the existing dio write to complete and
>>>>>> invalidate the stale page cache before performing a new buffered read
>>>>>> in ext4, avoiding data corruption by copying the stale page cache to
>>>>>> the user buffer. This makes it much less likely that the problem will
>>>>>> be triggered in a 64k page.
>>>>>>
>>>>>> Do we have a plan to add a lock to the ext4 buffered read or a field in
>>>>>> the page that indicates the size of the valid data in the page? Or does
>>>>>> anyone have a better idea?
>>>>> No, there are no plans to address this AFAIK. Because such locking will
>>>>> slow down all the well behaved applications to fix a corner case for
>>>>> application doing unsupported things. Sure we must not crash the kernel,
>>>>> corrupt the filesystem or leak sensitive (e.g. uninitialized) data if app
>>>>> combines buffered and direct IO but returning zeros instead of valid data
>>>>> is in my opinion fully within the range of acceptable behavior for such
>>>>> case.
>>>>>
>>>> I also feel that a scenario like buffered reads + DIO writes is strange.
>>>> But theoretically when read doesn't return an error, the data read
>>>> shouldn't be wrong.  And I tested that xfs guarantees data consistency in
>>>> this scenario, which is why I thought it might be buggy.
>>> Yes, XFS has inherited stronger consistency guarantees from IRIX times than
>>> Linux filesystems traditionally had. We generally don't even guarantee
>>> buffered read vs buffered write atomicity (i.e., buffered read can see a
>>> torn buffered write).
>> I'm a bit confused here, buffered read vs buffered write uses the same
>> page and appears to be protected by a memory barrier, how does the
>> inconsistency occur?
> Within the same page buffered reads and writes should be consistent because
> they are synchronized by the page lock. However once reads and writes
> involve multiple pages, there is no serialization so you can get contents
> of some pages before write and some pages after being written. However this
> doesn't seem to be your particular case here. I just wanted to point out
> that in general even buffered reads vs writes are not fully consistent.
>
> 								Honza
>
Yes, when writing multiple pages, it is possible to read a part of
the pages that has been updated and a part of the pages that has
not been updated. Thank you very much for answering my query!

Thanks!

Baokun Li Dec. 7, 2023, 6:41 a.m. UTC | #15

On 2023/12/7 5:55, Theodore Ts'o wrote:
> On Tue, Dec 05, 2023 at 09:19:03PM +0800, Baokun Li wrote:
>> Thank you very much for your detailed explanation!
>> But the downstream users do have buffered reads to read the relay log
>> file, as I confirmed with bpftrace. Here's an introduction to turning on
>> relay logging, but I'm not sure if you can access this link:
>> https://blog.csdn.net/javaanddonet/article/details/112596148
> Well, if a mysql-supplied program is trying read the relay log using a
> buffered read, when it's being written using Direct I/O, that's a bug
> in mysql, and it should be reported as such to the mysql folks.
I was mistaken here, it now looks like reads and writes to the relay
log are buffered IO, and I'm still trying to locate the issue.
>
> It does look like there is a newer way of doing replication which
> doesn't rely on messign with log files.   From:
>
>      https://dev.mysql.com/doc/refman/8.0/en/replication.html
>
>      MySQL 8.0 supports different methods of replication. The
>      traditional method is based on replicating events from the
>      source's binary log, and requires the log files and positions in
>      them to be synchronized between source and replica. The newer
>      method based on global transaction identifiers (GTIDs) is
>      transactional and therefore does not require working with log
>      files or positions within these files, which greatly simplifies
>      many common replication tasks. Replication using GTIDs guarantees
>      consistency between source and replica as long as all transactions
>      committed on the source have also been applied on the replica. For
>      more information about GTIDs and GTID-based replication in MySQL,
>      see Section 17.1.3, “Replication with Global Transaction
>      Identifiers”. For information on using binary log file position
>      based replication, see Section 17.1, “Configuring Replication”.
>
> Perhaps you can try and see how mysql handles GTID-based replication
> using bpftrace?
>
> Cheers,
>
> 						- Ted
Thank you very much for your solution! I'll try it.

Thanks!

Baokun Li Dec. 7, 2023, 2:15 p.m. UTC | #16

On 2023/12/7 3:37, Jan Kara wrote:
> On Tue 05-12-23 20:50:30, Baokun Li wrote:
>> On 2023/12/4 22:41, Jan Kara wrote:
>>> On Mon 04-12-23 21:50:18, Baokun Li wrote:
>>>> On 2023/12/4 20:11, Jan Kara wrote:
>>>> The problem is with a one-master-twoslave MYSQL database with three
>>>> physical machines, and using sysbench pressure testing on each of the
>>>> three machines, the problem occurs about once every two to three hours.
>>>>
>>>> The problem is with the relay log file, and when the problem occurs, the
>>>> middle dozens of bytes of the file are read as all zeros, while the data on
>>>> disk is not. This is a journal-like file where a write process gets the data
>>>> from
>>>> the master node and writes it locally, and another replay process reads the
>>>> file and performs the replay operation accordingly (some SQL statements).
>>>> The problem is that when replaying, it finds that the data read is
>>>> corrupted,
>>>> not valid SQL data, while the data on disk is normal.
>>>>
>>>> It's not confirmed that buffered reads vs direct IO writes is actually
>>>> causing this issue, but this is the only scenario that we can reproduce
>>>> with our local simplified scripts. Also, after merging in patch 1, the
>>>> MYSQL pressure test scenario has now been tested for 5 days and has not
>>>> been reproduced.
>>>>
>>>> I'll double-check the problem scenario, although buffered reads with
>>>> buffered writes doesn't seem to have this problem.
>>> Yeah, from what you write it seems that the replay code is using buffered
>>> reads on the journal file. I guess you could confirm that with a bit of
>>> kernel tracing but the symptoms look pretty convincing. Did you try talking
>>> to MYSQL guys about why they are doing this?
>> The operations performed on the relay log file are buffered reads and
>> writes, which I confirmed with the following bpftrace script:
>> ```
>> #include <linux/fs.h>
>> #include <linux/path.h>
>> #include <linux/dcache.h>
>>
>> kprobe:generic_file_buffered_read /!strncmp(str(((struct kiocb
>> *)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ {
>>      printf("read path: %s\n", str(((struct kiocb
>> *)arg0)->ki_filp->f_path.dentry->d_name.name));
>> }
>>
>> kprobe:ext4_buffered_write_iter /!strncmp(str(((struct kiocb
>> *)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ {
>>      printf("write path: %s\n", str(((struct kiocb
>> *)arg0)->ki_filp->f_path.dentry->d_name.name));
>> }
>> ```
>> I suspect there are DIO writes causing the problem, but I haven't caught
>> any DIO writes to such files via bpftrace.
> Interesting. Not sure how your partially zeroed-out buffers could happen
> with fully buffered IO.
>
After looking at the code again and again, the following concurrency
seems to bypass the memory barrier:

ext4_buffered_write_iter
  generic_perform_write
   copy_page_from_iter_atomic
   ext4_da_write_end
    ext4_da_do_write_end
     block_write_end
      __block_commit_write
       folio_mark_uptodate
        smp_wmb()
        set_bit(PG_uptodate, folio_flags(folio, 0))
     i_size_write(inode, pos + copied)
     // write isize 2048
     unlock_page(page)

ext4_file_read_iter
  generic_file_read_iter
   filemap_read
    filemap_get_pages
     filemap_get_read_batch
     folio_test_uptodate(folio)
      ret = test_bit(PG_uptodate, folio_flags(folio, 0));
      if (ret)
       smp_rmb();
       // The read barrier here ensures
       // that data 0-2048 in the page is synchronized.
                            ext4_buffered_write_iter
                             generic_perform_write
                              copy_page_from_iter_atomic
                              ext4_da_write_end
                               ext4_da_do_write_end
                                block_write_end
                                 __block_commit_write
                                  folio_mark_uptodate
                                   smp_wmb()
                                   set_bit(PG_uptodate, 
folio_flags(folio, 0))
                                i_size_write(inode, pos + copied)
                                // write isize 4096
                                unlock_page(page)
    // read isize 4096
    isize = i_size_read(inode)
    // But there is no read barrier here,
    // so the data in the 2048-4096 range
    // may not be synchronized yet !!!
    copy_page_to_iter()
    // copyout 4096

In the concurrency above, we read the updated i_size, but there is
no read barrier to ensure that the data in the page is the same as
the i_size at this point. Therefore, we may copy the unsynchronized
page out. Is it normal for us to read zero-filled data in this case?

Thanks!

Jan Kara Dec. 11, 2023, 5:49 p.m. UTC | #17

On Thu 07-12-23 22:15:55, Baokun Li wrote:
> On 2023/12/7 3:37, Jan Kara wrote:
> > On Tue 05-12-23 20:50:30, Baokun Li wrote:
> > > On 2023/12/4 22:41, Jan Kara wrote:
> > > > On Mon 04-12-23 21:50:18, Baokun Li wrote:
> > > > > On 2023/12/4 20:11, Jan Kara wrote:
> > > > > The problem is with a one-master-twoslave MYSQL database with three
> > > > > physical machines, and using sysbench pressure testing on each of the
> > > > > three machines, the problem occurs about once every two to three hours.
> > > > > 
> > > > > The problem is with the relay log file, and when the problem occurs, the
> > > > > middle dozens of bytes of the file are read as all zeros, while the data on
> > > > > disk is not. This is a journal-like file where a write process gets the data
> > > > > from
> > > > > the master node and writes it locally, and another replay process reads the
> > > > > file and performs the replay operation accordingly (some SQL statements).
> > > > > The problem is that when replaying, it finds that the data read is
> > > > > corrupted,
> > > > > not valid SQL data, while the data on disk is normal.
> > > > > 
> > > > > It's not confirmed that buffered reads vs direct IO writes is actually
> > > > > causing this issue, but this is the only scenario that we can reproduce
> > > > > with our local simplified scripts. Also, after merging in patch 1, the
> > > > > MYSQL pressure test scenario has now been tested for 5 days and has not
> > > > > been reproduced.
> > > > > 
> > > > > I'll double-check the problem scenario, although buffered reads with
> > > > > buffered writes doesn't seem to have this problem.
> > > > Yeah, from what you write it seems that the replay code is using buffered
> > > > reads on the journal file. I guess you could confirm that with a bit of
> > > > kernel tracing but the symptoms look pretty convincing. Did you try talking
> > > > to MYSQL guys about why they are doing this?
> > > The operations performed on the relay log file are buffered reads and
> > > writes, which I confirmed with the following bpftrace script:
> > > ```
> > > #include <linux/fs.h>
> > > #include <linux/path.h>
> > > #include <linux/dcache.h>
> > > 
> > > kprobe:generic_file_buffered_read /!strncmp(str(((struct kiocb
> > > *)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ {
> > >      printf("read path: %s\n", str(((struct kiocb
> > > *)arg0)->ki_filp->f_path.dentry->d_name.name));
> > > }
> > > 
> > > kprobe:ext4_buffered_write_iter /!strncmp(str(((struct kiocb
> > > *)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ {
> > >      printf("write path: %s\n", str(((struct kiocb
> > > *)arg0)->ki_filp->f_path.dentry->d_name.name));
> > > }
> > > ```
> > > I suspect there are DIO writes causing the problem, but I haven't caught
> > > any DIO writes to such files via bpftrace.
> > Interesting. Not sure how your partially zeroed-out buffers could happen
> > with fully buffered IO.
> > 
> After looking at the code again and again, the following concurrency
> seems to bypass the memory barrier:
> 
> ext4_buffered_write_iter
>  generic_perform_write
>   copy_page_from_iter_atomic
>   ext4_da_write_end
>    ext4_da_do_write_end
>     block_write_end
>      __block_commit_write
>       folio_mark_uptodate
>        smp_wmb()
>        set_bit(PG_uptodate, folio_flags(folio, 0))
>     i_size_write(inode, pos + copied)
>     // write isize 2048
>     unlock_page(page)
> 
> ext4_file_read_iter
>  generic_file_read_iter
>   filemap_read
>    filemap_get_pages
>     filemap_get_read_batch
>     folio_test_uptodate(folio)
>      ret = test_bit(PG_uptodate, folio_flags(folio, 0));
>      if (ret)
>       smp_rmb();
>       // The read barrier here ensures
>       // that data 0-2048 in the page is synchronized.
>                            ext4_buffered_write_iter
>                             generic_perform_write
>                              copy_page_from_iter_atomic
>                              ext4_da_write_end
>                               ext4_da_do_write_end
>                                block_write_end
>                                 __block_commit_write
>                                  folio_mark_uptodate
>                                   smp_wmb()
>                                   set_bit(PG_uptodate, folio_flags(folio,
> 0))
>                                i_size_write(inode, pos + copied)
>                                // write isize 4096
>                                unlock_page(page)
>    // read isize 4096
>    isize = i_size_read(inode)
>    // But there is no read barrier here,
>    // so the data in the 2048-4096 range
>    // may not be synchronized yet !!!
>    copy_page_to_iter()
>    // copyout 4096
> 
> In the concurrency above, we read the updated i_size, but there is
> no read barrier to ensure that the data in the page is the same as
> the i_size at this point. Therefore, we may copy the unsynchronized
> page out. Is it normal for us to read zero-filled data in this case?

Indeed, I have checked and filemap_read() (but this dates back even to old
do_generic_file_read() code) indeed does copy data only after checking
uptodate flag and then sampling i_size so we may be copying state in the
middle of the racing write and indeed there is nothing which would prevent
prefetching page data before fetching inode size. I agree this is kind of
nasty so I think adding a read barrier between i_size_read() and
copy_page_to_iter() makes sense. Does it fix your issue with MYSQL?

								Honza

Baokun Li Dec. 12, 2023, 2:15 a.m. UTC | #18

On 2023/12/12 1:49, Jan Kara wrote:
> On Thu 07-12-23 22:15:55, Baokun Li wrote:
>> On 2023/12/7 3:37, Jan Kara wrote:
>>> On Tue 05-12-23 20:50:30, Baokun Li wrote:
>>>> On 2023/12/4 22:41, Jan Kara wrote:
>>>>> On Mon 04-12-23 21:50:18, Baokun Li wrote:
>>>>>> On 2023/12/4 20:11, Jan Kara wrote:
>>>>>> The problem is with a one-master-twoslave MYSQL database with three
>>>>>> physical machines, and using sysbench pressure testing on each of the
>>>>>> three machines, the problem occurs about once every two to three hours.
>>>>>>
>>>>>> The problem is with the relay log file, and when the problem occurs, the
>>>>>> middle dozens of bytes of the file are read as all zeros, while the data on
>>>>>> disk is not. This is a journal-like file where a write process gets the data
>>>>>> from
>>>>>> the master node and writes it locally, and another replay process reads the
>>>>>> file and performs the replay operation accordingly (some SQL statements).
>>>>>> The problem is that when replaying, it finds that the data read is
>>>>>> corrupted,
>>>>>> not valid SQL data, while the data on disk is normal.
>>>>>>
>>>>>> It's not confirmed that buffered reads vs direct IO writes is actually
>>>>>> causing this issue, but this is the only scenario that we can reproduce
>>>>>> with our local simplified scripts. Also, after merging in patch 1, the
>>>>>> MYSQL pressure test scenario has now been tested for 5 days and has not
>>>>>> been reproduced.
>>>>>>
>>>>>> I'll double-check the problem scenario, although buffered reads with
>>>>>> buffered writes doesn't seem to have this problem.
>>>>> Yeah, from what you write it seems that the replay code is using buffered
>>>>> reads on the journal file. I guess you could confirm that with a bit of
>>>>> kernel tracing but the symptoms look pretty convincing. Did you try talking
>>>>> to MYSQL guys about why they are doing this?
>>>> The operations performed on the relay log file are buffered reads and
>>>> writes, which I confirmed with the following bpftrace script:
>>>> ```
>>>> #include <linux/fs.h>
>>>> #include <linux/path.h>
>>>> #include <linux/dcache.h>
>>>>
>>>> kprobe:generic_file_buffered_read /!strncmp(str(((struct kiocb
>>>> *)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ {
>>>>       printf("read path: %s\n", str(((struct kiocb
>>>> *)arg0)->ki_filp->f_path.dentry->d_name.name));
>>>> }
>>>>
>>>> kprobe:ext4_buffered_write_iter /!strncmp(str(((struct kiocb
>>>> *)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ {
>>>>       printf("write path: %s\n", str(((struct kiocb
>>>> *)arg0)->ki_filp->f_path.dentry->d_name.name));
>>>> }
>>>> ```
>>>> I suspect there are DIO writes causing the problem, but I haven't caught
>>>> any DIO writes to such files via bpftrace.
>>> Interesting. Not sure how your partially zeroed-out buffers could happen
>>> with fully buffered IO.
>>>
>> After looking at the code again and again, the following concurrency
>> seems to bypass the memory barrier:
>>
>> ext4_buffered_write_iter
>>   generic_perform_write
>>    copy_page_from_iter_atomic
>>    ext4_da_write_end
>>     ext4_da_do_write_end
>>      block_write_end
>>       __block_commit_write
>>        folio_mark_uptodate
>>         smp_wmb()
>>         set_bit(PG_uptodate, folio_flags(folio, 0))
>>      i_size_write(inode, pos + copied)
>>      // write isize 2048
>>      unlock_page(page)
>>
>> ext4_file_read_iter
>>   generic_file_read_iter
>>    filemap_read
>>     filemap_get_pages
>>      filemap_get_read_batch
>>      folio_test_uptodate(folio)
>>       ret = test_bit(PG_uptodate, folio_flags(folio, 0));
>>       if (ret)
>>        smp_rmb();
>>        // The read barrier here ensures
>>        // that data 0-2048 in the page is synchronized.
>>                             ext4_buffered_write_iter
>>                              generic_perform_write
>>                               copy_page_from_iter_atomic
>>                               ext4_da_write_end
>>                                ext4_da_do_write_end
>>                                 block_write_end
>>                                  __block_commit_write
>>                                   folio_mark_uptodate
>>                                    smp_wmb()
>>                                    set_bit(PG_uptodate, folio_flags(folio,
>> 0))
>>                                 i_size_write(inode, pos + copied)
>>                                 // write isize 4096
>>                                 unlock_page(page)
>>     // read isize 4096
>>     isize = i_size_read(inode)
>>     // But there is no read barrier here,
>>     // so the data in the 2048-4096 range
>>     // may not be synchronized yet !!!
>>     copy_page_to_iter()
>>     // copyout 4096
>>
>> In the concurrency above, we read the updated i_size, but there is
>> no read barrier to ensure that the data in the page is the same as
>> the i_size at this point. Therefore, we may copy the unsynchronized
>> page out. Is it normal for us to read zero-filled data in this case?
> Indeed, I have checked and filemap_read() (but this dates back even to old
> do_generic_file_read() code) indeed does copy data only after checking
> uptodate flag and then sampling i_size so we may be copying state in the
> middle of the racing write and indeed there is nothing which would prevent
> prefetching page data before fetching inode size. I agree this is kind of
> nasty so I think adding a read barrier between i_size_read() and
> copy_page_to_iter() makes sense. Does it fix your issue with MYSQL?
>
> 								Honza
Thank you very much for confirming this issue!

Adding a read barrier between i_size_read() and copy_page_to_iter()
does seem to have solved the problem. The environment, which was
previously reproducing once every 2-3h, has been running for 60+h
without reproducing after adding the read barrier.

This is a Load-Load reordering issue, which only occurs on some
weak mem-ordering architectures (e.g. ARM64, ALPHA), but not
on strong mem-ordering architectures (e.g. X86).  This issue was
exactly discovered when migrating MYSQL services from X86 to
ARM64.

And theoretically the problem doesn't only happen on ext4,
filesystems that call filemap_read() but don't hold inode lock
(e.g. btrfs, f2fs, ubifs ...) will have this problem, while filesystems
with inode lock (e.g. xfs, nfs) won't have this problem.

Later I will send the patch that adds the memory barrier.

Thanks!

Matthew Wilcox Dec. 12, 2023, 4:36 a.m. UTC | #19

On Wed, Dec 06, 2023 at 08:37:57PM +0100, Jan Kara wrote:
> Within the same page buffered reads and writes should be consistent because
> they are synchronized by the page lock. However once reads and writes
> involve multiple pages, there is no serialization so you can get contents
> of some pages before write and some pages after being written. However this
> doesn't seem to be your particular case here. I just wanted to point out
> that in general even buffered reads vs writes are not fully consistent.

Buffered reads don't take the page/folio lock.  We only use the folio
lock to avoid reading stale data from the page cache while we're
fetching the data from storage.  Once the uptodate flag is set on the
folio, we never take the folio lock for reads.

Jan Kara Dec. 12, 2023, 2:25 p.m. UTC | #20

On Tue 12-12-23 04:36:45, Matthew Wilcox wrote:
> On Wed, Dec 06, 2023 at 08:37:57PM +0100, Jan Kara wrote:
> > Within the same page buffered reads and writes should be consistent because
> > they are synchronized by the page lock. However once reads and writes
> > involve multiple pages, there is no serialization so you can get contents
> > of some pages before write and some pages after being written. However this
> > doesn't seem to be your particular case here. I just wanted to point out
> > that in general even buffered reads vs writes are not fully consistent.
> 
> Buffered reads don't take the page/folio lock.  We only use the folio
> lock to avoid reading stale data from the page cache while we're
> fetching the data from storage.  Once the uptodate flag is set on the
> folio, we never take the folio lock for reads.

Right, I've noticed once I've reread the code but thanks for correction
anyway. :)

								Honza

[-RFC,0/2] mm/ext4: avoid data corruption when extending DIO write race with buffered read

Message

Comments