Message ID | 20231202091432.8349-1-libaokun1@huawei.com (mailing list archive) |
---|---|
Headers | show |
Series | mm/ext4: avoid data corruption when extending DIO write race with buffered read | expand |
Hello! On Sat 02-12-23 17:14:30, Baokun Li wrote: > Recently, while running some pressure tests on MYSQL, noticed that > occasionally a "corrupted data in log event" error would be reported. > After analyzing the error, I found that extending DIO write and buffered > read were competing, resulting in some zero-filled page end being read. > Since ext4 buffered read doesn't hold an inode lock, and there is no > field in the page to indicate the valid data size, it seems to me that > it is impossible to solve this problem perfectly without changing these > two things. Yes, combining buffered reads with direct IO writes is a recipe for problems and pretty much in the "don't do it" territory. So honestly I'd consider this a MYSQL bug. Were you able to identify why does MYSQL use buffered read in this case? It is just something specific to the test you're doing? > In this series, the first patch reads the inode size twice, and takes the > smaller of the two values as the copyout limit to avoid copying data that > was not actually read (0-padding) into the user buffer and causing data > corruption. This greatly reduces the probability of problems under 4k > page. However, the problem is still easily triggered under 64k page. > > The second patch waits for the existing dio write to complete and > invalidate the stale page cache before performing a new buffered read > in ext4, avoiding data corruption by copying the stale page cache to > the user buffer. This makes it much less likely that the problem will > be triggered in a 64k page. > > Do we have a plan to add a lock to the ext4 buffered read or a field in > the page that indicates the size of the valid data in the page? Or does > anyone have a better idea? No, there are no plans to address this AFAIK. Because such locking will slow down all the well behaved applications to fix a corner case for application doing unsupported things. Sure we must not crash the kernel, corrupt the filesystem or leak sensitive (e.g. uninitialized) data if app combines buffered and direct IO but returning zeros instead of valid data is in my opinion fully within the range of acceptable behavior for such case. Honza
On 2023/12/4 20:11, Jan Kara wrote: > Hello! Thank you for your reply! > > On Sat 02-12-23 17:14:30, Baokun Li wrote: >> Recently, while running some pressure tests on MYSQL, noticed that >> occasionally a "corrupted data in log event" error would be reported. >> After analyzing the error, I found that extending DIO write and buffered >> read were competing, resulting in some zero-filled page end being read. >> Since ext4 buffered read doesn't hold an inode lock, and there is no >> field in the page to indicate the valid data size, it seems to me that >> it is impossible to solve this problem perfectly without changing these >> two things. > Yes, combining buffered reads with direct IO writes is a recipe for > problems and pretty much in the "don't do it" territory. So honestly I'd > consider this a MYSQL bug. Were you able to identify why does MYSQL use > buffered read in this case? It is just something specific to the test > you're doing? The problem is with a one-master-twoslave MYSQL database with three physical machines, and using sysbench pressure testing on each of the three machines, the problem occurs about once every two to three hours. The problem is with the relay log file, and when the problem occurs, the middle dozens of bytes of the file are read as all zeros, while the data on disk is not. This is a journal-like file where a write process gets the data from the master node and writes it locally, and another replay process reads the file and performs the replay operation accordingly (some SQL statements). The problem is that when replaying, it finds that the data read is corrupted, not valid SQL data, while the data on disk is normal. It's not confirmed that buffered reads vs direct IO writes is actually causing this issue, but this is the only scenario that we can reproduce with our local simplified scripts. Also, after merging in patch 1, the MYSQL pressure test scenario has now been tested for 5 days and has not been reproduced. I'll double-check the problem scenario, although buffered reads with buffered writes doesn't seem to have this problem. >> In this series, the first patch reads the inode size twice, and takes the >> smaller of the two values as the copyout limit to avoid copying data that >> was not actually read (0-padding) into the user buffer and causing data >> corruption. This greatly reduces the probability of problems under 4k >> page. However, the problem is still easily triggered under 64k page. >> >> The second patch waits for the existing dio write to complete and >> invalidate the stale page cache before performing a new buffered read >> in ext4, avoiding data corruption by copying the stale page cache to >> the user buffer. This makes it much less likely that the problem will >> be triggered in a 64k page. >> >> Do we have a plan to add a lock to the ext4 buffered read or a field in >> the page that indicates the size of the valid data in the page? Or does >> anyone have a better idea? > No, there are no plans to address this AFAIK. Because such locking will > slow down all the well behaved applications to fix a corner case for > application doing unsupported things. Sure we must not crash the kernel, > corrupt the filesystem or leak sensitive (e.g. uninitialized) data if app > combines buffered and direct IO but returning zeros instead of valid data > is in my opinion fully within the range of acceptable behavior for such > case. > > Honza I also feel that a scenario like buffered reads + DIO writes is strange. But theoretically when read doesn't return an error, the data read shouldn't be wrong. And I tested that xfs guarantees data consistency in this scenario, which is why I thought it might be buggy. Thanks!
On Mon 04-12-23 21:50:18, Baokun Li wrote: > On 2023/12/4 20:11, Jan Kara wrote: > > On Sat 02-12-23 17:14:30, Baokun Li wrote: > > > Recently, while running some pressure tests on MYSQL, noticed that > > > occasionally a "corrupted data in log event" error would be reported. > > > After analyzing the error, I found that extending DIO write and buffered > > > read were competing, resulting in some zero-filled page end being read. > > > Since ext4 buffered read doesn't hold an inode lock, and there is no > > > field in the page to indicate the valid data size, it seems to me that > > > it is impossible to solve this problem perfectly without changing these > > > two things. > > Yes, combining buffered reads with direct IO writes is a recipe for > > problems and pretty much in the "don't do it" territory. So honestly I'd > > consider this a MYSQL bug. Were you able to identify why does MYSQL use > > buffered read in this case? It is just something specific to the test > > you're doing? > The problem is with a one-master-twoslave MYSQL database with three > physical machines, and using sysbench pressure testing on each of the > three machines, the problem occurs about once every two to three hours. > > The problem is with the relay log file, and when the problem occurs, the > middle dozens of bytes of the file are read as all zeros, while the data on > disk is not. This is a journal-like file where a write process gets the data > from > the master node and writes it locally, and another replay process reads the > file and performs the replay operation accordingly (some SQL statements). > The problem is that when replaying, it finds that the data read is > corrupted, > not valid SQL data, while the data on disk is normal. > > It's not confirmed that buffered reads vs direct IO writes is actually > causing this issue, but this is the only scenario that we can reproduce > with our local simplified scripts. Also, after merging in patch 1, the > MYSQL pressure test scenario has now been tested for 5 days and has not > been reproduced. > > I'll double-check the problem scenario, although buffered reads with > buffered writes doesn't seem to have this problem. Yeah, from what you write it seems that the replay code is using buffered reads on the journal file. I guess you could confirm that with a bit of kernel tracing but the symptoms look pretty convincing. Did you try talking to MYSQL guys about why they are doing this? > > > In this series, the first patch reads the inode size twice, and takes the > > > smaller of the two values as the copyout limit to avoid copying data that > > > was not actually read (0-padding) into the user buffer and causing data > > > corruption. This greatly reduces the probability of problems under 4k > > > page. However, the problem is still easily triggered under 64k page. > > > > > > The second patch waits for the existing dio write to complete and > > > invalidate the stale page cache before performing a new buffered read > > > in ext4, avoiding data corruption by copying the stale page cache to > > > the user buffer. This makes it much less likely that the problem will > > > be triggered in a 64k page. > > > > > > Do we have a plan to add a lock to the ext4 buffered read or a field in > > > the page that indicates the size of the valid data in the page? Or does > > > anyone have a better idea? > > No, there are no plans to address this AFAIK. Because such locking will > > slow down all the well behaved applications to fix a corner case for > > application doing unsupported things. Sure we must not crash the kernel, > > corrupt the filesystem or leak sensitive (e.g. uninitialized) data if app > > combines buffered and direct IO but returning zeros instead of valid data > > is in my opinion fully within the range of acceptable behavior for such > > case. > > > I also feel that a scenario like buffered reads + DIO writes is strange. > But theoretically when read doesn't return an error, the data read > shouldn't be wrong. And I tested that xfs guarantees data consistency in > this scenario, which is why I thought it might be buggy. Yes, XFS has inherited stronger consistency guarantees from IRIX times than Linux filesystems traditionally had. We generally don't even guarantee buffered read vs buffered write atomicity (i.e., buffered read can see a torn buffered write). Honza
On Mon, Dec 04, 2023 at 09:50:18PM +0800, Baokun Li wrote: > The problem is with a one-master-twoslave MYSQL database with three > physical machines, and using sysbench pressure testing on each of the > three machines, the problem occurs about once every two to three hours. > > The problem is with the relay log file, and when the problem occurs, > the middle dozens of bytes of the file are read as all zeros, while > the data on disk is not. This is a journal-like file where a write > process gets the data from the master node and writes it locally, > and another replay process reads the file and performs the replay > operation accordingly (some SQL statements). The problem is that > when replaying, it finds that the data read is corrupted, not valid > SQL data, while the data on disk is normal. You mentioned "scripts" --- are these locally developped scripts by any chance? The procedure suggested in a few places that I looked up don't involve needing to read the replay log. For example from[1]: On the master server: root@repl-master:~# mysql -uroot -p; mysql> CREATE USER ‘slave’@’12.34.56.789‘ IDENTIFIED BY ‘SLAVE_PASSWORD‘; mysql> GRANT REPLICATION SLAVE ON . TO ‘slave’@’12.34.56.222 ‘; mysql> FLUSH PRIVILEGES; mysql> FLUSH TABLES WITH READ LOCK; This will make the master server read-only, with all pending writes flushed out (so you don't need to worry about the replay log), and then you move the data from the master to slave: root@repl-master:~# mysqldump -u root -p –all-databases –master-data > data.sql root@repl-master:~# scp data.sql root@12.34.56.222 Then on the slave: root@repl-slave:~# mysql -uroot -p < data.sql root@repl-slave:~# mysql -uroot -p; mysql> STOP SLAVE; ... and then on the master: root@repl-master:~# mysql -uroot -p; mysql> UNLOCK TABLES; ... and back on the slave: root@repl-slave:~# mysql -uroot -p; mysql> START SLAVE; [1] https://hevodata.com/learn/mysql-master-slave-replication/ ... or you could buy the product advertised at [1] which is easier for the database administrators, but results in $$$ flowing to the Hevo company. :-) In any case, I'm pretty sure that the official documented way of setting up a failover replication setup doesn't involve buffered reads of the replay file. It is certainly the case that mysqldump uses buffered reads, but that's why you have to temporary make the database read-only using "FLUSH TABLES WITH READ LOCK" before taking a database snapshot, and then re-enable database updates the "UNLOCK TABLES" SQL commands. Cheers, - Ted
On 2023/12/4 22:41, Jan Kara wrote: > On Mon 04-12-23 21:50:18, Baokun Li wrote: >> On 2023/12/4 20:11, Jan Kara wrote: >>> On Sat 02-12-23 17:14:30, Baokun Li wrote: >>>> Recently, while running some pressure tests on MYSQL, noticed that >>>> occasionally a "corrupted data in log event" error would be reported. >>>> After analyzing the error, I found that extending DIO write and buffered >>>> read were competing, resulting in some zero-filled page end being read. >>>> Since ext4 buffered read doesn't hold an inode lock, and there is no >>>> field in the page to indicate the valid data size, it seems to me that >>>> it is impossible to solve this problem perfectly without changing these >>>> two things. >>> Yes, combining buffered reads with direct IO writes is a recipe for >>> problems and pretty much in the "don't do it" territory. So honestly I'd >>> consider this a MYSQL bug. Were you able to identify why does MYSQL use >>> buffered read in this case? It is just something specific to the test >>> you're doing? >> The problem is with a one-master-twoslave MYSQL database with three >> physical machines, and using sysbench pressure testing on each of the >> three machines, the problem occurs about once every two to three hours. >> >> The problem is with the relay log file, and when the problem occurs, the >> middle dozens of bytes of the file are read as all zeros, while the data on >> disk is not. This is a journal-like file where a write process gets the data >> from >> the master node and writes it locally, and another replay process reads the >> file and performs the replay operation accordingly (some SQL statements). >> The problem is that when replaying, it finds that the data read is >> corrupted, >> not valid SQL data, while the data on disk is normal. >> >> It's not confirmed that buffered reads vs direct IO writes is actually >> causing this issue, but this is the only scenario that we can reproduce >> with our local simplified scripts. Also, after merging in patch 1, the >> MYSQL pressure test scenario has now been tested for 5 days and has not >> been reproduced. >> >> I'll double-check the problem scenario, although buffered reads with >> buffered writes doesn't seem to have this problem. > Yeah, from what you write it seems that the replay code is using buffered > reads on the journal file. I guess you could confirm that with a bit of > kernel tracing but the symptoms look pretty convincing. Did you try talking > to MYSQL guys about why they are doing this? The operations performed on the relay log file are buffered reads and writes, which I confirmed with the following bpftrace script: ``` #include <linux/fs.h> #include <linux/path.h> #include <linux/dcache.h> kprobe:generic_file_buffered_read /!strncmp(str(((struct kiocb *)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ { printf("read path: %s\n", str(((struct kiocb *)arg0)->ki_filp->f_path.dentry->d_name.name)); } kprobe:ext4_buffered_write_iter /!strncmp(str(((struct kiocb *)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ { printf("write path: %s\n", str(((struct kiocb *)arg0)->ki_filp->f_path.dentry->d_name.name)); } ``` I suspect there are DIO writes causing the problem, but I haven't caught any DIO writes to such files via bpftrace. >>>> In this series, the first patch reads the inode size twice, and takes the >>>> smaller of the two values as the copyout limit to avoid copying data that >>>> was not actually read (0-padding) into the user buffer and causing data >>>> corruption. This greatly reduces the probability of problems under 4k >>>> page. However, the problem is still easily triggered under 64k page. >>>> >>>> The second patch waits for the existing dio write to complete and >>>> invalidate the stale page cache before performing a new buffered read >>>> in ext4, avoiding data corruption by copying the stale page cache to >>>> the user buffer. This makes it much less likely that the problem will >>>> be triggered in a 64k page. >>>> >>>> Do we have a plan to add a lock to the ext4 buffered read or a field in >>>> the page that indicates the size of the valid data in the page? Or does >>>> anyone have a better idea? >>> No, there are no plans to address this AFAIK. Because such locking will >>> slow down all the well behaved applications to fix a corner case for >>> application doing unsupported things. Sure we must not crash the kernel, >>> corrupt the filesystem or leak sensitive (e.g. uninitialized) data if app >>> combines buffered and direct IO but returning zeros instead of valid data >>> is in my opinion fully within the range of acceptable behavior for such >>> case. >>> >> I also feel that a scenario like buffered reads + DIO writes is strange. >> But theoretically when read doesn't return an error, the data read >> shouldn't be wrong. And I tested that xfs guarantees data consistency in >> this scenario, which is why I thought it might be buggy. > Yes, XFS has inherited stronger consistency guarantees from IRIX times than > Linux filesystems traditionally had. We generally don't even guarantee > buffered read vs buffered write atomicity (i.e., buffered read can see a > torn buffered write). > > Honza I'm a bit confused here, buffered read vs buffered write uses the same page and appears to be protected by a memory barrier, how does the inconsistency occur? Thanks!
On 2023/12/5 12:17, Theodore Ts'o wrote: > On Mon, Dec 04, 2023 at 09:50:18PM +0800, Baokun Li wrote: >> The problem is with a one-master-twoslave MYSQL database with three >> physical machines, and using sysbench pressure testing on each of the >> three machines, the problem occurs about once every two to three hours. >> >> The problem is with the relay log file, and when the problem occurs, >> the middle dozens of bytes of the file are read as all zeros, while >> the data on disk is not. This is a journal-like file where a write >> process gets the data from the master node and writes it locally, >> and another replay process reads the file and performs the replay >> operation accordingly (some SQL statements). The problem is that >> when replaying, it finds that the data read is corrupted, not valid >> SQL data, while the data on disk is normal. > You mentioned "scripts" --- are these locally developped scripts by > any chance? This refers to the sql commands to be replayed in the relay log file. I don't know much about this file, but you can read the official documentation. https://dev.mysql.com/doc/refman/8.0/en/replica-logs-relaylog.html > The procedure suggested in a few places that I looked up > don't involve needing to read the replay log. For example from[1]: > > On the master server: > > root@repl-master:~# mysql -uroot -p; > mysql> CREATE USER ‘slave’@’12.34.56.789‘ IDENTIFIED BY ‘SLAVE_PASSWORD‘; > mysql> GRANT REPLICATION SLAVE ON . TO ‘slave’@’12.34.56.222 ‘; > mysql> FLUSH PRIVILEGES; > mysql> FLUSH TABLES WITH READ LOCK; > > This will make the master server read-only, with all pending writes > flushed out (so you don't need to worry about the replay log), and > then you move the data from the master to slave: > > root@repl-master:~# mysqldump -u root -p –all-databases –master-data > data.sql > root@repl-master:~# scp data.sql root@12.34.56.222 > > Then on the slave: > > root@repl-slave:~# mysql -uroot -p < data.sql > root@repl-slave:~# mysql -uroot -p; > mysql> STOP SLAVE; > > ... and then on the master: > > root@repl-master:~# mysql -uroot -p; > mysql> UNLOCK TABLES; > > ... and back on the slave: > > root@repl-slave:~# mysql -uroot -p; > mysql> START SLAVE; > > [1] https://hevodata.com/learn/mysql-master-slave-replication/ > > ... or you could buy the product advertised at [1] which is easier for > the database administrators, but results in $$$ flowing to the Hevo > company. :-) > > In any case, I'm pretty sure that the official documented way of > setting up a failover replication setup doesn't involve buffered reads > of the replay file. > > It is certainly the case that mysqldump uses buffered reads, but > that's why you have to temporary make the database read-only using > "FLUSH TABLES WITH READ LOCK" before taking a database snapshot, and > then re-enable database updates the "UNLOCK TABLES" SQL commands. > > Cheers, > > - Ted Thank you very much for your detailed explanation! But the downstream users do have buffered reads to read the relay log file, as I confirmed with bpftrace. Here's an introduction to turning on relay logging, but I'm not sure if you can access this link: https://blog.csdn.net/javaanddonet/article/details/112596148 Thanks!
On Mon, Dec 04, 2023 at 09:50:18PM +0800, Baokun Li wrote: > On 2023/12/4 20:11, Jan Kara wrote: > > Hello! > Thank you for your reply! > > > > On Sat 02-12-23 17:14:30, Baokun Li wrote: > > > Recently, while running some pressure tests on MYSQL, noticed that > > > occasionally a "corrupted data in log event" error would be reported. > > > After analyzing the error, I found that extending DIO write and buffered > > > read were competing, resulting in some zero-filled page end being read. > > > Since ext4 buffered read doesn't hold an inode lock, and there is no > > > field in the page to indicate the valid data size, it seems to me that > > > it is impossible to solve this problem perfectly without changing these > > > two things. > > Yes, combining buffered reads with direct IO writes is a recipe for > > problems and pretty much in the "don't do it" territory. So honestly I'd > > consider this a MYSQL bug. Were you able to identify why does MYSQL use > > buffered read in this case? It is just something specific to the test > > you're doing? > The problem is with a one-master-twoslave MYSQL database with three > physical machines, and using sysbench pressure testing on each of the > three machines, the problem occurs about once every two to three hours. > > The problem is with the relay log file, and when the problem occurs, the > middle dozens of bytes of the file are read as all zeros, while the data on > disk is not. This is a journal-like file where a write process gets the data > from > the master node and writes it locally, and another replay process reads the > file and performs the replay operation accordingly (some SQL statements). > The problem is that when replaying, it finds that the data read is > corrupted, > not valid SQL data, while the data on disk is normal. > > It's not confirmed that buffered reads vs direct IO writes is actually > causing > this issue, but this is the only scenario that we can reproduce with our > local > simplified scripts. Also, after merging in patch 1, the MYSQL pressure test > scenario has now been tested for 5 days and has not been reproduced. Mixing overlapping buffered read with direct writes - especially partial block extending DIO writes - is a recipe for data corruption. It's not a matter of if, it's a matter of when. Fundamentally, when you have overlapping write IO involving DIO, the result of the overlapping IOs is undefined. One cannot control submission order, the order that the overlapping IO hit the media, or completion ordering that might clear flags like unwritten extents. The only guarantee that we give in this case is that we won't expose stale data from the disk to the user read. As such, it is the responsibility of the application to avoid overlapping IOs when doing direct IO. The fact that you['ve observed data corruption due to overlapping IO ranges from the one application indicates that this is, indeed, an application level problem and can only be solved by fixing the application.... > I'll double-check the problem scenario, although buffered reads with > buffered > writes doesn't seem to have this problem. > > > In this series, the first patch reads the inode size twice, and takes the > > > smaller of the two values as the copyout limit to avoid copying data that > > > was not actually read (0-padding) into the user buffer and causing data > > > corruption. This greatly reduces the probability of problems under 4k > > > page. However, the problem is still easily triggered under 64k page. > > > > > > The second patch waits for the existing dio write to complete and > > > invalidate the stale page cache before performing a new buffered read > > > in ext4, avoiding data corruption by copying the stale page cache to > > > the user buffer. This makes it much less likely that the problem will > > > be triggered in a 64k page. > > > > > > Do we have a plan to add a lock to the ext4 buffered read or a field in > > > the page that indicates the size of the valid data in the page? Or does > > > anyone have a better idea? > > No, there are no plans to address this AFAIK. Because such locking will > > slow down all the well behaved applications to fix a corner case for > > application doing unsupported things. Sure we must not crash the kernel, > > corrupt the filesystem or leak sensitive (e.g. uninitialized) data if app > > combines buffered and direct IO but returning zeros instead of valid data > > is in my opinion fully within the range of acceptable behavior for such > > case. > > > > Honza > I also feel that a scenario like buffered reads + DIO writes is strange. But > theoretically when read doesn't return an error, the data read shouldn't be > wrong. The data that was read isn't wrong - it just wasn't what the application expected. > And I tested that xfs guarantees data consistency in this scenario, which is > why I > thought it might be buggy. XFS certainly does not guarantee data consistency between buffered reads and DIO writes, especially for overlapping IO ranges using DIO (see above). IOWs, the fact that the data inconsistency doesn't reproduce on XFS doesn't mean that XFS is providing some guarantee of consistency for this IO pattern. All it means is that ext4 and XFS behave differently in an situation where the operational result is indeterminate and all we guarantee is that we won't expose stale data read from disk.... Cheers, Dave.
On Wed, Dec 06, 2023 at 07:35:35PM +1100, Dave Chinner wrote: > Mixing overlapping buffered read with direct writes - especially partial block > extending DIO writes - is a recipe for data corruption. It's not a > matter of if, it's a matter of when. > > Fundamentally, when you have overlapping write IO involving DIO, the > result of the overlapping IOs is undefined. One cannot control > submission order, the order that the overlapping IO hit the > media, or completion ordering that might clear flags like unwritten > extents. The only guarantee that we give in this case is that we > won't expose stale data from the disk to the user read. Btw, one thing we could do to kill these races forever is to track if there are any buffered openers for an inode and just fall back to buffered I/O for that case. With that and and inode_dio_wait for when opening for buffered I/O we'd avoid the races an various crazy workarounds entirely. nfs and ocfs2 do (or did, I haven't checked for a while) something like that.
On Wed, Dec 06, 2023 at 01:02:43AM -0800, Christoph Hellwig wrote: > On Wed, Dec 06, 2023 at 07:35:35PM +1100, Dave Chinner wrote: > > Mixing overlapping buffered read with direct writes - especially partial block > > extending DIO writes - is a recipe for data corruption. It's not a > > matter of if, it's a matter of when. > > > > Fundamentally, when you have overlapping write IO involving DIO, the > > result of the overlapping IOs is undefined. One cannot control > > submission order, the order that the overlapping IO hit the > > media, or completion ordering that might clear flags like unwritten > > extents. The only guarantee that we give in this case is that we > > won't expose stale data from the disk to the user read. > > Btw, one thing we could do to kill these races forever is to track if > there are any buffered openers for an inode and just fall back to > buffered I/O for that case. With that and and inode_dio_wait for > when opening for buffered I/O we'd avoid the races an various crazy > workarounds entirely. That's basically what Solaris did 20-25 years ago. The inode held a flag that indicated what IO was being done, and if the "buffered" flag was set (either through mmap() based access or buffered read/write syscalls) then direct IO would do also do buffered IO until the flag was cleared and the cache cleaned and invalidated. That had .... problems. Largely they were performance problems - unpredictable IO latency and CPU overhead for IO meant applications would randomly miss SLAs. The application would see IO suddenly lose all concurrency, go real slow and/or burn lots more CPU when the inode switched to buffered mode. I'm not sure that's a particularly viable model given the raw IO throughput even cheap modern SSDs largely exceeds the capability of buffered IO through the page cache. The differences in concurrency, latency and throughput between buffered and DIO modes will be even more stark itoday than they were 20 years ago.... -Dave.
On 2023/12/6 16:35, Dave Ch inner wrote: > On Mon, Dec 04, 2023 at 09:50:18PM +0800, Baokun Li wrote: >> On 2023/12/4 20:11, Jan Kara wrote: >>> Hello! >> Thank you for your reply! >>> On Sat 02-12-23 17:14:30, Baokun Li wrote: >>>> Recently, while running some pressure tests on MYSQL, noticed that >>>> occasionally a "corrupted data in log event" error would be reported. >>>> After analyzing the error, I found that extending DIO write and buffered >>>> read were competing, resulting in some zero-filled page end being read. >>>> Since ext4 buffered read doesn't hold an inode lock, and there is no >>>> field in the page to indicate the valid data size, it seems to me that >>>> it is impossible to solve this problem perfectly without changing these >>>> two things. >>> Yes, combining buffered reads with direct IO writes is a recipe for >>> problems and pretty much in the "don't do it" territory. So honestly I'd >>> consider this a MYSQL bug. Were you able to identify why does MYSQL use >>> buffered read in this case? It is just something specific to the test >>> you're doing? >> The problem is with a one-master-twoslave MYSQL database with three >> physical machines, and using sysbench pressure testing on each of the >> three machines, the problem occurs about once every two to three hours. >> >> The problem is with the relay log file, and when the problem occurs, the >> middle dozens of bytes of the file are read as all zeros, while the data on >> disk is not. This is a journal-like file where a write process gets the data >> from >> the master node and writes it locally, and another replay process reads the >> file and performs the replay operation accordingly (some SQL statements). >> The problem is that when replaying, it finds that the data read is >> corrupted, >> not valid SQL data, while the data on disk is normal. >> >> It's not confirmed that buffered reads vs direct IO writes is actually >> causing >> this issue, but this is the only scenario that we can reproduce with our >> local >> simplified scripts. Also, after merging in patch 1, the MYSQL pressure test >> scenario has now been tested for 5 days and has not been reproduced. > Mixing overlapping buffered read with direct writes - especially partial block > extending DIO writes - is a recipe for data corruption. It's not a > matter of if, it's a matter of when. > > Fundamentally, when you have overlapping write IO involving DIO, the > result of the overlapping IOs is undefined. One cannot control > submission order, the order that the overlapping IO hit the > media, or completion ordering that might clear flags like unwritten > extents. The only guarantee that we give in this case is that we > won't expose stale data from the disk to the user read. > > As such, it is the responsibility of the application to avoid > overlapping IOs when doing direct IO. The fact that you['ve observed > data corruption due to overlapping IO ranges from the one > application indicates that this is, indeed, an application level > problem and can only be solved by fixing the application.... OK, thank you very much for your detailed explanation! >> I'll double-check the problem scenario, although buffered reads with >> buffered >> writes doesn't seem to have this problem. >>>> In this series, the first patch reads the inode size twice, and takes the >>>> smaller of the two values as the copyout limit to avoid copying data that >>>> was not actually read (0-padding) into the user buffer and causing data >>>> corruption. This greatly reduces the probability of problems under 4k >>>> page. However, the problem is still easily triggered under 64k page. >>>> >>>> The second patch waits for the existing dio write to complete and >>>> invalidate the stale page cache before performing a new buffered read >>>> in ext4, avoiding data corruption by copying the stale page cache to >>>> the user buffer. This makes it much less likely that the problem will >>>> be triggered in a 64k page. >>>> >>>> Do we have a plan to add a lock to the ext4 buffered read or a field in >>>> the page that indicates the size of the valid data in the page? Or does >>>> anyone have a better idea? >>> No, there are no plans to address this AFAIK. Because such locking will >>> slow down all the well behaved applications to fix a corner case for >>> application doing unsupported things. Sure we must not crash the kernel, >>> corrupt the filesystem or leak sensitive (e.g. uninitialized) data if app >>> combines buffered and direct IO but returning zeros instead of valid data >>> is in my opinion fully within the range of acceptable behavior for such >>> case. >>> >>> Honza >> I also feel that a scenario like buffered reads + DIO writes is strange. But >> theoretically when read doesn't return an error, the data read shouldn't be >> wrong. > The data that was read isn't wrong - it just wasn't what the > application expected. > >> And I tested that xfs guarantees data consistency in this scenario, which is >> why I >> thought it might be buggy. > XFS certainly does not guarantee data consistency between buffered > reads and DIO writes, especially for overlapping IO ranges using DIO > (see above). > > IOWs, the fact that the data inconsistency doesn't reproduce on XFS > doesn't mean that XFS is providing some guarantee of consistency for > this IO pattern. All it means is that ext4 and XFS behave differently > in an situation where the operational result is indeterminate and all > we guarantee is that we won't expose stale data read from disk.... > > Cheers, > > Dave. Here's my test script, just put all three shell scripts in the same directory, change the disk and dir, then run `sh test-io.sh`, and the inconsistency is reproduced under ext4 in just one round, and after switching the filesystem to xfs, the problem is no longer reproduced. So it looks like xfs guarantees consistency in this scenario. But you're right, a test case can't trigger a problem doesn't mean the problem doesn't exist. Thank you very much for the correction. ------ test-write.sh disk="/dev/sda" dir="/tmp/test" write_t() { pos=0 touch $dir/writefile while [ $pos -lt 16777216 ] do xfs_io -c "open -d $dir/writefile" -c "pwrite -q -b 1024 -i $dir/testfile -s $pos $pos 1024" &> /dev/null pos=`ls -l $dir/writefile | awk '{print $5}'` done } write_t ------------------------------------------------------------ ------ test-read.sh disk="/dev/sda" dir="/tmp/test" read_t() { pos=0 touch $dir/readfile while [ $pos -lt 16777216 ] do xfs_io -c "open -d $dir/readfile" -c "pwrite -q -b 65536 -i $dir/writefile -s $pos $pos 65536" &> /dev/null pos=`ls -l $dir/readfile | awk '{print $5}'` done } read_t ------------------------------------------------------------ ------ test-io.sh disk="/dev/sda" dir="/tmp/test" mkdir -p $dir umount $dir mkfs.ext4 -F $disk #mkfs.xfs -f -m crc=1,rmapbt=0,reflink=0 $disk mount $disk $dir xfs_io -f -c "pwrite -S 0xff 0 16777216" $dir/testfile sync round=0 while true do while true do echo "******* round $round ********" touch $dir/writefile taskset 0x00000001 sh test-read.sh & pid_read=$! taskset 0x0000000f sh test-write.sh & pid_write=$! wait $pid_read wait $pid_write sync md_raw=`md5sum $dir/testfile | awk '{print $1}'` md_write=`md5sum $dir/writefile | awk '{print $1}'` md_read=`md5sum $dir/readfile | awk '{print $1}'` if [ $md_write != $md_raw ] then echo "write error csum" echo "raw csum: $md_raw" echo "write csum: $md_write" exit 1 elif [ $md_read != $md_write ] then echo "raed error csum" echo "write csum: $md_write" echo "read csum: $md_read" exit 1 fi rm -f $dir/writefile rm -f $dir/readfile sync round=$(($round+1)) done done ------------------------------------------------------------ Thanks!
On Wed, Dec 06, 2023 at 09:34:49PM +1100, Dave Chinner wrote: > Largely they were performance problems - unpredictable IO latency > and CPU overhead for IO meant applications would randomly miss SLAs. > The application would see IO suddenly lose all concurrency, go real > slow and/or burn lots more CPU when the inode switched to buffered > mode. > > I'm not sure that's a particularly viable model given the raw IO > throughput even cheap modern SSDs largely exceeds the capability of > buffered IO through the page cache. The differences in concurrency, > latency and throughput between buffered and DIO modes will be even > more stark itoday than they were 20 years ago.... The question is what's worse: random performance drops or random corruption. I suspect the former is less bad, especially if we have good tracepoints to pin it down.
On Tue 05-12-23 20:50:30, Baokun Li wrote: > On 2023/12/4 22:41, Jan Kara wrote: > > On Mon 04-12-23 21:50:18, Baokun Li wrote: > > > On 2023/12/4 20:11, Jan Kara wrote: > > > > On Sat 02-12-23 17:14:30, Baokun Li wrote: > > > > > Recently, while running some pressure tests on MYSQL, noticed that > > > > > occasionally a "corrupted data in log event" error would be reported. > > > > > After analyzing the error, I found that extending DIO write and buffered > > > > > read were competing, resulting in some zero-filled page end being read. > > > > > Since ext4 buffered read doesn't hold an inode lock, and there is no > > > > > field in the page to indicate the valid data size, it seems to me that > > > > > it is impossible to solve this problem perfectly without changing these > > > > > two things. > > > > Yes, combining buffered reads with direct IO writes is a recipe for > > > > problems and pretty much in the "don't do it" territory. So honestly I'd > > > > consider this a MYSQL bug. Were you able to identify why does MYSQL use > > > > buffered read in this case? It is just something specific to the test > > > > you're doing? > > > The problem is with a one-master-twoslave MYSQL database with three > > > physical machines, and using sysbench pressure testing on each of the > > > three machines, the problem occurs about once every two to three hours. > > > > > > The problem is with the relay log file, and when the problem occurs, the > > > middle dozens of bytes of the file are read as all zeros, while the data on > > > disk is not. This is a journal-like file where a write process gets the data > > > from > > > the master node and writes it locally, and another replay process reads the > > > file and performs the replay operation accordingly (some SQL statements). > > > The problem is that when replaying, it finds that the data read is > > > corrupted, > > > not valid SQL data, while the data on disk is normal. > > > > > > It's not confirmed that buffered reads vs direct IO writes is actually > > > causing this issue, but this is the only scenario that we can reproduce > > > with our local simplified scripts. Also, after merging in patch 1, the > > > MYSQL pressure test scenario has now been tested for 5 days and has not > > > been reproduced. > > > > > > I'll double-check the problem scenario, although buffered reads with > > > buffered writes doesn't seem to have this problem. > > Yeah, from what you write it seems that the replay code is using buffered > > reads on the journal file. I guess you could confirm that with a bit of > > kernel tracing but the symptoms look pretty convincing. Did you try talking > > to MYSQL guys about why they are doing this? > > The operations performed on the relay log file are buffered reads and > writes, which I confirmed with the following bpftrace script: > ``` > #include <linux/fs.h> > #include <linux/path.h> > #include <linux/dcache.h> > > kprobe:generic_file_buffered_read /!strncmp(str(((struct kiocb > *)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ { > printf("read path: %s\n", str(((struct kiocb > *)arg0)->ki_filp->f_path.dentry->d_name.name)); > } > > kprobe:ext4_buffered_write_iter /!strncmp(str(((struct kiocb > *)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ { > printf("write path: %s\n", str(((struct kiocb > *)arg0)->ki_filp->f_path.dentry->d_name.name)); > } > ``` > I suspect there are DIO writes causing the problem, but I haven't caught > any DIO writes to such files via bpftrace. Interesting. Not sure how your partially zeroed-out buffers could happen with fully buffered IO. > > > > > In this series, the first patch reads the inode size twice, and takes the > > > > > smaller of the two values as the copyout limit to avoid copying data that > > > > > was not actually read (0-padding) into the user buffer and causing data > > > > > corruption. This greatly reduces the probability of problems under 4k > > > > > page. However, the problem is still easily triggered under 64k page. > > > > > > > > > > The second patch waits for the existing dio write to complete and > > > > > invalidate the stale page cache before performing a new buffered read > > > > > in ext4, avoiding data corruption by copying the stale page cache to > > > > > the user buffer. This makes it much less likely that the problem will > > > > > be triggered in a 64k page. > > > > > > > > > > Do we have a plan to add a lock to the ext4 buffered read or a field in > > > > > the page that indicates the size of the valid data in the page? Or does > > > > > anyone have a better idea? > > > > No, there are no plans to address this AFAIK. Because such locking will > > > > slow down all the well behaved applications to fix a corner case for > > > > application doing unsupported things. Sure we must not crash the kernel, > > > > corrupt the filesystem or leak sensitive (e.g. uninitialized) data if app > > > > combines buffered and direct IO but returning zeros instead of valid data > > > > is in my opinion fully within the range of acceptable behavior for such > > > > case. > > > > > > > I also feel that a scenario like buffered reads + DIO writes is strange. > > > But theoretically when read doesn't return an error, the data read > > > shouldn't be wrong. And I tested that xfs guarantees data consistency in > > > this scenario, which is why I thought it might be buggy. > > Yes, XFS has inherited stronger consistency guarantees from IRIX times than > > Linux filesystems traditionally had. We generally don't even guarantee > > buffered read vs buffered write atomicity (i.e., buffered read can see a > > torn buffered write). > > I'm a bit confused here, buffered read vs buffered write uses the same > page and appears to be protected by a memory barrier, how does the > inconsistency occur? Within the same page buffered reads and writes should be consistent because they are synchronized by the page lock. However once reads and writes involve multiple pages, there is no serialization so you can get contents of some pages before write and some pages after being written. However this doesn't seem to be your particular case here. I just wanted to point out that in general even buffered reads vs writes are not fully consistent. Honza
On Tue, Dec 05, 2023 at 09:19:03PM +0800, Baokun Li wrote: > Thank you very much for your detailed explanation! > But the downstream users do have buffered reads to read the relay log > file, as I confirmed with bpftrace. Here's an introduction to turning on > relay logging, but I'm not sure if you can access this link: > https://blog.csdn.net/javaanddonet/article/details/112596148 Well, if a mysql-supplied program is trying read the relay log using a buffered read, when it's being written using Direct I/O, that's a bug in mysql, and it should be reported as such to the mysql folks. It does look like there is a newer way of doing replication which doesn't rely on messign with log files. From: https://dev.mysql.com/doc/refman/8.0/en/replication.html MySQL 8.0 supports different methods of replication. The traditional method is based on replicating events from the source's binary log, and requires the log files and positions in them to be synchronized between source and replica. The newer method based on global transaction identifiers (GTIDs) is transactional and therefore does not require working with log files or positions within these files, which greatly simplifies many common replication tasks. Replication using GTIDs guarantees consistency between source and replica as long as all transactions committed on the source have also been applied on the replica. For more information about GTIDs and GTID-based replication in MySQL, see Section 17.1.3, “Replication with Global Transaction Identifiers”. For information on using binary log file position based replication, see Section 17.1, “Configuring Replication”. Perhaps you can try and see how mysql handles GTID-based replication using bpftrace? Cheers, - Ted
On 2023/12/7 3:37, Jan Kara wrote: > On Tue 05-12-23 20:50:30, Baokun Li wrote: >> On 2023/12/4 22:41, Jan Kara wrote: >>> On Mon 04-12-23 21:50:18, Baokun Li wrote: >>>> On 2023/12/4 20:11, Jan Kara wrote: >>>>> On Sat 02-12-23 17:14:30, Baokun Li wrote: >>>>>> Recently, while running some pressure tests on MYSQL, noticed that >>>>>> occasionally a "corrupted data in log event" error would be reported. >>>>>> After analyzing the error, I found that extending DIO write and buffered >>>>>> read were competing, resulting in some zero-filled page end being read. >>>>>> Since ext4 buffered read doesn't hold an inode lock, and there is no >>>>>> field in the page to indicate the valid data size, it seems to me that >>>>>> it is impossible to solve this problem perfectly without changing these >>>>>> two things. >>>>> Yes, combining buffered reads with direct IO writes is a recipe for >>>>> problems and pretty much in the "don't do it" territory. So honestly I'd >>>>> consider this a MYSQL bug. Were you able to identify why does MYSQL use >>>>> buffered read in this case? It is just something specific to the test >>>>> you're doing? >>>> The problem is with a one-master-twoslave MYSQL database with three >>>> physical machines, and using sysbench pressure testing on each of the >>>> three machines, the problem occurs about once every two to three hours. >>>> >>>> The problem is with the relay log file, and when the problem occurs, the >>>> middle dozens of bytes of the file are read as all zeros, while the data on >>>> disk is not. This is a journal-like file where a write process gets the data >>>> from >>>> the master node and writes it locally, and another replay process reads the >>>> file and performs the replay operation accordingly (some SQL statements). >>>> The problem is that when replaying, it finds that the data read is >>>> corrupted, >>>> not valid SQL data, while the data on disk is normal. >>>> >>>> It's not confirmed that buffered reads vs direct IO writes is actually >>>> causing this issue, but this is the only scenario that we can reproduce >>>> with our local simplified scripts. Also, after merging in patch 1, the >>>> MYSQL pressure test scenario has now been tested for 5 days and has not >>>> been reproduced. >>>> >>>> I'll double-check the problem scenario, although buffered reads with >>>> buffered writes doesn't seem to have this problem. >>> Yeah, from what you write it seems that the replay code is using buffered >>> reads on the journal file. I guess you could confirm that with a bit of >>> kernel tracing but the symptoms look pretty convincing. Did you try talking >>> to MYSQL guys about why they are doing this? >> The operations performed on the relay log file are buffered reads and >> writes, which I confirmed with the following bpftrace script: >> ``` >> #include <linux/fs.h> >> #include <linux/path.h> >> #include <linux/dcache.h> >> >> kprobe:generic_file_buffered_read /!strncmp(str(((struct kiocb >> *)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ { >> printf("read path: %s\n", str(((struct kiocb >> *)arg0)->ki_filp->f_path.dentry->d_name.name)); >> } >> >> kprobe:ext4_buffered_write_iter /!strncmp(str(((struct kiocb >> *)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ { >> printf("write path: %s\n", str(((struct kiocb >> *)arg0)->ki_filp->f_path.dentry->d_name.name)); >> } >> ``` >> I suspect there are DIO writes causing the problem, but I haven't caught >> any DIO writes to such files via bpftrace. > Interesting. Not sure how your partially zeroed-out buffers could happen > with fully buffered IO. I'm still in the process of locating this issue and will sync here if there is any progress. >>>>>> In this series, the first patch reads the inode size twice, and takes the >>>>>> smaller of the two values as the copyout limit to avoid copying data that >>>>>> was not actually read (0-padding) into the user buffer and causing data >>>>>> corruption. This greatly reduces the probability of problems under 4k >>>>>> page. However, the problem is still easily triggered under 64k page. >>>>>> >>>>>> The second patch waits for the existing dio write to complete and >>>>>> invalidate the stale page cache before performing a new buffered read >>>>>> in ext4, avoiding data corruption by copying the stale page cache to >>>>>> the user buffer. This makes it much less likely that the problem will >>>>>> be triggered in a 64k page. >>>>>> >>>>>> Do we have a plan to add a lock to the ext4 buffered read or a field in >>>>>> the page that indicates the size of the valid data in the page? Or does >>>>>> anyone have a better idea? >>>>> No, there are no plans to address this AFAIK. Because such locking will >>>>> slow down all the well behaved applications to fix a corner case for >>>>> application doing unsupported things. Sure we must not crash the kernel, >>>>> corrupt the filesystem or leak sensitive (e.g. uninitialized) data if app >>>>> combines buffered and direct IO but returning zeros instead of valid data >>>>> is in my opinion fully within the range of acceptable behavior for such >>>>> case. >>>>> >>>> I also feel that a scenario like buffered reads + DIO writes is strange. >>>> But theoretically when read doesn't return an error, the data read >>>> shouldn't be wrong. And I tested that xfs guarantees data consistency in >>>> this scenario, which is why I thought it might be buggy. >>> Yes, XFS has inherited stronger consistency guarantees from IRIX times than >>> Linux filesystems traditionally had. We generally don't even guarantee >>> buffered read vs buffered write atomicity (i.e., buffered read can see a >>> torn buffered write). >> I'm a bit confused here, buffered read vs buffered write uses the same >> page and appears to be protected by a memory barrier, how does the >> inconsistency occur? > Within the same page buffered reads and writes should be consistent because > they are synchronized by the page lock. However once reads and writes > involve multiple pages, there is no serialization so you can get contents > of some pages before write and some pages after being written. However this > doesn't seem to be your particular case here. I just wanted to point out > that in general even buffered reads vs writes are not fully consistent. > > Honza > Yes, when writing multiple pages, it is possible to read a part of the pages that has been updated and a part of the pages that has not been updated. Thank you very much for answering my query! Thanks!
On 2023/12/7 5:55, Theodore Ts'o wrote: > On Tue, Dec 05, 2023 at 09:19:03PM +0800, Baokun Li wrote: >> Thank you very much for your detailed explanation! >> But the downstream users do have buffered reads to read the relay log >> file, as I confirmed with bpftrace. Here's an introduction to turning on >> relay logging, but I'm not sure if you can access this link: >> https://blog.csdn.net/javaanddonet/article/details/112596148 > Well, if a mysql-supplied program is trying read the relay log using a > buffered read, when it's being written using Direct I/O, that's a bug > in mysql, and it should be reported as such to the mysql folks. I was mistaken here, it now looks like reads and writes to the relay log are buffered IO, and I'm still trying to locate the issue. > > It does look like there is a newer way of doing replication which > doesn't rely on messign with log files. From: > > https://dev.mysql.com/doc/refman/8.0/en/replication.html > > MySQL 8.0 supports different methods of replication. The > traditional method is based on replicating events from the > source's binary log, and requires the log files and positions in > them to be synchronized between source and replica. The newer > method based on global transaction identifiers (GTIDs) is > transactional and therefore does not require working with log > files or positions within these files, which greatly simplifies > many common replication tasks. Replication using GTIDs guarantees > consistency between source and replica as long as all transactions > committed on the source have also been applied on the replica. For > more information about GTIDs and GTID-based replication in MySQL, > see Section 17.1.3, “Replication with Global Transaction > Identifiers”. For information on using binary log file position > based replication, see Section 17.1, “Configuring Replication”. > > Perhaps you can try and see how mysql handles GTID-based replication > using bpftrace? > > Cheers, > > - Ted Thank you very much for your solution! I'll try it. Thanks!
On 2023/12/7 3:37, Jan Kara wrote: > On Tue 05-12-23 20:50:30, Baokun Li wrote: >> On 2023/12/4 22:41, Jan Kara wrote: >>> On Mon 04-12-23 21:50:18, Baokun Li wrote: >>>> On 2023/12/4 20:11, Jan Kara wrote: >>>> The problem is with a one-master-twoslave MYSQL database with three >>>> physical machines, and using sysbench pressure testing on each of the >>>> three machines, the problem occurs about once every two to three hours. >>>> >>>> The problem is with the relay log file, and when the problem occurs, the >>>> middle dozens of bytes of the file are read as all zeros, while the data on >>>> disk is not. This is a journal-like file where a write process gets the data >>>> from >>>> the master node and writes it locally, and another replay process reads the >>>> file and performs the replay operation accordingly (some SQL statements). >>>> The problem is that when replaying, it finds that the data read is >>>> corrupted, >>>> not valid SQL data, while the data on disk is normal. >>>> >>>> It's not confirmed that buffered reads vs direct IO writes is actually >>>> causing this issue, but this is the only scenario that we can reproduce >>>> with our local simplified scripts. Also, after merging in patch 1, the >>>> MYSQL pressure test scenario has now been tested for 5 days and has not >>>> been reproduced. >>>> >>>> I'll double-check the problem scenario, although buffered reads with >>>> buffered writes doesn't seem to have this problem. >>> Yeah, from what you write it seems that the replay code is using buffered >>> reads on the journal file. I guess you could confirm that with a bit of >>> kernel tracing but the symptoms look pretty convincing. Did you try talking >>> to MYSQL guys about why they are doing this? >> The operations performed on the relay log file are buffered reads and >> writes, which I confirmed with the following bpftrace script: >> ``` >> #include <linux/fs.h> >> #include <linux/path.h> >> #include <linux/dcache.h> >> >> kprobe:generic_file_buffered_read /!strncmp(str(((struct kiocb >> *)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ { >> printf("read path: %s\n", str(((struct kiocb >> *)arg0)->ki_filp->f_path.dentry->d_name.name)); >> } >> >> kprobe:ext4_buffered_write_iter /!strncmp(str(((struct kiocb >> *)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ { >> printf("write path: %s\n", str(((struct kiocb >> *)arg0)->ki_filp->f_path.dentry->d_name.name)); >> } >> ``` >> I suspect there are DIO writes causing the problem, but I haven't caught >> any DIO writes to such files via bpftrace. > Interesting. Not sure how your partially zeroed-out buffers could happen > with fully buffered IO. > After looking at the code again and again, the following concurrency seems to bypass the memory barrier: ext4_buffered_write_iter generic_perform_write copy_page_from_iter_atomic ext4_da_write_end ext4_da_do_write_end block_write_end __block_commit_write folio_mark_uptodate smp_wmb() set_bit(PG_uptodate, folio_flags(folio, 0)) i_size_write(inode, pos + copied) // write isize 2048 unlock_page(page) ext4_file_read_iter generic_file_read_iter filemap_read filemap_get_pages filemap_get_read_batch folio_test_uptodate(folio) ret = test_bit(PG_uptodate, folio_flags(folio, 0)); if (ret) smp_rmb(); // The read barrier here ensures // that data 0-2048 in the page is synchronized. ext4_buffered_write_iter generic_perform_write copy_page_from_iter_atomic ext4_da_write_end ext4_da_do_write_end block_write_end __block_commit_write folio_mark_uptodate smp_wmb() set_bit(PG_uptodate, folio_flags(folio, 0)) i_size_write(inode, pos + copied) // write isize 4096 unlock_page(page) // read isize 4096 isize = i_size_read(inode) // But there is no read barrier here, // so the data in the 2048-4096 range // may not be synchronized yet !!! copy_page_to_iter() // copyout 4096 In the concurrency above, we read the updated i_size, but there is no read barrier to ensure that the data in the page is the same as the i_size at this point. Therefore, we may copy the unsynchronized page out. Is it normal for us to read zero-filled data in this case? Thanks!
On Thu 07-12-23 22:15:55, Baokun Li wrote: > On 2023/12/7 3:37, Jan Kara wrote: > > On Tue 05-12-23 20:50:30, Baokun Li wrote: > > > On 2023/12/4 22:41, Jan Kara wrote: > > > > On Mon 04-12-23 21:50:18, Baokun Li wrote: > > > > > On 2023/12/4 20:11, Jan Kara wrote: > > > > > The problem is with a one-master-twoslave MYSQL database with three > > > > > physical machines, and using sysbench pressure testing on each of the > > > > > three machines, the problem occurs about once every two to three hours. > > > > > > > > > > The problem is with the relay log file, and when the problem occurs, the > > > > > middle dozens of bytes of the file are read as all zeros, while the data on > > > > > disk is not. This is a journal-like file where a write process gets the data > > > > > from > > > > > the master node and writes it locally, and another replay process reads the > > > > > file and performs the replay operation accordingly (some SQL statements). > > > > > The problem is that when replaying, it finds that the data read is > > > > > corrupted, > > > > > not valid SQL data, while the data on disk is normal. > > > > > > > > > > It's not confirmed that buffered reads vs direct IO writes is actually > > > > > causing this issue, but this is the only scenario that we can reproduce > > > > > with our local simplified scripts. Also, after merging in patch 1, the > > > > > MYSQL pressure test scenario has now been tested for 5 days and has not > > > > > been reproduced. > > > > > > > > > > I'll double-check the problem scenario, although buffered reads with > > > > > buffered writes doesn't seem to have this problem. > > > > Yeah, from what you write it seems that the replay code is using buffered > > > > reads on the journal file. I guess you could confirm that with a bit of > > > > kernel tracing but the symptoms look pretty convincing. Did you try talking > > > > to MYSQL guys about why they are doing this? > > > The operations performed on the relay log file are buffered reads and > > > writes, which I confirmed with the following bpftrace script: > > > ``` > > > #include <linux/fs.h> > > > #include <linux/path.h> > > > #include <linux/dcache.h> > > > > > > kprobe:generic_file_buffered_read /!strncmp(str(((struct kiocb > > > *)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ { > > > printf("read path: %s\n", str(((struct kiocb > > > *)arg0)->ki_filp->f_path.dentry->d_name.name)); > > > } > > > > > > kprobe:ext4_buffered_write_iter /!strncmp(str(((struct kiocb > > > *)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ { > > > printf("write path: %s\n", str(((struct kiocb > > > *)arg0)->ki_filp->f_path.dentry->d_name.name)); > > > } > > > ``` > > > I suspect there are DIO writes causing the problem, but I haven't caught > > > any DIO writes to such files via bpftrace. > > Interesting. Not sure how your partially zeroed-out buffers could happen > > with fully buffered IO. > > > After looking at the code again and again, the following concurrency > seems to bypass the memory barrier: > > ext4_buffered_write_iter > generic_perform_write > copy_page_from_iter_atomic > ext4_da_write_end > ext4_da_do_write_end > block_write_end > __block_commit_write > folio_mark_uptodate > smp_wmb() > set_bit(PG_uptodate, folio_flags(folio, 0)) > i_size_write(inode, pos + copied) > // write isize 2048 > unlock_page(page) > > ext4_file_read_iter > generic_file_read_iter > filemap_read > filemap_get_pages > filemap_get_read_batch > folio_test_uptodate(folio) > ret = test_bit(PG_uptodate, folio_flags(folio, 0)); > if (ret) > smp_rmb(); > // The read barrier here ensures > // that data 0-2048 in the page is synchronized. > ext4_buffered_write_iter > generic_perform_write > copy_page_from_iter_atomic > ext4_da_write_end > ext4_da_do_write_end > block_write_end > __block_commit_write > folio_mark_uptodate > smp_wmb() > set_bit(PG_uptodate, folio_flags(folio, > 0)) > i_size_write(inode, pos + copied) > // write isize 4096 > unlock_page(page) > // read isize 4096 > isize = i_size_read(inode) > // But there is no read barrier here, > // so the data in the 2048-4096 range > // may not be synchronized yet !!! > copy_page_to_iter() > // copyout 4096 > > In the concurrency above, we read the updated i_size, but there is > no read barrier to ensure that the data in the page is the same as > the i_size at this point. Therefore, we may copy the unsynchronized > page out. Is it normal for us to read zero-filled data in this case? Indeed, I have checked and filemap_read() (but this dates back even to old do_generic_file_read() code) indeed does copy data only after checking uptodate flag and then sampling i_size so we may be copying state in the middle of the racing write and indeed there is nothing which would prevent prefetching page data before fetching inode size. I agree this is kind of nasty so I think adding a read barrier between i_size_read() and copy_page_to_iter() makes sense. Does it fix your issue with MYSQL? Honza
On 2023/12/12 1:49, Jan Kara wrote: > On Thu 07-12-23 22:15:55, Baokun Li wrote: >> On 2023/12/7 3:37, Jan Kara wrote: >>> On Tue 05-12-23 20:50:30, Baokun Li wrote: >>>> On 2023/12/4 22:41, Jan Kara wrote: >>>>> On Mon 04-12-23 21:50:18, Baokun Li wrote: >>>>>> On 2023/12/4 20:11, Jan Kara wrote: >>>>>> The problem is with a one-master-twoslave MYSQL database with three >>>>>> physical machines, and using sysbench pressure testing on each of the >>>>>> three machines, the problem occurs about once every two to three hours. >>>>>> >>>>>> The problem is with the relay log file, and when the problem occurs, the >>>>>> middle dozens of bytes of the file are read as all zeros, while the data on >>>>>> disk is not. This is a journal-like file where a write process gets the data >>>>>> from >>>>>> the master node and writes it locally, and another replay process reads the >>>>>> file and performs the replay operation accordingly (some SQL statements). >>>>>> The problem is that when replaying, it finds that the data read is >>>>>> corrupted, >>>>>> not valid SQL data, while the data on disk is normal. >>>>>> >>>>>> It's not confirmed that buffered reads vs direct IO writes is actually >>>>>> causing this issue, but this is the only scenario that we can reproduce >>>>>> with our local simplified scripts. Also, after merging in patch 1, the >>>>>> MYSQL pressure test scenario has now been tested for 5 days and has not >>>>>> been reproduced. >>>>>> >>>>>> I'll double-check the problem scenario, although buffered reads with >>>>>> buffered writes doesn't seem to have this problem. >>>>> Yeah, from what you write it seems that the replay code is using buffered >>>>> reads on the journal file. I guess you could confirm that with a bit of >>>>> kernel tracing but the symptoms look pretty convincing. Did you try talking >>>>> to MYSQL guys about why they are doing this? >>>> The operations performed on the relay log file are buffered reads and >>>> writes, which I confirmed with the following bpftrace script: >>>> ``` >>>> #include <linux/fs.h> >>>> #include <linux/path.h> >>>> #include <linux/dcache.h> >>>> >>>> kprobe:generic_file_buffered_read /!strncmp(str(((struct kiocb >>>> *)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ { >>>> printf("read path: %s\n", str(((struct kiocb >>>> *)arg0)->ki_filp->f_path.dentry->d_name.name)); >>>> } >>>> >>>> kprobe:ext4_buffered_write_iter /!strncmp(str(((struct kiocb >>>> *)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ { >>>> printf("write path: %s\n", str(((struct kiocb >>>> *)arg0)->ki_filp->f_path.dentry->d_name.name)); >>>> } >>>> ``` >>>> I suspect there are DIO writes causing the problem, but I haven't caught >>>> any DIO writes to such files via bpftrace. >>> Interesting. Not sure how your partially zeroed-out buffers could happen >>> with fully buffered IO. >>> >> After looking at the code again and again, the following concurrency >> seems to bypass the memory barrier: >> >> ext4_buffered_write_iter >> generic_perform_write >> copy_page_from_iter_atomic >> ext4_da_write_end >> ext4_da_do_write_end >> block_write_end >> __block_commit_write >> folio_mark_uptodate >> smp_wmb() >> set_bit(PG_uptodate, folio_flags(folio, 0)) >> i_size_write(inode, pos + copied) >> // write isize 2048 >> unlock_page(page) >> >> ext4_file_read_iter >> generic_file_read_iter >> filemap_read >> filemap_get_pages >> filemap_get_read_batch >> folio_test_uptodate(folio) >> ret = test_bit(PG_uptodate, folio_flags(folio, 0)); >> if (ret) >> smp_rmb(); >> // The read barrier here ensures >> // that data 0-2048 in the page is synchronized. >> ext4_buffered_write_iter >> generic_perform_write >> copy_page_from_iter_atomic >> ext4_da_write_end >> ext4_da_do_write_end >> block_write_end >> __block_commit_write >> folio_mark_uptodate >> smp_wmb() >> set_bit(PG_uptodate, folio_flags(folio, >> 0)) >> i_size_write(inode, pos + copied) >> // write isize 4096 >> unlock_page(page) >> // read isize 4096 >> isize = i_size_read(inode) >> // But there is no read barrier here, >> // so the data in the 2048-4096 range >> // may not be synchronized yet !!! >> copy_page_to_iter() >> // copyout 4096 >> >> In the concurrency above, we read the updated i_size, but there is >> no read barrier to ensure that the data in the page is the same as >> the i_size at this point. Therefore, we may copy the unsynchronized >> page out. Is it normal for us to read zero-filled data in this case? > Indeed, I have checked and filemap_read() (but this dates back even to old > do_generic_file_read() code) indeed does copy data only after checking > uptodate flag and then sampling i_size so we may be copying state in the > middle of the racing write and indeed there is nothing which would prevent > prefetching page data before fetching inode size. I agree this is kind of > nasty so I think adding a read barrier between i_size_read() and > copy_page_to_iter() makes sense. Does it fix your issue with MYSQL? > > Honza Thank you very much for confirming this issue! Adding a read barrier between i_size_read() and copy_page_to_iter() does seem to have solved the problem. The environment, which was previously reproducing once every 2-3h, has been running for 60+h without reproducing after adding the read barrier. This is a Load-Load reordering issue, which only occurs on some weak mem-ordering architectures (e.g. ARM64, ALPHA), but not on strong mem-ordering architectures (e.g. X86). This issue was exactly discovered when migrating MYSQL services from X86 to ARM64. And theoretically the problem doesn't only happen on ext4, filesystems that call filemap_read() but don't hold inode lock (e.g. btrfs, f2fs, ubifs ...) will have this problem, while filesystems with inode lock (e.g. xfs, nfs) won't have this problem. Later I will send the patch that adds the memory barrier. Thanks!
On Wed, Dec 06, 2023 at 08:37:57PM +0100, Jan Kara wrote: > Within the same page buffered reads and writes should be consistent because > they are synchronized by the page lock. However once reads and writes > involve multiple pages, there is no serialization so you can get contents > of some pages before write and some pages after being written. However this > doesn't seem to be your particular case here. I just wanted to point out > that in general even buffered reads vs writes are not fully consistent. Buffered reads don't take the page/folio lock. We only use the folio lock to avoid reading stale data from the page cache while we're fetching the data from storage. Once the uptodate flag is set on the folio, we never take the folio lock for reads.
On Tue 12-12-23 04:36:45, Matthew Wilcox wrote: > On Wed, Dec 06, 2023 at 08:37:57PM +0100, Jan Kara wrote: > > Within the same page buffered reads and writes should be consistent because > > they are synchronized by the page lock. However once reads and writes > > involve multiple pages, there is no serialization so you can get contents > > of some pages before write and some pages after being written. However this > > doesn't seem to be your particular case here. I just wanted to point out > > that in general even buffered reads vs writes are not fully consistent. > > Buffered reads don't take the page/folio lock. We only use the folio > lock to avoid reading stale data from the page cache while we're > fetching the data from storage. Once the uptodate flag is set on the > folio, we never take the folio lock for reads. Right, I've noticed once I've reread the code but thanks for correction anyway. :) Honza