Message ID | 20201012140350.950064-2-bfoster@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | iomap: zero dirty pages over unwritten extents | expand |
On Mon, Oct 12, 2020 at 10:03:49AM -0400, Brian Foster wrote: > iomap seek hole/data currently uses page Uptodate state to track > data over unwritten extents. This is odd and unpredictable in that > the existence of clean pages changes behavior. For example: > > $ xfs_io -fc "falloc 0 32k" -c "seek -d 0" \ > -c "pread 16k 4k" -c "seek -d 0" /mnt/file > Whence Result > DATA EOF > ... > Whence Result > DATA 16384 > > Instead, use page dirty state to locate data over unwritten extents. > This causes seek data to land on the first uptodate block of a dirty > page since we don't have per-block dirty state in iomap. This is > consistent with writeback, however, which converts all uptodate > blocks of a dirty page for similar reasons. > > Signed-off-by: Brian Foster <bfoster@redhat.com> > --- JFYI that I hit a generic/285 failure with this patch. I suspect this needs to check for Dirty || Writeback, otherwise if we see the latter the range is incorrectly treated as a hole. Brian > fs/iomap/seek.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/fs/iomap/seek.c b/fs/iomap/seek.c > index 107ee80c3568..981a74c8d60f 100644 > --- a/fs/iomap/seek.c > +++ b/fs/iomap/seek.c > @@ -40,7 +40,7 @@ page_seek_hole_data(struct inode *inode, struct page *page, loff_t *lastoff, > * Just check the page unless we can and should check block ranges: > */ > if (bsize == PAGE_SIZE || !ops->is_partially_uptodate) > - return PageUptodate(page) == seek_data; > + return PageDirty(page) == seek_data; > > lock_page(page); > if (unlikely(page->mapping != inode->i_mapping)) > @@ -49,7 +49,8 @@ page_seek_hole_data(struct inode *inode, struct page *page, loff_t *lastoff, > for (off = 0; off < PAGE_SIZE; off += bsize) { > if (offset_in_page(*lastoff) >= off + bsize) > continue; > - if (ops->is_partially_uptodate(page, off, bsize) == seek_data) { > + if ((ops->is_partially_uptodate(page, off, bsize) && > + PageDirty(page)) == seek_data) { > unlock_page(page); > return true; > } > -- > 2.25.4 >
On Mon, Oct 12, 2020 at 10:03:49AM -0400, Brian Foster wrote: > iomap seek hole/data currently uses page Uptodate state to track > data over unwritten extents. This is odd and unpredictable in that > the existence of clean pages changes behavior. For example: > > $ xfs_io -fc "falloc 0 32k" -c "seek -d 0" \ > -c "pread 16k 4k" -c "seek -d 0" /mnt/file > Whence Result > DATA EOF > ... > Whence Result > DATA 16384 I don't think there is any way around this, because the page cache lookup done by the seek hole/data code is an unlocked operation and can race with other IO and operations. That is, seek does not take IO serialisation locks at all so read/write/page faults/fallocate/etc all run concurrently with it... i.e. we get an iomap that is current at the time the iomap_begin() call is made, but we don't hold any locks to stabilise that extent range while we do a page cache traversal looking for cached data. That means any region of the unwritten iomap can change state while we are running the page cache seek. We cannot determine what the data contents without major overhead, and if we are seeking over a large unwritten extent covered by clean pages that then gets partially written synchronously by another concurrent write IO then we might trip across clean uptodate pages with real data in them by the time the page cache scan gets to it. Hence the only thing we are looking at here is whether there is data present in the cache or not. As such, I think assuming that only dirty/writeback pages contain actual user data in a seek data/hole operation is a fundametnally incorrect premise. Cheers, Dave.
On Wed, Oct 14, 2020 at 09:53:44AM +1100, Dave Chinner wrote: > On Mon, Oct 12, 2020 at 10:03:49AM -0400, Brian Foster wrote: > > iomap seek hole/data currently uses page Uptodate state to track > > data over unwritten extents. This is odd and unpredictable in that > > the existence of clean pages changes behavior. For example: > > > > $ xfs_io -fc "falloc 0 32k" -c "seek -d 0" \ > > -c "pread 16k 4k" -c "seek -d 0" /mnt/file > > Whence Result > > DATA EOF > > ... > > Whence Result > > DATA 16384 > > I don't think there is any way around this, because the page cache > lookup done by the seek hole/data code is an > unlocked operation and can race with other IO and operations. That > is, seek does not take IO serialisation locks at all so > read/write/page faults/fallocate/etc all run concurrently with it... > > i.e. we get an iomap that is current at the time the iomap_begin() > call is made, but we don't hold any locks to stabilise that extent > range while we do a page cache traversal looking for cached data. > That means any region of the unwritten iomap can change state while > we are running the page cache seek. > Hm, Ok.. that makes sense.. > We cannot determine what the data contents without major overhead, > and if we are seeking over a large unwritten extent covered by clean > pages that then gets partially written synchronously by another > concurrent write IO then we might trip across clean uptodate pages > with real data in them by the time the page cache scan gets to it. > > Hence the only thing we are looking at here is whether there is data > present in the cache or not. As such, I think assuming that only > dirty/writeback pages contain actual user data in a seek data/hole > operation is a fundametnally incorrect premise. > ... but afaict this kind of thing is already possible because nothing stops a subsequently cleaned page (i.e., dirtied and written back) from also being dropped from cache before the scan finds it. IOW, I don't really see how this justifies using one page state check over another as opposed to pointing out the whole page scanning thing itself seems to be racy. Perhaps the reasoning wrt to seek is simply that we should either see one state (hole) or the next (data) and we don't terribly care much about seek being racy..? My concern is more the issue described by patch 2. Note that patch 2 doesn't necessarily depend on this one. The tradeoff without patch 1 is just that we'd explicitly zero and dirty any uptodate new EOF page as opposed to a page that was already dirty (or writeback). Truncate does hold iolock/mmaplock, but ISTM that is still not sufficient because of the same page reclaim issue mentioned above. E.g., a truncate down lands on a dirty page over an unwritten block, iomap_truncate_page() receives the unwritten mapping, page is flushed and reclaimed (changing block state), iomap_truncate_page() (still using the unwritten mapping) has nothing to do without a page and thus stale data is exposed. ISTM that either the filesystem needs to be more involved with the stabilization of unwritten mappings in general or truncate page needs to do something along the lines of block_truncate_page() (which we used pre-iomap) and just explicitly zero/dirty the new page if the block is otherwise mapped. Thoughts? Other ideas? Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com >
On Wed, Oct 14, 2020 at 08:59:55AM -0400, Brian Foster wrote: > On Wed, Oct 14, 2020 at 09:53:44AM +1100, Dave Chinner wrote: > > On Mon, Oct 12, 2020 at 10:03:49AM -0400, Brian Foster wrote: > > > iomap seek hole/data currently uses page Uptodate state to track > > > data over unwritten extents. This is odd and unpredictable in that > > > the existence of clean pages changes behavior. For example: > > > > > > $ xfs_io -fc "falloc 0 32k" -c "seek -d 0" \ > > > -c "pread 16k 4k" -c "seek -d 0" /mnt/file > > > Whence Result > > > DATA EOF > > > ... > > > Whence Result > > > DATA 16384 > > > > I don't think there is any way around this, because the page cache > > lookup done by the seek hole/data code is an > > unlocked operation and can race with other IO and operations. That > > is, seek does not take IO serialisation locks at all so > > read/write/page faults/fallocate/etc all run concurrently with it... > > > > i.e. we get an iomap that is current at the time the iomap_begin() > > call is made, but we don't hold any locks to stabilise that extent > > range while we do a page cache traversal looking for cached data. > > That means any region of the unwritten iomap can change state while > > we are running the page cache seek. > > > > Hm, Ok.. that makes sense.. > > > We cannot determine what the data contents without major overhead, > > and if we are seeking over a large unwritten extent covered by clean > > pages that then gets partially written synchronously by another > > concurrent write IO then we might trip across clean uptodate pages > > with real data in them by the time the page cache scan gets to it. > > > > Hence the only thing we are looking at here is whether there is data > > present in the cache or not. As such, I think assuming that only > > dirty/writeback pages contain actual user data in a seek data/hole > > operation is a fundametnally incorrect premise. > > > > ... but afaict this kind of thing is already possible because nothing > stops a subsequently cleaned page (i.e., dirtied and written back) from > also being dropped from cache before the scan finds it. IOW, I don't > really see how this justifies using one page state check over another as > opposed to pointing out the whole page scanning thing itself seems to be > racy. Perhaps the reasoning wrt to seek is simply that we should either > see one state (hole) or the next (data) and we don't terribly care much > about seek being racy..? lseek() is inherently racy, and there's nothing we can do to prevent that. i.e. even if we lock lseek up tight and guarantee the lookup we do is absolutely correct, something can come in between dropping the locks and returning that information to userspace that makes it incorrect. My point is that if we see a page that has uptodate data in it, then userspace needs to be told that so it can determine if it is actual data or just zeroed pages by reading it. If there's no page in the cache, then seek hole/data simply doesn't know there was ever data there, and there's nothing a page cache scan can do about that sort of race. And there's nothing userspace can do about that sort of race, either, because lseek() is inherently racy. > My concern is more the issue described by patch 2. Note that patch 2 > doesn't necessarily depend on this one. The tradeoff without patch 1 is > just that we'd explicitly zero and dirty any uptodate new EOF page as > opposed to a page that was already dirty (or writeback). I haven't looked at patch 2 - changes to seek hole/data need to be correct from a stand alone perspective... > Truncate does hold iolock/mmaplock, but ISTM that is still not > sufficient because of the same page reclaim issue mentioned above. E.g., > a truncate down lands on a dirty page over an unwritten block, > iomap_truncate_page() receives the unwritten mapping, page is flushed > and reclaimed (changing block state), iomap_truncate_page() (still using > the unwritten mapping) has nothing to do without a page and thus stale > data is exposed. So the problem here is only the new partial EOF block after the truncate down completes? i.e. iomap_zero_range_actor() if failing to zero the partial page that covers the new EOF if the block is unwritten? So, truncate_setsize() zeros the new partial EOF page, but it does so without dirtying the page at all. Hence if the page is dirty in memory over an unwritten extent when we run iomap_truncate_page(), it's ready to be written out with the tail already zeroed in the page cache. The only problem is that iomap_truncate_page() didn't return "did_zeroing = true", and hence the followup "did_zeroing" flush doesn't trigger. And, FWIW, I suspect that follow flush is invalid for a truncate down, too, because ip->i_d.di_size > new_isize because both write_cache_pages and __filemap_fdatawait_range() do nothing if start > end as per the call in xfs_setattr_size(). > ISTM that either the filesystem needs to be more involved with the > stabilization of unwritten mappings in general or truncate page needs to > do something along the lines of block_truncate_page() (which we used > pre-iomap) and just explicitly zero/dirty the new page if the block is > otherwise mapped. Thoughts? Other ideas? I suspect that iomap_truncate_page() needs to attempt to invalidate the range being truncated first. i.e. call invalidate_inode_pages2_range() and if it gets EBUSY returned we know that there is a dirty cached page over the offset we are truncating, and we now know that we must zero the page range regardless of whether it sits over an unwritten extent or not. Then we just have to make sure the xfs_setattr_size does the right thing with did_zeroing on truncate down.... Cheers, Dave.
I don't think we can solve this properly. Due to the racyness we can always err one side. The beauty of treating all the uptodate pages as present data is that we err on the safe side, as applications expect holes to never have data, while "data" could always be zeroed.
On Thu, Oct 15, 2020 at 10:47:00AM +0100, Christoph Hellwig wrote: > I don't think we can solve this properly. Due to the racyness we can > always err one side. The beauty of treating all the uptodate pages > as present data is that we err on the safe side, as applications > expect holes to never have data, while "data" could always be zeroed. > I don't think that's quite accurate. Nothing prevents a dirty page from being written back and reclaimed between acquiring the (unwritten) mapping and doing the pagecache scan, so it's possible to present valid data (written to the kernel prior to a seek) as a hole with the current code. Brian
On Mon, Oct 19, 2020 at 12:55:01PM -0400, Brian Foster wrote: > On Thu, Oct 15, 2020 at 10:47:00AM +0100, Christoph Hellwig wrote: > > I don't think we can solve this properly. Due to the racyness we can > > always err one side. The beauty of treating all the uptodate pages > > as present data is that we err on the safe side, as applications > > expect holes to never have data, while "data" could always be zeroed. > > > > I don't think that's quite accurate. Nothing prevents a dirty page from > being written back and reclaimed between acquiring the (unwritten) > mapping and doing the pagecache scan, so it's possible to present valid > data (written to the kernel prior to a seek) as a hole with the current > code. True. I guess we need to go back and do another lookup to fully solve this problem. That doesn't change my opinion that this patch makes the problem worse.
On Tue, Oct 27, 2020 at 06:07:31PM +0000, Christoph Hellwig wrote: > On Mon, Oct 19, 2020 at 12:55:01PM -0400, Brian Foster wrote: > > On Thu, Oct 15, 2020 at 10:47:00AM +0100, Christoph Hellwig wrote: > > > I don't think we can solve this properly. Due to the racyness we can > > > always err one side. The beauty of treating all the uptodate pages > > > as present data is that we err on the safe side, as applications > > > expect holes to never have data, while "data" could always be zeroed. > > > > > > > I don't think that's quite accurate. Nothing prevents a dirty page from > > being written back and reclaimed between acquiring the (unwritten) > > mapping and doing the pagecache scan, so it's possible to present valid > > data (written to the kernel prior to a seek) as a hole with the current > > code. > > True. I guess we need to go back and do another lookup to fully > solve this problem. That doesn't change my opinion that this patch > makes the problem worse. > Yeah. I think it's possible to at least have some internal consistency (i.e. while we're under locks) if we check the page state first or somehow or another jump back out of the iomap_apply() sequence to do so. I hadn't thought about it a ton since the goal of these patches was to address the post-eof zeroing problem vs. fix seek data/hole. Brian
diff --git a/fs/iomap/seek.c b/fs/iomap/seek.c index 107ee80c3568..981a74c8d60f 100644 --- a/fs/iomap/seek.c +++ b/fs/iomap/seek.c @@ -40,7 +40,7 @@ page_seek_hole_data(struct inode *inode, struct page *page, loff_t *lastoff, * Just check the page unless we can and should check block ranges: */ if (bsize == PAGE_SIZE || !ops->is_partially_uptodate) - return PageUptodate(page) == seek_data; + return PageDirty(page) == seek_data; lock_page(page); if (unlikely(page->mapping != inode->i_mapping)) @@ -49,7 +49,8 @@ page_seek_hole_data(struct inode *inode, struct page *page, loff_t *lastoff, for (off = 0; off < PAGE_SIZE; off += bsize) { if (offset_in_page(*lastoff) >= off + bsize) continue; - if (ops->is_partially_uptodate(page, off, bsize) == seek_data) { + if ((ops->is_partially_uptodate(page, off, bsize) && + PageDirty(page)) == seek_data) { unlock_page(page); return true; }
iomap seek hole/data currently uses page Uptodate state to track data over unwritten extents. This is odd and unpredictable in that the existence of clean pages changes behavior. For example: $ xfs_io -fc "falloc 0 32k" -c "seek -d 0" \ -c "pread 16k 4k" -c "seek -d 0" /mnt/file Whence Result DATA EOF ... Whence Result DATA 16384 Instead, use page dirty state to locate data over unwritten extents. This causes seek data to land on the first uptodate block of a dirty page since we don't have per-block dirty state in iomap. This is consistent with writeback, however, which converts all uptodate blocks of a dirty page for similar reasons. Signed-off-by: Brian Foster <bfoster@redhat.com> --- fs/iomap/seek.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)