Message ID | 20210413112859.32249-2-jack@suse.cz (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | fs: Hole punch vs page cache filling races | expand |
> if (error == AOP_TRUNCATED_PAGE) > put_page(page); > + up_read(&mapping->host->i_mapping_sem); > return error; Please add an unlock_mapping label above this up_read and consolidate most of the other unlocks by jumping there (put_and_wait_on_page_locked probablt can't use it). > truncated: > unlock_page(page); > @@ -2309,6 +2324,7 @@ static int filemap_update_page(struct kiocb *iocb, > return AOP_TRUNCATED_PAGE; The trunated case actually seems to miss the unlock. Similarly I think filemap_fault would benefit from a common unlock path.
On Tue 13-04-21 13:57:46, Christoph Hellwig wrote: > > if (error == AOP_TRUNCATED_PAGE) > > put_page(page); > > + up_read(&mapping->host->i_mapping_sem); > > return error; > > Please add an unlock_mapping label above this up_read and consolidate > most of the other unlocks by jumping there (put_and_wait_on_page_locked > probablt can't use it). Yeah, I've actually simplified the labels even a bit more like: ... error = filemap_read_page(iocb->ki_filp, mapping, page); goto unlock_mapping; unlock: unlock_page(page); unlock_mapping: up_read(&mapping->host->i_mapping_sem); if (error == AOP_TRUNCATED_PAGE) put_page(page); return error; and everything now jumps to either unlock or unlock_mapping (except for put_and_wait_on_page_locked() case). > > truncated: > > unlock_page(page); > > @@ -2309,6 +2324,7 @@ static int filemap_update_page(struct kiocb *iocb, > > return AOP_TRUNCATED_PAGE; > > The trunated case actually seems to miss the unlock. > > Similarly I think filemap_fault would benefit from a common > unlock path. Right, thanks for catching that! Honza
On Tue, Apr 13, 2021 at 01:28:46PM +0200, Jan Kara wrote: > Currently, serializing operations such as page fault, read, or readahead > against hole punching is rather difficult. The basic race scheme is > like: > > fallocate(FALLOC_FL_PUNCH_HOLE) read / fault / .. > truncate_inode_pages_range() > <create pages in page > cache here> > <update fs block mapping and free blocks> > > Now the problem is in this way read / page fault / readahead can > instantiate pages in page cache with potentially stale data (if blocks > get quickly reused). Avoiding this race is not simple - page locks do > not work because we want to make sure there are *no* pages in given > range. inode->i_rwsem does not work because page fault happens under > mmap_sem which ranks below inode->i_rwsem. Also using it for reads makes > the performance for mixed read-write workloads suffer. > > So create a new rw_semaphore in the inode - i_mapping_sem - that > protects adding of pages to page cache for page faults / reads / > readahead. > > Signed-off-by: Jan Kara <jack@suse.cz> .... > diff --git a/mm/filemap.c b/mm/filemap.c > index bd7c50e060a9..bc82a7856d3e 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -77,7 +77,8 @@ > * ->i_pages lock > * > * ->i_rwsem > - * ->i_mmap_rwsem (truncate->unmap_mapping_range) > + * ->i_mapping_sem (acquired by fs in truncate path) > + * ->i_mmap_rwsem (truncate->unmap_mapping_range) This is now officially a confusing mess. We have inode->i_mapping that points to an address space, which has an internal i_mmap variable and an i_mmap_rwsem that serialises access to the i_mmap tree. Now we have a inode->i_mapping_sem (which is actually a rwsem, not a sem) that sorta serialises additions to the mapping tree (inode->i_mapping->i_pages) against truncate, but it does not serialise accesses to the rest of the inode->i_mapping structure itself despite the similarlities in naming. Then we have the inode_lock() API and the i_mmap_lock() API that wrap around the i_rwsem and i_mmap_rwsem, but there's no API for inode_mapping_lock()... THen we have the mmap_lock in the page fault path as well, which is also an rwsem despite the name, and that protects something completely different to the i_mmap and the i_mapping. IOWs, we have 4 layers of entwined structures and locking here that pretty much all have the same name but protect different things and not always the obvious thing the name suggests. This makes it really difficult to actually read the code and understand that the correct lock is being used in the correct place... > * > * ->mmap_lock > * ->i_mmap_rwsem > @@ -85,7 +86,8 @@ > * ->i_pages lock (arch-dependent flush_dcache_mmap_lock) > * > * ->mmap_lock > - * ->lock_page (access_process_vm) > + * ->i_mapping_sem (filemap_fault) > + * ->lock_page (filemap_fault, access_process_vm) > * > * ->i_rwsem (generic_perform_write) > * ->mmap_lock (fault_in_pages_readable->do_page_fault) > @@ -2276,16 +2278,28 @@ static int filemap_update_page(struct kiocb *iocb, > { > int error; > > + if (iocb->ki_flags & IOCB_NOWAIT) { > + if (!down_read_trylock(&mapping->host->i_mapping_sem)) > + return -EAGAIN; > + } else { > + down_read(&mapping->host->i_mapping_sem); > + } We really need a lock primitive for this. The number of times this exact lock pattern is being replicated all through the IO path is getting out of hand. static inline bool down_read_try_or_lock(struct rwsem *sem, bool try) { if (try) { if (!down_read_trylock(sem)) return false; } else { down_read(&mapping->host->i_mapping_sem); } return true; } and the callers become: if (!down_read_try_or_lock(sem, (iocb->ki_flags & IOCB_NOWAIT))) return -EAGAIN; We can do the same with mutex_try_or_lock(), down_try_or_lock(), etc and we don't need to rely on cargo cult knowledge to propagate this pattern anymore. Because I'm betting relatively few people actually know why the code is written this way because the only place it is documented is in an XFS commit message.... Doing this is a separate cleanup, though, and not something that needs to be done in this patchset. > index c5b0457415be..ac5bb50b3a4c 100644 > --- a/mm/readahead.c > +++ b/mm/readahead.c > @@ -192,6 +192,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl, > */ > unsigned int nofs = memalloc_nofs_save(); > > + down_read(&mapping->host->i_mapping_sem); > /* > * Preallocate as many pages as we will need. > */ I can't say I'm a great fan of having the mapping reach back up to the host to lock the host. THis seems the wrong way around to me given that most of the locking in the IO path is in "host locks mapping" and "mapping locks internal mapping structures" order... I also come back to the naming confusion here, in that when we look at this in long hand from the inode perspective, this chain actually looks like: lock(inode->i_mapping->inode->i_mapping_sem) i.e. the mapping is reaching back up outside it's scope to lock itself against other inode->i_mapping operations. Smells of layering violations to me. So, next question: should this truncate semanphore actually be part of the address space, not the inode? This patch is actually moving the page fault serialisation from the inode into the address space operations when page faults and page cache operations are done, so maybe the lock should also make that move? That would help clear up the naming problem, because now we can name it based around what it serialises in the address space, not the address space as a whole... Cheers, Dave.
On Wed 14-04-21 10:01:13, Dave Chinner wrote: > On Tue, Apr 13, 2021 at 01:28:46PM +0200, Jan Kara wrote: > > * > > * ->mmap_lock > > * ->i_mmap_rwsem > > @@ -85,7 +86,8 @@ > > * ->i_pages lock (arch-dependent flush_dcache_mmap_lock) > > * > > * ->mmap_lock > > - * ->lock_page (access_process_vm) > > + * ->i_mapping_sem (filemap_fault) > > + * ->lock_page (filemap_fault, access_process_vm) > > * > > * ->i_rwsem (generic_perform_write) > > * ->mmap_lock (fault_in_pages_readable->do_page_fault) > > @@ -2276,16 +2278,28 @@ static int filemap_update_page(struct kiocb *iocb, > > { > > int error; > > > > + if (iocb->ki_flags & IOCB_NOWAIT) { > > + if (!down_read_trylock(&mapping->host->i_mapping_sem)) > > + return -EAGAIN; > > + } else { > > + down_read(&mapping->host->i_mapping_sem); > > + } > > We really need a lock primitive for this. The number of times this > exact lock pattern is being replicated all through the IO path is > getting out of hand. > > static inline bool > down_read_try_or_lock(struct rwsem *sem, bool try) > { > if (try) { > if (!down_read_trylock(sem)) > return false; > } else { > down_read(&mapping->host->i_mapping_sem); > } > return true; > } > > and the callers become: > > if (!down_read_try_or_lock(sem, (iocb->ki_flags & IOCB_NOWAIT))) > return -EAGAIN; > > We can do the same with mutex_try_or_lock(), down_try_or_lock(), etc > and we don't need to rely on cargo cult knowledge to propagate this > pattern anymore. Because I'm betting relatively few people actually > know why the code is written this way because the only place it is > documented is in an XFS commit message.... > > Doing this is a separate cleanup, though, and not something that > needs to be done in this patchset. Yep, good idea but let's do it in a separate patch set. > > index c5b0457415be..ac5bb50b3a4c 100644 > > --- a/mm/readahead.c > > +++ b/mm/readahead.c > > @@ -192,6 +192,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl, > > */ > > unsigned int nofs = memalloc_nofs_save(); > > > > + down_read(&mapping->host->i_mapping_sem); > > /* > > * Preallocate as many pages as we will need. > > */ > > I can't say I'm a great fan of having the mapping reach back up to > the host to lock the host. THis seems the wrong way around to me > given that most of the locking in the IO path is in "host locks > mapping" and "mapping locks internal mapping structures" order... > > I also come back to the naming confusion here, in that when we look > at this in long hand from the inode perspective, this chain actually > looks like: > > lock(inode->i_mapping->inode->i_mapping_sem) > > i.e. the mapping is reaching back up outside it's scope to lock > itself against other inode->i_mapping operations. Smells of layering > violations to me. > > So, next question: should this truncate semanphore actually be part > of the address space, not the inode? This patch is actually moving > the page fault serialisation from the inode into the address space > operations when page faults and page cache operations are done, so > maybe the lock should also make that move? That would help clear up > the naming problem, because now we can name it based around what it > serialises in the address space, not the address space as a whole... I think that moving the lock to address_space makes some sence although the lock actually protects consistency of inode->i_mapping->i_pages with whatever the filesystem has in its file_offset->disk_block mapping structures (which are generally associated with the inode). So it is not only about inode->i_mapping contents but I agree that struct address_space is probably a bit more logical place than struct inode. Regarding the name: How about i_pages_rwsem? The lock is protecting invalidation of mapping->i_pages and needs to be held until insertion of pages into i_pages is safe again... Honza
On Wed, Apr 14, 2021 at 02:23:19PM +0200, Jan Kara wrote: > On Wed 14-04-21 10:01:13, Dave Chinner wrote: > > On Tue, Apr 13, 2021 at 01:28:46PM +0200, Jan Kara wrote: > > > index c5b0457415be..ac5bb50b3a4c 100644 > > > --- a/mm/readahead.c > > > +++ b/mm/readahead.c > > > @@ -192,6 +192,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl, > > > */ > > > unsigned int nofs = memalloc_nofs_save(); > > > > > > + down_read(&mapping->host->i_mapping_sem); > > > /* > > > * Preallocate as many pages as we will need. > > > */ > > > > I can't say I'm a great fan of having the mapping reach back up to > > the host to lock the host. THis seems the wrong way around to me > > given that most of the locking in the IO path is in "host locks > > mapping" and "mapping locks internal mapping structures" order... > > > > I also come back to the naming confusion here, in that when we look > > at this in long hand from the inode perspective, this chain actually > > looks like: > > > > lock(inode->i_mapping->inode->i_mapping_sem) > > > > i.e. the mapping is reaching back up outside it's scope to lock > > itself against other inode->i_mapping operations. Smells of layering > > violations to me. > > > > So, next question: should this truncate semanphore actually be part > > of the address space, not the inode? This patch is actually moving > > the page fault serialisation from the inode into the address space > > operations when page faults and page cache operations are done, so > > maybe the lock should also make that move? That would help clear up > > the naming problem, because now we can name it based around what it > > serialises in the address space, not the address space as a whole... > > I think that moving the lock to address_space makes some sence although the > lock actually protects consistency of inode->i_mapping->i_pages with > whatever the filesystem has in its file_offset->disk_block mapping > structures (which are generally associated with the inode). Well, I look at is as a mechanism that the filesystem uses to ensure coherency of the page cache accesses w.r.t. physical layout changes. The layout is a property of the inode, but changes to the physical layout of the inode are serialised by other inode based mechanisms. THe page cache isn't part of the inode - it's part of the address space - but coherency with the inode is required. Hence inode operations need to be able to ensure coherency of the address space content and accesses w.r.t. physical layout changes of the inode, but the address space really knows nothing about the physical layout of the inode or how it gets changed... Hence it's valid for the inode operations to lock the address space to ensure coherency of the page cache when making physical layout changes, but locking the address space, by itself, is not sufficient to safely serialise against physical changes to the inode layout. > So it is not > only about inode->i_mapping contents but I agree that struct address_space > is probably a bit more logical place than struct inode. Yup. Remember that the XFS_MMAPLOCK arose at the inode level because that was the only way the filesystem could acheive the necessary serialisation of page cache accesses whilst doing physical layout changes. So the lock became an "inode property" because of implementation constraints, not because it was the best way to implement the necessary coherency hooks. > Regarding the name: How about i_pages_rwsem? The lock is protecting > invalidation of mapping->i_pages and needs to be held until insertion of > pages into i_pages is safe again... I don't actually have a good name for this right now. :( The i_pages structure has it's own internal locking, so i_pages_rwsem implies things that aren't necessarily true, and taking a read lock for insertion for something that is named like a structure protection lock creates cognitive dissonance... I keep wanting to say "lock for invalidation" and "lock to exclude invalidation" because those are the two actions that we need for coherency of operations. But they are way too verbose for an actual API... So I want to call this an "invalidation lock" of some kind (no need to encode the type in the name!), but haven't worked out a good shorthand for "address space invalidation coherency mechanism"... Naming is hard. :/ Cheers, Dave.
On Wed, Apr 14, 2021 at 10:01:13AM +1000, Dave Chinner wrote: > > + if (iocb->ki_flags & IOCB_NOWAIT) { > > + if (!down_read_trylock(&mapping->host->i_mapping_sem)) > > + return -EAGAIN; > > + } else { > > + down_read(&mapping->host->i_mapping_sem); > > + } > > We really need a lock primitive for this. The number of times this > exact lock pattern is being replicated all through the IO path is > getting out of hand. > > static inline bool > down_read_try_or_lock(struct rwsem *sem, bool try) > { > if (try) { > if (!down_read_trylock(sem)) > return false; > } else { > down_read(&mapping->host->i_mapping_sem); > } > return true; > } > > and the callers become: > > if (!down_read_try_or_lock(sem, (iocb->ki_flags & IOCB_NOWAIT))) > return -EAGAIN; I think that should be written: if (!iocb_read_lock(iocb, &rwsem)) return -EAGAIN; and implemented as: static inline int iocb_read_lock(struct kiocb *iocb, struct rwsem *sem) { if (iocb->ki_flags & IOCB_NOWAIT) return down_read_trylock(sem) ? 0 : -EAGAIN; return down_read_killable(sem); }
On Wed, Apr 14, 2021 at 11:25:31PM +0100, Matthew Wilcox wrote: > On Wed, Apr 14, 2021 at 10:01:13AM +1000, Dave Chinner wrote: > > > + if (iocb->ki_flags & IOCB_NOWAIT) { > > > + if (!down_read_trylock(&mapping->host->i_mapping_sem)) > > > + return -EAGAIN; > > > + } else { > > > + down_read(&mapping->host->i_mapping_sem); > > > + } > > > > We really need a lock primitive for this. The number of times this > > exact lock pattern is being replicated all through the IO path is > > getting out of hand. > > > > static inline bool > > down_read_try_or_lock(struct rwsem *sem, bool try) > > { > > if (try) { > > if (!down_read_trylock(sem)) > > return false; > > } else { > > down_read(&mapping->host->i_mapping_sem); > > } > > return true; > > } > > > > and the callers become: > > > > if (!down_read_try_or_lock(sem, (iocb->ki_flags & IOCB_NOWAIT))) > > return -EAGAIN; > > I think that should be written: > > if (!iocb_read_lock(iocb, &rwsem)) > return -EAGAIN; > > and implemented as: > > static inline int iocb_read_lock(struct kiocb *iocb, struct rwsem *sem) > { > if (iocb->ki_flags & IOCB_NOWAIT) > return down_read_trylock(sem) ? 0 : -EAGAIN; > return down_read_killable(sem); > } Yup, we already have done that with xfs_ilock_iocb(), but my point is that this "non blocking try lock or lock" pattern is slowly being used in more places than just IOCB_NOWAIT situations. e.g. We use if for IOMAP_NOWAIT locking in XFS, too, and ISTR other places where optimisitic locking is used are replicating it, too. Hence my suggestion that is moved up into the locking primitives, not merely have context specific wrappers added... Cheers, Dave.
On Thu 15-04-21 07:57:39, Dave Chinner wrote: > On Wed, Apr 14, 2021 at 02:23:19PM +0200, Jan Kara wrote: > > Regarding the name: How about i_pages_rwsem? The lock is protecting > > invalidation of mapping->i_pages and needs to be held until insertion of > > pages into i_pages is safe again... > > I don't actually have a good name for this right now. :( > > The i_pages structure has it's own internal locking, so > i_pages_rwsem implies things that aren't necessarily true, and > taking a read lock for insertion for something that is named like a > structure protection lock creates cognitive dissonance... > > I keep wanting to say "lock for invalidation" and "lock to exclude > invalidation" because those are the two actions that we need for > coherency of operations. But they are way too verbose for an actual > API... > > So I want to call this an "invalidation lock" of some kind (no need > to encode the type in the name!), but haven't worked out a good > shorthand for "address space invalidation coherency mechanism"... So "invalidate_lock" was just next on my list of things to suggest so I'm fine with that name. Or maybe block_invalidate_lock, block_remove_lock, map_remove_lock, ... Dunno :). Honza
diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst index b7dcc86c92a4..67ba0a81301a 100644 --- a/Documentation/filesystems/locking.rst +++ b/Documentation/filesystems/locking.rst @@ -266,19 +266,19 @@ prototypes:: locking rules: All except set_page_dirty and freepage may block -====================== ======================== ========= -ops PageLocked(page) i_rwsem -====================== ======================== ========= +====================== ======================== ========= =============== +ops PageLocked(page) i_rwsem i_mapping_sem +====================== ======================== ========= =============== writepage: yes, unlocks (see below) -readpage: yes, unlocks +readpage: yes, unlocks shared writepages: set_page_dirty no -readahead: yes, unlocks -readpages: no +readahead: yes, unlocks shared +readpages: no shared write_begin: locks the page exclusive write_end: yes, unlocks exclusive bmap: -invalidatepage: yes +invalidatepage: yes exclusive releasepage: yes freepage: yes direct_IO: @@ -373,7 +373,10 @@ keep it that way and don't breed new callers. ->invalidatepage() is called when the filesystem must attempt to drop some or all of the buffers from the page when it is being truncated. It returns zero on success. If ->invalidatepage is zero, the kernel uses -block_invalidatepage() instead. +block_invalidatepage() instead. The filesystem should exclusively acquire +i_mapping_sem before invalidating page cache in truncate / hole punch path (and +thus calling into ->invalidatepage) to block races between page cache +invalidation and page cache filling functions (fault, read, ...). ->releasepage() is called when the kernel is about to try to drop the buffers from the page in preparation for freeing it. It returns zero to @@ -567,6 +570,19 @@ in sys_read() and friends. the lease within the individual filesystem to record the result of the operation +->fallocate implementation must be really careful to maintain page cache +consistency when punching holes or performing other operations that invalidate +page cache contents. Usually the filesystem needs to call +truncate_inode_pages_range() to invalidate relevant range of the page cache. +However the filesystem usually also needs to update its internal (and on disk) +view of file offset -> disk block mapping. Until this update is finished, the +filesystem needs to block page faults and reads from reloading now-stale page +cache contents from the disk. VFS provides inode->i_mapping_sem for this and +acquires it in shared mode in paths loading pages from disk (filemap_fault(), +filemap_read(), readahead paths). The filesystem is responsible for taking this +lock in its fallocate implementation and generally whenever the page cache +contents needs to be invalidated because a block is moving from under a page. + dquot_operations ================ @@ -628,9 +644,9 @@ access: yes to be faulted in. The filesystem must find and return the page associated with the passed in "pgoff" in the vm_fault structure. If it is possible that the page may be truncated and/or invalidated, then the filesystem must lock -the page, then ensure it is not already truncated (the page lock will block -subsequent truncate), and then return with VM_FAULT_LOCKED, and the page -locked. The VM will unlock the page. +i_mapping_sem, then ensure the page is not already truncated (i_mapping_sem +will block subsequent truncate), and then return with VM_FAULT_LOCKED, and the +page locked. The VM will unlock the page. ->map_pages() is called when VM asks to map easy accessible pages. Filesystem should find and map pages associated with offsets from "start_pgoff" diff --git a/fs/inode.c b/fs/inode.c index a047ab306f9a..e23e707a507d 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -175,6 +175,9 @@ int inode_init_always(struct super_block *sb, struct inode *inode) init_rwsem(&inode->i_rwsem); lockdep_set_class(&inode->i_rwsem, &sb->s_type->i_mutex_key); + init_rwsem(&inode->i_mapping_sem); + lockdep_set_class(&inode->i_mapping_sem, + &sb->s_type->i_mapping_sem_key); atomic_set(&inode->i_dio_count, 0); diff --git a/include/linux/fs.h b/include/linux/fs.h index ec8f3ddf4a6a..c020c105d2d9 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -660,6 +660,7 @@ struct inode { /* Misc */ unsigned long i_state; struct rw_semaphore i_rwsem; + struct rw_semaphore i_mapping_sem; unsigned long dirtied_when; /* jiffies of first dirtying */ unsigned long dirtied_time_when; @@ -2351,6 +2352,7 @@ struct file_system_type { struct lock_class_key i_lock_key; struct lock_class_key i_mutex_key; + struct lock_class_key i_mapping_sem_key; struct lock_class_key i_mutex_dir_key; }; diff --git a/mm/filemap.c b/mm/filemap.c index bd7c50e060a9..bc82a7856d3e 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -77,7 +77,8 @@ * ->i_pages lock * * ->i_rwsem - * ->i_mmap_rwsem (truncate->unmap_mapping_range) + * ->i_mapping_sem (acquired by fs in truncate path) + * ->i_mmap_rwsem (truncate->unmap_mapping_range) * * ->mmap_lock * ->i_mmap_rwsem @@ -85,7 +86,8 @@ * ->i_pages lock (arch-dependent flush_dcache_mmap_lock) * * ->mmap_lock - * ->lock_page (access_process_vm) + * ->i_mapping_sem (filemap_fault) + * ->lock_page (filemap_fault, access_process_vm) * * ->i_rwsem (generic_perform_write) * ->mmap_lock (fault_in_pages_readable->do_page_fault) @@ -2276,16 +2278,28 @@ static int filemap_update_page(struct kiocb *iocb, { int error; + if (iocb->ki_flags & IOCB_NOWAIT) { + if (!down_read_trylock(&mapping->host->i_mapping_sem)) + return -EAGAIN; + } else { + down_read(&mapping->host->i_mapping_sem); + } + if (!trylock_page(page)) { - if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_NOIO)) + if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_NOIO)) { + up_read(&mapping->host->i_mapping_sem); return -EAGAIN; + } if (!(iocb->ki_flags & IOCB_WAITQ)) { + up_read(&mapping->host->i_mapping_sem); put_and_wait_on_page_locked(page, TASK_KILLABLE); return AOP_TRUNCATED_PAGE; } error = __lock_page_async(page, iocb->ki_waitq); - if (error) + if (error) { + up_read(&mapping->host->i_mapping_sem); return error; + } } if (!page->mapping) @@ -2302,6 +2316,7 @@ static int filemap_update_page(struct kiocb *iocb, error = filemap_read_page(iocb->ki_filp, mapping, page); if (error == AOP_TRUNCATED_PAGE) put_page(page); + up_read(&mapping->host->i_mapping_sem); return error; truncated: unlock_page(page); @@ -2309,6 +2324,7 @@ static int filemap_update_page(struct kiocb *iocb, return AOP_TRUNCATED_PAGE; unlock: unlock_page(page); + up_read(&mapping->host->i_mapping_sem); return error; } @@ -2323,6 +2339,19 @@ static int filemap_create_page(struct file *file, if (!page) return -ENOMEM; + /* + * Protect against truncate / hole punch. Grabbing i_mapping_sem here + * assures we cannot instantiate and bring uptodate new pagecache pages + * after evicting page cache during truncate and before actually + * freeing blocks. Note that we could release i_mapping_sem after + * inserting the page into page cache as the locked page would then be + * enough to synchronize with hole punching. But there are code paths + * such as filemap_update_page() filling in partially uptodate pages or + * ->readpages() that need to hold i_mapping_sem while mapping blocks + * for IO so let's hold the lock here as well to keep locking rules + * simple. + */ + down_read(&mapping->host->i_mapping_sem); error = add_to_page_cache_lru(page, mapping, index, mapping_gfp_constraint(mapping, GFP_KERNEL)); if (error == -EEXIST) @@ -2334,9 +2363,11 @@ static int filemap_create_page(struct file *file, if (error) goto error; + up_read(&mapping->host->i_mapping_sem); pagevec_add(pvec, page); return 0; error: + up_read(&mapping->host->i_mapping_sem); put_page(page); return error; } @@ -2896,6 +2927,13 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT); ret = VM_FAULT_MAJOR; fpin = do_sync_mmap_readahead(vmf); + } + + /* + * See comment in filemap_create_page() why we need i_mapping_sem + */ + down_read(&inode->i_mapping_sem); + if (!page) { retry_find: page = pagecache_get_page(mapping, offset, FGP_CREAT|FGP_FOR_MMAP, @@ -2903,6 +2941,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) if (!page) { if (fpin) goto out_retry; + up_read(&inode->i_mapping_sem); return VM_FAULT_OOM; } } @@ -2943,9 +2982,11 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) if (unlikely(offset >= max_off)) { unlock_page(page); put_page(page); + up_read(&inode->i_mapping_sem); return VM_FAULT_SIGBUS; } + up_read(&inode->i_mapping_sem); vmf->page = page; return ret | VM_FAULT_LOCKED; @@ -2971,6 +3012,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) if (!error || error == AOP_TRUNCATED_PAGE) goto retry_find; + up_read(&inode->i_mapping_sem); shrink_readahead_size_eio(ra); return VM_FAULT_SIGBUS; @@ -2982,6 +3024,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) */ if (page) put_page(page); + up_read(&inode->i_mapping_sem); if (fpin) fput(fpin); return ret | VM_FAULT_RETRY; diff --git a/mm/readahead.c b/mm/readahead.c index c5b0457415be..ac5bb50b3a4c 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -192,6 +192,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl, */ unsigned int nofs = memalloc_nofs_save(); + down_read(&mapping->host->i_mapping_sem); /* * Preallocate as many pages as we will need. */ @@ -236,6 +237,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl, * will then handle the error. */ read_pages(ractl, &page_pool, false); + up_read(&mapping->host->i_mapping_sem); memalloc_nofs_restore(nofs); } EXPORT_SYMBOL_GPL(page_cache_ra_unbounded); diff --git a/mm/rmap.c b/mm/rmap.c index dba8cb8a5578..37e5dceb4351 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -22,24 +22,25 @@ * * inode->i_rwsem (while writing or truncating, not reading or faulting) * mm->mmap_lock - * page->flags PG_locked (lock_page) * (see hugetlbfs below) - * hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share) - * mapping->i_mmap_rwsem - * hugetlb_fault_mutex (hugetlbfs specific page fault mutex) - * anon_vma->rwsem - * mm->page_table_lock or pte_lock - * swap_lock (in swap_duplicate, swap_info_get) - * mmlist_lock (in mmput, drain_mmlist and others) - * mapping->private_lock (in __set_page_dirty_buffers) - * lock_page_memcg move_lock (in __set_page_dirty_buffers) - * i_pages lock (widely used) - * lruvec->lru_lock (in lock_page_lruvec_irq) - * inode->i_lock (in set_page_dirty's __mark_inode_dirty) - * bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty) - * sb_lock (within inode_lock in fs/fs-writeback.c) - * i_pages lock (widely used, in set_page_dirty, - * in arch-dependent flush_dcache_mmap_lock, - * within bdi.wb->list_lock in __sync_single_inode) + * inode->i_mapping_sem (in filemap_fault) + * page->flags PG_locked (lock_page) * (see hugetlbfs below) + * hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share) + * mapping->i_mmap_rwsem + * hugetlb_fault_mutex (hugetlbfs specific page fault mutex) + * anon_vma->rwsem + * mm->page_table_lock or pte_lock + * swap_lock (in swap_duplicate, swap_info_get) + * mmlist_lock (in mmput, drain_mmlist and others) + * mapping->private_lock (in __set_page_dirty_buffers) + * lock_page_memcg move_lock (in __set_page_dirty_buffers) + * i_pages lock (widely used) + * lruvec->lru_lock (in lock_page_lruvec_irq) + * inode->i_lock (in set_page_dirty's __mark_inode_dirty) + * bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty) + * sb_lock (within inode_lock in fs/fs-writeback.c) + * i_pages lock (widely used, in set_page_dirty, + * in arch-dependent flush_dcache_mmap_lock, + * within bdi.wb->list_lock in __sync_single_inode) * * anon_vma->rwsem,mapping->i_mmap_rwsem (memory_failure, collect_procs_anon) * ->tasklist_lock diff --git a/mm/truncate.c b/mm/truncate.c index 2cf71d8c3c62..464ad70a081f 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -416,7 +416,7 @@ EXPORT_SYMBOL(truncate_inode_pages_range); * @mapping: mapping to truncate * @lstart: offset from which to truncate * - * Called under (and serialised by) inode->i_rwsem. + * Called under (and serialised by) inode->i_rwsem and inode->i_mapping_rwsem. * * Note: When this function returns, there can be a page in the process of * deletion (inside __delete_from_page_cache()) in the specified range. Thus
Currently, serializing operations such as page fault, read, or readahead against hole punching is rather difficult. The basic race scheme is like: fallocate(FALLOC_FL_PUNCH_HOLE) read / fault / .. truncate_inode_pages_range() <create pages in page cache here> <update fs block mapping and free blocks> Now the problem is in this way read / page fault / readahead can instantiate pages in page cache with potentially stale data (if blocks get quickly reused). Avoiding this race is not simple - page locks do not work because we want to make sure there are *no* pages in given range. inode->i_rwsem does not work because page fault happens under mmap_sem which ranks below inode->i_rwsem. Also using it for reads makes the performance for mixed read-write workloads suffer. So create a new rw_semaphore in the inode - i_mapping_sem - that protects adding of pages to page cache for page faults / reads / readahead. Signed-off-by: Jan Kara <jack@suse.cz> --- Documentation/filesystems/locking.rst | 38 ++++++++++++++------ fs/inode.c | 3 ++ include/linux/fs.h | 2 ++ mm/filemap.c | 51 ++++++++++++++++++++++++--- mm/readahead.c | 2 ++ mm/rmap.c | 37 +++++++++---------- mm/truncate.c | 2 +- 7 files changed, 101 insertions(+), 34 deletions(-)