Message ID | 20180518074918.13816-3-kent.overstreet@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, May 18, 2018 at 03:49:00AM -0400, Kent Overstreet wrote: > Add a per address space lock around adding pages to the pagecache - making it > possible for fallocate INSERT_RANGE/COLLAPSE_RANGE to work correctly, and also > hopefully making truncate and dio a bit saner. (moving this section here from the overall description so I can reply to it in one place) > * pagecache add lock > > This is the only one that touches existing code in nontrivial ways. > The problem it's solving is that there is no existing general mechanism > for shooting down pages in the page and keeping them removed, which is a > real problem if you're doing anything that modifies file data and isn't > buffered writes. > > Historically, the only problematic case has been direct IO, and people > have been willing to say "well, if you mix buffered and direct IO you > get what you deserve", and that's probably not unreasonable. But now we > have fallocate insert range and collapse range, and those are broken in > ways I frankly don't want to think about if they can't ensure consistency > with the page cache. ext4 manages collapse-vs-pagefault with the ext4-specific i_mmap_sem. You may get pushback on the grounds that this ought to be a filesystem-specific lock rather than one embedded in the generic inode. > Also, the mechanism truncate uses (i_size and sacrificing a goat) has > historically been rather fragile, IMO it might be a good think if we > switched it to a more general rigorous mechanism. > > I need this solved for bcachefs because without this mechanism, the page > cache inconsistencies lead to various assertions popping (primarily when > we didn't think we need to get a disk reservation going by page cache > state, but then do the actual write and disk space accounting says oops, > we did need one). And having to reason about what can happen without > a locking mechanism for this is not something I care to spend brain > cycles on. > > That said, my patch is kind of ugly, and it requires filesystem changes > for other filesystems to take advantage of it. And unfortunately, since > one of the code paths that needs locking is readahead, I don't see any > realistic way of implementing the locking within just bcachefs code. > > So I'm hoping someone has an idea for something cleaner (I think I recall > Matthew Wilcox saying he had an idea for how to use xarray to solve this), > but if not I'll polish up my pagecache add lock patch and see what I can > do to make it less ugly, and hopefully other people find it palatable > or at least useful. My idea with the XArray is that we have a number of reserved entries which we can use as blocking entries. I was originally planning on making this an XArray feature, but I now believe it's a page-cache-special feature. We can always revisit that decision if it turns out to be useful to another user. API: int filemap_block_range(struct address_space *mapping, loff_t start, loff_t end); void filemap_remove_block(struct address_space *mapping, loff_t start, loff_t end); - After removing a block, the pagecache is empty between [start, end]. - You have to treat the block as a single entity; don't unblock only a subrange of the range you originally blocked. - Lookups of a page within a blocked range return NULL. - Attempts to add a page to a blocked range sleep on one of the page_wait_table queues. - Attempts to block a blocked range will also sleep on one of the page_wait_table queues. Is this restriction acceptable for your use case? It's clearly not a problem for fallocate insert/collapse. It would only be a problem for Direct I/O if people are doing subpage directio from within the same page. I think that's rare enough to not be a problem (but please tell me if I'm wrong!)
On Fri, May 18, 2018 at 06:13:06AM -0700, Matthew Wilcox wrote: > > Historically, the only problematic case has been direct IO, and people > > have been willing to say "well, if you mix buffered and direct IO you > > get what you deserve", and that's probably not unreasonable. But now we > > have fallocate insert range and collapse range, and those are broken in > > ways I frankly don't want to think about if they can't ensure consistency > > with the page cache. > > ext4 manages collapse-vs-pagefault with the ext4-specific i_mmap_sem. > You may get pushback on the grounds that this ought to be a > filesystem-specific lock rather than one embedded in the generic inode. Honestly I think this probably should be in the core. But IFF we move it to the core the existing users of per-fs locks need to be moved over first. E.g. XFS as the very first one, and at least ext4 and f2fs that copied the approach, and probably more if you audit deep enough.
On Fri, May 18, 2018 at 08:53:30AM -0700, Christoph Hellwig wrote: > On Fri, May 18, 2018 at 06:13:06AM -0700, Matthew Wilcox wrote: > > > Historically, the only problematic case has been direct IO, and people > > > have been willing to say "well, if you mix buffered and direct IO you > > > get what you deserve", and that's probably not unreasonable. But now we > > > have fallocate insert range and collapse range, and those are broken in > > > ways I frankly don't want to think about if they can't ensure consistency > > > with the page cache. > > > > ext4 manages collapse-vs-pagefault with the ext4-specific i_mmap_sem. > > You may get pushback on the grounds that this ought to be a > > filesystem-specific lock rather than one embedded in the generic inode. > > Honestly I think this probably should be in the core. But IFF we move > it to the core the existing users of per-fs locks need to be moved > over first. E.g. XFS as the very first one, and at least ext4 and f2fs > that copied the approach, and probably more if you audit deep enough. I didn't know about i_mmap_sem, thanks But, using a rw semaphore wouldn't work for dio writes, and I do need dio writes to block pagecache adds in bcachefs since the dio write could overwrite uncompressed data or a reservation with compressed data, which means new writes need a disk reservation. Also I'm guessing ext4 takes the lock at the top of the read path? That sucks for reads that are already cached, the generic buffered read path can do cached reads without taking any per inode locks - that's why I pushed the lock down to only be taken when we have to add a page to the pagecache. Definitely less ugly doing it that way though...
On Fri, May 18, 2018 at 08:53:30AM -0700, Christoph Hellwig wrote: > On Fri, May 18, 2018 at 06:13:06AM -0700, Matthew Wilcox wrote: > > > Historically, the only problematic case has been direct IO, and people > > > have been willing to say "well, if you mix buffered and direct IO you > > > get what you deserve", and that's probably not unreasonable. But now we > > > have fallocate insert range and collapse range, and those are broken in > > > ways I frankly don't want to think about if they can't ensure consistency > > > with the page cache. > > > > ext4 manages collapse-vs-pagefault with the ext4-specific i_mmap_sem. > > You may get pushback on the grounds that this ought to be a > > filesystem-specific lock rather than one embedded in the generic inode. > > Honestly I think this probably should be in the core. But IFF we move > it to the core the existing users of per-fs locks need to be moved > over first. E.g. XFS as the very first one, and at least ext4 and f2fs > that copied the approach, and probably more if you audit deep enough. I'm not going to go and redo locking in XFS and ext4 as a prerequisite to merging bcachefs. Sorry, but that's a bit crazy. I am more than happy to work on the locking itself if we can agree on what semantics we want out of it. We have two possible approaches, and we're going to have to pick one first: the locking can be done at the top of the IO stack (like ext4 and I'm guessing xfs), but then we're adding locking overhead to buffered reads and writes that don't need it because they're only touching pages that are already in cache. Or we can go with my approach, pushing down the locking to only when we need to add pages to the page cache. I think if we started out by merging my approach, it would be pretty easy to have it make use of Mathew's fancy xarray based range locking when that goes in, the semantics should be similar enough. If people are ok with and willing to use my approach, I can polish it up - add lockdep support and whatever else I can think of, and attempt to get rid of the stupid recursive part. But that's got to be decided first, where in the call stack the locking should be done.
On Sun, May 20, 2018 at 06:45:24PM -0400, Kent Overstreet wrote: > > > > Honestly I think this probably should be in the core. But IFF we move > > it to the core the existing users of per-fs locks need to be moved > > over first. E.g. XFS as the very first one, and at least ext4 and f2fs > > that copied the approach, and probably more if you audit deep enough. > > I'm not going to go and redo locking in XFS and ext4 as a prerequisite to > merging bcachefs. Sorry, but that's a bit crazy. It isn't crazy at all. In general we expect people to do their fair share of core work to get their pet feature in. How much is acceptable is a difficult question and not black and white. But if you want to grow a critical core structure you better take a stab at converting existing users. Without that the tradeoff of growing that core structure is simply not given. Or to put it in words for this exact feature: unless your new field is also used by mainstream file systems it is not going to be added.
On Wed, May 23, 2018 at 08:22:39AM -0700, Christoph Hellwig wrote: > On Sun, May 20, 2018 at 06:45:24PM -0400, Kent Overstreet wrote: > > > > > > Honestly I think this probably should be in the core. But IFF we move > > > it to the core the existing users of per-fs locks need to be moved > > > over first. E.g. XFS as the very first one, and at least ext4 and f2fs > > > that copied the approach, and probably more if you audit deep enough. > > > > I'm not going to go and redo locking in XFS and ext4 as a prerequisite to > > merging bcachefs. Sorry, but that's a bit crazy. > > It isn't crazy at all. In general we expect people to do their fair > share of core work to get their pet feature in. How much is acceptable > is a difficult question and not black and white. > > But if you want to grow a critical core structure you better take a stab > at converting existing users. Without that the tradeoff of growing > that core structure is simply not given. > > Or to put it in words for this exact feature: unless your new field > is also used by mainstream file systems it is not going to be added. Christoph, I'm really not someone you can accuse of avoiding my share of core work and refactoring and you know it. When are you going to get around to converting existing users of fs/direct-io.c to iomap so it can be deleted? The kernel is carrying around two dio implementations right now thanks to you. Not a good situation, is it?
On Fri, May 18, 2018 at 01:45:49PM -0400, Kent Overstreet wrote: > On Fri, May 18, 2018 at 08:53:30AM -0700, Christoph Hellwig wrote: > > On Fri, May 18, 2018 at 06:13:06AM -0700, Matthew Wilcox wrote: > > > > Historically, the only problematic case has been direct IO, and people > > > > have been willing to say "well, if you mix buffered and direct IO you > > > > get what you deserve", and that's probably not unreasonable. But now we > > > > have fallocate insert range and collapse range, and those are broken in > > > > ways I frankly don't want to think about if they can't ensure consistency > > > > with the page cache. > > > > > > ext4 manages collapse-vs-pagefault with the ext4-specific i_mmap_sem. > > > You may get pushback on the grounds that this ought to be a > > > filesystem-specific lock rather than one embedded in the generic inode. > > > > Honestly I think this probably should be in the core. But IFF we move > > it to the core the existing users of per-fs locks need to be moved > > over first. E.g. XFS as the very first one, and at least ext4 and f2fs > > that copied the approach, and probably more if you audit deep enough. > > I didn't know about i_mmap_sem, thanks > > But, using a rw semaphore wouldn't work for dio writes, and I do need dio writes > to block pagecache adds in bcachefs since the dio write could overwrite > uncompressed data or a reservation with compressed data, which means new writes > need a disk reservation. > > Also I'm guessing ext4 takes the lock at the top of the read path? That sucks > for reads that are already cached, the generic buffered read path can do cached > reads without taking any per inode locks - that's why I pushed the lock down to > only be taken when we have to add a page to the pagecache. > > Definitely less ugly doing it that way though... More notes on locking design: Mathew, this is very relevant to any sort of range locking, too. There's some really tricky issues related to dio writes + page faults. If your particular filesystem doesn't care about minor page cache consistency issues caused by dio writes most of this may not be relevant to you, but I honestly would find it a little hard to believe this isn't an issue for _any_ other filesystems. Current situation today, for most filesystems: top of the dio write path shoots down the region of the pagecache for the file it's writing to, with filemap_write_and_wait_range() then invalidate_inode_pages2_range(). This is racy though and does _not_ gurarantee page cache consistency, i.e. we can end up in a situation where the write completes, but we have stale data - and worse, potentially stale metadata, in the buffer heads or whatever your filesystem uses. Ok, solving that is easy enough, we need a lock dio writes can take that prevents pages from being added to a mapping for their duration (or alternately, range locks, but range locks can be viewed as just an optimization). This explains my locking design - if you have a lock that can be taken for "add" or "block", where multiple threads can take it for add or block, but it can't be in both states simultaneously then it does what we want, you can have multiple outstanding dio writes or multiple buffered IO operations bringing pages in and it doesn't add much overhead. This isn't enough, though. - calling filemap_write_and_wait_range then invalidate_inode_pages2_range can race - the call to invalidate_inode_pages2_range will fail if any of the pages have been redirtied, and we'll have stale pages in the page cache. The ideal solution for this would be a function to do both, that removes pages from the page cache atomically with clearing PageDirty and kicking off writeback. Alternately, you can have .page_mkwrite and the buffered write path take the pagecache add lock when they have to dirty a page, but that kind of sucks. - pagefaults, via gup() this is the really annoying one, if userspace does a dio write where the buf they're writing is mmapped and overlaps with the part of the file they're writing to, yay fun we call gup() after shooting down the part of the pagecache we're writing to, so gup() just faults it back in again and we're left with stale data in the page cache again. the generic dio code tries to deal with this these days by calling invalidate_inode_pages2_range() again after the dio write completes. Again though, invalidate_inode_pages2_range() will fail if the pages are dirty, and we're left with stale data in the page cache. I _think_ (haven't tried this yet) it ought to be possible to move the second call to invalidate_inode_pages2_range() to immediately after the call to gup() - this change wouldn't make sense in the current code without locking, but it would make sense if the dio write path is holding a pagecache add lock to prevent anything but its own faults via gup() from bringing pages back in. Also need to prevent the pages gup() faulted in from being dirtied until they can be removed again... Also, one of the things my patch did was change filemap_fault() to not kick off readahead and only read in single pages if we're being called with pagecache block lock held (i.e. from dio write to the same file). I'm pretty sure this is unnecessary though, if I add the second invalidate_inode_pages2_range() call to my own dio code after gup() (which I believe is the correct solution anyways). - lock recursion my locking approach pushes down the locking to only when we're adding pages to the radix tree, unlike, I belive, the ext4/xfs approach. this means that dio write takes page_cache_block lock, and page faults take page_cache_add lock, so dio write -> gup -> fault is a recursive locking deadlock unless we do something about it. my solution was to add a pointer to a pagecache_lock to task_struct, so we can detect the recursion when we go to take pagecache_add lock. It's ugly, but I haven't thought of a better way to do it. I'm pretty sure that the xarray based range locking Matthew wants to do will hit the same issue, it's just inherent to pushing the locking down to where we manipulate the radix tree - which IMO we want to do, so that buffered reads/writes to cached data aren't taking these locks unnecessarily; those are the fast paths. If anyone has any better ideas than what I've come up with, or sees any gaping holes in my design, please speak up... Even if you don't care about dio write consistency, having this locking in the core VFS could mean converting truncate to use it, which I think would be a major benefit too.
diff --git a/fs/inode.c b/fs/inode.c index ef362364d3..e7aaa39adb 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -350,6 +350,7 @@ void address_space_init_once(struct address_space *mapping) { memset(mapping, 0, sizeof(*mapping)); INIT_RADIX_TREE(&mapping->page_tree, GFP_ATOMIC | __GFP_ACCOUNT); + pagecache_lock_init(&mapping->add_lock); spin_lock_init(&mapping->tree_lock); init_rwsem(&mapping->i_mmap_rwsem); INIT_LIST_HEAD(&mapping->private_list); diff --git a/include/linux/fs.h b/include/linux/fs.h index c6baf76761..18d2886a44 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -388,9 +388,32 @@ int pagecache_write_end(struct file *, struct address_space *mapping, loff_t pos, unsigned len, unsigned copied, struct page *page, void *fsdata); +/* + * Two-state lock - can be taken for add or block - both states are shared, + * like read side of rwsem, but conflict with other state: + */ +struct pagecache_lock { + atomic_long_t v; + wait_queue_head_t wait; +}; + +static inline void pagecache_lock_init(struct pagecache_lock *lock) +{ + atomic_long_set(&lock->v, 0); + init_waitqueue_head(&lock->wait); +} + +void pagecache_add_put(struct pagecache_lock *); +void pagecache_add_get(struct pagecache_lock *); +void __pagecache_block_put(struct pagecache_lock *); +void __pagecache_block_get(struct pagecache_lock *); +void pagecache_block_put(struct pagecache_lock *); +void pagecache_block_get(struct pagecache_lock *); + struct address_space { struct inode *host; /* owner: inode, block_device */ struct radix_tree_root page_tree; /* radix tree of all pages */ + struct pagecache_lock add_lock; /* protects adding new pages */ spinlock_t tree_lock; /* and lock protecting it */ atomic_t i_mmap_writable;/* count VM_SHARED mappings */ struct rb_root_cached i_mmap; /* tree of private and shared mappings */ diff --git a/include/linux/sched.h b/include/linux/sched.h index b161ef8a90..e58465f61a 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -40,6 +40,7 @@ struct io_context; struct mempolicy; struct nameidata; struct nsproxy; +struct pagecache_lock; struct perf_event_context; struct pid_namespace; struct pipe_inode_info; @@ -865,6 +866,9 @@ struct task_struct { unsigned int in_ubsan; #endif + /* currently held lock, for avoiding recursing in fault path: */ + struct pagecache_lock *pagecache_lock; + /* Journalling filesystem info: */ void *journal_info; diff --git a/init/init_task.c b/init/init_task.c index 3ac6e754cf..308d46eef9 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -106,6 +106,7 @@ struct task_struct init_task }, .blocked = {{0}}, .alloc_lock = __SPIN_LOCK_UNLOCKED(init_task.alloc_lock), + .pagecache_lock = NULL, .journal_info = NULL, INIT_CPU_TIMERS(init_task) .pi_lock = __RAW_SPIN_LOCK_UNLOCKED(init_task.pi_lock), diff --git a/mm/filemap.c b/mm/filemap.c index 693f62212a..31dd888785 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -111,6 +111,73 @@ * ->tasklist_lock (memory_failure, collect_procs_ao) */ +static void __pagecache_lock_put(struct pagecache_lock *lock, long i) +{ + BUG_ON(atomic_long_read(&lock->v) == 0); + + if (atomic_long_sub_return_release(i, &lock->v) == 0) + wake_up_all(&lock->wait); +} + +static bool __pagecache_lock_tryget(struct pagecache_lock *lock, long i) +{ + long v = atomic_long_read(&lock->v), old; + + do { + old = v; + + if (i > 0 ? v < 0 : v > 0) + return false; + } while ((v = atomic_long_cmpxchg_acquire(&lock->v, + old, old + i)) != old); + return true; +} + +static void __pagecache_lock_get(struct pagecache_lock *lock, long i) +{ + wait_event(lock->wait, __pagecache_lock_tryget(lock, i)); +} + +void pagecache_add_put(struct pagecache_lock *lock) +{ + __pagecache_lock_put(lock, 1); +} +EXPORT_SYMBOL(pagecache_add_put); + +void pagecache_add_get(struct pagecache_lock *lock) +{ + __pagecache_lock_get(lock, 1); +} +EXPORT_SYMBOL(pagecache_add_get); + +void __pagecache_block_put(struct pagecache_lock *lock) +{ + __pagecache_lock_put(lock, -1); +} +EXPORT_SYMBOL(__pagecache_block_put); + +void __pagecache_block_get(struct pagecache_lock *lock) +{ + __pagecache_lock_get(lock, -1); +} +EXPORT_SYMBOL(__pagecache_block_get); + +void pagecache_block_put(struct pagecache_lock *lock) +{ + BUG_ON(current->pagecache_lock != lock); + current->pagecache_lock = NULL; + __pagecache_lock_put(lock, -1); +} +EXPORT_SYMBOL(pagecache_block_put); + +void pagecache_block_get(struct pagecache_lock *lock) +{ + __pagecache_lock_get(lock, -1); + BUG_ON(current->pagecache_lock); + current->pagecache_lock = lock; +} +EXPORT_SYMBOL(pagecache_block_get); + static int page_cache_tree_insert(struct address_space *mapping, struct page *page, void **shadowp) { @@ -834,18 +901,21 @@ static int __add_to_page_cache_locked(struct page *page, VM_BUG_ON_PAGE(!PageLocked(page), page); VM_BUG_ON_PAGE(PageSwapBacked(page), page); + if (current->pagecache_lock != &mapping->add_lock) + pagecache_add_get(&mapping->add_lock); + if (!huge) { error = mem_cgroup_try_charge(page, current->mm, gfp_mask, &memcg, false); if (error) - return error; + goto err; } error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM); if (error) { if (!huge) mem_cgroup_cancel_charge(page, memcg, false); - return error; + goto err; } get_page(page); @@ -865,7 +935,11 @@ static int __add_to_page_cache_locked(struct page *page, if (!huge) mem_cgroup_commit_charge(page, memcg, false, false); trace_mm_filemap_add_to_page_cache(page); - return 0; +err: + if (current->pagecache_lock != &mapping->add_lock) + pagecache_add_put(&mapping->add_lock); + + return error; err_insert: page->mapping = NULL; /* Leave page->index set: truncation relies upon it */ @@ -873,7 +947,7 @@ static int __add_to_page_cache_locked(struct page *page, if (!huge) mem_cgroup_cancel_charge(page, memcg, false); put_page(page); - return error; + goto err; } /** @@ -2511,7 +2585,14 @@ int filemap_fault(struct vm_fault *vmf) * Do we have something in the page cache already? */ page = find_get_page(mapping, offset); - if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) { + if (unlikely(current->pagecache_lock == &mapping->add_lock)) { + /* + * fault from e.g. dio -> get_user_pages() - _don't_ want to do + * readahead, only read in page we need: + */ + if (!page) + goto no_cached_page; + } else if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) { /* * We found the page, so try async readahead before * waiting for the lock.
Add a per address space lock around adding pages to the pagecache - making it possible for fallocate INSERT_RANGE/COLLAPSE_RANGE to work correctly, and also hopefully making truncate and dio a bit saner. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> --- fs/inode.c | 1 + include/linux/fs.h | 23 +++++++++++ include/linux/sched.h | 4 ++ init/init_task.c | 1 + mm/filemap.c | 91 ++++++++++++++++++++++++++++++++++++++++--- 5 files changed, 115 insertions(+), 5 deletions(-)