Message ID | 1434051673-13838-7-git-send-email-jbacik@fb.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu 11-06-15 15:41:11, Josef Bacik wrote: > From: Dave Chinner <dchinner@redhat.com> > > wait_sb_inodes() current does a walk of all inodes in the filesystem > to find dirty one to wait on during sync. This is highly > inefficient and wastes a lot of CPU when there are lots of clean > cached inodes that we don't need to wait on. > > To avoid this "all inode" walk, we need to track inodes that are > currently under writeback that we need to wait for. We do this by > adding inodes to a writeback list on the bdi when the mapping is > first tagged as having pages under writeback. wait_sb_inodes() can > then walk this list of "inodes under IO" and wait specifically just > for the inodes that the current sync(2) needs to wait for. > > To avoid needing all the realted locking to be safe against > interrupts, Jan Kara suggested that we be lazy about removal from > the writeback list. That is, we don't remove inodes from the > writeback list on IO completion, but do it directly during a > wait_sb_inodes() walk. > > This means that the a rare sync(2) call will have some work to do > skipping clean inodes However, in the current problem case of > concurrent sync workloads, concurrent wait_sb_inodes() calls only > walk the very recently dispatched inodes and hence should have very > little work to do. > > This also means that we have to remove the inodes from the writeback > list during eviction. Do this in inode_wait_for_writeback() once > all writeback on the inode is complete. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > Signed-off-by: Josef Bacik <jbacik@fb.com> I don't think you noticed comments I did when you last posted this series: > /* > - * Wait for writeback on an inode to complete. Caller must have inode pinned. > + * Wait for writeback on an inode to complete during eviction. Caller must have > + * inode pinned. > */ > void inode_wait_for_writeback(struct inode *inode) > { > + BUG_ON(!(inode->i_state & I_FREEING)); > + > spin_lock(&inode->i_lock); > __inode_wait_for_writeback(inode); > spin_unlock(&inode->i_lock); > + > + /* > + * bd_inode's will have already had the bdev free'd so don't bother > + * doing the bdi_clear_inode_writeback. > + */ > + if (!sb_is_blkdev_sb(inode->i_sb)) > + bdi_clear_inode_writeback(inode_to_bdi(inode), inode); > } Why do we bother with not putting bdev inode back? ... > spin_lock(&inode->i_lock); > - if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) || > - (mapping->nrpages == 0)) { > + if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) { > + list_move(&inode->i_wb_list, &bdi->wb.b_writeback); Indenting is broken here... > spin_unlock(&inode->i_lock); > + cond_resched_lock(&bdi->wb.list_lock); > continue; > } > __iget(inode); > spin_unlock(&inode->i_lock); Honza
On 06/15/2015 07:12 AM, Jan Kara wrote: > On Thu 11-06-15 15:41:11, Josef Bacik wrote: >> From: Dave Chinner <dchinner@redhat.com> >> >> wait_sb_inodes() current does a walk of all inodes in the filesystem >> to find dirty one to wait on during sync. This is highly >> inefficient and wastes a lot of CPU when there are lots of clean >> cached inodes that we don't need to wait on. >> >> To avoid this "all inode" walk, we need to track inodes that are >> currently under writeback that we need to wait for. We do this by >> adding inodes to a writeback list on the bdi when the mapping is >> first tagged as having pages under writeback. wait_sb_inodes() can >> then walk this list of "inodes under IO" and wait specifically just >> for the inodes that the current sync(2) needs to wait for. >> >> To avoid needing all the realted locking to be safe against >> interrupts, Jan Kara suggested that we be lazy about removal from >> the writeback list. That is, we don't remove inodes from the >> writeback list on IO completion, but do it directly during a >> wait_sb_inodes() walk. >> >> This means that the a rare sync(2) call will have some work to do >> skipping clean inodes However, in the current problem case of >> concurrent sync workloads, concurrent wait_sb_inodes() calls only >> walk the very recently dispatched inodes and hence should have very >> little work to do. >> >> This also means that we have to remove the inodes from the writeback >> list during eviction. Do this in inode_wait_for_writeback() once >> all writeback on the inode is complete. >> >> Signed-off-by: Dave Chinner <dchinner@redhat.com> >> Signed-off-by: Josef Bacik <jbacik@fb.com> > > I don't think you noticed comments I did when you last posted this > series: You're right I missed them completely, sorry. > >> /* >> - * Wait for writeback on an inode to complete. Caller must have inode pinned. >> + * Wait for writeback on an inode to complete during eviction. Caller must have >> + * inode pinned. >> */ >> void inode_wait_for_writeback(struct inode *inode) >> { >> + BUG_ON(!(inode->i_state & I_FREEING)); >> + >> spin_lock(&inode->i_lock); >> __inode_wait_for_writeback(inode); >> spin_unlock(&inode->i_lock); >> + >> + /* >> + * bd_inode's will have already had the bdev free'd so don't bother >> + * doing the bdi_clear_inode_writeback. >> + */ >> + if (!sb_is_blkdev_sb(inode->i_sb)) >> + bdi_clear_inode_writeback(inode_to_bdi(inode), inode); >> } > > Why do we bother with not putting bdev inode back? > My memory is rusty on this, but if the inode is the inode for a bdev we will have already free'd the bdev at this point and we get a null pointer deref, so this just skips that bit. I'll fix up the indenting. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue 16-06-15 08:42:27, Josef Bacik wrote: > >> /* > >>- * Wait for writeback on an inode to complete. Caller must have inode pinned. > >>+ * Wait for writeback on an inode to complete during eviction. Caller must have > >>+ * inode pinned. > >> */ > >> void inode_wait_for_writeback(struct inode *inode) > >> { > >>+ BUG_ON(!(inode->i_state & I_FREEING)); > >>+ > >> spin_lock(&inode->i_lock); > >> __inode_wait_for_writeback(inode); > >> spin_unlock(&inode->i_lock); > >>+ > >>+ /* > >>+ * bd_inode's will have already had the bdev free'd so don't bother > >>+ * doing the bdi_clear_inode_writeback. > >>+ */ > >>+ if (!sb_is_blkdev_sb(inode->i_sb)) > >>+ bdi_clear_inode_writeback(inode_to_bdi(inode), inode); > >> } > > > >Why do we bother with not putting bdev inode back? > > > > My memory is rusty on this, but if the inode is the inode for a bdev > we will have already free'd the bdev at this point and we get a null > pointer deref, so this just skips that bit. Ah, the reason likely is that bdev->bd_disk is NULL (already cleaned up in __blkdev_put()) at this moment and thus bdev_get_queue() called from inode_to_bdi() will oops. Can you please add these details to the comment? It's a bit subtle... Also we shouldn't have any pages in the block device mapping anymore because of the work done in __blkdev_put() (and thus inode shouldn't be in the writeback list) but I'd be calmer if we asserted list_empty(&inode->i_wb_list). Can you please add that? Thanks! Honza
On 06/17/2015 03:34 AM, Jan Kara wrote: > On Tue 16-06-15 08:42:27, Josef Bacik wrote: >>>> /* >>>> - * Wait for writeback on an inode to complete. Caller must have inode pinned. >>>> + * Wait for writeback on an inode to complete during eviction. Caller must have >>>> + * inode pinned. >>>> */ >>>> void inode_wait_for_writeback(struct inode *inode) >>>> { >>>> + BUG_ON(!(inode->i_state & I_FREEING)); >>>> + >>>> spin_lock(&inode->i_lock); >>>> __inode_wait_for_writeback(inode); >>>> spin_unlock(&inode->i_lock); >>>> + >>>> + /* >>>> + * bd_inode's will have already had the bdev free'd so don't bother >>>> + * doing the bdi_clear_inode_writeback. >>>> + */ >>>> + if (!sb_is_blkdev_sb(inode->i_sb)) >>>> + bdi_clear_inode_writeback(inode_to_bdi(inode), inode); >>>> } >>> >>> Why do we bother with not putting bdev inode back? >>> >> >> My memory is rusty on this, but if the inode is the inode for a bdev >> we will have already free'd the bdev at this point and we get a null >> pointer deref, so this just skips that bit. > > Ah, the reason likely is that bdev->bd_disk is NULL (already cleaned up in > __blkdev_put()) at this moment and thus bdev_get_queue() called from > inode_to_bdi() will oops. Can you please add these details to the comment? > It's a bit subtle... > > Also we shouldn't have any pages in the block device mapping anymore > because of the work done in __blkdev_put() (and thus inode shouldn't be in > the writeback list) but I'd be calmer if we asserted > list_empty(&inode->i_wb_list). Can you please add that? Thanks! Won't it trip if we never sync before we drop the device tho? So we write some stuff to the block device, it gets written out, we then drop the device for whatever reason and boom, hit BUG_ON(&inode->i_wb_list) because we're still on the writeback list even though it doesn't matter because this disk is going away. Just an untested theory, what do you think? If it is possible I suppose I could just add the clear'ing bit for the bd inode to before we drop bd_disk if that would make you happy. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed 17-06-15 10:55:58, Josef Bacik wrote: > On 06/17/2015 03:34 AM, Jan Kara wrote: > >On Tue 16-06-15 08:42:27, Josef Bacik wrote: > >>>> /* > >>>>- * Wait for writeback on an inode to complete. Caller must have inode pinned. > >>>>+ * Wait for writeback on an inode to complete during eviction. Caller must have > >>>>+ * inode pinned. > >>>> */ > >>>> void inode_wait_for_writeback(struct inode *inode) > >>>> { > >>>>+ BUG_ON(!(inode->i_state & I_FREEING)); > >>>>+ > >>>> spin_lock(&inode->i_lock); > >>>> __inode_wait_for_writeback(inode); > >>>> spin_unlock(&inode->i_lock); > >>>>+ > >>>>+ /* > >>>>+ * bd_inode's will have already had the bdev free'd so don't bother > >>>>+ * doing the bdi_clear_inode_writeback. > >>>>+ */ > >>>>+ if (!sb_is_blkdev_sb(inode->i_sb)) > >>>>+ bdi_clear_inode_writeback(inode_to_bdi(inode), inode); > >>>> } > >>> > >>>Why do we bother with not putting bdev inode back? > >>> > >> > >>My memory is rusty on this, but if the inode is the inode for a bdev > >>we will have already free'd the bdev at this point and we get a null > >>pointer deref, so this just skips that bit. > > > >Ah, the reason likely is that bdev->bd_disk is NULL (already cleaned up in > >__blkdev_put()) at this moment and thus bdev_get_queue() called from > >inode_to_bdi() will oops. Can you please add these details to the comment? > >It's a bit subtle... > > > >Also we shouldn't have any pages in the block device mapping anymore > >because of the work done in __blkdev_put() (and thus inode shouldn't be in > >the writeback list) but I'd be calmer if we asserted > >list_empty(&inode->i_wb_list). Can you please add that? Thanks! > > Won't it trip if we never sync before we drop the device tho? So we > write some stuff to the block device, it gets written out, we then > drop the device for whatever reason and boom, hit > BUG_ON(&inode->i_wb_list) because we're still on the writeback list > even though it doesn't matter because this disk is going away. Just > an untested theory, what do you think? Well, we are going to free the inode. If it is still linked into the writeback list, we are in trouble as the list will now contain freed element. So better BUG_ON earlier than chase memory corruption later. Calling bdi_clear_inode_writeback() in kill_bdev() is probably a good idea and we can then just have the assertion in evict() to make sure nothing went wrong. OK? Honza
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index aa72536..3f5b2ff 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -210,6 +210,34 @@ void inode_wb_list_del(struct inode *inode) } /* + * mark an inode as under writeback on the given bdi + */ +void bdi_mark_inode_writeback(struct backing_dev_info *bdi, struct inode *inode) +{ + WARN_ON_ONCE(bdi != inode_to_bdi(inode)); + if (list_empty(&inode->i_wb_list)) { + spin_lock(&bdi->wb.list_lock); + if (list_empty(&inode->i_wb_list)) + list_add_tail(&inode->i_wb_list, &bdi->wb.b_writeback); + spin_unlock(&bdi->wb.list_lock); + } +} + +/* + * clear an inode as under writeback on the given bdi + */ +static void bdi_clear_inode_writeback(struct backing_dev_info *bdi, + struct inode *inode) +{ + WARN_ON_ONCE(bdi != inode_to_bdi(inode)); + if (!list_empty(&inode->i_wb_list)) { + spin_lock(&bdi->wb.list_lock); + list_del_init(&inode->i_wb_list); + spin_unlock(&bdi->wb.list_lock); + } +} + +/* * Redirty an inode: set its when-it-was dirtied timestamp and move it to the * furthest end of its superblock's dirty-inode list. * @@ -383,13 +411,23 @@ static void __inode_wait_for_writeback(struct inode *inode) } /* - * Wait for writeback on an inode to complete. Caller must have inode pinned. + * Wait for writeback on an inode to complete during eviction. Caller must have + * inode pinned. */ void inode_wait_for_writeback(struct inode *inode) { + BUG_ON(!(inode->i_state & I_FREEING)); + spin_lock(&inode->i_lock); __inode_wait_for_writeback(inode); spin_unlock(&inode->i_lock); + + /* + * bd_inode's will have already had the bdev free'd so don't bother + * doing the bdi_clear_inode_writeback. + */ + if (!sb_is_blkdev_sb(inode->i_sb)) + bdi_clear_inode_writeback(inode_to_bdi(inode), inode); } /* @@ -1372,7 +1410,8 @@ EXPORT_SYMBOL(__mark_inode_dirty); */ static void wait_sb_inodes(struct super_block *sb) { - struct inode *inode, *old_inode = NULL; + struct backing_dev_info *bdi = sb->s_bdi; + LIST_HEAD(sync_list); /* * We need to be protected against the filesystem going from @@ -1380,48 +1419,82 @@ static void wait_sb_inodes(struct super_block *sb) */ WARN_ON(!rwsem_is_locked(&sb->s_umount)); - mutex_lock(&sb->s_sync_lock); - spin_lock(&sb->s_inode_list_lock); - /* - * Data integrity sync. Must wait for all pages under writeback, - * because there may have been pages dirtied before our sync - * call, but which had writeout started before we write it out. - * In which case, the inode may not be on the dirty list, but - * we still have to wait for that writeout. + * Data integrity sync. Must wait for all pages under writeback, because + * there may have been pages dirtied before our sync call, but which had + * writeout started before we write it out. In which case, the inode + * may not be on the dirty list, but we still have to wait for that + * writeout. + * + * To avoid syncing inodes put under IO after we have started here, + * splice the io list to a temporary list head and walk that. Newly + * dirtied inodes will go onto the primary list so we won't wait for + * them. This is safe to do as we are serialised by the s_sync_lock, + * so we'll complete processing the complete list before the next + * sync operation repeats the splice-and-walk process. + * + * Inodes that have pages under writeback after we've finished the wait + * may or may not be on the primary list. They had pages under put IO + * after we started our wait, so we need to make sure the next sync or + * IO completion treats them correctly, Move them back to the primary + * list and restart the walk. + * + * Inodes that are clean after we have waited for them don't belong on + * any list, so simply remove them from the sync list and move onwards. */ - list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { + mutex_lock(&sb->s_sync_lock); + spin_lock(&bdi->wb.list_lock); + list_splice_init(&bdi->wb.b_writeback, &sync_list); + + while (!list_empty(&sync_list)) { + struct inode *inode = list_first_entry(&sync_list, struct inode, + i_wb_list); struct address_space *mapping = inode->i_mapping; + /* + * We are lazy on IO completion and don't remove inodes from the + * list when they are clean. Detect that immediately and skip + * inodes we don't ahve to wait on. + */ + if (!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) { + list_del_init(&inode->i_wb_list); + cond_resched_lock(&bdi->wb.list_lock); + continue; + } + spin_lock(&inode->i_lock); - if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) || - (mapping->nrpages == 0)) { + if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) { + list_move(&inode->i_wb_list, &bdi->wb.b_writeback); spin_unlock(&inode->i_lock); + cond_resched_lock(&bdi->wb.list_lock); continue; } __iget(inode); spin_unlock(&inode->i_lock); - spin_unlock(&sb->s_inode_list_lock); - - /* - * We hold a reference to 'inode' so it couldn't have been - * removed from s_inodes list while we dropped the - * s_inode_list_lock. We cannot iput the inode now as we can - * be holding the last reference and we cannot iput it under - * s_inode_list_lock. So we keep the reference and iput it - * later. - */ - iput(old_inode); - old_inode = inode; + spin_unlock(&bdi->wb.list_lock); filemap_fdatawait(mapping); - cond_resched(); - spin_lock(&sb->s_inode_list_lock); + /* + * the inode has been written back now, so check whether we + * still have pages under IO and move it back to the primary + * list if necessary. We really need the mapping->tree_lock + * here because bdi_mark_inode_writeback may have not done + * anything because we were on the spliced list and we need to + * check TAG_WRITEBACK. + */ + spin_lock_irq(&mapping->tree_lock); + spin_lock(&bdi->wb.list_lock); + if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) { + WARN_ON(list_empty(&inode->i_wb_list)); + list_move(&inode->i_wb_list, &bdi->wb.b_writeback); + } else + list_del_init(&inode->i_wb_list); + spin_unlock_irq(&mapping->tree_lock); + iput(inode); } - spin_unlock(&sb->s_inode_list_lock); - iput(old_inode); + spin_unlock(&bdi->wb.list_lock); mutex_unlock(&sb->s_sync_lock); } diff --git a/fs/inode.c b/fs/inode.c index c76a575..d1e6598 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -356,6 +356,7 @@ void inode_init_once(struct inode *inode) INIT_HLIST_NODE(&inode->i_hash); INIT_LIST_HEAD(&inode->i_devices); INIT_LIST_HEAD(&inode->i_io_list); + INIT_LIST_HEAD(&inode->i_wb_list); INIT_LIST_HEAD(&inode->i_lru); address_space_init_once(&inode->i_data); i_size_ordered_init(inode); diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index aff923a..12d8224 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -56,6 +56,7 @@ struct bdi_writeback { struct list_head b_io; /* parked for writeback */ struct list_head b_more_io; /* parked for more writeback */ struct list_head b_dirty_time; /* time stamps are dirty */ + struct list_head b_writeback; /* inodes under writeback */ spinlock_t list_lock; /* protects the b_* lists */ }; @@ -124,6 +125,8 @@ void bdi_start_background_writeback(struct backing_dev_info *bdi); void bdi_writeback_workfn(struct work_struct *work); int bdi_has_dirty_io(struct backing_dev_info *bdi); void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi); +void bdi_mark_inode_writeback(struct backing_dev_info *bdi, + struct inode *inode); extern spinlock_t bdi_lock; extern struct list_head bdi_list; diff --git a/include/linux/fs.h b/include/linux/fs.h index f4d56cc..f47792f 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -637,6 +637,7 @@ struct inode { struct list_head i_io_list; /* backing dev IO list */ struct list_head i_lru; /* inode LRU list */ struct list_head i_sb_list; + struct list_head i_wb_list; /* backing dev writeback list */ union { struct hlist_head i_dentry; struct rcu_head i_rcu; diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 6e7a644..df31958 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -386,6 +386,7 @@ static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi) INIT_LIST_HEAD(&wb->b_io); INIT_LIST_HEAD(&wb->b_more_io); INIT_LIST_HEAD(&wb->b_dirty_time); + INIT_LIST_HEAD(&wb->b_writeback); spin_lock_init(&wb->list_lock); INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn); } diff --git a/mm/page-writeback.c b/mm/page-writeback.c index eb59f7e..f3751d1 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2382,11 +2382,25 @@ int __test_set_page_writeback(struct page *page, bool keep_write) spin_lock_irqsave(&mapping->tree_lock, flags); ret = TestSetPageWriteback(page); if (!ret) { + bool on_wblist; + + on_wblist = mapping_tagged(mapping, + PAGECACHE_TAG_WRITEBACK); + radix_tree_tag_set(&mapping->page_tree, page_index(page), PAGECACHE_TAG_WRITEBACK); if (bdi_cap_account_writeback(bdi)) __inc_bdi_stat(bdi, BDI_WRITEBACK); + + /* + * we can come through here when swapping anonymous + * pages, so we don't necessarily have an inode to + * track for sync. Inodes are remove lazily, so there is + * no equivalent in test_clear_page_writeback(). + */ + if (!on_wblist && mapping->host) + bdi_mark_inode_writeback(bdi, mapping->host); } if (!PageDirty(page)) radix_tree_tag_clear(&mapping->page_tree,