diff mbox series

[2/3] jbd2: introduce jbd2_inode dirty range scoping

Message ID 20190619172156.105508-3-zwisler@google.com (mailing list archive)
State New, archived
Headers show
Series Add dirty range scoping to jbd2 | expand

Commit Message

Ross Zwisler June 19, 2019, 5:21 p.m. UTC
Currently both journal_submit_inode_data_buffers() and
journal_finish_inode_data_buffers() operate on the entire address space
of each of the inodes associated with a given journal entry.  The
consequence of this is that if we have an inode where we are constantly
appending dirty pages we can end up waiting for an indefinite amount of
time in journal_finish_inode_data_buffers() while we wait for all the
pages under writeback to be written out.

The easiest way to cause this type of workload is do just dd from
/dev/zero to a file until it fills the entire filesystem.  This can
cause journal_finish_inode_data_buffers() to wait for the duration of
the entire dd operation.

We can improve this situation by scoping each of the inode dirty ranges
associated with a given transaction.  We do this via the jbd2_inode
structure so that the scoping is contained within jbd2 and so that it
follows the lifetime and locking rules for that structure.

This allows us to limit the writeback & wait in
journal_submit_inode_data_buffers() and
journal_finish_inode_data_buffers() respectively to the dirty range for
a given struct jdb2_inode, keeping us from waiting forever if the inode
in question is still being appended to.

Signed-off-by: Ross Zwisler <zwisler@google.com>
---
 fs/jbd2/commit.c      | 26 +++++++++++++++++------
 fs/jbd2/journal.c     |  2 ++
 fs/jbd2/transaction.c | 49 ++++++++++++++++++++++++-------------------
 include/linux/jbd2.h  | 22 +++++++++++++++++++
 4 files changed, 72 insertions(+), 27 deletions(-)

Comments

Jan Kara June 20, 2019, 11:04 a.m. UTC | #1
On Wed 19-06-19 11:21:55, Ross Zwisler wrote:
> Currently both journal_submit_inode_data_buffers() and
> journal_finish_inode_data_buffers() operate on the entire address space
> of each of the inodes associated with a given journal entry.  The
> consequence of this is that if we have an inode where we are constantly
> appending dirty pages we can end up waiting for an indefinite amount of
> time in journal_finish_inode_data_buffers() while we wait for all the
> pages under writeback to be written out.
> 
> The easiest way to cause this type of workload is do just dd from
> /dev/zero to a file until it fills the entire filesystem.  This can
> cause journal_finish_inode_data_buffers() to wait for the duration of
> the entire dd operation.
> 
> We can improve this situation by scoping each of the inode dirty ranges
> associated with a given transaction.  We do this via the jbd2_inode
> structure so that the scoping is contained within jbd2 and so that it
> follows the lifetime and locking rules for that structure.
> 
> This allows us to limit the writeback & wait in
> journal_submit_inode_data_buffers() and
> journal_finish_inode_data_buffers() respectively to the dirty range for
> a given struct jdb2_inode, keeping us from waiting forever if the inode
> in question is still being appended to.
> 
> Signed-off-by: Ross Zwisler <zwisler@google.com>

The patch looks good to me. I was thinking whether we should not have
separate ranges for current and the next transaction but I guess it is not
worth it at least for now. So just one nit below. With that applied feel free
to add:

Reviewed-by: Jan Kara <jack@suse.cz>

> @@ -257,15 +262,24 @@ static int journal_finish_inode_data_buffers(journal_t *journal,
>  	/* For locking, see the comment in journal_submit_data_buffers() */
>  	spin_lock(&journal->j_list_lock);
>  	list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
> +		loff_t dirty_start = jinode->i_dirty_start;
> +		loff_t dirty_end = jinode->i_dirty_end;
> +
>  		if (!(jinode->i_flags & JI_WAIT_DATA))
>  			continue;
>  		jinode->i_flags |= JI_COMMIT_RUNNING;
>  		spin_unlock(&journal->j_list_lock);
> -		err = filemap_fdatawait_keep_errors(
> -				jinode->i_vfs_inode->i_mapping);
> +		err = filemap_fdatawait_range_keep_errors(
> +				jinode->i_vfs_inode->i_mapping, dirty_start,
> +				dirty_end);
>  		if (!ret)
>  			ret = err;
>  		spin_lock(&journal->j_list_lock);
> +
> +		if (!jinode->i_next_transaction) {
> +			jinode->i_dirty_start = 0;
> +			jinode->i_dirty_end = 0;
> +		}

This would be more logical in the next loop that moves jinode into the next
transaction.

>  		jinode->i_flags &= ~JI_COMMIT_RUNNING;
>  		smp_mb();
>  		wake_up_bit(&jinode->i_flags, __JI_COMMIT_RUNNING);

								Honza
Ross Zwisler June 20, 2019, 3:09 p.m. UTC | #2
On Thu, Jun 20, 2019 at 01:04:54PM +0200, Jan Kara wrote:
> On Wed 19-06-19 11:21:55, Ross Zwisler wrote:
> > Currently both journal_submit_inode_data_buffers() and
> > journal_finish_inode_data_buffers() operate on the entire address space
> > of each of the inodes associated with a given journal entry.  The
> > consequence of this is that if we have an inode where we are constantly
> > appending dirty pages we can end up waiting for an indefinite amount of
> > time in journal_finish_inode_data_buffers() while we wait for all the
> > pages under writeback to be written out.
> > 
> > The easiest way to cause this type of workload is do just dd from
> > /dev/zero to a file until it fills the entire filesystem.  This can
> > cause journal_finish_inode_data_buffers() to wait for the duration of
> > the entire dd operation.
> > 
> > We can improve this situation by scoping each of the inode dirty ranges
> > associated with a given transaction.  We do this via the jbd2_inode
> > structure so that the scoping is contained within jbd2 and so that it
> > follows the lifetime and locking rules for that structure.
> > 
> > This allows us to limit the writeback & wait in
> > journal_submit_inode_data_buffers() and
> > journal_finish_inode_data_buffers() respectively to the dirty range for
> > a given struct jdb2_inode, keeping us from waiting forever if the inode
> > in question is still being appended to.
> > 
> > Signed-off-by: Ross Zwisler <zwisler@google.com>
> 
> The patch looks good to me. I was thinking whether we should not have
> separate ranges for current and the next transaction but I guess it is not
> worth it at least for now. So just one nit below. With that applied feel free
> to add:
> 
> Reviewed-by: Jan Kara <jack@suse.cz>

We could definitely keep separate dirty ranges for each of the current and
next transaction.  I think the case where you would see a difference would be
if you had multiple transactions in a row which grew the dirty range for a
given jbd2_inode, and then had a random I/O workload which kept dirtying pages
inside that enlarged dirty range.

I'm not sure how often this type of workload would be a problem.  For the
workloads I've been testing which purely append to the inode, having a single
dirty range per jbd2_inode is sufficient.

I guess for now this single range seems simpler, but if later we find that
someone would benefit from separate tracking for each of the current and next
transactions, I'll take a shot at adding it.

Thank you for the review!

> > @@ -257,15 +262,24 @@ static int journal_finish_inode_data_buffers(journal_t *journal,
> >  	/* For locking, see the comment in journal_submit_data_buffers() */
> >  	spin_lock(&journal->j_list_lock);
> >  	list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
> > +		loff_t dirty_start = jinode->i_dirty_start;
> > +		loff_t dirty_end = jinode->i_dirty_end;
> > +
> >  		if (!(jinode->i_flags & JI_WAIT_DATA))
> >  			continue;
> >  		jinode->i_flags |= JI_COMMIT_RUNNING;
> >  		spin_unlock(&journal->j_list_lock);
> > -		err = filemap_fdatawait_keep_errors(
> > -				jinode->i_vfs_inode->i_mapping);
> > +		err = filemap_fdatawait_range_keep_errors(
> > +				jinode->i_vfs_inode->i_mapping, dirty_start,
> > +				dirty_end);
> >  		if (!ret)
> >  			ret = err;
> >  		spin_lock(&journal->j_list_lock);
> > +
> > +		if (!jinode->i_next_transaction) {
> > +			jinode->i_dirty_start = 0;
> > +			jinode->i_dirty_end = 0;
> > +		}
> 
> This would be more logical in the next loop that moves jinode into the next
> transaction.

Yep, agreed, this is much better.  Fixed in v2.
Theodore Ts'o June 20, 2019, 5:22 p.m. UTC | #3
On Thu, Jun 20, 2019 at 09:09:11AM -0600, Ross Zwisler wrote:
> We could definitely keep separate dirty ranges for each of the current and
> next transaction.  I think the case where you would see a difference would be
> if you had multiple transactions in a row which grew the dirty range for a
> given jbd2_inode, and then had a random I/O workload which kept dirtying pages
> inside that enlarged dirty range.
> 
> I'm not sure how often this type of workload would be a problem.  For the
> workloads I've been testing which purely append to the inode, having a single
> dirty range per jbd2_inode is sufficient.

My inclination would be to keep things simple for now, unless we have
a real workload that tickles this.  In the long run I'm hoping to
remove the need to do writebacks from the journal thread altogether,
by always updating the metadata blocks *after* the I/O completes,
instead of before we submit the I/O.

					- Ted
diff mbox series

Patch

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index efd0ce9489ae9..b4b99ea6e8700 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -187,14 +187,15 @@  static int journal_wait_on_commit_record(journal_t *journal,
  * use writepages() because with dealyed allocation we may be doing
  * block allocation in writepages().
  */
-static int journal_submit_inode_data_buffers(struct address_space *mapping)
+static int journal_submit_inode_data_buffers(struct address_space *mapping,
+		loff_t dirty_start, loff_t dirty_end)
 {
 	int ret;
 	struct writeback_control wbc = {
 		.sync_mode =  WB_SYNC_ALL,
 		.nr_to_write = mapping->nrpages * 2,
-		.range_start = 0,
-		.range_end = i_size_read(mapping->host),
+		.range_start = dirty_start,
+		.range_end = dirty_end,
 	};
 
 	ret = generic_writepages(mapping, &wbc);
@@ -218,6 +219,9 @@  static int journal_submit_data_buffers(journal_t *journal,
 
 	spin_lock(&journal->j_list_lock);
 	list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
+		loff_t dirty_start = jinode->i_dirty_start;
+		loff_t dirty_end = jinode->i_dirty_end;
+
 		if (!(jinode->i_flags & JI_WRITE_DATA))
 			continue;
 		mapping = jinode->i_vfs_inode->i_mapping;
@@ -230,7 +234,8 @@  static int journal_submit_data_buffers(journal_t *journal,
 		 * only allocated blocks here.
 		 */
 		trace_jbd2_submit_inode_data(jinode->i_vfs_inode);
-		err = journal_submit_inode_data_buffers(mapping);
+		err = journal_submit_inode_data_buffers(mapping, dirty_start,
+				dirty_end);
 		if (!ret)
 			ret = err;
 		spin_lock(&journal->j_list_lock);
@@ -257,15 +262,24 @@  static int journal_finish_inode_data_buffers(journal_t *journal,
 	/* For locking, see the comment in journal_submit_data_buffers() */
 	spin_lock(&journal->j_list_lock);
 	list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
+		loff_t dirty_start = jinode->i_dirty_start;
+		loff_t dirty_end = jinode->i_dirty_end;
+
 		if (!(jinode->i_flags & JI_WAIT_DATA))
 			continue;
 		jinode->i_flags |= JI_COMMIT_RUNNING;
 		spin_unlock(&journal->j_list_lock);
-		err = filemap_fdatawait_keep_errors(
-				jinode->i_vfs_inode->i_mapping);
+		err = filemap_fdatawait_range_keep_errors(
+				jinode->i_vfs_inode->i_mapping, dirty_start,
+				dirty_end);
 		if (!ret)
 			ret = err;
 		spin_lock(&journal->j_list_lock);
+
+		if (!jinode->i_next_transaction) {
+			jinode->i_dirty_start = 0;
+			jinode->i_dirty_end = 0;
+		}
 		jinode->i_flags &= ~JI_COMMIT_RUNNING;
 		smp_mb();
 		wake_up_bit(&jinode->i_flags, __JI_COMMIT_RUNNING);
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 43df0c943229c..288b8e7cf21c7 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -2574,6 +2574,8 @@  void jbd2_journal_init_jbd_inode(struct jbd2_inode *jinode, struct inode *inode)
 	jinode->i_next_transaction = NULL;
 	jinode->i_vfs_inode = inode;
 	jinode->i_flags = 0;
+	jinode->i_dirty_start = 0;
+	jinode->i_dirty_end = 0;
 	INIT_LIST_HEAD(&jinode->i_list);
 }
 
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 8ca4fddc705fe..990e7b5062e74 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -2565,7 +2565,7 @@  void jbd2_journal_refile_buffer(journal_t *journal, struct journal_head *jh)
  * File inode in the inode list of the handle's transaction
  */
 static int jbd2_journal_file_inode(handle_t *handle, struct jbd2_inode *jinode,
-				   unsigned long flags)
+		unsigned long flags, loff_t start_byte, loff_t end_byte)
 {
 	transaction_t *transaction = handle->h_transaction;
 	journal_t *journal;
@@ -2577,26 +2577,17 @@  static int jbd2_journal_file_inode(handle_t *handle, struct jbd2_inode *jinode,
 	jbd_debug(4, "Adding inode %lu, tid:%d\n", jinode->i_vfs_inode->i_ino,
 			transaction->t_tid);
 
-	/*
-	 * First check whether inode isn't already on the transaction's
-	 * lists without taking the lock. Note that this check is safe
-	 * without the lock as we cannot race with somebody removing inode
-	 * from the transaction. The reason is that we remove inode from the
-	 * transaction only in journal_release_jbd_inode() and when we commit
-	 * the transaction. We are guarded from the first case by holding
-	 * a reference to the inode. We are safe against the second case
-	 * because if jinode->i_transaction == transaction, commit code
-	 * cannot touch the transaction because we hold reference to it,
-	 * and if jinode->i_next_transaction == transaction, commit code
-	 * will only file the inode where we want it.
-	 */
-	if ((jinode->i_transaction == transaction ||
-	    jinode->i_next_transaction == transaction) &&
-	    (jinode->i_flags & flags) == flags)
-		return 0;
-
 	spin_lock(&journal->j_list_lock);
 	jinode->i_flags |= flags;
+
+	if (jinode->i_dirty_end) {
+		jinode->i_dirty_start = min(jinode->i_dirty_start, start_byte);
+		jinode->i_dirty_end = max(jinode->i_dirty_end, end_byte);
+	} else {
+		jinode->i_dirty_start = start_byte;
+		jinode->i_dirty_end = end_byte;
+	}
+
 	/* Is inode already attached where we need it? */
 	if (jinode->i_transaction == transaction ||
 	    jinode->i_next_transaction == transaction)
@@ -2631,12 +2622,28 @@  static int jbd2_journal_file_inode(handle_t *handle, struct jbd2_inode *jinode,
 int jbd2_journal_inode_add_write(handle_t *handle, struct jbd2_inode *jinode)
 {
 	return jbd2_journal_file_inode(handle, jinode,
-				       JI_WRITE_DATA | JI_WAIT_DATA);
+			JI_WRITE_DATA | JI_WAIT_DATA, 0, LLONG_MAX);
 }
 
 int jbd2_journal_inode_add_wait(handle_t *handle, struct jbd2_inode *jinode)
 {
-	return jbd2_journal_file_inode(handle, jinode, JI_WAIT_DATA);
+	return jbd2_journal_file_inode(handle, jinode, JI_WAIT_DATA, 0,
+			LLONG_MAX);
+}
+
+int jbd2_journal_inode_ranged_write(handle_t *handle,
+		struct jbd2_inode *jinode, loff_t start_byte, loff_t length)
+{
+	return jbd2_journal_file_inode(handle, jinode,
+			JI_WRITE_DATA | JI_WAIT_DATA, start_byte,
+			start_byte + length - 1);
+}
+
+int jbd2_journal_inode_ranged_wait(handle_t *handle, struct jbd2_inode *jinode,
+		loff_t start_byte, loff_t length)
+{
+	return jbd2_journal_file_inode(handle, jinode, JI_WAIT_DATA,
+			start_byte, start_byte + length - 1);
 }
 
 /*
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 5c04181b7c6d8..0e0393e7f41a4 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -451,6 +451,22 @@  struct jbd2_inode {
 	 * @i_flags: Flags of inode [j_list_lock]
 	 */
 	unsigned long i_flags;
+
+	/**
+	 * @i_dirty_start:
+	 *
+	 * Offset in bytes where the dirty range for this inode starts.
+	 * [j_list_lock]
+	 */
+	loff_t i_dirty_start;
+
+	/**
+	 * @i_dirty_end:
+	 *
+	 * Inclusive offset in bytes where the dirty range for this inode
+	 * ends. [j_list_lock]
+	 */
+	loff_t i_dirty_end;
 };
 
 struct jbd2_revoke_table_s;
@@ -1397,6 +1413,12 @@  extern int	   jbd2_journal_force_commit(journal_t *);
 extern int	   jbd2_journal_force_commit_nested(journal_t *);
 extern int	   jbd2_journal_inode_add_write(handle_t *handle, struct jbd2_inode *inode);
 extern int	   jbd2_journal_inode_add_wait(handle_t *handle, struct jbd2_inode *inode);
+extern int	   jbd2_journal_inode_ranged_write(handle_t *handle,
+			struct jbd2_inode *inode, loff_t start_byte,
+			loff_t length);
+extern int	   jbd2_journal_inode_ranged_wait(handle_t *handle,
+			struct jbd2_inode *inode, loff_t start_byte,
+			loff_t length);
 extern int	   jbd2_journal_begin_ordered_truncate(journal_t *journal,
 				struct jbd2_inode *inode, loff_t new_size);
 extern void	   jbd2_journal_init_jbd_inode(struct jbd2_inode *jinode, struct inode *inode);