diff mbox

[2/2] ext4: implement cgroup writeback support

Message ID 1434495193-31182-3-git-send-email-tj@kernel.org (mailing list archive)
State New, archived
Headers show

Commit Message

Tejun Heo June 16, 2015, 10:53 p.m. UTC
For ordered and writeback data modes, all data IOs go through
ext4_io_submit.  This patch adds cgroup writeback support by invoking
wbc_init_bio() from io_submit_init_bio() and wbc_account_io() in
io_submit_add_bh().  Journal data which is written by jbd2 worker is
left alone by this patch and will always be written out from the root
cgroup.

ext4_fill_super() is updated to set MS_CGROUPWB when data mode is
either ordered or writeback.  In journaled data mode, most IOs become
synchronous through the journal and enabling cgroup writeback support
doesn't make much sense or difference.  Journaled data mode is left
alone.

Lightly tested with sequential data write workload.  Behaves as
expected.

v2: Updated for MS_CGROUPWB -> SB_I_CGROUPWB.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: linux-ext4@vger.kernel.org
---
 fs/ext4/page-io.c | 2 ++
 fs/ext4/super.c   | 2 ++
 2 files changed, 4 insertions(+)

Comments

Theodore Ts'o July 22, 2015, 3:56 a.m. UTC | #1
On Tue, Jun 16, 2015 at 06:53:13PM -0400, Tejun Heo wrote:
> For ordered and writeback data modes, all data IOs go through
> ext4_io_submit.  This patch adds cgroup writeback support by invoking
> wbc_init_bio() from io_submit_init_bio() and wbc_account_io() in
> io_submit_add_bh().  Journal data which is written by jbd2 worker is
> left alone by this patch and will always be written out from the root
> cgroup.
> 
> ext4_fill_super() is updated to set MS_CGROUPWB when data mode is
> either ordered or writeback.  In journaled data mode, most IOs become
> synchronous through the journal and enabling cgroup writeback support
> doesn't make much sense or difference.  Journaled data mode is left
> alone.
> 
> Lightly tested with sequential data write workload.  Behaves as
> expected.
> 
> v2: Updated for MS_CGROUPWB -> SB_I_CGROUPWB.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: "Theodore Ts'o" <tytso@mit.edu>
> Cc: Andreas Dilger <adilger.kernel@dilger.ca>
> Cc: linux-ext4@vger.kernel.org

Thanks, applied.

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Artem Bityutskiy Sept. 23, 2015, 12:49 p.m. UTC | #2
On Tue, 2015-07-21 at 23:56 -0400, Theodore Ts'o wrote:
> > v2: Updated for MS_CGROUPWB -> SB_I_CGROUPWB.
> > 
> > Signed-off-by: Tejun Heo <tj@kernel.org>
> > Cc: "Theodore Ts'o" <tytso@mit.edu>
> > Cc: Andreas Dilger <adilger.kernel@dilger.ca>
> > Cc: linux-ext4@vger.kernel.org
> 
> Thanks, applied.

Hi, this patch introduces a regression - a major one, I'd say.

Symptoms: copy a bunch of file, run sync, then run 'reboot', and after
you boot up the copied files are corrupted. So basically the user
-visible symptom is that 'sync' does not work.

I quite an effort to bisect it, but it led me to this patch.

If I take the latest upstream (v4.3-rc2+), and revert this patch:

001e4a8 ext4: imlpement cgroup writeback support

then the problem goes away - files are not corrupted after reboot.

I use ext4 on top of a "bare" partition, no LVM or dm layers involved.

I use Fedora 22 with all the latest package updates, and I only change
the kernel there.

The corruption seems to be that the start with a bunch of zeroes
instead of the real data, but I did not check carefully, looked only at
one file briefly.

Artem.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Artem Bityutskiy Sept. 23, 2015, 1:50 p.m. UTC | #3
On Wed, 2015-09-23 at 15:49 +0300, Artem Bityutskiy wrote:
> On Tue, 2015-07-21 at 23:56 -0400, Theodore Ts'o wrote:
> > > v2: Updated for MS_CGROUPWB -> SB_I_CGROUPWB.
> > > 
> > > Signed-off-by: Tejun Heo <tj@kernel.org>
> > > Cc: "Theodore Ts'o" <tytso@mit.edu>
> > > Cc: Andreas Dilger <adilger.kernel@dilger.ca>
> > > Cc: linux-ext4@vger.kernel.org
> > 
> > Thanks, applied.
> 
> Hi, this patch introduces a regression - a major one, I'd say.
> 
> Symptoms: copy a bunch of file, run sync, then run 'reboot', and
> after
> you boot up the copied files are corrupted. So basically the user
> -visible symptom is that 'sync' does not work.

Just FYI, this is the issue I briefly reported last Fri:

https://lkml.org/lkml/2015/9/18/640

Artem.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Sept. 23, 2015, 5:02 p.m. UTC | #4
On Wed, Sep 23, 2015 at 04:50:53PM +0300, Artem Bityutskiy wrote:
> On Wed, 2015-09-23 at 15:49 +0300, Artem Bityutskiy wrote:
> > On Tue, 2015-07-21 at 23:56 -0400, Theodore Ts'o wrote:
> > > > v2: Updated for MS_CGROUPWB -> SB_I_CGROUPWB.
> > > > 
> > > > Signed-off-by: Tejun Heo <tj@kernel.org>
> > > > Cc: "Theodore Ts'o" <tytso@mit.edu>
> > > > Cc: Andreas Dilger <adilger.kernel@dilger.ca>
> > > > Cc: linux-ext4@vger.kernel.org
> > > 
> > > Thanks, applied.
> > 
> > Hi, this patch introduces a regression - a major one, I'd say.
> > 
> > Symptoms: copy a bunch of file, run sync, then run 'reboot', and
> > after
> > you boot up the copied files are corrupted. So basically the user
> > -visible symptom is that 'sync' does not work.
> 
> Just FYI, this is the issue I briefly reported last Fri:
> 
> https://lkml.org/lkml/2015/9/18/640

Also note this performance regression reported by Dexuan Cui

	https://lkml.org/lkml/2015/9/23/333

I didn't notice these problems since I my userspace doesn't enable the
writeback cgroup.  (In fact I don't know how to do it using Debian
Jessie.)

Tejun, can you please take a look at this and give me a
recommendation?  I'm willing to wait a day or two while we try to fix
the problem, but past that point I'd like to have a fix or revert this
commit before Linus releases the next -rc release.

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Mason Sept. 23, 2015, 5:25 p.m. UTC | #5
On Wed, Sep 23, 2015 at 03:49:12PM +0300, Artem Bityutskiy wrote:
> On Tue, 2015-07-21 at 23:56 -0400, Theodore Ts'o wrote:
> > > v2: Updated for MS_CGROUPWB -> SB_I_CGROUPWB.
> > > 
> > > Signed-off-by: Tejun Heo <tj@kernel.org>
> > > Cc: "Theodore Ts'o" <tytso@mit.edu>
> > > Cc: Andreas Dilger <adilger.kernel@dilger.ca>
> > > Cc: linux-ext4@vger.kernel.org
> > 
> > Thanks, applied.
> 
> Hi, this patch introduces a regression - a major one, I'd say.
> 
> Symptoms: copy a bunch of file, run sync, then run 'reboot', and after
> you boot up the copied files are corrupted. So basically the user
> -visible symptom is that 'sync' does not work.

Hi Artem,

Are you doing a hard shutdown (reboot -nf)?  If you're doing a friendly
shutdown, is the FS unmounting cleanly?

> 
> I quite an effort to bisect it, but it led me to this patch.

I bet it was a long bisect.  Trying to see if the same patch to btrfs
has similar impacts.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Mason Sept. 23, 2015, 5:53 p.m. UTC | #6
On Wed, Sep 23, 2015 at 08:41:25PM +0300, Artem Bityutskiy wrote:
>    Hi
> 
>    $ sync
>    $ reboot

If this is case, it should be possible to reproduce with:

cp a bunch of stuff to /ext4
unmount /ext4
mount ext4
compare data

If you're not getting a clean unmount of the test FS during the reboot,
its a different test.  Trying to reproduce here, so far its clean.
Could you please double check for failed unmounts?

Thanks,
Chris
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo Sept. 23, 2015, 5:57 p.m. UTC | #7
Hello, Ted.

On Wed, Sep 23, 2015 at 01:02:38PM -0400, Theodore Ts'o wrote:
> Also note this performance regression reported by Dexuan Cui
> 
> 	https://lkml.org/lkml/2015/9/23/333
> 
> I didn't notice these problems since I my userspace doesn't enable the
> writeback cgroup.  (In fact I don't know how to do it using Debian
> Jessie.)

The thing is I don't think they're either.

> Tejun, can you please take a look at this and give me a
> recommendation?  I'm willing to wait a day or two while we try to fix
> the problem, but past that point I'd like to have a fix or revert this
> commit before Linus releases the next -rc release.

Yeap, looking at them both.  Will update as soon as I know more.

Thanks.
Tejun Heo Sept. 23, 2015, 6:09 p.m. UTC | #8
Hello, Artem.

On Wed, Sep 23, 2015 at 03:49:12PM +0300, Artem Bityutskiy wrote:
> I use Fedora 22 with all the latest package updates, and I only change
> the kernel there.

What's your cgroup setup like?  Are you trying out the unified
hierarchy?  Given that you didn't mention that, I'm guessing you
aren't which makes me pretty head-scratchy because I don't see how the
actual behavior would change depending on that flag.  I'll dig more.

Thanks.
Theodore Ts'o Sept. 23, 2015, 6:24 p.m. UTC | #9
Artem,

Can you (or someone on the cgroups list, perhaps) give more details
about how Fedora 22 sets up groups?

Unfortunately apparently no one has gotten an official Fedora image
for Google Compute Engine so it's a bit of a pain for me to reproduce
the problem.  (I suppose I could use AWS, but all of my test
infrastructure uses GCE, and I'd really rather not have to install a
Java Runtime on my laptop. :-)

Or can you reproduce this problem on Debian Jessie?

Thanks,

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo Sept. 23, 2015, 6:51 p.m. UTC | #10
On Wed, Sep 23, 2015 at 02:09:34PM -0400, Tejun Heo wrote:
> Hello, Artem.
> 
> On Wed, Sep 23, 2015 at 03:49:12PM +0300, Artem Bityutskiy wrote:
> > I use Fedora 22 with all the latest package updates, and I only change
> > the kernel there.
> 
> What's your cgroup setup like?  Are you trying out the unified
> hierarchy?  Given that you didn't mention that, I'm guessing you
> aren't which makes me pretty head-scratchy because I don't see how the
> actual behavior would change depending on that flag.  I'll dig more.

Ugh... never mind.  I spotted what I did wrong.  The sync thing still
needs figuring out (it still should have worked) but fixing the
regression should be easy.  Will update soon.

Thanks.
Tejun Heo Sept. 23, 2015, 7:03 p.m. UTC | #11
Hello, Artem.

On Wed, Sep 23, 2015 at 08:41:25PM +0300, Artem Bityutskiy wrote:
> $ sync
> $ reboot
> 
> is exactly the sequence.

Can you please test with 4.3-rc2 and see whether the issue is
reproducible?  The multi-wb wait logic rewrite was merged during rc1
and I'm wondering whether this is a bug in the old implementation.

Thanks.
Artem Bityutskiy Sept. 23, 2015, 7:47 p.m. UTC | #12
On Wed, Sep 23, 2015 at 9:24 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> Artem,
>
> Can you (or someone on the cgroups list, perhaps) give more details
> about how Fedora 22 sets up groups?
>
> Unfortunately apparently no one has gotten an official Fedora image
> for Google Compute Engine so it's a bit of a pain for me to reproduce
> the problem.  (I suppose I could use AWS, but all of my test
> infrastructure uses GCE, and I'd really rather not have to install a
> Java Runtime on my laptop. :-)

[ My apologies for top posting and for sending HTML e-mails which do
not get through vger.
I am using gmail web interface, and just learned how to send plain
text from here. So re-sending
my longer answer. ]

Hi Ted, Chris, Tejun, all,

quick and probably messy reply before I go to sleep...

I can give more information tomorrow.  But one note - It would be helpful to get
questions like "send us the output of this command" rather than "what are the
cgroups you are in", because I am not fluent with cgroups. IOW, more specific
questions are welcome.

Some more about my setup. I have an number of testboxes, which are 1/2/4-socket
servers. I compile the kernel for them on a separate worker box. Then I copy the
kernel binary to /boot, and the modules to /lib/modules, then run
'sync' and then
reboot to reboot to the new kernel. And vrey often many module files
are corrupted.
They won't load because of majic/crc mismatches.

I copy stuff over scp. Well, this is not exactly scp, but rather a
Python 'scp' module,
which is based on the 'paramiko' module. But I think this should not matter.

Anyway, may be there are some cgroups related with scp/ssh sessions or
/lib/modules
in Fedora 22?

Also note, I tried to be careful during bisecting, I used 4 servers in
parallel, and
did 5 reboot tests on each of them. With this patch reverted all 4
boxes survive 5
reboots just fine. Without this patch reverted, each fail 1-3 reboots.

And, by the way, I forgot this detail - I cut AC power off at the end,
then put it back
on after a 20 seconds delay. I mean, this is a clean reboot, but with
power cut at the
end. So the process is this:

1. I run 'sync' on the box remotely over ssh
2. I run 'reboot'  on the box remotely over ssh, the ssh connection
gets closed at this point
3. I ping the box, and keep doing this until it is stops echoing back
4. I wait several seconds, and then just cut the AC power off. The
wall socket power is off.

So if there was something in, say, SSD cache which was not synced, it
is gone too.

May be this patch reveals an existing issue. My setup has been stable
with 4.2 and many
previous kernels, and it only fails with 4.3-rcX, and my bisecting
lead to this patch.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Sept. 23, 2015, 8:48 p.m. UTC | #13
On Wed, Sep 23, 2015 at 10:47:16PM +0300, Artem Bityutskiy wrote:
> And, by the way, I forgot this detail - I cut AC power off at the
> end, then put it back on after a 20 seconds delay. I mean, this is a
> clean reboot, but with power cut at the end.

Is this reproducible without the power cut?  And what model SSD are
you using, and are you sure that it has Power Loss Protection (many
SSD vendors use power loss protection as the price discrimination
feature between consumer-grade SSD and enterprise-grade SSD's that
cost $$$).  If this problem wasn't showing up with 4.2, and is only
failing with 4.3-rcX, then it might not be a hardware issue --- but
it's also possible that there was a timing issue which was hiding a
hardware problem.  So for the purposes of debugging, removing the
power cut from the set of variables is a useful thing to do.

      	       	       	  	       	 	- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Artem Bityutskiy Sept. 24, 2015, 8:13 a.m. UTC | #14
On Wed, 2015-09-23 at 16:48 -0400, Theodore Ts'o wrote:
> On Wed, Sep 23, 2015 at 10:47:16PM +0300, Artem Bityutskiy wrote:
> > And, by the way, I forgot this detail - I cut AC power off at the
> > end, then put it back on after a 20 seconds delay. I mean, this is
> > a
> > clean reboot, but with power cut at the end.
> 
> Is this reproducible without the power cut?  And what model SSD are
> you using, and are you sure that it has Power Loss Protection (many
> SSD vendors use power loss protection as the price discrimination
> feature between consumer-grade SSD and enterprise-grade SSD's that
> cost $$$).  If this problem wasn't showing up with 4.2, and is only
> failing with 4.3-rcX, then it might not be a hardware issue --- but
> it's also possible that there was a timing issue which was hiding a
> hardware problem.  So for the purposes of debugging, removing the
> power cut from the set of variables is a useful thing to do.

Ted, you are right that the problem may be anywhere, but since Tejun's
patch fixes it, I decided to not spend time on testing without AC power
cuts.

But I checked and on of my boxes uses an HDD, and it shows the same
problem, which makes the theory of imperfect SSD firmware being
involved less likely.

Thanks,
Artem.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 3f80cb2..c56ba7b 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -383,6 +383,7 @@  static int io_submit_init_bio(struct ext4_io_submit *io,
 	bio = bio_alloc(GFP_NOIO, min(nvecs, BIO_MAX_PAGES));
 	if (!bio)
 		return -ENOMEM;
+	wbc_init_bio(io->io_wbc, bio);
 	bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9);
 	bio->bi_bdev = bh->b_bdev;
 	bio->bi_end_io = ext4_end_bio;
@@ -411,6 +412,7 @@  static int io_submit_add_bh(struct ext4_io_submit *io,
 	ret = bio_add_page(io->io_bio, page, bh->b_size, bh_offset(bh));
 	if (ret != bh->b_size)
 		goto submit_and_retry;
+	wbc_account_io(io->io_wbc, page, bh->b_size);
 	io->io_next_block++;
 	return 0;
 }
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 56b8bb7..3ad1eb4 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -3623,6 +3623,8 @@  static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 		}
 		if (test_opt(sb, DELALLOC))
 			clear_opt(sb, DELALLOC);
+	} else {
+		sb->s_iflags |= SB_I_CGROUPWB;
 	}
 
 	sb->s_flags = (sb->s_flags & ~MS_POSIXACL) |