diff mbox series

[RFC,1/3] zram: charge the compressed RAM to the page's memcgroup

Message ID 20230615034830.1361853-1-hezhongkun.hzk@bytedance.com (mailing list archive)
State New
Headers show
Series [RFC,1/3] zram: charge the compressed RAM to the page's memcgroup | expand

Commit Message

Zhongkun He June 15, 2023, 3:48 a.m. UTC
The compressed RAM is currently charged to kernel, not to
any memory cgroup, which is not satisfy our usage scenario.
if the memory of a task is limited by memcgroup, it will
swap out the memory to zram swap device when the memory
is insufficient. In that case, the memory limit will have
no effect.

So, it should makes sense to charge the compressed RAM to
the page's memory cgroup.

Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
---
 drivers/block/zram/zram_drv.c | 22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

Comments

Yu Zhao June 15, 2023, 4:59 a.m. UTC | #1
On Wed, Jun 14, 2023 at 9:48 PM Zhongkun He
<hezhongkun.hzk@bytedance.com> wrote:
>
> The compressed RAM is currently charged to kernel, not to
> any memory cgroup, which is not satisfy our usage scenario.
> if the memory of a task is limited by memcgroup, it will
> swap out the memory to zram swap device when the memory
> is insufficient. In that case, the memory limit will have
> no effect.
>
> So, it should makes sense to charge the compressed RAM to
> the page's memory cgroup.

We used to do this a long time ago, but we had per-memcg swapfiles [1[
to prevent compressed pages from different memcgs from sharing the
same zspage.

Does this patchset alone suffer from the same problem, i.e., memcgs
sharing zspages?

[1] https://lwn.net/Articles/592923/
Fabian Deutsch June 15, 2023, 8:57 a.m. UTC | #2
On Thu, Jun 15, 2023 at 6:59 AM Yu Zhao <yuzhao@google.com> wrote:

> On Wed, Jun 14, 2023 at 9:48 PM Zhongkun He
> <hezhongkun.hzk@bytedance.com> wrote:
> >
> > The compressed RAM is currently charged to kernel, not to
> > any memory cgroup, which is not satisfy our usage scenario.
> > if the memory of a task is limited by memcgroup, it will
> > swap out the memory to zram swap device when the memory
> > is insufficient. In that case, the memory limit will have
> > no effect.
> >
> > So, it should makes sense to charge the compressed RAM to
> > the page's memory cgroup.
>

While looking at this in the past weeks, I believe that there are two
distinct problems:
1. Direct zram usage by process within a cg ie. a process writing to a zram
device
2. Indirect zram usage by a process within a cg via swap (described above)

Both of them probably require different solutions.
In order to fix #1, accounting a zram device should be accounted towards a
cgroup. IMHO this is something that should be fixed.

Yu Zhao and Yosry are probably much more familiar with the solution to #2.
WRT per-cgrou-swapfile, to me this is addressing #2, but I agree with Yu
Zhao, that there are probably better solutions to this.

Lastly, this patchset, while it will possibly not address the swap issue
(#2) completely, is it satisfying the needs of #1?

- fabian


> We used to do this a long time ago, but we had per-memcg swapfiles [1[
> to prevent compressed pages from different memcgs from sharing the
> same zspage.
>
> Does this patchset alone suffer from the same problem, i.e., memcgs
> sharing zspages?
>
> [1] https://lwn.net/Articles/592923/
>
>
David Hildenbrand June 15, 2023, 9:27 a.m. UTC | #3
On 15.06.23 05:48, Zhongkun He wrote:
> The compressed RAM is currently charged to kernel, not to
> any memory cgroup, which is not satisfy our usage scenario.
> if the memory of a task is limited by memcgroup, it will
> swap out the memory to zram swap device when the memory
> is insufficient. In that case, the memory limit will have
> no effect.
> 
> So, it should makes sense to charge the compressed RAM to
> the page's memory cgroup.

Interesting. When looking at possible ways to achieve that in a clean 
way, my understanding was that the page might not always be accounted to 
a memcg (e.g., simply some buffer?). Besides the swap->zram case I was 
thinking about other fs->zram case, or just a straight read/write to the 
zram device.

The easiest way to see where that goes wrong I think is 
zram_bvec_write_partial(), where we simply allocate a temporary page.

Either I am missing something important, or this only works in some 
special cases.

I came to the conclusion that the only reasonable way is to assign a 
complete zram device to a single cgroup and have all memory accounted to 
that cgroup. Does also not solve all use cases (especially the 
swap->zram case that might be better offjust using zswap?) but at least 
the memory gets accounted in a more predictable way.

Happy to learn more.
Fabian Deutsch June 15, 2023, 9:32 a.m. UTC | #4
On Thu, Jun 15, 2023 at 6:59 AM Yu Zhao <yuzhao@google.com> wrote:
>
> On Wed, Jun 14, 2023 at 9:48 PM Zhongkun He
> <hezhongkun.hzk@bytedance.com> wrote:
> >
> > The compressed RAM is currently charged to kernel, not to
> > any memory cgroup, which is not satisfy our usage scenario.
> > if the memory of a task is limited by memcgroup, it will
> > swap out the memory to zram swap device when the memory
> > is insufficient. In that case, the memory limit will have
> > no effect.
> >
> > So, it should makes sense to charge the compressed RAM to
> > the page's memory cgroup.

While looking at this in the past weeks, I believe that there are two
distinct problems:
1. Direct zram usage by process within a cg ie. a process writing to a
zram device
2. Indirect zram usage by a process within a cg via swap (described above)

Both of them probably require different solutions.
In order to fix #1, accounting a zram device should be accounted
towards a cgroup. IMHO this is something that should be fixed.

Yu Zhao and Yosry are probably much more familiar with the solution to #2.
WRT per-cgrou-swapfile, to me this is addressing #2, but I agree with
Yu Zhao, that there are probably better solutions to this.

Lastly, this patchset, while it will possibly not address the swap
issue (#2) completely, is it satisfying the needs of #1?

- fabian

> We used to do this a long time ago, but we had per-memcg swapfiles [1[
> to prevent compressed pages from different memcgs from sharing the
> same zspage.
>
> Does this patchset alone suffer from the same problem, i.e., memcgs
> sharing zspages?
>
> [1] https://lwn.net/Articles/592923/
>
Michal Hocko June 15, 2023, 9:35 a.m. UTC | #5
On Thu 15-06-23 11:48:30, Zhongkun He wrote:
[...]
> @@ -1419,6 +1420,10 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
>  	struct zcomp_strm *zstrm;
>  	unsigned long element = 0;
>  	enum zram_pageflags flags = 0;
> +	struct mem_cgroup *memcg, *old_memcg;
> +
> +	memcg = page_memcg(page);
> +	old_memcg = set_active_memcg(memcg);
>  
>  	mem = kmap_atomic(page);
>  	if (page_same_filled(mem, &element)) {
[...]
> @@ -1470,8 +1475,10 @@ static int zram_write_page(struct zram *zram, struct page *page, u32 index)
>  		handle = zs_malloc(zram->mem_pool, comp_len,
>  				GFP_NOIO | __GFP_HIGHMEM |
>  				__GFP_MOVABLE);
> -		if (IS_ERR_VALUE(handle))
> -			return PTR_ERR((void *)handle);
> +		if (IS_ERR_VALUE(handle)) {
> +			ret = PTR_ERR((void *)handle);
> +			goto out;
> +		}
>  
>  		if (comp_len != PAGE_SIZE)
>  			goto compress_again;

I am not really deeply familiar with zram implementation nor usage but
how is the above allocation going to be charged without __GFP_ACCOUNT in
the gfp mask?

Also what exactly is going to happen for the swap backed by the zram
device? Your memcg might be hitting the hard limit and therefore
swapping out. Wouldn't zs_malloc fail very likely under that condition
making the swap effectively unusable?
Zhongkun He June 15, 2023, 9:41 a.m. UTC | #6
On Thu, Jun 15, 2023 at 1:00 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Wed, Jun 14, 2023 at 9:48 PM Zhongkun He
> <hezhongkun.hzk@bytedance.com> wrote:
> >
> > The compressed RAM is currently charged to kernel, not to
> > any memory cgroup, which is not satisfy our usage scenario.
> > if the memory of a task is limited by memcgroup, it will
> > swap out the memory to zram swap device when the memory
> > is insufficient. In that case, the memory limit will have
> > no effect.
> >
> > So, it should makes sense to charge the compressed RAM to
> > the page's memory cgroup.
>
> We used to do this a long time ago, but we had per-memcg swapfiles [1[
> to prevent compressed pages from different memcgs from sharing the
> same zspage.
>
> Does this patchset alone suffer from the same problem, i.e., memcgs
> sharing zspages?
>
> [1] https://lwn.net/Articles/592923/

Thanks for your reply. Yes, different memcgs may share the same zspages.
Zhongkun He June 15, 2023, 10 a.m. UTC | #7
> While looking at this in the past weeks, I believe that there are two distinct problems:
> 1. Direct zram usage by process within a cg ie. a process writing to a zram device
> 2. Indirect zram usage by a process within a cg via swap (described above)
>
> Both of them probably require different solutions.
> In order to fix #1, accounting a zram device should be accounted towards a cgroup. IMHO this is something that should be fixed.
>
> Yu Zhao and Yosry are probably much more familiar with the solution to #2.
> WRT per-cgrou-swapfile, to me this is addressing #2, but I agree with Yu Zhao, that there are probably better solutions to this.
>
> Lastly, this patchset, while it will possibly not address the swap issue (#2) completely, is it satisfying the needs of #1?
>
> - fabian

Thanks for your reply, fabian. According to the previous design and
test results, this patchset can meet the need of  direct and
indirect usage scenarios,charge the compressed memory to memory cgroup.
Zhongkun He June 15, 2023, 11:15 a.m. UTC | #8
On Thu, Jun 15, 2023 at 5:27 PM David Hildenbrand <david@redhat.com> wrote:
>
> Interesting. When looking at possible ways to achieve that in a clean
> way, my understanding was that the page might not always be accounted to
> a memcg (e.g., simply some buffer?). Besides the swap->zram case I was
> thinking about other fs->zram case, or just a straight read/write to the
> zram device.
>
> The easiest way to see where that goes wrong I think is
> zram_bvec_write_partial(), where we simply allocate a temporary page.
>
> Either I am missing something important, or this only works in some
> special cases.
>
> I came to the conclusion that the only reasonable way is to assign a
> complete zram device to a single cgroup and have all memory accounted to
> that cgroup. Does also not solve all use cases (especially the
> swap->zram case that might be better offjust using zswap?) but at least
> the memory gets accounted in a more predictable way.
>
> Happy to learn more.
>
> --
> Cheers,
>
> David / dhildenb
>

Hi david, thanks for your reply. This should be my mistake, I didn't
describe the
problem clearly.

The reason for the problem is not the temporary page allocated at the beginning
of zram_bvec_write_partial() and released at the end,but the compressed memory
allocated by zs_malloc() in zram_write_page().

So, it may not meet the need to  assign a complete zram device to a
single cgroup.
we need to charge the compressed memory to the original page's memory cgroup,
which is not charged so far.
David Hildenbrand June 15, 2023, 11:19 a.m. UTC | #9
On 15.06.23 13:15, 贺中坤 wrote:
> On Thu, Jun 15, 2023 at 5:27 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> Interesting. When looking at possible ways to achieve that in a clean
>> way, my understanding was that the page might not always be accounted to
>> a memcg (e.g., simply some buffer?). Besides the swap->zram case I was
>> thinking about other fs->zram case, or just a straight read/write to the
>> zram device.
>>
>> The easiest way to see where that goes wrong I think is
>> zram_bvec_write_partial(), where we simply allocate a temporary page.
>>
>> Either I am missing something important, or this only works in some
>> special cases.
>>
>> I came to the conclusion that the only reasonable way is to assign a
>> complete zram device to a single cgroup and have all memory accounted to
>> that cgroup. Does also not solve all use cases (especially the
>> swap->zram case that might be better offjust using zswap?) but at least
>> the memory gets accounted in a more predictable way.
>>
>> Happy to learn more.
>>
>> --
>> Cheers,
>>
>> David / dhildenb
>>
> 
> Hi david, thanks for your reply. This should be my mistake, I didn't
> describe the
> problem clearly.
> 
> The reason for the problem is not the temporary page allocated at the beginning
> of zram_bvec_write_partial() and released at the end,but the compressed memory
> allocated by zs_malloc() in zram_write_page().
> 
> So, it may not meet the need to  assign a complete zram device to a
> single cgroup.
> we need to charge the compressed memory to the original page's memory cgroup,
> which is not charged so far.
> 

Yes, but my point is that there are cases where the pages you get are 
not charged. zram_bvec_write_partial() is just one such example that 
highlights the issue.
Zhongkun He June 15, 2023, 11:58 a.m. UTC | #10
Hi michal,  glad to hear from you.

> I am not really deeply familiar with zram implementation nor usage but
> how is the above allocation going to be charged without __GFP_ACCOUNT in
> the gfp mask?

Yes,zs_malloc() did not charge compressed memory, even if we add this gfp.
so we need to implement this function in this patchset. But this flag should be
used to enable this feature.

> Also what exactly is going to happen for the swap backed by the zram
> device? Your memcg might be hitting the hard limit and therefore
> swapping out. Wouldn't zs_malloc fail very likely under that condition
> making the swap effectively unusable?

This is the key point, as i said above, zs_malloc() did not charge
compressed memory,
so zs_malloc will not fail under that condition. if the zram swap is
large enough, zs_malloc
never fails unless system OOM.   so memory.max will be invalidated.


> --
> Michal Hocko
> SUSE Labs
Fabian Deutsch June 15, 2023, 12:14 p.m. UTC | #11
On Thu, Jun 15, 2023 at 12:00 PM 贺中坤 <hezhongkun.hzk@bytedance.com> wrote:
>
> > While looking at this in the past weeks, I believe that there are two distinct problems:
> > 1. Direct zram usage by process within a cg ie. a process writing to a zram device
> > 2. Indirect zram usage by a process within a cg via swap (described above)
> >
> > Both of them probably require different solutions.
> > In order to fix #1, accounting a zram device should be accounted towards a cgroup. IMHO this is something that should be fixed.
> >
> > Yu Zhao and Yosry are probably much more familiar with the solution to #2.
> > WRT per-cgrou-swapfile, to me this is addressing #2, but I agree with Yu Zhao, that there are probably better solutions to this.
> >
> > Lastly, this patchset, while it will possibly not address the swap issue (#2) completely, is it satisfying the needs of #1?
> >
> > - fabian
>
> Thanks for your reply, fabian. According to the previous design and
> test results, this patchset can meet the need of  direct and
> indirect usage scenarios,charge the compressed memory to memory cgroup.

As said, I can not speak about the indirect swap case, but if it is
addressing the direct case, and putting the memory accounting on the
cgroup - then this would address one of the use-cases I have in mind.
Thanks!

- fabian
Michal Hocko June 15, 2023, 12:16 p.m. UTC | #12
On Thu 15-06-23 19:58:37, 贺中坤 wrote:
> Hi michal,  glad to hear from you.
> 
> > I am not really deeply familiar with zram implementation nor usage but
> > how is the above allocation going to be charged without __GFP_ACCOUNT in
> > the gfp mask?
> 
> Yes,zs_malloc() did not charge compressed memory, even if we add this gfp.
> so we need to implement this function in this patchset. But this flag should be
> used to enable this feature.

Let me check I understand. This patch on its own doesn't really do
anything. You need the zs_malloc support implemented in patch 3 for this
to have any effect. Even with that in place the zs_malloc doesn't follow
the __GFP_ACCOUNT scheme we use for allocation tracking. Correct?

> > Also what exactly is going to happen for the swap backed by the zram
> > device? Your memcg might be hitting the hard limit and therefore
> > swapping out. Wouldn't zs_malloc fail very likely under that condition
> > making the swap effectively unusable?
> 
> This is the key point, as i said above, zs_malloc() did not charge
> compressed memory,
> so zs_malloc will not fail under that condition. if the zram swap is
> large enough, zs_malloc
> never fails unless system OOM.   so memory.max will be invalidated.

I do not think this is answering my question. Or maybe I just
misunderstand. Let me try again. Say you have a memcg under hard limit
pressure so any further charge is going to fail. How can you reasonably
implement zram back swapout if the memory is charged?
Zhongkun He June 15, 2023, 12:19 p.m. UTC | #13
On Thu, Jun 15, 2023 at 7:19 PM David Hildenbrand <david@redhat.com> wrote:
>
> Yes, but my point is that there are cases where the pages you get are
> not charged. zram_bvec_write_partial() is just one such example that
> highlights the issue.

Sorry ,I got it.

>
> --
> Cheers,
>
> David / dhildenb
>
David Hildenbrand June 15, 2023, 12:56 p.m. UTC | #14
On 15.06.23 14:19, 贺中坤 wrote:
> On Thu, Jun 15, 2023 at 7:19 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> Yes, but my point is that there are cases where the pages you get are
>> not charged. zram_bvec_write_partial() is just one such example that
>> highlights the issue.
> 
> Sorry ,I got it.

I suspect for the swap->zram we should always get charged pages, because 
we're effectively writing out charged anon/shmem pages only -- without 
any buffer in between.

For the fs->zram or direct zram access device case I'm not so sure. It 
highly depends on what gets mapped into the bio (e.g., a kernel buffer, 
zeropage, ...). If it's a pagecache page, that should be charged and 
we're good. No so sure about fs metadata or some other fs cases (e.g., 
write() to a file that bypass the pagecache).
Zhongkun He June 15, 2023, 1:09 p.m. UTC | #15
> Let me check I understand. This patch on its own doesn't really do
> anything. You need the zs_malloc support implemented in patch 3 for this
> to have any effect. Even with that in place the zs_malloc doesn't follow
> the __GFP_ACCOUNT scheme we use for allocation tracking. Correct?
>

Yes, I will use it on next version.

> I do not think this is answering my question. Or maybe I just
> misunderstand. Let me try again. Say you have a memcg under hard limit
> pressure so any further charge is going to fail. How can you reasonably
> implement zram back swapout if the memory is charged?
>

Sorry, let me try to explain again. I have a memcg under hard limit pressure.
Any  further charge will  try to free memory and swapout to zram back which
is compressed and stored data in memory.so any further charge is not going
to fail. The charged memory is swapout to compressed memory step by
step, but the compressed memory is not charged to the original memcgroup.
So, Actual memory usage is already greater than the hard limit in some cases.
This pachset will charge the compressed memory to the original memcg,
limited by memory.max

> --
> Michal Hocko
> SUSE Labs
Michal Hocko June 15, 2023, 1:27 p.m. UTC | #16
On Thu 15-06-23 21:09:16, 贺中坤 wrote:
> > Let me check I understand. This patch on its own doesn't really do
> > anything. You need the zs_malloc support implemented in patch 3 for this
> > to have any effect. Even with that in place the zs_malloc doesn't follow
> > the __GFP_ACCOUNT scheme we use for allocation tracking. Correct?
> >
> 
> Yes, I will use it on next version.

OK, also make sure that the zsmalloc support is implemented before zram
depends on it.

> > I do not think this is answering my question. Or maybe I just
> > misunderstand. Let me try again. Say you have a memcg under hard limit
> > pressure so any further charge is going to fail. How can you reasonably
> > implement zram back swapout if the memory is charged?
> >
> 
> Sorry, let me try to explain again. I have a memcg under hard limit pressure.
> Any  further charge will  try to free memory and swapout to zram back which
> is compressed and stored data in memory.so any further charge is not going
> to fail. The charged memory is swapout to compressed memory step by
> step, but the compressed memory is not charged to the original memcgroup.
> So, Actual memory usage is already greater than the hard limit in some cases.
> This pachset will charge the compressed memory to the original memcg,
> limited by memory.max

This is not really answering my question though. memcg under hard limit
is not really my concern. This is a simpler case. I am not saying it
doesn't need to get addresses but it is the memcg hard limited case that
is much more interested. Because your charges are going to fail very
likely and that would mean that swapout would fail AFAIU. If my
understanding is wrong then it would really help to describe that case
much more in the changelog.
Zhongkun He June 15, 2023, 1:40 p.m. UTC | #17
> I suspect for the swap->zram we should always get charged pages, because
> we're effectively writing out charged anon/shmem pages only -- without
> any buffer in between.

Hi David,the charged memory will be released in swap->zram. New pages
are allocated by alloc_zspage(), and we did not charge the page directly,but
the objects(like  slab), because the zspage are shared by any memcg.

>
> For the fs->zram or direct zram access device case I'm not so sure. It
> highly depends on what gets mapped into the bio (e.g., a kernel buffer,
> zeropage, ...). If it's a pagecache page, that should be charged and
> we're good. No so sure about fs metadata or some other fs cases (e.g.,
> write() to a file that bypass the pagecache).
>

Yes, the pagecaches are charged in fs->zram, but  will be released if
we drop the cache. the compressed objects are not charged.

> --
> Cheers,
>
> David / dhildenb
>
Zhongkun He June 15, 2023, 2:13 p.m. UTC | #18
>
> OK, also make sure that the zsmalloc support is implemented before zram
> depends on it.
>

OK. I got it.

>
> This is not really answering my question though. memcg under hard limit
> is not really my concern. This is a simpler case. I am not saying it
> doesn't need to get addresses but it is the memcg hard limited case that
> is much more interested. Because your charges are going to fail very
> likely and that would mean that swapout would fail AFAIU. If my
> understanding is wrong then it would really help to describe that case
> much more in the changelog.
>

OK, Got it. In many cases I have tested, it  will not fail because we did
not charge the page directly,but the objects(like  slab,compressed page),
for the zspage may be shared by any memcg.

> --
> Michal Hocko
> SUSE Labs
Michal Hocko June 15, 2023, 2:20 p.m. UTC | #19
On Thu 15-06-23 22:13:09, 贺中坤 wrote:
> > This is not really answering my question though. memcg under hard limit
> > is not really my concern. This is a simpler case. I am not saying it
> > doesn't need to get addresses but it is the memcg hard limited case that
> > is much more interested. Because your charges are going to fail very
> > likely and that would mean that swapout would fail AFAIU. If my
> > understanding is wrong then it would really help to describe that case
> > much more in the changelog.
> >
> 
> OK, Got it. In many cases I have tested, it  will not fail because we did
> not charge the page directly,but the objects(like  slab,compressed page),
> for the zspage may be shared by any memcg.

That sounds like a broken design to me.
David Hildenbrand June 15, 2023, 2:46 p.m. UTC | #20
On 15.06.23 15:40, 贺中坤 wrote:
>> I suspect for the swap->zram we should always get charged pages, because
>> we're effectively writing out charged anon/shmem pages only -- without
>> any buffer in between.
> 
> Hi David,the charged memory will be released in swap->zram. New pages
> are allocated by alloc_zspage(), and we did not charge the page directly,but
> the objects(like  slab), because the zspage are shared by any memcg.
> 
>>
>> For the fs->zram or direct zram access device case I'm not so sure. It
>> highly depends on what gets mapped into the bio (e.g., a kernel buffer,
>> zeropage, ...). If it's a pagecache page, that should be charged and
>> we're good. No so sure about fs metadata or some other fs cases (e.g.,
>> write() to a file that bypass the pagecache).
>>
> 
> Yes, the pagecaches are charged in fs->zram, but  will be released if
> we drop the cache. the compressed objects are not charged.

Yes. But just to stress again, one issue I see is that if there is a 
page in the BIO that is not charged, you cannot charge the compressed page.

Assume you have some FS on that zram block device and  you want to make 
sure it gets properly charged to whoever is reading/writing a file on 
that filesystem. (imagine something like a compress shmem)

If a user (or the filesystem?) can trigger a BIO that has an uncharged 
page in it, it would not get charged accordingly.

The "easy" reproducer would have been O_DIRECT write() using the shared 
zeropage, but zram_write_page() is smart enough to optimize for that 
case (page_same_filled()). :)

Maybe I'm over-thinking this (well, the we do have partial I/O support, 
so something seems to be able to trigger such cases), and it would be 
great if someone with more FS->BIO experience could comment.

I'll note that this is fundamentally different to zswap, because with 
zswap you don't get arbitrary BIOs, you get an anon or shmem page (that 
should be charged).
Yosry Ahmed June 16, 2023, 1:39 a.m. UTC | #21
On Thu, Jun 15, 2023 at 1:57 AM Fabian Deutsch <fdeutsch@redhat.com> wrote:
>
>
> On Thu, Jun 15, 2023 at 6:59 AM Yu Zhao <yuzhao@google.com> wrote:
>>
>> On Wed, Jun 14, 2023 at 9:48 PM Zhongkun He
>> <hezhongkun.hzk@bytedance.com> wrote:
>> >
>> > The compressed RAM is currently charged to kernel, not to
>> > any memory cgroup, which is not satisfy our usage scenario.
>> > if the memory of a task is limited by memcgroup, it will
>> > swap out the memory to zram swap device when the memory
>> > is insufficient. In that case, the memory limit will have
>> > no effect.
>> >
>> > So, it should makes sense to charge the compressed RAM to
>> > the page's memory cgroup.
>
>
> While looking at this in the past weeks, I believe that there are two distinct problems:
> 1. Direct zram usage by process within a cg ie. a process writing to a zram device
> 2. Indirect zram usage by a process within a cg via swap (described above)
>
> Both of them probably require different solutions.
> In order to fix #1, accounting a zram device should be accounted towards a cgroup. IMHO this is something that should be fixed.
>
> Yu Zhao and Yosry are probably much more familiar with the solution to #2.
> WRT per-cgrou-swapfile, to me this is addressing #2, but I agree with Yu Zhao, that there are probably better solutions to this.

Thanks Fabian for tagging me.

I am not familiar with #1, so I will speak to #2. Zhongkun, There are
a few parts that I do not understand -- hopefully you can help me out
here:

(1) If I understand correctly in this patch we set the active memcg
trying to charge any pages allocated in a zspage to the current memcg,
yet that zspage will contain multiple compressed object slots, not
just the one used by this memcg. Aren't we overcharging the memcg?
Basically the first memcg that happens to allocate the zspage will pay
for all the objects in this zspage, even after it stops using the
zspage completely?

(2) Patch 3 seems to be charging the compressed slots to the memcgs,
yet this patch is trying to charge the entire zspage. Aren't we double
charging the zspage? I am guessing this isn't happening because (as
Michal pointed out) we are not using __GFP_ACCOUNT here anyway, so
this patch may be NOP, and the actual charging is coming from patch 3
only.

(3) Zswap recently implemented per-memcg charging of compressed
objects in a much simpler way. If your main interest is #2 (which is
what I understand from the commit log), it seems like zswap might be
providing this already? Why can't you use zswap? Is it the fact that
zswap requires a backing swapfile?

Thanks!

>
> Lastly, this patchset, while it will possibly not address the swap issue (#2) completely, is it satisfying the needs of #1?
>
> - fabian
>
>>
>> We used to do this a long time ago, but we had per-memcg swapfiles [1[
>> to prevent compressed pages from different memcgs from sharing the
>> same zspage.
>>
>> Does this patchset alone suffer from the same problem, i.e., memcgs
>> sharing zspages?
>>
>> [1] https://lwn.net/Articles/592923/
>>
Zhongkun He June 16, 2023, 3:31 a.m. UTC | #22
On Thu, Jun 15, 2023 at 10:21 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 15-06-23 22:13:09, 贺中坤 wrote:
> > > This is not really answering my question though. memcg under hard limit
> > > is not really my concern. This is a simpler case. I am not saying it
> > > doesn't need to get addresses but it is the memcg hard limited case that
> > > is much more interested. Because your charges are going to fail very
> > > likely and that would mean that swapout would fail AFAIU. If my
> > > understanding is wrong then it would really help to describe that case
> > > much more in the changelog.
> > >
> >
> > OK, Got it. In many cases I have tested, it  will not fail because we did
> > not charge the page directly,but the objects(like  slab,compressed page),
> > for the zspage may be shared by any memcg.
>
> That sounds like a broken design to me.
> --
> Michal Hocko
> SUSE Labs

I will try more cases in different compression ratios and make sure
that swapout will not fail.
Zhongkun He June 16, 2023, 3:44 a.m. UTC | #23
>
> Yes. But just to stress again, one issue I see is that if there is a
> page in the BIO that is not charged, you cannot charge the compressed page.

I got it. Thanks.

>
> Assume you have some FS on that zram block device and  you want to make
> sure it gets properly charged to whoever is reading/writing a file on
> that filesystem. (imagine something like a compress shmem)
>
> If a user (or the filesystem?) can trigger a BIO that has an uncharged
> page in it, it would not get charged accordingly.
>
> The "easy" reproducer would have been O_DIRECT write() using the shared
> zeropage, but zram_write_page() is smart enough to optimize for that
> case (page_same_filled()). :)

Ok, I will try it.

>
> Maybe I'm over-thinking this (well, the we do have partial I/O support,
> so something seems to be able to trigger such cases), and it would be
> great if someone with more FS->BIO experience could comment.
>
> I'll note that this is fundamentally different to zswap, because with
> zswap you don't get arbitrary BIOs, you get an anon or shmem page (that
> should be charged).
>

Hi David, I know your concern and I will try to find the uncharged case.

> --
> Cheers,
>
> David / dhildenb
>
Zhongkun He June 16, 2023, 4:40 a.m. UTC | #24
> Thanks Fabian for tagging me.
>
> I am not familiar with #1, so I will speak to #2. Zhongkun, There are
> a few parts that I do not understand -- hopefully you can help me out
> here:
>
> (1) If I understand correctly in this patch we set the active memcg
> trying to charge any pages allocated in a zspage to the current memcg,
> yet that zspage will contain multiple compressed object slots, not
> just the one used by this memcg. Aren't we overcharging the memcg?
> Basically the first memcg that happens to allocate the zspage will pay
> for all the objects in this zspage, even after it stops using the
> zspage completely?

It will not overcharge.  As you said below, we are not using
__GFP_ACCOUNT and charging the compressed slots to the memcgs.

>
> (2) Patch 3 seems to be charging the compressed slots to the memcgs,
> yet this patch is trying to charge the entire zspage. Aren't we double
> charging the zspage? I am guessing this isn't happening because (as
> Michal pointed out) we are not using __GFP_ACCOUNT here anyway, so
> this patch may be NOP, and the actual charging is coming from patch 3
> only.

YES, the actual charging is coming from patch 3. This patch just
delivers the BIO page's  memcg to the current task which is not the
consumer.

>
> (3) Zswap recently implemented per-memcg charging of compressed
> objects in a much simpler way. If your main interest is #2 (which is
> what I understand from the commit log), it seems like zswap might be
> providing this already? Why can't you use zswap? Is it the fact that
> zswap requires a backing swapfile?

Thanks for your reply and review. Yes, the zswap requires a backing
swapfile. The I/O path is very complex, sometimes it will throttle the
whole system if some resources are short , so we hope to use zram.

>
> Thanks!
>
Michal Hocko June 16, 2023, 6:40 a.m. UTC | #25
On Fri 16-06-23 11:31:18, 贺中坤 wrote:
> On Thu, Jun 15, 2023 at 10:21 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 15-06-23 22:13:09, 贺中坤 wrote:
> > > > This is not really answering my question though. memcg under hard limit
> > > > is not really my concern. This is a simpler case. I am not saying it
> > > > doesn't need to get addresses but it is the memcg hard limited case that
> > > > is much more interested. Because your charges are going to fail very
> > > > likely and that would mean that swapout would fail AFAIU. If my
> > > > understanding is wrong then it would really help to describe that case
> > > > much more in the changelog.
> > > >
> > >
> > > OK, Got it. In many cases I have tested, it  will not fail because we did
> > > not charge the page directly,but the objects(like  slab,compressed page),
> > > for the zspage may be shared by any memcg.
> >
> > That sounds like a broken design to me.
> > --
> > Michal Hocko
> > SUSE Labs
> 
> I will try more cases in different compression ratios and make sure
> that swapout will not fail.

I do not think different compression methods will cut it. You
essentially need some form of memory reserves - in memcg world
pre-charged pool of memory readily available for the swapout. Another
way would be to allow the charge to bypass the limit with a guarantee
that this will be a temporal breach.
Yosry Ahmed June 16, 2023, 7:37 a.m. UTC | #26
On Thu, Jun 15, 2023 at 9:41 PM 贺中坤 <hezhongkun.hzk@bytedance.com> wrote:
>
> > Thanks Fabian for tagging me.
> >
> > I am not familiar with #1, so I will speak to #2. Zhongkun, There are
> > a few parts that I do not understand -- hopefully you can help me out
> > here:
> >
> > (1) If I understand correctly in this patch we set the active memcg
> > trying to charge any pages allocated in a zspage to the current memcg,
> > yet that zspage will contain multiple compressed object slots, not
> > just the one used by this memcg. Aren't we overcharging the memcg?
> > Basically the first memcg that happens to allocate the zspage will pay
> > for all the objects in this zspage, even after it stops using the
> > zspage completely?
>
> It will not overcharge.  As you said below, we are not using
> __GFP_ACCOUNT and charging the compressed slots to the memcgs.
>
> >
> > (2) Patch 3 seems to be charging the compressed slots to the memcgs,
> > yet this patch is trying to charge the entire zspage. Aren't we double
> > charging the zspage? I am guessing this isn't happening because (as
> > Michal pointed out) we are not using __GFP_ACCOUNT here anyway, so
> > this patch may be NOP, and the actual charging is coming from patch 3
> > only.
>
> YES, the actual charging is coming from patch 3. This patch just
> delivers the BIO page's  memcg to the current task which is not the
> consumer.
>
> >
> > (3) Zswap recently implemented per-memcg charging of compressed
> > objects in a much simpler way. If your main interest is #2 (which is
> > what I understand from the commit log), it seems like zswap might be
> > providing this already? Why can't you use zswap? Is it the fact that
> > zswap requires a backing swapfile?
>
> Thanks for your reply and review. Yes, the zswap requires a backing
> swapfile. The I/O path is very complex, sometimes it will throttle the
> whole system if some resources are short , so we hope to use zram.

Is the only problem with zswap for you the requirement of a backing swapfile?

If yes, I am in the early stages of developing a solution to make
zswap work without a backing swapfile. This was discussed in LSF/MM
[1]. Would this make zswap usable in for your use case?

[1]https://lore.kernel.org/linux-mm/CAJD7tkYb_sGN8mfGVjr2JxdB8Pz8Td=yj9_sBCMrmsKQo56vTg@mail.gmail.com/

>
> >
> > Thanks!
> >
David Hildenbrand June 16, 2023, 7:57 a.m. UTC | #27
On 16.06.23 09:37, Yosry Ahmed wrote:
> On Thu, Jun 15, 2023 at 9:41 PM 贺中坤 <hezhongkun.hzk@bytedance.com> wrote:
>>
>>> Thanks Fabian for tagging me.
>>>
>>> I am not familiar with #1, so I will speak to #2. Zhongkun, There are
>>> a few parts that I do not understand -- hopefully you can help me out
>>> here:
>>>
>>> (1) If I understand correctly in this patch we set the active memcg
>>> trying to charge any pages allocated in a zspage to the current memcg,
>>> yet that zspage will contain multiple compressed object slots, not
>>> just the one used by this memcg. Aren't we overcharging the memcg?
>>> Basically the first memcg that happens to allocate the zspage will pay
>>> for all the objects in this zspage, even after it stops using the
>>> zspage completely?
>>
>> It will not overcharge.  As you said below, we are not using
>> __GFP_ACCOUNT and charging the compressed slots to the memcgs.
>>
>>>
>>> (2) Patch 3 seems to be charging the compressed slots to the memcgs,
>>> yet this patch is trying to charge the entire zspage. Aren't we double
>>> charging the zspage? I am guessing this isn't happening because (as
>>> Michal pointed out) we are not using __GFP_ACCOUNT here anyway, so
>>> this patch may be NOP, and the actual charging is coming from patch 3
>>> only.
>>
>> YES, the actual charging is coming from patch 3. This patch just
>> delivers the BIO page's  memcg to the current task which is not the
>> consumer.
>>
>>>
>>> (3) Zswap recently implemented per-memcg charging of compressed
>>> objects in a much simpler way. If your main interest is #2 (which is
>>> what I understand from the commit log), it seems like zswap might be
>>> providing this already? Why can't you use zswap? Is it the fact that
>>> zswap requires a backing swapfile?
>>
>> Thanks for your reply and review. Yes, the zswap requires a backing
>> swapfile. The I/O path is very complex, sometimes it will throttle the
>> whole system if some resources are short , so we hope to use zram.
> 
> Is the only problem with zswap for you the requirement of a backing swapfile?
> 
> If yes, I am in the early stages of developing a solution to make
> zswap work without a backing swapfile. This was discussed in LSF/MM
> [1]. Would this make zswap usable in for your use case?

Out of curiosity, are there any other known pros/cons when using 
zswap-without-swap instead of zram?

I know that zram requires sizing (size of the virtual block device) and 
consumes metadata, zswap doesn't.
Yosry Ahmed June 16, 2023, 8:04 a.m. UTC | #28
On Fri, Jun 16, 2023 at 12:57 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 16.06.23 09:37, Yosry Ahmed wrote:
> > On Thu, Jun 15, 2023 at 9:41 PM 贺中坤 <hezhongkun.hzk@bytedance.com> wrote:
> >>
> >>> Thanks Fabian for tagging me.
> >>>
> >>> I am not familiar with #1, so I will speak to #2. Zhongkun, There are
> >>> a few parts that I do not understand -- hopefully you can help me out
> >>> here:
> >>>
> >>> (1) If I understand correctly in this patch we set the active memcg
> >>> trying to charge any pages allocated in a zspage to the current memcg,
> >>> yet that zspage will contain multiple compressed object slots, not
> >>> just the one used by this memcg. Aren't we overcharging the memcg?
> >>> Basically the first memcg that happens to allocate the zspage will pay
> >>> for all the objects in this zspage, even after it stops using the
> >>> zspage completely?
> >>
> >> It will not overcharge.  As you said below, we are not using
> >> __GFP_ACCOUNT and charging the compressed slots to the memcgs.
> >>
> >>>
> >>> (2) Patch 3 seems to be charging the compressed slots to the memcgs,
> >>> yet this patch is trying to charge the entire zspage. Aren't we double
> >>> charging the zspage? I am guessing this isn't happening because (as
> >>> Michal pointed out) we are not using __GFP_ACCOUNT here anyway, so
> >>> this patch may be NOP, and the actual charging is coming from patch 3
> >>> only.
> >>
> >> YES, the actual charging is coming from patch 3. This patch just
> >> delivers the BIO page's  memcg to the current task which is not the
> >> consumer.
> >>
> >>>
> >>> (3) Zswap recently implemented per-memcg charging of compressed
> >>> objects in a much simpler way. If your main interest is #2 (which is
> >>> what I understand from the commit log), it seems like zswap might be
> >>> providing this already? Why can't you use zswap? Is it the fact that
> >>> zswap requires a backing swapfile?
> >>
> >> Thanks for your reply and review. Yes, the zswap requires a backing
> >> swapfile. The I/O path is very complex, sometimes it will throttle the
> >> whole system if some resources are short , so we hope to use zram.
> >
> > Is the only problem with zswap for you the requirement of a backing swapfile?
> >
> > If yes, I am in the early stages of developing a solution to make
> > zswap work without a backing swapfile. This was discussed in LSF/MM
> > [1]. Would this make zswap usable in for your use case?
>
> Out of curiosity, are there any other known pros/cons when using
> zswap-without-swap instead of zram?
>
> I know that zram requires sizing (size of the virtual block device) and
> consumes metadata, zswap doesn't.

We don't use zram in our data centers so I am not an expert about
zram, but off the top of my head there are a few more advantages to
zswap:
(1) Better memcg support (which this series is attempting to address
in zram, although in a much more complicated way).

(2) We internally have incompressible memory handling on top of zswap,
which is something that we would like to upstream when
zswap-without-swap is supported. Basically if a page does not compress
well enough to save memory we reject it from zswap and make it
unevictable (if there is no backing swapfile). The existence of zswap
in the MM layer helps with this. Since zram is a block device from the
MM perspective, it's more difficult to do something like this.
Incompressible pages just sit in zram AFAICT.

(3) Writeback support. If you're running out of memory to store
compressed pages you can add a swapfile in runtime and zswap will
start writing to it freeing up space to compress more pages. This
wouldn't be possible in the same way in zram. Zram supports writing to
a backing device but in a more manual way (userspace has to write to
an interface to tell zram to write some pages).

The disadvantage is that zswap doesn't currently support being used
without a swapfile, but once this support exists, I am not sure what
other disadvantages zswap would have compared to zram.

>
> --
> Cheers,
>
> David / dhildenb
>
David Hildenbrand June 16, 2023, 8:37 a.m. UTC | #29
On 16.06.23 10:04, Yosry Ahmed wrote:
> On Fri, Jun 16, 2023 at 12:57 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 16.06.23 09:37, Yosry Ahmed wrote:
>>> On Thu, Jun 15, 2023 at 9:41 PM 贺中坤 <hezhongkun.hzk@bytedance.com> wrote:
>>>>
>>>>> Thanks Fabian for tagging me.
>>>>>
>>>>> I am not familiar with #1, so I will speak to #2. Zhongkun, There are
>>>>> a few parts that I do not understand -- hopefully you can help me out
>>>>> here:
>>>>>
>>>>> (1) If I understand correctly in this patch we set the active memcg
>>>>> trying to charge any pages allocated in a zspage to the current memcg,
>>>>> yet that zspage will contain multiple compressed object slots, not
>>>>> just the one used by this memcg. Aren't we overcharging the memcg?
>>>>> Basically the first memcg that happens to allocate the zspage will pay
>>>>> for all the objects in this zspage, even after it stops using the
>>>>> zspage completely?
>>>>
>>>> It will not overcharge.  As you said below, we are not using
>>>> __GFP_ACCOUNT and charging the compressed slots to the memcgs.
>>>>
>>>>>
>>>>> (2) Patch 3 seems to be charging the compressed slots to the memcgs,
>>>>> yet this patch is trying to charge the entire zspage. Aren't we double
>>>>> charging the zspage? I am guessing this isn't happening because (as
>>>>> Michal pointed out) we are not using __GFP_ACCOUNT here anyway, so
>>>>> this patch may be NOP, and the actual charging is coming from patch 3
>>>>> only.
>>>>
>>>> YES, the actual charging is coming from patch 3. This patch just
>>>> delivers the BIO page's  memcg to the current task which is not the
>>>> consumer.
>>>>
>>>>>
>>>>> (3) Zswap recently implemented per-memcg charging of compressed
>>>>> objects in a much simpler way. If your main interest is #2 (which is
>>>>> what I understand from the commit log), it seems like zswap might be
>>>>> providing this already? Why can't you use zswap? Is it the fact that
>>>>> zswap requires a backing swapfile?
>>>>
>>>> Thanks for your reply and review. Yes, the zswap requires a backing
>>>> swapfile. The I/O path is very complex, sometimes it will throttle the
>>>> whole system if some resources are short , so we hope to use zram.
>>>
>>> Is the only problem with zswap for you the requirement of a backing swapfile?
>>>
>>> If yes, I am in the early stages of developing a solution to make
>>> zswap work without a backing swapfile. This was discussed in LSF/MM
>>> [1]. Would this make zswap usable in for your use case?
>>
>> Out of curiosity, are there any other known pros/cons when using
>> zswap-without-swap instead of zram?
>>
>> I know that zram requires sizing (size of the virtual block device) and
>> consumes metadata, zswap doesn't.
> 
> We don't use zram in our data centers so I am not an expert about
> zram, but off the top of my head there are a few more advantages to
> zswap:

Thanks!

> (1) Better memcg support (which this series is attempting to address
> in zram, although in a much more complicated way).

Right. I think this patch also misses to update apply the charging in the recompress
case. (only triggered by user space IIUC)

> 
> (2) We internally have incompressible memory handling on top of zswap,
> which is something that we would like to upstream when
> zswap-without-swap is supported. Basically if a page does not compress
> well enough to save memory we reject it from zswap and make it
> unevictable (if there is no backing swapfile). The existence of zswap
> in the MM layer helps with this. Since zram is a block device from the
> MM perspective, it's more difficult to do something like this.
> Incompressible pages just sit in zram AFAICT.

I see. With ZRAM_HUGE we still have to store the uncompressed page
(because, it's a block device and has to hold that data).

> 
> (3) Writeback support. If you're running out of memory to store
> compressed pages you can add a swapfile in runtime and zswap will
> start writing to it freeing up space to compress more pages. This
> wouldn't be possible in the same way in zram. Zram supports writing to
> a backing device but in a more manual way (userspace has to write to
> an interface to tell zram to write some pages).

Right, that zram backing device stuff is really sub-optimal and only useful
in corner cases (most probably not datacenters).

What one can do with zram is to add a second swap device with lower priority.
Looking at my Fedora machine:

  $ cat /proc/swaps
Filename                                Type            Size            Used            Priority
/dev/dm-2                               partition       16588796        0               -2
/dev/zram0                              partition       8388604         0               100


Guess the difference here is that you won't be writing out the compressed
data to the disk, but anything the gets swapped out afterwards will
end up on the disk. I can see how the zswap behavior might be better in that case
(instead of swapping out some additional pages you relocate the
already-swapped-out-to-zswap pages to the disk).
Yosry Ahmed June 16, 2023, 8:39 a.m. UTC | #30
On Fri, Jun 16, 2023 at 1:37 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 16.06.23 10:04, Yosry Ahmed wrote:
> > On Fri, Jun 16, 2023 at 12:57 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 16.06.23 09:37, Yosry Ahmed wrote:
> >>> On Thu, Jun 15, 2023 at 9:41 PM 贺中坤 <hezhongkun.hzk@bytedance.com> wrote:
> >>>>
> >>>>> Thanks Fabian for tagging me.
> >>>>>
> >>>>> I am not familiar with #1, so I will speak to #2. Zhongkun, There are
> >>>>> a few parts that I do not understand -- hopefully you can help me out
> >>>>> here:
> >>>>>
> >>>>> (1) If I understand correctly in this patch we set the active memcg
> >>>>> trying to charge any pages allocated in a zspage to the current memcg,
> >>>>> yet that zspage will contain multiple compressed object slots, not
> >>>>> just the one used by this memcg. Aren't we overcharging the memcg?
> >>>>> Basically the first memcg that happens to allocate the zspage will pay
> >>>>> for all the objects in this zspage, even after it stops using the
> >>>>> zspage completely?
> >>>>
> >>>> It will not overcharge.  As you said below, we are not using
> >>>> __GFP_ACCOUNT and charging the compressed slots to the memcgs.
> >>>>
> >>>>>
> >>>>> (2) Patch 3 seems to be charging the compressed slots to the memcgs,
> >>>>> yet this patch is trying to charge the entire zspage. Aren't we double
> >>>>> charging the zspage? I am guessing this isn't happening because (as
> >>>>> Michal pointed out) we are not using __GFP_ACCOUNT here anyway, so
> >>>>> this patch may be NOP, and the actual charging is coming from patch 3
> >>>>> only.
> >>>>
> >>>> YES, the actual charging is coming from patch 3. This patch just
> >>>> delivers the BIO page's  memcg to the current task which is not the
> >>>> consumer.
> >>>>
> >>>>>
> >>>>> (3) Zswap recently implemented per-memcg charging of compressed
> >>>>> objects in a much simpler way. If your main interest is #2 (which is
> >>>>> what I understand from the commit log), it seems like zswap might be
> >>>>> providing this already? Why can't you use zswap? Is it the fact that
> >>>>> zswap requires a backing swapfile?
> >>>>
> >>>> Thanks for your reply and review. Yes, the zswap requires a backing
> >>>> swapfile. The I/O path is very complex, sometimes it will throttle the
> >>>> whole system if some resources are short , so we hope to use zram.
> >>>
> >>> Is the only problem with zswap for you the requirement of a backing swapfile?
> >>>
> >>> If yes, I am in the early stages of developing a solution to make
> >>> zswap work without a backing swapfile. This was discussed in LSF/MM
> >>> [1]. Would this make zswap usable in for your use case?
> >>
> >> Out of curiosity, are there any other known pros/cons when using
> >> zswap-without-swap instead of zram?
> >>
> >> I know that zram requires sizing (size of the virtual block device) and
> >> consumes metadata, zswap doesn't.
> >
> > We don't use zram in our data centers so I am not an expert about
> > zram, but off the top of my head there are a few more advantages to
> > zswap:
>
> Thanks!
>
> > (1) Better memcg support (which this series is attempting to address
> > in zram, although in a much more complicated way).
>
> Right. I think this patch also misses to update apply the charging in the recompress
> case. (only triggered by user space IIUC)
>
> >
> > (2) We internally have incompressible memory handling on top of zswap,
> > which is something that we would like to upstream when
> > zswap-without-swap is supported. Basically if a page does not compress
> > well enough to save memory we reject it from zswap and make it
> > unevictable (if there is no backing swapfile). The existence of zswap
> > in the MM layer helps with this. Since zram is a block device from the
> > MM perspective, it's more difficult to do something like this.
> > Incompressible pages just sit in zram AFAICT.
>
> I see. With ZRAM_HUGE we still have to store the uncompressed page
> (because, it's a block device and has to hold that data).

Right.

>
> >
> > (3) Writeback support. If you're running out of memory to store
> > compressed pages you can add a swapfile in runtime and zswap will
> > start writing to it freeing up space to compress more pages. This
> > wouldn't be possible in the same way in zram. Zram supports writing to
> > a backing device but in a more manual way (userspace has to write to
> > an interface to tell zram to write some pages).
>
> Right, that zram backing device stuff is really sub-optimal and only useful
> in corner cases (most probably not datacenters).
>
> What one can do with zram is to add a second swap device with lower priority.
> Looking at my Fedora machine:
>
>   $ cat /proc/swaps
> Filename                                Type            Size            Used            Priority
> /dev/dm-2                               partition       16588796        0               -2
> /dev/zram0                              partition       8388604         0               100
>
>
> Guess the difference here is that you won't be writing out the compressed
> data to the disk, but anything the gets swapped out afterwards will
> end up on the disk. I can see how the zswap behavior might be better in that case
> (instead of swapping out some additional pages you relocate the
> already-swapped-out-to-zswap pages to the disk).

Yeah I am hoping we can enable the use of zswap without a backing
swapfile, and I keep seeing use cases that would benefit from that.

>
> --
> Cheers,
>
> David / dhildenb
>
diff mbox series

Patch

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index f6d90f1ba5cf..03b508447473 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -33,6 +33,7 @@ 
 #include <linux/debugfs.h>
 #include <linux/cpuhotplug.h>
 #include <linux/part_stat.h>
+#include <linux/memcontrol.h>
 
 #include "zram_drv.h"
 
@@ -1419,6 +1420,10 @@  static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 	struct zcomp_strm *zstrm;
 	unsigned long element = 0;
 	enum zram_pageflags flags = 0;
+	struct mem_cgroup *memcg, *old_memcg;
+
+	memcg = page_memcg(page);
+	old_memcg = set_active_memcg(memcg);
 
 	mem = kmap_atomic(page);
 	if (page_same_filled(mem, &element)) {
@@ -1426,7 +1431,7 @@  static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 		/* Free memory associated with this sector now. */
 		flags = ZRAM_SAME;
 		atomic64_inc(&zram->stats.same_pages);
-		goto out;
+		goto out_free;
 	}
 	kunmap_atomic(mem);
 
@@ -1440,7 +1445,7 @@  static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 		zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
 		pr_err("Compression failed! err=%d\n", ret);
 		zs_free(zram->mem_pool, handle);
-		return ret;
+		goto out;
 	}
 
 	if (comp_len >= huge_class_size)
@@ -1470,8 +1475,10 @@  static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 		handle = zs_malloc(zram->mem_pool, comp_len,
 				GFP_NOIO | __GFP_HIGHMEM |
 				__GFP_MOVABLE);
-		if (IS_ERR_VALUE(handle))
-			return PTR_ERR((void *)handle);
+		if (IS_ERR_VALUE(handle)) {
+			ret = PTR_ERR((void *)handle);
+			goto out;
+		}
 
 		if (comp_len != PAGE_SIZE)
 			goto compress_again;
@@ -1491,7 +1498,8 @@  static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 	if (zram->limit_pages && alloced_pages > zram->limit_pages) {
 		zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
 		zs_free(zram->mem_pool, handle);
-		return -ENOMEM;
+		ret = -ENOMEM;
+		goto out;
 	}
 
 	dst = zs_map_object(zram->mem_pool, handle, ZS_MM_WO);
@@ -1506,7 +1514,7 @@  static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 	zcomp_stream_put(zram->comps[ZRAM_PRIMARY_COMP]);
 	zs_unmap_object(zram->mem_pool, handle);
 	atomic64_add(comp_len, &zram->stats.compr_data_size);
-out:
+out_free:
 	/*
 	 * Free memory associated with this sector
 	 * before overwriting unused sectors.
@@ -1531,6 +1539,8 @@  static int zram_write_page(struct zram *zram, struct page *page, u32 index)
 
 	/* Update stats */
 	atomic64_inc(&zram->stats.pages_stored);
+out:
+	set_active_memcg(old_memcg);
 	return ret;
 }