diff mbox series

zram corruption due to uninitialized do_swap_page fault

Message ID CABWYdi2a=Tc3dRfQ+037PG0GHKvZd5SEXJxBBbNspsrHK1zNpQ@mail.gmail.com (mailing list archive)
State New
Headers show
Series zram corruption due to uninitialized do_swap_page fault | expand

Commit Message

Ivan Babrou March 11, 2022, 7:51 p.m. UTC
Hello,

We're looking into using zram, but unfortunately we ran into some
corruption issues. We've seen rocksdb complaining about "Corruption:
bad entry in block", and we've also seen some coredumps that point at
memory being zeroed out. One of our Rust processes coredumps contains
a non-null pointer pointing at zero, among other things:

* core::ptr::non_null::NonNull<u8> {pointer: 0x0}

In fact, a whole bunch of memory around this pointer was all zeros.

Disabling zram resolves all issues, and we can't reproduce any of
these issues with other swap setups. I've tried adding crc32
checksumming for pages that are compressed, but it didn't catch the
issue either, even though userspace facing symptoms were present. My
crc32 code doesn't touch ZRAM_SAME pages, though.

Unfortunately, this isn't trivial to replicate, and I believe that it
depends on zram used for swap specifically, not for zram as a block
device. Specifically, swap_slot_free_notify looks suspicious.

Here's a patch that I have to catch the issue in the act:

  zram_fill_page(mem, PAGE_SIZE, value);

In essence, it warns whenever a page is read from zram that was not
previously written to. To make this work, one needs to zero out zram
prior to running mkswap on it.

I have prepared a GitHub repo with my observations and a reproduction:

* https://github.com/bobrik/zram-corruptor

I'm able to trigger the following in an aarch64 VM with two threads
reading the same memory out of swap:

[ 512.651752][ T7285] ------------[ cut here ]------------
[ 512.652279][ T7285] WARNING: CPU: 0 PID: 7285 at
drivers/block/zram/zram_drv.c:1285 __zram_bvec_read+0x28c/0x2e8 [zram]
[ 512.653923][ T7285] Modules linked in: zram zsmalloc kheaders nfsv3
nfs lockd grace sunrpc xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat
nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4
nft_counter xt_addrtype nft_compat nf_tables nfnetlink bridge stp llc
overlay xfs libcrc32c zstd zstd_compress af_packet aes_ce_blk
aes_ce_cipher ghash_ce gf128mul virtio_net sha3_ce net_failover
sha3_generic failover sha512_ce sha512_arm64 sha2_ce sha256_arm64
virtio_mmio virtio_ring qemu_fw_cfg rtc_pl031 virtio fuse ip_tables
x_tables ext4 mbcache crc16 jbd2 nvme nvme_core pci_host_generic
pci_host_common unix [last unloaded: zsmalloc]
[ 512.659238][ T7285] CPU: 0 PID: 7285 Comm: zram-corruptor Tainted: G
W 5.16.0-ivan #1 0877d306c6dc0716835d43cafe4399473d09e406
[ 512.660413][ T7285] Hardware name: linux,dummy-virt (DT)
[ 512.661077][ T7285] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT
-SSBS BTYPE=--)
[ 512.661788][ T7285] pc : __zram_bvec_read+0x28c/0x2e8 [zram]
[ 512.662099][ T7285] lr : zram_bvec_rw+0x70/0x204 [zram]
[ 512.662422][ T7285] sp : ffffffc01018bac0
[ 512.662720][ T7285] x29: ffffffc01018bae0 x28: ffffff9e4e725280 x27:
ffffff9e4e725280
[ 512.663122][ T7285] x26: ffffff9e4e725280 x25: 00000000000001f6 x24:
0000000100033e6c
[ 512.663601][ T7285] x23: 00000000000001f6 x22: 0000000000000000 x21:
fffffffe7a36d840
[ 512.664252][ T7285] x20: 00000000000001f6 x19: ffffff9e69423c00 x18:
ffffffc010711068
[ 512.664812][ T7285] x17: 0000000000000008 x16: ffffffd34aed51bc x15:
0000000000000000
[ 512.665507][ T7285] x14: 0000000000000a88 x13: 0000000000000000 x12:
0000000000000000
[ 512.666183][ T7285] x11: 0000000100033e6c x10: ffffffc01091d000 x9 :
0000000001000000
[ 512.666627][ T7285] x8 : 0000000000002f10 x7 : 80b75f8fb90b52c4 x6 :
051609fe50833de3
[ 512.667276][ T7285] x5 : 0000000000000000 x4 : 0000000000000000 x3 :
0000000000000000
[ 512.667875][ T7285] x2 : 00000000000001f6 x1 : 00000000000001f6 x0 :
ffffffd305b746af
[ 512.668483][ T7285] Call trace:
[ 512.668682][ T7285] __zram_bvec_read+0x28c/0x2e8 [zram
745969ed35ea0fb382bfd518d6f70e13966e9b52]
[ 512.669405][ T7285] zram_bvec_rw+0x70/0x204 [zram
745969ed35ea0fb382bfd518d6f70e13966e9b52]
[ 512.670066][ T7285] zram_rw_page+0xb4/0x16c [zram
745969ed35ea0fb382bfd518d6f70e13966e9b52]
[ 512.670584][ T7285] bdev_read_page+0x74/0xac
[ 512.670843][ T7285] swap_readpage+0x5c/0x2e4
[ 512.671243][ T7285] do_swap_page+0x2f4/0x988
[ 512.671560][ T7285] handle_pte_fault+0xcc/0x1fc
[ 512.671935][ T7285] handle_mm_fault+0x284/0x4a8
[ 512.672412][ T7285] do_page_fault+0x274/0x428
[ 512.672704][ T7285] do_translation_fault+0x5c/0xf8
[ 512.673083][ T7285] do_mem_abort+0x50/0xc8
[ 512.673293][ T7285] el0_da+0x3c/0x74
[ 512.673549][ T7285] el0t_64_sync_handler+0xc4/0xec
[ 512.673972][ T7285] el0t_64_sync+0x1a4/0x1a8
[ 512.674495][ T7285] ---[ end trace cf983b7507c20343 ]---
[ 512.675359][ T7285] zram: Page 502 read from zram without previous write

I can also trace accesses to zram to catch the unfortunate sequence:

zram_bvec_write index = 502 [cpu = 3, tid = 7286]
zram_free_page index = 502 [cpu = 3, tid = 7286]
zram_bvec_read index = 502 [cpu = 3, tid = 7286]
zram_free_page index = 502 [cpu = 3, tid = 7286] <-- problematic free
zram_bvec_read index = 502 [cpu = 0, tid = 7285] <-- problematic read

With stacks for zram_free_page:

zram_bvec_write index = 502 [cpu = 3, tid = 7286]
zram_free_page  index = 502 [cpu = 3, tid = 7286]

        zram_free_page+0
        $x.97+32
        zram_rw_page+180
        bdev_write_page+124
        __swap_writepage+116
        swap_writepage+160
        pageout+284
        shrink_page_list+2892
        shrink_inactive_list+688
        shrink_lruvec+360
        shrink_node_memcgs+148
        shrink_node+860
        shrink_zones+368
        do_try_to_free_pages+232
        try_to_free_mem_cgroup_pages+292
        try_charge_memcg+608

zram_bvec_read  index = 502 [cpu = 3, tid = 7286]
zram_free_page  index = 502 [cpu = 3, tid = 7286] <-- problematic free

        zram_free_page+0
        swap_range_free+220
        swap_entry_free+244
        swapcache_free_entries+152
        free_swap_slot+288
        __swap_entry_free+216
        swap_free+108
        do_swap_page+1776
        handle_pte_fault+204
        handle_mm_fault+644
        do_page_fault+628
        do_translation_fault+92
        do_mem_abort+80
        el0_da+60
        el0t_64_sync_handler+196
        el0t_64_sync+420

zram_bvec_read  index = 502 [cpu = 0, tid = 7285] <-- problematic read

The very last read is the same one that triggered the warning from my
patch in dmesg. You can see that the slot is freed before reading by
swapcache_free_entries. As far as I can see, only zram implements
swap_slot_free_notify. Swapping in an uninitialized zram page results
in all zeroes copied, which matches the symptoms.

The issue doesn't reproduce if I pin both threads to the same CPU. It
also doesn't reproduce with a single thread. All of this seems to
point at some sort of race condition.

I was able to reproduce this on x86_64 bare metal server as well.

I'm happy to try out mitigation approaches for this. If my
understanding here is incorrect, I'm also happy to try out patches
that could help me catch the issue in the wild.

Comments

Yu Zhao March 11, 2022, 10:30 p.m. UTC | #1
On Fri, Mar 11, 2022 at 12:52 PM Ivan Babrou <ivan@cloudflare.com> wrote:
>
> Hello,
>
> We're looking into using zram, but unfortunately we ran into some
> corruption issues.

Kernel version(s) please?
Ivan Babrou March 11, 2022, 10:36 p.m. UTC | #2
On Fri, Mar 11, 2022 at 2:30 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Fri, Mar 11, 2022 at 12:52 PM Ivan Babrou <ivan@cloudflare.com> wrote:
> >
> > Hello,
> >
> > We're looking into using zram, but unfortunately we ran into some
> > corruption issues.
>
> Kernel version(s) please?

You can see 5.16 in dmesg output. I've also replicated on 5.17-rc7 and
5.15.26. The latter is the LTS version we use in production, where
coredumps and rocksdb corruptions were observed.
Minchan Kim March 14, 2022, 10:40 p.m. UTC | #3
On Fri, Mar 11, 2022 at 11:51:47AM -0800, Ivan Babrou wrote:
> Hello,
> 
> We're looking into using zram, but unfortunately we ran into some
> corruption issues. We've seen rocksdb complaining about "Corruption:
> bad entry in block", and we've also seen some coredumps that point at
> memory being zeroed out. One of our Rust processes coredumps contains
> a non-null pointer pointing at zero, among other things:
> 
> * core::ptr::non_null::NonNull<u8> {pointer: 0x0}
> 
> In fact, a whole bunch of memory around this pointer was all zeros.
> 
> Disabling zram resolves all issues, and we can't reproduce any of
> these issues with other swap setups. I've tried adding crc32
> checksumming for pages that are compressed, but it didn't catch the
> issue either, even though userspace facing symptoms were present. My
> crc32 code doesn't touch ZRAM_SAME pages, though.
> 
> Unfortunately, this isn't trivial to replicate, and I believe that it
> depends on zram used for swap specifically, not for zram as a block
> device. Specifically, swap_slot_free_notify looks suspicious.
> 
> Here's a patch that I have to catch the issue in the act:
> 
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index 438ce34ee760..fea46a70a3c9 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -1265,6 +1265,9 @@ static int __zram_bvec_read(struct zram *zram,
> struct page *page, u32 index,
>   unsigned long value;
>   void *mem;
> 
> + if (WARN_ON(!handle && !zram_test_flag(zram, index, ZRAM_SAME)))
> + pr_warn("Page %u read from zram without previous write\n", index);
> +
>   value = handle ? zram_get_element(zram, index) : 0;
>   mem = kmap_atomic(page);
>   zram_fill_page(mem, PAGE_SIZE, value);
> 
> In essence, it warns whenever a page is read from zram that was not
> previously written to. To make this work, one needs to zero out zram
> prior to running mkswap on it.
> 
> I have prepared a GitHub repo with my observations and a reproduction:
> 
> * https://github.com/bobrik/zram-corruptor
> 
> I'm able to trigger the following in an aarch64 VM with two threads
> reading the same memory out of swap:
> 
> [ 512.651752][ T7285] ------------[ cut here ]------------
> [ 512.652279][ T7285] WARNING: CPU: 0 PID: 7285 at
> drivers/block/zram/zram_drv.c:1285 __zram_bvec_read+0x28c/0x2e8 [zram]
> [ 512.653923][ T7285] Modules linked in: zram zsmalloc kheaders nfsv3
> nfs lockd grace sunrpc xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat
> nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4
> nft_counter xt_addrtype nft_compat nf_tables nfnetlink bridge stp llc
> overlay xfs libcrc32c zstd zstd_compress af_packet aes_ce_blk
> aes_ce_cipher ghash_ce gf128mul virtio_net sha3_ce net_failover
> sha3_generic failover sha512_ce sha512_arm64 sha2_ce sha256_arm64
> virtio_mmio virtio_ring qemu_fw_cfg rtc_pl031 virtio fuse ip_tables
> x_tables ext4 mbcache crc16 jbd2 nvme nvme_core pci_host_generic
> pci_host_common unix [last unloaded: zsmalloc]
> [ 512.659238][ T7285] CPU: 0 PID: 7285 Comm: zram-corruptor Tainted: G
> W 5.16.0-ivan #1 0877d306c6dc0716835d43cafe4399473d09e406
> [ 512.660413][ T7285] Hardware name: linux,dummy-virt (DT)
> [ 512.661077][ T7285] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT
> -SSBS BTYPE=--)
> [ 512.661788][ T7285] pc : __zram_bvec_read+0x28c/0x2e8 [zram]
> [ 512.662099][ T7285] lr : zram_bvec_rw+0x70/0x204 [zram]
> [ 512.662422][ T7285] sp : ffffffc01018bac0
> [ 512.662720][ T7285] x29: ffffffc01018bae0 x28: ffffff9e4e725280 x27:
> ffffff9e4e725280
> [ 512.663122][ T7285] x26: ffffff9e4e725280 x25: 00000000000001f6 x24:
> 0000000100033e6c
> [ 512.663601][ T7285] x23: 00000000000001f6 x22: 0000000000000000 x21:
> fffffffe7a36d840
> [ 512.664252][ T7285] x20: 00000000000001f6 x19: ffffff9e69423c00 x18:
> ffffffc010711068
> [ 512.664812][ T7285] x17: 0000000000000008 x16: ffffffd34aed51bc x15:
> 0000000000000000
> [ 512.665507][ T7285] x14: 0000000000000a88 x13: 0000000000000000 x12:
> 0000000000000000
> [ 512.666183][ T7285] x11: 0000000100033e6c x10: ffffffc01091d000 x9 :
> 0000000001000000
> [ 512.666627][ T7285] x8 : 0000000000002f10 x7 : 80b75f8fb90b52c4 x6 :
> 051609fe50833de3
> [ 512.667276][ T7285] x5 : 0000000000000000 x4 : 0000000000000000 x3 :
> 0000000000000000
> [ 512.667875][ T7285] x2 : 00000000000001f6 x1 : 00000000000001f6 x0 :
> ffffffd305b746af
> [ 512.668483][ T7285] Call trace:
> [ 512.668682][ T7285] __zram_bvec_read+0x28c/0x2e8 [zram
> 745969ed35ea0fb382bfd518d6f70e13966e9b52]
> [ 512.669405][ T7285] zram_bvec_rw+0x70/0x204 [zram
> 745969ed35ea0fb382bfd518d6f70e13966e9b52]
> [ 512.670066][ T7285] zram_rw_page+0xb4/0x16c [zram
> 745969ed35ea0fb382bfd518d6f70e13966e9b52]
> [ 512.670584][ T7285] bdev_read_page+0x74/0xac
> [ 512.670843][ T7285] swap_readpage+0x5c/0x2e4
> [ 512.671243][ T7285] do_swap_page+0x2f4/0x988
> [ 512.671560][ T7285] handle_pte_fault+0xcc/0x1fc
> [ 512.671935][ T7285] handle_mm_fault+0x284/0x4a8
> [ 512.672412][ T7285] do_page_fault+0x274/0x428
> [ 512.672704][ T7285] do_translation_fault+0x5c/0xf8
> [ 512.673083][ T7285] do_mem_abort+0x50/0xc8
> [ 512.673293][ T7285] el0_da+0x3c/0x74
> [ 512.673549][ T7285] el0t_64_sync_handler+0xc4/0xec
> [ 512.673972][ T7285] el0t_64_sync+0x1a4/0x1a8
> [ 512.674495][ T7285] ---[ end trace cf983b7507c20343 ]---
> [ 512.675359][ T7285] zram: Page 502 read from zram without previous write
> 
> I can also trace accesses to zram to catch the unfortunate sequence:
> 
> zram_bvec_write index = 502 [cpu = 3, tid = 7286]
> zram_free_page index = 502 [cpu = 3, tid = 7286]
> zram_bvec_read index = 502 [cpu = 3, tid = 7286]
> zram_free_page index = 502 [cpu = 3, tid = 7286] <-- problematic free
> zram_bvec_read index = 502 [cpu = 0, tid = 7285] <-- problematic read
> 
> With stacks for zram_free_page:
> 
> zram_bvec_write index = 502 [cpu = 3, tid = 7286]
> zram_free_page  index = 502 [cpu = 3, tid = 7286]
> 
>         zram_free_page+0
>         $x.97+32
>         zram_rw_page+180
>         bdev_write_page+124
>         __swap_writepage+116
>         swap_writepage+160
>         pageout+284
>         shrink_page_list+2892
>         shrink_inactive_list+688
>         shrink_lruvec+360
>         shrink_node_memcgs+148
>         shrink_node+860
>         shrink_zones+368
>         do_try_to_free_pages+232
>         try_to_free_mem_cgroup_pages+292
>         try_charge_memcg+608
> 
> zram_bvec_read  index = 502 [cpu = 3, tid = 7286]
> zram_free_page  index = 502 [cpu = 3, tid = 7286] <-- problematic free
> 
>         zram_free_page+0
>         swap_range_free+220
>         swap_entry_free+244
>         swapcache_free_entries+152
>         free_swap_slot+288
>         __swap_entry_free+216
>         swap_free+108
>         do_swap_page+1776
>         handle_pte_fault+204
>         handle_mm_fault+644
>         do_page_fault+628
>         do_translation_fault+92
>         do_mem_abort+80
>         el0_da+60
>         el0t_64_sync_handler+196
>         el0t_64_sync+420
> 
> zram_bvec_read  index = 502 [cpu = 0, tid = 7285] <-- problematic read
> 
> The very last read is the same one that triggered the warning from my
> patch in dmesg. You can see that the slot is freed before reading by
> swapcache_free_entries. As far as I can see, only zram implements
> swap_slot_free_notify. Swapping in an uninitialized zram page results
> in all zeroes copied, which matches the symptoms.
> 
> The issue doesn't reproduce if I pin both threads to the same CPU. It
> also doesn't reproduce with a single thread. All of this seems to
> point at some sort of race condition.
> 
> I was able to reproduce this on x86_64 bare metal server as well.
> 
> I'm happy to try out mitigation approaches for this. If my
> understanding here is incorrect, I'm also happy to try out patches
> that could help me catch the issue in the wild.

Thanks for the report and detail analysis!

Could you reproduce the problem with this workaround?

diff --git a/mm/page_io.c b/mm/page_io.c
index 0bf8e40f4e57..f2438a5101a7 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -114,7 +114,7 @@ static void end_swap_bio_read(struct bio *bio)
        }

        SetPageUptodate(page);
-       swap_slot_free_notify(page);
+       // swap_slot_free_notify(page);
 out:
        unlock_page(page);
        WRITE_ONCE(bio->bi_private, NULL);
Ivan Babrou March 15, 2022, 12:24 a.m. UTC | #4
On Mon, Mar 14, 2022 at 3:40 PM Minchan Kim <minchan@kernel.org> wrote:
> Could you reproduce the problem with this workaround?
>
> diff --git a/mm/page_io.c b/mm/page_io.c
> index 0bf8e40f4e57..f2438a5101a7 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -114,7 +114,7 @@ static void end_swap_bio_read(struct bio *bio)
>         }
>
>         SetPageUptodate(page);
> -       swap_slot_free_notify(page);
> +       // swap_slot_free_notify(page);
>  out:
>         unlock_page(page);
>         WRITE_ONCE(bio->bi_private, NULL);
>

Yes:

zram_bvec_write index = 346 [cpu = 5, tid = 7003, nsecs = 956579989]
zram_free_page  index = 346 [cpu = 5, tid = 7003, nsecs = 956620031]
zram_bvec_read  index = 346 [cpu = 2, tid = 7005, nsecs = 967421676]
zram_free_page  index = 346 [cpu = 2, tid = 7005, nsecs = 967502177]

        zram_free_page+0
        swap_range_free+220
        swap_entry_free+244
        swapcache_free_entries+152
        free_swap_slot+288
        __swap_entry_free+216
        swap_free+108
        do_swap_page+1624
        handle_pte_fault+204
        handle_mm_fault+644
        do_page_fault+628
        do_translation_fault+92
        do_mem_abort+80
        el0_da+60
        el0t_64_sync_handler+196
        el0t_64_sync+420

zram_bvec_read  index = 346 [cpu = 6, tid = 7004, nsecs = 974478898]

[ 1298.139588][ T7004] ------------[ cut here ]------------
[ 1298.140574][ T7004] WARNING: CPU: 6 PID: 7004 at
drivers/block/zram/zram_drv.c:1285 __zram_bvec_read+0x28c/0x2e8 [zram]
[ 1298.142199][ T7004] Modules linked in: zram zsmalloc kheaders nfsv3
nfs lockd grace sunrpc xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat
nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4
xt_addrtype nft_compat nf_tables nfnetlink bridge stp llc overlay xfs
libcrc32c zstd zstd_compress aes_ce_blk aes_ce_cipher af_packet
ghash_ce gf128mul sha3_ce virtio_net sha3_generic net_failover
sha512_ce failover sha512_arm64 sha2_ce sha256_arm64 virtio_mmio
virtio_ring rtc_pl031 qemu_fw_cfg virtio fuse ip_tables x_tables ext4
mbcache crc16 jbd2 nvme nvme_core pci_host_generic pci_host_common
unix [last unloaded: zsmalloc]
[ 1298.149549][ T7004] CPU: 6 PID: 7004 Comm: zram-corruptor Tainted:
G        W         5.17.0-rc7-ivan #1
0b29b2552ab8a22d5ba4845528328b4b4bb50082
[ 1298.151293][ T7004] Hardware name: linux,dummy-virt (DT)
[ 1298.151826][ T7004] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT
-SSBS BTYPE=--)
[ 1298.152722][ T7004] pc : __zram_bvec_read+0x28c/0x2e8 [zram]
[ 1298.153501][ T7004] lr : zram_bvec_rw+0x70/0x214 [zram]
[ 1298.154597][ T7004] sp : ffffffc00aefbac0
[ 1298.155417][ T7004] x29: ffffffc00aefbae0 x28: ffffff9bc5770400
x27: ffffff9bc20da100
[ 1298.155838][ T7004] x26: ffffff9bc20da100 x25: 000000000000015a
x24: 00000001000f3aa7
[ 1298.156991][ T7004] x23: 000000000000015a x22: 0000000000000000
x21: fffffffe70cf5580
[ 1298.157718][ T7004] x20: 000000000000015a x19: ffffff9bcbbf3000
x18: ffffffc008755068
[ 1298.158907][ T7004] x17: 0000000000000008 x16: ffffffd7d92df0cc
x15: 0000000000000000
[ 1298.159705][ T7004] x14: 0000000000000000 x13: 00000000000000a0
x12: 0000000000000000
[ 1298.161758][ T7004] x11: 0000000000000000 x10: ffffffc0086fb000 x9
: 0000000001000000
[ 1298.162334][ T7004] x8 : 0000000000002070 x7 : 04f2046406c06e04 x6
: 4929b081d81d4b1b
[ 1298.163165][ T7004] x5 : 0000000000000000 x4 : 0000000000000000 x3
: 0000000000000000
[ 1298.163895][ T7004] x2 : 000000000000015a x1 : 000000000000015a x0
: ffffffd77fd576af
[ 1298.164830][ T7004] Call trace:
[ 1298.165153][ T7004]  __zram_bvec_read+0x28c/0x2e8 [zram
d8b02cd0fd062c13f68799e6d1953bf67d996403]
[ 1298.165900][ T7004]  zram_bvec_rw+0x70/0x214 [zram
d8b02cd0fd062c13f68799e6d1953bf67d996403]
[ 1298.166592][ T7004]  zram_rw_page+0xb4/0x170 [zram
d8b02cd0fd062c13f68799e6d1953bf67d996403]
[ 1298.167529][ T7004]  bdev_read_page+0x74/0xac
[ 1298.167823][ T7004]  swap_readpage+0x60/0x328
[ 1298.168317][ T7004]  do_swap_page+0x438/0x904
[ 1298.168841][ T7004]  handle_pte_fault+0xcc/0x1fc
[ 1298.169250][ T7004]  handle_mm_fault+0x284/0x4a8
[ 1298.169802][ T7004]  do_page_fault+0x274/0x428
[ 1298.170217][ T7004]  do_translation_fault+0x5c/0xf8
[ 1298.170650][ T7004]  do_mem_abort+0x50/0x100
[ 1298.171070][ T7004]  el0_da+0x3c/0x74
[ 1298.171422][ T7004]  el0t_64_sync_handler+0xc4/0xec
[ 1298.172102][ T7004]  el0t_64_sync+0x1a4/0x1a8
[ 1298.172773][ T7004] ---[ end trace 0000000000000000 ]---
[ 1298.173596][ T7004] zram: Page 346 read from zram without previous write
Ivan Babrou March 15, 2022, 4:18 a.m. UTC | #5
On Fri, Mar 11, 2022 at 11:51 AM Ivan Babrou <ivan@cloudflare.com> wrote:
>
> Hello,
>
> We're looking into using zram, but unfortunately we ran into some
> corruption issues. We've seen rocksdb complaining about "Corruption:
> bad entry in block", and we've also seen some coredumps that point at
> memory being zeroed out. One of our Rust processes coredumps contains
> a non-null pointer pointing at zero, among other things:
>
> * core::ptr::non_null::NonNull<u8> {pointer: 0x0}
>
> In fact, a whole bunch of memory around this pointer was all zeros.
>
> Disabling zram resolves all issues, and we can't reproduce any of
> these issues with other swap setups. I've tried adding crc32
> checksumming for pages that are compressed, but it didn't catch the
> issue either, even though userspace facing symptoms were present. My
> crc32 code doesn't touch ZRAM_SAME pages, though.
>
> Unfortunately, this isn't trivial to replicate, and I believe that it
> depends on zram used for swap specifically, not for zram as a block
> device. Specifically, swap_slot_free_notify looks suspicious.
>
> Here's a patch that I have to catch the issue in the act:
>
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index 438ce34ee760..fea46a70a3c9 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -1265,6 +1265,9 @@ static int __zram_bvec_read(struct zram *zram,
> struct page *page, u32 index,
>   unsigned long value;
>   void *mem;
>
> + if (WARN_ON(!handle && !zram_test_flag(zram, index, ZRAM_SAME)))
> + pr_warn("Page %u read from zram without previous write\n", index);
> +
>   value = handle ? zram_get_element(zram, index) : 0;
>   mem = kmap_atomic(page);
>   zram_fill_page(mem, PAGE_SIZE, value);
>
> In essence, it warns whenever a page is read from zram that was not
> previously written to. To make this work, one needs to zero out zram
> prior to running mkswap on it.
>
> I have prepared a GitHub repo with my observations and a reproduction:
>
> * https://github.com/bobrik/zram-corruptor
>
> I'm able to trigger the following in an aarch64 VM with two threads
> reading the same memory out of swap:
>
> [ 512.651752][ T7285] ------------[ cut here ]------------
> [ 512.652279][ T7285] WARNING: CPU: 0 PID: 7285 at
> drivers/block/zram/zram_drv.c:1285 __zram_bvec_read+0x28c/0x2e8 [zram]
> [ 512.653923][ T7285] Modules linked in: zram zsmalloc kheaders nfsv3
> nfs lockd grace sunrpc xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat
> nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4
> nft_counter xt_addrtype nft_compat nf_tables nfnetlink bridge stp llc
> overlay xfs libcrc32c zstd zstd_compress af_packet aes_ce_blk
> aes_ce_cipher ghash_ce gf128mul virtio_net sha3_ce net_failover
> sha3_generic failover sha512_ce sha512_arm64 sha2_ce sha256_arm64
> virtio_mmio virtio_ring qemu_fw_cfg rtc_pl031 virtio fuse ip_tables
> x_tables ext4 mbcache crc16 jbd2 nvme nvme_core pci_host_generic
> pci_host_common unix [last unloaded: zsmalloc]
> [ 512.659238][ T7285] CPU: 0 PID: 7285 Comm: zram-corruptor Tainted: G
> W 5.16.0-ivan #1 0877d306c6dc0716835d43cafe4399473d09e406
> [ 512.660413][ T7285] Hardware name: linux,dummy-virt (DT)
> [ 512.661077][ T7285] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT
> -SSBS BTYPE=--)
> [ 512.661788][ T7285] pc : __zram_bvec_read+0x28c/0x2e8 [zram]
> [ 512.662099][ T7285] lr : zram_bvec_rw+0x70/0x204 [zram]
> [ 512.662422][ T7285] sp : ffffffc01018bac0
> [ 512.662720][ T7285] x29: ffffffc01018bae0 x28: ffffff9e4e725280 x27:
> ffffff9e4e725280
> [ 512.663122][ T7285] x26: ffffff9e4e725280 x25: 00000000000001f6 x24:
> 0000000100033e6c
> [ 512.663601][ T7285] x23: 00000000000001f6 x22: 0000000000000000 x21:
> fffffffe7a36d840
> [ 512.664252][ T7285] x20: 00000000000001f6 x19: ffffff9e69423c00 x18:
> ffffffc010711068
> [ 512.664812][ T7285] x17: 0000000000000008 x16: ffffffd34aed51bc x15:
> 0000000000000000
> [ 512.665507][ T7285] x14: 0000000000000a88 x13: 0000000000000000 x12:
> 0000000000000000
> [ 512.666183][ T7285] x11: 0000000100033e6c x10: ffffffc01091d000 x9 :
> 0000000001000000
> [ 512.666627][ T7285] x8 : 0000000000002f10 x7 : 80b75f8fb90b52c4 x6 :
> 051609fe50833de3
> [ 512.667276][ T7285] x5 : 0000000000000000 x4 : 0000000000000000 x3 :
> 0000000000000000
> [ 512.667875][ T7285] x2 : 00000000000001f6 x1 : 00000000000001f6 x0 :
> ffffffd305b746af
> [ 512.668483][ T7285] Call trace:
> [ 512.668682][ T7285] __zram_bvec_read+0x28c/0x2e8 [zram
> 745969ed35ea0fb382bfd518d6f70e13966e9b52]
> [ 512.669405][ T7285] zram_bvec_rw+0x70/0x204 [zram
> 745969ed35ea0fb382bfd518d6f70e13966e9b52]
> [ 512.670066][ T7285] zram_rw_page+0xb4/0x16c [zram
> 745969ed35ea0fb382bfd518d6f70e13966e9b52]
> [ 512.670584][ T7285] bdev_read_page+0x74/0xac
> [ 512.670843][ T7285] swap_readpage+0x5c/0x2e4
> [ 512.671243][ T7285] do_swap_page+0x2f4/0x988
> [ 512.671560][ T7285] handle_pte_fault+0xcc/0x1fc
> [ 512.671935][ T7285] handle_mm_fault+0x284/0x4a8
> [ 512.672412][ T7285] do_page_fault+0x274/0x428
> [ 512.672704][ T7285] do_translation_fault+0x5c/0xf8
> [ 512.673083][ T7285] do_mem_abort+0x50/0xc8
> [ 512.673293][ T7285] el0_da+0x3c/0x74
> [ 512.673549][ T7285] el0t_64_sync_handler+0xc4/0xec
> [ 512.673972][ T7285] el0t_64_sync+0x1a4/0x1a8
> [ 512.674495][ T7285] ---[ end trace cf983b7507c20343 ]---
> [ 512.675359][ T7285] zram: Page 502 read from zram without previous write
>
> I can also trace accesses to zram to catch the unfortunate sequence:
>
> zram_bvec_write index = 502 [cpu = 3, tid = 7286]
> zram_free_page index = 502 [cpu = 3, tid = 7286]
> zram_bvec_read index = 502 [cpu = 3, tid = 7286]
> zram_free_page index = 502 [cpu = 3, tid = 7286] <-- problematic free
> zram_bvec_read index = 502 [cpu = 0, tid = 7285] <-- problematic read
>
> With stacks for zram_free_page:
>
> zram_bvec_write index = 502 [cpu = 3, tid = 7286]
> zram_free_page  index = 502 [cpu = 3, tid = 7286]
>
>         zram_free_page+0
>         $x.97+32
>         zram_rw_page+180
>         bdev_write_page+124
>         __swap_writepage+116
>         swap_writepage+160
>         pageout+284
>         shrink_page_list+2892
>         shrink_inactive_list+688
>         shrink_lruvec+360
>         shrink_node_memcgs+148
>         shrink_node+860
>         shrink_zones+368
>         do_try_to_free_pages+232
>         try_to_free_mem_cgroup_pages+292
>         try_charge_memcg+608
>
> zram_bvec_read  index = 502 [cpu = 3, tid = 7286]
> zram_free_page  index = 502 [cpu = 3, tid = 7286] <-- problematic free
>
>         zram_free_page+0
>         swap_range_free+220
>         swap_entry_free+244
>         swapcache_free_entries+152
>         free_swap_slot+288
>         __swap_entry_free+216
>         swap_free+108
>         do_swap_page+1776
>         handle_pte_fault+204
>         handle_mm_fault+644
>         do_page_fault+628
>         do_translation_fault+92
>         do_mem_abort+80
>         el0_da+60
>         el0t_64_sync_handler+196
>         el0t_64_sync+420
>
> zram_bvec_read  index = 502 [cpu = 0, tid = 7285] <-- problematic read
>
> The very last read is the same one that triggered the warning from my
> patch in dmesg. You can see that the slot is freed before reading by
> swapcache_free_entries. As far as I can see, only zram implements
> swap_slot_free_notify. Swapping in an uninitialized zram page results
> in all zeroes copied, which matches the symptoms.
>
> The issue doesn't reproduce if I pin both threads to the same CPU. It
> also doesn't reproduce with a single thread. All of this seems to
> point at some sort of race condition.
>
> I was able to reproduce this on x86_64 bare metal server as well.
>
> I'm happy to try out mitigation approaches for this. If my
> understanding here is incorrect, I'm also happy to try out patches
> that could help me catch the issue in the wild.

I poked around the swapping code a bit. In the failing read stack:

[ 1298.167823][ T7004]  swap_readpage+0x60/0x328
[ 1298.168317][ T7004]  do_swap_page+0x438/0x904

You can see that swap_readpage is only called from do_swap_page for
synchronous IO:

if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
    __swap_count(entry) == 1) {
    // ...
    if (page) {
        // ...
        swap_readpage(page, true);

See: https://elixir.bootlin.com/linux/v5.15.28/source/mm/memory.c#L3548

I looked around some more and found 0bcac06f27d7:

* mm, swap: skip swapcache for swapin of synchronous device

Zram is considered fast synchronous storage. Reverting that notion
makes my reproduction not complain anymore:

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 22d10f713848..d125555a0836 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3221,8 +3221,10 @@ SYSCALL_DEFINE2(swapon, const char __user *,
specialfile, int, swap_flags)
  if (p->bdev && blk_queue_stable_writes(p->bdev->bd_disk->queue))
  p->flags |= SWP_STABLE_WRITES;

+ /*
  if (p->bdev && p->bdev->bd_disk->fops->rw_page)
  p->flags |= SWP_SYNCHRONOUS_IO;
+ */

  if (p->bdev && blk_queue_nonrot(bdev_get_queue(p->bdev))) {
  int cpu;

Putting it back in makes my code complain once again.

Hopefully this provides a clue.
Hillf Danton March 15, 2022, 4:34 a.m. UTC | #6
On Mon, 14 Mar 2022 17:24:13 -0700 Ivan Babrou <ivan@cloudflare.com> wrote:
> On Mon, Mar 14, 2022 at 3:40 PM Minchan Kim <minchan@kernel.org> wrote:
> > Could you reproduce the problem with this workaround?
> >
> > diff --git a/mm/page_io.c b/mm/page_io.c
> > index 0bf8e40f4e57..f2438a5101a7 100644
> > --- a/mm/page_io.c
> > +++ b/mm/page_io.c
> > @@ -114,7 +114,7 @@ static void end_swap_bio_read(struct bio *bio)
> >         }
> >
> >         SetPageUptodate(page);
> > -       swap_slot_free_notify(page);
> > +       // swap_slot_free_notify(page);
> >  out:
> >         unlock_page(page);
> >         WRITE_ONCE(bio->bi_private, NULL);
> >
> 
> Yes:
> 
> zram_bvec_write index = 346 [cpu = 5, tid = 7003, nsecs = 956579989]
> zram_free_page  index = 346 [cpu = 5, tid = 7003, nsecs = 956620031]
> zram_bvec_read  index = 346 [cpu = 2, tid = 7005, nsecs = 967421676]
> zram_free_page  index = 346 [cpu = 2, tid = 7005, nsecs = 967502177]
> 
>         zram_free_page+0
>         swap_range_free+220
>         swap_entry_free+244
>         swapcache_free_entries+152
>         free_swap_slot+288
>         __swap_entry_free+216
>         swap_free+108
>         do_swap_page+1624
>         handle_pte_fault+204
>         handle_mm_fault+644
>         do_page_fault+628
>         do_translation_fault+92
>         do_mem_abort+80
>         el0_da+60
>         el0t_64_sync_handler+196
>         el0t_64_sync+420
> 
> zram_bvec_read  index = 346 [cpu = 6, tid = 7004, nsecs = 974478898]
> 
> [ 1298.139588][ T7004] ------------[ cut here ]------------
> [ 1298.140574][ T7004] WARNING: CPU: 6 PID: 7004 at
> drivers/block/zram/zram_drv.c:1285 __zram_bvec_read+0x28c/0x2e8 [zram]
> [ 1298.142199][ T7004] Modules linked in: zram zsmalloc kheaders nfsv3
> nfs lockd grace sunrpc xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat
> nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4
> xt_addrtype nft_compat nf_tables nfnetlink bridge stp llc overlay xfs
> libcrc32c zstd zstd_compress aes_ce_blk aes_ce_cipher af_packet
> ghash_ce gf128mul sha3_ce virtio_net sha3_generic net_failover
> sha512_ce failover sha512_arm64 sha2_ce sha256_arm64 virtio_mmio
> virtio_ring rtc_pl031 qemu_fw_cfg virtio fuse ip_tables x_tables ext4
> mbcache crc16 jbd2 nvme nvme_core pci_host_generic pci_host_common
> unix [last unloaded: zsmalloc]
> [ 1298.149549][ T7004] CPU: 6 PID: 7004 Comm: zram-corruptor Tainted:
> G        W         5.17.0-rc7-ivan #1
> 0b29b2552ab8a22d5ba4845528328b4b4bb50082
> [ 1298.151293][ T7004] Hardware name: linux,dummy-virt (DT)
> [ 1298.151826][ T7004] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT
> -SSBS BTYPE=--)
> [ 1298.152722][ T7004] pc : __zram_bvec_read+0x28c/0x2e8 [zram]
> [ 1298.153501][ T7004] lr : zram_bvec_rw+0x70/0x214 [zram]
> [ 1298.154597][ T7004] sp : ffffffc00aefbac0
> [ 1298.155417][ T7004] x29: ffffffc00aefbae0 x28: ffffff9bc5770400
> x27: ffffff9bc20da100
> [ 1298.155838][ T7004] x26: ffffff9bc20da100 x25: 000000000000015a
> x24: 00000001000f3aa7
> [ 1298.156991][ T7004] x23: 000000000000015a x22: 0000000000000000
> x21: fffffffe70cf5580
> [ 1298.157718][ T7004] x20: 000000000000015a x19: ffffff9bcbbf3000
> x18: ffffffc008755068
> [ 1298.158907][ T7004] x17: 0000000000000008 x16: ffffffd7d92df0cc
> x15: 0000000000000000
> [ 1298.159705][ T7004] x14: 0000000000000000 x13: 00000000000000a0
> x12: 0000000000000000
> [ 1298.161758][ T7004] x11: 0000000000000000 x10: ffffffc0086fb000 x9
> : 0000000001000000
> [ 1298.162334][ T7004] x8 : 0000000000002070 x7 : 04f2046406c06e04 x6
> : 4929b081d81d4b1b
> [ 1298.163165][ T7004] x5 : 0000000000000000 x4 : 0000000000000000 x3
> : 0000000000000000
> [ 1298.163895][ T7004] x2 : 000000000000015a x1 : 000000000000015a x0
> : ffffffd77fd576af
> [ 1298.164830][ T7004] Call trace:
> [ 1298.165153][ T7004]  __zram_bvec_read+0x28c/0x2e8 [zram
> d8b02cd0fd062c13f68799e6d1953bf67d996403]
> [ 1298.165900][ T7004]  zram_bvec_rw+0x70/0x214 [zram
> d8b02cd0fd062c13f68799e6d1953bf67d996403]
> [ 1298.166592][ T7004]  zram_rw_page+0xb4/0x170 [zram
> d8b02cd0fd062c13f68799e6d1953bf67d996403]
> [ 1298.167529][ T7004]  bdev_read_page+0x74/0xac
> [ 1298.167823][ T7004]  swap_readpage+0x60/0x328
> [ 1298.168317][ T7004]  do_swap_page+0x438/0x904
> [ 1298.168841][ T7004]  handle_pte_fault+0xcc/0x1fc
> [ 1298.169250][ T7004]  handle_mm_fault+0x284/0x4a8
> [ 1298.169802][ T7004]  do_page_fault+0x274/0x428
> [ 1298.170217][ T7004]  do_translation_fault+0x5c/0xf8
> [ 1298.170650][ T7004]  do_mem_abort+0x50/0x100
> [ 1298.171070][ T7004]  el0_da+0x3c/0x74
> [ 1298.171422][ T7004]  el0t_64_sync_handler+0xc4/0xec
> [ 1298.172102][ T7004]  el0t_64_sync+0x1a4/0x1a8
> [ 1298.172773][ T7004] ---[ end trace 0000000000000000 ]---
> [ 1298.173596][ T7004] zram: Page 346 read from zram without previous write

Can you test if the race comes from the diff below wrt zram_free_page?

Hillf

--- upstream/mm/page_io.c
+++ b/mm/page_io.c
@@ -392,11 +392,6 @@ int swap_readpage(struct page *page, boo
 	if (sis->flags & SWP_SYNCHRONOUS_IO) {
 		ret = bdev_read_page(sis->bdev, swap_page_sector(page), page);
 		if (!ret) {
-			if (trylock_page(page)) {
-				swap_slot_free_notify(page);
-				unlock_page(page);
-			}
-
 			count_vm_event(PSWPIN);
 			goto out;
 		}
Ivan Babrou March 15, 2022, 4:54 a.m. UTC | #7
On Mon, Mar 14, 2022 at 9:34 PM Hillf Danton <hdanton@sina.com> wrote:
> Can you test if the race comes from the diff below wrt zram_free_page?
>
> Hillf
>
> --- upstream/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -392,11 +392,6 @@ int swap_readpage(struct page *page, boo
>         if (sis->flags & SWP_SYNCHRONOUS_IO) {
>                 ret = bdev_read_page(sis->bdev, swap_page_sector(page), page);
>                 if (!ret) {
> -                       if (trylock_page(page)) {
> -                               swap_slot_free_notify(page);
> -                               unlock_page(page);
> -                       }
> -
>                         count_vm_event(PSWPIN);
>                         goto out;
>                 }

I tried it without the previous patch (the one commenting out
swap_slot_free_notify in end_swap_bio_read) and it still fails.
Minchan Kim March 15, 2022, 6:57 a.m. UTC | #8
On Mon, Mar 14, 2022 at 09:18:43PM -0700, Ivan Babrou wrote:
> On Fri, Mar 11, 2022 at 11:51 AM Ivan Babrou <ivan@cloudflare.com> wrote:
> >
> > Hello,
> >
> > We're looking into using zram, but unfortunately we ran into some
> > corruption issues. We've seen rocksdb complaining about "Corruption:
> > bad entry in block", and we've also seen some coredumps that point at
> > memory being zeroed out. One of our Rust processes coredumps contains
> > a non-null pointer pointing at zero, among other things:
> >
> > * core::ptr::non_null::NonNull<u8> {pointer: 0x0}
> >
> > In fact, a whole bunch of memory around this pointer was all zeros.
> >
> > Disabling zram resolves all issues, and we can't reproduce any of
> > these issues with other swap setups. I've tried adding crc32
> > checksumming for pages that are compressed, but it didn't catch the
> > issue either, even though userspace facing symptoms were present. My
> > crc32 code doesn't touch ZRAM_SAME pages, though.
> >
> > Unfortunately, this isn't trivial to replicate, and I believe that it
> > depends on zram used for swap specifically, not for zram as a block
> > device. Specifically, swap_slot_free_notify looks suspicious.
> >
> > Here's a patch that I have to catch the issue in the act:
> >
> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > index 438ce34ee760..fea46a70a3c9 100644
> > --- a/drivers/block/zram/zram_drv.c
> > +++ b/drivers/block/zram/zram_drv.c
> > @@ -1265,6 +1265,9 @@ static int __zram_bvec_read(struct zram *zram,
> > struct page *page, u32 index,
> >   unsigned long value;
> >   void *mem;
> >
> > + if (WARN_ON(!handle && !zram_test_flag(zram, index, ZRAM_SAME)))
> > + pr_warn("Page %u read from zram without previous write\n", index);
> > +
> >   value = handle ? zram_get_element(zram, index) : 0;
> >   mem = kmap_atomic(page);
> >   zram_fill_page(mem, PAGE_SIZE, value);
> >
> > In essence, it warns whenever a page is read from zram that was not
> > previously written to. To make this work, one needs to zero out zram
> > prior to running mkswap on it.
> >
> > I have prepared a GitHub repo with my observations and a reproduction:
> >
> > * https://github.com/bobrik/zram-corruptor
> >
> > I'm able to trigger the following in an aarch64 VM with two threads
> > reading the same memory out of swap:
> >
> > [ 512.651752][ T7285] ------------[ cut here ]------------
> > [ 512.652279][ T7285] WARNING: CPU: 0 PID: 7285 at
> > drivers/block/zram/zram_drv.c:1285 __zram_bvec_read+0x28c/0x2e8 [zram]
> > [ 512.653923][ T7285] Modules linked in: zram zsmalloc kheaders nfsv3
> > nfs lockd grace sunrpc xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat
> > nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4
> > nft_counter xt_addrtype nft_compat nf_tables nfnetlink bridge stp llc
> > overlay xfs libcrc32c zstd zstd_compress af_packet aes_ce_blk
> > aes_ce_cipher ghash_ce gf128mul virtio_net sha3_ce net_failover
> > sha3_generic failover sha512_ce sha512_arm64 sha2_ce sha256_arm64
> > virtio_mmio virtio_ring qemu_fw_cfg rtc_pl031 virtio fuse ip_tables
> > x_tables ext4 mbcache crc16 jbd2 nvme nvme_core pci_host_generic
> > pci_host_common unix [last unloaded: zsmalloc]
> > [ 512.659238][ T7285] CPU: 0 PID: 7285 Comm: zram-corruptor Tainted: G
> > W 5.16.0-ivan #1 0877d306c6dc0716835d43cafe4399473d09e406
> > [ 512.660413][ T7285] Hardware name: linux,dummy-virt (DT)
> > [ 512.661077][ T7285] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT
> > -SSBS BTYPE=--)
> > [ 512.661788][ T7285] pc : __zram_bvec_read+0x28c/0x2e8 [zram]
> > [ 512.662099][ T7285] lr : zram_bvec_rw+0x70/0x204 [zram]
> > [ 512.662422][ T7285] sp : ffffffc01018bac0
> > [ 512.662720][ T7285] x29: ffffffc01018bae0 x28: ffffff9e4e725280 x27:
> > ffffff9e4e725280
> > [ 512.663122][ T7285] x26: ffffff9e4e725280 x25: 00000000000001f6 x24:
> > 0000000100033e6c
> > [ 512.663601][ T7285] x23: 00000000000001f6 x22: 0000000000000000 x21:
> > fffffffe7a36d840
> > [ 512.664252][ T7285] x20: 00000000000001f6 x19: ffffff9e69423c00 x18:
> > ffffffc010711068
> > [ 512.664812][ T7285] x17: 0000000000000008 x16: ffffffd34aed51bc x15:
> > 0000000000000000
> > [ 512.665507][ T7285] x14: 0000000000000a88 x13: 0000000000000000 x12:
> > 0000000000000000
> > [ 512.666183][ T7285] x11: 0000000100033e6c x10: ffffffc01091d000 x9 :
> > 0000000001000000
> > [ 512.666627][ T7285] x8 : 0000000000002f10 x7 : 80b75f8fb90b52c4 x6 :
> > 051609fe50833de3
> > [ 512.667276][ T7285] x5 : 0000000000000000 x4 : 0000000000000000 x3 :
> > 0000000000000000
> > [ 512.667875][ T7285] x2 : 00000000000001f6 x1 : 00000000000001f6 x0 :
> > ffffffd305b746af
> > [ 512.668483][ T7285] Call trace:
> > [ 512.668682][ T7285] __zram_bvec_read+0x28c/0x2e8 [zram
> > 745969ed35ea0fb382bfd518d6f70e13966e9b52]
> > [ 512.669405][ T7285] zram_bvec_rw+0x70/0x204 [zram
> > 745969ed35ea0fb382bfd518d6f70e13966e9b52]
> > [ 512.670066][ T7285] zram_rw_page+0xb4/0x16c [zram
> > 745969ed35ea0fb382bfd518d6f70e13966e9b52]
> > [ 512.670584][ T7285] bdev_read_page+0x74/0xac
> > [ 512.670843][ T7285] swap_readpage+0x5c/0x2e4
> > [ 512.671243][ T7285] do_swap_page+0x2f4/0x988
> > [ 512.671560][ T7285] handle_pte_fault+0xcc/0x1fc
> > [ 512.671935][ T7285] handle_mm_fault+0x284/0x4a8
> > [ 512.672412][ T7285] do_page_fault+0x274/0x428
> > [ 512.672704][ T7285] do_translation_fault+0x5c/0xf8
> > [ 512.673083][ T7285] do_mem_abort+0x50/0xc8
> > [ 512.673293][ T7285] el0_da+0x3c/0x74
> > [ 512.673549][ T7285] el0t_64_sync_handler+0xc4/0xec
> > [ 512.673972][ T7285] el0t_64_sync+0x1a4/0x1a8
> > [ 512.674495][ T7285] ---[ end trace cf983b7507c20343 ]---
> > [ 512.675359][ T7285] zram: Page 502 read from zram without previous write
> >
> > I can also trace accesses to zram to catch the unfortunate sequence:
> >
> > zram_bvec_write index = 502 [cpu = 3, tid = 7286]
> > zram_free_page index = 502 [cpu = 3, tid = 7286]
> > zram_bvec_read index = 502 [cpu = 3, tid = 7286]
> > zram_free_page index = 502 [cpu = 3, tid = 7286] <-- problematic free
> > zram_bvec_read index = 502 [cpu = 0, tid = 7285] <-- problematic read
> >
> > With stacks for zram_free_page:
> >
> > zram_bvec_write index = 502 [cpu = 3, tid = 7286]
> > zram_free_page  index = 502 [cpu = 3, tid = 7286]
> >
> >         zram_free_page+0
> >         $x.97+32
> >         zram_rw_page+180
> >         bdev_write_page+124
> >         __swap_writepage+116
> >         swap_writepage+160
> >         pageout+284
> >         shrink_page_list+2892
> >         shrink_inactive_list+688
> >         shrink_lruvec+360
> >         shrink_node_memcgs+148
> >         shrink_node+860
> >         shrink_zones+368
> >         do_try_to_free_pages+232
> >         try_to_free_mem_cgroup_pages+292
> >         try_charge_memcg+608
> >
> > zram_bvec_read  index = 502 [cpu = 3, tid = 7286]
> > zram_free_page  index = 502 [cpu = 3, tid = 7286] <-- problematic free
> >
> >         zram_free_page+0
> >         swap_range_free+220
> >         swap_entry_free+244
> >         swapcache_free_entries+152
> >         free_swap_slot+288
> >         __swap_entry_free+216
> >         swap_free+108
> >         do_swap_page+1776
> >         handle_pte_fault+204
> >         handle_mm_fault+644
> >         do_page_fault+628
> >         do_translation_fault+92
> >         do_mem_abort+80
> >         el0_da+60
> >         el0t_64_sync_handler+196
> >         el0t_64_sync+420
> >
> > zram_bvec_read  index = 502 [cpu = 0, tid = 7285] <-- problematic read
> >
> > The very last read is the same one that triggered the warning from my
> > patch in dmesg. You can see that the slot is freed before reading by
> > swapcache_free_entries. As far as I can see, only zram implements
> > swap_slot_free_notify. Swapping in an uninitialized zram page results
> > in all zeroes copied, which matches the symptoms.
> >
> > The issue doesn't reproduce if I pin both threads to the same CPU. It
> > also doesn't reproduce with a single thread. All of this seems to
> > point at some sort of race condition.
> >
> > I was able to reproduce this on x86_64 bare metal server as well.
> >
> > I'm happy to try out mitigation approaches for this. If my
> > understanding here is incorrect, I'm also happy to try out patches
> > that could help me catch the issue in the wild.
> 
> I poked around the swapping code a bit. In the failing read stack:
> 
> [ 1298.167823][ T7004]  swap_readpage+0x60/0x328
> [ 1298.168317][ T7004]  do_swap_page+0x438/0x904
> 
> You can see that swap_readpage is only called from do_swap_page for
> synchronous IO:
> 
> if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
>     __swap_count(entry) == 1) {
>     // ...
>     if (page) {
>         // ...
>         swap_readpage(page, true);
> 
> See: https://elixir.bootlin.com/linux/v5.15.28/source/mm/memory.c#L3548
> 
> I looked around some more and found 0bcac06f27d7:
> 
> * mm, swap: skip swapcache for swapin of synchronous device
> 
> Zram is considered fast synchronous storage. Reverting that notion
> makes my reproduction not complain anymore:


Yeah, that was the part I was chasing since we had problem there

5df373e95689b, mm/page_io.c: do not free shared swap slots

Initially, I suspected __swap_count race(I still believe it has
swap_slot_free_notify and do_swap_page) and fixed the race
with workaround but the problem still happened. 

Looks like your test program clone the child with CLONE_VM
which never call swap_duplicate to increase swap_map count.
It means the 0bcac06f27d7 and 5df373e95689b couldn't work
with CLONE_VM.

I think reverting them is best at this moment unless someone
has an idea.
David Hildenbrand March 15, 2022, 8:30 a.m. UTC | #9
On 15.03.22 07:57, Minchan Kim wrote:
> On Mon, Mar 14, 2022 at 09:18:43PM -0700, Ivan Babrou wrote:
>> On Fri, Mar 11, 2022 at 11:51 AM Ivan Babrou <ivan@cloudflare.com> wrote:
>>>
>>> Hello,
>>>
>>> We're looking into using zram, but unfortunately we ran into some
>>> corruption issues. We've seen rocksdb complaining about "Corruption:
>>> bad entry in block", and we've also seen some coredumps that point at
>>> memory being zeroed out. One of our Rust processes coredumps contains
>>> a non-null pointer pointing at zero, among other things:
>>>
>>> * core::ptr::non_null::NonNull<u8> {pointer: 0x0}
>>>
>>> In fact, a whole bunch of memory around this pointer was all zeros.
>>>
>>> Disabling zram resolves all issues, and we can't reproduce any of
>>> these issues with other swap setups. I've tried adding crc32
>>> checksumming for pages that are compressed, but it didn't catch the
>>> issue either, even though userspace facing symptoms were present. My
>>> crc32 code doesn't touch ZRAM_SAME pages, though.
>>>
>>> Unfortunately, this isn't trivial to replicate, and I believe that it
>>> depends on zram used for swap specifically, not for zram as a block
>>> device. Specifically, swap_slot_free_notify looks suspicious.
>>>
>>> Here's a patch that I have to catch the issue in the act:
>>>
>>> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
>>> index 438ce34ee760..fea46a70a3c9 100644
>>> --- a/drivers/block/zram/zram_drv.c
>>> +++ b/drivers/block/zram/zram_drv.c
>>> @@ -1265,6 +1265,9 @@ static int __zram_bvec_read(struct zram *zram,
>>> struct page *page, u32 index,
>>>   unsigned long value;
>>>   void *mem;
>>>
>>> + if (WARN_ON(!handle && !zram_test_flag(zram, index, ZRAM_SAME)))
>>> + pr_warn("Page %u read from zram without previous write\n", index);
>>> +
>>>   value = handle ? zram_get_element(zram, index) : 0;
>>>   mem = kmap_atomic(page);
>>>   zram_fill_page(mem, PAGE_SIZE, value);
>>>
>>> In essence, it warns whenever a page is read from zram that was not
>>> previously written to. To make this work, one needs to zero out zram
>>> prior to running mkswap on it.
>>>
>>> I have prepared a GitHub repo with my observations and a reproduction:
>>>
>>> * https://github.com/bobrik/zram-corruptor
>>>
>>> I'm able to trigger the following in an aarch64 VM with two threads
>>> reading the same memory out of swap:
>>>
>>> [ 512.651752][ T7285] ------------[ cut here ]------------
>>> [ 512.652279][ T7285] WARNING: CPU: 0 PID: 7285 at
>>> drivers/block/zram/zram_drv.c:1285 __zram_bvec_read+0x28c/0x2e8 [zram]
>>> [ 512.653923][ T7285] Modules linked in: zram zsmalloc kheaders nfsv3
>>> nfs lockd grace sunrpc xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat
>>> nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4
>>> nft_counter xt_addrtype nft_compat nf_tables nfnetlink bridge stp llc
>>> overlay xfs libcrc32c zstd zstd_compress af_packet aes_ce_blk
>>> aes_ce_cipher ghash_ce gf128mul virtio_net sha3_ce net_failover
>>> sha3_generic failover sha512_ce sha512_arm64 sha2_ce sha256_arm64
>>> virtio_mmio virtio_ring qemu_fw_cfg rtc_pl031 virtio fuse ip_tables
>>> x_tables ext4 mbcache crc16 jbd2 nvme nvme_core pci_host_generic
>>> pci_host_common unix [last unloaded: zsmalloc]
>>> [ 512.659238][ T7285] CPU: 0 PID: 7285 Comm: zram-corruptor Tainted: G
>>> W 5.16.0-ivan #1 0877d306c6dc0716835d43cafe4399473d09e406
>>> [ 512.660413][ T7285] Hardware name: linux,dummy-virt (DT)
>>> [ 512.661077][ T7285] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT
>>> -SSBS BTYPE=--)
>>> [ 512.661788][ T7285] pc : __zram_bvec_read+0x28c/0x2e8 [zram]
>>> [ 512.662099][ T7285] lr : zram_bvec_rw+0x70/0x204 [zram]
>>> [ 512.662422][ T7285] sp : ffffffc01018bac0
>>> [ 512.662720][ T7285] x29: ffffffc01018bae0 x28: ffffff9e4e725280 x27:
>>> ffffff9e4e725280
>>> [ 512.663122][ T7285] x26: ffffff9e4e725280 x25: 00000000000001f6 x24:
>>> 0000000100033e6c
>>> [ 512.663601][ T7285] x23: 00000000000001f6 x22: 0000000000000000 x21:
>>> fffffffe7a36d840
>>> [ 512.664252][ T7285] x20: 00000000000001f6 x19: ffffff9e69423c00 x18:
>>> ffffffc010711068
>>> [ 512.664812][ T7285] x17: 0000000000000008 x16: ffffffd34aed51bc x15:
>>> 0000000000000000
>>> [ 512.665507][ T7285] x14: 0000000000000a88 x13: 0000000000000000 x12:
>>> 0000000000000000
>>> [ 512.666183][ T7285] x11: 0000000100033e6c x10: ffffffc01091d000 x9 :
>>> 0000000001000000
>>> [ 512.666627][ T7285] x8 : 0000000000002f10 x7 : 80b75f8fb90b52c4 x6 :
>>> 051609fe50833de3
>>> [ 512.667276][ T7285] x5 : 0000000000000000 x4 : 0000000000000000 x3 :
>>> 0000000000000000
>>> [ 512.667875][ T7285] x2 : 00000000000001f6 x1 : 00000000000001f6 x0 :
>>> ffffffd305b746af
>>> [ 512.668483][ T7285] Call trace:
>>> [ 512.668682][ T7285] __zram_bvec_read+0x28c/0x2e8 [zram
>>> 745969ed35ea0fb382bfd518d6f70e13966e9b52]
>>> [ 512.669405][ T7285] zram_bvec_rw+0x70/0x204 [zram
>>> 745969ed35ea0fb382bfd518d6f70e13966e9b52]
>>> [ 512.670066][ T7285] zram_rw_page+0xb4/0x16c [zram
>>> 745969ed35ea0fb382bfd518d6f70e13966e9b52]
>>> [ 512.670584][ T7285] bdev_read_page+0x74/0xac
>>> [ 512.670843][ T7285] swap_readpage+0x5c/0x2e4
>>> [ 512.671243][ T7285] do_swap_page+0x2f4/0x988
>>> [ 512.671560][ T7285] handle_pte_fault+0xcc/0x1fc
>>> [ 512.671935][ T7285] handle_mm_fault+0x284/0x4a8
>>> [ 512.672412][ T7285] do_page_fault+0x274/0x428
>>> [ 512.672704][ T7285] do_translation_fault+0x5c/0xf8
>>> [ 512.673083][ T7285] do_mem_abort+0x50/0xc8
>>> [ 512.673293][ T7285] el0_da+0x3c/0x74
>>> [ 512.673549][ T7285] el0t_64_sync_handler+0xc4/0xec
>>> [ 512.673972][ T7285] el0t_64_sync+0x1a4/0x1a8
>>> [ 512.674495][ T7285] ---[ end trace cf983b7507c20343 ]---
>>> [ 512.675359][ T7285] zram: Page 502 read from zram without previous write
>>>
>>> I can also trace accesses to zram to catch the unfortunate sequence:
>>>
>>> zram_bvec_write index = 502 [cpu = 3, tid = 7286]
>>> zram_free_page index = 502 [cpu = 3, tid = 7286]
>>> zram_bvec_read index = 502 [cpu = 3, tid = 7286]
>>> zram_free_page index = 502 [cpu = 3, tid = 7286] <-- problematic free
>>> zram_bvec_read index = 502 [cpu = 0, tid = 7285] <-- problematic read
>>>
>>> With stacks for zram_free_page:
>>>
>>> zram_bvec_write index = 502 [cpu = 3, tid = 7286]
>>> zram_free_page  index = 502 [cpu = 3, tid = 7286]
>>>
>>>         zram_free_page+0
>>>         $x.97+32
>>>         zram_rw_page+180
>>>         bdev_write_page+124
>>>         __swap_writepage+116
>>>         swap_writepage+160
>>>         pageout+284
>>>         shrink_page_list+2892
>>>         shrink_inactive_list+688
>>>         shrink_lruvec+360
>>>         shrink_node_memcgs+148
>>>         shrink_node+860
>>>         shrink_zones+368
>>>         do_try_to_free_pages+232
>>>         try_to_free_mem_cgroup_pages+292
>>>         try_charge_memcg+608
>>>
>>> zram_bvec_read  index = 502 [cpu = 3, tid = 7286]
>>> zram_free_page  index = 502 [cpu = 3, tid = 7286] <-- problematic free
>>>
>>>         zram_free_page+0
>>>         swap_range_free+220
>>>         swap_entry_free+244
>>>         swapcache_free_entries+152
>>>         free_swap_slot+288
>>>         __swap_entry_free+216
>>>         swap_free+108
>>>         do_swap_page+1776
>>>         handle_pte_fault+204
>>>         handle_mm_fault+644
>>>         do_page_fault+628
>>>         do_translation_fault+92
>>>         do_mem_abort+80
>>>         el0_da+60
>>>         el0t_64_sync_handler+196
>>>         el0t_64_sync+420
>>>
>>> zram_bvec_read  index = 502 [cpu = 0, tid = 7285] <-- problematic read
>>>
>>> The very last read is the same one that triggered the warning from my
>>> patch in dmesg. You can see that the slot is freed before reading by
>>> swapcache_free_entries. As far as I can see, only zram implements
>>> swap_slot_free_notify. Swapping in an uninitialized zram page results
>>> in all zeroes copied, which matches the symptoms.
>>>
>>> The issue doesn't reproduce if I pin both threads to the same CPU. It
>>> also doesn't reproduce with a single thread. All of this seems to
>>> point at some sort of race condition.
>>>
>>> I was able to reproduce this on x86_64 bare metal server as well.
>>>
>>> I'm happy to try out mitigation approaches for this. If my
>>> understanding here is incorrect, I'm also happy to try out patches
>>> that could help me catch the issue in the wild.
>>
>> I poked around the swapping code a bit. In the failing read stack:
>>
>> [ 1298.167823][ T7004]  swap_readpage+0x60/0x328
>> [ 1298.168317][ T7004]  do_swap_page+0x438/0x904
>>
>> You can see that swap_readpage is only called from do_swap_page for
>> synchronous IO:
>>
>> if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
>>     __swap_count(entry) == 1) {
>>     // ...
>>     if (page) {
>>         // ...
>>         swap_readpage(page, true);
>>
>> See: https://elixir.bootlin.com/linux/v5.15.28/source/mm/memory.c#L3548
>>
>> I looked around some more and found 0bcac06f27d7:
>>
>> * mm, swap: skip swapcache for swapin of synchronous device
>>
>> Zram is considered fast synchronous storage. Reverting that notion
>> makes my reproduction not complain anymore:
> 
> 
> Yeah, that was the part I was chasing since we had problem there
> 
> 5df373e95689b, mm/page_io.c: do not free shared swap slots
> 
> Initially, I suspected __swap_count race(I still believe it has
> swap_slot_free_notify and do_swap_page) and fixed the race
> with workaround but the problem still happened. 
> 
> Looks like your test program clone the child with CLONE_VM
> which never call swap_duplicate to increase swap_map count.
> It means the 0bcac06f27d7 and 5df373e95689b couldn't work
> with CLONE_VM.
> 
> I think reverting them is best at this moment unless someone
> has an idea.
> 

Is it just zram, that's broken? Can we special-case that to not bypass
the swapcache?
Minchan Kim March 15, 2022, 10:09 p.m. UTC | #10
On Mon, Mar 14, 2022 at 11:57:35PM -0700, Minchan Kim wrote:
> On Mon, Mar 14, 2022 at 09:18:43PM -0700, Ivan Babrou wrote:
> > On Fri, Mar 11, 2022 at 11:51 AM Ivan Babrou <ivan@cloudflare.com> wrote:
> > >
> > > Hello,
> > >
> > > We're looking into using zram, but unfortunately we ran into some
> > > corruption issues. We've seen rocksdb complaining about "Corruption:
> > > bad entry in block", and we've also seen some coredumps that point at
> > > memory being zeroed out. One of our Rust processes coredumps contains
> > > a non-null pointer pointing at zero, among other things:
> > >
> > > * core::ptr::non_null::NonNull<u8> {pointer: 0x0}
> > >
> > > In fact, a whole bunch of memory around this pointer was all zeros.
> > >
> > > Disabling zram resolves all issues, and we can't reproduce any of
> > > these issues with other swap setups. I've tried adding crc32
> > > checksumming for pages that are compressed, but it didn't catch the
> > > issue either, even though userspace facing symptoms were present. My
> > > crc32 code doesn't touch ZRAM_SAME pages, though.
> > >
> > > Unfortunately, this isn't trivial to replicate, and I believe that it
> > > depends on zram used for swap specifically, not for zram as a block
> > > device. Specifically, swap_slot_free_notify looks suspicious.
> > >
> > > Here's a patch that I have to catch the issue in the act:
> > >
> > > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > > index 438ce34ee760..fea46a70a3c9 100644
> > > --- a/drivers/block/zram/zram_drv.c
> > > +++ b/drivers/block/zram/zram_drv.c
> > > @@ -1265,6 +1265,9 @@ static int __zram_bvec_read(struct zram *zram,
> > > struct page *page, u32 index,
> > >   unsigned long value;
> > >   void *mem;
> > >
> > > + if (WARN_ON(!handle && !zram_test_flag(zram, index, ZRAM_SAME)))
> > > + pr_warn("Page %u read from zram without previous write\n", index);
> > > +
> > >   value = handle ? zram_get_element(zram, index) : 0;
> > >   mem = kmap_atomic(page);
> > >   zram_fill_page(mem, PAGE_SIZE, value);
> > >
> > > In essence, it warns whenever a page is read from zram that was not
> > > previously written to. To make this work, one needs to zero out zram
> > > prior to running mkswap on it.
> > >
> > > I have prepared a GitHub repo with my observations and a reproduction:
> > >
> > > * https://github.com/bobrik/zram-corruptor
> > >
> > > I'm able to trigger the following in an aarch64 VM with two threads
> > > reading the same memory out of swap:
> > >
> > > [ 512.651752][ T7285] ------------[ cut here ]------------
> > > [ 512.652279][ T7285] WARNING: CPU: 0 PID: 7285 at
> > > drivers/block/zram/zram_drv.c:1285 __zram_bvec_read+0x28c/0x2e8 [zram]
> > > [ 512.653923][ T7285] Modules linked in: zram zsmalloc kheaders nfsv3
> > > nfs lockd grace sunrpc xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat
> > > nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4
> > > nft_counter xt_addrtype nft_compat nf_tables nfnetlink bridge stp llc
> > > overlay xfs libcrc32c zstd zstd_compress af_packet aes_ce_blk
> > > aes_ce_cipher ghash_ce gf128mul virtio_net sha3_ce net_failover
> > > sha3_generic failover sha512_ce sha512_arm64 sha2_ce sha256_arm64
> > > virtio_mmio virtio_ring qemu_fw_cfg rtc_pl031 virtio fuse ip_tables
> > > x_tables ext4 mbcache crc16 jbd2 nvme nvme_core pci_host_generic
> > > pci_host_common unix [last unloaded: zsmalloc]
> > > [ 512.659238][ T7285] CPU: 0 PID: 7285 Comm: zram-corruptor Tainted: G
> > > W 5.16.0-ivan #1 0877d306c6dc0716835d43cafe4399473d09e406
> > > [ 512.660413][ T7285] Hardware name: linux,dummy-virt (DT)
> > > [ 512.661077][ T7285] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT
> > > -SSBS BTYPE=--)
> > > [ 512.661788][ T7285] pc : __zram_bvec_read+0x28c/0x2e8 [zram]
> > > [ 512.662099][ T7285] lr : zram_bvec_rw+0x70/0x204 [zram]
> > > [ 512.662422][ T7285] sp : ffffffc01018bac0
> > > [ 512.662720][ T7285] x29: ffffffc01018bae0 x28: ffffff9e4e725280 x27:
> > > ffffff9e4e725280
> > > [ 512.663122][ T7285] x26: ffffff9e4e725280 x25: 00000000000001f6 x24:
> > > 0000000100033e6c
> > > [ 512.663601][ T7285] x23: 00000000000001f6 x22: 0000000000000000 x21:
> > > fffffffe7a36d840
> > > [ 512.664252][ T7285] x20: 00000000000001f6 x19: ffffff9e69423c00 x18:
> > > ffffffc010711068
> > > [ 512.664812][ T7285] x17: 0000000000000008 x16: ffffffd34aed51bc x15:
> > > 0000000000000000
> > > [ 512.665507][ T7285] x14: 0000000000000a88 x13: 0000000000000000 x12:
> > > 0000000000000000
> > > [ 512.666183][ T7285] x11: 0000000100033e6c x10: ffffffc01091d000 x9 :
> > > 0000000001000000
> > > [ 512.666627][ T7285] x8 : 0000000000002f10 x7 : 80b75f8fb90b52c4 x6 :
> > > 051609fe50833de3
> > > [ 512.667276][ T7285] x5 : 0000000000000000 x4 : 0000000000000000 x3 :
> > > 0000000000000000
> > > [ 512.667875][ T7285] x2 : 00000000000001f6 x1 : 00000000000001f6 x0 :
> > > ffffffd305b746af
> > > [ 512.668483][ T7285] Call trace:
> > > [ 512.668682][ T7285] __zram_bvec_read+0x28c/0x2e8 [zram
> > > 745969ed35ea0fb382bfd518d6f70e13966e9b52]
> > > [ 512.669405][ T7285] zram_bvec_rw+0x70/0x204 [zram
> > > 745969ed35ea0fb382bfd518d6f70e13966e9b52]
> > > [ 512.670066][ T7285] zram_rw_page+0xb4/0x16c [zram
> > > 745969ed35ea0fb382bfd518d6f70e13966e9b52]
> > > [ 512.670584][ T7285] bdev_read_page+0x74/0xac
> > > [ 512.670843][ T7285] swap_readpage+0x5c/0x2e4
> > > [ 512.671243][ T7285] do_swap_page+0x2f4/0x988
> > > [ 512.671560][ T7285] handle_pte_fault+0xcc/0x1fc
> > > [ 512.671935][ T7285] handle_mm_fault+0x284/0x4a8
> > > [ 512.672412][ T7285] do_page_fault+0x274/0x428
> > > [ 512.672704][ T7285] do_translation_fault+0x5c/0xf8
> > > [ 512.673083][ T7285] do_mem_abort+0x50/0xc8
> > > [ 512.673293][ T7285] el0_da+0x3c/0x74
> > > [ 512.673549][ T7285] el0t_64_sync_handler+0xc4/0xec
> > > [ 512.673972][ T7285] el0t_64_sync+0x1a4/0x1a8
> > > [ 512.674495][ T7285] ---[ end trace cf983b7507c20343 ]---
> > > [ 512.675359][ T7285] zram: Page 502 read from zram without previous write
> > >
> > > I can also trace accesses to zram to catch the unfortunate sequence:
> > >
> > > zram_bvec_write index = 502 [cpu = 3, tid = 7286]
> > > zram_free_page index = 502 [cpu = 3, tid = 7286]
> > > zram_bvec_read index = 502 [cpu = 3, tid = 7286]
> > > zram_free_page index = 502 [cpu = 3, tid = 7286] <-- problematic free
> > > zram_bvec_read index = 502 [cpu = 0, tid = 7285] <-- problematic read
> > >
> > > With stacks for zram_free_page:
> > >
> > > zram_bvec_write index = 502 [cpu = 3, tid = 7286]
> > > zram_free_page  index = 502 [cpu = 3, tid = 7286]
> > >
> > >         zram_free_page+0
> > >         $x.97+32
> > >         zram_rw_page+180
> > >         bdev_write_page+124
> > >         __swap_writepage+116
> > >         swap_writepage+160
> > >         pageout+284
> > >         shrink_page_list+2892
> > >         shrink_inactive_list+688
> > >         shrink_lruvec+360
> > >         shrink_node_memcgs+148
> > >         shrink_node+860
> > >         shrink_zones+368
> > >         do_try_to_free_pages+232
> > >         try_to_free_mem_cgroup_pages+292
> > >         try_charge_memcg+608
> > >
> > > zram_bvec_read  index = 502 [cpu = 3, tid = 7286]
> > > zram_free_page  index = 502 [cpu = 3, tid = 7286] <-- problematic free
> > >
> > >         zram_free_page+0
> > >         swap_range_free+220
> > >         swap_entry_free+244
> > >         swapcache_free_entries+152
> > >         free_swap_slot+288
> > >         __swap_entry_free+216
> > >         swap_free+108
> > >         do_swap_page+1776
> > >         handle_pte_fault+204
> > >         handle_mm_fault+644
> > >         do_page_fault+628
> > >         do_translation_fault+92
> > >         do_mem_abort+80
> > >         el0_da+60
> > >         el0t_64_sync_handler+196
> > >         el0t_64_sync+420
> > >
> > > zram_bvec_read  index = 502 [cpu = 0, tid = 7285] <-- problematic read
> > >
> > > The very last read is the same one that triggered the warning from my
> > > patch in dmesg. You can see that the slot is freed before reading by
> > > swapcache_free_entries. As far as I can see, only zram implements
> > > swap_slot_free_notify. Swapping in an uninitialized zram page results
> > > in all zeroes copied, which matches the symptoms.
> > >
> > > The issue doesn't reproduce if I pin both threads to the same CPU. It
> > > also doesn't reproduce with a single thread. All of this seems to
> > > point at some sort of race condition.
> > >
> > > I was able to reproduce this on x86_64 bare metal server as well.
> > >
> > > I'm happy to try out mitigation approaches for this. If my
> > > understanding here is incorrect, I'm also happy to try out patches
> > > that could help me catch the issue in the wild.
> > 
> > I poked around the swapping code a bit. In the failing read stack:
> > 
> > [ 1298.167823][ T7004]  swap_readpage+0x60/0x328
> > [ 1298.168317][ T7004]  do_swap_page+0x438/0x904
> > 
> > You can see that swap_readpage is only called from do_swap_page for
> > synchronous IO:
> > 
> > if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> >     __swap_count(entry) == 1) {
> >     // ...
> >     if (page) {
> >         // ...
> >         swap_readpage(page, true);
> > 
> > See: https://elixir.bootlin.com/linux/v5.15.28/source/mm/memory.c#L3548
> > 
> > I looked around some more and found 0bcac06f27d7:
> > 
> > * mm, swap: skip swapcache for swapin of synchronous device
> > 
> > Zram is considered fast synchronous storage. Reverting that notion
> > makes my reproduction not complain anymore:
> 
> 
> Yeah, that was the part I was chasing since we had problem there
> 
> 5df373e95689b, mm/page_io.c: do not free shared swap slots
> 
> Initially, I suspected __swap_count race(I still believe it has
> swap_slot_free_notify and do_swap_page) and fixed the race
> with workaround but the problem still happened. 
> 
> Looks like your test program clone the child with CLONE_VM
> which never call swap_duplicate to increase swap_map count.
> It means the 0bcac06f27d7 and 5df373e95689b couldn't work
> with CLONE_VM.
> 
> I think reverting them is best at this moment unless someone
> has an idea.

I think the problem with CLONE_VM is following race

CPU A                        CPU B

do_swap_page                do_swap_page
SWP_SYNCHRONOUS_IO path     SWP_SYNCHRONOUS_IO path
swap_readpage original data
  swap_slot_free_notify
    delete zram entry
                            swap_readpage zero data
                            pte_lock
                            map the *zero data* to userspace
                            pte_unlock
pte_lock
if (!pte_same)
  goto out_nomap;
pte_unlock
return and next refault will
read zero data

So, CPU A and B see zero data. With patchset below, it changes 


CPU A                        CPU B

do_swap_page                do_swap_page
SWP_SYNCHRONOUS_IO path     SWP_SYNCHRONOUS_IO path
                            swap_readpage original data
                            pte_lock
                            map the original data
                            swap_free
                              swap_range_free
                                bd_disk->fops->swap_slot_free_notify
swap_readpage read zero data
                            pte_unlock
pte_lock
if (!pte_same)
  goto out_nomap;
pte_unlock
return and next refault will
read correct data again

Here, CPU A could read zero data from zram but that's not a bug
(IOW, warning injected doesn't mean bug).

The concern of the patch would increase memory size since it could
increase wasted memory with compressed form in zram and uncompressed
form in address space.  However, most of cases of zram uses no
readahead and then, do_swap_page is followed by swap_free so it will
free the compressed from in zram quickly.

Ivan, with this patch, you can see the warning you added in the zram
but it shouldn't trigger the userspace corruption as mentioned above
if I understand correctly.

Could you test whether the patch prevent userspace broken?

From 71d8b7dce1f5e11fbae4a282ef7d45253ec2087d Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan@kernel.org>
Date: Tue, 15 Mar 2022 14:14:23 -0700
Subject: [PATCH] fix

Not-Yet-Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/page_io.c | 54 ----------------------------------------------------
 1 file changed, 54 deletions(-)

diff --git a/mm/page_io.c b/mm/page_io.c
index 0bf8e40f4e57..d3eea0a3f1af 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -51,54 +51,6 @@ void end_swap_bio_write(struct bio *bio)
 	bio_put(bio);
 }
 
-static void swap_slot_free_notify(struct page *page)
-{
-	struct swap_info_struct *sis;
-	struct gendisk *disk;
-	swp_entry_t entry;
-
-	/*
-	 * There is no guarantee that the page is in swap cache - the software
-	 * suspend code (at least) uses end_swap_bio_read() against a non-
-	 * swapcache page.  So we must check PG_swapcache before proceeding with
-	 * this optimization.
-	 */
-	if (unlikely(!PageSwapCache(page)))
-		return;
-
-	sis = page_swap_info(page);
-	if (data_race(!(sis->flags & SWP_BLKDEV)))
-		return;
-
-	/*
-	 * The swap subsystem performs lazy swap slot freeing,
-	 * expecting that the page will be swapped out again.
-	 * So we can avoid an unnecessary write if the page
-	 * isn't redirtied.
-	 * This is good for real swap storage because we can
-	 * reduce unnecessary I/O and enhance wear-leveling
-	 * if an SSD is used as the as swap device.
-	 * But if in-memory swap device (eg zram) is used,
-	 * this causes a duplicated copy between uncompressed
-	 * data in VM-owned memory and compressed data in
-	 * zram-owned memory.  So let's free zram-owned memory
-	 * and make the VM-owned decompressed page *dirty*,
-	 * so the page should be swapped out somewhere again if
-	 * we again wish to reclaim it.
-	 */
-	disk = sis->bdev->bd_disk;
-	entry.val = page_private(page);
-	if (disk->fops->swap_slot_free_notify && __swap_count(entry) == 1) {
-		unsigned long offset;
-
-		offset = swp_offset(entry);
-
-		SetPageDirty(page);
-		disk->fops->swap_slot_free_notify(sis->bdev,
-				offset);
-	}
-}
-
 static void end_swap_bio_read(struct bio *bio)
 {
 	struct page *page = bio_first_page_all(bio);
@@ -114,7 +66,6 @@ static void end_swap_bio_read(struct bio *bio)
 	}
 
 	SetPageUptodate(page);
-	swap_slot_free_notify(page);
 out:
 	unlock_page(page);
 	WRITE_ONCE(bio->bi_private, NULL);
@@ -392,11 +343,6 @@ int swap_readpage(struct page *page, bool synchronous)
 	if (sis->flags & SWP_SYNCHRONOUS_IO) {
 		ret = bdev_read_page(sis->bdev, swap_page_sector(page), page);
 		if (!ret) {
-			if (trylock_page(page)) {
-				swap_slot_free_notify(page);
-				unlock_page(page);
-			}
-
 			count_vm_event(PSWPIN);
 			goto out;
 		}
Ivan Babrou March 16, 2022, 6:26 p.m. UTC | #11
On Tue, Mar 15, 2022 at 3:09 PM Minchan Kim <minchan@kernel.org> wrote:
> I think the problem with CLONE_VM is following race
>
> CPU A                        CPU B
>
> do_swap_page                do_swap_page
> SWP_SYNCHRONOUS_IO path     SWP_SYNCHRONOUS_IO path
> swap_readpage original data
>   swap_slot_free_notify
>     delete zram entry
>                             swap_readpage zero data
>                             pte_lock
>                             map the *zero data* to userspace
>                             pte_unlock
> pte_lock
> if (!pte_same)
>   goto out_nomap;
> pte_unlock
> return and next refault will
> read zero data
>
> So, CPU A and B see zero data. With patchset below, it changes
>
>
> CPU A                        CPU B
>
> do_swap_page                do_swap_page
> SWP_SYNCHRONOUS_IO path     SWP_SYNCHRONOUS_IO path
>                             swap_readpage original data
>                             pte_lock
>                             map the original data
>                             swap_free
>                               swap_range_free
>                                 bd_disk->fops->swap_slot_free_notify
> swap_readpage read zero data
>                             pte_unlock
> pte_lock
> if (!pte_same)
>   goto out_nomap;
> pte_unlock
> return and next refault will
> read correct data again
>
> Here, CPU A could read zero data from zram but that's not a bug
> (IOW, warning injected doesn't mean bug).
>
> The concern of the patch would increase memory size since it could
> increase wasted memory with compressed form in zram and uncompressed
> form in address space.  However, most of cases of zram uses no
> readahead and then, do_swap_page is followed by swap_free so it will
> free the compressed from in zram quickly.
>
> Ivan, with this patch, you can see the warning you added in the zram
> but it shouldn't trigger the userspace corruption as mentioned above
> if I understand correctly.
>
> Could you test whether the patch prevent userspace broken?

I'm making an internal build and will push it to some location to see
how it behaves, but it might take a few days to get any sort of
confidence in the results (unless it breaks immediately).

I've also pushed my patch that disables SWP_SYNCHRONOUS_IO to a few
locations yesterday to see how it fares.
Ivan Babrou March 18, 2022, 4:30 p.m. UTC | #12
On Wed, Mar 16, 2022 at 11:26 AM Ivan Babrou <ivan@cloudflare.com> wrote:
> I'm making an internal build and will push it to some location to see
> how it behaves, but it might take a few days to get any sort of
> confidence in the results (unless it breaks immediately).
>
> I've also pushed my patch that disables SWP_SYNCHRONOUS_IO to a few
> locations yesterday to see how it fares.

I have some updates before the weekend. There are two experimental groups:

* My patch that removes the SWP_SYNCHRONOUS_IO flag. There are 704
machines in this group across 5 datacenters with cumulative uptime of
916 days.
* Minchan's patch to remove swap_slot_free_notify. There are 376
machines in this group across 3 datacenters with cumulative uptime of
240 days.

Our machines take a couple of hours to start swapping anything after
boot, and I discounted these two hours from the cumulative uptime.

Neither of these two groups experienced unexpected coredumps or
rocksdb corruptions.

I think at this point it's reasonable to proceed with Minchan's patch
(including a backport).
Minchan Kim March 18, 2022, 5:32 p.m. UTC | #13
On Fri, Mar 18, 2022 at 09:30:09AM -0700, Ivan Babrou wrote:
> On Wed, Mar 16, 2022 at 11:26 AM Ivan Babrou <ivan@cloudflare.com> wrote:
> > I'm making an internal build and will push it to some location to see
> > how it behaves, but it might take a few days to get any sort of
> > confidence in the results (unless it breaks immediately).
> >
> > I've also pushed my patch that disables SWP_SYNCHRONOUS_IO to a few
> > locations yesterday to see how it fares.
> 
> I have some updates before the weekend. There are two experimental groups:
> 
> * My patch that removes the SWP_SYNCHRONOUS_IO flag. There are 704
> machines in this group across 5 datacenters with cumulative uptime of
> 916 days.
> * Minchan's patch to remove swap_slot_free_notify. There are 376
> machines in this group across 3 datacenters with cumulative uptime of
> 240 days.
> 
> Our machines take a couple of hours to start swapping anything after
> boot, and I discounted these two hours from the cumulative uptime.
> 
> Neither of these two groups experienced unexpected coredumps or
> rocksdb corruptions.
> 
> I think at this point it's reasonable to proceed with Minchan's patch
> (including a backport).

Let me cook the patch and then will post it.

Thanks for the testing as well as reporting, Ivan!
Minchan Kim March 18, 2022, 6:54 p.m. UTC | #14
On Fri, Mar 18, 2022 at 10:32:07AM -0700, Minchan Kim wrote:
> On Fri, Mar 18, 2022 at 09:30:09AM -0700, Ivan Babrou wrote:
> > On Wed, Mar 16, 2022 at 11:26 AM Ivan Babrou <ivan@cloudflare.com> wrote:
> > > I'm making an internal build and will push it to some location to see
> > > how it behaves, but it might take a few days to get any sort of
> > > confidence in the results (unless it breaks immediately).
> > >
> > > I've also pushed my patch that disables SWP_SYNCHRONOUS_IO to a few
> > > locations yesterday to see how it fares.
> > 
> > I have some updates before the weekend. There are two experimental groups:
> > 
> > * My patch that removes the SWP_SYNCHRONOUS_IO flag. There are 704
> > machines in this group across 5 datacenters with cumulative uptime of
> > 916 days.
> > * Minchan's patch to remove swap_slot_free_notify. There are 376
> > machines in this group across 3 datacenters with cumulative uptime of
> > 240 days.
> > 
> > Our machines take a couple of hours to start swapping anything after
> > boot, and I discounted these two hours from the cumulative uptime.
> > 
> > Neither of these two groups experienced unexpected coredumps or
> > rocksdb corruptions.
> > 
> > I think at this point it's reasonable to proceed with Minchan's patch
> > (including a backport).
> 
> Let me cook the patch and then will post it.
> 
> Thanks for the testing as well as reporting, Ivan!

From 1ede54d46f0b1958bfc624f17fe709637ef8f12a Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan@kernel.org>
Date: Tue, 15 Mar 2022 14:14:23 -0700
Subject: [PATCH] mm: fix unexpected zeroed page mapping with zram swap

Two processes under CLONE_VM cloning, user process can be
corrupted by seeing zeroed page unexpectedly.

    CPU A                        CPU B

do_swap_page                do_swap_page
SWP_SYNCHRONOUS_IO path     SWP_SYNCHRONOUS_IO path
swap_readpage valid data
  swap_slot_free_notify
    delete zram entry
                            swap_readpage zeroed(invalid) data
                            pte_lock
                            map the *zero data* to userspace
                            pte_unlock
pte_lock
if (!pte_same)
  goto out_nomap;
pte_unlock
return and next refault will
read zeroed data

The swap_slot_free_notify is bogus for CLONE_VM case since it doesn't
increase the refcount of swap slot at copy_mm so it couldn't catch up
whether it's safe or not to discard data from backing device.
In the case, only the lock it could rely on to synchronize swap slot
freeing is page table lock. Thus, this patch gets rid of the
swap_slot_free_notify function. With this patch, CPU A will see
correct data.

    CPU A                        CPU B

do_swap_page                do_swap_page
SWP_SYNCHRONOUS_IO path     SWP_SYNCHRONOUS_IO path
                            swap_readpage original data
                            pte_lock
                            map the original data
                            swap_free
                              swap_range_free
                                bd_disk->fops->swap_slot_free_notify
swap_readpage read zeroed data
                            pte_unlock
pte_lock
if (!pte_same)
  goto out_nomap;
pte_unlock
return
on next refault will see mapped data by CPU B

The concern of the patch would increase memory consumption since it
could keep wasted memory with compressed form in zram as well as
uncompressed form in address space. However, most of cases of zram
uses no readahead and do_swap_page is followed by swap_free so it will
free the compressed form from in zram quickly.

Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device")
Cc: <stable@vger.kernel.org> # 4.14+
Reported-by: Ivan Babrou <ivan@cloudflare.com>
Tested-by: Ivan Babrou <ivan@cloudflare.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/page_io.c | 54 ----------------------------------------------------
 1 file changed, 54 deletions(-)

diff --git a/mm/page_io.c b/mm/page_io.c
index b417f000b49e..89fbf3cae30f 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -51,54 +51,6 @@ void end_swap_bio_write(struct bio *bio)
 	bio_put(bio);
 }
 
-static void swap_slot_free_notify(struct page *page)
-{
-	struct swap_info_struct *sis;
-	struct gendisk *disk;
-	swp_entry_t entry;
-
-	/*
-	 * There is no guarantee that the page is in swap cache - the software
-	 * suspend code (at least) uses end_swap_bio_read() against a non-
-	 * swapcache page.  So we must check PG_swapcache before proceeding with
-	 * this optimization.
-	 */
-	if (unlikely(!PageSwapCache(page)))
-		return;
-
-	sis = page_swap_info(page);
-	if (data_race(!(sis->flags & SWP_BLKDEV)))
-		return;
-
-	/*
-	 * The swap subsystem performs lazy swap slot freeing,
-	 * expecting that the page will be swapped out again.
-	 * So we can avoid an unnecessary write if the page
-	 * isn't redirtied.
-	 * This is good for real swap storage because we can
-	 * reduce unnecessary I/O and enhance wear-leveling
-	 * if an SSD is used as the as swap device.
-	 * But if in-memory swap device (eg zram) is used,
-	 * this causes a duplicated copy between uncompressed
-	 * data in VM-owned memory and compressed data in
-	 * zram-owned memory.  So let's free zram-owned memory
-	 * and make the VM-owned decompressed page *dirty*,
-	 * so the page should be swapped out somewhere again if
-	 * we again wish to reclaim it.
-	 */
-	disk = sis->bdev->bd_disk;
-	entry.val = page_private(page);
-	if (disk->fops->swap_slot_free_notify && __swap_count(entry) == 1) {
-		unsigned long offset;
-
-		offset = swp_offset(entry);
-
-		SetPageDirty(page);
-		disk->fops->swap_slot_free_notify(sis->bdev,
-				offset);
-	}
-}
-
 static void end_swap_bio_read(struct bio *bio)
 {
 	struct page *page = bio_first_page_all(bio);
@@ -114,7 +66,6 @@ static void end_swap_bio_read(struct bio *bio)
 	}
 
 	SetPageUptodate(page);
-	swap_slot_free_notify(page);
 out:
 	unlock_page(page);
 	WRITE_ONCE(bio->bi_private, NULL);
@@ -394,11 +345,6 @@ int swap_readpage(struct page *page, bool synchronous)
 	if (sis->flags & SWP_SYNCHRONOUS_IO) {
 		ret = bdev_read_page(sis->bdev, swap_page_sector(page), page);
 		if (!ret) {
-			if (trylock_page(page)) {
-				swap_slot_free_notify(page);
-				unlock_page(page);
-			}
-
 			count_vm_event(PSWPIN);
 			goto out;
 		}
Ivan Babrou March 25, 2022, 2:10 a.m. UTC | #15
On Fri, Mar 18, 2022 at 11:54 AM Minchan Kim <minchan@kernel.org> wrote:
>
> On Fri, Mar 18, 2022 at 10:32:07AM -0700, Minchan Kim wrote:
> > On Fri, Mar 18, 2022 at 09:30:09AM -0700, Ivan Babrou wrote:
> > > On Wed, Mar 16, 2022 at 11:26 AM Ivan Babrou <ivan@cloudflare.com> wrote:
> > > > I'm making an internal build and will push it to some location to see
> > > > how it behaves, but it might take a few days to get any sort of
> > > > confidence in the results (unless it breaks immediately).
> > > >
> > > > I've also pushed my patch that disables SWP_SYNCHRONOUS_IO to a few
> > > > locations yesterday to see how it fares.
> > >
> > > I have some updates before the weekend. There are two experimental groups:
> > >
> > > * My patch that removes the SWP_SYNCHRONOUS_IO flag. There are 704
> > > machines in this group across 5 datacenters with cumulative uptime of
> > > 916 days.
> > > * Minchan's patch to remove swap_slot_free_notify. There are 376
> > > machines in this group across 3 datacenters with cumulative uptime of
> > > 240 days.
> > >
> > > Our machines take a couple of hours to start swapping anything after
> > > boot, and I discounted these two hours from the cumulative uptime.
> > >
> > > Neither of these two groups experienced unexpected coredumps or
> > > rocksdb corruptions.
> > >
> > > I think at this point it's reasonable to proceed with Minchan's patch
> > > (including a backport).
> >
> > Let me cook the patch and then will post it.
> >
> > Thanks for the testing as well as reporting, Ivan!
>
> From 1ede54d46f0b1958bfc624f17fe709637ef8f12a Mon Sep 17 00:00:00 2001
> From: Minchan Kim <minchan@kernel.org>
> Date: Tue, 15 Mar 2022 14:14:23 -0700
> Subject: [PATCH] mm: fix unexpected zeroed page mapping with zram swap

Is there any action needed from me to make sure that this lands into
the mm tree and eventually into stable releases?
David Hildenbrand March 25, 2022, 8:33 a.m. UTC | #16
On 25.03.22 03:10, Ivan Babrou wrote:
> On Fri, Mar 18, 2022 at 11:54 AM Minchan Kim <minchan@kernel.org> wrote:
>>
>> On Fri, Mar 18, 2022 at 10:32:07AM -0700, Minchan Kim wrote:
>>> On Fri, Mar 18, 2022 at 09:30:09AM -0700, Ivan Babrou wrote:
>>>> On Wed, Mar 16, 2022 at 11:26 AM Ivan Babrou <ivan@cloudflare.com> wrote:
>>>>> I'm making an internal build and will push it to some location to see
>>>>> how it behaves, but it might take a few days to get any sort of
>>>>> confidence in the results (unless it breaks immediately).
>>>>>
>>>>> I've also pushed my patch that disables SWP_SYNCHRONOUS_IO to a few
>>>>> locations yesterday to see how it fares.
>>>>
>>>> I have some updates before the weekend. There are two experimental groups:
>>>>
>>>> * My patch that removes the SWP_SYNCHRONOUS_IO flag. There are 704
>>>> machines in this group across 5 datacenters with cumulative uptime of
>>>> 916 days.
>>>> * Minchan's patch to remove swap_slot_free_notify. There are 376
>>>> machines in this group across 3 datacenters with cumulative uptime of
>>>> 240 days.
>>>>
>>>> Our machines take a couple of hours to start swapping anything after
>>>> boot, and I discounted these two hours from the cumulative uptime.
>>>>
>>>> Neither of these two groups experienced unexpected coredumps or
>>>> rocksdb corruptions.
>>>>
>>>> I think at this point it's reasonable to proceed with Minchan's patch
>>>> (including a backport).
>>>
>>> Let me cook the patch and then will post it.
>>>
>>> Thanks for the testing as well as reporting, Ivan!
>>
>> From 1ede54d46f0b1958bfc624f17fe709637ef8f12a Mon Sep 17 00:00:00 2001
>> From: Minchan Kim <minchan@kernel.org>
>> Date: Tue, 15 Mar 2022 14:14:23 -0700
>> Subject: [PATCH] mm: fix unexpected zeroed page mapping with zram swap
> 
> Is there any action needed from me to make sure that this lands into
> the mm tree and eventually into stable releases?
> 

Sending it as a proper patch with a proper subject, not buried deep down
in a conversation might help ;)
diff mbox series

Patch

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 438ce34ee760..fea46a70a3c9 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1265,6 +1265,9 @@  static int __zram_bvec_read(struct zram *zram,
struct page *page, u32 index,
  unsigned long value;
  void *mem;

+ if (WARN_ON(!handle && !zram_test_flag(zram, index, ZRAM_SAME)))
+ pr_warn("Page %u read from zram without previous write\n", index);
+
  value = handle ? zram_get_element(zram, index) : 0;
  mem = kmap_atomic(page);