mbox series

[v1,00/15] Keep track of GUPed pages in fs and block

Message ID 20190411210834.4105-1-jglisse@redhat.com (mailing list archive)
Headers show
Series Keep track of GUPed pages in fs and block | expand

Message

Jerome Glisse April 11, 2019, 9:08 p.m. UTC
From: Jérôme Glisse <jglisse@redhat.com>

This patchset depends on various small fixes [1] and also on patchset
which introduce put_user_page*() [2] and thus is 5.3 material as those
pre-requisite will get in 5.2 at best. Nonetheless i am posting it now
so that it can get review and comments on how and what should be done
to test things.

For various reasons [2] [3] we want to track page reference through GUP
differently than "regular" page reference. Thus we need to keep track
of how we got a page within the block and fs layer. To do so this patch-
set change the bio_bvec struct to store a pfn and flags instead of a
direct pointer to a page. This way we can flag page that are coming from
GUP.

This patchset is divided as follow:
    - First part of the patchset is just small cleanup i believe they
      can go in as his assuming people are ok with them.
    - Second part convert bio_vec->bv_page to bio_vec->bv_pfn this is
      done in multi-step, first we replace all direct dereference of
      the field by call to inline helper, then we introduce macro for
      bio_bvec that are initialized on the stack. Finaly we change the
      bv_page field to bv_pfn.
    - Third part replace put_page(bv_page(bio_vec)) with a new helper
      which will use put_user_page() when the page in the bio_vec is
      coming from GUP.
    - Fourth part update BIO to use bv_set_user_page() for page that
      are coming from GUP this means updating bio_add_page*() to pass
      down the origin of the page (GUP or not).
    - Fith part convert few more places that directly use bvec_io or
      BIO.

Note that after this patchset they are still places in the kernel where
we should use put_user_page*(). The intention is to separate that task
in chewable chunk (driver by driver, sub-system by sub-system).


I have only lightly tested this patchset (branch [4]) on my desktop and
have not seen anything obviously wrong but i might have miss something.
What kind of test suite should i run to stress test the vfs/block layer
around DIO and BIO ?


Note that you coccinelle [5] recent enough for the semantic patch to work
properly ([5] with git commit >= eac73d191e4f03d759957fc5620062428fadada8).

Cheers,
Jérôme Glisse

[1] https://cgit.freedesktop.org/~glisse/linux/commit/?h=gup-fs-block&id=5f67db69fd9f95d12987d2a030a82bc390e05a71
    https://cgit.freedesktop.org/~glisse/linux/commit/?h=gup-fs-block&id=b070348d0e1fd9397eb8d0e97b4c89f1d04d5a0a
    https://cgit.freedesktop.org/~glisse/linux/commit/?h=gup-fs-block&id=83691c86a6c8f560b5b78f3f57fcd62c0f3f1c7a
[2] https://lkml.org/lkml/2019/3/26/1395
[3] https://lwn.net/Articles/753027/
[4] https://cgit.freedesktop.org/~glisse/linux/log/?h=gup-fs-block
[5] https://github.com/coccinelle/coccinelle

Cc: linux-fsdevel@vger.kernel.org
Cc: linux-block@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Steve French <sfrench@samba.org>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Cc: Yan Zheng <zyan@redhat.com>
Cc: Sage Weil <sage@redhat.com>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Alex Elder <elder@kernel.org>
Cc: ceph-devel@vger.kernel.org
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Cc: Mike Marshall <hubcap@omnibond.com>
Cc: Martin Brandenburg <martin@omnibond.com>
Cc: devel@lists.orangefs.org
Cc: Dominique Martinet <asmadeus@codewreck.org>
Cc: v9fs-developer@lists.sourceforge.net
Cc: Coly Li <colyli@suse.de>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: linux-bcache@vger.kernel.org
Cc: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>

Jérôme Glisse (15):
  fs/direct-io: fix trailing whitespace issues
  iov_iter: add helper to test if an iter would use GUP
  block: introduce bvec_page()/bvec_set_page() to get/set
    bio_vec.bv_page
  block: introduce BIO_VEC_INIT() macro to initialize bio_vec structure
  block: replace all bio_vec->bv_page by bvec_page()/bvec_set_page()
  block: convert bio_vec.bv_page to bv_pfn to store pfn and not page
  block: add bvec_put_page_dirty*() to replace put_page(bvec_page())
  block: use bvec_put_page() instead of put_page(bvec_page())
  block: bvec_put_page_dirty* instead of set_page_dirty* and
    bvec_put_page
  block: add gup flag to
    bio_add_page()/bio_add_pc_page()/__bio_add_page()
  block: make sure bio_add_page*() knows page that are coming from GUP
  fs/direct-io: keep track of wether a page is coming from GUP or not
  fs/splice: use put_user_page() when appropriate
  fs: use bvec_set_gup_page() where appropriate
  ceph: use put_user_pages() instead of ceph_put_page_vector()

 Documentation/block/biodoc.txt      |  7 +-
 arch/m68k/emu/nfblock.c             |  2 +-
 arch/um/drivers/ubd_kern.c          |  2 +-
 arch/xtensa/platforms/iss/simdisk.c |  2 +-
 block/bio-integrity.c               |  8 +--
 block/bio.c                         | 92 ++++++++++++++++-----------
 block/blk-core.c                    |  2 +-
 block/blk-integrity.c               |  7 +-
 block/blk-lib.c                     |  5 +-
 block/blk-merge.c                   |  9 +--
 block/blk.h                         |  4 +-
 block/bounce.c                      | 26 ++++----
 block/t10-pi.c                      |  4 +-
 drivers/block/aoe/aoecmd.c          |  4 +-
 drivers/block/brd.c                 |  2 +-
 drivers/block/drbd/drbd_actlog.c    |  2 +-
 drivers/block/drbd/drbd_bitmap.c    |  4 +-
 drivers/block/drbd/drbd_main.c      |  4 +-
 drivers/block/drbd/drbd_receiver.c  |  6 +-
 drivers/block/drbd/drbd_worker.c    |  2 +-
 drivers/block/floppy.c              |  6 +-
 drivers/block/loop.c                | 16 ++---
 drivers/block/null_blk_main.c       |  6 +-
 drivers/block/pktcdvd.c             |  4 +-
 drivers/block/ps3disk.c             |  2 +-
 drivers/block/ps3vram.c             |  2 +-
 drivers/block/rbd.c                 | 12 ++--
 drivers/block/rsxx/dma.c            |  3 +-
 drivers/block/umem.c                |  2 +-
 drivers/block/virtio_blk.c          |  4 +-
 drivers/block/xen-blkback/blkback.c |  2 +-
 drivers/block/zram/zram_drv.c       | 24 +++----
 drivers/lightnvm/core.c             |  2 +-
 drivers/lightnvm/pblk-core.c        | 12 ++--
 drivers/lightnvm/pblk-rb.c          |  2 +-
 drivers/lightnvm/pblk-read.c        |  6 +-
 drivers/md/bcache/btree.c           |  2 +-
 drivers/md/bcache/debug.c           |  4 +-
 drivers/md/bcache/request.c         |  4 +-
 drivers/md/bcache/super.c           |  6 +-
 drivers/md/bcache/util.c            | 11 ++--
 drivers/md/dm-bufio.c               |  2 +-
 drivers/md/dm-crypt.c               | 18 ++++--
 drivers/md/dm-integrity.c           | 18 +++---
 drivers/md/dm-io.c                  |  7 +-
 drivers/md/dm-log-writes.c          | 20 +++---
 drivers/md/dm-verity-target.c       |  4 +-
 drivers/md/dm-writecache.c          |  3 +-
 drivers/md/dm-zoned-metadata.c      |  6 +-
 drivers/md/md.c                     |  4 +-
 drivers/md/raid1-10.c               |  2 +-
 drivers/md/raid1.c                  |  4 +-
 drivers/md/raid10.c                 |  4 +-
 drivers/md/raid5-cache.c            |  7 +-
 drivers/md/raid5-ppl.c              |  6 +-
 drivers/md/raid5.c                  | 10 +--
 drivers/nvdimm/blk.c                |  6 +-
 drivers/nvdimm/btt.c                |  5 +-
 drivers/nvdimm/pmem.c               |  4 +-
 drivers/nvme/host/core.c            |  4 +-
 drivers/nvme/host/tcp.c             |  2 +-
 drivers/nvme/target/io-cmd-bdev.c   |  2 +-
 drivers/nvme/target/io-cmd-file.c   |  2 +-
 drivers/s390/block/dasd_diag.c      |  2 +-
 drivers/s390/block/dasd_eckd.c      | 14 ++--
 drivers/s390/block/dasd_fba.c       |  6 +-
 drivers/s390/block/dcssblk.c        |  2 +-
 drivers/s390/block/scm_blk.c        |  2 +-
 drivers/s390/block/xpram.c          |  2 +-
 drivers/scsi/sd.c                   | 25 ++++----
 drivers/staging/erofs/data.c        |  6 +-
 drivers/staging/erofs/unzip_vle.c   |  4 +-
 drivers/target/target_core_file.c   |  6 +-
 drivers/target/target_core_iblock.c |  4 +-
 drivers/target/target_core_pscsi.c  |  2 +-
 drivers/xen/biomerge.c              |  4 +-
 fs/9p/vfs_addr.c                    |  4 +-
 fs/afs/fsclient.c                   |  2 +-
 fs/afs/rxrpc.c                      |  4 +-
 fs/afs/yfsclient.c                  |  2 +-
 fs/block_dev.c                      | 10 ++-
 fs/btrfs/check-integrity.c          |  6 +-
 fs/btrfs/compression.c              | 22 +++----
 fs/btrfs/disk-io.c                  |  4 +-
 fs/btrfs/extent_io.c                | 16 ++---
 fs/btrfs/file-item.c                |  8 +--
 fs/btrfs/inode.c                    | 20 +++---
 fs/btrfs/raid56.c                   |  8 +--
 fs/btrfs/scrub.c                    | 10 +--
 fs/buffer.c                         |  4 +-
 fs/ceph/file.c                      | 20 +++---
 fs/cifs/connect.c                   |  4 +-
 fs/cifs/misc.c                      | 14 ++--
 fs/cifs/smb2ops.c                   |  2 +-
 fs/cifs/smbdirect.c                 |  2 +-
 fs/cifs/transport.c                 |  2 +-
 fs/crypto/bio.c                     |  4 +-
 fs/direct-io.c                      | 94 +++++++++++++++++++--------
 fs/ext4/page-io.c                   |  4 +-
 fs/ext4/readpage.c                  |  4 +-
 fs/f2fs/data.c                      | 20 +++---
 fs/gfs2/lops.c                      |  8 +--
 fs/gfs2/meta_io.c                   |  4 +-
 fs/gfs2/ops_fstype.c                |  2 +-
 fs/hfsplus/wrapper.c                |  3 +-
 fs/io_uring.c                       |  4 +-
 fs/iomap.c                          | 10 +--
 fs/jfs/jfs_logmgr.c                 |  4 +-
 fs/jfs/jfs_metapage.c               |  6 +-
 fs/mpage.c                          |  6 +-
 fs/nfs/blocklayout/blocklayout.c    |  2 +-
 fs/nilfs2/segbuf.c                  |  3 +-
 fs/ocfs2/cluster/heartbeat.c        |  2 +-
 fs/orangefs/inode.c                 |  2 +-
 fs/splice.c                         | 13 ++--
 fs/xfs/xfs_aops.c                   |  8 +--
 fs/xfs/xfs_buf.c                    |  2 +-
 include/linux/bio.h                 | 13 ++--
 include/linux/bvec.h                | 99 +++++++++++++++++++++++++----
 include/linux/uio.h                 | 11 ++++
 kernel/power/swap.c                 |  2 +-
 lib/iov_iter.c                      | 32 +++++-----
 mm/page_io.c                        |  8 +--
 net/ceph/messenger.c                | 10 +--
 net/sunrpc/xdr.c                    |  2 +-
 net/sunrpc/xprtsock.c               |  4 +-
 126 files changed, 628 insertions(+), 467 deletions(-)

Comments

Dave Chinner April 16, 2019, midnight UTC | #1
On Thu, Apr 11, 2019 at 05:08:19PM -0400, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> This patchset depends on various small fixes [1] and also on patchset
> which introduce put_user_page*() [2] and thus is 5.3 material as those
> pre-requisite will get in 5.2 at best. Nonetheless i am posting it now
> so that it can get review and comments on how and what should be done
> to test things.
> 
> For various reasons [2] [3] we want to track page reference through GUP
> differently than "regular" page reference. Thus we need to keep track
> of how we got a page within the block and fs layer. To do so this patch-
> set change the bio_bvec struct to store a pfn and flags instead of a
> direct pointer to a page. This way we can flag page that are coming from
> GUP.
> 
> This patchset is divided as follow:
>     - First part of the patchset is just small cleanup i believe they
>       can go in as his assuming people are ok with them.
>     - Second part convert bio_vec->bv_page to bio_vec->bv_pfn this is
>       done in multi-step, first we replace all direct dereference of
>       the field by call to inline helper, then we introduce macro for
>       bio_bvec that are initialized on the stack. Finaly we change the
>       bv_page field to bv_pfn.
>     - Third part replace put_page(bv_page(bio_vec)) with a new helper
>       which will use put_user_page() when the page in the bio_vec is
>       coming from GUP.
>     - Fourth part update BIO to use bv_set_user_page() for page that
>       are coming from GUP this means updating bio_add_page*() to pass
>       down the origin of the page (GUP or not).
>     - Fith part convert few more places that directly use bvec_io or
>       BIO.
> 
> Note that after this patchset they are still places in the kernel where
> we should use put_user_page*(). The intention is to separate that task
> in chewable chunk (driver by driver, sub-system by sub-system).
> 
> 
> I have only lightly tested this patchset (branch [4]) on my desktop and
> have not seen anything obviously wrong but i might have miss something.
> What kind of test suite should i run to stress test the vfs/block layer
> around DIO and BIO ?

Such widespread changes need full correctness tests run on them. I'd
suggest fstests (auto group) be run on all the filesystems it
supports that are affected by the changes in the patchset. Given you
touched bio_add_page() here, that's probably all of them....

Cheers,

Dave.
Boaz Harrosh April 16, 2019, 6:35 p.m. UTC | #2
On Thu, Apr 11, 2019 at 05:08:19PM -0400, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> This patchset depends on various small fixes [1] and also on patchset
> which introduce put_user_page*() [2] and thus is 5.3 material as those
> pre-requisite will get in 5.2 at best. Nonetheless i am posting it now
> so that it can get review and comments on how and what should be done
> to test things.
> 
> For various reasons [2] [3] we want to track page reference through GUP
> differently than "regular" page reference. Thus we need to keep track
> of how we got a page within the block and fs layer. To do so this patch-
> set change the bio_bvec struct to store a pfn and flags instead of a
> direct pointer to a page. This way we can flag page that are coming from
> GUP.
> 
> This patchset is divided as follow:
>     - First part of the patchset is just small cleanup i believe they
>       can go in as his assuming people are ok with them.


>     - Second part convert bio_vec->bv_page to bio_vec->bv_pfn this is
>       done in multi-step, first we replace all direct dereference of
>       the field by call to inline helper, then we introduce macro for
>       bio_bvec that are initialized on the stack. Finaly we change the
>       bv_page field to bv_pfn.

Why do we need a bv_pfn. Why not just use the lowest bit of the page-ptr
as a flag (pointer always aligned to 64 bytes in our case).

So yes we need an inline helper for reference of the page but is it not clearer
that we assume a page* and not any kind of pfn ?
It will not be the first place using low bits of a pointer for flags.

That said. Why we need it at all? I mean why not have it as a bio flag. If it exist
at all that a user has a GUP and none-GUP pages to IO at the same request he/she
can just submit them as two separate BIOs (chained at the block layer).

Many users just submit one page bios and let elevator merge them any way.

Cheers
Boaz

>     - Third part replace put_page(bv_page(bio_vec)) with a new helper
>       which will use put_user_page() when the page in the bio_vec is
>       coming from GUP.
>     - Fourth part update BIO to use bv_set_user_page() for page that
>       are coming from GUP this means updating bio_add_page*() to pass
>       down the origin of the page (GUP or not).
>     - Fith part convert few more places that directly use bvec_io or
>       BIO.
>
Jerome Glisse April 16, 2019, 6:47 p.m. UTC | #3
On Tue, Apr 16, 2019 at 09:35:04PM +0300, Boaz Harrosh wrote:
> On Thu, Apr 11, 2019 at 05:08:19PM -0400, jglisse@redhat.com wrote:
> > From: Jérôme Glisse <jglisse@redhat.com>
> > 
> > This patchset depends on various small fixes [1] and also on patchset
> > which introduce put_user_page*() [2] and thus is 5.3 material as those
> > pre-requisite will get in 5.2 at best. Nonetheless i am posting it now
> > so that it can get review and comments on how and what should be done
> > to test things.
> > 
> > For various reasons [2] [3] we want to track page reference through GUP
> > differently than "regular" page reference. Thus we need to keep track
> > of how we got a page within the block and fs layer. To do so this patch-
> > set change the bio_bvec struct to store a pfn and flags instead of a
> > direct pointer to a page. This way we can flag page that are coming from
> > GUP.
> > 
> > This patchset is divided as follow:
> >     - First part of the patchset is just small cleanup i believe they
> >       can go in as his assuming people are ok with them.
> 
> 
> >     - Second part convert bio_vec->bv_page to bio_vec->bv_pfn this is
> >       done in multi-step, first we replace all direct dereference of
> >       the field by call to inline helper, then we introduce macro for
> >       bio_bvec that are initialized on the stack. Finaly we change the
> >       bv_page field to bv_pfn.
> 
> Why do we need a bv_pfn. Why not just use the lowest bit of the page-ptr
> as a flag (pointer always aligned to 64 bytes in our case).
> 
> So yes we need an inline helper for reference of the page but is it not clearer
> that we assume a page* and not any kind of pfn ?
> It will not be the first place using low bits of a pointer for flags.

Yes i can use the lower bit of struct page * pointer it should be safe on
all architecture. I wanted to change the bv_page field name to make sure
that we catch anyone doing any direct dereference. Do you prefer keeping a
page pointer there ?

> 
> That said. Why we need it at all? I mean why not have it as a bio flag. If it exist
> at all that a user has a GUP and none-GUP pages to IO at the same request he/she
> can just submit them as two separate BIOs (chained at the block layer).
> 
> Many users just submit one page bios and let elevator merge them any way.

The issue is that bio_vec is use, on its own, outside of bios and for
those use cases i need to track the GUP status within the bio_vec. Thus
it is easier to use the same mechanisms for bio too as adding a flag to
bio would mean that i also have to audit all code path that could merge
bios. While i believe it should be restrictred to block/blk-merge.c it
seems some block and some fs have spawn some custom bio manipulation
(md comes to mind). So using same mechanism for bio_vec and bio seems
like a safer and easier course of action.

Cheers,
Jérôme
Kent Overstreet April 16, 2019, 6:59 p.m. UTC | #4
On Tue, Apr 16, 2019 at 09:35:04PM +0300, Boaz Harrosh wrote:
> On Thu, Apr 11, 2019 at 05:08:19PM -0400, jglisse@redhat.com wrote:
> > From: Jérôme Glisse <jglisse@redhat.com>
> > 
> > This patchset depends on various small fixes [1] and also on patchset
> > which introduce put_user_page*() [2] and thus is 5.3 material as those
> > pre-requisite will get in 5.2 at best. Nonetheless i am posting it now
> > so that it can get review and comments on how and what should be done
> > to test things.
> > 
> > For various reasons [2] [3] we want to track page reference through GUP
> > differently than "regular" page reference. Thus we need to keep track
> > of how we got a page within the block and fs layer. To do so this patch-
> > set change the bio_bvec struct to store a pfn and flags instead of a
> > direct pointer to a page. This way we can flag page that are coming from
> > GUP.
> > 
> > This patchset is divided as follow:
> >     - First part of the patchset is just small cleanup i believe they
> >       can go in as his assuming people are ok with them.
> 
> 
> >     - Second part convert bio_vec->bv_page to bio_vec->bv_pfn this is
> >       done in multi-step, first we replace all direct dereference of
> >       the field by call to inline helper, then we introduce macro for
> >       bio_bvec that are initialized on the stack. Finaly we change the
> >       bv_page field to bv_pfn.
> 
> Why do we need a bv_pfn. Why not just use the lowest bit of the page-ptr
> as a flag (pointer always aligned to 64 bytes in our case).
> 
> So yes we need an inline helper for reference of the page but is it not clearer
> that we assume a page* and not any kind of pfn ?
> It will not be the first place using low bits of a pointer for flags.
> 
> That said. Why we need it at all? I mean why not have it as a bio flag. If it exist
> at all that a user has a GUP and none-GUP pages to IO at the same request he/she
> can just submit them as two separate BIOs (chained at the block layer).
> 
> Many users just submit one page bios and let elevator merge them any way.

Let's please not add additional flags and weirdness to struct bio - "if this
flag is set interpret one way, if not interpret another" - or eventually bios
will be as bad as skbuffs. I would much prefer just changing bv_page to bv_pfn.

Question though - why do we need a flag for whether a page is a GUP page or not?
Couldn't the needed information just be determined by what range the pfn is not
(i.e. whether or not it has a struct page associated with it)?
Dan Williams April 16, 2019, 7:12 p.m. UTC | #5
On Tue, Apr 16, 2019 at 11:59 AM Kent Overstreet
<kent.overstreet@gmail.com> wrote:
>
> On Tue, Apr 16, 2019 at 09:35:04PM +0300, Boaz Harrosh wrote:
> > On Thu, Apr 11, 2019 at 05:08:19PM -0400, jglisse@redhat.com wrote:
> > > From: Jérôme Glisse <jglisse@redhat.com>
> > >
> > > This patchset depends on various small fixes [1] and also on patchset
> > > which introduce put_user_page*() [2] and thus is 5.3 material as those
> > > pre-requisite will get in 5.2 at best. Nonetheless i am posting it now
> > > so that it can get review and comments on how and what should be done
> > > to test things.
> > >
> > > For various reasons [2] [3] we want to track page reference through GUP
> > > differently than "regular" page reference. Thus we need to keep track
> > > of how we got a page within the block and fs layer. To do so this patch-
> > > set change the bio_bvec struct to store a pfn and flags instead of a
> > > direct pointer to a page. This way we can flag page that are coming from
> > > GUP.
> > >
> > > This patchset is divided as follow:
> > >     - First part of the patchset is just small cleanup i believe they
> > >       can go in as his assuming people are ok with them.
> >
> >
> > >     - Second part convert bio_vec->bv_page to bio_vec->bv_pfn this is
> > >       done in multi-step, first we replace all direct dereference of
> > >       the field by call to inline helper, then we introduce macro for
> > >       bio_bvec that are initialized on the stack. Finaly we change the
> > >       bv_page field to bv_pfn.
> >
> > Why do we need a bv_pfn. Why not just use the lowest bit of the page-ptr
> > as a flag (pointer always aligned to 64 bytes in our case).
> >
> > So yes we need an inline helper for reference of the page but is it not clearer
> > that we assume a page* and not any kind of pfn ?
> > It will not be the first place using low bits of a pointer for flags.
> >
> > That said. Why we need it at all? I mean why not have it as a bio flag. If it exist
> > at all that a user has a GUP and none-GUP pages to IO at the same request he/she
> > can just submit them as two separate BIOs (chained at the block layer).
> >
> > Many users just submit one page bios and let elevator merge them any way.
>
> Let's please not add additional flags and weirdness to struct bio - "if this
> flag is set interpret one way, if not interpret another" - or eventually bios
> will be as bad as skbuffs. I would much prefer just changing bv_page to bv_pfn.

This all reminds of the failed attempt to teach the block layer to
operate without pages:

https://lore.kernel.org/lkml/20150316201640.33102.33761.stgit@dwillia2-desk3.amr.corp.intel.com/

>
> Question though - why do we need a flag for whether a page is a GUP page or not?
> Couldn't the needed information just be determined by what range the pfn is not
> (i.e. whether or not it has a struct page associated with it)?

That amounts to a pfn_valid() check which is a bit heavier than if we
can store a flag in the bv_pfn entry directly.

I'd say create a new PFN_* flag, and make bv_pfn a 'pfn_t' rather than
an 'unsigned long'.

That said, I'm still in favor of Jan's proposal to just make the
bv_page semantics uniform. Otherwise we're complicating this core
infrastructure for some yet to be implemented GPU memory management
capabilities with yet to be determined value. Circle back when that
value is clear, but in the meantime fix the GUP bug.
Boaz Harrosh April 16, 2019, 7:14 p.m. UTC | #6
On 16/04/19 21:47, Jerome Glisse wrote:
> On Tue, Apr 16, 2019 at 09:35:04PM +0300, Boaz Harrosh wrote:
>> On Thu, Apr 11, 2019 at 05:08:19PM -0400, jglisse@redhat.com wrote:
>>> From: Jérôme Glisse <jglisse@redhat.com>
>>>
<>
>> Why do we need a bv_pfn. Why not just use the lowest bit of the page-ptr
>> as a flag (pointer always aligned to 64 bytes in our case).
>>
>> So yes we need an inline helper for reference of the page but is it not clearer
>> that we assume a page* and not any kind of pfn ?
>> It will not be the first place using low bits of a pointer for flags.
> 
> Yes i can use the lower bit of struct page * pointer it should be safe on
> all architecture. I wanted to change the bv_page field name to make sure
> that we catch anyone doing any direct dereference. Do you prefer keeping a
> page pointer there ?
> 

Yes I would prefer that personally.
Changing the name (And type to ulong) is a good idea, let the compiler check us.
But lets make sure we all understand this is a page pointer. And not any kind
of pfn.

>>
>> That said. Why we need it at all? I mean why not have it as a bio flag. If it exist
>> at all that a user has a GUP and none-GUP pages to IO at the same request he/she
>> can just submit them as two separate BIOs (chained at the block layer).
>>
>> Many users just submit one page bios and let elevator merge them any way.
> 
> The issue is that bio_vec is use, on its own, outside of bios and for
> those use cases i need to track the GUP status within the bio_vec. Thus
> it is easier to use the same mechanisms for bio too as adding a flag to
> bio would mean that i also have to audit all code path that could merge
> bios. While i believe it should be restrictred to block/blk-merge.c it
> seems some block and some fs have spawn some custom bio manipulation
> (md comes to mind). 

I would imagine they use mechanics as bio-split and bio-clone so it need
only be handled there. but ...

> So using same mechanism for bio_vec and bio seems
> like a safer and easier course of action.
> 

OK I get it thanks. I would imagine the opposite but I have not audited all
call sighs, if you say there are fewer bvec call sites then it makes sense.

> Cheers,
> Jérôme
> 

Thanks
Boaz
Boaz Harrosh April 16, 2019, 7:28 p.m. UTC | #7
On 16/04/19 22:12, Dan Williams wrote:
> On Tue, Apr 16, 2019 at 11:59 AM Kent Overstreet
> <kent.overstreet@gmail.com> wrote:
<>
> This all reminds of the failed attempt to teach the block layer to
> operate without pages:
> 
> https://lore.kernel.org/lkml/20150316201640.33102.33761.stgit@dwillia2-desk3.amr.corp.intel.com/
> 

Exactly why I want to make sure it is just a [pointer | flag] and not any kind of pfn
type. Let us please not go there again?

>>
>> Question though - why do we need a flag for whether a page is a GUP page or not?
>> Couldn't the needed information just be determined by what range the pfn is not
>> (i.e. whether or not it has a struct page associated with it)?
> 
> That amounts to a pfn_valid() check which is a bit heavier than if we
> can store a flag in the bv_pfn entry directly.
> 
> I'd say create a new PFN_* flag, and make bv_pfn a 'pfn_t' rather than
> an 'unsigned long'.
> 

No, please please not. This is not a pfn and not a pfn_t. It is a page-ptr
and a flag that says where/how to put_page it. IE I did a GUP on this page
please do a PUP on this page instead of regular put_page. So no where do I mean
pfn or pfn_t in this code. Then why?

> That said, I'm still in favor of Jan's proposal to just make the
> bv_page semantics uniform. Otherwise we're complicating this core
> infrastructure for some yet to be implemented GPU memory management
> capabilities with yet to be determined value. Circle back when that
> value is clear, but in the meantime fix the GUP bug.
> 

I agree there are simpler ways to solve the bugs at hand then
to system wide separate get_user_page from get_page and force all put_user
callers to remember what to do. Is there some Document explaining the
all design of where this is going?

Thanks
Boaz
Jerome Glisse April 16, 2019, 7:49 p.m. UTC | #8
On Tue, Apr 16, 2019 at 12:12:27PM -0700, Dan Williams wrote:
> On Tue, Apr 16, 2019 at 11:59 AM Kent Overstreet
> <kent.overstreet@gmail.com> wrote:
> >
> > On Tue, Apr 16, 2019 at 09:35:04PM +0300, Boaz Harrosh wrote:
> > > On Thu, Apr 11, 2019 at 05:08:19PM -0400, jglisse@redhat.com wrote:
> > > > From: Jérôme Glisse <jglisse@redhat.com>
> > > >
> > > > This patchset depends on various small fixes [1] and also on patchset
> > > > which introduce put_user_page*() [2] and thus is 5.3 material as those
> > > > pre-requisite will get in 5.2 at best. Nonetheless i am posting it now
> > > > so that it can get review and comments on how and what should be done
> > > > to test things.
> > > >
> > > > For various reasons [2] [3] we want to track page reference through GUP
> > > > differently than "regular" page reference. Thus we need to keep track
> > > > of how we got a page within the block and fs layer. To do so this patch-
> > > > set change the bio_bvec struct to store a pfn and flags instead of a
> > > > direct pointer to a page. This way we can flag page that are coming from
> > > > GUP.
> > > >
> > > > This patchset is divided as follow:
> > > >     - First part of the patchset is just small cleanup i believe they
> > > >       can go in as his assuming people are ok with them.
> > >
> > >
> > > >     - Second part convert bio_vec->bv_page to bio_vec->bv_pfn this is
> > > >       done in multi-step, first we replace all direct dereference of
> > > >       the field by call to inline helper, then we introduce macro for
> > > >       bio_bvec that are initialized on the stack. Finaly we change the
> > > >       bv_page field to bv_pfn.
> > >
> > > Why do we need a bv_pfn. Why not just use the lowest bit of the page-ptr
> > > as a flag (pointer always aligned to 64 bytes in our case).
> > >
> > > So yes we need an inline helper for reference of the page but is it not clearer
> > > that we assume a page* and not any kind of pfn ?
> > > It will not be the first place using low bits of a pointer for flags.
> > >
> > > That said. Why we need it at all? I mean why not have it as a bio flag. If it exist
> > > at all that a user has a GUP and none-GUP pages to IO at the same request he/she
> > > can just submit them as two separate BIOs (chained at the block layer).
> > >
> > > Many users just submit one page bios and let elevator merge them any way.
> >
> > Let's please not add additional flags and weirdness to struct bio - "if this
> > flag is set interpret one way, if not interpret another" - or eventually bios
> > will be as bad as skbuffs. I would much prefer just changing bv_page to bv_pfn.
> 
> This all reminds of the failed attempt to teach the block layer to
> operate without pages:
> 
> https://lore.kernel.org/lkml/20150316201640.33102.33761.stgit@dwillia2-desk3.amr.corp.intel.com/
> 
> >
> > Question though - why do we need a flag for whether a page is a GUP page or not?
> > Couldn't the needed information just be determined by what range the pfn is not
> > (i.e. whether or not it has a struct page associated with it)?
> 
> That amounts to a pfn_valid() check which is a bit heavier than if we
> can store a flag in the bv_pfn entry directly.
> 
> I'd say create a new PFN_* flag, and make bv_pfn a 'pfn_t' rather than
> an 'unsigned long'.
> 
> That said, I'm still in favor of Jan's proposal to just make the
> bv_page semantics uniform. Otherwise we're complicating this core
> infrastructure for some yet to be implemented GPU memory management
> capabilities with yet to be determined value. Circle back when that
> value is clear, but in the meantime fix the GUP bug.

This has nothing to do with GPU, what make you think so ? Here i am
trying to solve GUP and to keep the value of knowing wether a page
has been GUP or not. I argue that if we bias every page in every bio
then we loose that information and thus the value.

I gave the page protection mechanisms as an example that would be
impacted but it is not the only one. Knowing if a page has been GUP
can be useful for memory reclaimation, compaction, NUMA balancing,
...

Also page that are going through a bio in one thread might be under
some other fs specific operation in another thread which would be
block by GUP but do not need to be block by I/O (ie fs can either
wait on the I/O or knows that it is safe to proceed even if the page
is under I/O).

Hence i believe that by making every page look the same we do loose
valuable information. More over the complexity of making all the
page in bio have a reference count bias is much bigger than the
changes needed to keep track of wether the page did came from GUP
or not.

Cheers,
Jérôme
Jerome Glisse April 16, 2019, 7:57 p.m. UTC | #9
On Tue, Apr 16, 2019 at 10:28:40PM +0300, Boaz Harrosh wrote:
> On 16/04/19 22:12, Dan Williams wrote:
> > On Tue, Apr 16, 2019 at 11:59 AM Kent Overstreet
> > <kent.overstreet@gmail.com> wrote:
> <>
> > This all reminds of the failed attempt to teach the block layer to
> > operate without pages:
> > 
> > https://lore.kernel.org/lkml/20150316201640.33102.33761.stgit@dwillia2-desk3.amr.corp.intel.com/
> > 
> 
> Exactly why I want to make sure it is just a [pointer | flag] and not any kind of pfn
> type. Let us please not go there again?
> 
> >>
> >> Question though - why do we need a flag for whether a page is a GUP page or not?
> >> Couldn't the needed information just be determined by what range the pfn is not
> >> (i.e. whether or not it has a struct page associated with it)?
> > 
> > That amounts to a pfn_valid() check which is a bit heavier than if we
> > can store a flag in the bv_pfn entry directly.
> > 
> > I'd say create a new PFN_* flag, and make bv_pfn a 'pfn_t' rather than
> > an 'unsigned long'.
> > 
> 
> No, please please not. This is not a pfn and not a pfn_t. It is a page-ptr
> and a flag that says where/how to put_page it. IE I did a GUP on this page
> please do a PUP on this page instead of regular put_page. So no where do I mean
> pfn or pfn_t in this code. Then why?
> 
> > That said, I'm still in favor of Jan's proposal to just make the
> > bv_page semantics uniform. Otherwise we're complicating this core
> > infrastructure for some yet to be implemented GPU memory management
> > capabilities with yet to be determined value. Circle back when that
> > value is clear, but in the meantime fix the GUP bug.
> > 
> 
> I agree there are simpler ways to solve the bugs at hand then
> to system wide separate get_user_page from get_page and force all put_user
> callers to remember what to do. Is there some Document explaining the
> all design of where this is going?
> 

A very long thread on this:

https://lkml.org/lkml/2018/12/3/1128

especialy all the reply to this first one

There is also:

https://lkml.org/lkml/2019/3/26/1395
https://lwn.net/Articles/753027/

Cheers,
Jérôme
Boaz Harrosh April 16, 2019, 10:09 p.m. UTC | #10
On 16/04/19 22:57, Jerome Glisse wrote:
<>
> 
> A very long thread on this:
> 
> https://lkml.org/lkml/2018/12/3/1128
> 
> especialy all the reply to this first one
> 
> There is also:
> 
> https://lkml.org/lkml/2019/3/26/1395
> https://lwn.net/Articles/753027/
> 

OK I have re-read this patchset and a little bit of the threads above (not all)

As I understand the long term plan is to keep two separate ref-counts one
for GUP-ref and one for the regular page-state/ownership ref.
Currently looking at page-ref we do not know if we have a GUP currently held.
With the new plan we can (Still not sure what's the full plan with this new info)

But if you make it such as the first GUP-ref also takes a page_ref and the
last GUp-dec also does put_page. Then the all of these becomes a matter of
matching every call to get_user_pages or iov_iter_get_pages() with a new
put_user_pages or iov_iter_put_pages().

Then if much below us an LLD takes a get_page() say an skb below the iscsi
driver, and so on. We do not care and we keep doing a put_page because we know
the GUP-ref holds the page for us.

The current block layer is transparent to any page-ref it does not take any
nor put_page any. It is only the higher users that have done GUP that take care of that.

The patterns I see are:

  iov_iter_get_pages()

	IO(sync)

  for(numpages)
	put_page()

Or

  iov_iter_get_pages()

	IO (async)
		->	foo_end_io()
				put_page

(Same with get_user_pages)
(IO need not be block layer. It can be networking and so on like in NFS or CIFS
 and so on)

The first pattern is easy just add the proper new api for
it, so for every iov_iter_get_pages() you have an iov_iter_put_pages() and remove
lots of cooked up for loops. Also the all iov_iter_get_pages_use_gup() just drops.
(Same at get_user_pages sites use put_user_pages)

The second pattern is a bit harder because it is possible that the foo_end_io()
is currently used for GUP as well as none-GUP cases. this is easy to fix. But the
even harder case is if the same foo_end_io() call has some pages GUPed and some not
in the same call.

staring at this patchset and the call sites I did not see any such places. Do you know
of any?
(We can always force such mixed-case users to always GUP-ref the pages and code
 foo_end_io() to GUP-dec)

So with a very careful coding I think you need not touch the block / scatter-list layers
nor any LLD drivers. The only code affected is the code around the get_user_pages and friends.
Changing the API will surface all those.
(IE. introduce a new API, convert one by one, Remove old API)

Am I smoking?

BTW: Are you aware of the users of iov_iter_get_pages_alloc() Do they need fixing too?

> Cheers,
> Jérôme
> 

Thanks
Boaz
Jerome Glisse April 16, 2019, 11:16 p.m. UTC | #11
On Wed, Apr 17, 2019 at 01:09:22AM +0300, Boaz Harrosh wrote:
> On 16/04/19 22:57, Jerome Glisse wrote:
> <>
> > 
> > A very long thread on this:
> > 
> > https://lkml.org/lkml/2018/12/3/1128
> > 
> > especialy all the reply to this first one
> > 
> > There is also:
> > 
> > https://lkml.org/lkml/2019/3/26/1395
> > https://lwn.net/Articles/753027/
> > 
> 
> OK I have re-read this patchset and a little bit of the threads above (not all)
> 
> As I understand the long term plan is to keep two separate ref-counts one
> for GUP-ref and one for the regular page-state/ownership ref.
> Currently looking at page-ref we do not know if we have a GUP currently held.
> With the new plan we can (Still not sure what's the full plan with this new info)
> 
> But if you make it such as the first GUP-ref also takes a page_ref and the
> last GUp-dec also does put_page. Then the all of these becomes a matter of
> matching every call to get_user_pages or iov_iter_get_pages() with a new
> put_user_pages or iov_iter_put_pages().
> 
> Then if much below us an LLD takes a get_page() say an skb below the iscsi
> driver, and so on. We do not care and we keep doing a put_page because we know
> the GUP-ref holds the page for us.
> 
> The current block layer is transparent to any page-ref it does not take any
> nor put_page any. It is only the higher users that have done GUP that take care of that.
> 
> The patterns I see are:
> 
>   iov_iter_get_pages()
> 
> 	IO(sync)
> 
>   for(numpages)
> 	put_page()
> 
> Or
> 
>   iov_iter_get_pages()
> 
> 	IO (async)
> 		->	foo_end_io()
> 				put_page
> 
> (Same with get_user_pages)
> (IO need not be block layer. It can be networking and so on like in NFS or CIFS
>  and so on)

They are also other code that pass around bio_vec and the code that
fill it is disconnected from the code that release the page and they
can mix and match GUP and non GUP AFAICT.

On fs side they are also code that fill either bio or bio_vec and
use some extra mechanism other than bio_end to submit io through
workqueue and then release pages (cifs for instance). Again i believe
they can mix and match GUP and non GUP (i have not spotted something
obvious indicating otherwise).

> 
> The first pattern is easy just add the proper new api for
> it, so for every iov_iter_get_pages() you have an iov_iter_put_pages() and remove
> lots of cooked up for loops. Also the all iov_iter_get_pages_use_gup() just drops.
> (Same at get_user_pages sites use put_user_pages)

Yes this patchset already convert some of this first pattern.

> The second pattern is a bit harder because it is possible that the foo_end_io()
> is currently used for GUP as well as none-GUP cases. this is easy to fix. But the
> even harder case is if the same foo_end_io() call has some pages GUPed and some not
> in the same call.
> 
> staring at this patchset and the call sites I did not see any such places. Do you know
> of any?
> (We can always force such mixed-case users to always GUP-ref the pages and code
>  foo_end_io() to GUP-dec)

I believe direct-io.c is such example thought in that case i believe it
can only be the ZERO_PAGE so this might easily detectable. They are also
lot of fs functions taking an iterator and then using iov_iter_get_pages*()
to fill a bio. AFAICT those functions can be call with pipe iterator or
iovec iterator and probably also with other iterator type. But it is all
common code afterward (the bi_end_io function is the same no matter the
iterator).

Thought that can probably be solve that way:

From:
    foo_bi_end_io(struct bio *bio) {
        ...
        for (i = 0; i < npages; ++i) {
            put_page(pages[i]);
        }
    }

To:
    foo_bi_end_io_common(struct bio *bio) {
        ...
    }

    foo_bi_end_io_normal(struct bio *bio)
        foo_bi_end_io_common(bio);
        for (i = 0; i < npages; ++i) {
            put_page(pages[i]);
        }
    }

    foo_bi_end_io_gup(struct bio *bio)
        foo_bi_end_io_common(bio);
        for (i = 0; i < npages; ++i) {
            put_user_page(pages[i]);
        }
    }

Then when filling in the bio i either pick foo_bi_end_io_normal() or
foo_bi_end_io_gup(). I am assuming that bio with different bi_end_io
function never get merge.

The issue is that some bio_add_page*() call site are disconnected
from where the bio is allocated and initialized (and also where the
bi_end_io function is set). This make it quite hard to ascertain
that GUPed page and non GUP page can not co-exist in same bio.

Also in some cases it is not clear that the same iter is use to
fill the same bio ie it might be possible that some code path fill
the same bio from different iterator (and thus some pages might
be coming from GUP and other not).

It would certainly seems to require more careful review from the
maintainers of such fs. I tend to believe that putting the burden
on the reviewer is a harder sell :)

From quick glance:
   - nilfs segment thing
   - direct-io same bio accumulate pages over multiple call but
     it should always be from same iterator and thus either always
     be from GUP or non GUP. Also the ZERO_PAGE case should be easy
     to catch.
   - fs/nfs/blocklayout/blocklayout.c
   - gfs2 log buffer, that should never be page from GUP but i could
     not ascertain that easily from quick review

This is not extensive, i was just grepping for bio_add_page() and
they are 2 other variant to check and i tended to discard places
where bio is allocated in same function as bio_add_page() but this
might not be a valid assumption either. Some bio might be allocated
and only if there is no default bio already and then set as default
bio which might be use latter on with different iterator.

> 
> So with a very careful coding I think you need not touch the block / scatter-list layers
> nor any LLD drivers. The only code affected is the code around the get_user_pages and friends.
> Changing the API will surface all those.
> (IE. introduce a new API, convert one by one, Remove old API)
> 
> Am I smoking?

No, i thought about it seemed more dangerous and harder to get right
because some code add page in one place and setup bio in another. I
can dig some more on that front but this still leave the non-bio user
of bio_vec and those IIRC also suffer from same disconnect issue.

> 
> BTW: Are you aware of the users of iov_iter_get_pages_alloc() Do they need fixing too?

Yeah and that patchset should address those already, i do not think
i missed any.

Cheers,
Jérôme
Jerome Glisse April 16, 2019, 11:34 p.m. UTC | #12
On Wed, Apr 17, 2019 at 01:09:22AM +0300, Boaz Harrosh wrote:
> On 16/04/19 22:57, Jerome Glisse wrote:
> <>
> > 
> > A very long thread on this:
> > 
> > https://lkml.org/lkml/2018/12/3/1128
> > 
> > especialy all the reply to this first one
> > 
> > There is also:
> > 
> > https://lkml.org/lkml/2019/3/26/1395
> > https://lwn.net/Articles/753027/
> > 
> 
> OK I have re-read this patchset and a little bit of the threads above (not all)
> 
> As I understand the long term plan is to keep two separate ref-counts one
> for GUP-ref and one for the regular page-state/ownership ref.
> Currently looking at page-ref we do not know if we have a GUP currently held.
> With the new plan we can (Still not sure what's the full plan with this new info)
> 
> But if you make it such as the first GUP-ref also takes a page_ref and the
> last GUp-dec also does put_page. Then the all of these becomes a matter of
> matching every call to get_user_pages or iov_iter_get_pages() with a new
> put_user_pages or iov_iter_put_pages().

So sorry forgot to answer that part. So idea is to do:
    GUP() {
        ...
-       page_ref_inc(page);
+       page_ref_add(page, GUP_BIAS);
        ...
    }

with GUP_BIAS = 1024 or something big but not too big to avoid risk of
overflow by GUP. Then put_user_page() just ref_sub instead of ref_dec
the same amount.

We can have false GUP positive if a page is map so many time or reference
so many time that its refcount reach the GUP_BIAS value but considering
such page as GUPed should not be too harmful (not more harmful than what
we do with GUPed page).

So we want to call put_user_page() for GUPed page and only GUPed page so
that we keep the reference count properly balance.

Cheers,
Jérôme
Boaz Harrosh April 17, 2019, 1:11 a.m. UTC | #13
On 17/04/19 02:16, Jerome Glisse wrote:
> On Wed, Apr 17, 2019 at 01:09:22AM +0300, Boaz Harrosh wrote:
>> On 16/04/19 22:57, Jerome Glisse wrote:
>> <>
>>>
>>> A very long thread on this:
>>>
>>> https://lkml.org/lkml/2018/12/3/1128
>>>
>>> especialy all the reply to this first one
>>>
>>> There is also:
>>>
>>> https://lkml.org/lkml/2019/3/26/1395
>>> https://lwn.net/Articles/753027/
>>>
>>
>> OK I have re-read this patchset and a little bit of the threads above (not all)
>>
>> As I understand the long term plan is to keep two separate ref-counts one
>> for GUP-ref and one for the regular page-state/ownership ref.
>> Currently looking at page-ref we do not know if we have a GUP currently held.
>> With the new plan we can (Still not sure what's the full plan with this new info)
>>
>> But if you make it such as the first GUP-ref also takes a page_ref and the
>> last GUp-dec also does put_page. Then the all of these becomes a matter of
>> matching every call to get_user_pages or iov_iter_get_pages() with a new
>> put_user_pages or iov_iter_put_pages().
>>
>> Then if much below us an LLD takes a get_page() say an skb below the iscsi
>> driver, and so on. We do not care and we keep doing a put_page because we know
>> the GUP-ref holds the page for us.
>>
>> The current block layer is transparent to any page-ref it does not take any
>> nor put_page any. It is only the higher users that have done GUP that take care of that.
>>
>> The patterns I see are:
>>
>>   iov_iter_get_pages()
>>
>> 	IO(sync)
>>
>>   for(numpages)
>> 	put_page()
>>
>> Or
>>
>>   iov_iter_get_pages()
>>
>> 	IO (async)
>> 		->	foo_end_io()
>> 				put_page
>>
>> (Same with get_user_pages)
>> (IO need not be block layer. It can be networking and so on like in NFS or CIFS
>>  and so on)
> 
> They are also other code that pass around bio_vec and the code that
> fill it is disconnected from the code that release the page and they
> can mix and match GUP and non GUP AFAICT.
> 
> On fs side they are also code that fill either bio or bio_vec and
> use some extra mechanism other than bio_end to submit io through
> workqueue and then release pages (cifs for instance). Again i believe
> they can mix and match GUP and non GUP (i have not spotted something
> obvious indicating otherwise).
> 

But what I meant is why do we care at all? block layer does not inc page nor put
page in any of bio or bio_vec. It is agnostic to the page-refs.

Users register an end_io and know if pages are getted or not.
So the balanced put is up to the user.

>>
>> The first pattern is easy just add the proper new api for
>> it, so for every iov_iter_get_pages() you have an iov_iter_put_pages() and remove
>> lots of cooked up for loops. Also the all iov_iter_get_pages_use_gup() just drops.
>> (Same at get_user_pages sites use put_user_pages)
> 
> Yes this patchset already convert some of this first pattern.
> 

Right!

>> The second pattern is a bit harder because it is possible that the foo_end_io()
>> is currently used for GUP as well as none-GUP cases. this is easy to fix. But the
>> even harder case is if the same foo_end_io() call has some pages GUPed and some not
>> in the same call.
>>
>> staring at this patchset and the call sites I did not see any such places. Do you know
>> of any?
>> (We can always force such mixed-case users to always GUP-ref the pages and code
>>  foo_end_io() to GUP-dec)
> 
> I believe direct-io.c is such example thought in that case i believe it
> can only be the ZERO_PAGE so this might easily detectable. They are also
> lot of fs functions taking an iterator and then using iov_iter_get_pages*()
> to fill a bio. AFAICT those functions can be call with pipe iterator or
> iovec iterator and probably also with other iterator type. But it is all
> common code afterward (the bi_end_io function is the same no matter the
> iterator).
> 
> Thought that can probably be solve that way:
> 
> From:
>     foo_bi_end_io(struct bio *bio) {
>         ...
>         for (i = 0; i < npages; ++i) {
>             put_page(pages[i]);
>         }
>     }
> 
> To:
>     foo_bi_end_io_common(struct bio *bio) {
>         ...
>     }
> 
>     foo_bi_end_io_normal(struct bio *bio)
>         foo_bi_end_io_common(bio);
>         for (i = 0; i < npages; ++i) {
>             put_page(pages[i]);
>         }
>     }
> 
>     foo_bi_end_io_gup(struct bio *bio)
>         foo_bi_end_io_common(bio);
>         for (i = 0; i < npages; ++i) {
>             put_user_page(pages[i]);
>         }
>     }
> 

Yes or when foo_bi_end_io_common is more complicated, then just make it

     foo_bi_end_io_common(struct bio *bio, bool gup) {
         ...
     }

     foo_bi_end_io_normal(struct bio *bio)
	foo_bi_end_io_common(bio, false);
     }
 
     foo_bi_end_io_gup(struct bio *bio)
	foo_bi_end_io_common(bio, true);
     }

Less risky coding of code we do not know?

> Then when filling in the bio i either pick foo_bi_end_io_normal() or
> foo_bi_end_io_gup(). I am assuming that bio with different bi_end_io
> function never get merge.
> 

Exactly

> The issue is that some bio_add_page*() call site are disconnected
> from where the bio is allocated and initialized (and also where the
> bi_end_io function is set). This make it quite hard to ascertain
> that GUPed page and non GUP page can not co-exist in same bio.
> 

Two questions if they always do a put_page at end IO. Who takes a page_ref
in the not GUP case? page-cache? VFS? a mechanical get_page?

> Also in some cases it is not clear that the same iter is use to
> fill the same bio ie it might be possible that some code path fill
> the same bio from different iterator (and thus some pages might
> be coming from GUP and other not).
> 

This one is hard to believe for me. 
one iter may produce multiple iter_get_pages() and many more bios.
But the opposite?

I know, never say never. Do you know of a specific example?
I would like to stare at it.

> It would certainly seems to require more careful review from the
> maintainers of such fs. I tend to believe that putting the burden
> on the reviewer is a harder sell :)
> 

I think a couple carefully put WARN_ONs in the PUT path can
detect any leakage of refs. And help debug these cases.

> From quick glance:
>    - nilfs segment thing
>    - direct-io same bio accumulate pages over multiple call but
>      it should always be from same iterator and thus either always
>      be from GUP or non GUP. Also the ZERO_PAGE case should be easy
>      to catch.

Yes. Or we can always take a GUP-ref on the ZERO_PAGE as well

>    - fs/nfs/blocklayout/blocklayout.c

This one is an example of "please do not touch" if you look at the code
it currently does not do any put page at all. Though yes it does bio_add_page.

The pages are GETed and PUTed in nfs/direct.c and reference are balanced there.

this is the trivial case of for every iov_iter_get_pages[_alloc]() there is
a new defined iov_iter_put_pages[_alloc]()

So this is an example of extra not needed code changes in your approach

>    - gfs2 log buffer, that should never be page from GUP but i could
>      not ascertain that easily from quick review

	Same as NFS maybe? didn't look.

> 
> This is not extensive, i was just grepping for bio_add_page() and
> they are 2 other variant to check and i tended to discard places
> where bio is allocated in same function as bio_add_page() but this
> might not be a valid assumption either. Some bio might be allocated
> and only if there is no default bio already and then set as default
> bio which might be use latter on with different iterator.
> 

I think we do not care at all about any of the bio_add_page() or bio_alloc
places. All we care about is the call to iov_iter_get_pages* and where in the
code these puts are balanced.

If we need to split the endio case at those sights then we can do as above.
Or in the worse case when pages are really mixed. Always take a GUP  ref also
on the not GUP case. 
(I would like to see where this happens)

>>
>> So with a very careful coding I think you need not touch the block / scatter-list layers
>> nor any LLD drivers. The only code affected is the code around the get_user_pages and friends.
>> Changing the API will surface all those.
>> (IE. introduce a new API, convert one by one, Remove old API)
>>
>> Am I smoking?
> 
> No, i thought about it seemed more dangerous and harder to get right
> because some code add page in one place and setup bio in another. I
> can dig some more on that front but this still leave the non-bio user
> of bio_vec and those IIRC also suffer from same disconnect issue.
> 

Again I should not care about bio_vec. I only need to trace the balancing of the
ref taken in GUP call sight. Let me help you in those places it is not obvious to
you.

>>
>> BTW: Are you aware of the users of iov_iter_get_pages_alloc() Do they need fixing too?
> 
> Yeah and that patchset should address those already, i do not think
> i missed any.
> 

I could not find a patch for nfs/direct.c where a put_page is called
to balance the iov_iter_get_pages_alloc(). Which takes care of for example of
the blocklayout.c pages state

So I think the deep Audit needs to be for iov_iter_get_pages and get_user_pages
and the balancing of that. And the all of bio_alloc and bio_add_page should stay
agnostic to any pege-refs taking/putting

> Cheers,
> Jérôme
> 

Lets talk in LSF, see things hands on

Thanks
Boaz
Jerome Glisse April 17, 2019, 2:03 a.m. UTC | #14
On Wed, Apr 17, 2019 at 04:11:03AM +0300, Boaz Harrosh wrote:
> On 17/04/19 02:16, Jerome Glisse wrote:
> > On Wed, Apr 17, 2019 at 01:09:22AM +0300, Boaz Harrosh wrote:
> >> On 16/04/19 22:57, Jerome Glisse wrote:
> >> <>
> >>>
> >>> A very long thread on this:
> >>>
> >>> https://lkml.org/lkml/2018/12/3/1128
> >>>
> >>> especialy all the reply to this first one
> >>>
> >>> There is also:
> >>>
> >>> https://lkml.org/lkml/2019/3/26/1395
> >>> https://lwn.net/Articles/753027/
> >>>
> >>
> >> OK I have re-read this patchset and a little bit of the threads above (not all)
> >>
> >> As I understand the long term plan is to keep two separate ref-counts one
> >> for GUP-ref and one for the regular page-state/ownership ref.
> >> Currently looking at page-ref we do not know if we have a GUP currently held.
> >> With the new plan we can (Still not sure what's the full plan with this new info)
> >>
> >> But if you make it such as the first GUP-ref also takes a page_ref and the
> >> last GUp-dec also does put_page. Then the all of these becomes a matter of
> >> matching every call to get_user_pages or iov_iter_get_pages() with a new
> >> put_user_pages or iov_iter_put_pages().
> >>
> >> Then if much below us an LLD takes a get_page() say an skb below the iscsi
> >> driver, and so on. We do not care and we keep doing a put_page because we know
> >> the GUP-ref holds the page for us.
> >>
> >> The current block layer is transparent to any page-ref it does not take any
> >> nor put_page any. It is only the higher users that have done GUP that take care of that.
> >>
> >> The patterns I see are:
> >>
> >>   iov_iter_get_pages()
> >>
> >> 	IO(sync)
> >>
> >>   for(numpages)
> >> 	put_page()
> >>
> >> Or
> >>
> >>   iov_iter_get_pages()
> >>
> >> 	IO (async)
> >> 		->	foo_end_io()
> >> 				put_page
> >>
> >> (Same with get_user_pages)
> >> (IO need not be block layer. It can be networking and so on like in NFS or CIFS
> >>  and so on)
> > 
> > They are also other code that pass around bio_vec and the code that
> > fill it is disconnected from the code that release the page and they
> > can mix and match GUP and non GUP AFAICT.
> > 
> > On fs side they are also code that fill either bio or bio_vec and
> > use some extra mechanism other than bio_end to submit io through
> > workqueue and then release pages (cifs for instance). Again i believe
> > they can mix and match GUP and non GUP (i have not spotted something
> > obvious indicating otherwise).
> > 
> 
> But what I meant is why do we care at all? block layer does not inc page nor put
> page in any of bio or bio_vec. It is agnostic to the page-refs.
> 
> Users register an end_io and know if pages are getted or not.
> So the balanced put is up to the user.
> 
> >>
> >> The first pattern is easy just add the proper new api for
> >> it, so for every iov_iter_get_pages() you have an iov_iter_put_pages() and remove
> >> lots of cooked up for loops. Also the all iov_iter_get_pages_use_gup() just drops.
> >> (Same at get_user_pages sites use put_user_pages)
> > 
> > Yes this patchset already convert some of this first pattern.
> > 
> 
> Right!
> 
> >> The second pattern is a bit harder because it is possible that the foo_end_io()
> >> is currently used for GUP as well as none-GUP cases. this is easy to fix. But the
> >> even harder case is if the same foo_end_io() call has some pages GUPed and some not
> >> in the same call.
> >>
> >> staring at this patchset and the call sites I did not see any such places. Do you know
> >> of any?
> >> (We can always force such mixed-case users to always GUP-ref the pages and code
> >>  foo_end_io() to GUP-dec)
> > 
> > I believe direct-io.c is such example thought in that case i believe it
> > can only be the ZERO_PAGE so this might easily detectable. They are also
> > lot of fs functions taking an iterator and then using iov_iter_get_pages*()
> > to fill a bio. AFAICT those functions can be call with pipe iterator or
> > iovec iterator and probably also with other iterator type. But it is all
> > common code afterward (the bi_end_io function is the same no matter the
> > iterator).
> > 
> > Thought that can probably be solve that way:
> > 
> > From:
> >     foo_bi_end_io(struct bio *bio) {
> >         ...
> >         for (i = 0; i < npages; ++i) {
> >             put_page(pages[i]);
> >         }
> >     }
> > 
> > To:
> >     foo_bi_end_io_common(struct bio *bio) {
> >         ...
> >     }
> > 
> >     foo_bi_end_io_normal(struct bio *bio)
> >         foo_bi_end_io_common(bio);
> >         for (i = 0; i < npages; ++i) {
> >             put_page(pages[i]);
> >         }
> >     }
> > 
> >     foo_bi_end_io_gup(struct bio *bio)
> >         foo_bi_end_io_common(bio);
> >         for (i = 0; i < npages; ++i) {
> >             put_user_page(pages[i]);
> >         }
> >     }
> > 
> 
> Yes or when foo_bi_end_io_common is more complicated, then just make it
> 
>      foo_bi_end_io_common(struct bio *bio, bool gup) {
>          ...
>      }
> 
>      foo_bi_end_io_normal(struct bio *bio)
> 	foo_bi_end_io_common(bio, false);
>      }
>  
>      foo_bi_end_io_gup(struct bio *bio)
> 	foo_bi_end_io_common(bio, true);
>      }
> 
> Less risky coding of code we do not know?

Yes whatever is more appropriate for eahc end_io func.

> 
> > Then when filling in the bio i either pick foo_bi_end_io_normal() or
> > foo_bi_end_io_gup(). I am assuming that bio with different bi_end_io
> > function never get merge.
> > 
> 
> Exactly
> 
> > The issue is that some bio_add_page*() call site are disconnected
> > from where the bio is allocated and initialized (and also where the
> > bi_end_io function is set). This make it quite hard to ascertain
> > that GUPed page and non GUP page can not co-exist in same bio.
> > 
> 
> Two questions if they always do a put_page at end IO. Who takes a page_ref
> in the not GUP case? page-cache? VFS? a mechanical get_page?

It depends, some get page to page-cache (find_get_page), other just
get_page on a page that is coming from unknown origin (ie it is not
clear from within the function where the page is coming from), other
allocate a page and transfer the page reference to the bio ie the
page will get free once the bio end function is call through as it
call call put_page on page within the bio.

So there is not one simple story here. Hence why it looks harder to
me (still likely do-able).

> > Also in some cases it is not clear that the same iter is use to
> > fill the same bio ie it might be possible that some code path fill
> > the same bio from different iterator (and thus some pages might
> > be coming from GUP and other not).
> > 
> 
> This one is hard to believe for me. 
> one iter may produce multiple iter_get_pages() and many more bios.
> But the opposite?
> 
> I know, never say never. Do you know of a specific example?
> I would like to stare at it.

No i do not know of any such example but reading through a lot of code
i could not convince myself of that fact. I agree with your feeling on
the matter i just don't have proof. I can spend more time digging all
code path and ascertaining thing but this will inevitably be a snapshot
in time and something that people might break in the future.

> 
> > It would certainly seems to require more careful review from the
> > maintainers of such fs. I tend to believe that putting the burden
> > on the reviewer is a harder sell :)
> > 
> 
> I think a couple carefully put WARN_ONs in the PUT path can
> detect any leakage of refs. And help debug these cases.

Yes we do plan on catching things like underflow (easy one to catch)
or put_page that bring the refcount just below the bias value ... and
in anycase the only harm that can come of this should be memory leak.

So i believe that even if we miss some weird corner case we should be
able to catch it eventualy and nothing harmful should ever happen,
famous last word i will regret when i get burnt at the stake for missing
that one case ;)

> 
> > From quick glance:
> >    - nilfs segment thing
> >    - direct-io same bio accumulate pages over multiple call but
> >      it should always be from same iterator and thus either always
> >      be from GUP or non GUP. Also the ZERO_PAGE case should be easy
> >      to catch.
> 
> Yes. Or we can always take a GUP-ref on the ZERO_PAGE as well
> 
> >    - fs/nfs/blocklayout/blocklayout.c
> 
> This one is an example of "please do not touch" if you look at the code
> it currently does not do any put page at all. Though yes it does bio_add_page.
> 
> The pages are GETed and PUTed in nfs/direct.c and reference are balanced there.
> 
> this is the trivial case of for every iov_iter_get_pages[_alloc]() there is
> a new defined iov_iter_put_pages[_alloc]()
> 
> So this is an example of extra not needed code changes in your approach
> 
> >    - gfs2 log buffer, that should never be page from GUP but i could
> >      not ascertain that easily from quick review
> 
> 	Same as NFS maybe? didn't look.
> 
> > 
> > This is not extensive, i was just grepping for bio_add_page() and
> > they are 2 other variant to check and i tended to discard places
> > where bio is allocated in same function as bio_add_page() but this
> > might not be a valid assumption either. Some bio might be allocated
> > and only if there is no default bio already and then set as default
> > bio which might be use latter on with different iterator.
> > 
> 
> I think we do not care at all about any of the bio_add_page() or bio_alloc
> places. All we care about is the call to iov_iter_get_pages* and where in the
> code these puts are balanced.
> 
> If we need to split the endio case at those sights then we can do as above.
> Or in the worse case when pages are really mixed. Always take a GUP  ref also
> on the not GUP case. 
> (I would like to see where this happens)
> 
> >>
> >> So with a very careful coding I think you need not touch the block / scatter-list layers
> >> nor any LLD drivers. The only code affected is the code around the get_user_pages and friends.
> >> Changing the API will surface all those.
> >> (IE. introduce a new API, convert one by one, Remove old API)
> >>
> >> Am I smoking?
> > 
> > No, i thought about it seemed more dangerous and harder to get right
> > because some code add page in one place and setup bio in another. I
> > can dig some more on that front but this still leave the non-bio user
> > of bio_vec and those IIRC also suffer from same disconnect issue.
> > 
> 
> Again I should not care about bio_vec. I only need to trace the balancing of the
> ref taken in GUP call sight. Let me help you in those places it is not obvious to
> you.
> 
> >>
> >> BTW: Are you aware of the users of iov_iter_get_pages_alloc() Do they need fixing too?
> > 
> > Yeah and that patchset should address those already, i do not think
> > i missed any.
> > 
> 
> I could not find a patch for nfs/direct.c where a put_page is called
> to balance the iov_iter_get_pages_alloc(). Which takes care of for example of
> the blocklayout.c pages state
> 
> So I think the deep Audit needs to be for iov_iter_get_pages and get_user_pages
> and the balancing of that. And the all of bio_alloc and bio_add_page should stay
> agnostic to any pege-refs taking/putting

Will try to get started on that and see if i hit any roadblock. I will
report once i get my feet wet, or at least before i drown ;)

Cheers,
Jérôme
Jerome Glisse April 17, 2019, 9:19 p.m. UTC | #15
On Tue, Apr 16, 2019 at 10:03:45PM -0400, Jerome Glisse wrote:
> On Wed, Apr 17, 2019 at 04:11:03AM +0300, Boaz Harrosh wrote:
> > On 17/04/19 02:16, Jerome Glisse wrote:
> > > On Wed, Apr 17, 2019 at 01:09:22AM +0300, Boaz Harrosh wrote:
> > >> On 16/04/19 22:57, Jerome Glisse wrote:

[...]

> > >>
> > >> BTW: Are you aware of the users of iov_iter_get_pages_alloc() Do they need fixing too?
> > > 
> > > Yeah and that patchset should address those already, i do not think
> > > i missed any.
> > > 
> > 
> > I could not find a patch for nfs/direct.c where a put_page is called
> > to balance the iov_iter_get_pages_alloc(). Which takes care of for example of
> > the blocklayout.c pages state
> > 
> > So I think the deep Audit needs to be for iov_iter_get_pages and get_user_pages
> > and the balancing of that. And the all of bio_alloc and bio_add_page should stay
> > agnostic to any pege-refs taking/putting
> 
> Will try to get started on that and see if i hit any roadblock. I will
> report once i get my feet wet, or at least before i drown ;)

So far it does not look too bad:

https://cgit.freedesktop.org/~glisse/linux/log/?h=gup-bio-v2

they are few things that will be harder to fit in like splice
and pipe that are populated from GUP.

Cheers,
Jérôme
Dan Williams April 17, 2019, 9:53 p.m. UTC | #16
On Tue, Apr 16, 2019 at 12:50 PM Jerome Glisse <jglisse@redhat.com> wrote:
>
> On Tue, Apr 16, 2019 at 12:12:27PM -0700, Dan Williams wrote:
> > On Tue, Apr 16, 2019 at 11:59 AM Kent Overstreet
> > <kent.overstreet@gmail.com> wrote:
> > >
> > > On Tue, Apr 16, 2019 at 09:35:04PM +0300, Boaz Harrosh wrote:
> > > > On Thu, Apr 11, 2019 at 05:08:19PM -0400, jglisse@redhat.com wrote:
> > > > > From: Jérôme Glisse <jglisse@redhat.com>
> > > > >
> > > > > This patchset depends on various small fixes [1] and also on patchset
> > > > > which introduce put_user_page*() [2] and thus is 5.3 material as those
> > > > > pre-requisite will get in 5.2 at best. Nonetheless i am posting it now
> > > > > so that it can get review and comments on how and what should be done
> > > > > to test things.
> > > > >
> > > > > For various reasons [2] [3] we want to track page reference through GUP
> > > > > differently than "regular" page reference. Thus we need to keep track
> > > > > of how we got a page within the block and fs layer. To do so this patch-
> > > > > set change the bio_bvec struct to store a pfn and flags instead of a
> > > > > direct pointer to a page. This way we can flag page that are coming from
> > > > > GUP.
> > > > >
> > > > > This patchset is divided as follow:
> > > > >     - First part of the patchset is just small cleanup i believe they
> > > > >       can go in as his assuming people are ok with them.
> > > >
> > > >
> > > > >     - Second part convert bio_vec->bv_page to bio_vec->bv_pfn this is
> > > > >       done in multi-step, first we replace all direct dereference of
> > > > >       the field by call to inline helper, then we introduce macro for
> > > > >       bio_bvec that are initialized on the stack. Finaly we change the
> > > > >       bv_page field to bv_pfn.
> > > >
> > > > Why do we need a bv_pfn. Why not just use the lowest bit of the page-ptr
> > > > as a flag (pointer always aligned to 64 bytes in our case).
> > > >
> > > > So yes we need an inline helper for reference of the page but is it not clearer
> > > > that we assume a page* and not any kind of pfn ?
> > > > It will not be the first place using low bits of a pointer for flags.
> > > >
> > > > That said. Why we need it at all? I mean why not have it as a bio flag. If it exist
> > > > at all that a user has a GUP and none-GUP pages to IO at the same request he/she
> > > > can just submit them as two separate BIOs (chained at the block layer).
> > > >
> > > > Many users just submit one page bios and let elevator merge them any way.
> > >
> > > Let's please not add additional flags and weirdness to struct bio - "if this
> > > flag is set interpret one way, if not interpret another" - or eventually bios
> > > will be as bad as skbuffs. I would much prefer just changing bv_page to bv_pfn.
> >
> > This all reminds of the failed attempt to teach the block layer to
> > operate without pages:
> >
> > https://lore.kernel.org/lkml/20150316201640.33102.33761.stgit@dwillia2-desk3.amr.corp.intel.com/
> >
> > >
> > > Question though - why do we need a flag for whether a page is a GUP page or not?
> > > Couldn't the needed information just be determined by what range the pfn is not
> > > (i.e. whether or not it has a struct page associated with it)?
> >
> > That amounts to a pfn_valid() check which is a bit heavier than if we
> > can store a flag in the bv_pfn entry directly.
> >
> > I'd say create a new PFN_* flag, and make bv_pfn a 'pfn_t' rather than
> > an 'unsigned long'.
> >
> > That said, I'm still in favor of Jan's proposal to just make the
> > bv_page semantics uniform. Otherwise we're complicating this core
> > infrastructure for some yet to be implemented GPU memory management
> > capabilities with yet to be determined value. Circle back when that
> > value is clear, but in the meantime fix the GUP bug.
>
> This has nothing to do with GPU, what make you think so ? Here i am
> trying to solve GUP and to keep the value of knowing wether a page
> has been GUP or not. I argue that if we bias every page in every bio
> then we loose that information and thus the value.
>
> I gave the page protection mechanisms as an example that would be
> impacted but it is not the only one. Knowing if a page has been GUP
> can be useful for memory reclaimation, compaction, NUMA balancing,

Right, this is what I was reacting to in your pushback to Jan's
proposal. You're claiming value for not doing the simple thing for
some future "may be useful in these contexts". To my knowledge those
things are not broken today. You're asking for the complexity to be
carried today for some future benefit, and I'm asking for the
simplicity to be maintained as much as possible today and let the
value of future changes stand on their own to push for more complexity
later.

Effectively don't use this bug fix to push complexity for a future
agenda where the value has yet to be quantified.
Dan Williams April 17, 2019, 9:54 p.m. UTC | #17
On Tue, Apr 16, 2019 at 12:28 PM Boaz Harrosh <boaz@plexistor.com> wrote:
>
> On 16/04/19 22:12, Dan Williams wrote:
> > On Tue, Apr 16, 2019 at 11:59 AM Kent Overstreet
> > <kent.overstreet@gmail.com> wrote:
> <>
> > This all reminds of the failed attempt to teach the block layer to
> > operate without pages:
> >
> > https://lore.kernel.org/lkml/20150316201640.33102.33761.stgit@dwillia2-desk3.amr.corp.intel.com/
> >
>
> Exactly why I want to make sure it is just a [pointer | flag] and not any kind of pfn
> type. Let us please not go there again?
>
> >>
> >> Question though - why do we need a flag for whether a page is a GUP page or not?
> >> Couldn't the needed information just be determined by what range the pfn is not
> >> (i.e. whether or not it has a struct page associated with it)?
> >
> > That amounts to a pfn_valid() check which is a bit heavier than if we
> > can store a flag in the bv_pfn entry directly.
> >
> > I'd say create a new PFN_* flag, and make bv_pfn a 'pfn_t' rather than
> > an 'unsigned long'.
> >
>
> No, please please not. This is not a pfn and not a pfn_t. It is a page-ptr
> and a flag that says where/how to put_page it. IE I did a GUP on this page
> please do a PUP on this page instead of regular put_page. So no where do I mean
> pfn or pfn_t in this code. Then why?

If it's not a pfn then it shouldn't be an unsigned long named "bv_pfn".
Jerome Glisse April 17, 2019, 10:28 p.m. UTC | #18
On Wed, Apr 17, 2019 at 02:53:28PM -0700, Dan Williams wrote:
> On Tue, Apr 16, 2019 at 12:50 PM Jerome Glisse <jglisse@redhat.com> wrote:
> >
> > On Tue, Apr 16, 2019 at 12:12:27PM -0700, Dan Williams wrote:
> > > On Tue, Apr 16, 2019 at 11:59 AM Kent Overstreet
> > > <kent.overstreet@gmail.com> wrote:
> > > >
> > > > On Tue, Apr 16, 2019 at 09:35:04PM +0300, Boaz Harrosh wrote:
> > > > > On Thu, Apr 11, 2019 at 05:08:19PM -0400, jglisse@redhat.com wrote:
> > > > > > From: Jérôme Glisse <jglisse@redhat.com>
> > > > > >
> > > > > > This patchset depends on various small fixes [1] and also on patchset
> > > > > > which introduce put_user_page*() [2] and thus is 5.3 material as those
> > > > > > pre-requisite will get in 5.2 at best. Nonetheless i am posting it now
> > > > > > so that it can get review and comments on how and what should be done
> > > > > > to test things.
> > > > > >
> > > > > > For various reasons [2] [3] we want to track page reference through GUP
> > > > > > differently than "regular" page reference. Thus we need to keep track
> > > > > > of how we got a page within the block and fs layer. To do so this patch-
> > > > > > set change the bio_bvec struct to store a pfn and flags instead of a
> > > > > > direct pointer to a page. This way we can flag page that are coming from
> > > > > > GUP.
> > > > > >
> > > > > > This patchset is divided as follow:
> > > > > >     - First part of the patchset is just small cleanup i believe they
> > > > > >       can go in as his assuming people are ok with them.
> > > > >
> > > > >
> > > > > >     - Second part convert bio_vec->bv_page to bio_vec->bv_pfn this is
> > > > > >       done in multi-step, first we replace all direct dereference of
> > > > > >       the field by call to inline helper, then we introduce macro for
> > > > > >       bio_bvec that are initialized on the stack. Finaly we change the
> > > > > >       bv_page field to bv_pfn.
> > > > >
> > > > > Why do we need a bv_pfn. Why not just use the lowest bit of the page-ptr
> > > > > as a flag (pointer always aligned to 64 bytes in our case).
> > > > >
> > > > > So yes we need an inline helper for reference of the page but is it not clearer
> > > > > that we assume a page* and not any kind of pfn ?
> > > > > It will not be the first place using low bits of a pointer for flags.
> > > > >
> > > > > That said. Why we need it at all? I mean why not have it as a bio flag. If it exist
> > > > > at all that a user has a GUP and none-GUP pages to IO at the same request he/she
> > > > > can just submit them as two separate BIOs (chained at the block layer).
> > > > >
> > > > > Many users just submit one page bios and let elevator merge them any way.
> > > >
> > > > Let's please not add additional flags and weirdness to struct bio - "if this
> > > > flag is set interpret one way, if not interpret another" - or eventually bios
> > > > will be as bad as skbuffs. I would much prefer just changing bv_page to bv_pfn.
> > >
> > > This all reminds of the failed attempt to teach the block layer to
> > > operate without pages:
> > >
> > > https://lore.kernel.org/lkml/20150316201640.33102.33761.stgit@dwillia2-desk3.amr.corp.intel.com/
> > >
> > > >
> > > > Question though - why do we need a flag for whether a page is a GUP page or not?
> > > > Couldn't the needed information just be determined by what range the pfn is not
> > > > (i.e. whether or not it has a struct page associated with it)?
> > >
> > > That amounts to a pfn_valid() check which is a bit heavier than if we
> > > can store a flag in the bv_pfn entry directly.
> > >
> > > I'd say create a new PFN_* flag, and make bv_pfn a 'pfn_t' rather than
> > > an 'unsigned long'.
> > >
> > > That said, I'm still in favor of Jan's proposal to just make the
> > > bv_page semantics uniform. Otherwise we're complicating this core
> > > infrastructure for some yet to be implemented GPU memory management
> > > capabilities with yet to be determined value. Circle back when that
> > > value is clear, but in the meantime fix the GUP bug.
> >
> > This has nothing to do with GPU, what make you think so ? Here i am
> > trying to solve GUP and to keep the value of knowing wether a page
> > has been GUP or not. I argue that if we bias every page in every bio
> > then we loose that information and thus the value.
> >
> > I gave the page protection mechanisms as an example that would be
> > impacted but it is not the only one. Knowing if a page has been GUP
> > can be useful for memory reclaimation, compaction, NUMA balancing,
> 
> Right, this is what I was reacting to in your pushback to Jan's
> proposal. You're claiming value for not doing the simple thing for
> some future "may be useful in these contexts". To my knowledge those
> things are not broken today. You're asking for the complexity to be
> carried today for some future benefit, and I'm asking for the
> simplicity to be maintained as much as possible today and let the
> value of future changes stand on their own to push for more complexity
> later.
> 
> Effectively don't use this bug fix to push complexity for a future
> agenda where the value has yet to be quantified.

Except that this solution (biasing everyone in bio) would _more complex_
it is only conceptualy appealing. The changes are on the other hand much
deeper and much riskier but you decided to ignore that and focus on some-
thing i was just giving as an example.

Jérôme
Dan Williams April 17, 2019, 11:32 p.m. UTC | #19
On Wed, Apr 17, 2019 at 3:29 PM Jerome Glisse <jglisse@redhat.com> wrote:
>
> On Wed, Apr 17, 2019 at 02:53:28PM -0700, Dan Williams wrote:
> > On Tue, Apr 16, 2019 at 12:50 PM Jerome Glisse <jglisse@redhat.com> wrote:
> > >
> > > On Tue, Apr 16, 2019 at 12:12:27PM -0700, Dan Williams wrote:
> > > > On Tue, Apr 16, 2019 at 11:59 AM Kent Overstreet
> > > > <kent.overstreet@gmail.com> wrote:
> > > > >
> > > > > On Tue, Apr 16, 2019 at 09:35:04PM +0300, Boaz Harrosh wrote:
> > > > > > On Thu, Apr 11, 2019 at 05:08:19PM -0400, jglisse@redhat.com wrote:
> > > > > > > From: Jérôme Glisse <jglisse@redhat.com>
> > > > > > >
> > > > > > > This patchset depends on various small fixes [1] and also on patchset
> > > > > > > which introduce put_user_page*() [2] and thus is 5.3 material as those
> > > > > > > pre-requisite will get in 5.2 at best. Nonetheless i am posting it now
> > > > > > > so that it can get review and comments on how and what should be done
> > > > > > > to test things.
> > > > > > >
> > > > > > > For various reasons [2] [3] we want to track page reference through GUP
> > > > > > > differently than "regular" page reference. Thus we need to keep track
> > > > > > > of how we got a page within the block and fs layer. To do so this patch-
> > > > > > > set change the bio_bvec struct to store a pfn and flags instead of a
> > > > > > > direct pointer to a page. This way we can flag page that are coming from
> > > > > > > GUP.
> > > > > > >
> > > > > > > This patchset is divided as follow:
> > > > > > >     - First part of the patchset is just small cleanup i believe they
> > > > > > >       can go in as his assuming people are ok with them.
> > > > > >
> > > > > >
> > > > > > >     - Second part convert bio_vec->bv_page to bio_vec->bv_pfn this is
> > > > > > >       done in multi-step, first we replace all direct dereference of
> > > > > > >       the field by call to inline helper, then we introduce macro for
> > > > > > >       bio_bvec that are initialized on the stack. Finaly we change the
> > > > > > >       bv_page field to bv_pfn.
> > > > > >
> > > > > > Why do we need a bv_pfn. Why not just use the lowest bit of the page-ptr
> > > > > > as a flag (pointer always aligned to 64 bytes in our case).
> > > > > >
> > > > > > So yes we need an inline helper for reference of the page but is it not clearer
> > > > > > that we assume a page* and not any kind of pfn ?
> > > > > > It will not be the first place using low bits of a pointer for flags.
> > > > > >
> > > > > > That said. Why we need it at all? I mean why not have it as a bio flag. If it exist
> > > > > > at all that a user has a GUP and none-GUP pages to IO at the same request he/she
> > > > > > can just submit them as two separate BIOs (chained at the block layer).
> > > > > >
> > > > > > Many users just submit one page bios and let elevator merge them any way.
> > > > >
> > > > > Let's please not add additional flags and weirdness to struct bio - "if this
> > > > > flag is set interpret one way, if not interpret another" - or eventually bios
> > > > > will be as bad as skbuffs. I would much prefer just changing bv_page to bv_pfn.
> > > >
> > > > This all reminds of the failed attempt to teach the block layer to
> > > > operate without pages:
> > > >
> > > > https://lore.kernel.org/lkml/20150316201640.33102.33761.stgit@dwillia2-desk3.amr.corp.intel.com/
> > > >
> > > > >
> > > > > Question though - why do we need a flag for whether a page is a GUP page or not?
> > > > > Couldn't the needed information just be determined by what range the pfn is not
> > > > > (i.e. whether or not it has a struct page associated with it)?
> > > >
> > > > That amounts to a pfn_valid() check which is a bit heavier than if we
> > > > can store a flag in the bv_pfn entry directly.
> > > >
> > > > I'd say create a new PFN_* flag, and make bv_pfn a 'pfn_t' rather than
> > > > an 'unsigned long'.
> > > >
> > > > That said, I'm still in favor of Jan's proposal to just make the
> > > > bv_page semantics uniform. Otherwise we're complicating this core
> > > > infrastructure for some yet to be implemented GPU memory management
> > > > capabilities with yet to be determined value. Circle back when that
> > > > value is clear, but in the meantime fix the GUP bug.
> > >
> > > This has nothing to do with GPU, what make you think so ? Here i am
> > > trying to solve GUP and to keep the value of knowing wether a page
> > > has been GUP or not. I argue that if we bias every page in every bio
> > > then we loose that information and thus the value.
> > >
> > > I gave the page protection mechanisms as an example that would be
> > > impacted but it is not the only one. Knowing if a page has been GUP
> > > can be useful for memory reclaimation, compaction, NUMA balancing,
> >
> > Right, this is what I was reacting to in your pushback to Jan's
> > proposal. You're claiming value for not doing the simple thing for
> > some future "may be useful in these contexts". To my knowledge those
> > things are not broken today. You're asking for the complexity to be
> > carried today for some future benefit, and I'm asking for the
> > simplicity to be maintained as much as possible today and let the
> > value of future changes stand on their own to push for more complexity
> > later.
> >
> > Effectively don't use this bug fix to push complexity for a future
> > agenda where the value has yet to be quantified.
>
> Except that this solution (biasing everyone in bio) would _more complex_
> it is only conceptualy appealing. The changes are on the other hand much
> deeper and much riskier but you decided to ignore that and focus on some-
> thing i was just giving as an example.

Not ignoring, asking for more clarification on the complexity it
introduces independent of potential future uses.
Jan Kara April 18, 2019, 10:42 a.m. UTC | #20
On Wed 17-04-19 18:28:58, Jerome Glisse wrote:
> On Wed, Apr 17, 2019 at 02:53:28PM -0700, Dan Williams wrote:
> > On Tue, Apr 16, 2019 at 12:50 PM Jerome Glisse <jglisse@redhat.com> wrote:
> > >
> > > On Tue, Apr 16, 2019 at 12:12:27PM -0700, Dan Williams wrote:
> > > > On Tue, Apr 16, 2019 at 11:59 AM Kent Overstreet
> > > > <kent.overstreet@gmail.com> wrote:
> > > > >
> > > > > On Tue, Apr 16, 2019 at 09:35:04PM +0300, Boaz Harrosh wrote:
> > > > > > On Thu, Apr 11, 2019 at 05:08:19PM -0400, jglisse@redhat.com wrote:
> > > > > > > From: Jérôme Glisse <jglisse@redhat.com>
> > > > > > >
> > > > > > > This patchset depends on various small fixes [1] and also on patchset
> > > > > > > which introduce put_user_page*() [2] and thus is 5.3 material as those
> > > > > > > pre-requisite will get in 5.2 at best. Nonetheless i am posting it now
> > > > > > > so that it can get review and comments on how and what should be done
> > > > > > > to test things.
> > > > > > >
> > > > > > > For various reasons [2] [3] we want to track page reference through GUP
> > > > > > > differently than "regular" page reference. Thus we need to keep track
> > > > > > > of how we got a page within the block and fs layer. To do so this patch-
> > > > > > > set change the bio_bvec struct to store a pfn and flags instead of a
> > > > > > > direct pointer to a page. This way we can flag page that are coming from
> > > > > > > GUP.
> > > > > > >
> > > > > > > This patchset is divided as follow:
> > > > > > >     - First part of the patchset is just small cleanup i believe they
> > > > > > >       can go in as his assuming people are ok with them.
> > > > > >
> > > > > >
> > > > > > >     - Second part convert bio_vec->bv_page to bio_vec->bv_pfn this is
> > > > > > >       done in multi-step, first we replace all direct dereference of
> > > > > > >       the field by call to inline helper, then we introduce macro for
> > > > > > >       bio_bvec that are initialized on the stack. Finaly we change the
> > > > > > >       bv_page field to bv_pfn.
> > > > > >
> > > > > > Why do we need a bv_pfn. Why not just use the lowest bit of the page-ptr
> > > > > > as a flag (pointer always aligned to 64 bytes in our case).
> > > > > >
> > > > > > So yes we need an inline helper for reference of the page but is it not clearer
> > > > > > that we assume a page* and not any kind of pfn ?
> > > > > > It will not be the first place using low bits of a pointer for flags.
> > > > > >
> > > > > > That said. Why we need it at all? I mean why not have it as a bio flag. If it exist
> > > > > > at all that a user has a GUP and none-GUP pages to IO at the same request he/she
> > > > > > can just submit them as two separate BIOs (chained at the block layer).
> > > > > >
> > > > > > Many users just submit one page bios and let elevator merge them any way.
> > > > >
> > > > > Let's please not add additional flags and weirdness to struct bio - "if this
> > > > > flag is set interpret one way, if not interpret another" - or eventually bios
> > > > > will be as bad as skbuffs. I would much prefer just changing bv_page to bv_pfn.
> > > >
> > > > This all reminds of the failed attempt to teach the block layer to
> > > > operate without pages:
> > > >
> > > > https://lore.kernel.org/lkml/20150316201640.33102.33761.stgit@dwillia2-desk3.amr.corp.intel.com/
> > > >
> > > > >
> > > > > Question though - why do we need a flag for whether a page is a GUP page or not?
> > > > > Couldn't the needed information just be determined by what range the pfn is not
> > > > > (i.e. whether or not it has a struct page associated with it)?
> > > >
> > > > That amounts to a pfn_valid() check which is a bit heavier than if we
> > > > can store a flag in the bv_pfn entry directly.
> > > >
> > > > I'd say create a new PFN_* flag, and make bv_pfn a 'pfn_t' rather than
> > > > an 'unsigned long'.
> > > >
> > > > That said, I'm still in favor of Jan's proposal to just make the
> > > > bv_page semantics uniform. Otherwise we're complicating this core
> > > > infrastructure for some yet to be implemented GPU memory management
> > > > capabilities with yet to be determined value. Circle back when that
> > > > value is clear, but in the meantime fix the GUP bug.
> > >
> > > This has nothing to do with GPU, what make you think so ? Here i am
> > > trying to solve GUP and to keep the value of knowing wether a page
> > > has been GUP or not. I argue that if we bias every page in every bio
> > > then we loose that information and thus the value.
> > >
> > > I gave the page protection mechanisms as an example that would be
> > > impacted but it is not the only one. Knowing if a page has been GUP
> > > can be useful for memory reclaimation, compaction, NUMA balancing,
> > 
> > Right, this is what I was reacting to in your pushback to Jan's
> > proposal. You're claiming value for not doing the simple thing for
> > some future "may be useful in these contexts". To my knowledge those
> > things are not broken today. You're asking for the complexity to be
> > carried today for some future benefit, and I'm asking for the
> > simplicity to be maintained as much as possible today and let the
> > value of future changes stand on their own to push for more complexity
> > later.
> > 
> > Effectively don't use this bug fix to push complexity for a future
> > agenda where the value has yet to be quantified.
> 
> Except that this solution (biasing everyone in bio) would _more complex_
> it is only conceptualy appealing. The changes are on the other hand much
> deeper and much riskier but you decided to ignore that and focus on some-
> thing i was just giving as an example.

Yeah, after going and reading several places like fs/iomap.c, fs/mpage.c,
drivers/md/dm-io.c I agree with you. The places that are not doing direct
IO usually just don't hold any page reference that could be directly
attributed to the bio (and they don't drop it when bio finishes). They
rather use other means (like PageLocked, PageWriteback) to make sure the
page stays alive so mandating gup-pin reference for all pages attached to a
bio would require a lot of reworking of places that are not related to our
problem and currently work just fine. So I withdraw my suggestion. Nice in
theory, too much work in practice ;).

								Honza
Jerome Glisse April 18, 2019, 2:27 p.m. UTC | #21
On Thu, Apr 18, 2019 at 12:42:05PM +0200, Jan Kara wrote:
> On Wed 17-04-19 18:28:58, Jerome Glisse wrote:
> > On Wed, Apr 17, 2019 at 02:53:28PM -0700, Dan Williams wrote:
> > > On Tue, Apr 16, 2019 at 12:50 PM Jerome Glisse <jglisse@redhat.com> wrote:
> > > >
> > > > On Tue, Apr 16, 2019 at 12:12:27PM -0700, Dan Williams wrote:
> > > > > On Tue, Apr 16, 2019 at 11:59 AM Kent Overstreet
> > > > > <kent.overstreet@gmail.com> wrote:
> > > > > >
> > > > > > On Tue, Apr 16, 2019 at 09:35:04PM +0300, Boaz Harrosh wrote:
> > > > > > > On Thu, Apr 11, 2019 at 05:08:19PM -0400, jglisse@redhat.com wrote:
> > > > > > > > From: Jérôme Glisse <jglisse@redhat.com>
> > > > > > > >
> > > > > > > > This patchset depends on various small fixes [1] and also on patchset
> > > > > > > > which introduce put_user_page*() [2] and thus is 5.3 material as those
> > > > > > > > pre-requisite will get in 5.2 at best. Nonetheless i am posting it now
> > > > > > > > so that it can get review and comments on how and what should be done
> > > > > > > > to test things.
> > > > > > > >
> > > > > > > > For various reasons [2] [3] we want to track page reference through GUP
> > > > > > > > differently than "regular" page reference. Thus we need to keep track
> > > > > > > > of how we got a page within the block and fs layer. To do so this patch-
> > > > > > > > set change the bio_bvec struct to store a pfn and flags instead of a
> > > > > > > > direct pointer to a page. This way we can flag page that are coming from
> > > > > > > > GUP.
> > > > > > > >
> > > > > > > > This patchset is divided as follow:
> > > > > > > >     - First part of the patchset is just small cleanup i believe they
> > > > > > > >       can go in as his assuming people are ok with them.
> > > > > > >
> > > > > > >
> > > > > > > >     - Second part convert bio_vec->bv_page to bio_vec->bv_pfn this is
> > > > > > > >       done in multi-step, first we replace all direct dereference of
> > > > > > > >       the field by call to inline helper, then we introduce macro for
> > > > > > > >       bio_bvec that are initialized on the stack. Finaly we change the
> > > > > > > >       bv_page field to bv_pfn.
> > > > > > >
> > > > > > > Why do we need a bv_pfn. Why not just use the lowest bit of the page-ptr
> > > > > > > as a flag (pointer always aligned to 64 bytes in our case).
> > > > > > >
> > > > > > > So yes we need an inline helper for reference of the page but is it not clearer
> > > > > > > that we assume a page* and not any kind of pfn ?
> > > > > > > It will not be the first place using low bits of a pointer for flags.
> > > > > > >
> > > > > > > That said. Why we need it at all? I mean why not have it as a bio flag. If it exist
> > > > > > > at all that a user has a GUP and none-GUP pages to IO at the same request he/she
> > > > > > > can just submit them as two separate BIOs (chained at the block layer).
> > > > > > >
> > > > > > > Many users just submit one page bios and let elevator merge them any way.
> > > > > >
> > > > > > Let's please not add additional flags and weirdness to struct bio - "if this
> > > > > > flag is set interpret one way, if not interpret another" - or eventually bios
> > > > > > will be as bad as skbuffs. I would much prefer just changing bv_page to bv_pfn.
> > > > >
> > > > > This all reminds of the failed attempt to teach the block layer to
> > > > > operate without pages:
> > > > >
> > > > > https://lore.kernel.org/lkml/20150316201640.33102.33761.stgit@dwillia2-desk3.amr.corp.intel.com/
> > > > >
> > > > > >
> > > > > > Question though - why do we need a flag for whether a page is a GUP page or not?
> > > > > > Couldn't the needed information just be determined by what range the pfn is not
> > > > > > (i.e. whether or not it has a struct page associated with it)?
> > > > >
> > > > > That amounts to a pfn_valid() check which is a bit heavier than if we
> > > > > can store a flag in the bv_pfn entry directly.
> > > > >
> > > > > I'd say create a new PFN_* flag, and make bv_pfn a 'pfn_t' rather than
> > > > > an 'unsigned long'.
> > > > >
> > > > > That said, I'm still in favor of Jan's proposal to just make the
> > > > > bv_page semantics uniform. Otherwise we're complicating this core
> > > > > infrastructure for some yet to be implemented GPU memory management
> > > > > capabilities with yet to be determined value. Circle back when that
> > > > > value is clear, but in the meantime fix the GUP bug.
> > > >
> > > > This has nothing to do with GPU, what make you think so ? Here i am
> > > > trying to solve GUP and to keep the value of knowing wether a page
> > > > has been GUP or not. I argue that if we bias every page in every bio
> > > > then we loose that information and thus the value.
> > > >
> > > > I gave the page protection mechanisms as an example that would be
> > > > impacted but it is not the only one. Knowing if a page has been GUP
> > > > can be useful for memory reclaimation, compaction, NUMA balancing,
> > > 
> > > Right, this is what I was reacting to in your pushback to Jan's
> > > proposal. You're claiming value for not doing the simple thing for
> > > some future "may be useful in these contexts". To my knowledge those
> > > things are not broken today. You're asking for the complexity to be
> > > carried today for some future benefit, and I'm asking for the
> > > simplicity to be maintained as much as possible today and let the
> > > value of future changes stand on their own to push for more complexity
> > > later.
> > > 
> > > Effectively don't use this bug fix to push complexity for a future
> > > agenda where the value has yet to be quantified.
> > 
> > Except that this solution (biasing everyone in bio) would _more complex_
> > it is only conceptualy appealing. The changes are on the other hand much
> > deeper and much riskier but you decided to ignore that and focus on some-
> > thing i was just giving as an example.
> 
> Yeah, after going and reading several places like fs/iomap.c, fs/mpage.c,
> drivers/md/dm-io.c I agree with you. The places that are not doing direct
> IO usually just don't hold any page reference that could be directly
> attributed to the bio (and they don't drop it when bio finishes). They
> rather use other means (like PageLocked, PageWriteback) to make sure the
> page stays alive so mandating gup-pin reference for all pages attached to a
> bio would require a lot of reworking of places that are not related to our
> problem and currently work just fine. So I withdraw my suggestion. Nice in
> theory, too much work in practice ;).

Have you seem Boaz proposal ? I have started it and it does not look to
bad (but you knwo taste and color :)) You can take a peek:

https://cgit.freedesktop.org/~glisse/linux/log/?h=gup-bio-v2

I need to finish that and run fstests on bunch of different fs before
posting. Dunno if i will have enough time to do that before LSF/MM.

Cheers,
Jérôme
Jan Kara April 18, 2019, 3:30 p.m. UTC | #22
On Thu 18-04-19 10:27:29, Jerome Glisse wrote:
> On Thu, Apr 18, 2019 at 12:42:05PM +0200, Jan Kara wrote:
> > On Wed 17-04-19 18:28:58, Jerome Glisse wrote:
> > > On Wed, Apr 17, 2019 at 02:53:28PM -0700, Dan Williams wrote:
> > > > On Tue, Apr 16, 2019 at 12:50 PM Jerome Glisse <jglisse@redhat.com> wrote:
> > > > >
> > > > > On Tue, Apr 16, 2019 at 12:12:27PM -0700, Dan Williams wrote:
> > > > > > On Tue, Apr 16, 2019 at 11:59 AM Kent Overstreet
> > > > > > <kent.overstreet@gmail.com> wrote:
> > > > > > >
> > > > > > > On Tue, Apr 16, 2019 at 09:35:04PM +0300, Boaz Harrosh wrote:
> > > > > > > > On Thu, Apr 11, 2019 at 05:08:19PM -0400, jglisse@redhat.com wrote:
> > > > > > > > > From: Jérôme Glisse <jglisse@redhat.com>
> > > > > > > > >
> > > > > > > > > This patchset depends on various small fixes [1] and also on patchset
> > > > > > > > > which introduce put_user_page*() [2] and thus is 5.3 material as those
> > > > > > > > > pre-requisite will get in 5.2 at best. Nonetheless i am posting it now
> > > > > > > > > so that it can get review and comments on how and what should be done
> > > > > > > > > to test things.
> > > > > > > > >
> > > > > > > > > For various reasons [2] [3] we want to track page reference through GUP
> > > > > > > > > differently than "regular" page reference. Thus we need to keep track
> > > > > > > > > of how we got a page within the block and fs layer. To do so this patch-
> > > > > > > > > set change the bio_bvec struct to store a pfn and flags instead of a
> > > > > > > > > direct pointer to a page. This way we can flag page that are coming from
> > > > > > > > > GUP.
> > > > > > > > >
> > > > > > > > > This patchset is divided as follow:
> > > > > > > > >     - First part of the patchset is just small cleanup i believe they
> > > > > > > > >       can go in as his assuming people are ok with them.
> > > > > > > >
> > > > > > > >
> > > > > > > > >     - Second part convert bio_vec->bv_page to bio_vec->bv_pfn this is
> > > > > > > > >       done in multi-step, first we replace all direct dereference of
> > > > > > > > >       the field by call to inline helper, then we introduce macro for
> > > > > > > > >       bio_bvec that are initialized on the stack. Finaly we change the
> > > > > > > > >       bv_page field to bv_pfn.
> > > > > > > >
> > > > > > > > Why do we need a bv_pfn. Why not just use the lowest bit of the page-ptr
> > > > > > > > as a flag (pointer always aligned to 64 bytes in our case).
> > > > > > > >
> > > > > > > > So yes we need an inline helper for reference of the page but is it not clearer
> > > > > > > > that we assume a page* and not any kind of pfn ?
> > > > > > > > It will not be the first place using low bits of a pointer for flags.
> > > > > > > >
> > > > > > > > That said. Why we need it at all? I mean why not have it as a bio flag. If it exist
> > > > > > > > at all that a user has a GUP and none-GUP pages to IO at the same request he/she
> > > > > > > > can just submit them as two separate BIOs (chained at the block layer).
> > > > > > > >
> > > > > > > > Many users just submit one page bios and let elevator merge them any way.
> > > > > > >
> > > > > > > Let's please not add additional flags and weirdness to struct bio - "if this
> > > > > > > flag is set interpret one way, if not interpret another" - or eventually bios
> > > > > > > will be as bad as skbuffs. I would much prefer just changing bv_page to bv_pfn.
> > > > > >
> > > > > > This all reminds of the failed attempt to teach the block layer to
> > > > > > operate without pages:
> > > > > >
> > > > > > https://lore.kernel.org/lkml/20150316201640.33102.33761.stgit@dwillia2-desk3.amr.corp.intel.com/
> > > > > >
> > > > > > >
> > > > > > > Question though - why do we need a flag for whether a page is a GUP page or not?
> > > > > > > Couldn't the needed information just be determined by what range the pfn is not
> > > > > > > (i.e. whether or not it has a struct page associated with it)?
> > > > > >
> > > > > > That amounts to a pfn_valid() check which is a bit heavier than if we
> > > > > > can store a flag in the bv_pfn entry directly.
> > > > > >
> > > > > > I'd say create a new PFN_* flag, and make bv_pfn a 'pfn_t' rather than
> > > > > > an 'unsigned long'.
> > > > > >
> > > > > > That said, I'm still in favor of Jan's proposal to just make the
> > > > > > bv_page semantics uniform. Otherwise we're complicating this core
> > > > > > infrastructure for some yet to be implemented GPU memory management
> > > > > > capabilities with yet to be determined value. Circle back when that
> > > > > > value is clear, but in the meantime fix the GUP bug.
> > > > >
> > > > > This has nothing to do with GPU, what make you think so ? Here i am
> > > > > trying to solve GUP and to keep the value of knowing wether a page
> > > > > has been GUP or not. I argue that if we bias every page in every bio
> > > > > then we loose that information and thus the value.
> > > > >
> > > > > I gave the page protection mechanisms as an example that would be
> > > > > impacted but it is not the only one. Knowing if a page has been GUP
> > > > > can be useful for memory reclaimation, compaction, NUMA balancing,
> > > > 
> > > > Right, this is what I was reacting to in your pushback to Jan's
> > > > proposal. You're claiming value for not doing the simple thing for
> > > > some future "may be useful in these contexts". To my knowledge those
> > > > things are not broken today. You're asking for the complexity to be
> > > > carried today for some future benefit, and I'm asking for the
> > > > simplicity to be maintained as much as possible today and let the
> > > > value of future changes stand on their own to push for more complexity
> > > > later.
> > > > 
> > > > Effectively don't use this bug fix to push complexity for a future
> > > > agenda where the value has yet to be quantified.
> > > 
> > > Except that this solution (biasing everyone in bio) would _more complex_
> > > it is only conceptualy appealing. The changes are on the other hand much
> > > deeper and much riskier but you decided to ignore that and focus on some-
> > > thing i was just giving as an example.
> > 
> > Yeah, after going and reading several places like fs/iomap.c, fs/mpage.c,
> > drivers/md/dm-io.c I agree with you. The places that are not doing direct
> > IO usually just don't hold any page reference that could be directly
> > attributed to the bio (and they don't drop it when bio finishes). They
> > rather use other means (like PageLocked, PageWriteback) to make sure the
> > page stays alive so mandating gup-pin reference for all pages attached to a
> > bio would require a lot of reworking of places that are not related to our
> > problem and currently work just fine. So I withdraw my suggestion. Nice in
> > theory, too much work in practice ;).
> 
> Have you seem Boaz proposal ? I have started it and it does not look to
> bad (but you knwo taste and color :)) You can take a peek:
> 
> https://cgit.freedesktop.org/~glisse/linux/log/?h=gup-bio-v2
> 
> I need to finish that and run fstests on bunch of different fs before
> posting. Dunno if i will have enough time to do that before LSF/MM.

Yes, I've seen it. I just wasn't sure how the result will look like. What
you have in your tree looks pretty clean so far. BTW (I know I'm repeating
myself ;) what if we made iov_iter_get_pages() & iov_iter_get_pages_alloc()
always return gup-pin reference? That would get rid of the need for two
ioend handlers for each call site...

								Honza
Jerome Glisse April 18, 2019, 3:36 p.m. UTC | #23
On Thu, Apr 18, 2019 at 05:30:47PM +0200, Jan Kara wrote:
> On Thu 18-04-19 10:27:29, Jerome Glisse wrote:
> > On Thu, Apr 18, 2019 at 12:42:05PM +0200, Jan Kara wrote:
> > > On Wed 17-04-19 18:28:58, Jerome Glisse wrote:
> > > > On Wed, Apr 17, 2019 at 02:53:28PM -0700, Dan Williams wrote:
> > > > > On Tue, Apr 16, 2019 at 12:50 PM Jerome Glisse <jglisse@redhat.com> wrote:
> > > > > >
> > > > > > On Tue, Apr 16, 2019 at 12:12:27PM -0700, Dan Williams wrote:
> > > > > > > On Tue, Apr 16, 2019 at 11:59 AM Kent Overstreet
> > > > > > > <kent.overstreet@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On Tue, Apr 16, 2019 at 09:35:04PM +0300, Boaz Harrosh wrote:
> > > > > > > > > On Thu, Apr 11, 2019 at 05:08:19PM -0400, jglisse@redhat.com wrote:
> > > > > > > > > > From: Jérôme Glisse <jglisse@redhat.com>
> > > > > > > > > >
> > > > > > > > > > This patchset depends on various small fixes [1] and also on patchset
> > > > > > > > > > which introduce put_user_page*() [2] and thus is 5.3 material as those
> > > > > > > > > > pre-requisite will get in 5.2 at best. Nonetheless i am posting it now
> > > > > > > > > > so that it can get review and comments on how and what should be done
> > > > > > > > > > to test things.
> > > > > > > > > >
> > > > > > > > > > For various reasons [2] [3] we want to track page reference through GUP
> > > > > > > > > > differently than "regular" page reference. Thus we need to keep track
> > > > > > > > > > of how we got a page within the block and fs layer. To do so this patch-
> > > > > > > > > > set change the bio_bvec struct to store a pfn and flags instead of a
> > > > > > > > > > direct pointer to a page. This way we can flag page that are coming from
> > > > > > > > > > GUP.
> > > > > > > > > >
> > > > > > > > > > This patchset is divided as follow:
> > > > > > > > > >     - First part of the patchset is just small cleanup i believe they
> > > > > > > > > >       can go in as his assuming people are ok with them.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >     - Second part convert bio_vec->bv_page to bio_vec->bv_pfn this is
> > > > > > > > > >       done in multi-step, first we replace all direct dereference of
> > > > > > > > > >       the field by call to inline helper, then we introduce macro for
> > > > > > > > > >       bio_bvec that are initialized on the stack. Finaly we change the
> > > > > > > > > >       bv_page field to bv_pfn.
> > > > > > > > >
> > > > > > > > > Why do we need a bv_pfn. Why not just use the lowest bit of the page-ptr
> > > > > > > > > as a flag (pointer always aligned to 64 bytes in our case).
> > > > > > > > >
> > > > > > > > > So yes we need an inline helper for reference of the page but is it not clearer
> > > > > > > > > that we assume a page* and not any kind of pfn ?
> > > > > > > > > It will not be the first place using low bits of a pointer for flags.
> > > > > > > > >
> > > > > > > > > That said. Why we need it at all? I mean why not have it as a bio flag. If it exist
> > > > > > > > > at all that a user has a GUP and none-GUP pages to IO at the same request he/she
> > > > > > > > > can just submit them as two separate BIOs (chained at the block layer).
> > > > > > > > >
> > > > > > > > > Many users just submit one page bios and let elevator merge them any way.
> > > > > > > >
> > > > > > > > Let's please not add additional flags and weirdness to struct bio - "if this
> > > > > > > > flag is set interpret one way, if not interpret another" - or eventually bios
> > > > > > > > will be as bad as skbuffs. I would much prefer just changing bv_page to bv_pfn.
> > > > > > >
> > > > > > > This all reminds of the failed attempt to teach the block layer to
> > > > > > > operate without pages:
> > > > > > >
> > > > > > > https://lore.kernel.org/lkml/20150316201640.33102.33761.stgit@dwillia2-desk3.amr.corp.intel.com/
> > > > > > >
> > > > > > > >
> > > > > > > > Question though - why do we need a flag for whether a page is a GUP page or not?
> > > > > > > > Couldn't the needed information just be determined by what range the pfn is not
> > > > > > > > (i.e. whether or not it has a struct page associated with it)?
> > > > > > >
> > > > > > > That amounts to a pfn_valid() check which is a bit heavier than if we
> > > > > > > can store a flag in the bv_pfn entry directly.
> > > > > > >
> > > > > > > I'd say create a new PFN_* flag, and make bv_pfn a 'pfn_t' rather than
> > > > > > > an 'unsigned long'.
> > > > > > >
> > > > > > > That said, I'm still in favor of Jan's proposal to just make the
> > > > > > > bv_page semantics uniform. Otherwise we're complicating this core
> > > > > > > infrastructure for some yet to be implemented GPU memory management
> > > > > > > capabilities with yet to be determined value. Circle back when that
> > > > > > > value is clear, but in the meantime fix the GUP bug.
> > > > > >
> > > > > > This has nothing to do with GPU, what make you think so ? Here i am
> > > > > > trying to solve GUP and to keep the value of knowing wether a page
> > > > > > has been GUP or not. I argue that if we bias every page in every bio
> > > > > > then we loose that information and thus the value.
> > > > > >
> > > > > > I gave the page protection mechanisms as an example that would be
> > > > > > impacted but it is not the only one. Knowing if a page has been GUP
> > > > > > can be useful for memory reclaimation, compaction, NUMA balancing,
> > > > > 
> > > > > Right, this is what I was reacting to in your pushback to Jan's
> > > > > proposal. You're claiming value for not doing the simple thing for
> > > > > some future "may be useful in these contexts". To my knowledge those
> > > > > things are not broken today. You're asking for the complexity to be
> > > > > carried today for some future benefit, and I'm asking for the
> > > > > simplicity to be maintained as much as possible today and let the
> > > > > value of future changes stand on their own to push for more complexity
> > > > > later.
> > > > > 
> > > > > Effectively don't use this bug fix to push complexity for a future
> > > > > agenda where the value has yet to be quantified.
> > > > 
> > > > Except that this solution (biasing everyone in bio) would _more complex_
> > > > it is only conceptualy appealing. The changes are on the other hand much
> > > > deeper and much riskier but you decided to ignore that and focus on some-
> > > > thing i was just giving as an example.
> > > 
> > > Yeah, after going and reading several places like fs/iomap.c, fs/mpage.c,
> > > drivers/md/dm-io.c I agree with you. The places that are not doing direct
> > > IO usually just don't hold any page reference that could be directly
> > > attributed to the bio (and they don't drop it when bio finishes). They
> > > rather use other means (like PageLocked, PageWriteback) to make sure the
> > > page stays alive so mandating gup-pin reference for all pages attached to a
> > > bio would require a lot of reworking of places that are not related to our
> > > problem and currently work just fine. So I withdraw my suggestion. Nice in
> > > theory, too much work in practice ;).
> > 
> > Have you seem Boaz proposal ? I have started it and it does not look to
> > bad (but you knwo taste and color :)) You can take a peek:
> > 
> > https://cgit.freedesktop.org/~glisse/linux/log/?h=gup-bio-v2
> > 
> > I need to finish that and run fstests on bunch of different fs before
> > posting. Dunno if i will have enough time to do that before LSF/MM.
> 
> Yes, I've seen it. I just wasn't sure how the result will look like. What
> you have in your tree looks pretty clean so far. BTW (I know I'm repeating
> myself ;) what if we made iov_iter_get_pages() & iov_iter_get_pages_alloc()
> always return gup-pin reference? That would get rid of the need for two
> ioend handlers for each call site...

No it would not everywhere, some of the callsite have more way to fill
the bio then just iov_iter_get_pages*() so i expect a good chunk of
the patches i have would stay the same because of that. So in the end
it might now simplify much maybe couple place.

Cheers,
Jérôme
Boaz Harrosh April 18, 2019, 3:56 p.m. UTC | #24
On 18/04/19 00:54, Dan Williams wrote:
<>
> 
> If it's not a pfn then it shouldn't be an unsigned long named "bv_pfn".
> 

Off course not:
	ulong bv_page_gup;


But I hope it is not needed at all

Thanks
Boaz
Dan Williams April 18, 2019, 6:03 p.m. UTC | #25
On Thu, Apr 18, 2019 at 3:42 AM Jan Kara <jack@suse.cz> wrote:
> > Except that this solution (biasing everyone in bio) would _more complex_
> > it is only conceptualy appealing. The changes are on the other hand much
> > deeper and much riskier but you decided to ignore that and focus on some-
> > thing i was just giving as an example.
>
> Yeah, after going and reading several places like fs/iomap.c, fs/mpage.c,
> drivers/md/dm-io.c I agree with you. The places that are not doing direct
> IO usually just don't hold any page reference that could be directly
> attributed to the bio (and they don't drop it when bio finishes). They
> rather use other means (like PageLocked, PageWriteback) to make sure the
> page stays alive so mandating gup-pin reference for all pages attached to a
> bio would require a lot of reworking of places that are not related to our
> problem and currently work just fine. So I withdraw my suggestion. Nice in
> theory, too much work in practice ;).

Is it though? We already have BIO_NO_PAGE_REF, so it seems it would be
a useful cleanup to have all locations that don't participate in page
references use that existing flag and then teach all other locations
to use gup-pinned pages.