Message ID | 20160823220419.11717-3-ross.zwisler@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi Ross, can you take at my (fully working, but not fully cleaned up) version of the iomap based DAX code here: http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/iomap-dax By using iomap we don't even have the size hole problem and totally get out of the reverse-engineer what buffer_heads are trying to tell us business. It also gets rid of the other warts of the DAX path due to pretending to be like direct I/O, so this might be a better way forward also for ext2/4. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Aug 25, 2016 at 12:57:28AM -0700, Christoph Hellwig wrote: > Hi Ross, > > can you take at my (fully working, but not fully cleaned up) version > of the iomap based DAX code here: > > http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/iomap-dax > > By using iomap we don't even have the size hole problem and totally > get out of the reverse-engineer what buffer_heads are trying to tell > us business. It also gets rid of the other warts of the DAX path > due to pretending to be like direct I/O, so this might be a better > way forward also for ext2/4. Sure, I'll take a look. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Aug 25, 2016 at 12:57:28AM -0700, Christoph Hellwig wrote: > Hi Ross, > > can you take at my (fully working, but not fully cleaned up) version > of the iomap based DAX code here: > > http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/iomap-dax > > By using iomap we don't even have the size hole problem and totally > get out of the reverse-engineer what buffer_heads are trying to tell > us business. It also gets rid of the other warts of the DAX path > due to pretending to be like direct I/O, so this might be a better > way forward also for ext2/4. In general I agree that the usage of struct iomap seems more straightforward than the old way of using struct buffer_head + get_block_t. I really don't think we want to have two competing DAX I/O and fault paths, though, which I assume everyone else agrees with as well. These changes don't remove the things in XFS needed by the old I/O and fault paths (e.g. xfs_get_blocks_direct() is still there an unchanged). Is the correct way forward to get buy-in from ext2/ext4 so that they also move to supporting an iomap based I/O path (xfs_file_iomap_begin(), xfs_iomap_write_direct(), etc?). That would allow us to have parallel I/O and fault paths for a while, then remove the old buffer_head based versions when the three supported filesystems have moved to iomap. If ext2 and ext4 don't choose to move to iomap, though, I don't think we want to have a separate I/O & fault path for iomap/XFS. That seems too painful, and the old buffer_head version should continue to work, ugly as it may be. Assuming we can get buy-in from ext4/ext2, I can work on a PMD version of the iomap based fault path that is equivalent to the buffer_head based one I sent out in my series, and we can all eventually move to that. A few comments/questions on the implementation: 1) In your mail above you say "It also gets rid of the other warts of the DAX path due to pretending to be like direct I/O". I assume by this you mean the code in dax_do_io() around DIO_LOCKING, inode_dio_begin(), etc? Perhaps there are other things as well in XFS, but this is what I see in the DAX code. If so, yep, this seems like a win. I don't understand how DIO_LOCKING is relevant to the DAX I/O path, as we never mix buffered and direct access. The comment in dax_do_io() for the inode_dio_begin() call says that it prevents the I/O from races with truncate. Am I correct that we now get this protection via the xfs_rw_ilock()/xfs_rw_iunlock() calls in xfs_file_dax_write()? 2) Just a nit, I noticed that you used "~(PAGE_SIZE - 1)" in several places in iomap_dax_actor() and iomap_dax_fault() instead of PAGE_MASK. Was this intentional? 3) It's kind of weird having iomap_dax_fault() in fs/dax.c but having iomap_dax_actor() and iomap_dax_rw() in fs/iomap.c? I'm guessing the latter is placed where it is because it uses iomap_apply(), which is local to fs/iomap.c? Anyway, it would be nice if we could keep them together, if possible. 4) In iomap_dax_actor() you do this check: WARN_ON_ONCE(iomap->type != IOMAP_MAPPED); If we hit this we should bail with -EIO, yea? Otherwise we could write to unmapped space or something horrible. 5) In iomap_dax_fault, I think the "I/O beyond the end of the file" check might have been broken. Take for example an I/O to the second page of a file, where the file has size one page. So: vmf->pgoff = 1 i_size_read(inode) = 4096 Here's the old code in dax_fault(): size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; if (vmf->pgoff >= size) return VM_FAULT_SIGBUS; size = (4096 + 4096 - 1) >> PAGE_SHIFT = 1 vmf->pgoff is 1 and size is 1, so we return SIGBUS Here's the new code: if (pos >= i_size_read(inode) + PAGE_SIZE - 1) return VM_FAULT_SIGBUS; pos = vmf->pgoff << PAGE_SHIFT = 4096 i_size_read(inode) + PAGE_SIZE - 1 = 8193 so, 'pos' isn't >= where we calculate the end of the file to be, so we do I/O Basically the old check did the "+ PAGE_SIZE - 1" so that the >> PAGE_SHIFT was sure to round up to the next full page. You don't need this with your current logic, so I think the test should just be: if (pos >= i_size_read(inode)) return VM_FAULT_SIGBUS; Right? 6) Regarding the "we don't even have the size hole problem" comment in your mail, the current PMD logic requires us to know the size of the hole. This is important so that we can fault in a huge zero page if we have a 2 MiB hole. It's fine if that 2 MiB page then gets fragmented into 4k DAX allocations when we start to do writes, but the path the other way doesn't work. If we don't know the size of holes then we can't fault in a 2 MiB zero page, so we'll use 4k zero pages to satisfy reads. This means that if later we want to fault in a 2MiB DAX allocation, we don't have a single entry that we can use to lock the entire 2MiB range while we clean the radix tree an unmap the range from all the user processes. With the current PMD logic this will mean that if someone does a 4k read that faults in a 4k zero page, we will only use 4k faults for that range and won't use PMDs. The current XFS code in the v4.8 tree tells me the size of the hole, and I think we need to keep this functionality. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Aug 26, 2016 at 03:29:34PM -0600, Ross Zwisler wrote: > On Thu, Aug 25, 2016 at 12:57:28AM -0700, Christoph Hellwig wrote: > > Hi Ross, > > > > can you take at my (fully working, but not fully cleaned up) version > > of the iomap based DAX code here: > > > > http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/iomap-dax > > > > By using iomap we don't even have the size hole problem and totally > > get out of the reverse-engineer what buffer_heads are trying to tell > > us business. It also gets rid of the other warts of the DAX path > > due to pretending to be like direct I/O, so this might be a better > > way forward also for ext2/4. > > In general I agree that the usage of struct iomap seems more straightforward > than the old way of using struct buffer_head + get_block_t. I really don't > think we want to have two competing DAX I/O and fault paths, though, which I > assume everyone else agrees with as well. We'll be moving XFS this way, regardless of whether the generic DAX code goes that way or not. iomap is a much cleaner, more efficient interface than get_blocks via bufferheads. We are slowly removing bufferheads from XFS so anything that uses them or depends on them that Xfs requires is going to have an iomap-based variant written for it. Christoph is doing the hard yards to make iomap a VFS level interface because that's a) the most efficient way to implement it, and b) it's the right place for IO path extent mapping abstractions. So there will be a iomap path for DAX, just like there will be a iomap path for direct IO, regardless of what other filesystems implement. i.e. other filesystems can move to the more efficient iomap infrastructure if they want, but we can't force them to do so. As such, the generic DAX path can either remain as it is, or we can move to iomap and use wrappers for converting get_block() + bufferehead to iomaps on non-iomap filesystems. (i.e. similar to the existing iomap_to_bh() for allowing iomap lookups to be used to replace bufferheads returned by get_block().) I'd much prefer we move DAX to iomaps before there is wider uptake of it in other filesystems - I've been saying we should use iomaps for DAX right from the start. Now we have the iomap infrastructure in place we should jump straight to it. If we have to drag ext4 kicking and screaming into the 1990s to get there then so be it - it won't be the first time... > These changes don't remove the things in XFS needed by the old I/O and fault > paths (e.g. xfs_get_blocks_direct() is still there an unchanged). Is the Yes, they'll remain until their functionality has been replaced by iomap functions. e.g. xfs_get_blocks_direct() can't be removed until the direct IO path has an iomap interface. .... > 6) Regarding the "we don't even have the size hole problem" comment in your > mail, the current PMD logic requires us to know the size of the hole. This .... > The current XFS code in the v4.8 tree tells me the size of the hole, and I > think we need to keep this functionality. IOMAP_HOLE extents. It's a requirement of the iomap infrastructure that the filesystem reports hole extents in full for the range being mapped. Cheers, Dave.
On Fri, Aug 26, 2016 at 03:29:34PM -0600, Ross Zwisler wrote: > These changes don't remove the things in XFS needed by the old I/O and fault > paths (e.g. xfs_get_blocks_direct() is still there an unchanged). Is the > correct way forward to get buy-in from ext2/ext4 so that they also move to > supporting an iomap based I/O path (xfs_file_iomap_begin(), > xfs_iomap_write_direct(), etc?). That would allow us to have parallel I/O and > fault paths for a while, then remove the old buffer_head based versions when > the three supported filesystems have moved to iomap. > > If ext2 and ext4 don't choose to move to iomap, though, I don't think we want > to have a separate I/O & fault path for iomap/XFS. That seems too painful, > and the old buffer_head version should continue to work, ugly as it may be. We're going to move forward killing buffer_heads in XFS. I think ext4 would dramatically benefit from this a well, as would ext2 (although I think all that DAX work in ext2 is a horrible idea to start with). If I don't get buy-in for the iomap DAX work in the dax code we'll just have to keep it separate. That buffer_head mess just isn't maintainable the long run. > 1) In your mail above you say "It also gets rid of the other warts of the DAX > path due to pretending to be like direct I/O". I assume by this you mean > the code in dax_do_io() around DIO_LOCKING, inode_dio_begin(), etc? Yes. > Perhaps there are other things as well in XFS, but this is what I see in > the DAX code. If so, yep, this seems like a win. I don't understand how > DIO_LOCKING is relevant to the DAX I/O path, as we never mix buffered and > direct access. It's related to doing stupid copy and paste from direct I/O in the DAX code. > The comment in dax_do_io() for the inode_dio_begin() call says that it > prevents the I/O from races with truncate. Am I correct that we now get > this protection via the xfs_rw_ilock()/xfs_rw_iunlock() calls in > xfs_file_dax_write()? Yes, XFS always has a lock over reads that serializes with truncate. Currenrly it's the XFS i_iolock, but I'll remove that soon and use the VFS i_rwsem instead. For ext2/4 we could go straight to i_rwsem in shared mode. > 2) Just a nit, I noticed that you used "~(PAGE_SIZE - 1)" in several places in > iomap_dax_actor() and iomap_dax_fault() instead of PAGE_MASK. Was this > intentional? Mostly because that's how I think. I'm fine using PAGE_MASK, though. > 3) It's kind of weird having iomap_dax_fault() in fs/dax.c but having > iomap_dax_actor() and iomap_dax_rw() in fs/iomap.c? I'm guessing the > latter is placed where it is because it uses iomap_apply(), which is local > to fs/iomap.c? Anyway, it would be nice if we could keep them together, if > possible. It's still work in progress and could use a few cleanups. > > 4) In iomap_dax_actor() you do this check: > > WARN_ON_ONCE(iomap->type != IOMAP_MAPPED); > > If we hit this we should bail with -EIO, yea? Otherwise we could write to > unmapped space or something horrible. Fine with me. > 5) In iomap_dax_fault, I think the "I/O beyond the end of the file" check > might have been broken. Take for example an I/O to the second page of a > file, where the file has size one page. So: sure, I can fix this up. > 6) Regarding the "we don't even have the size hole problem" comment in your > mail, the current PMD logic requires us to know the size of the hole. And a big part of the iomap interface is proper reporting of holes. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Aug 29, 2016 at 12:41:16AM -0700, Christoph Hellwig wrote: > > We're going to move forward killing buffer_heads in XFS. I think ext4 > would dramatically benefit from this a well, as would ext2 (although I > think all that DAX work in ext2 is a horrible idea to start with). It's been on my todo list. The only reason why I haven't done it yet is because I knew you were working on a solution, and I didn't want to do things one way for buffered I/O, and a different way for Direct I/O, and disentangling the DIO code and the different assumptions of how different file systems interact with the DIO code is a *mess*. It may have gotten better more recently, but a few years ago I took a look at it and backed slowly away..... - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Aug 29, 2016 at 08:57:41AM -0400, Theodore Ts'o wrote: > It's been on my todo list. The only reason why I haven't done it yet > is because I knew you were working on a solution, and I didn't want to > do things one way for buffered I/O, and a different way for Direct > I/O, and disentangling the DIO code and the different assumptions of > how different file systems interact with the DIO code is a *mess*. It is. I have an almost working iomap direct I/O implementation, which simplifies a lot of this. Partially because it expects sane locking (a shared lock should be held over read, fortunately we now have i_rwsem for that) and allows less opt-in/out behavior, and partially just because we have so much better infrastructure now (iomap, bio chaining, iov_iters, ...). The iomap version is still work in progress, but I'm about to post a cut down version for block devices which reduces I/O latency by 20%. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Aug 29, 2016 at 08:57:41AM -0400, Theodore Ts'o wrote: > On Mon, Aug 29, 2016 at 12:41:16AM -0700, Christoph Hellwig wrote: > > > > We're going to move forward killing buffer_heads in XFS. I think ext4 > > would dramatically benefit from this a well, as would ext2 (although I > > think all that DAX work in ext2 is a horrible idea to start with). > > It's been on my todo list. The only reason why I haven't done it yet > is because I knew you were working on a solution, and I didn't want to > do things one way for buffered I/O, and a different way for Direct > I/O, and disentangling the DIO code and the different assumptions of > how different file systems interact with the DIO code is a *mess*. > > It may have gotten better more recently, but a few years ago I took a > look at it and backed slowly away..... Ted, what do you think of the idea of moving to struct iomap in ext2? If ext2 stays with the current struct buffer_head + get_block_t interface, then it looks like DAX basically has three options: 1) Support two I/O paths and two versions of each of the fault paths (PTE, PMD, etc). One of each of these would be based on struct iomap and would be used by xfs and potentially ext4, and the other would be based on struct buffer_head + get_block_t and would be used by ext2. 2) Only have a single struct iomap based I/O path and fault path, and add shim/support code so that ext2 can use it, leaving the rest of ext2 to be struct buffer_head + get_block_t based. 3) Only have a single struct buffer_head + get_block_t based DAX I/O and fault path, and have XFS and potentially ext4 do the translation from their native struct iomap interface. It seems ideal for ext2 to switch along with everyone else, if getting rid of struct buffer_head is a global goal. If not, I guess barring technical issues #2 above seems cleanest - move DAX to the new structure, and provide backwards compatibility to ext2. Thoughts? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
I feel like we're not only building on shifting sands, but we haven't decided whether we're building a Pyramid or a Sphinx. I thought after Storage Summit, we had broad agreement that we were moving to a primary DAX API that was not BH (nor indeed iomap) based. We would still have DAX helpers for block based filesystems (because duplicating all that code between filesystems is pointless), but I now know of three filesystems which are not block based that are interested in using DAX. Jared Hulbert's AXFS is a nice public example. I posted a prototype of this here: https://groups.google.com/d/msg/linux.kernel/xFFHVCQM7Go/ZQeDVYTnFgAJ It is, of course, woefully out of date, but some of the principles in it are still good (and I'm working to split it into digestible chunks). The essence: 1. VFS or VM calls filesystem (eg ->fault()) 2. Filesystem calls DAX (eg dax_fault()) 3. DAX looks in radix tree, finds no information. 4. DAX calls (NEW!) mapping->a_ops->populate_pfns 5a. Filesystem (if not block based) does its own thing to find out the PFNs corresponding to the requested range, then inserts them into the radix tree (possible helper in DAX code) 5b. Filesystem (if block based) looks up its internal data structure (eg extent tree) and calls dax_create_pfns() (see giant patch from yesterday, only instead of passing a get_block_t, the filesystem has already filled in a bh which describes the entire extent that this access happens to land in). 6b. DAX takes care of calling bdev_direct_access() from dax_create_pfns(). Now, notice that there's no interaction with the rest of the filesystem here. We can swap out BHs and iomaps relatively trivially; there's no call for making grand changes, like converting ext2 over to iomap. The BH or iomap is only used for communicating the extent from the filesystem to DAX. Do we have agreement that this is the right way to go? -----Original Message----- From: Ross Zwisler [mailto:ross.zwisler@linux.intel.com] Sent: Friday, September 9, 2016 12:48 PM To: Theodore Ts'o <tytso@mit.edu>; Christoph Hellwig <hch@infradead.org>; Ross Zwisler <ross.zwisler@linux.intel.com>; linux-kernel@vger.kernel.org; Andrew Morton <akpm@linux-foundation.org>; linux-nvdimm@ml01.01.org; Matthew Wilcox <mawilcox@microsoft.com>; Dave Chinner <david@fromorbit.com>; linux-mm@kvack.org; Andreas Dilger <adilger.kernel@dilger.ca>; Alexander Viro <viro@zeniv.linux.org.uk>; Jan Kara <jack@suse.com>; linux-fsdevel@vger.kernel.org; linux-ext4@vger.kernel.org Subject: Re: [PATCH v2 2/9] ext2: tell DAX the size of allocation holes On Mon, Aug 29, 2016 at 08:57:41AM -0400, Theodore Ts'o wrote: > On Mon, Aug 29, 2016 at 12:41:16AM -0700, Christoph Hellwig wrote: > > > > We're going to move forward killing buffer_heads in XFS. I think ext4 > > would dramatically benefit from this a well, as would ext2 (although I > > think all that DAX work in ext2 is a horrible idea to start with). > > It's been on my todo list. The only reason why I haven't done it yet > is because I knew you were working on a solution, and I didn't want to > do things one way for buffered I/O, and a different way for Direct > I/O, and disentangling the DIO code and the different assumptions of > how different file systems interact with the DIO code is a *mess*. > > It may have gotten better more recently, but a few years ago I took a > look at it and backed slowly away..... Ted, what do you think of the idea of moving to struct iomap in ext2? If ext2 stays with the current struct buffer_head + get_block_t interface, then it looks like DAX basically has three options: 1) Support two I/O paths and two versions of each of the fault paths (PTE, PMD, etc). One of each of these would be based on struct iomap and would be used by xfs and potentially ext4, and the other would be based on struct buffer_head + get_block_t and would be used by ext2. 2) Only have a single struct iomap based I/O path and fault path, and add shim/support code so that ext2 can use it, leaving the rest of ext2 to be struct buffer_head + get_block_t based. 3) Only have a single struct buffer_head + get_block_t based DAX I/O and fault path, and have XFS and potentially ext4 do the translation from their native struct iomap interface. It seems ideal for ext2 to switch along with everyone else, if getting rid of struct buffer_head is a global goal. If not, I guess barring technical issues #2 above seems cleanest - move DAX to the new structure, and provide backwards compatibility to ext2. Thoughts? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
/me grumbles about top-posting... On Fri, Sep 9, 2016 at 1:35 PM, Matthew Wilcox <mawilcox@microsoft.com> wrote: > I feel like we're not only building on shifting sands, but we haven't decided whether we're building a Pyramid or a Sphinx. > > I thought after Storage Summit, we had broad agreement that we were moving to a primary DAX API that was not BH (nor indeed iomap) based. We would still have DAX helpers for block based filesystems (because duplicating all that code between filesystems is pointless), but I now know of three filesystems which are not block based that are interested in using DAX. Jared Hulbert's AXFS is a nice public example. > > I posted a prototype of this here: > > https://groups.google.com/d/msg/linux.kernel/xFFHVCQM7Go/ZQeDVYTnFgAJ > > It is, of course, woefully out of date, but some of the principles in it are still good (and I'm working to split it into digestible chunks). > > The essence: > > 1. VFS or VM calls filesystem (eg ->fault()) > 2. Filesystem calls DAX (eg dax_fault()) > 3. DAX looks in radix tree, finds no information. > 4. DAX calls (NEW!) mapping->a_ops->populate_pfns > 5a. Filesystem (if not block based) does its own thing to find out the PFNs corresponding to the requested range, then inserts them into the radix tree (possible helper in DAX code) > 5b. Filesystem (if block based) looks up its internal data structure (eg extent tree) and > calls dax_create_pfns() (see giant patch from yesterday, only instead of > passing a get_block_t, the filesystem has already filled in a bh which > describes the entire extent that this access happens to land in). > 6b. DAX takes care of calling bdev_direct_access() from dax_create_pfns(). > > Now, notice that there's no interaction with the rest of the filesystem here. We can swap out BHs and iomaps relatively trivially; there's no call for making grand changes, like converting ext2 over to iomap. The BH or iomap is only used for communicating the extent from the filesystem to DAX. > > Do we have agreement that this is the right way to go? My $0.02... So the current dax implementation is still struggling to get right (pmd faulting, dirty entry cleaning, etc) and this seems like a rewrite that sets us up for future features without addressing the current bugs and todo items. In comparison the iomap conversion work seems incremental and conserving of current development momentum. I agree with you that continuing to touch ext2 is not a good idea, but I'm not yet convinced that now is the time to go do dax-2.0 when we haven't finished shipping dax-1.0. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
The mail is basically unparsable (hint: you can use a sane mailer even with exchange servers :)). Either way we need to get rid of buffer_heads, and another aop that is entirely caller specific is unaceptable. That being said your idea doesn't sounds unreasonable, but will require a bit more work and has no real short-term need. So let's revisit the idea once you have patches to post and move forward with the more urgent needs for now. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Sep 09, 2016 at 03:34:43PM -0700, Dan Williams wrote: > I agree with you that continuing to touch ext2 is not a good idea, but > I'm not yet convinced that now is the time to go do dax-2.0 when we > haven't finished shipping dax-1.0. I've mentioned this before, but I'd like to repeat it. With all the work reqwuired in the file system I would prefer to drop DAX support in ext2 (and if people really cry for it reinstate the trivial old xip support). -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Christoph Hellwig [mailto:hch@infradead.org] > The mail is basically unparsable (hint: you can use a sane mailer even with > exchange servers :)). That rather depends on how the Exchange servers are configured ... this isn't the appropriate place to discuss IT issues though. > Either way we need to get rid of buffer_heads, and another aop that is entirely > caller specific is unaceptable. That being said your idea doesn't sounds > unreasonable, but will require a bit more work and has no real short-term > need. So your proposal is to remove buffer_heads from ext2? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Sep 10, 2016 at 07:33:18AM +0000, Matthew Wilcox wrote: > > caller specific is unaceptable. That being said your idea doesn't sounds > > unreasonable, but will require a bit more work and has no real short-term > > need. > > So your proposal is to remove buffer_heads from ext2? No, the proposal is to remove buffer_heads from XFS first, then GFS2 and then maybe others like ext4. I'd like to remove buffer_heads from the DAX path for ext2 and ext4 entitrely for sure (and direct I/O next). -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Christoph Hellwig [mailto:hch@infradead.org] > On Fri, Sep 09, 2016 at 03:34:43PM -0700, Dan Williams wrote: > > I agree with you that continuing to touch ext2 is not a good idea, but > > I'm not yet convinced that now is the time to go do dax-2.0 when we > > haven't finished shipping dax-1.0. > > I've mentioned this before, but I'd like to repeat it. With all the work reqwuired > in the file system I would prefer to drop DAX support in ext2 (and if people > really cry for it reinstate the trivial old xip support). That allegedly trivial old xip support was horrendously broken. And, er, it used an aop which you seem implacably opposed to in your earlier email. And that was truly a disgusting one from a layering point of view. Let me remind you: - int (*get_xip_mem)(struct address_space *, pgoff_t, int, - void **, unsigned long *); That void ** was an 'out' parameter to store a kernel address for the memory. The unsigned long * was also an 'out' parameter to store the PFN for the memory. The 'int' was actually a Boolean for whether to create or not, but you'd actually have to go look at the implementation to find that out; the documentation never said it. A real dog's breakfast of an API. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Christoph Hellwig [mailto:hch@infradead.org] > On Sat, Sep 10, 2016 at 07:33:18AM +0000, Matthew Wilcox wrote: > > > caller specific is unaceptable. That being said your idea doesn't > > > sounds unreasonable, but will require a bit more work and has no > > > real short-term need. > > > > So your proposal is to remove buffer_heads from ext2? > > No, the proposal is to remove buffer_heads from XFS first, then GFS2 and then > maybe others like ext4. I'd like to remove buffer_heads from the DAX path for > ext2 and ext4 entitrely for sure (and direct I/O next). That's ... what I propose. The only use of buffer_head in my proposal is to communicate a single extent from the filesystem to the DAX core, and that can equally well use an iomap. Ross seems to think that converting the current DAX code over to using iomap requires converting all of ext2 away from buffer_head; are you saying he's wrong? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RnJvbTogRGFuIFdpbGxpYW1zIFttYWlsdG86ZGFuLmoud2lsbGlhbXNAaW50ZWwuY29tXQ0KPiAv bWUgZ3J1bWJsZXMgYWJvdXQgdG9wLXBvc3RpbmcuLi4NCg0KTGV0J3Mgc2VlIGlmIHRoaXMgZG9l cyBhbnkgYmV0dGVyIC4uIHRoZXJlJ3MgbG90cyBvZiBuZXcgZmVhdHVyZXMsIGJ1dCBJIGRvbid0 IHNlZSBhICd3cmFwIGxpbmVzIGF0IDgwIGNvbHVtbnMnIG9wdGlvbi4gIFVuZm9ydHVuYXRlbHku DQoNCj4gT24gRnJpLCBTZXAgOSwgMjAxNiBhdCAxOjM1IFBNLCBNYXR0aGV3IFdpbGNveCA8bWF3 aWxjb3hAbWljcm9zb2Z0LmNvbT4NCj4gd3JvdGU6DQo+ID4gSSB0aG91Z2h0IGFmdGVyIFN0b3Jh Z2UgU3VtbWl0LCB3ZSBoYWQgYnJvYWQgYWdyZWVtZW50IHRoYXQgd2Ugd2VyZQ0KPiA+IG1vdmlu ZyB0byBhIHByaW1hcnkgREFYIEFQSSB0aGF0IHdhcyBub3QgQkggKG5vciBpbmRlZWQgaW9tYXAp IGJhc2VkLiAgV2UNCj4gPiB3b3VsZCBzdGlsbCBoYXZlIERBWCBoZWxwZXJzIGZvciBibG9jayBi YXNlZCBmaWxlc3lzdGVtcyAoYmVjYXVzZSBkdXBsaWNhdGluZw0KPiA+IGFsbCB0aGF0IGNvZGUg YmV0d2VlbiBmaWxlc3lzdGVtcyBpcyBwb2ludGxlc3MpLCBidXQgSSBub3cga25vdyBvZiB0aHJl ZQ0KPiA+IGZpbGVzeXN0ZW1zIHdoaWNoIGFyZSBub3QgYmxvY2sgYmFzZWQgdGhhdCBhcmUgaW50 ZXJlc3RlZCBpbiB1c2luZyBEQVguICBKYXJlZA0KPiA+IEh1bGJlcnQncyBBWEZTIGlzIGEgbmlj ZSBwdWJsaWMgZXhhbXBsZS4NCj4gPg0KPiA+IEkgcG9zdGVkIGEgcHJvdG90eXBlIG9mIHRoaXMg aGVyZToNCj4gPg0KPiA+DQo+IGh0dHBzOi8vZ3JvdXBzLmdvb2dsZS5jb20vZC9tc2cvbGludXgu a2VybmVsL3hGRkhWQ1FNN0dvL1pRZURWWVRuRmdBSg0KPiA+DQo+ID4gSXQgaXMsIG9mIGNvdXJz ZSwgd29lZnVsbHkgb3V0IG9mIGRhdGUsIGJ1dCBzb21lIG9mIHRoZSBwcmluY2lwbGVzIGluIGl0 IGFyZSBzdGlsbA0KPiBnb29kIChhbmQgSSdtIHdvcmtpbmcgdG8gc3BsaXQgaXQgaW50byBkaWdl c3RpYmxlIGNodW5rcykuDQo+ID4NCj4gPiBUaGUgZXNzZW5jZToNCj4gPg0KPiA+IDEuIFZGUyBv ciBWTSBjYWxscyBmaWxlc3lzdGVtIChlZyAtPmZhdWx0KCkpIDIuIEZpbGVzeXN0ZW0gY2FsbHMg REFYDQo+ID4gKGVnIGRheF9mYXVsdCgpKSAzLiBEQVggbG9va3MgaW4gcmFkaXggdHJlZSwgZmlu ZHMgbm8gaW5mb3JtYXRpb24uDQo+ID4gNC4gREFYIGNhbGxzIChORVchKSBtYXBwaW5nLT5hX29w cy0+cG9wdWxhdGVfcGZucyA1YS4gRmlsZXN5c3RlbSAoaWYNCj4gPiBub3QgYmxvY2sgYmFzZWQp IGRvZXMgaXRzIG93biB0aGluZyB0byBmaW5kIG91dCB0aGUgUEZOcyBjb3JyZXNwb25kaW5nDQo+ ID4gdG8gdGhlIHJlcXVlc3RlZCByYW5nZSwgdGhlbiBpbnNlcnRzIHRoZW0gaW50byB0aGUgcmFk aXggdHJlZSAocG9zc2libGUgaGVscGVyDQo+IGluIERBWCBjb2RlKSA1Yi4gRmlsZXN5c3RlbSAo aWYgYmxvY2sgYmFzZWQpIGxvb2tzIHVwIGl0cyBpbnRlcm5hbCBkYXRhIHN0cnVjdHVyZQ0KPiAo ZWcgZXh0ZW50IHRyZWUpIGFuZA0KPiA+ICAgIGNhbGxzIGRheF9jcmVhdGVfcGZucygpIChzZWUg Z2lhbnQgcGF0Y2ggZnJvbSB5ZXN0ZXJkYXksIG9ubHkgaW5zdGVhZCBvZg0KPiA+ICAgIHBhc3Np bmcgYSBnZXRfYmxvY2tfdCwgdGhlIGZpbGVzeXN0ZW0gaGFzIGFscmVhZHkgZmlsbGVkIGluIGEg Ymggd2hpY2gNCj4gPiAgICBkZXNjcmliZXMgdGhlIGVudGlyZSBleHRlbnQgdGhhdCB0aGlzIGFj Y2VzcyBoYXBwZW5zIHRvIGxhbmQgaW4pLg0KPiA+IDZiLiBEQVggdGFrZXMgY2FyZSBvZiBjYWxs aW5nIGJkZXZfZGlyZWN0X2FjY2VzcygpIGZyb20gZGF4X2NyZWF0ZV9wZm5zKCkuDQo+ID4NCj4g PiBOb3csIG5vdGljZSB0aGF0IHRoZXJlJ3Mgbm8gaW50ZXJhY3Rpb24gd2l0aCB0aGUgcmVzdCBv ZiB0aGUgZmlsZXN5c3RlbSBoZXJlLg0KPiBXZSBjYW4gc3dhcCBvdXQgQkhzIGFuZCBpb21hcHMg cmVsYXRpdmVseSB0cml2aWFsbHk7IHRoZXJlJ3Mgbm8gY2FsbCBmb3IgbWFraW5nDQo+IGdyYW5k IGNoYW5nZXMsIGxpa2UgY29udmVydGluZyBleHQyIG92ZXIgdG8gaW9tYXAuICBUaGUgQkggb3Ig aW9tYXAgaXMgb25seQ0KPiB1c2VkIGZvciBjb21tdW5pY2F0aW5nIHRoZSBleHRlbnQgZnJvbSB0 aGUgZmlsZXN5c3RlbSB0byBEQVguDQo+ID4NCj4gPiBEbyB3ZSBoYXZlIGFncmVlbWVudCB0aGF0 IHRoaXMgaXMgdGhlIHJpZ2h0IHdheSB0byBnbz8NCj4gDQo+IE15ICQwLjAyLi4uDQo+IA0KPiBT byB0aGUgY3VycmVudCBkYXggaW1wbGVtZW50YXRpb24gaXMgc3RpbGwgc3RydWdnbGluZyB0byBn ZXQgcmlnaHQgKHBtZCBmYXVsdGluZywNCj4gZGlydHkgZW50cnkgY2xlYW5pbmcsIGV0YykgYW5k IHRoaXMgc2VlbXMgbGlrZSBhIHJld3JpdGUgdGhhdCBzZXRzIHVzIHVwIGZvciBmdXR1cmUNCj4g ZmVhdHVyZXMgd2l0aG91dCBhZGRyZXNzaW5nIHRoZSBjdXJyZW50IGJ1Z3MgYW5kIHRvZG8gaXRl bXMuICBJbiBjb21wYXJpc29uDQo+IHRoZSBpb21hcCBjb252ZXJzaW9uIHdvcmsgc2VlbXMgaW5j cmVtZW50YWwgYW5kIGNvbnNlcnZpbmcgb2YgY3VycmVudA0KPiBkZXZlbG9wbWVudCBtb21lbnR1 bS4NCg0KSSBiZWxpZXZlIHlvdXIgYXNzZXNzbWVudCBpcyBpbmNvcnJlY3QuICBJZiBjb252ZXJ0 aW5nIHRoZSBjdXJyZW50IERBWCBjb2RlIHRvDQp1c2UgaW9tYXAgZm9yY2VzIGNvbnZlcnRpbmcg ZXh0MiwgdGhlbiBpdCdzIHRpbWUgdG8gZ2V0IHJpZCBvZiBhbGwgdGhlIGhhbGYtbWVhc3VyZXMN CmN1cnJlbnRseSBpbiBwbGFjZS4gIFlvdSBsZWZ0IG9mZiBvbmUgdG9kbyBpdGVtIHRoYXQgdGhp cyBkb2VzIGdldCB1cyBhIHN0ZXAgY2xvc2VyIHRvDQpmaXhpbmcgLS0gc3VwcG9ydCBmb3IgRE1B IHRvIG1tYXBlZCBEQVggZmlsZXMuICBJIHRoaW5rIGl0IGFsc28gcHV0cyB1cyBpbiBhIGJldHRl cg0KcG9zaXRpb24gdG8gZml4IHRoZSAyTUIgc3VwcG9ydCwgbG9ja2luZywgYW5kIGRpcnRpbmVz cyB0cmFja2luZy4gIE9oLCBhbmQgaXQgZG9lcw0KZml4IHRoZSBtdWx0aXZvbHVtZSBzdXBwb3J0 IChiZWNhdXNlIHRoZSBzZWN0b3JzIGluIHRoZSByYWRpeCB0cmVlIGNvdWxkIGJlDQppbnRlcnBy ZXRlZCBhcyBiZWluZyBmcm9tIHRoZSB3cm9uZyB2b2x1bWUpLg0KDQo+IEkgYWdyZWUgd2l0aCB5 b3UgdGhhdCBjb250aW51aW5nIHRvIHRvdWNoIGV4dDIgaXMgbm90IGEgZ29vZCBpZGVhLCBidXQg SSdtIG5vdA0KPiB5ZXQgY29udmluY2VkIHRoYXQgbm93IGlzIHRoZSB0aW1lIHRvIGdvIGRvIGRh eC0yLjAgd2hlbiB3ZSBoYXZlbid0IGZpbmlzaGVkDQo+IHNoaXBwaW5nIGRheC0xLjAuDQoNCmRh eC0xLjAgZGllZCBsb25nIGFnbyAuLi4gSSB0aGluayB3ZSdyZSB1cCB0byBhdCBsZWFzdCBEQVgg dmVyc2lvbiA0IGJ5IG5vdy4NCg0K -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Sep 10, 2016 at 1:15 AM, Matthew Wilcox <mawilcox@microsoft.com> wrote: > From: Dan Williams [mailto:dan.j.williams@intel.com] >> /me grumbles about top-posting... > > Let's see if this does any better .. there's lots of new features, but I don't see a 'wrap lines at 80 columns' option. Unfortunately. Much appreciated. [..] >> So the current dax implementation is still struggling to get right (pmd faulting, >> dirty entry cleaning, etc) and this seems like a rewrite that sets us up for future >> features without addressing the current bugs and todo items. In comparison >> the iomap conversion work seems incremental and conserving of current >> development momentum. > > I believe your assessment is incorrect. If converting the current DAX code to > use iomap forces converting ext2, then it's time to get rid of all the half-measures > currently in place. You left off one todo item that this does get us a step closer to > fixing -- support for DMA to mmaped DAX files. I didn't leave that off, DMA is solved with devm_memremap_pages(). Now, DMA without the ~1.6% capacity tax for the memmap array is interesting, but that's a new feature. > I think it also puts us in a better > position to fix the 2MB support, locking, and dirtiness tracking. Oh, and it does > fix the multivolume support (because the sectors in the radix tree could be > interpreted as being from the wrong volume). > >> I agree with you that continuing to touch ext2 is not a good idea, but I'm not >> yet convinced that now is the time to go do dax-2.0 when we haven't finished >> shipping dax-1.0. > > dax-1.0 died long ago ... I think we're up to at least DAX version 4 by now. My point is that I want to address the current slate of problems before solving new questions like "how do we support non-block based filesystems?". We just happened to land DAX in the middle of the in-progress buffer_head removal effort, so DAX should not stand in the way of where filesystems were already going. I'm arguing to complete all the false starts and half measures that are presently in DAX and then look to incrementally evolve the interfaces to something new without regressing any of it. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Christoph Hellwig [mailto:hch@infradead.org] > Either way we need to get rid of buffer_heads, and another aop that is entirely > caller specific is unaceptable. I finally figured out what you actually meant by this. You mean that instead of having an aop->populate_pfn, you want to define a populate_pfn_t callback and pass it in. Something like this: int ext2_populate_pfn(struct address_space *mapping, pgoff_t pgoff) { struct iomap iomap; ... return dax_populate_pfn(mapping, pgoff, &iomap); } int ext2_dax_fault(vma, vmf) { ... ret = dax_fault(vma, vmf, ext2_populate_pfn); ... } I don't have a problem with that. I'll work up something along those lines next week. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Sep 10, 2016 at 12:31:51AM -0700, Christoph Hellwig wrote: > I've mentioned this before, but I'd like to repeat it. With all the > work reqwuired in the file system I would prefer to drop DAX support > in ext2 (and if people really cry for it reinstate the trivial old xip > support). Why is so much work required to support the new DAX interfaces in ext2? Is that unique to ext2, or is adding DAX support just going to be painful for all file systems? Hopefully it's not the latter, right? - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Theodore Ts'o [mailto:tytso@mit.edu] > On Sat, Sep 10, 2016 at 12:31:51AM -0700, Christoph Hellwig wrote: > > I've mentioned this before, but I'd like to repeat it. With all the > > work reqwuired in the file system I would prefer to drop DAX support > > in ext2 (and if people really cry for it reinstate the trivial old xip > > support). > > Why is so much work required to support the new DAX interfaces in > ext2? Is that unique to ext2, or is adding DAX support just going to > be painful for all file systems? Hopefully it's not the latter, > right? It's always been my goal to make supporting DAX as easy as possible for the filesystem. Hence the sharing of the DIO locking, and the (it turned out) premature reliance on helper functions. It's more complex than I wanted it to be right now, and I hope we get to simplify it again. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Sep 10, 2016 at 07:52:53AM +0000, Matthew Wilcox wrote: > DAX code over to using iomap requires converting all of ext2 away from > buffer_head; are you saying he's wrong? Not sure if he's really saying that, but it's wrong for sure. Just to prove that I came up with a working ext2 iomap DAX implementation in a few hours today. I'll take a stab at ext4 and the block device as well and will post the updated series early next week - I'll need to take care of a few high priority todo list items first. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Sep 11, 2016 at 05:47:41AM -0700, Christoph Hellwig wrote: > On Sat, Sep 10, 2016 at 07:52:53AM +0000, Matthew Wilcox wrote: > > DAX code over to using iomap requires converting all of ext2 away from > > buffer_head; are you saying he's wrong? > > Not sure if he's really saying that, but it's wrong for sure. Just > to prove that I came up with a working ext2 iomap DAX implementation > in a few hours today. I'll take a stab at ext4 and the block device > as well and will post the updated series early next week - I'll need > to take care of a few high priority todo list items first. Yay! Sorry if I was unclear, I wasn't trying to say that we had to change all of ext2 over to using struct iomap. If we can (and apparently we can) just switch over the DAX interfaces, that's good enough to me. I understand that this will mean that we may have overlapping DAX paths for a while (an iomap version and a buffer_head version). I just wanted to figure out whether this overlap would need to be permanent - sounds like not, which is ideal. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Aug 23, 2016 at 04:04:12PM -0600, Ross Zwisler wrote: > When DAX calls ext2_get_block() and the file offset points to a hole we > currently don't set bh_result->b_size. When we re-enable PMD faults DAX > will need bh_result->b_size to tell it the size of the hole so it can > decide whether to fault in a 4 KiB zero page or a 2 MiB zero page. > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> > --- > fs/ext2/inode.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c > index d5c7d09..dd55d74 100644 > --- a/fs/ext2/inode.c > +++ b/fs/ext2/inode.c > @@ -773,6 +773,9 @@ int ext2_get_block(struct inode *inode, sector_t iblock, struct buffer_head *bh_ > if (ret > 0) { > bh_result->b_size = (ret << inode->i_blkbits); > ret = 0; > + } else if (ret == 0) { > + /* hole case, need to fill in bh_result->b_size */ > + bh_result->b_size = 1 << inode->i_blkbits; > } > return ret; > > -- > 2.9.0 > Jan, is it possible for ext2 to return 2 MiB of contiguous space to us via ext2_get_block()? I ask because we have all the infrastructure in place for ext2 to handle PMD faults (ext2_dax_pmd_fault(), etc.), but I don't think in my testing I've ever seen this actually happen. ext2 can obviously return multiple blocks from ext2_get_block(), but can it actually satisfy a whole PMD's worth (512 contiguous blocks)? If so, what steps do I need to take to get this to work in my testing? If it can't happen, we should probably rip out ext2_dax_pmd_fault() so that we don't have to keep falling back to PTEs via the PMD path. Thanks, - Ross -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index d5c7d09..dd55d74 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -773,6 +773,9 @@ int ext2_get_block(struct inode *inode, sector_t iblock, struct buffer_head *bh_ if (ret > 0) { bh_result->b_size = (ret << inode->i_blkbits); ret = 0; + } else if (ret == 0) { + /* hole case, need to fill in bh_result->b_size */ + bh_result->b_size = 1 << inode->i_blkbits; } return ret;
When DAX calls ext2_get_block() and the file offset points to a hole we currently don't set bh_result->b_size. When we re-enable PMD faults DAX will need bh_result->b_size to tell it the size of the hole so it can decide whether to fault in a 4 KiB zero page or a 2 MiB zero page. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> --- fs/ext2/inode.c | 3 +++ 1 file changed, 3 insertions(+)