Message ID | 20200430143825.3534128-1-imbrenda@linux.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2,1/1] fs/splice: add missing callback for inaccessible pages | expand |
On 30.04.20 16:38, Claudio Imbrenda wrote: > Inaccessible pages are pages that should not be accessed by any device, > and belong to a protected VM. If any such pages are passed to a > device, there will be I/O errors, which will not always be recoverable, > depending on the architecture and on the specific device. > > CPU accesses to inaccessible pages are less problematic, since they are > always recoverable. > > Page cache and direct I/O were fixed in a previous patch, in which a > architecture hook is provided to make the page accessible by I/O > devices. > > One possible remaining path to sneak a protected page directly to a > device is sendfile and similar syscalls. Those syscalls take a page > directly from the page cache and give them directly to the device with > zero-copy. This bypasses both existing hooks in gup and in writeback. > > This patch fixes the issue by adding a call to arch_make_page_accessible > in page_cache_pipe_buf_confirm, thus fixing the issue. > > Notice that we only need to make sure the source is accessible, since > zero-copy only works in one direction, and CPU accesses to inaccessible > pages are not a problem. Pagecache-to-pagecache is also not a problem > since that is done by the CPU. > > The hook has no overhead for architectures that do not need to deal > with inaccessible pages. > > Fixes: f28d43636d6f ("mm/gup/writeback: add callbacks for inaccessible pages") You should add a Reported-by Dave Hansen, I guess. Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> > Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> > --- > fs/splice.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/fs/splice.c b/fs/splice.c > index 4735defc46ee..f026e0ce9acd 100644 > --- a/fs/splice.c > +++ b/fs/splice.c > @@ -106,6 +106,9 @@ static int page_cache_pipe_buf_confirm(struct pipe_inode_info *pipe, > struct page *page = buf->page; > int err; > > + if (arch_make_page_accessible(page)) > + return -EIO; > + > if (!PageUptodate(page)) { > lock_page(page); > >
I was also wondering if Claudio was right about the debug patch having races. I went to go look how the s390 code avoids races when pages go from accessible->inaccessible. Because, if if all of the traps are in place to transform pages from inaccessible->accessible, the code *after* those traps is still vulnerable. What *keeps* pages accessible? The race avoidance is this, basically: down_read(&gmap->mm->mmap_sem); lock_page(page); ptep = get_locked_pte(gmap->mm, uaddr, &ptelock); ... > expected = expected_page_refs(page); > if (!page_ref_freeze(page, expected)) > return -EBUSY; > set_bit(PG_arch_1, &page->flags); > rc = uv_call(0, (u64)uvcb); > page_ref_unfreeze(page, expected); ... up_read(mmap_sem) / unlock_page() / unlock pte I'm assuming that after the uv_call(), the page is inaccessible and I/O devices will go boom if they touch the page. The page_ref_freeze() ensures that references come between the freeze/unfreeze are noticed, but it doesn't actually *stop* new ones for users that hold references already. For the page cache, especially, someone could do: page = find_get_page(); arch_make_page_accessible(); lock_page(); ... make_secure_pte(); unlock_page(); get_page(); // ^ OK because I have a ref // do DMA on inaccessible page Because the make_secure_pte() code isn't looking for a *specific* 'expected' value, it has no way of noticing that the extra ref snuck in there. I _think_ expected actually needs to be checked for having a specific (low) value so that if there's a *possibility* of a reference holder acquiring additional references, the page is known to be off-limits. mm/migrate.c has a few examples of this, but I'm not quite sure how bulletproof they are. Some of it appears to just be optimizations.
One other thing... The gup code will not take references on ref-frozen pages: > static inline __must_check bool try_get_page(struct page *page) > { > page = compound_head(page); > if (WARN_ON_ONCE(page_ref_count(page) <= 0)) > return false; > page_ref_inc(page); > return true; > } *But*, notice that the path that skips taking a ref is also a WARN_ON_ONCE(). Basically, if you get to try_get_page() on a ref-frozen page, it's considered buggy. This makes sense because you fundamentally can't freeze refs on a page that might have more refs taken on it. I think all the other users do this by ensuring that any PTE that could be gup'd is set non-present before the refs are frozen and remote TLBs are flushed which also ensures no GUPs are running. I don't know if the s390 code has some other way of preventing GUPs, but leaving Present=1 PTEs while you freeze refs would be quite troublesome on x86.
On 01.05.20 00:06, Dave Hansen wrote: > I was also wondering if Claudio was right about the debug patch having > races. I went to go look how the s390 code avoids races when pages go > from accessible->inaccessible. > > Because, if if all of the traps are in place to transform pages from > inaccessible->accessible, the code *after* those traps is still > vulnerable. What *keeps* pages accessible? > > The race avoidance is this, basically: > > down_read(&gmap->mm->mmap_sem); > lock_page(page); > ptep = get_locked_pte(gmap->mm, uaddr, &ptelock); > ... >> expected = expected_page_refs(page); >> if (!page_ref_freeze(page, expected)) >> return -EBUSY; >> set_bit(PG_arch_1, &page->flags); >> rc = uv_call(0, (u64)uvcb); >> page_ref_unfreeze(page, expected); > > ... up_read(mmap_sem) / unlock_page() / unlock pte > > I'm assuming that after the uv_call(), the page is inaccessible and I/O > devices will go boom if they touch the page. > > The page_ref_freeze() ensures that references come between the > freeze/unfreeze are noticed, but it doesn't actually *stop* new ones for > users that hold references already. For the page cache, especially, > someone could do: > > page = find_get_page(); > arch_make_page_accessible(); > lock_page(); > ... make_secure_pte(); Not sure if I got your point here, but this make_secure_pte should bail out because we actually do check for a calculated refcount value and return -EBUSY. The find_get_page should have raised this refcount to a value that would go beyond the expected value, No? > unlock_page(); > get_page(); > // ^ OK because I have a ref > // do DMA on inaccessible page > > Because the make_secure_pte() code isn't looking for a *specific* > 'expected' value, it has no way of noticing that the extra ref snuck in > there. I think the expected calcution is actually doing that,giving back the minimum value when no one else has any references that are valid for I/O. But I might not have understood what you are trying to tell me? > > I _think_ expected actually needs to be checked for having a specific > (low) value so that if there's a *possibility* of a reference holder > acquiring additional references, the page is known to be off-limits. > mm/migrate.c has a few examples of this, but I'm not quite sure how > bulletproof they are. Some of it appears to just be optimizations. > > > b
On 5/1/20 12:18 AM, Christian Borntraeger wrote: >> unlock_page(); >> get_page(); >> // ^ OK because I have a ref >> // do DMA on inaccessible page >> >> Because the make_secure_pte() code isn't looking for a *specific* >> 'expected' value, it has no way of noticing that the extra ref snuck in >> there. > I think the expected calcution is actually doing that,giving back the minimum > value when no one else has any references that are valid for I/O. > > But I might not have understood what you are trying to tell me? I was wrong. I was looking at migrate_page_move_mapping(): > int expected_count = expected_page_refs(mapping, page) + extra_count; ... > xas_lock_irq(&xas); > if (page_count(page) != expected_count || xas_load(&xas) != page) { > xas_unlock_irq(&xas); > return -EAGAIN; > } > > if (!page_ref_freeze(page, expected_count)) { > xas_unlock_irq(&xas); > return -EAGAIN; > } I saw the check for page_count(page) *and* the page_ref_freeze() call. My assumption was that both were needed. My assumption was wrong. (I think the migrate_page_move_mapping() code may actually be doing a superfluous check.) The larger point, though, is that the s390 code ensures no extra references exist upon entering make_secure_pte(), but it still has no mechanism to prevent future, new references to page cache pages from being created. The one existing user of expected_page_refs() freezes the refs then *removes* the page from the page cache (that's what the xas_lock_irq() is for). That stops *new* refs from being acquired. The s390 code is missing an equivalent mechanism. One example: page_freeze_refs(); // page->_count==0 now find_get_page(); // ^ sees a "freed" page page_unfreeze_refs(); find_get_page() will either fail to *find* the page because it will see page->_refcount==0 think it is freed (not great), or it will VM_BUG_ON_PAGE() in __page_cache_add_speculative(). My bigger point is that this patches doesn't systematically stop finding page cache pages that are arch-inaccessible. This patch hits *one* of those sites.
On Fri, May 01, 2020 at 09:32:45AM -0700, Dave Hansen wrote: > The larger point, though, is that the s390 code ensures no extra > references exist upon entering make_secure_pte(), but it still has no > mechanism to prevent future, new references to page cache pages from > being created. Hi Dave, I worked with Claudio and Christian on the initial design of our approach, so let me chime in here as well. You're right that there is no mechanism to prevent new references, but that's really never been the goal either. We're simply trying to ensure that no I/O is ever done on a page that is in the "secure" (or inaccessible) state. To do so, we rely on the assumption that all code that starts I/O on a page cache page will *first*: - mark the page as pending I/O by either taking an extra page count, or by setting the Writeback flag; then: - call arch_make_page_accessible(); then: - start I/O; and only after I/O has finished: - remove the "pending I/O" marker (Writeback and/or extra ref) We thought we had identified all places where we needed to place arch_make_page_accessible so that the above assumption is satisfied. You've found at least two instances where this wasn't true (thanks!); but I still think that this can be fixed by just adding those calls. Now, if the above assumption holds, then I believe we're safe: - before we make any page secure, we verify that it is not "pending I/O" as defined above (neither Writeback flag, nor and extra page count) - *during* the process of making the page secure, we're protected against any potential races due to changes in that status, since we hold the page lock (and therefore the Writeback flag cannot change), and we've frozen page references (so those cannot change). This implies that before I/O has started, the page was made accessible; and as long as the page is marked "pending I/O" it will not be made inaccessible again. > The one existing user of expected_page_refs() freezes the refs then > *removes* the page from the page cache (that's what the xas_lock_irq() > is for). That stops *new* refs from being acquired. > > The s390 code is missing an equivalent mechanism. > > One example: > > page_freeze_refs(); > // page->_count==0 now > find_get_page(); > // ^ sees a "freed" page > page_unfreeze_refs(); > > find_get_page() will either fail to *find* the page because it will see > page->_refcount==0 think it is freed (not great), or it will > VM_BUG_ON_PAGE() in __page_cache_add_speculative(). I don't really see how that could happen; my understanding is that page_freeze_refs simply causes potential users to spin and wait until it is no longer frozen. For example, find_get_page will in the end call down to find_get_entry, which does: if (!page_cache_get_speculative(page)) goto repeat; Am I misunderstanding anything here? > My bigger point is that this patches doesn't systematically stop finding > page cache pages that are arch-inaccessible. This patch hits *one* of > those sites. As I said above, that wasn't really the goal for our approach. In particular, note that we *must* have secure pages present in the page table of the secure guest (that is a requirement of the architecture; note that the "secure" status doesn't just apply to the phyiscal page, but a triple of "*this* host physical page is the secure backing store of *this* guest physical page in *this* secure guest", which the HW/FW tracks based on the specific page table entry). As a consequence, the page really also has to remain present in the page cache (I don't think Linux mm code would be able to handle the case where a file-backed page is in the page table but not page cache). I'm not sure what exactly the requirements for your use case are; if those are significantly differently, maybe we can work together to find an approach that works for both? Bye, Ulrich
On 5/4/20 6:41 AM, Ulrich Weigand wrote: > On Fri, May 01, 2020 at 09:32:45AM -0700, Dave Hansen wrote: >> The larger point, though, is that the s390 code ensures no extra >> references exist upon entering make_secure_pte(), but it still has no >> mechanism to prevent future, new references to page cache pages from >> being created. > > Hi Dave, I worked with Claudio and Christian on the initial design > of our approach, so let me chime in here as well. Hi Ulrich! > You're right that there is no mechanism to prevent new references, > but that's really never been the goal either. We're simply trying > to ensure that no I/O is ever done on a page that is in the "secure" > (or inaccessible) state. To do so, we rely on the assumption that > all code that starts I/O on a page cache page will *first*: > - mark the page as pending I/O by either taking an extra page > count, or by setting the Writeback flag; then: > - call arch_make_page_accessible(); then: > - start I/O; and only after I/O has finished: > - remove the "pending I/O" marker (Writeback and/or extra ref) Let's ignore writeback for a moment because get_page() is the more general case. The locking sequence is: 1. get_page() (or equivalent) "locks out" a page from converting to inaccessbile, 2. followed by a make_page_accessible() guarantees that the page *stays* accessible until 3. I/O is safe in this region 4. put_page(), removes the "lock out", I/O now unsafe They key is, though, the get_page() must happen before make_page_accessible() and *every* place that acquires a new reference needs a make_page_accessible(). try_get_page() is obviously one of those "new reference sites" and it only has one call site outisde of the gup code: generic_pipe_buf_get(), which is effectively patched by the patch that started this thread. The fact that this one oddball site _and_ gup are patched now is a good sign. *But*, I still don't know how that could work nicely: > static inline __must_check bool try_get_page(struct page *page) > { > page = compound_head(page); > if (WARN_ON_ONCE(page_ref_count(page) <= 0)) > return false; > page_ref_inc(page); > return true; > } If try_get_page() collides with a freeze_page_refs(), it'll hit the WARN_ON_ONCE(), which is surely there for a good reason. I'm not sure that warning is _actually_ valid since freeze_page_refs() isn't truly a 0 refcount. But, the fact that this hasn't been encountered means that the testing here is potentially lacking. > We thought we had identified all places where we needed to place > arch_make_page_accessible so that the above assumption is satisfied. > You've found at least two instances where this wasn't true (thanks!); > but I still think that this can be fixed by just adding those calls. Why do you think that's the extent of the problem? Because the crashes stopped? I'd feel a lot more comfortable if you explained the audits that you've performed or _why_ you think that. What I've heard thus far is basically that you've been able to boot a guest and you're ready to ship this code. >> The one existing user of expected_page_refs() freezes the refs then >> *removes* the page from the page cache (that's what the xas_lock_irq() >> is for). That stops *new* refs from being acquired. >> >> The s390 code is missing an equivalent mechanism. >> >> One example: >> >> page_freeze_refs(); >> // page->_count==0 now >> find_get_page(); >> // ^ sees a "freed" page >> page_unfreeze_refs(); >> >> find_get_page() will either fail to *find* the page because it will see >> page->_refcount==0 think it is freed (not great), or it will >> VM_BUG_ON_PAGE() in __page_cache_add_speculative(). > > I don't really see how that could happen; my understanding is that > page_freeze_refs simply causes potential users to spin and wait > until it is no longer frozen. For example, find_get_page will > in the end call down to find_get_entry, which does: > > if (!page_cache_get_speculative(page)) > goto repeat; > > Am I misunderstanding anything here? Dang, I think I was looking at the TINY_RCU code, which is unfortunately first in page_cache_get_speculative(). It doesn't support PREEMPT or SMP, so it can take some shortcuts. But, with regular RCU, you're right, it _does_ appear that it would hit that retry loop, but then it would *succeed* in getting a reference. In the end, this just supports the sequence I wrote above: arch_make_page_accessible() is only valid when called with an elevated refcount and the refcount must be held to lock out make_secure_pte(). >> My bigger point is that this patches doesn't systematically stop finding >> page cache pages that are arch-inaccessible. This patch hits *one* of >> those sites. > > As I said above, that wasn't really the goal for our approach. > > In particular, note that we *must* have secure pages present in the > page table of the secure guest (that is a requirement of the architecture; > note that the "secure" status doesn't just apply to the phyiscal page, > but a triple of "*this* host physical page is the secure backing store > of *this* guest physical page in *this* secure guest", which the HW/FW > tracks based on the specific page table entry). > > As a consequence, the page really also has to remain present in the > page cache (I don't think Linux mm code would be able to handle the > case where a file-backed page is in the page table but not page cache). It actually happens transiently, at least. I believe inode truncation removes from the page cache before it zaps the PTEs. > I'm not sure what exactly the requirements for your use case are; if those > are significantly differently, maybe we can work together to find an > approach that works for both? I'm actually trying to figure out what to do with AMD's SEV. The current state isn't great and, for instance, allows userspace to read guest ciphertext. But, the pages come and go out of the encrypted state at the behest of the guest, and the kernel needs *some* mapping for the pages to do things like instruction emulation. I started looking at s390 because someone said there was a similar problem there and suggested the hooks might work. I couldn't figure out how they worked comprehensively on s390, and that's how we got here.
On Tue, May 05, 2020 at 05:34:45AM -0700, Dave Hansen wrote: > On 5/4/20 6:41 AM, Ulrich Weigand wrote: > > You're right that there is no mechanism to prevent new references, > > but that's really never been the goal either. We're simply trying > > to ensure that no I/O is ever done on a page that is in the "secure" > > (or inaccessible) state. To do so, we rely on the assumption that > > all code that starts I/O on a page cache page will *first*: > > - mark the page as pending I/O by either taking an extra page > > count, or by setting the Writeback flag; then: > > - call arch_make_page_accessible(); then: > > - start I/O; and only after I/O has finished: > > - remove the "pending I/O" marker (Writeback and/or extra ref) > > Let's ignore writeback for a moment because get_page() is the more > general case. The locking sequence is: > > 1. get_page() (or equivalent) "locks out" a page from converting to > inaccessbile, > 2. followed by a make_page_accessible() guarantees that the page > *stays* accessible until > 3. I/O is safe in this region > 4. put_page(), removes the "lock out", I/O now unsafe Yes, exactly. > They key is, though, the get_page() must happen before > make_page_accessible() and *every* place that acquires a new reference > needs a make_page_accessible(). Well, sort of: every place that acquires a new reference *and then performs I/O* needs a make_page_accessible(). There seem to be a lot of plain get_page() calls that aren't related to I/O. > try_get_page() is obviously one of those "new reference sites" and it > only has one call site outisde of the gup code: generic_pipe_buf_get(), > which is effectively patched by the patch that started this thread. The > fact that this one oddball site _and_ gup are patched now is a good sign. > > *But*, I still don't know how that could work nicely: > > > static inline __must_check bool try_get_page(struct page *page) > > { > > page = compound_head(page); > > if (WARN_ON_ONCE(page_ref_count(page) <= 0)) > > return false; > > page_ref_inc(page); > > return true; > > } > > If try_get_page() collides with a freeze_page_refs(), it'll hit the > WARN_ON_ONCE(), which is surely there for a good reason. I'm not sure > that warning is _actually_ valid since freeze_page_refs() isn't truly a > 0 refcount. But, the fact that this hasn't been encountered means that > the testing here is potentially lacking. This is indeed interesting. In particular if you compare try_get_page with try_get_compound_head in gup.c, which does instead: if (WARN_ON_ONCE(page_ref_count(head) < 0)) return NULL; which seems more reasonable to me, given the presence of the page_ref_freeze method. So I'm not sure why try_get_page has <= 0. I think I understand why we haven't seen this in testing: all the places in gup.c where try_get_page is called hold the pte lock; and in the usual case, the pte lock itself already suffices to lock out make_secure_pte before it even tries to use page_ref_freeze. (The intent of holding the pte lock there was really to ensure that the PTE entry is and remains valid throughout the execution of the ultravisor call, which will look at the PTE entry.) However, I guess we could construct cases where the pte lock doesn't suffice to prevent the try_get_page warning: if we create a shared mapping of the secure guest backing store file into a second process. That doesn't ever happen in normal qemu operation, so that's likely why we haven't seen that case. > > We thought we had identified all places where we needed to place > > arch_make_page_accessible so that the above assumption is satisfied. > > You've found at least two instances where this wasn't true (thanks!); > > but I still think that this can be fixed by just adding those calls. > > Why do you think that's the extent of the problem? Because the crashes > stopped? > > I'd feel a lot more comfortable if you explained the audits that you've > performed or _why_ you think that. What I've heard thus far is > basically that you've been able to boot a guest and you're ready to ship > this code. Not sure if you can really call this an "audit", but we were really coming from identifying places where I/O can happen on a page cache page, and everything we found (except writeback) went through gup. We obviously missed the sendfile case here; not sure what the best way would be to verify nothing else was missed. > But, with regular RCU, you're right, it _does_ appear that it would hit > that retry loop, but then it would *succeed* in getting a reference. In > the end, this just supports the sequence I wrote above: > arch_make_page_accessible() is only valid when called with an elevated > refcount and the refcount must be held to lock out make_secure_pte(). Yes, exactly. That's what comment ahead of our arch_make_page_accesible says: To be called with the page locked or with an extra reference! (Either is enough to lock out make_secure_pte.) Bye, Ulrich
On 05.05.20 14:34, Dave Hansen wrote:[...] >> I'm not sure what exactly the requirements for your use case are; if those >> are significantly differently, maybe we can work together to find an >> approach that works for both? > > I'm actually trying to figure out what to do with AMD's SEV. The > current state isn't great and, for instance, allows userspace to read > guest ciphertext. But, the pages come and go out of the encrypted state > at the behest of the guest, and the kernel needs *some* mapping for the > pages to do things like instruction emulation. > > I started looking at s390 because someone said there was a similar > problem there and suggested the hooks might work. I couldn't figure out > how they worked comprehensively on s390, and that's how we got here. We are certainly not married to our approach. I would happily extend/change this to anything that works for your case and the s390 case. So can you outline your requirements a bit more?
On 05.05.20 15:55, Ulrich Weigand wrote: > On Tue, May 05, 2020 at 05:34:45AM -0700, Dave Hansen wrote: >> On 5/4/20 6:41 AM, Ulrich Weigand wrote: >>> You're right that there is no mechanism to prevent new references, >>> but that's really never been the goal either. We're simply trying >>> to ensure that no I/O is ever done on a page that is in the "secure" >>> (or inaccessible) state. To do so, we rely on the assumption that >>> all code that starts I/O on a page cache page will *first*: >>> - mark the page as pending I/O by either taking an extra page >>> count, or by setting the Writeback flag; then: >>> - call arch_make_page_accessible(); then: >>> - start I/O; and only after I/O has finished: >>> - remove the "pending I/O" marker (Writeback and/or extra ref) >> >> Let's ignore writeback for a moment because get_page() is the more >> general case. The locking sequence is: >> >> 1. get_page() (or equivalent) "locks out" a page from converting to >> inaccessbile, >> 2. followed by a make_page_accessible() guarantees that the page >> *stays* accessible until >> 3. I/O is safe in this region >> 4. put_page(), removes the "lock out", I/O now unsafe > > Yes, exactly. > >> They key is, though, the get_page() must happen before >> make_page_accessible() and *every* place that acquires a new reference >> needs a make_page_accessible(). > > Well, sort of: every place that acquires a new reference *and then > performs I/O* needs a make_page_accessible(). There seem to be a > lot of plain get_page() calls that aren't related to I/O. > >> try_get_page() is obviously one of those "new reference sites" and it >> only has one call site outisde of the gup code: generic_pipe_buf_get(), >> which is effectively patched by the patch that started this thread. The >> fact that this one oddball site _and_ gup are patched now is a good sign. >> >> *But*, I still don't know how that could work nicely: >> >>> static inline __must_check bool try_get_page(struct page *page) >>> { >>> page = compound_head(page); >>> if (WARN_ON_ONCE(page_ref_count(page) <= 0)) >>> return false; >>> page_ref_inc(page); >>> return true; >>> } >> >> If try_get_page() collides with a freeze_page_refs(), it'll hit the >> WARN_ON_ONCE(), which is surely there for a good reason. I'm not sure >> that warning is _actually_ valid since freeze_page_refs() isn't truly a >> 0 refcount. But, the fact that this hasn't been encountered means that >> the testing here is potentially lacking. > > This is indeed interesting. In particular if you compare try_get_page > with try_get_compound_head in gup.c, which does instead: > > if (WARN_ON_ONCE(page_ref_count(head) < 0)) > return NULL; > > which seems more reasonable to me, given the presence of the > page_ref_freeze method. So I'm not sure why try_get_page has <= 0. Just looked at commit 88b1a17dfc3ed7728316478fae0f5ad508f50397 mm: add 'try_get_page()' helper function which says: Also like 'get_page()', you can't use this function unless you already had a reference to the page. The intent is that you can use this exactly like get_page(), but in situations where you want to limit the maximum reference count. The code currently does an unconditional WARN_ON_ONCE() if we ever hit the reference count issues (either zero or negative), as a notification that the conditional non-increment actually happened. If try_get_page must not be called with an existing reference, that means that when we call it the page reference is already higher and our freeze will never succeed. That would imply that we cannot trigger this. No?
On 05.05.20 16:01, Christian Borntraeger wrote: > > > On 05.05.20 15:55, Ulrich Weigand wrote: >> On Tue, May 05, 2020 at 05:34:45AM -0700, Dave Hansen wrote: >>> On 5/4/20 6:41 AM, Ulrich Weigand wrote: >>>> You're right that there is no mechanism to prevent new references, >>>> but that's really never been the goal either. We're simply trying >>>> to ensure that no I/O is ever done on a page that is in the "secure" >>>> (or inaccessible) state. To do so, we rely on the assumption that >>>> all code that starts I/O on a page cache page will *first*: >>>> - mark the page as pending I/O by either taking an extra page >>>> count, or by setting the Writeback flag; then: >>>> - call arch_make_page_accessible(); then: >>>> - start I/O; and only after I/O has finished: >>>> - remove the "pending I/O" marker (Writeback and/or extra ref) >>> >>> Let's ignore writeback for a moment because get_page() is the more >>> general case. The locking sequence is: >>> >>> 1. get_page() (or equivalent) "locks out" a page from converting to >>> inaccessbile, >>> 2. followed by a make_page_accessible() guarantees that the page >>> *stays* accessible until >>> 3. I/O is safe in this region >>> 4. put_page(), removes the "lock out", I/O now unsafe >> >> Yes, exactly. >> >>> They key is, though, the get_page() must happen before >>> make_page_accessible() and *every* place that acquires a new reference >>> needs a make_page_accessible(). >> >> Well, sort of: every place that acquires a new reference *and then >> performs I/O* needs a make_page_accessible(). There seem to be a >> lot of plain get_page() calls that aren't related to I/O. >> >>> try_get_page() is obviously one of those "new reference sites" and it >>> only has one call site outisde of the gup code: generic_pipe_buf_get(), >>> which is effectively patched by the patch that started this thread. The >>> fact that this one oddball site _and_ gup are patched now is a good sign. >>> >>> *But*, I still don't know how that could work nicely: >>> >>>> static inline __must_check bool try_get_page(struct page *page) >>>> { >>>> page = compound_head(page); >>>> if (WARN_ON_ONCE(page_ref_count(page) <= 0)) >>>> return false; >>>> page_ref_inc(page); >>>> return true; >>>> } >>> >>> If try_get_page() collides with a freeze_page_refs(), it'll hit the >>> WARN_ON_ONCE(), which is surely there for a good reason. I'm not sure >>> that warning is _actually_ valid since freeze_page_refs() isn't truly a >>> 0 refcount. But, the fact that this hasn't been encountered means that >>> the testing here is potentially lacking. >> >> This is indeed interesting. In particular if you compare try_get_page >> with try_get_compound_head in gup.c, which does instead: >> >> if (WARN_ON_ONCE(page_ref_count(head) < 0)) >> return NULL; >> >> which seems more reasonable to me, given the presence of the >> page_ref_freeze method. So I'm not sure why try_get_page has <= 0. > > > Just looked at > commit 88b1a17dfc3ed7728316478fae0f5ad508f50397 mm: add 'try_get_page()' helper function > > which says: > Also like 'get_page()', you can't use this function unless you already > had a reference to the page. The intent is that you can use this > exactly like get_page(), but in situations where you want to limit the > maximum reference count. > > The code currently does an unconditional WARN_ON_ONCE() if we ever hit > the reference count issues (either zero or negative), as a notification > that the conditional non-increment actually happened. > > If try_get_page must not be called with an existing reference, that means s/not// > that when we call it the page reference is already higher and our freeze > will never succeed. That would imply that we cannot trigger this. No? >
On 5/5/20 7:00 AM, Christian Borntraeger wrote: > We are certainly not married to our approach. I would happily extend/change > this to anything that works for your case and the s390 case. So can you outline > your requirements a bit more? For SEV, the guest define which pages are encrypted or not. You could theoretically do DMA to them or have the CPU access their contents, but you'd get either get ciphertext for reads, or data corruption and loss of cache coherency for writes. That's not so cool. Ideally, we would stop the CPU from ever accessing those pages by unmapping them. But, the pages go in and out of the encrypted state and the host really needs to be *sure* about what's going on before it restores its mapping and messes with the page. That includes situations where someone does a gup, starts an I/O to an unencrypted page, then the guest tries to convert that page over to being encrypted. So, the requirements are: 1. Allow host-side DMA and CPU access to shared pages 2. Stop host-side DMA and CPU access to encrypted pages 3. Allow pages to be converted between the states at the request of the guest Stopping the DMA is pretty easy, even across the gazillions of drivers in the tree because even random ethernet drivers do stuff like: txdr->buffer_info[i].dma = dma_map_single(&pdev->dev, skb->data, skb->len, DMA_TO_DEVICE); So the DMA can be stopped at the mapping layer. It's a *LOT* easier to catch there since the IOMMUs already provide isolation between the I/O and CPU address spaces.
On 05.05.20 16:24, Dave Hansen wrote: > On 5/5/20 7:00 AM, Christian Borntraeger wrote: >> We are certainly not married to our approach. I would happily extend/change >> this to anything that works for your case and the s390 case. So can you outline >> your requirements a bit more? > > For SEV, the guest define which pages are encrypted or not. You could > theoretically do DMA to them or have the CPU access their contents, but > you'd get either get ciphertext for reads, or data corruption and loss > of cache coherency for writes. That's not so cool. > > Ideally, we would stop the CPU from ever accessing those pages by > unmapping them. But, the pages go in and out of the encrypted state and > the host really needs to be *sure* about what's going on before it > restores its mapping and messes with the page. That includes situations > where someone does a gup, starts an I/O to an unencrypted page, then the > guest tries to convert that page over to being encrypted. > > So, the requirements are: > > 1. Allow host-side DMA and CPU access to shared pages > 2. Stop host-side DMA and CPU access to encrypted pages > 3. Allow pages to be converted between the states at the request of the > guest > > Stopping the DMA is pretty easy, even across the gazillions of drivers > in the tree because even random ethernet drivers do stuff like: > > txdr->buffer_info[i].dma = > dma_map_single(&pdev->dev, skb->data, skb->len, > DMA_TO_DEVICE); > > So the DMA can be stopped at the mapping layer. It's a *LOT* easier to > catch there since the IOMMUs already provide isolation between the I/O > and CPU address spaces. And your problem is that the guest could convert this after the dma_map? So you looked into our code if this would help?
On Tue, May 05, 2020 at 04:03:00PM +0200, Christian Borntraeger wrote: > > Just looked at > > commit 88b1a17dfc3ed7728316478fae0f5ad508f50397 mm: add 'try_get_page()' helper function > > > > which says: > > Also like 'get_page()', you can't use this function unless you already > > had a reference to the page. The intent is that you can use this > > exactly like get_page(), but in situations where you want to limit the > > maximum reference count. > > > > The code currently does an unconditional WARN_ON_ONCE() if we ever hit > > the reference count issues (either zero or negative), as a notification > > that the conditional non-increment actually happened. > > > > If try_get_page must not be called with an existing reference, that means > s/not// > > that when we call it the page reference is already higher and our freeze > > will never succeed. That would imply that we cannot trigger this. No? Well, my understanding is that the "existing" reference may be one of the references that is expected by our freeze code, in particular in gup the existing reference is simply the one from the pte. So in this case our freeze *would* succeeed. Bye, Ulrich
On 5/5/20 7:31 AM, Christian Borntraeger wrote: >> So, the requirements are: >> >> 1. Allow host-side DMA and CPU access to shared pages >> 2. Stop host-side DMA and CPU access to encrypted pages >> 3. Allow pages to be converted between the states at the request of the >> guest >> >> Stopping the DMA is pretty easy, even across the gazillions of drivers >> in the tree because even random ethernet drivers do stuff like: >> >> txdr->buffer_info[i].dma = >> dma_map_single(&pdev->dev, skb->data, skb->len, >> DMA_TO_DEVICE); >> >> So the DMA can be stopped at the mapping layer. It's a *LOT* easier to >> catch there since the IOMMUs already provide isolation between the I/O >> and CPU address spaces. > And your problem is that the guest could convert this after the dma_map? > So you looked into our code if this would help? Yep, it seemed like a close-enough problem.
On 05.05.20 16:34, Dave Hansen wrote: > On 5/5/20 7:31 AM, Christian Borntraeger wrote: >>> So, the requirements are: >>> >>> 1. Allow host-side DMA and CPU access to shared pages >>> 2. Stop host-side DMA and CPU access to encrypted pages >>> 3. Allow pages to be converted between the states at the request of the >>> guest >>> >>> Stopping the DMA is pretty easy, even across the gazillions of drivers >>> in the tree because even random ethernet drivers do stuff like: >>> >>> txdr->buffer_info[i].dma = >>> dma_map_single(&pdev->dev, skb->data, skb->len, >>> DMA_TO_DEVICE); >>> >>> So the DMA can be stopped at the mapping layer. It's a *LOT* easier to >>> catch there since the IOMMUs already provide isolation between the I/O >>> and CPU address spaces. >> And your problem is that the guest could convert this after the dma_map? >> So you looked into our code if this would help? > > Yep, it seemed like a close-enough problem. Is there a way to prevent the guest from switching the state? We also have 2 variants of secure pages 1. those that are secure and when the host accesses them will be encrypted 2. those that are marked by the guest as shared. When you look at arch_make_page_accessible we first try to "pin" the shared state. So the guest trying to "unshare" such a page would trigger an exit that we can handle.
On 05.05.20 16:33, Ulrich Weigand wrote: > On Tue, May 05, 2020 at 04:03:00PM +0200, Christian Borntraeger wrote: >>> Just looked at >>> commit 88b1a17dfc3ed7728316478fae0f5ad508f50397 mm: add 'try_get_page()' helper function >>> >>> which says: >>> Also like 'get_page()', you can't use this function unless you already >>> had a reference to the page. The intent is that you can use this >>> exactly like get_page(), but in situations where you want to limit the >>> maximum reference count. >>> >>> The code currently does an unconditional WARN_ON_ONCE() if we ever hit >>> the reference count issues (either zero or negative), as a notification >>> that the conditional non-increment actually happened. >>> >>> If try_get_page must not be called with an existing reference, that means >> s/not// >>> that when we call it the page reference is already higher and our freeze >>> will never succeed. That would imply that we cannot trigger this. No? > > Well, my understanding is that the "existing" reference may be one of the > references that is expected by our freeze code, in particular in gup the > existing reference is simply the one from the pte. So in this case our > freeze *would* succeeed. If thats the case then "<0" looks indeed better than "<=0" for the check. I think try_get_page was never meant to exclude a parallel page_ref_freeze/unfreeze.
On 5/5/20 7:01 AM, Christian Borntraeger wrote: > On 05.05.20 15:55, Ulrich Weigand wrote: >> On Tue, May 05, 2020 at 05:34:45AM -0700, Dave Hansen wrote: >>>> static inline __must_check bool try_get_page(struct page *page) >>>> { >>>> page = compound_head(page); >>>> if (WARN_ON_ONCE(page_ref_count(page) <= 0)) >>>> return false; >>>> page_ref_inc(page); >>>> return true; >>>> } >>> >>> If try_get_page() collides with a freeze_page_refs(), it'll hit the >>> WARN_ON_ONCE(), which is surely there for a good reason. I'm not sure >>> that warning is _actually_ valid since freeze_page_refs() isn't truly a >>> 0 refcount. But, the fact that this hasn't been encountered means that >>> the testing here is potentially lacking. >> >> This is indeed interesting. In particular if you compare try_get_page >> with try_get_compound_head in gup.c, which does instead: >> >> if (WARN_ON_ONCE(page_ref_count(head) < 0)) >> return NULL; >> >> which seems more reasonable to me, given the presence of the >> page_ref_freeze method. So I'm not sure why try_get_page has <= 0. > > Just looked at > commit 88b1a17dfc3ed7728316478fae0f5ad508f50397 mm: add 'try_get_page()' helper function > > which says: > Also like 'get_page()', you can't use this function unless you already > had a reference to the page. The intent is that you can use this > exactly like get_page(), but in situations where you want to limit the > maximum reference count. > > The code currently does an unconditional WARN_ON_ONCE() if we ever hit > the reference count issues (either zero or negative), as a notification > that the conditional non-increment actually happened. > > If try_get_page must be called with an existing reference, that means > that when we call it the page reference is already higher and our freeze > will never succeed. That would imply that we cannot trigger this. No? For gup, we hold the page table lock over the try_grab_page(). That ensures that nobody can drop the reference while try_grab_page() is in progress. The migration page_ref_freeze() code also never races with this because it first shoots down the PTEs before freezing refs. My worry with the s390 code is that it leaves the PTEs in place while freezing refs. This seems new, otherwise we would have been tripping the gup warning. For the page cache, there's a reference taken because of the page's presence in the page cache xarray. But, the page cache uses page_cache_get_speculative(), not try_grab_page(). It doesn't have the warning on the <=0 refcount. Either way, I agree that the try_get_page() WARN_ON_ONCE(page_ref_count(page) <= 0) is looking fishy.
diff --git a/fs/splice.c b/fs/splice.c index 4735defc46ee..f026e0ce9acd 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -106,6 +106,9 @@ static int page_cache_pipe_buf_confirm(struct pipe_inode_info *pipe, struct page *page = buf->page; int err; + if (arch_make_page_accessible(page)) + return -EIO; + if (!PageUptodate(page)) { lock_page(page);
Inaccessible pages are pages that should not be accessed by any device, and belong to a protected VM. If any such pages are passed to a device, there will be I/O errors, which will not always be recoverable, depending on the architecture and on the specific device. CPU accesses to inaccessible pages are less problematic, since they are always recoverable. Page cache and direct I/O were fixed in a previous patch, in which a architecture hook is provided to make the page accessible by I/O devices. One possible remaining path to sneak a protected page directly to a device is sendfile and similar syscalls. Those syscalls take a page directly from the page cache and give them directly to the device with zero-copy. This bypasses both existing hooks in gup and in writeback. This patch fixes the issue by adding a call to arch_make_page_accessible in page_cache_pipe_buf_confirm, thus fixing the issue. Notice that we only need to make sure the source is accessible, since zero-copy only works in one direction, and CPU accesses to inaccessible pages are not a problem. Pagecache-to-pagecache is also not a problem since that is done by the CPU. The hook has no overhead for architectures that do not need to deal with inaccessible pages. Fixes: f28d43636d6f ("mm/gup/writeback: add callbacks for inaccessible pages") Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> --- fs/splice.c | 3 +++ 1 file changed, 3 insertions(+)