diff mbox

[1/4] vfio/iommu_type1: Multi-IOMMU domain support

Message ID 20140217202354.22775.58794.stgit@bling.home (mailing list archive)
State New, archived
Headers show

Commit Message

Alex Williamson Feb. 17, 2014, 8:23 p.m. UTC
We currently have a problem that we cannot support advanced features
of an IOMMU domain (ex. IOMMU_CACHE), because we have no guarantee
that those features will be supported by all of the hardware units
involved with the domain over its lifetime.  For instance, the Intel
VT-d architecture does not require that all DRHDs support snoop
control.  If we create a domain based on a device behind a DRHD that
does support snoop control and enable SNP support via the IOMMU_CACHE
mapping option, we cannot then add a device behind a DRHD which does
not support snoop control or we'll get reserved bit faults from the
SNP bit in the pagetables.  To add to the complexity, we can't know
the properties of a domain until a device is attached.

We could pass this problem off to userspace and require that a
separate vfio container be used, but we don't know how to handle page
accounting in that case.  How do we know that a page pinned in one
container is the same page as a different container and avoid double
billing the user for the page.

The solution is therefore to support multiple IOMMU domains per
container.  In the majority of cases, only one domain will be required
since hardware is typically consistent within a system.  However, this
provides us the ability to validate compatibility of domains and
support mixed environments where page table flags can be different
between domains.

To do this, our DMA tracking needs to change.  We currently try to
coalesce user mappings into as few tracking entries as possible.  The
problem then becomes that we lose granularity of user mappings.  We've
never guaranteed that a user is able to unmap at a finer granularity
than the original mapping, but we must honor the granularity of the
original mapping.  This coalescing code is therefore removed, allowing
only unmaps covering complete maps.  The change in accounting is
fairly small here, a typical QEMU VM will start out with roughly a
dozen entries, so it's arguable if this coalescing was ever needed.

We also move IOMMU domain creation to the point where a group is
attached to the container.  An interesting side-effect of this is that
we now have access to the device at the time of domain creation and
can probe the devices within the group to determine the bus_type.
This finally makes vfio_iommu_type1 completely device/bus agnostic.
In fact, each IOMMU domain can host devices on different buses managed
by different physical IOMMUs, and present a single DMA mapping
interface to the user.  When a new domain is created, mappings are
replayed to bring the IOMMU pagetables up to the state of the current
container.  And of course, DMA mapping and unmapping automatically
traverse all of the configured IOMMU domains.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Cc: Varun Sethi <Varun.Sethi@freescale.com>
---
 drivers/vfio/vfio_iommu_type1.c |  637 +++++++++++++++++++++------------------
 include/uapi/linux/vfio.h       |    1 
 2 files changed, 336 insertions(+), 302 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Varun Sethi March 18, 2014, 10:24 a.m. UTC | #1
SGkgQWxleCwNCldvdWxkIGl0IG1ha2Ugc2Vuc2UsIHRvIGxpbmsgdGhlIGlvbW11IGdyb3VwIHRv
IGl0cyBjb3JyZXNwb25kaW5nIGhhcmR3YXJlIGlvbW11IGJsb2NrJ3MgY2FwYWJpbGl0aWVzPyBU
aGlzIGNvdWxkIGJlIGRvbmUgaWYgd2UgY2FuIGRldGVybWluZSB0aGUgaW9tbXUgZGV2aWNlIGNv
cnJlc3BvbmRpbmcgdG8gdGhlIGlvbW11IGdyb3VwIGR1cmluZyBidXMgcHJvYmUuIFdpdGggdGhp
cyB3ZSB3b24ndCBoYXZlIHRvIHdhaXQgdGlsbCBkZXZpY2UgYXR0YWNoIHRvIGRldGVybWluZSB0
aGUgZG9tYWluIGNhcGFiaWxpdGllcyAob3IgdW5kZXJseWluZyBpb21tdSBjYXBhYmlsaXRpZXMp
LiBJbiB2ZmlvIHdlIGNhbiBzaW1wbHkgYXR0YWNoIHRoZSBpb21tdSBncm91cHMgd2l0aCBzaW1p
bGFyIGlvbW11IGNhcGFiaWxpdGllcyB0byB0aGUgc2FtZSBkb21haW4uDQoNClJlZ2FyZHMNClZh
cnVuDQoNCj4gLS0tLS1PcmlnaW5hbCBNZXNzYWdlLS0tLS0NCj4gRnJvbTogQWxleCBXaWxsaWFt
c29uIFttYWlsdG86YWxleC53aWxsaWFtc29uQHJlZGhhdC5jb21dDQo+IFNlbnQ6IFR1ZXNkYXks
IEZlYnJ1YXJ5IDE4LCAyMDE0IDE6NTQgQU0NCj4gVG86IGFsZXgud2lsbGlhbXNvbkByZWRoYXQu
Y29tOyBrdm1Admdlci5rZXJuZWwub3JnDQo+IENjOiBTZXRoaSBWYXJ1bi1CMTYzOTU7IGxpbnV4
LWtlcm5lbEB2Z2VyLmtlcm5lbC5vcmcNCj4gU3ViamVjdDogW1BBVENIIDEvNF0gdmZpby9pb21t
dV90eXBlMTogTXVsdGktSU9NTVUgZG9tYWluIHN1cHBvcnQNCj4gDQo+IFdlIGN1cnJlbnRseSBo
YXZlIGEgcHJvYmxlbSB0aGF0IHdlIGNhbm5vdCBzdXBwb3J0IGFkdmFuY2VkIGZlYXR1cmVzIG9m
DQo+IGFuIElPTU1VIGRvbWFpbiAoZXguIElPTU1VX0NBQ0hFKSwgYmVjYXVzZSB3ZSBoYXZlIG5v
IGd1YXJhbnRlZSB0aGF0DQo+IHRob3NlIGZlYXR1cmVzIHdpbGwgYmUgc3VwcG9ydGVkIGJ5IGFs
bCBvZiB0aGUgaGFyZHdhcmUgdW5pdHMgaW52b2x2ZWQNCj4gd2l0aCB0aGUgZG9tYWluIG92ZXIg
aXRzIGxpZmV0aW1lLiAgRm9yIGluc3RhbmNlLCB0aGUgSW50ZWwgVlQtZA0KPiBhcmNoaXRlY3R1
cmUgZG9lcyBub3QgcmVxdWlyZSB0aGF0IGFsbCBEUkhEcyBzdXBwb3J0IHNub29wIGNvbnRyb2wu
ICBJZg0KPiB3ZSBjcmVhdGUgYSBkb21haW4gYmFzZWQgb24gYSBkZXZpY2UgYmVoaW5kIGEgRFJI
RCB0aGF0IGRvZXMgc3VwcG9ydA0KPiBzbm9vcCBjb250cm9sIGFuZCBlbmFibGUgU05QIHN1cHBv
cnQgdmlhIHRoZSBJT01NVV9DQUNIRSBtYXBwaW5nIG9wdGlvbiwNCj4gd2UgY2Fubm90IHRoZW4g
YWRkIGEgZGV2aWNlIGJlaGluZCBhIERSSEQgd2hpY2ggZG9lcyBub3Qgc3VwcG9ydCBzbm9vcA0K
PiBjb250cm9sIG9yIHdlJ2xsIGdldCByZXNlcnZlZCBiaXQgZmF1bHRzIGZyb20gdGhlIFNOUCBi
aXQgaW4gdGhlDQo+IHBhZ2V0YWJsZXMuICBUbyBhZGQgdG8gdGhlIGNvbXBsZXhpdHksIHdlIGNh
bid0IGtub3cgdGhlIHByb3BlcnRpZXMgb2YgYQ0KPiBkb21haW4gdW50aWwgYSBkZXZpY2UgaXMg
YXR0YWNoZWQuDQo+IA0KPiBXZSBjb3VsZCBwYXNzIHRoaXMgcHJvYmxlbSBvZmYgdG8gdXNlcnNw
YWNlIGFuZCByZXF1aXJlIHRoYXQgYSBzZXBhcmF0ZQ0KPiB2ZmlvIGNvbnRhaW5lciBiZSB1c2Vk
LCBidXQgd2UgZG9uJ3Qga25vdyBob3cgdG8gaGFuZGxlIHBhZ2UgYWNjb3VudGluZw0KPiBpbiB0
aGF0IGNhc2UuICBIb3cgZG8gd2Uga25vdyB0aGF0IGEgcGFnZSBwaW5uZWQgaW4gb25lIGNvbnRh
aW5lciBpcyB0aGUNCj4gc2FtZSBwYWdlIGFzIGEgZGlmZmVyZW50IGNvbnRhaW5lciBhbmQgYXZv
aWQgZG91YmxlIGJpbGxpbmcgdGhlIHVzZXIgZm9yDQo+IHRoZSBwYWdlLg0KPiANCj4gVGhlIHNv
bHV0aW9uIGlzIHRoZXJlZm9yZSB0byBzdXBwb3J0IG11bHRpcGxlIElPTU1VIGRvbWFpbnMgcGVy
DQo+IGNvbnRhaW5lci4gIEluIHRoZSBtYWpvcml0eSBvZiBjYXNlcywgb25seSBvbmUgZG9tYWlu
IHdpbGwgYmUgcmVxdWlyZWQNCj4gc2luY2UgaGFyZHdhcmUgaXMgdHlwaWNhbGx5IGNvbnNpc3Rl
bnQgd2l0aGluIGEgc3lzdGVtLiAgSG93ZXZlciwgdGhpcw0KPiBwcm92aWRlcyB1cyB0aGUgYWJp
bGl0eSB0byB2YWxpZGF0ZSBjb21wYXRpYmlsaXR5IG9mIGRvbWFpbnMgYW5kIHN1cHBvcnQNCj4g
bWl4ZWQgZW52aXJvbm1lbnRzIHdoZXJlIHBhZ2UgdGFibGUgZmxhZ3MgY2FuIGJlIGRpZmZlcmVu
dCBiZXR3ZWVuDQo+IGRvbWFpbnMuDQo+IA0KPiBUbyBkbyB0aGlzLCBvdXIgRE1BIHRyYWNraW5n
IG5lZWRzIHRvIGNoYW5nZS4gIFdlIGN1cnJlbnRseSB0cnkgdG8NCj4gY29hbGVzY2UgdXNlciBt
YXBwaW5ncyBpbnRvIGFzIGZldyB0cmFja2luZyBlbnRyaWVzIGFzIHBvc3NpYmxlLiAgVGhlDQo+
IHByb2JsZW0gdGhlbiBiZWNvbWVzIHRoYXQgd2UgbG9zZSBncmFudWxhcml0eSBvZiB1c2VyIG1h
cHBpbmdzLiAgV2UndmUNCj4gbmV2ZXIgZ3VhcmFudGVlZCB0aGF0IGEgdXNlciBpcyBhYmxlIHRv
IHVubWFwIGF0IGEgZmluZXIgZ3JhbnVsYXJpdHkgdGhhbg0KPiB0aGUgb3JpZ2luYWwgbWFwcGlu
ZywgYnV0IHdlIG11c3QgaG9ub3IgdGhlIGdyYW51bGFyaXR5IG9mIHRoZSBvcmlnaW5hbA0KPiBt
YXBwaW5nLiAgVGhpcyBjb2FsZXNjaW5nIGNvZGUgaXMgdGhlcmVmb3JlIHJlbW92ZWQsIGFsbG93
aW5nIG9ubHkgdW5tYXBzDQo+IGNvdmVyaW5nIGNvbXBsZXRlIG1hcHMuICBUaGUgY2hhbmdlIGlu
IGFjY291bnRpbmcgaXMgZmFpcmx5IHNtYWxsIGhlcmUsIGENCj4gdHlwaWNhbCBRRU1VIFZNIHdp
bGwgc3RhcnQgb3V0IHdpdGggcm91Z2hseSBhIGRvemVuIGVudHJpZXMsIHNvIGl0J3MNCj4gYXJn
dWFibGUgaWYgdGhpcyBjb2FsZXNjaW5nIHdhcyBldmVyIG5lZWRlZC4NCj4gDQo+IFdlIGFsc28g
bW92ZSBJT01NVSBkb21haW4gY3JlYXRpb24gdG8gdGhlIHBvaW50IHdoZXJlIGEgZ3JvdXAgaXMg
YXR0YWNoZWQNCj4gdG8gdGhlIGNvbnRhaW5lci4gIEFuIGludGVyZXN0aW5nIHNpZGUtZWZmZWN0
IG9mIHRoaXMgaXMgdGhhdCB3ZSBub3cgaGF2ZQ0KPiBhY2Nlc3MgdG8gdGhlIGRldmljZSBhdCB0
aGUgdGltZSBvZiBkb21haW4gY3JlYXRpb24gYW5kIGNhbiBwcm9iZSB0aGUNCj4gZGV2aWNlcyB3
aXRoaW4gdGhlIGdyb3VwIHRvIGRldGVybWluZSB0aGUgYnVzX3R5cGUuDQo+IFRoaXMgZmluYWxs
eSBtYWtlcyB2ZmlvX2lvbW11X3R5cGUxIGNvbXBsZXRlbHkgZGV2aWNlL2J1cyBhZ25vc3RpYy4N
Cj4gSW4gZmFjdCwgZWFjaCBJT01NVSBkb21haW4gY2FuIGhvc3QgZGV2aWNlcyBvbiBkaWZmZXJl
bnQgYnVzZXMgbWFuYWdlZCBieQ0KPiBkaWZmZXJlbnQgcGh5c2ljYWwgSU9NTVVzLCBhbmQgcHJl
c2VudCBhIHNpbmdsZSBETUEgbWFwcGluZyBpbnRlcmZhY2UgdG8NCj4gdGhlIHVzZXIuICBXaGVu
IGEgbmV3IGRvbWFpbiBpcyBjcmVhdGVkLCBtYXBwaW5ncyBhcmUgcmVwbGF5ZWQgdG8gYnJpbmcN
Cj4gdGhlIElPTU1VIHBhZ2V0YWJsZXMgdXAgdG8gdGhlIHN0YXRlIG9mIHRoZSBjdXJyZW50IGNv
bnRhaW5lci4gIEFuZCBvZg0KPiBjb3Vyc2UsIERNQSBtYXBwaW5nIGFuZCB1bm1hcHBpbmcgYXV0
b21hdGljYWxseSB0cmF2ZXJzZSBhbGwgb2YgdGhlDQo+IGNvbmZpZ3VyZWQgSU9NTVUgZG9tYWlu
cy4NCj4gDQo+IFNpZ25lZC1vZmYtYnk6IEFsZXggV2lsbGlhbXNvbiA8YWxleC53aWxsaWFtc29u
QHJlZGhhdC5jb20+DQo+IENjOiBWYXJ1biBTZXRoaSA8VmFydW4uU2V0aGlAZnJlZXNjYWxlLmNv
bT4NCj4gLS0tDQo+ICBkcml2ZXJzL3ZmaW8vdmZpb19pb21tdV90eXBlMS5jIHwgIDYzNyArKysr
KysrKysrKysrKysrKysrKystLS0tLS0tLS0tLS0NCj4gLS0tLS0tDQo+ICBpbmNsdWRlL3VhcGkv
bGludXgvdmZpby5oICAgICAgIHwgICAgMQ0KPiAgMiBmaWxlcyBjaGFuZ2VkLCAzMzYgaW5zZXJ0
aW9ucygrKSwgMzAyIGRlbGV0aW9ucygtKQ0KPiANCj4gZGlmZiAtLWdpdCBhL2RyaXZlcnMvdmZp
by92ZmlvX2lvbW11X3R5cGUxLmMNCj4gYi9kcml2ZXJzL3ZmaW8vdmZpb19pb21tdV90eXBlMS5j
IGluZGV4IDRmYjdhOGYuLjhjN2JiOWIgMTAwNjQ0DQo+IC0tLSBhL2RyaXZlcnMvdmZpby92Zmlv
X2lvbW11X3R5cGUxLmMNCj4gKysrIGIvZHJpdmVycy92ZmlvL3ZmaW9faW9tbXVfdHlwZTEuYw0K
PiBAQCAtMzAsNyArMzAsNiBAQA0KPiAgI2luY2x1ZGUgPGxpbnV4L2lvbW11Lmg+DQo+ICAjaW5j
bHVkZSA8bGludXgvbW9kdWxlLmg+DQo+ICAjaW5jbHVkZSA8bGludXgvbW0uaD4NCj4gLSNpbmNs
dWRlIDxsaW51eC9wY2kuaD4JCS8qIHBjaV9idXNfdHlwZSAqLw0KPiAgI2luY2x1ZGUgPGxpbnV4
L3JidHJlZS5oPg0KPiAgI2luY2x1ZGUgPGxpbnV4L3NjaGVkLmg+DQo+ICAjaW5jbHVkZSA8bGlu
dXgvc2xhYi5oPg0KPiBAQCAtNTUsMTEgKzU0LDE3IEBAIE1PRFVMRV9QQVJNX0RFU0MoZGlzYWJs
ZV9odWdlcGFnZXMsDQo+ICAJCSAiRGlzYWJsZSBWRklPIElPTU1VIHN1cHBvcnQgZm9yIElPTU1V
IGh1Z2VwYWdlcy4iKTsNCj4gDQo+ICBzdHJ1Y3QgdmZpb19pb21tdSB7DQo+IC0Jc3RydWN0IGlv
bW11X2RvbWFpbgkqZG9tYWluOw0KPiArCXN0cnVjdCBsaXN0X2hlYWQJZG9tYWluX2xpc3Q7DQo+
ICAJc3RydWN0IG11dGV4CQlsb2NrOw0KPiAgCXN0cnVjdCByYl9yb290CQlkbWFfbGlzdDsNCj4g
Kwlib29sIHYyOw0KPiArfTsNCj4gKw0KPiArc3RydWN0IHZmaW9fZG9tYWluIHsNCj4gKwlzdHJ1
Y3QgaW9tbXVfZG9tYWluCSpkb21haW47DQo+ICsJc3RydWN0IGxpc3RfaGVhZAluZXh0Ow0KPiAg
CXN0cnVjdCBsaXN0X2hlYWQJZ3JvdXBfbGlzdDsNCj4gLQlib29sCQkJY2FjaGU7DQo+ICsJaW50
CQkJcHJvdDsJCS8qIElPTU1VX0NBQ0hFICovDQo+ICB9Ow0KPiANCj4gIHN0cnVjdCB2ZmlvX2Rt
YSB7DQo+IEBAIC05OSw3ICsxMDQsNyBAQCBzdGF0aWMgc3RydWN0IHZmaW9fZG1hICp2ZmlvX2Zp
bmRfZG1hKHN0cnVjdA0KPiB2ZmlvX2lvbW11ICppb21tdSwNCj4gIAlyZXR1cm4gTlVMTDsNCj4g
IH0NCj4gDQo+IC1zdGF0aWMgdm9pZCB2ZmlvX2luc2VydF9kbWEoc3RydWN0IHZmaW9faW9tbXUg
KmlvbW11LCBzdHJ1Y3QgdmZpb19kbWENCj4gKm5ldykNCj4gK3N0YXRpYyB2b2lkIHZmaW9fbGlu
a19kbWEoc3RydWN0IHZmaW9faW9tbXUgKmlvbW11LCBzdHJ1Y3QgdmZpb19kbWENCj4gKypuZXcp
DQo+ICB7DQo+ICAJc3RydWN0IHJiX25vZGUgKipsaW5rID0gJmlvbW11LT5kbWFfbGlzdC5yYl9u
b2RlLCAqcGFyZW50ID0gTlVMTDsNCj4gIAlzdHJ1Y3QgdmZpb19kbWEgKmRtYTsNCj4gQEAgLTEx
OCw3ICsxMjMsNyBAQCBzdGF0aWMgdm9pZCB2ZmlvX2luc2VydF9kbWEoc3RydWN0IHZmaW9faW9t
bXUgKmlvbW11LA0KPiBzdHJ1Y3QgdmZpb19kbWEgKm5ldykNCj4gIAlyYl9pbnNlcnRfY29sb3Io
Jm5ldy0+bm9kZSwgJmlvbW11LT5kbWFfbGlzdCk7ICB9DQo+IA0KPiAtc3RhdGljIHZvaWQgdmZp
b19yZW1vdmVfZG1hKHN0cnVjdCB2ZmlvX2lvbW11ICppb21tdSwgc3RydWN0IHZmaW9fZG1hDQo+
ICpvbGQpDQo+ICtzdGF0aWMgdm9pZCB2ZmlvX3VubGlua19kbWEoc3RydWN0IHZmaW9faW9tbXUg
KmlvbW11LCBzdHJ1Y3QgdmZpb19kbWENCj4gKypvbGQpDQo+ICB7DQo+ICAJcmJfZXJhc2UoJm9s
ZC0+bm9kZSwgJmlvbW11LT5kbWFfbGlzdCk7ICB9IEBAIC0zMjIsMzIgKzMyNywzOSBAQA0KPiBz
dGF0aWMgbG9uZyB2ZmlvX3VucGluX3BhZ2VzKHVuc2lnbmVkIGxvbmcgcGZuLCBsb25nIG5wYWdl
LA0KPiAgCXJldHVybiB1bmxvY2tlZDsNCj4gIH0NCj4gDQo+IC1zdGF0aWMgaW50IHZmaW9fdW5t
YXBfdW5waW4oc3RydWN0IHZmaW9faW9tbXUgKmlvbW11LCBzdHJ1Y3QgdmZpb19kbWENCj4gKmRt
YSwNCj4gLQkJCSAgICBkbWFfYWRkcl90IGlvdmEsIHNpemVfdCAqc2l6ZSkNCj4gK3N0YXRpYyB2
b2lkIHZmaW9fdW5tYXBfdW5waW4oc3RydWN0IHZmaW9faW9tbXUgKmlvbW11LCBzdHJ1Y3QgdmZp
b19kbWENCj4gKypkbWEpDQo+ICB7DQo+IC0JZG1hX2FkZHJfdCBzdGFydCA9IGlvdmEsIGVuZCA9
IGlvdmEgKyAqc2l6ZTsNCj4gKwlkbWFfYWRkcl90IGlvdmEgPSBkbWEtPmlvdmEsIGVuZCA9IGRt
YS0+aW92YSArIGRtYS0+c2l6ZTsNCj4gKwlzdHJ1Y3QgdmZpb19kb21haW4gKmRvbWFpbiwgKmQ7
DQo+ICAJbG9uZyB1bmxvY2tlZCA9IDA7DQo+IA0KPiArCWlmICghZG1hLT5zaXplKQ0KPiArCQly
ZXR1cm47DQo+ICsJLyoNCj4gKwkgKiBXZSB1c2UgdGhlIElPTU1VIHRvIHRyYWNrIHRoZSBwaHlz
aWNhbCBhZGRyZXNzZXMsIG90aGVyd2lzZSB3ZSdkDQo+ICsJICogbmVlZCBhIG11Y2ggbW9yZSBj
b21wbGljYXRlZCB0cmFja2luZyBzeXN0ZW0uICBVbmZvcnR1bmF0ZWx5DQo+IHRoYXQNCj4gKwkg
KiBtZWFucyB3ZSBuZWVkIHRvIHVzZSBvbmUgb2YgdGhlIGlvbW11IGRvbWFpbnMgdG8gZmlndXJl
IG91dCB0aGUNCj4gKwkgKiBwZm5zIHRvIHVucGluLiAgVGhlIHJlc3QgbmVlZCB0byBiZSB1bm1h
cHBlZCBpbiBhZHZhbmNlIHNvIHdlDQo+IGhhdmUNCj4gKwkgKiBubyBpb21tdSB0cmFuc2xhdGlv
bnMgcmVtYWluaW5nIHdoZW4gdGhlIHBhZ2VzIGFyZSB1bnBpbm5lZC4NCj4gKwkgKi8NCj4gKwlk
b21haW4gPSBkID0gbGlzdF9maXJzdF9lbnRyeSgmaW9tbXUtPmRvbWFpbl9saXN0LA0KPiArCQkJ
CSAgICAgIHN0cnVjdCB2ZmlvX2RvbWFpbiwgbmV4dCk7DQo+ICsNCj4gKwlsaXN0X2Zvcl9lYWNo
X2VudHJ5X2NvbnRpbnVlKGQsICZpb21tdS0+ZG9tYWluX2xpc3QsIG5leHQpDQo+ICsJCWlvbW11
X3VubWFwKGQtPmRvbWFpbiwgZG1hLT5pb3ZhLCBkbWEtPnNpemUpOw0KPiArDQo+ICAJd2hpbGUg
KGlvdmEgPCBlbmQpIHsNCj4gIAkJc2l6ZV90IHVubWFwcGVkOw0KPiAgCQlwaHlzX2FkZHJfdCBw
aHlzOw0KPiANCj4gLQkJLyoNCj4gLQkJICogV2UgdXNlIHRoZSBJT01NVSB0byB0cmFjayB0aGUg
cGh5c2ljYWwgYWRkcmVzcy4gIFRoaXMNCj4gLQkJICogc2F2ZXMgdXMgZnJvbSBoYXZpbmcgYSBs
b3QgbW9yZSBlbnRyaWVzIGluIG91ciBtYXBwaW5nDQo+IC0JCSAqIHRyZWUuICBUaGUgZG93bnNp
ZGUgaXMgdGhhdCB3ZSBkb24ndCB0cmFjayB0aGUgc2l6ZQ0KPiAtCQkgKiB1c2VkIHRvIGRvIHRo
ZSBtYXBwaW5nLiAgV2UgcmVxdWVzdCB1bm1hcCBvZiBhIHNpbmdsZQ0KPiAtCQkgKiBwYWdlLCBi
dXQgZXhwZWN0IElPTU1VcyB0aGF0IHN1cHBvcnQgbGFyZ2UgcGFnZXMgdG8NCj4gLQkJICogdW5t
YXAgYSBsYXJnZXIgY2h1bmsuDQo+IC0JCSAqLw0KPiAtCQlwaHlzID0gaW9tbXVfaW92YV90b19w
aHlzKGlvbW11LT5kb21haW4sIGlvdmEpOw0KPiArCQlwaHlzID0gaW9tbXVfaW92YV90b19waHlz
KGRvbWFpbi0+ZG9tYWluLCBpb3ZhKTsNCj4gIAkJaWYgKFdBUk5fT04oIXBoeXMpKSB7DQo+ICAJ
CQlpb3ZhICs9IFBBR0VfU0laRTsNCj4gIAkJCWNvbnRpbnVlOw0KPiAgCQl9DQo+IA0KPiAtCQl1
bm1hcHBlZCA9IGlvbW11X3VubWFwKGlvbW11LT5kb21haW4sIGlvdmEsIFBBR0VfU0laRSk7DQo+
IC0JCWlmICghdW5tYXBwZWQpDQo+ICsJCXVubWFwcGVkID0gaW9tbXVfdW5tYXAoZG9tYWluLT5k
b21haW4sIGlvdmEsIFBBR0VfU0laRSk7DQo+ICsJCWlmIChXQVJOX09OKCF1bm1hcHBlZCkpDQo+
ICAJCQlicmVhazsNCj4gDQo+ICAJCXVubG9ja2VkICs9IHZmaW9fdW5waW5fcGFnZXMocGh5cyA+
PiBQQUdFX1NISUZULCBAQCAtMzU3LDExOQ0KPiArMzY5LDI2IEBAIHN0YXRpYyBpbnQgdmZpb191
bm1hcF91bnBpbihzdHJ1Y3QgdmZpb19pb21tdSAqaW9tbXUsIHN0cnVjdA0KPiB2ZmlvX2RtYSAq
ZG1hLA0KPiAgCX0NCj4gDQo+ICAJdmZpb19sb2NrX2FjY3QoLXVubG9ja2VkKTsNCj4gLQ0KPiAt
CSpzaXplID0gaW92YSAtIHN0YXJ0Ow0KPiAtDQo+IC0JcmV0dXJuIDA7DQo+ICB9DQo+IA0KPiAt
c3RhdGljIGludCB2ZmlvX3JlbW92ZV9kbWFfb3ZlcmxhcChzdHJ1Y3QgdmZpb19pb21tdSAqaW9t
bXUsIGRtYV9hZGRyX3QNCj4gc3RhcnQsDQo+IC0JCQkJICAgc2l6ZV90ICpzaXplLCBzdHJ1Y3Qg
dmZpb19kbWEgKmRtYSkNCj4gK3N0YXRpYyB2b2lkIHZmaW9fcmVtb3ZlX2RtYShzdHJ1Y3QgdmZp
b19pb21tdSAqaW9tbXUsIHN0cnVjdCB2ZmlvX2RtYQ0KPiArKmRtYSkNCj4gIHsNCj4gLQlzaXpl
X3Qgb2Zmc2V0LCBvdmVybGFwLCB0bXA7DQo+IC0Jc3RydWN0IHZmaW9fZG1hICpzcGxpdDsNCj4g
LQlpbnQgcmV0Ow0KPiAtDQo+IC0JaWYgKCEqc2l6ZSkNCj4gLQkJcmV0dXJuIDA7DQo+IC0NCj4g
LQkvKg0KPiAtCSAqIEV4aXN0aW5nIGRtYSByZWdpb24gaXMgY29tcGxldGVseSBjb3ZlcmVkLCB1
bm1hcCBhbGwuICBUaGlzIGlzDQo+IC0JICogdGhlIGxpa2VseSBjYXNlIHNpbmNlIHVzZXJzcGFj
ZSB0ZW5kcyB0byBtYXAgYW5kIHVubWFwIGJ1ZmZlcnMNCj4gLQkgKiBpbiBvbmUgc2hvdCByYXRo
ZXIgdGhhbiBtdWx0aXBsZSBtYXBwaW5ncyB3aXRoaW4gYSBidWZmZXIuDQo+IC0JICovDQo+IC0J
aWYgKGxpa2VseShzdGFydCA8PSBkbWEtPmlvdmEgJiYNCj4gLQkJICAgc3RhcnQgKyAqc2l6ZSA+
PSBkbWEtPmlvdmEgKyBkbWEtPnNpemUpKSB7DQo+IC0JCSpzaXplID0gZG1hLT5zaXplOw0KPiAt
CQlyZXQgPSB2ZmlvX3VubWFwX3VucGluKGlvbW11LCBkbWEsIGRtYS0+aW92YSwgc2l6ZSk7DQo+
IC0JCWlmIChyZXQpDQo+IC0JCQlyZXR1cm4gcmV0Ow0KPiAtDQo+IC0JCS8qDQo+IC0JCSAqIERp
ZCB3ZSByZW1vdmUgbW9yZSB0aGFuIHdlIGhhdmU/ICBTaG91bGQgbmV2ZXIgaGFwcGVuDQo+IC0J
CSAqIHNpbmNlIGEgdmZpb19kbWEgaXMgY29udGlndW91cyBpbiBpb3ZhIGFuZCB2YWRkci4NCj4g
LQkJICovDQo+IC0JCVdBUk5fT04oKnNpemUgIT0gZG1hLT5zaXplKTsNCj4gLQ0KPiAtCQl2Zmlv
X3JlbW92ZV9kbWEoaW9tbXUsIGRtYSk7DQo+IC0JCWtmcmVlKGRtYSk7DQo+IC0JCXJldHVybiAw
Ow0KPiAtCX0NCj4gLQ0KPiAtCS8qIE92ZXJsYXAgbG93IGFkZHJlc3Mgb2YgZXhpc3RpbmcgcmFu
Z2UgKi8NCj4gLQlpZiAoc3RhcnQgPD0gZG1hLT5pb3ZhKSB7DQo+IC0JCW92ZXJsYXAgPSBzdGFy
dCArICpzaXplIC0gZG1hLT5pb3ZhOw0KPiAtCQlyZXQgPSB2ZmlvX3VubWFwX3VucGluKGlvbW11
LCBkbWEsIGRtYS0+aW92YSwgJm92ZXJsYXApOw0KPiAtCQlpZiAocmV0KQ0KPiAtCQkJcmV0dXJu
IHJldDsNCj4gLQ0KPiAtCQl2ZmlvX3JlbW92ZV9kbWEoaW9tbXUsIGRtYSk7DQo+IC0NCj4gLQkJ
LyoNCj4gLQkJICogQ2hlY2ssIHdlIG1heSBoYXZlIHJlbW92ZWQgdG8gd2hvbGUgdmZpb19kbWEu
ICBJZiBub3QNCj4gLQkJICogZml4dXAgYW5kIHJlLWluc2VydC4NCj4gLQkJICovDQo+IC0JCWlm
IChvdmVybGFwIDwgZG1hLT5zaXplKSB7DQo+IC0JCQlkbWEtPmlvdmEgKz0gb3ZlcmxhcDsNCj4g
LQkJCWRtYS0+dmFkZHIgKz0gb3ZlcmxhcDsNCj4gLQkJCWRtYS0+c2l6ZSAtPSBvdmVybGFwOw0K
PiAtCQkJdmZpb19pbnNlcnRfZG1hKGlvbW11LCBkbWEpOw0KPiAtCQl9IGVsc2UNCj4gLQkJCWtm
cmVlKGRtYSk7DQo+IC0NCj4gLQkJKnNpemUgPSBvdmVybGFwOw0KPiAtCQlyZXR1cm4gMDsNCj4g
LQl9DQo+IC0NCj4gLQkvKiBPdmVybGFwIGhpZ2ggYWRkcmVzcyBvZiBleGlzdGluZyByYW5nZSAq
Lw0KPiAtCWlmIChzdGFydCArICpzaXplID49IGRtYS0+aW92YSArIGRtYS0+c2l6ZSkgew0KPiAt
CQlvZmZzZXQgPSBzdGFydCAtIGRtYS0+aW92YTsNCj4gLQkJb3ZlcmxhcCA9IGRtYS0+c2l6ZSAt
IG9mZnNldDsNCj4gLQ0KPiAtCQlyZXQgPSB2ZmlvX3VubWFwX3VucGluKGlvbW11LCBkbWEsIHN0
YXJ0LCAmb3ZlcmxhcCk7DQo+IC0JCWlmIChyZXQpDQo+IC0JCQlyZXR1cm4gcmV0Ow0KPiAtDQo+
IC0JCWRtYS0+c2l6ZSAtPSBvdmVybGFwOw0KPiAtCQkqc2l6ZSA9IG92ZXJsYXA7DQo+IC0JCXJl
dHVybiAwOw0KPiAtCX0NCj4gLQ0KPiAtCS8qIFNwbGl0IGV4aXN0aW5nICovDQo+IC0NCj4gLQkv
Kg0KPiAtCSAqIEFsbG9jYXRlIG91ciB0cmFja2luZyBzdHJ1Y3R1cmUgZWFybHkgZXZlbiB0aG91
Z2ggaXQgbWF5IG5vdA0KPiAtCSAqIGJlIHVzZWQuICBBbiBBbGxvY2F0aW9uIGZhaWx1cmUgbGF0
ZXIgbG9zZXMgdHJhY2sgb2YgcGFnZXMgYW5kDQo+IC0JICogaXMgbW9yZSBkaWZmaWN1bHQgdG8g
dW53aW5kLg0KPiAtCSAqLw0KPiAtCXNwbGl0ID0ga3phbGxvYyhzaXplb2YoKnNwbGl0KSwgR0ZQ
X0tFUk5FTCk7DQo+IC0JaWYgKCFzcGxpdCkNCj4gLQkJcmV0dXJuIC1FTk9NRU07DQo+IC0NCj4g
LQlvZmZzZXQgPSBzdGFydCAtIGRtYS0+aW92YTsNCj4gLQ0KPiAtCXJldCA9IHZmaW9fdW5tYXBf
dW5waW4oaW9tbXUsIGRtYSwgc3RhcnQsIHNpemUpOw0KPiAtCWlmIChyZXQgfHwgISpzaXplKSB7
DQo+IC0JCWtmcmVlKHNwbGl0KTsNCj4gLQkJcmV0dXJuIHJldDsNCj4gLQl9DQo+IC0NCj4gLQl0
bXAgPSBkbWEtPnNpemU7DQo+ICsJdmZpb191bm1hcF91bnBpbihpb21tdSwgZG1hKTsNCj4gKwl2
ZmlvX3VubGlua19kbWEoaW9tbXUsIGRtYSk7DQo+ICsJa2ZyZWUoZG1hKTsNCj4gK30NCj4gDQo+
IC0JLyogUmVzaXplIHRoZSBsb3dlciB2ZmlvX2RtYSBpbiBwbGFjZSwgYmVmb3JlIHRoZSBiZWxv
dyBpbnNlcnQgKi8NCj4gLQlkbWEtPnNpemUgPSBvZmZzZXQ7DQo+ICtzdGF0aWMgdW5zaWduZWQg
bG9uZyB2ZmlvX3Bnc2l6ZV9iaXRtYXAoc3RydWN0IHZmaW9faW9tbXUgKmlvbW11KSB7DQo+ICsJ
c3RydWN0IHZmaW9fZG9tYWluICpkb21haW47DQo+ICsJdW5zaWduZWQgbG9uZyBiaXRtYXAgPSBQ
QUdFX01BU0s7DQo+IA0KPiAtCS8qIEluc2VydCBuZXcgZm9yIHJlbWFpbmRlciwgYXNzdW1pbmcg
aXQgZGlkbid0IGFsbCBnZXQgdW5tYXBwZWQgKi8NCj4gLQlpZiAobGlrZWx5KG9mZnNldCArICpz
aXplIDwgdG1wKSkgew0KPiAtCQlzcGxpdC0+c2l6ZSA9IHRtcCAtIG9mZnNldCAtICpzaXplOw0K
PiAtCQlzcGxpdC0+aW92YSA9IGRtYS0+aW92YSArIG9mZnNldCArICpzaXplOw0KPiAtCQlzcGxp
dC0+dmFkZHIgPSBkbWEtPnZhZGRyICsgb2Zmc2V0ICsgKnNpemU7DQo+IC0JCXNwbGl0LT5wcm90
ID0gZG1hLT5wcm90Ow0KPiAtCQl2ZmlvX2luc2VydF9kbWEoaW9tbXUsIHNwbGl0KTsNCj4gLQl9
IGVsc2UNCj4gLQkJa2ZyZWUoc3BsaXQpOw0KPiArCW11dGV4X2xvY2soJmlvbW11LT5sb2NrKTsN
Cj4gKwlsaXN0X2Zvcl9lYWNoX2VudHJ5KGRvbWFpbiwgJmlvbW11LT5kb21haW5fbGlzdCwgbmV4
dCkNCj4gKwkJYml0bWFwICY9IGRvbWFpbi0+ZG9tYWluLT5vcHMtPnBnc2l6ZV9iaXRtYXA7DQo+
ICsJbXV0ZXhfdW5sb2NrKCZpb21tdS0+bG9jayk7DQo+IA0KPiAtCXJldHVybiAwOw0KPiArCXJl
dHVybiBiaXRtYXA7DQo+ICB9DQo+IA0KPiAgc3RhdGljIGludCB2ZmlvX2RtYV9kb191bm1hcChz
dHJ1Y3QgdmZpb19pb21tdSAqaW9tbXUsIEBAIC00NzcsMTANCj4gKzM5NiwxMCBAQCBzdGF0aWMg
aW50IHZmaW9fZG1hX2RvX3VubWFwKHN0cnVjdCB2ZmlvX2lvbW11ICppb21tdSwgIHsNCj4gIAl1
aW50NjRfdCBtYXNrOw0KPiAgCXN0cnVjdCB2ZmlvX2RtYSAqZG1hOw0KPiAtCXNpemVfdCB1bm1h
cHBlZCA9IDAsIHNpemU7DQo+ICsJc2l6ZV90IHVubWFwcGVkID0gMDsNCj4gIAlpbnQgcmV0ID0g
MDsNCj4gDQo+IC0JbWFzayA9ICgodWludDY0X3QpMSA8PCBfX2Zmcyhpb21tdS0+ZG9tYWluLT5v
cHMtPnBnc2l6ZV9iaXRtYXApKSAtDQo+IDE7DQo+ICsJbWFzayA9ICgodWludDY0X3QpMSA8PCBf
X2Zmcyh2ZmlvX3Bnc2l6ZV9iaXRtYXAoaW9tbXUpKSkgLSAxOw0KPiANCj4gIAlpZiAodW5tYXAt
PmlvdmEgJiBtYXNrKQ0KPiAgCQlyZXR1cm4gLUVJTlZBTDsNCj4gQEAgLTQ5MSwyMCArNDEwLDYx
IEBAIHN0YXRpYyBpbnQgdmZpb19kbWFfZG9fdW5tYXAoc3RydWN0IHZmaW9faW9tbXUNCj4gKmlv
bW11LA0KPiANCj4gIAltdXRleF9sb2NrKCZpb21tdS0+bG9jayk7DQo+IA0KPiArCS8qDQo+ICsJ
ICogdmZpby1pb21tdS10eXBlMSAodjEpIC0gVXNlciBtYXBwaW5ncyB3ZXJlIGNvYWxlc2NlZCB0
b2dldGhlciB0bw0KPiArCSAqIGF2b2lkIHRyYWNraW5nIGluZGl2aWR1YWwgbWFwcGluZ3MuICBU
aGlzIG1lYW5zIHRoYXQgdGhlDQo+IGdyYW51bGFyaXR5DQo+ICsJICogb2YgdGhlIG9yaWdpbmFs
IG1hcHBpbmcgd2FzIGxvc3QgYW5kIHRoZSB1c2VyIHdhcyBhbGxvd2VkIHRvDQo+IGF0dGVtcHQN
Cj4gKwkgKiB0byB1bm1hcCBhbnkgcmFuZ2UuICBEZXBlbmRpbmcgb24gdGhlIGNvbnRpZ3VvdXNu
ZXNzIG9mIHBoeXNpY2FsDQo+ICsJICogbWVtb3J5IGFuZCBwYWdlIHNpemVzIHN1cHBvcnRlZCBi
eSB0aGUgSU9NTVUsIGFyYml0cmFyeSB1bm1hcHMNCj4gbWF5DQo+ICsJICogb3IgbWF5IG5vdCBo
YXZlIHdvcmtlZC4gIFdlIG9ubHkgZ3VhcmFudGVlZCB1bm1hcCBncmFudWxhcml0eQ0KPiArCSAq
IG1hdGNoaW5nIHRoZSBvcmlnaW5hbCBtYXBwaW5nOyBldmVuIHRob3VnaCBpdCB3YXMgdW50cmFj
a2VkDQo+IGhlcmUsDQo+ICsJICogdGhlIG9yaWdpbmFsIG1hcHBpbmdzIGFyZSByZWZsZWN0ZWQg
aW4gSU9NTVUgbWFwcGluZ3MuICBUaGlzDQo+ICsJICogcmVzdWx0ZWQgaW4gYSBjb3VwbGUgdW51
c3VhbCBiZWhhdmlvcnMuICBGaXJzdCwgaWYgYSByYW5nZSBpcw0KPiBub3QNCj4gKwkgKiBhYmxl
IHRvIGJlIHVubWFwcGVkLCBleC4gYSBzZXQgb2YgNGsgcGFnZXMgdGhhdCB3YXMgbWFwcGVkIGFz
IGENCj4gKwkgKiAyTSBodWdlcGFnZSBpbnRvIHRoZSBJT01NVSwgdGhlIHVubWFwIGlvY3RsIHJl
dHVybnMgc3VjY2VzcyBidXQNCj4gd2l0aA0KPiArCSAqIGEgemVybyBzaXplZCB1bm1hcC4gIEFs
c28sIGlmIGFuIHVubWFwIHJlcXVlc3Qgb3ZlcmxhcHMgdGhlDQo+IGZpcnN0DQo+ICsJICogYWRk
cmVzcyBvZiBhIGh1Z2VwYWdlLCB0aGUgSU9NTVUgd2lsbCB1bm1hcCB0aGUgZW50aXJlIGh1Z2Vw
YWdlLg0KPiArCSAqIFRoaXMgYWxzbyByZXR1cm5zIHN1Y2Nlc3MgYW5kIHRoZSByZXR1cm5lZCB1
bm1hcCBzaXplIHJlZmxlY3RzDQo+IHRoZQ0KPiArCSAqIGFjdHVhbCBzaXplIHVubWFwcGVkLg0K
PiArCSAqDQo+ICsJICogV2UgYXR0ZW1wdCB0byBtYWludGFpbiBjb21wYXRpYmlsaXR5IHdpdGgg
dGhpcyAidjEiIGludGVyZmFjZSwNCj4gYnV0DQo+ICsJICogd2UgdGFrZSBjb250cm9sIG91dCBv
ZiB0aGUgaGFuZHMgb2YgdGhlIElPTU1VLiAgVGhlcmVmb3JlLCBhbg0KPiB1bm1hcA0KPiArCSAq
IHJlcXVlc3Qgb2Zmc2V0IGZyb20gdGhlIGJlZ2lubmluZyBvZiB0aGUgb3JpZ2luYWwgbWFwcGlu
ZyB3aWxsDQo+ICsJICogcmV0dXJuIHN1Y2Nlc3Mgd2l0aCB6ZXJvIHNpemVkIHVubWFwLiAgQW5k
IGFuIHVubWFwIHJlcXVlc3QNCj4gY292ZXJpbmcNCj4gKwkgKiB0aGUgZmlyc3QgaW92YSBvZiBt
YXBwaW5nIHdpbGwgdW5tYXAgdGhlIGVudGlyZSByYW5nZS4NCj4gKwkgKg0KPiArCSAqIFRoZSB2
MiB2ZXJzaW9uIG9mIHRoaXMgaW50ZXJmYWNlIGludGVuZHMgdG8gYmUgbW9yZQ0KPiBkZXRlcm1p
bmlzdGljLg0KPiArCSAqIFVubWFwIHJlcXVlc3RzIG11c3QgZnVsbHkgY292ZXIgcHJldmlvdXMg
bWFwcGluZ3MuICBNdWx0aXBsZQ0KPiArCSAqIG1hcHBpbmdzIG1heSBzdGlsbCBiZSB1bm1hcGVk
IGJ5IHNwZWNpZnlpbmcgbGFyZ2UgcmFuZ2VzLCBidXQNCj4gdGhlcmUNCj4gKwkgKiBtdXN0IG5v
dCBiZSBhbnkgcHJldmlvdXMgbWFwcGluZ3MgYmlzZWN0ZWQgYnkgdGhlIHJhbmdlLiAgQW4NCj4g
ZXJyb3INCj4gKwkgKiB3aWxsIGJlIHJldHVybmVkIGlmIHRoZXNlIGNvbmRpdGlvbnMgYXJlIG5v
dCBtZXQuICBUaGUgdjINCj4gaW50ZXJmYWNlDQo+ICsJICogd2lsbCBvbmx5IHJldHVybiBzdWNj
ZXNzIGFuZCBhIHNpemUgb2YgemVybyBpZiB0aGVyZSB3ZXJlIG5vDQo+ICsJICogbWFwcGluZ3Mg
d2l0aGluIHRoZSByYW5nZS4NCj4gKwkgKi8NCj4gKwlpZiAoaW9tbXUtPnYyKSB7DQo+ICsJCWRt
YSA9IHZmaW9fZmluZF9kbWEoaW9tbXUsIHVubWFwLT5pb3ZhLCAwKTsNCj4gKwkJaWYgKGRtYSAm
JiBkbWEtPmlvdmEgIT0gdW5tYXAtPmlvdmEpIHsNCj4gKwkJCXJldCA9IC1FSU5WQUw7DQo+ICsJ
CQlnb3RvIHVubG9jazsNCj4gKwkJfQ0KPiArCQlkbWEgPSB2ZmlvX2ZpbmRfZG1hKGlvbW11LCB1
bm1hcC0+aW92YSArIHVubWFwLT5zaXplIC0gMSwgMCk7DQo+ICsJCWlmIChkbWEgJiYgZG1hLT5p
b3ZhICsgZG1hLT5zaXplICE9IHVubWFwLT5pb3ZhICsgdW5tYXAtDQo+ID5zaXplKSB7DQo+ICsJ
CQlyZXQgPSAtRUlOVkFMOw0KPiArCQkJZ290byB1bmxvY2s7DQo+ICsJCX0NCj4gKwl9DQo+ICsN
Cj4gIAl3aGlsZSAoKGRtYSA9IHZmaW9fZmluZF9kbWEoaW9tbXUsIHVubWFwLT5pb3ZhLCB1bm1h
cC0+c2l6ZSkpKSB7DQo+IC0JCXNpemUgPSB1bm1hcC0+c2l6ZTsNCj4gLQkJcmV0ID0gdmZpb19y
ZW1vdmVfZG1hX292ZXJsYXAoaW9tbXUsIHVubWFwLT5pb3ZhLCAmc2l6ZSwNCj4gZG1hKTsNCj4g
LQkJaWYgKHJldCB8fCAhc2l6ZSkNCj4gKwkJaWYgKCFpb21tdS0+djIgJiYgdW5tYXAtPmlvdmEg
PiBkbWEtPmlvdmEpDQo+ICAJCQlicmVhazsNCj4gLQkJdW5tYXBwZWQgKz0gc2l6ZTsNCj4gKwkJ
dW5tYXBwZWQgKz0gZG1hLT5zaXplOw0KPiArCQl2ZmlvX3JlbW92ZV9kbWEoaW9tbXUsIGRtYSk7
DQo+ICAJfQ0KPiANCj4gK3VubG9jazoNCj4gIAltdXRleF91bmxvY2soJmlvbW11LT5sb2NrKTsN
Cj4gDQo+IC0JLyoNCj4gLQkgKiBXZSBtYXkgdW5tYXAgbW9yZSB0aGFuIHJlcXVlc3RlZCwgdXBk
YXRlIHRoZSB1bm1hcCBzdHJ1Y3Qgc28NCj4gLQkgKiB1c2Vyc3BhY2UgY2FuIGtub3cuDQo+IC0J
ICovDQo+ICsJLyogUmVwb3J0IGhvdyBtdWNoIHdhcyB1bm1hcHBlZCAqLw0KPiAgCXVubWFwLT5z
aXplID0gdW5tYXBwZWQ7DQo+IA0KPiAgCXJldHVybiByZXQ7DQo+IEBAIC01MTYsMjIgKzQ3Niw0
NyBAQCBzdGF0aWMgaW50IHZmaW9fZG1hX2RvX3VubWFwKHN0cnVjdCB2ZmlvX2lvbW11DQo+ICpp
b21tdSwNCj4gICAqIHNvb24sIHNvIHRoaXMgaXMganVzdCBhIHRlbXBvcmFyeSB3b3JrYXJvdW5k
IHRvIGJyZWFrIG1hcHBpbmdzIGRvd24NCj4gaW50bw0KPiAgICogUEFHRV9TSVpFLiAgQmV0dGVy
IHRvIG1hcCBzbWFsbGVyIHBhZ2VzIHRoYW4gbm90aGluZy4NCj4gICAqLw0KPiAtc3RhdGljIGlu
dCBtYXBfdHJ5X2hhcmRlcihzdHJ1Y3QgdmZpb19pb21tdSAqaW9tbXUsIGRtYV9hZGRyX3QgaW92
YSwNCj4gK3N0YXRpYyBpbnQgbWFwX3RyeV9oYXJkZXIoc3RydWN0IHZmaW9fZG9tYWluICpkb21h
aW4sIGRtYV9hZGRyX3QgaW92YSwNCj4gIAkJCSAgdW5zaWduZWQgbG9uZyBwZm4sIGxvbmcgbnBh
Z2UsIGludCBwcm90KSAgew0KPiAgCWxvbmcgaTsNCj4gIAlpbnQgcmV0Ow0KPiANCj4gIAlmb3Ig
KGkgPSAwOyBpIDwgbnBhZ2U7IGkrKywgcGZuKyssIGlvdmEgKz0gUEFHRV9TSVpFKSB7DQo+IC0J
CXJldCA9IGlvbW11X21hcChpb21tdS0+ZG9tYWluLCBpb3ZhLA0KPiArCQlyZXQgPSBpb21tdV9t
YXAoZG9tYWluLT5kb21haW4sIGlvdmEsDQo+ICAJCQkJKHBoeXNfYWRkcl90KXBmbiA8PCBQQUdF
X1NISUZULA0KPiAtCQkJCVBBR0VfU0laRSwgcHJvdCk7DQo+ICsJCQkJUEFHRV9TSVpFLCBwcm90
IHwgZG9tYWluLT5wcm90KTsNCj4gIAkJaWYgKHJldCkNCj4gIAkJCWJyZWFrOw0KPiAgCX0NCj4g
DQo+ICAJZm9yICg7IGkgPCBucGFnZSAmJiBpID4gMDsgaS0tLCBpb3ZhIC09IFBBR0VfU0laRSkN
Cj4gLQkJaW9tbXVfdW5tYXAoaW9tbXUtPmRvbWFpbiwgaW92YSwgUEFHRV9TSVpFKTsNCj4gKwkJ
aW9tbXVfdW5tYXAoZG9tYWluLT5kb21haW4sIGlvdmEsIFBBR0VfU0laRSk7DQo+ICsNCj4gKwly
ZXR1cm4gcmV0Ow0KPiArfQ0KPiArDQo+ICtzdGF0aWMgaW50IHZmaW9faW9tbXVfbWFwKHN0cnVj
dCB2ZmlvX2lvbW11ICppb21tdSwgZG1hX2FkZHJfdCBpb3ZhLA0KPiArCQkJICB1bnNpZ25lZCBs
b25nIHBmbiwgbG9uZyBucGFnZSwgaW50IHByb3QpIHsNCj4gKwlzdHJ1Y3QgdmZpb19kb21haW4g
KmQ7DQo+ICsJaW50IHJldDsNCj4gKw0KPiArCWxpc3RfZm9yX2VhY2hfZW50cnkoZCwgJmlvbW11
LT5kb21haW5fbGlzdCwgbmV4dCkgew0KPiArCQlyZXQgPSBpb21tdV9tYXAoZC0+ZG9tYWluLCBp
b3ZhLCAocGh5c19hZGRyX3QpcGZuIDw8DQo+IFBBR0VfU0hJRlQsDQo+ICsJCQkJbnBhZ2UgPDwg
UEFHRV9TSElGVCwgcHJvdCB8IGQtPnByb3QpOw0KPiArCQlpZiAocmV0KSB7DQo+ICsJCQlpZiAo
cmV0ICE9IC1FQlVTWSB8fA0KPiArCQkJICAgIG1hcF90cnlfaGFyZGVyKGQsIGlvdmEsIHBmbiwg
bnBhZ2UsIHByb3QpKQ0KPiArCQkJCWdvdG8gdW53aW5kOw0KPiArCQl9DQo+ICsJfQ0KPiArDQo+
ICsJcmV0dXJuIDA7DQo+ICsNCj4gK3Vud2luZDoNCj4gKwlsaXN0X2Zvcl9lYWNoX2VudHJ5X2Nv
bnRpbnVlX3JldmVyc2UoZCwgJmlvbW11LT5kb21haW5fbGlzdCwgbmV4dCkNCj4gKwkJaW9tbXVf
dW5tYXAoZC0+ZG9tYWluLCBpb3ZhLCBucGFnZSA8PCBQQUdFX1NISUZUKTsNCj4gDQo+ICAJcmV0
dXJuIHJldDsNCj4gIH0NCj4gQEAgLTU0NSwxMiArNTMwLDEyIEBAIHN0YXRpYyBpbnQgdmZpb19k
bWFfZG9fbWFwKHN0cnVjdCB2ZmlvX2lvbW11DQo+ICppb21tdSwNCj4gIAlsb25nIG5wYWdlOw0K
PiAgCWludCByZXQgPSAwLCBwcm90ID0gMDsNCj4gIAl1aW50NjRfdCBtYXNrOw0KPiAtCXN0cnVj
dCB2ZmlvX2RtYSAqZG1hID0gTlVMTDsNCj4gKwlzdHJ1Y3QgdmZpb19kbWEgKmRtYTsNCj4gIAl1
bnNpZ25lZCBsb25nIHBmbjsNCj4gDQo+ICAJZW5kID0gbWFwLT5pb3ZhICsgbWFwLT5zaXplOw0K
PiANCj4gLQltYXNrID0gKCh1aW50NjRfdCkxIDw8IF9fZmZzKGlvbW11LT5kb21haW4tPm9wcy0+
cGdzaXplX2JpdG1hcCkpIC0NCj4gMTsNCj4gKwltYXNrID0gKCh1aW50NjRfdCkxIDw8IF9fZmZz
KHZmaW9fcGdzaXplX2JpdG1hcChpb21tdSkpKSAtIDE7DQo+IA0KPiAgCS8qIFJFQUQvV1JJVEUg
ZnJvbSBkZXZpY2UgcGVyc3BlY3RpdmUgKi8NCj4gIAlpZiAobWFwLT5mbGFncyAmIFZGSU9fRE1B
X01BUF9GTEFHX1dSSVRFKSBAQCAtNTYxLDkgKzU0Niw2IEBADQo+IHN0YXRpYyBpbnQgdmZpb19k
bWFfZG9fbWFwKHN0cnVjdCB2ZmlvX2lvbW11ICppb21tdSwNCj4gIAlpZiAoIXByb3QpDQo+ICAJ
CXJldHVybiAtRUlOVkFMOyAvKiBObyBSRUFEL1dSSVRFPyAqLw0KPiANCj4gLQlpZiAoaW9tbXUt
PmNhY2hlKQ0KPiAtCQlwcm90IHw9IElPTU1VX0NBQ0hFOw0KPiAtDQo+ICAJaWYgKHZhZGRyICYg
bWFzaykNCj4gIAkJcmV0dXJuIC1FSU5WQUw7DQo+ICAJaWYgKG1hcC0+aW92YSAmIG1hc2spDQo+
IEBAIC01ODgsMTgwICs1NzAsMjU3IEBAIHN0YXRpYyBpbnQgdmZpb19kbWFfZG9fbWFwKHN0cnVj
dCB2ZmlvX2lvbW11DQo+ICppb21tdSwNCj4gIAkJcmV0dXJuIC1FRVhJU1Q7DQo+ICAJfQ0KPiAN
Cj4gLQlmb3IgKGlvdmEgPSBtYXAtPmlvdmE7IGlvdmEgPCBlbmQ7IGlvdmEgKz0gc2l6ZSwgdmFk
ZHIgKz0gc2l6ZSkgew0KPiAtCQlsb25nIGk7DQo+ICsJZG1hID0ga3phbGxvYyhzaXplb2YoKmRt
YSksIEdGUF9LRVJORUwpOw0KPiArCWlmICghZG1hKSB7DQo+ICsJCW11dGV4X3VubG9jaygmaW9t
bXUtPmxvY2spOw0KPiArCQlyZXR1cm4gLUVOT01FTTsNCj4gKwl9DQo+ICsNCj4gKwlkbWEtPmlv
dmEgPSBtYXAtPmlvdmE7DQo+ICsJZG1hLT52YWRkciA9IG1hcC0+dmFkZHI7DQo+ICsJZG1hLT5w
cm90ID0gcHJvdDsNCj4gDQo+ICsJLyogSW5zZXJ0IHplcm8tc2l6ZWQgYW5kIGdyb3cgYXMgd2Ug
bWFwIGNodW5rcyBvZiBpdCAqLw0KPiArCXZmaW9fbGlua19kbWEoaW9tbXUsIGRtYSk7DQo+ICsN
Cj4gKwlmb3IgKGlvdmEgPSBtYXAtPmlvdmE7IGlvdmEgPCBlbmQ7IGlvdmEgKz0gc2l6ZSwgdmFk
ZHIgKz0gc2l6ZSkgew0KPiAgCQkvKiBQaW4gYSBjb250aWd1b3VzIGNodW5rIG9mIG1lbW9yeSAq
Lw0KPiAgCQlucGFnZSA9IHZmaW9fcGluX3BhZ2VzKHZhZGRyLCAoZW5kIC0gaW92YSkgPj4gUEFH
RV9TSElGVCwNCj4gIAkJCQkgICAgICAgcHJvdCwgJnBmbik7DQo+ICAJCWlmIChucGFnZSA8PSAw
KSB7DQo+ICAJCQlXQVJOX09OKCFucGFnZSk7DQo+ICAJCQlyZXQgPSAoaW50KW5wYWdlOw0KPiAt
CQkJZ290byBvdXQ7DQo+IC0JCX0NCj4gLQ0KPiAtCQkvKiBWZXJpZnkgcGFnZXMgYXJlIG5vdCBh
bHJlYWR5IG1hcHBlZCAqLw0KPiAtCQlmb3IgKGkgPSAwOyBpIDwgbnBhZ2U7IGkrKykgew0KPiAt
CQkJaWYgKGlvbW11X2lvdmFfdG9fcGh5cyhpb21tdS0+ZG9tYWluLA0KPiAtCQkJCQkgICAgICAg
aW92YSArIChpIDw8IFBBR0VfU0hJRlQpKSkgew0KPiAtCQkJCXJldCA9IC1FQlVTWTsNCj4gLQkJ
CQlnb3RvIG91dF91bnBpbjsNCj4gLQkJCX0NCj4gKwkJCWJyZWFrOw0KPiAgCQl9DQo+IA0KPiAt
CQlyZXQgPSBpb21tdV9tYXAoaW9tbXUtPmRvbWFpbiwgaW92YSwNCj4gLQkJCQkocGh5c19hZGRy
X3QpcGZuIDw8IFBBR0VfU0hJRlQsDQo+IC0JCQkJbnBhZ2UgPDwgUEFHRV9TSElGVCwgcHJvdCk7
DQo+ICsJCS8qIE1hcCBpdCEgKi8NCj4gKwkJcmV0ID0gdmZpb19pb21tdV9tYXAoaW9tbXUsIGlv
dmEsIHBmbiwgbnBhZ2UsIHByb3QpOw0KPiAgCQlpZiAocmV0KSB7DQo+IC0JCQlpZiAocmV0ICE9
IC1FQlVTWSB8fA0KPiAtCQkJICAgIG1hcF90cnlfaGFyZGVyKGlvbW11LCBpb3ZhLCBwZm4sIG5w
YWdlLCBwcm90KSkgew0KPiAtCQkJCWdvdG8gb3V0X3VucGluOw0KPiAtCQkJfQ0KPiArCQkJdmZp
b191bnBpbl9wYWdlcyhwZm4sIG5wYWdlLCBwcm90LCB0cnVlKTsNCj4gKwkJCWJyZWFrOw0KPiAg
CQl9DQo+IA0KPiAgCQlzaXplID0gbnBhZ2UgPDwgUEFHRV9TSElGVDsNCj4gKwkJZG1hLT5zaXpl
ICs9IHNpemU7DQo+ICsJfQ0KPiANCj4gLQkJLyoNCj4gLQkJICogQ2hlY2sgaWYgd2UgYWJ1dCBh
IHJlZ2lvbiBiZWxvdyAtIG5vdGhpbmcgYmVsb3cgMC4NCj4gLQkJICogVGhpcyBpcyB0aGUgbW9z
dCBsaWtlbHkgY2FzZSB3aGVuIG1hcHBpbmcgY2h1bmtzIG9mDQo+IC0JCSAqIHBoeXNpY2FsbHkg
Y29udGlndW91cyByZWdpb25zIHdpdGhpbiBhIHZpcnR1YWwgYWRkcmVzcw0KPiAtCQkgKiByYW5n
ZS4gIFVwZGF0ZSB0aGUgYWJ1dHRpbmcgZW50cnkgaW4gcGxhY2Ugc2luY2UgaW92YQ0KPiAtCQkg
KiBkb2Vzbid0IGNoYW5nZS4NCj4gLQkJICovDQo+IC0JCWlmIChsaWtlbHkoaW92YSkpIHsNCj4g
LQkJCXN0cnVjdCB2ZmlvX2RtYSAqdG1wOw0KPiAtCQkJdG1wID0gdmZpb19maW5kX2RtYShpb21t
dSwgaW92YSAtIDEsIDEpOw0KPiAtCQkJaWYgKHRtcCAmJiB0bXAtPnByb3QgPT0gcHJvdCAmJg0K
PiAtCQkJICAgIHRtcC0+dmFkZHIgKyB0bXAtPnNpemUgPT0gdmFkZHIpIHsNCj4gLQkJCQl0bXAt
PnNpemUgKz0gc2l6ZTsNCj4gLQkJCQlpb3ZhID0gdG1wLT5pb3ZhOw0KPiAtCQkJCXNpemUgPSB0
bXAtPnNpemU7DQo+IC0JCQkJdmFkZHIgPSB0bXAtPnZhZGRyOw0KPiAtCQkJCWRtYSA9IHRtcDsN
Cj4gLQkJCX0NCj4gLQkJfQ0KPiArCWlmIChyZXQpDQo+ICsJCXZmaW9fcmVtb3ZlX2RtYShpb21t
dSwgZG1hKTsNCj4gDQo+IC0JCS8qDQo+IC0JCSAqIENoZWNrIGlmIHdlIGFidXQgYSByZWdpb24g
YWJvdmUgLSBub3RoaW5nIGFib3ZlIH4wICsgMS4NCj4gLQkJICogSWYgd2UgYWJ1dCBhYm92ZSBh
bmQgYmVsb3csIHJlbW92ZSBhbmQgZnJlZS4gIElmIG9ubHkNCj4gLQkJICogYWJ1dCBhYm92ZSwg
cmVtb3ZlLCBtb2RpZnksIHJlaW5zZXJ0Lg0KPiAtCQkgKi8NCj4gLQkJaWYgKGxpa2VseShpb3Zh
ICsgc2l6ZSkpIHsNCj4gLQkJCXN0cnVjdCB2ZmlvX2RtYSAqdG1wOw0KPiAtCQkJdG1wID0gdmZp
b19maW5kX2RtYShpb21tdSwgaW92YSArIHNpemUsIDEpOw0KPiAtCQkJaWYgKHRtcCAmJiB0bXAt
PnByb3QgPT0gcHJvdCAmJg0KPiAtCQkJICAgIHRtcC0+dmFkZHIgPT0gdmFkZHIgKyBzaXplKSB7
DQo+IC0JCQkJdmZpb19yZW1vdmVfZG1hKGlvbW11LCB0bXApOw0KPiAtCQkJCWlmIChkbWEpIHsN
Cj4gLQkJCQkJZG1hLT5zaXplICs9IHRtcC0+c2l6ZTsNCj4gLQkJCQkJa2ZyZWUodG1wKTsNCj4g
LQkJCQl9IGVsc2Ugew0KPiAtCQkJCQlzaXplICs9IHRtcC0+c2l6ZTsNCj4gLQkJCQkJdG1wLT5z
aXplID0gc2l6ZTsNCj4gLQkJCQkJdG1wLT5pb3ZhID0gaW92YTsNCj4gLQkJCQkJdG1wLT52YWRk
ciA9IHZhZGRyOw0KPiAtCQkJCQl2ZmlvX2luc2VydF9kbWEoaW9tbXUsIHRtcCk7DQo+IC0JCQkJ
CWRtYSA9IHRtcDsNCj4gLQkJCQl9DQo+IC0JCQl9DQo+IC0JCX0NCj4gKwltdXRleF91bmxvY2so
JmlvbW11LT5sb2NrKTsNCj4gKwlyZXR1cm4gcmV0Ow0KPiArfQ0KPiArDQo+ICtzdGF0aWMgaW50
IHZmaW9fYnVzX3R5cGUoc3RydWN0IGRldmljZSAqZGV2LCB2b2lkICpkYXRhKSB7DQo+ICsJc3Ry
dWN0IGJ1c190eXBlICoqYnVzID0gZGF0YTsNCj4gKw0KPiArCWlmICgqYnVzICYmICpidXMgIT0g
ZGV2LT5idXMpDQo+ICsJCXJldHVybiAtRUlOVkFMOw0KPiArDQo+ICsJKmJ1cyA9IGRldi0+YnVz
Ow0KPiArDQo+ICsJcmV0dXJuIDA7DQo+ICt9DQo+ICsNCj4gK3N0YXRpYyBpbnQgdmZpb19pb21t
dV9yZXBsYXkoc3RydWN0IHZmaW9faW9tbXUgKmlvbW11LA0KPiArCQkJICAgICBzdHJ1Y3QgdmZp
b19kb21haW4gKmRvbWFpbikNCj4gK3sNCj4gKwlzdHJ1Y3QgdmZpb19kb21haW4gKmQ7DQo+ICsJ
c3RydWN0IHJiX25vZGUgKm47DQo+ICsJaW50IHJldDsNCj4gKw0KPiArCS8qIEFyYml0cmFyaWx5
IHBpY2sgdGhlIGZpcnN0IGRvbWFpbiBpbiB0aGUgbGlzdCBmb3IgbG9va3VwcyAqLw0KPiArCWQg
PSBsaXN0X2ZpcnN0X2VudHJ5KCZpb21tdS0+ZG9tYWluX2xpc3QsIHN0cnVjdCB2ZmlvX2RvbWFp
biwNCj4gbmV4dCk7DQo+ICsJbiA9IHJiX2ZpcnN0KCZpb21tdS0+ZG1hX2xpc3QpOw0KPiArDQo+
ICsJLyogSWYgdGhlcmUncyBub3QgYSBkb21haW4sIHRoZXJlIGJldHRlciBub3QgYmUgYW55IG1h
cHBpbmdzICovDQo+ICsJaWYgKFdBUk5fT04obiAmJiAhZCkpDQo+ICsJCXJldHVybiAtRUlOVkFM
Ow0KPiArDQo+ICsJZm9yICg7IG47IG4gPSByYl9uZXh0KG4pKSB7DQo+ICsJCXN0cnVjdCB2Zmlv
X2RtYSAqZG1hOw0KPiArCQlkbWFfYWRkcl90IGlvdmE7DQo+ICsNCj4gKwkJZG1hID0gcmJfZW50
cnkobiwgc3RydWN0IHZmaW9fZG1hLCBub2RlKTsNCj4gKwkJaW92YSA9IGRtYS0+aW92YTsNCj4g
Kw0KPiArCQl3aGlsZSAoaW92YSA8IGRtYS0+aW92YSArIGRtYS0+c2l6ZSkgew0KPiArCQkJcGh5
c19hZGRyX3QgcGh5cyA9IGlvbW11X2lvdmFfdG9fcGh5cyhkLT5kb21haW4sIGlvdmEpOw0KPiAr
CQkJc2l6ZV90IHNpemU7DQo+IA0KPiAtCQlpZiAoIWRtYSkgew0KPiAtCQkJZG1hID0ga3phbGxv
YyhzaXplb2YoKmRtYSksIEdGUF9LRVJORUwpOw0KPiAtCQkJaWYgKCFkbWEpIHsNCj4gLQkJCQlp
b21tdV91bm1hcChpb21tdS0+ZG9tYWluLCBpb3ZhLCBzaXplKTsNCj4gLQkJCQlyZXQgPSAtRU5P
TUVNOw0KPiAtCQkJCWdvdG8gb3V0X3VucGluOw0KPiArCQkJaWYgKFdBUk5fT04oIXBoeXMpKSB7
DQo+ICsJCQkJaW92YSArPSBQQUdFX1NJWkU7DQo+ICsJCQkJY29udGludWU7DQo+ICAJCQl9DQo+
IA0KPiAtCQkJZG1hLT5zaXplID0gc2l6ZTsNCj4gLQkJCWRtYS0+aW92YSA9IGlvdmE7DQo+IC0J
CQlkbWEtPnZhZGRyID0gdmFkZHI7DQo+IC0JCQlkbWEtPnByb3QgPSBwcm90Ow0KPiAtCQkJdmZp
b19pbnNlcnRfZG1hKGlvbW11LCBkbWEpOw0KPiAtCQl9DQo+IC0JfQ0KPiArCQkJc2l6ZSA9IFBB
R0VfU0laRTsNCj4gDQo+IC0JV0FSTl9PTihyZXQpOw0KPiAtCW11dGV4X3VubG9jaygmaW9tbXUt
PmxvY2spOw0KPiAtCXJldHVybiByZXQ7DQo+ICsJCQl3aGlsZSAoaW92YSArIHNpemUgPCBkbWEt
PmlvdmEgKyBkbWEtPnNpemUgJiYNCj4gKwkJCSAgICAgICBwaHlzICsgc2l6ZSA9PSBpb21tdV9p
b3ZhX3RvX3BoeXMoZC0+ZG9tYWluLA0KPiArCQkJCQkJCQkgaW92YSArIHNpemUpKQ0KPiArCQkJ
CXNpemUgKz0gUEFHRV9TSVpFOw0KPiANCj4gLW91dF91bnBpbjoNCj4gLQl2ZmlvX3VucGluX3Bh
Z2VzKHBmbiwgbnBhZ2UsIHByb3QsIHRydWUpOw0KPiArCQkJcmV0ID0gaW9tbXVfbWFwKGRvbWFp
bi0+ZG9tYWluLCBpb3ZhLCBwaHlzLA0KPiArCQkJCQlzaXplLCBkbWEtPnByb3QgfCBkb21haW4t
PnByb3QpOw0KPiArCQkJaWYgKHJldCkNCj4gKwkJCQlyZXR1cm4gcmV0Ow0KPiANCj4gLW91dDoN
Cj4gLQlpb3ZhID0gbWFwLT5pb3ZhOw0KPiAtCXNpemUgPSBtYXAtPnNpemU7DQo+IC0Jd2hpbGUg
KChkbWEgPSB2ZmlvX2ZpbmRfZG1hKGlvbW11LCBpb3ZhLCBzaXplKSkpIHsNCj4gLQkJaW50IHIg
PSB2ZmlvX3JlbW92ZV9kbWFfb3ZlcmxhcChpb21tdSwgaW92YSwNCj4gLQkJCQkJCSZzaXplLCBk
bWEpOw0KPiAtCQlpZiAoV0FSTl9PTihyIHx8ICFzaXplKSkNCj4gLQkJCWJyZWFrOw0KPiArCQkJ
aW92YSArPSBzaXplOw0KPiArCQl9DQo+ICAJfQ0KPiANCj4gLQltdXRleF91bmxvY2soJmlvbW11
LT5sb2NrKTsNCj4gLQlyZXR1cm4gcmV0Ow0KPiArCXJldHVybiAwOw0KPiAgfQ0KPiANCj4gIHN0
YXRpYyBpbnQgdmZpb19pb21tdV90eXBlMV9hdHRhY2hfZ3JvdXAodm9pZCAqaW9tbXVfZGF0YSwN
Cj4gIAkJCQkJIHN0cnVjdCBpb21tdV9ncm91cCAqaW9tbXVfZ3JvdXApDQo+ICB7DQo+ICAJc3Ry
dWN0IHZmaW9faW9tbXUgKmlvbW11ID0gaW9tbXVfZGF0YTsNCj4gLQlzdHJ1Y3QgdmZpb19ncm91
cCAqZ3JvdXAsICp0bXA7DQo+ICsJc3RydWN0IHZmaW9fZ3JvdXAgKmdyb3VwLCAqZzsNCj4gKwlz
dHJ1Y3QgdmZpb19kb21haW4gKmRvbWFpbiwgKmQ7DQo+ICsJc3RydWN0IGJ1c190eXBlICpidXMg
PSBOVUxMOw0KPiAgCWludCByZXQ7DQo+IA0KPiAtCWdyb3VwID0ga3phbGxvYyhzaXplb2YoKmdy
b3VwKSwgR0ZQX0tFUk5FTCk7DQo+IC0JaWYgKCFncm91cCkNCj4gLQkJcmV0dXJuIC1FTk9NRU07
DQo+IC0NCj4gIAltdXRleF9sb2NrKCZpb21tdS0+bG9jayk7DQo+IA0KPiAtCWxpc3RfZm9yX2Vh
Y2hfZW50cnkodG1wLCAmaW9tbXUtPmdyb3VwX2xpc3QsIG5leHQpIHsNCj4gLQkJaWYgKHRtcC0+
aW9tbXVfZ3JvdXAgPT0gaW9tbXVfZ3JvdXApIHsNCj4gKwlsaXN0X2Zvcl9lYWNoX2VudHJ5KGQs
ICZpb21tdS0+ZG9tYWluX2xpc3QsIG5leHQpIHsNCj4gKwkJbGlzdF9mb3JfZWFjaF9lbnRyeShn
LCAmZC0+Z3JvdXBfbGlzdCwgbmV4dCkgew0KPiArCQkJaWYgKGctPmlvbW11X2dyb3VwICE9IGlv
bW11X2dyb3VwKQ0KPiArCQkJCWNvbnRpbnVlOw0KPiArDQo+ICAJCQltdXRleF91bmxvY2soJmlv
bW11LT5sb2NrKTsNCj4gLQkJCWtmcmVlKGdyb3VwKTsNCj4gIAkJCXJldHVybiAtRUlOVkFMOw0K
PiAgCQl9DQo+ICAJfQ0KPiANCj4gKwlncm91cCA9IGt6YWxsb2Moc2l6ZW9mKCpncm91cCksIEdG
UF9LRVJORUwpOw0KPiArCWRvbWFpbiA9IGt6YWxsb2Moc2l6ZW9mKCpkb21haW4pLCBHRlBfS0VS
TkVMKTsNCj4gKwlpZiAoIWdyb3VwIHx8ICFkb21haW4pIHsNCj4gKwkJcmV0ID0gLUVOT01FTTsN
Cj4gKwkJZ290byBvdXRfZnJlZTsNCj4gKwl9DQo+ICsNCj4gKwlncm91cC0+aW9tbXVfZ3JvdXAg
PSBpb21tdV9ncm91cDsNCj4gKw0KPiArCS8qIERldGVybWluZSBidXNfdHlwZSBpbiBvcmRlciB0
byBhbGxvY2F0ZSBhIGRvbWFpbiAqLw0KPiArCXJldCA9IGlvbW11X2dyb3VwX2Zvcl9lYWNoX2Rl
dihpb21tdV9ncm91cCwgJmJ1cywgdmZpb19idXNfdHlwZSk7DQo+ICsJaWYgKHJldCkNCj4gKwkJ
Z290byBvdXRfZnJlZTsNCj4gKw0KPiArCWRvbWFpbi0+ZG9tYWluID0gaW9tbXVfZG9tYWluX2Fs
bG9jKGJ1cyk7DQo+ICsJaWYgKCFkb21haW4tPmRvbWFpbikgew0KPiArCQlyZXQgPSAtRUlPOw0K
PiArCQlnb3RvIG91dF9mcmVlOw0KPiArCX0NCj4gKw0KPiArCXJldCA9IGlvbW11X2F0dGFjaF9n
cm91cChkb21haW4tPmRvbWFpbiwgaW9tbXVfZ3JvdXApOw0KPiArCWlmIChyZXQpDQo+ICsJCWdv
dG8gb3V0X2RvbWFpbjsNCj4gKw0KPiArCUlOSVRfTElTVF9IRUFEKCZkb21haW4tPmdyb3VwX2xp
c3QpOw0KPiArCWxpc3RfYWRkKCZncm91cC0+bmV4dCwgJmRvbWFpbi0+Z3JvdXBfbGlzdCk7DQo+
ICsNCj4gKwlpZiAoIWFsbG93X3Vuc2FmZV9pbnRlcnJ1cHRzICYmDQo+ICsJICAgICFpb21tdV9k
b21haW5faGFzX2NhcChkb21haW4tPmRvbWFpbiwgSU9NTVVfQ0FQX0lOVFJfUkVNQVApKSB7DQo+
ICsJCXByX3dhcm4oIiVzOiBObyBpbnRlcnJ1cHQgcmVtYXBwaW5nIHN1cHBvcnQuICBVc2UgdGhl
IG1vZHVsZQ0KPiBwYXJhbSBcImFsbG93X3Vuc2FmZV9pbnRlcnJ1cHRzXCIgdG8gZW5hYmxlIFZG
SU8gSU9NTVUgc3VwcG9ydCBvbiB0aGlzDQo+IHBsYXRmb3JtXG4iLA0KPiArCQkgICAgICAgX19m
dW5jX18pOw0KPiArCQlyZXQgPSAtRVBFUk07DQo+ICsJCWdvdG8gb3V0X2RldGFjaDsNCj4gKwl9
DQo+ICsNCj4gKwlpZiAoaW9tbXVfZG9tYWluX2hhc19jYXAoZG9tYWluLT5kb21haW4sDQo+IElP
TU1VX0NBUF9DQUNIRV9DT0hFUkVOQ1kpKQ0KPiArCQlkb21haW4tPnByb3QgfD0gSU9NTVVfQ0FD
SEU7DQo+ICsNCj4gIAkvKg0KPiAtCSAqIFRPRE86IERvbWFpbiBoYXZlIGNhcGFiaWxpdGllcyB0
aGF0IG1pZ2h0IGNoYW5nZSBhcyB3ZSBhZGQNCj4gLQkgKiBncm91cHMgKHNlZSBpb21tdS0+Y2Fj
aGUsIGN1cnJlbnRseSBuZXZlciBzZXQpLiAgQ2hlY2sgZm9yDQo+IC0JICogdGhlbSBhbmQgcG90
ZW50aWFsbHkgZGlzYWxsb3cgZ3JvdXBzIHRvIGJlIGF0dGFjaGVkIHdoZW4gaXQNCj4gLQkgKiB3
b3VsZCBjaGFuZ2UgY2FwYWJpbGl0aWVzICh1Z2gpLg0KPiArCSAqIFRyeSB0byBtYXRjaCBhbiBl
eGlzdGluZyBjb21wYXRpYmxlIGRvbWFpbi4gIFdlIGRvbid0IHdhbnQgdG8NCj4gKwkgKiBwcmVj
bHVkZSBhbiBJT01NVSBkcml2ZXIgc3VwcG9ydGluZyBtdWx0aXBsZSBidXNfdHlwZXMgYW5kIGJl
aW5nDQo+ICsJICogYWJsZSB0byBpbmNsdWRlIGRpZmZlcmVudCBidXNfdHlwZXMgaW4gdGhlIHNh
bWUgSU9NTVUgZG9tYWluLCBzbw0KPiArCSAqIHdlIHRlc3Qgd2hldGhlciB0aGUgZG9tYWlucyB1
c2UgdGhlIHNhbWUgaW9tbXVfb3BzIHJhdGhlciB0aGFuDQo+ICsJICogdGVzdGluZyBpZiB0aGV5
J3JlIG9uIHRoZSBzYW1lIGJ1c190eXBlLg0KPiAgCSAqLw0KPiAtCXJldCA9IGlvbW11X2F0dGFj
aF9ncm91cChpb21tdS0+ZG9tYWluLCBpb21tdV9ncm91cCk7DQo+IC0JaWYgKHJldCkgew0KPiAt
CQltdXRleF91bmxvY2soJmlvbW11LT5sb2NrKTsNCj4gLQkJa2ZyZWUoZ3JvdXApOw0KPiAtCQly
ZXR1cm4gcmV0Ow0KPiArCWxpc3RfZm9yX2VhY2hfZW50cnkoZCwgJmlvbW11LT5kb21haW5fbGlz
dCwgbmV4dCkgew0KPiArCQlpZiAoZC0+ZG9tYWluLT5vcHMgPT0gZG9tYWluLT5kb21haW4tPm9w
cyAmJg0KPiArCQkgICAgZC0+cHJvdCA9PSBkb21haW4tPnByb3QpIHsNCj4gKwkJCWlvbW11X2Rl
dGFjaF9ncm91cChkb21haW4tPmRvbWFpbiwgaW9tbXVfZ3JvdXApOw0KPiArCQkJaWYgKCFpb21t
dV9hdHRhY2hfZ3JvdXAoZC0+ZG9tYWluLCBpb21tdV9ncm91cCkpIHsNCj4gKwkJCQlsaXN0X2Fk
ZCgmZ3JvdXAtPm5leHQsICZkLT5ncm91cF9saXN0KTsNCj4gKwkJCQlpb21tdV9kb21haW5fZnJl
ZShkb21haW4tPmRvbWFpbik7DQo+ICsJCQkJa2ZyZWUoZG9tYWluKTsNCj4gKwkJCQltdXRleF91
bmxvY2soJmlvbW11LT5sb2NrKTsNCj4gKwkJCQlyZXR1cm4gMDsNCj4gKwkJCX0NCj4gKw0KPiAr
CQkJcmV0ID0gaW9tbXVfYXR0YWNoX2dyb3VwKGRvbWFpbi0+ZG9tYWluLCBpb21tdV9ncm91cCk7
DQo+ICsJCQlpZiAocmV0KQ0KPiArCQkJCWdvdG8gb3V0X2RvbWFpbjsNCj4gKwkJfQ0KPiAgCX0N
Cj4gDQo+IC0JZ3JvdXAtPmlvbW11X2dyb3VwID0gaW9tbXVfZ3JvdXA7DQo+IC0JbGlzdF9hZGQo
Jmdyb3VwLT5uZXh0LCAmaW9tbXUtPmdyb3VwX2xpc3QpOw0KPiArCS8qIHJlcGxheSBtYXBwaW5n
cyBvbiBuZXcgZG9tYWlucyAqLw0KPiArCXJldCA9IHZmaW9faW9tbXVfcmVwbGF5KGlvbW11LCBk
b21haW4pOw0KPiArCWlmIChyZXQpDQo+ICsJCWdvdG8gb3V0X2RldGFjaDsNCj4gKw0KPiArCWxp
c3RfYWRkKCZkb21haW4tPm5leHQsICZpb21tdS0+ZG9tYWluX2xpc3QpOw0KPiANCj4gIAltdXRl
eF91bmxvY2soJmlvbW11LT5sb2NrKTsNCj4gDQo+ICAJcmV0dXJuIDA7DQo+ICsNCj4gK291dF9k
ZXRhY2g6DQo+ICsJaW9tbXVfZGV0YWNoX2dyb3VwKGRvbWFpbi0+ZG9tYWluLCBpb21tdV9ncm91
cCk7DQo+ICtvdXRfZG9tYWluOg0KPiArCWlvbW11X2RvbWFpbl9mcmVlKGRvbWFpbi0+ZG9tYWlu
KTsNCj4gK291dF9mcmVlOg0KPiArCWtmcmVlKGRvbWFpbik7DQo+ICsJa2ZyZWUoZ3JvdXApOw0K
PiArCW11dGV4X3VubG9jaygmaW9tbXUtPmxvY2spOw0KPiArCXJldHVybiByZXQ7DQo+ICt9DQo+
ICsNCj4gK3N0YXRpYyB2b2lkIHZmaW9faW9tbXVfdW5tYXBfdW5waW5fYWxsKHN0cnVjdCB2Zmlv
X2lvbW11ICppb21tdSkgew0KPiArCXN0cnVjdCByYl9ub2RlICpub2RlOw0KPiArDQo+ICsJd2hp
bGUgKChub2RlID0gcmJfZmlyc3QoJmlvbW11LT5kbWFfbGlzdCkpKQ0KPiArCQl2ZmlvX3JlbW92
ZV9kbWEoaW9tbXUsIHJiX2VudHJ5KG5vZGUsIHN0cnVjdCB2ZmlvX2RtYSwNCj4gbm9kZSkpOw0K
PiAgfQ0KPiANCj4gIHN0YXRpYyB2b2lkIHZmaW9faW9tbXVfdHlwZTFfZGV0YWNoX2dyb3VwKHZv
aWQgKmlvbW11X2RhdGEsDQo+ICAJCQkJCSAgc3RydWN0IGlvbW11X2dyb3VwICppb21tdV9ncm91
cCkgIHsNCj4gIAlzdHJ1Y3QgdmZpb19pb21tdSAqaW9tbXUgPSBpb21tdV9kYXRhOw0KPiArCXN0
cnVjdCB2ZmlvX2RvbWFpbiAqZG9tYWluOw0KPiAgCXN0cnVjdCB2ZmlvX2dyb3VwICpncm91cDsN
Cj4gDQo+ICAJbXV0ZXhfbG9jaygmaW9tbXUtPmxvY2spOw0KPiANCj4gLQlsaXN0X2Zvcl9lYWNo
X2VudHJ5KGdyb3VwLCAmaW9tbXUtPmdyb3VwX2xpc3QsIG5leHQpIHsNCj4gLQkJaWYgKGdyb3Vw
LT5pb21tdV9ncm91cCA9PSBpb21tdV9ncm91cCkgew0KPiAtCQkJaW9tbXVfZGV0YWNoX2dyb3Vw
KGlvbW11LT5kb21haW4sIGlvbW11X2dyb3VwKTsNCj4gKwlsaXN0X2Zvcl9lYWNoX2VudHJ5KGRv
bWFpbiwgJmlvbW11LT5kb21haW5fbGlzdCwgbmV4dCkgew0KPiArCQlsaXN0X2Zvcl9lYWNoX2Vu
dHJ5KGdyb3VwLCAmZG9tYWluLT5ncm91cF9saXN0LCBuZXh0KSB7DQo+ICsJCQlpZiAoZ3JvdXAt
PmlvbW11X2dyb3VwICE9IGlvbW11X2dyb3VwKQ0KPiArCQkJCWNvbnRpbnVlOw0KPiArDQo+ICsJ
CQlpb21tdV9kZXRhY2hfZ3JvdXAoZG9tYWluLT5kb21haW4sIGlvbW11X2dyb3VwKTsNCj4gIAkJ
CWxpc3RfZGVsKCZncm91cC0+bmV4dCk7DQo+ICAJCQlrZnJlZShncm91cCk7DQo+IC0JCQlicmVh
azsNCj4gKwkJCS8qDQo+ICsJCQkgKiBHcm91cCBvd25lcnNoaXAgcHJvdmlkZXMgcHJpdmlsZWdl
LCBpZiB0aGUgZ3JvdXANCj4gKwkJCSAqIGxpc3QgaXMgZW1wdHksIHRoZSBkb21haW4gZ29lcyBh
d2F5LiAgSWYgaXQncyB0aGUNCj4gKwkJCSAqIGxhc3QgZG9tYWluLCB0aGVuIGFsbCB0aGUgbWFw
cGluZ3MgZ28gYXdheSB0b28uDQo+ICsJCQkgKi8NCj4gKwkJCWlmIChsaXN0X2VtcHR5KCZkb21h
aW4tPmdyb3VwX2xpc3QpKSB7DQo+ICsJCQkJaWYgKGxpc3RfaXNfc2luZ3VsYXIoJmlvbW11LT5k
b21haW5fbGlzdCkpDQo+ICsJCQkJCXZmaW9faW9tbXVfdW5tYXBfdW5waW5fYWxsKGlvbW11KTsN
Cj4gKwkJCQlpb21tdV9kb21haW5fZnJlZShkb21haW4tPmRvbWFpbik7DQo+ICsJCQkJbGlzdF9k
ZWwoJmRvbWFpbi0+bmV4dCk7DQo+ICsJCQkJa2ZyZWUoZG9tYWluKTsNCj4gKwkJCX0NCj4gKwkJ
CWdvdG8gZG9uZTsNCj4gIAkJfQ0KPiAgCX0NCj4gDQo+ICtkb25lOg0KPiAgCW11dGV4X3VubG9j
aygmaW9tbXUtPmxvY2spOw0KPiAgfQ0KPiANCj4gQEAgLTc2OSw0MCArODI4LDE3IEBAIHN0YXRp
YyB2b2lkICp2ZmlvX2lvbW11X3R5cGUxX29wZW4odW5zaWduZWQgbG9uZw0KPiBhcmcpICB7DQo+
ICAJc3RydWN0IHZmaW9faW9tbXUgKmlvbW11Ow0KPiANCj4gLQlpZiAoYXJnICE9IFZGSU9fVFlQ
RTFfSU9NTVUpDQo+ICsJaWYgKGFyZyAhPSBWRklPX1RZUEUxX0lPTU1VICYmIGFyZyAhPSBWRklP
X1RZUEUxdjJfSU9NTVUpDQo+ICAJCXJldHVybiBFUlJfUFRSKC1FSU5WQUwpOw0KPiANCj4gIAlp
b21tdSA9IGt6YWxsb2Moc2l6ZW9mKCppb21tdSksIEdGUF9LRVJORUwpOw0KPiAgCWlmICghaW9t
bXUpDQo+ICAJCXJldHVybiBFUlJfUFRSKC1FTk9NRU0pOw0KPiANCj4gLQlJTklUX0xJU1RfSEVB
RCgmaW9tbXUtPmdyb3VwX2xpc3QpOw0KPiArCUlOSVRfTElTVF9IRUFEKCZpb21tdS0+ZG9tYWlu
X2xpc3QpOw0KPiAgCWlvbW11LT5kbWFfbGlzdCA9IFJCX1JPT1Q7DQo+ICAJbXV0ZXhfaW5pdCgm
aW9tbXUtPmxvY2spOw0KPiAtDQo+IC0JLyoNCj4gLQkgKiBXaXNoIHdlIGRpZG4ndCBoYXZlIHRv
IGtub3cgYWJvdXQgYnVzX3R5cGUgaGVyZS4NCj4gLQkgKi8NCj4gLQlpb21tdS0+ZG9tYWluID0g
aW9tbXVfZG9tYWluX2FsbG9jKCZwY2lfYnVzX3R5cGUpOw0KPiAtCWlmICghaW9tbXUtPmRvbWFp
bikgew0KPiAtCQlrZnJlZShpb21tdSk7DQo+IC0JCXJldHVybiBFUlJfUFRSKC1FSU8pOw0KPiAt
CX0NCj4gLQ0KPiAtCS8qDQo+IC0JICogV2lzaCB3ZSBjb3VsZCBzcGVjaWZ5IHJlcXVpcmVkIGNh
cGFiaWxpdGllcyByYXRoZXIgdGhhbiBjcmVhdGUNCj4gLQkgKiBhIGRvbWFpbiwgc2VlIHdoYXQg
Y29tZXMgb3V0IGFuZCBob3BlIGl0IGRvZXNuJ3QgY2hhbmdlIGFsb25nDQo+IC0JICogdGhlIHdh
eS4gIEZvcnR1bmF0ZWx5IHdlIGtub3cgaW50ZXJydXB0IHJlbWFwcGluZyBpcyBnbG9iYWwgZm9y
DQo+IC0JICogb3VyIGlvbW11cy4NCj4gLQkgKi8NCj4gLQlpZiAoIWFsbG93X3Vuc2FmZV9pbnRl
cnJ1cHRzICYmDQo+IC0JICAgICFpb21tdV9kb21haW5faGFzX2NhcChpb21tdS0+ZG9tYWluLCBJ
T01NVV9DQVBfSU5UUl9SRU1BUCkpIHsNCj4gLQkJcHJfd2FybigiJXM6IE5vIGludGVycnVwdCBy
ZW1hcHBpbmcgc3VwcG9ydC4gIFVzZSB0aGUgbW9kdWxlDQo+IHBhcmFtIFwiYWxsb3dfdW5zYWZl
X2ludGVycnVwdHNcIiB0byBlbmFibGUgVkZJTyBJT01NVSBzdXBwb3J0IG9uIHRoaXMNCj4gcGxh
dGZvcm1cbiIsDQo+IC0JCSAgICAgICBfX2Z1bmNfXyk7DQo+IC0JCWlvbW11X2RvbWFpbl9mcmVl
KGlvbW11LT5kb21haW4pOw0KPiAtCQlrZnJlZShpb21tdSk7DQo+IC0JCXJldHVybiBFUlJfUFRS
KC1FUEVSTSk7DQo+IC0JfQ0KPiArCWlvbW11LT52MiA9IChhcmcgPT0gVkZJT19UWVBFMXYyX0lP
TU1VKTsNCj4gDQo+ICAJcmV0dXJuIGlvbW11Ow0KPiAgfQ0KPiBAQCAtODEwLDI1ICs4NDYsMjQg
QEAgc3RhdGljIHZvaWQgKnZmaW9faW9tbXVfdHlwZTFfb3Blbih1bnNpZ25lZCBsb25nDQo+IGFy
ZykgIHN0YXRpYyB2b2lkIHZmaW9faW9tbXVfdHlwZTFfcmVsZWFzZSh2b2lkICppb21tdV9kYXRh
KSAgew0KPiAgCXN0cnVjdCB2ZmlvX2lvbW11ICppb21tdSA9IGlvbW11X2RhdGE7DQo+ICsJc3Ry
dWN0IHZmaW9fZG9tYWluICpkb21haW4sICpkb21haW5fdG1wOw0KPiAgCXN0cnVjdCB2ZmlvX2dy
b3VwICpncm91cCwgKmdyb3VwX3RtcDsNCj4gLQlzdHJ1Y3QgcmJfbm9kZSAqbm9kZTsNCj4gDQo+
IC0JbGlzdF9mb3JfZWFjaF9lbnRyeV9zYWZlKGdyb3VwLCBncm91cF90bXAsICZpb21tdS0+Z3Jv
dXBfbGlzdCwNCj4gbmV4dCkgew0KPiAtCQlpb21tdV9kZXRhY2hfZ3JvdXAoaW9tbXUtPmRvbWFp
biwgZ3JvdXAtPmlvbW11X2dyb3VwKTsNCj4gLQkJbGlzdF9kZWwoJmdyb3VwLT5uZXh0KTsNCj4g
LQkJa2ZyZWUoZ3JvdXApOw0KPiAtCX0NCj4gKwl2ZmlvX2lvbW11X3VubWFwX3VucGluX2FsbChp
b21tdSk7DQo+IA0KPiAtCXdoaWxlICgobm9kZSA9IHJiX2ZpcnN0KCZpb21tdS0+ZG1hX2xpc3Qp
KSkgew0KPiAtCQlzdHJ1Y3QgdmZpb19kbWEgKmRtYSA9IHJiX2VudHJ5KG5vZGUsIHN0cnVjdCB2
ZmlvX2RtYSwgbm9kZSk7DQo+IC0JCXNpemVfdCBzaXplID0gZG1hLT5zaXplOw0KPiAtCQl2Zmlv
X3JlbW92ZV9kbWFfb3ZlcmxhcChpb21tdSwgZG1hLT5pb3ZhLCAmc2l6ZSwgZG1hKTsNCj4gLQkJ
aWYgKFdBUk5fT04oIXNpemUpKQ0KPiAtCQkJYnJlYWs7DQo+ICsJbGlzdF9mb3JfZWFjaF9lbnRy
eV9zYWZlKGRvbWFpbiwgZG9tYWluX3RtcCwNCj4gKwkJCQkgJmlvbW11LT5kb21haW5fbGlzdCwg
bmV4dCkgew0KPiArCQlsaXN0X2Zvcl9lYWNoX2VudHJ5X3NhZmUoZ3JvdXAsIGdyb3VwX3RtcCwN
Cj4gKwkJCQkJICZkb21haW4tPmdyb3VwX2xpc3QsIG5leHQpIHsNCj4gKwkJCWlvbW11X2RldGFj
aF9ncm91cChkb21haW4tPmRvbWFpbiwgZ3JvdXAtPmlvbW11X2dyb3VwKTsNCj4gKwkJCWxpc3Rf
ZGVsKCZncm91cC0+bmV4dCk7DQo+ICsJCQlrZnJlZShncm91cCk7DQo+ICsJCX0NCj4gKwkJaW9t
bXVfZG9tYWluX2ZyZWUoZG9tYWluLT5kb21haW4pOw0KPiArCQlsaXN0X2RlbCgmZG9tYWluLT5u
ZXh0KTsNCj4gKwkJa2ZyZWUoZG9tYWluKTsNCj4gIAl9DQo+IA0KPiAtCWlvbW11X2RvbWFpbl9m
cmVlKGlvbW11LT5kb21haW4pOw0KPiAtCWlvbW11LT5kb21haW4gPSBOVUxMOw0KPiAgCWtmcmVl
KGlvbW11KTsNCj4gIH0NCj4gDQo+IEBAIC04NDEsNiArODc2LDcgQEAgc3RhdGljIGxvbmcgdmZp
b19pb21tdV90eXBlMV9pb2N0bCh2b2lkICppb21tdV9kYXRhLA0KPiAgCWlmIChjbWQgPT0gVkZJ
T19DSEVDS19FWFRFTlNJT04pIHsNCj4gIAkJc3dpdGNoIChhcmcpIHsNCj4gIAkJY2FzZSBWRklP
X1RZUEUxX0lPTU1VOg0KPiArCQljYXNlIFZGSU9fVFlQRTF2Ml9JT01NVToNCj4gIAkJCXJldHVy
biAxOw0KPiAgCQlkZWZhdWx0Og0KPiAgCQkJcmV0dXJuIDA7DQo+IEBAIC04NTgsNyArODk0LDcg
QEAgc3RhdGljIGxvbmcgdmZpb19pb21tdV90eXBlMV9pb2N0bCh2b2lkICppb21tdV9kYXRhLA0K
PiANCj4gIAkJaW5mby5mbGFncyA9IDA7DQo+IA0KPiAtCQlpbmZvLmlvdmFfcGdzaXplcyA9IGlv
bW11LT5kb21haW4tPm9wcy0+cGdzaXplX2JpdG1hcDsNCj4gKwkJaW5mby5pb3ZhX3Bnc2l6ZXMg
PSB2ZmlvX3Bnc2l6ZV9iaXRtYXAoaW9tbXUpOw0KPiANCj4gIAkJcmV0dXJuIGNvcHlfdG9fdXNl
cigodm9pZCBfX3VzZXIgKilhcmcsICZpbmZvLCBtaW5zeik7DQo+IA0KPiBAQCAtOTExLDkgKzk0
Nyw2IEBAIHN0YXRpYyBjb25zdCBzdHJ1Y3QgdmZpb19pb21tdV9kcml2ZXJfb3BzDQo+IHZmaW9f
aW9tbXVfZHJpdmVyX29wc190eXBlMSA9IHsNCj4gDQo+ICBzdGF0aWMgaW50IF9faW5pdCB2Zmlv
X2lvbW11X3R5cGUxX2luaXQodm9pZCkgIHsNCj4gLQlpZiAoIWlvbW11X3ByZXNlbnQoJnBjaV9i
dXNfdHlwZSkpDQo+IC0JCXJldHVybiAtRU5PREVWOw0KPiAtDQo+ICAJcmV0dXJuIHZmaW9fcmVn
aXN0ZXJfaW9tbXVfZHJpdmVyKCZ2ZmlvX2lvbW11X2RyaXZlcl9vcHNfdHlwZTEpOw0KPiAgfQ0K
PiANCj4gZGlmZiAtLWdpdCBhL2luY2x1ZGUvdWFwaS9saW51eC92ZmlvLmggYi9pbmNsdWRlL3Vh
cGkvbGludXgvdmZpby5oIGluZGV4DQo+IDBmZDQ3ZjUuLjQ2MGZkZjIgMTAwNjQ0DQo+IC0tLSBh
L2luY2x1ZGUvdWFwaS9saW51eC92ZmlvLmgNCj4gKysrIGIvaW5jbHVkZS91YXBpL2xpbnV4L3Zm
aW8uaA0KPiBAQCAtMjMsNiArMjMsNyBAQA0KPiANCj4gICNkZWZpbmUgVkZJT19UWVBFMV9JT01N
VQkJMQ0KPiAgI2RlZmluZSBWRklPX1NQQVBSX1RDRV9JT01NVQkJMg0KPiArI2RlZmluZSBWRklP
X1RZUEUxdjJfSU9NTVUJCTMNCj4gDQo+ICAvKg0KPiAgICogVGhlIElPQ1RMIGludGVyZmFjZSBp
cyBkZXNpZ25lZCBmb3IgZXh0ZW5zaWJpbGl0eSBieSBlbWJlZGRpbmcgdGhlDQo+IA0KPiANCg0K
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Williamson March 18, 2014, 2:16 p.m. UTC | #2
On Tue, 2014-03-18 at 10:24 +0000, Varun Sethi wrote:
> Hi Alex,
> Would it make sense, to link the iommu group to its corresponding
> hardware iommu block's capabilities? This could be done if we can
> determine the iommu device corresponding to the iommu group during bus
> probe. With this we won't have to wait till device attach to determine
> the domain capabilities (or underlying iommu capabilities). In vfio we
> can simply attach the iommu groups with similar iommu capabilities to
> the same domain.

The IOMMU API doesn't provide us with a way to get the capabilities
without creating a domain and attaching all the devices in the group.
That means during device scan we'd need to for each device added to a
group, create a domain, attach the group and record the IOMMU
capabilities.  That sounds rather intrusive for a startup operation.
There's also the problem of hotplug, a device may be hot-added to the
system and join an existing IOMMU group whose other devices are in use.
We'd have no way to attach the devices necessary to probe the
capabilities without disrupting the system.

Perhaps extending the IOMMU API to ask about the domain capabilities of
a device without actually instantiating the domain would be an
interesting addition, but that doesn't exist today.  Thanks,

Alex

> > -----Original Message-----
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Tuesday, February 18, 2014 1:54 AM
> > To: alex.williamson@redhat.com; kvm@vger.kernel.org
> > Cc: Sethi Varun-B16395; linux-kernel@vger.kernel.org
> > Subject: [PATCH 1/4] vfio/iommu_type1: Multi-IOMMU domain support
> > 
> > We currently have a problem that we cannot support advanced features of
> > an IOMMU domain (ex. IOMMU_CACHE), because we have no guarantee that
> > those features will be supported by all of the hardware units involved
> > with the domain over its lifetime.  For instance, the Intel VT-d
> > architecture does not require that all DRHDs support snoop control.  If
> > we create a domain based on a device behind a DRHD that does support
> > snoop control and enable SNP support via the IOMMU_CACHE mapping option,
> > we cannot then add a device behind a DRHD which does not support snoop
> > control or we'll get reserved bit faults from the SNP bit in the
> > pagetables.  To add to the complexity, we can't know the properties of a
> > domain until a device is attached.
> > 
> > We could pass this problem off to userspace and require that a separate
> > vfio container be used, but we don't know how to handle page accounting
> > in that case.  How do we know that a page pinned in one container is the
> > same page as a different container and avoid double billing the user for
> > the page.
> > 
> > The solution is therefore to support multiple IOMMU domains per
> > container.  In the majority of cases, only one domain will be required
> > since hardware is typically consistent within a system.  However, this
> > provides us the ability to validate compatibility of domains and support
> > mixed environments where page table flags can be different between
> > domains.
> > 
> > To do this, our DMA tracking needs to change.  We currently try to
> > coalesce user mappings into as few tracking entries as possible.  The
> > problem then becomes that we lose granularity of user mappings.  We've
> > never guaranteed that a user is able to unmap at a finer granularity than
> > the original mapping, but we must honor the granularity of the original
> > mapping.  This coalescing code is therefore removed, allowing only unmaps
> > covering complete maps.  The change in accounting is fairly small here, a
> > typical QEMU VM will start out with roughly a dozen entries, so it's
> > arguable if this coalescing was ever needed.
> > 
> > We also move IOMMU domain creation to the point where a group is attached
> > to the container.  An interesting side-effect of this is that we now have
> > access to the device at the time of domain creation and can probe the
> > devices within the group to determine the bus_type.
> > This finally makes vfio_iommu_type1 completely device/bus agnostic.
> > In fact, each IOMMU domain can host devices on different buses managed by
> > different physical IOMMUs, and present a single DMA mapping interface to
> > the user.  When a new domain is created, mappings are replayed to bring
> > the IOMMU pagetables up to the state of the current container.  And of
> > course, DMA mapping and unmapping automatically traverse all of the
> > configured IOMMU domains.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > Cc: Varun Sethi <Varun.Sethi@freescale.com>
> > ---
> >  drivers/vfio/vfio_iommu_type1.c |  637 +++++++++++++++++++++------------
> > ------
> >  include/uapi/linux/vfio.h       |    1
> >  2 files changed, 336 insertions(+), 302 deletions(-)
> > 
> > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > b/drivers/vfio/vfio_iommu_type1.c index 4fb7a8f..8c7bb9b 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -30,7 +30,6 @@
> >  #include <linux/iommu.h>
> >  #include <linux/module.h>
> >  #include <linux/mm.h>
> > -#include <linux/pci.h>		/* pci_bus_type */
> >  #include <linux/rbtree.h>
> >  #include <linux/sched.h>
> >  #include <linux/slab.h>
> > @@ -55,11 +54,17 @@ MODULE_PARM_DESC(disable_hugepages,
> >  		 "Disable VFIO IOMMU support for IOMMU hugepages.");
> > 
> >  struct vfio_iommu {
> > -	struct iommu_domain	*domain;
> > +	struct list_head	domain_list;
> >  	struct mutex		lock;
> >  	struct rb_root		dma_list;
> > +	bool v2;
> > +};
> > +
> > +struct vfio_domain {
> > +	struct iommu_domain	*domain;
> > +	struct list_head	next;
> >  	struct list_head	group_list;
> > -	bool			cache;
> > +	int			prot;		/* IOMMU_CACHE */
> >  };
> > 
> >  struct vfio_dma {
> > @@ -99,7 +104,7 @@ static struct vfio_dma *vfio_find_dma(struct
> > vfio_iommu *iommu,
> >  	return NULL;
> >  }
> > 
> > -static void vfio_insert_dma(struct vfio_iommu *iommu, struct vfio_dma
> > *new)
> > +static void vfio_link_dma(struct vfio_iommu *iommu, struct vfio_dma
> > +*new)
> >  {
> >  	struct rb_node **link = &iommu->dma_list.rb_node, *parent = NULL;
> >  	struct vfio_dma *dma;
> > @@ -118,7 +123,7 @@ static void vfio_insert_dma(struct vfio_iommu *iommu,
> > struct vfio_dma *new)
> >  	rb_insert_color(&new->node, &iommu->dma_list);  }
> > 
> > -static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma
> > *old)
> > +static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma
> > +*old)
> >  {
> >  	rb_erase(&old->node, &iommu->dma_list);  } @@ -322,32 +327,39 @@
> > static long vfio_unpin_pages(unsigned long pfn, long npage,
> >  	return unlocked;
> >  }
> > 
> > -static int vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma
> > *dma,
> > -			    dma_addr_t iova, size_t *size)
> > +static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma
> > +*dma)
> >  {
> > -	dma_addr_t start = iova, end = iova + *size;
> > +	dma_addr_t iova = dma->iova, end = dma->iova + dma->size;
> > +	struct vfio_domain *domain, *d;
> >  	long unlocked = 0;
> > 
> > +	if (!dma->size)
> > +		return;
> > +	/*
> > +	 * We use the IOMMU to track the physical addresses, otherwise we'd
> > +	 * need a much more complicated tracking system.  Unfortunately
> > that
> > +	 * means we need to use one of the iommu domains to figure out the
> > +	 * pfns to unpin.  The rest need to be unmapped in advance so we
> > have
> > +	 * no iommu translations remaining when the pages are unpinned.
> > +	 */
> > +	domain = d = list_first_entry(&iommu->domain_list,
> > +				      struct vfio_domain, next);
> > +
> > +	list_for_each_entry_continue(d, &iommu->domain_list, next)
> > +		iommu_unmap(d->domain, dma->iova, dma->size);
> > +
> >  	while (iova < end) {
> >  		size_t unmapped;
> >  		phys_addr_t phys;
> > 
> > -		/*
> > -		 * We use the IOMMU to track the physical address.  This
> > -		 * saves us from having a lot more entries in our mapping
> > -		 * tree.  The downside is that we don't track the size
> > -		 * used to do the mapping.  We request unmap of a single
> > -		 * page, but expect IOMMUs that support large pages to
> > -		 * unmap a larger chunk.
> > -		 */
> > -		phys = iommu_iova_to_phys(iommu->domain, iova);
> > +		phys = iommu_iova_to_phys(domain->domain, iova);
> >  		if (WARN_ON(!phys)) {
> >  			iova += PAGE_SIZE;
> >  			continue;
> >  		}
> > 
> > -		unmapped = iommu_unmap(iommu->domain, iova, PAGE_SIZE);
> > -		if (!unmapped)
> > +		unmapped = iommu_unmap(domain->domain, iova, PAGE_SIZE);
> > +		if (WARN_ON(!unmapped))
> >  			break;
> > 
> >  		unlocked += vfio_unpin_pages(phys >> PAGE_SHIFT, @@ -357,119
> > +369,26 @@ static int vfio_unmap_unpin(struct vfio_iommu *iommu, struct
> > vfio_dma *dma,
> >  	}
> > 
> >  	vfio_lock_acct(-unlocked);
> > -
> > -	*size = iova - start;
> > -
> > -	return 0;
> >  }
> > 
> > -static int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t
> > start,
> > -				   size_t *size, struct vfio_dma *dma)
> > +static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma
> > +*dma)
> >  {
> > -	size_t offset, overlap, tmp;
> > -	struct vfio_dma *split;
> > -	int ret;
> > -
> > -	if (!*size)
> > -		return 0;
> > -
> > -	/*
> > -	 * Existing dma region is completely covered, unmap all.  This is
> > -	 * the likely case since userspace tends to map and unmap buffers
> > -	 * in one shot rather than multiple mappings within a buffer.
> > -	 */
> > -	if (likely(start <= dma->iova &&
> > -		   start + *size >= dma->iova + dma->size)) {
> > -		*size = dma->size;
> > -		ret = vfio_unmap_unpin(iommu, dma, dma->iova, size);
> > -		if (ret)
> > -			return ret;
> > -
> > -		/*
> > -		 * Did we remove more than we have?  Should never happen
> > -		 * since a vfio_dma is contiguous in iova and vaddr.
> > -		 */
> > -		WARN_ON(*size != dma->size);
> > -
> > -		vfio_remove_dma(iommu, dma);
> > -		kfree(dma);
> > -		return 0;
> > -	}
> > -
> > -	/* Overlap low address of existing range */
> > -	if (start <= dma->iova) {
> > -		overlap = start + *size - dma->iova;
> > -		ret = vfio_unmap_unpin(iommu, dma, dma->iova, &overlap);
> > -		if (ret)
> > -			return ret;
> > -
> > -		vfio_remove_dma(iommu, dma);
> > -
> > -		/*
> > -		 * Check, we may have removed to whole vfio_dma.  If not
> > -		 * fixup and re-insert.
> > -		 */
> > -		if (overlap < dma->size) {
> > -			dma->iova += overlap;
> > -			dma->vaddr += overlap;
> > -			dma->size -= overlap;
> > -			vfio_insert_dma(iommu, dma);
> > -		} else
> > -			kfree(dma);
> > -
> > -		*size = overlap;
> > -		return 0;
> > -	}
> > -
> > -	/* Overlap high address of existing range */
> > -	if (start + *size >= dma->iova + dma->size) {
> > -		offset = start - dma->iova;
> > -		overlap = dma->size - offset;
> > -
> > -		ret = vfio_unmap_unpin(iommu, dma, start, &overlap);
> > -		if (ret)
> > -			return ret;
> > -
> > -		dma->size -= overlap;
> > -		*size = overlap;
> > -		return 0;
> > -	}
> > -
> > -	/* Split existing */
> > -
> > -	/*
> > -	 * Allocate our tracking structure early even though it may not
> > -	 * be used.  An Allocation failure later loses track of pages and
> > -	 * is more difficult to unwind.
> > -	 */
> > -	split = kzalloc(sizeof(*split), GFP_KERNEL);
> > -	if (!split)
> > -		return -ENOMEM;
> > -
> > -	offset = start - dma->iova;
> > -
> > -	ret = vfio_unmap_unpin(iommu, dma, start, size);
> > -	if (ret || !*size) {
> > -		kfree(split);
> > -		return ret;
> > -	}
> > -
> > -	tmp = dma->size;
> > +	vfio_unmap_unpin(iommu, dma);
> > +	vfio_unlink_dma(iommu, dma);
> > +	kfree(dma);
> > +}
> > 
> > -	/* Resize the lower vfio_dma in place, before the below insert */
> > -	dma->size = offset;
> > +static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu) {
> > +	struct vfio_domain *domain;
> > +	unsigned long bitmap = PAGE_MASK;
> > 
> > -	/* Insert new for remainder, assuming it didn't all get unmapped */
> > -	if (likely(offset + *size < tmp)) {
> > -		split->size = tmp - offset - *size;
> > -		split->iova = dma->iova + offset + *size;
> > -		split->vaddr = dma->vaddr + offset + *size;
> > -		split->prot = dma->prot;
> > -		vfio_insert_dma(iommu, split);
> > -	} else
> > -		kfree(split);
> > +	mutex_lock(&iommu->lock);
> > +	list_for_each_entry(domain, &iommu->domain_list, next)
> > +		bitmap &= domain->domain->ops->pgsize_bitmap;
> > +	mutex_unlock(&iommu->lock);
> > 
> > -	return 0;
> > +	return bitmap;
> >  }
> > 
> >  static int vfio_dma_do_unmap(struct vfio_iommu *iommu, @@ -477,10
> > +396,10 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,  {
> >  	uint64_t mask;
> >  	struct vfio_dma *dma;
> > -	size_t unmapped = 0, size;
> > +	size_t unmapped = 0;
> >  	int ret = 0;
> > 
> > -	mask = ((uint64_t)1 << __ffs(iommu->domain->ops->pgsize_bitmap)) -
> > 1;
> > +	mask = ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
> > 
> >  	if (unmap->iova & mask)
> >  		return -EINVAL;
> > @@ -491,20 +410,61 @@ static int vfio_dma_do_unmap(struct vfio_iommu
> > *iommu,
> > 
> >  	mutex_lock(&iommu->lock);
> > 
> > +	/*
> > +	 * vfio-iommu-type1 (v1) - User mappings were coalesced together to
> > +	 * avoid tracking individual mappings.  This means that the
> > granularity
> > +	 * of the original mapping was lost and the user was allowed to
> > attempt
> > +	 * to unmap any range.  Depending on the contiguousness of physical
> > +	 * memory and page sizes supported by the IOMMU, arbitrary unmaps
> > may
> > +	 * or may not have worked.  We only guaranteed unmap granularity
> > +	 * matching the original mapping; even though it was untracked
> > here,
> > +	 * the original mappings are reflected in IOMMU mappings.  This
> > +	 * resulted in a couple unusual behaviors.  First, if a range is
> > not
> > +	 * able to be unmapped, ex. a set of 4k pages that was mapped as a
> > +	 * 2M hugepage into the IOMMU, the unmap ioctl returns success but
> > with
> > +	 * a zero sized unmap.  Also, if an unmap request overlaps the
> > first
> > +	 * address of a hugepage, the IOMMU will unmap the entire hugepage.
> > +	 * This also returns success and the returned unmap size reflects
> > the
> > +	 * actual size unmapped.
> > +	 *
> > +	 * We attempt to maintain compatibility with this "v1" interface,
> > but
> > +	 * we take control out of the hands of the IOMMU.  Therefore, an
> > unmap
> > +	 * request offset from the beginning of the original mapping will
> > +	 * return success with zero sized unmap.  And an unmap request
> > covering
> > +	 * the first iova of mapping will unmap the entire range.
> > +	 *
> > +	 * The v2 version of this interface intends to be more
> > deterministic.
> > +	 * Unmap requests must fully cover previous mappings.  Multiple
> > +	 * mappings may still be unmaped by specifying large ranges, but
> > there
> > +	 * must not be any previous mappings bisected by the range.  An
> > error
> > +	 * will be returned if these conditions are not met.  The v2
> > interface
> > +	 * will only return success and a size of zero if there were no
> > +	 * mappings within the range.
> > +	 */
> > +	if (iommu->v2) {
> > +		dma = vfio_find_dma(iommu, unmap->iova, 0);
> > +		if (dma && dma->iova != unmap->iova) {
> > +			ret = -EINVAL;
> > +			goto unlock;
> > +		}
> > +		dma = vfio_find_dma(iommu, unmap->iova + unmap->size - 1, 0);
> > +		if (dma && dma->iova + dma->size != unmap->iova + unmap-
> > >size) {
> > +			ret = -EINVAL;
> > +			goto unlock;
> > +		}
> > +	}
> > +
> >  	while ((dma = vfio_find_dma(iommu, unmap->iova, unmap->size))) {
> > -		size = unmap->size;
> > -		ret = vfio_remove_dma_overlap(iommu, unmap->iova, &size,
> > dma);
> > -		if (ret || !size)
> > +		if (!iommu->v2 && unmap->iova > dma->iova)
> >  			break;
> > -		unmapped += size;
> > +		unmapped += dma->size;
> > +		vfio_remove_dma(iommu, dma);
> >  	}
> > 
> > +unlock:
> >  	mutex_unlock(&iommu->lock);
> > 
> > -	/*
> > -	 * We may unmap more than requested, update the unmap struct so
> > -	 * userspace can know.
> > -	 */
> > +	/* Report how much was unmapped */
> >  	unmap->size = unmapped;
> > 
> >  	return ret;
> > @@ -516,22 +476,47 @@ static int vfio_dma_do_unmap(struct vfio_iommu
> > *iommu,
> >   * soon, so this is just a temporary workaround to break mappings down
> > into
> >   * PAGE_SIZE.  Better to map smaller pages than nothing.
> >   */
> > -static int map_try_harder(struct vfio_iommu *iommu, dma_addr_t iova,
> > +static int map_try_harder(struct vfio_domain *domain, dma_addr_t iova,
> >  			  unsigned long pfn, long npage, int prot)  {
> >  	long i;
> >  	int ret;
> > 
> >  	for (i = 0; i < npage; i++, pfn++, iova += PAGE_SIZE) {
> > -		ret = iommu_map(iommu->domain, iova,
> > +		ret = iommu_map(domain->domain, iova,
> >  				(phys_addr_t)pfn << PAGE_SHIFT,
> > -				PAGE_SIZE, prot);
> > +				PAGE_SIZE, prot | domain->prot);
> >  		if (ret)
> >  			break;
> >  	}
> > 
> >  	for (; i < npage && i > 0; i--, iova -= PAGE_SIZE)
> > -		iommu_unmap(iommu->domain, iova, PAGE_SIZE);
> > +		iommu_unmap(domain->domain, iova, PAGE_SIZE);
> > +
> > +	return ret;
> > +}
> > +
> > +static int vfio_iommu_map(struct vfio_iommu *iommu, dma_addr_t iova,
> > +			  unsigned long pfn, long npage, int prot) {
> > +	struct vfio_domain *d;
> > +	int ret;
> > +
> > +	list_for_each_entry(d, &iommu->domain_list, next) {
> > +		ret = iommu_map(d->domain, iova, (phys_addr_t)pfn <<
> > PAGE_SHIFT,
> > +				npage << PAGE_SHIFT, prot | d->prot);
> > +		if (ret) {
> > +			if (ret != -EBUSY ||
> > +			    map_try_harder(d, iova, pfn, npage, prot))
> > +				goto unwind;
> > +		}
> > +	}
> > +
> > +	return 0;
> > +
> > +unwind:
> > +	list_for_each_entry_continue_reverse(d, &iommu->domain_list, next)
> > +		iommu_unmap(d->domain, iova, npage << PAGE_SHIFT);
> > 
> >  	return ret;
> >  }
> > @@ -545,12 +530,12 @@ static int vfio_dma_do_map(struct vfio_iommu
> > *iommu,
> >  	long npage;
> >  	int ret = 0, prot = 0;
> >  	uint64_t mask;
> > -	struct vfio_dma *dma = NULL;
> > +	struct vfio_dma *dma;
> >  	unsigned long pfn;
> > 
> >  	end = map->iova + map->size;
> > 
> > -	mask = ((uint64_t)1 << __ffs(iommu->domain->ops->pgsize_bitmap)) -
> > 1;
> > +	mask = ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
> > 
> >  	/* READ/WRITE from device perspective */
> >  	if (map->flags & VFIO_DMA_MAP_FLAG_WRITE) @@ -561,9 +546,6 @@
> > static int vfio_dma_do_map(struct vfio_iommu *iommu,
> >  	if (!prot)
> >  		return -EINVAL; /* No READ/WRITE? */
> > 
> > -	if (iommu->cache)
> > -		prot |= IOMMU_CACHE;
> > -
> >  	if (vaddr & mask)
> >  		return -EINVAL;
> >  	if (map->iova & mask)
> > @@ -588,180 +570,257 @@ static int vfio_dma_do_map(struct vfio_iommu
> > *iommu,
> >  		return -EEXIST;
> >  	}
> > 
> > -	for (iova = map->iova; iova < end; iova += size, vaddr += size) {
> > -		long i;
> > +	dma = kzalloc(sizeof(*dma), GFP_KERNEL);
> > +	if (!dma) {
> > +		mutex_unlock(&iommu->lock);
> > +		return -ENOMEM;
> > +	}
> > +
> > +	dma->iova = map->iova;
> > +	dma->vaddr = map->vaddr;
> > +	dma->prot = prot;
> > 
> > +	/* Insert zero-sized and grow as we map chunks of it */
> > +	vfio_link_dma(iommu, dma);
> > +
> > +	for (iova = map->iova; iova < end; iova += size, vaddr += size) {
> >  		/* Pin a contiguous chunk of memory */
> >  		npage = vfio_pin_pages(vaddr, (end - iova) >> PAGE_SHIFT,
> >  				       prot, &pfn);
> >  		if (npage <= 0) {
> >  			WARN_ON(!npage);
> >  			ret = (int)npage;
> > -			goto out;
> > -		}
> > -
> > -		/* Verify pages are not already mapped */
> > -		for (i = 0; i < npage; i++) {
> > -			if (iommu_iova_to_phys(iommu->domain,
> > -					       iova + (i << PAGE_SHIFT))) {
> > -				ret = -EBUSY;
> > -				goto out_unpin;
> > -			}
> > +			break;
> >  		}
> > 
> > -		ret = iommu_map(iommu->domain, iova,
> > -				(phys_addr_t)pfn << PAGE_SHIFT,
> > -				npage << PAGE_SHIFT, prot);
> > +		/* Map it! */
> > +		ret = vfio_iommu_map(iommu, iova, pfn, npage, prot);
> >  		if (ret) {
> > -			if (ret != -EBUSY ||
> > -			    map_try_harder(iommu, iova, pfn, npage, prot)) {
> > -				goto out_unpin;
> > -			}
> > +			vfio_unpin_pages(pfn, npage, prot, true);
> > +			break;
> >  		}
> > 
> >  		size = npage << PAGE_SHIFT;
> > +		dma->size += size;
> > +	}
> > 
> > -		/*
> > -		 * Check if we abut a region below - nothing below 0.
> > -		 * This is the most likely case when mapping chunks of
> > -		 * physically contiguous regions within a virtual address
> > -		 * range.  Update the abutting entry in place since iova
> > -		 * doesn't change.
> > -		 */
> > -		if (likely(iova)) {
> > -			struct vfio_dma *tmp;
> > -			tmp = vfio_find_dma(iommu, iova - 1, 1);
> > -			if (tmp && tmp->prot == prot &&
> > -			    tmp->vaddr + tmp->size == vaddr) {
> > -				tmp->size += size;
> > -				iova = tmp->iova;
> > -				size = tmp->size;
> > -				vaddr = tmp->vaddr;
> > -				dma = tmp;
> > -			}
> > -		}
> > +	if (ret)
> > +		vfio_remove_dma(iommu, dma);
> > 
> > -		/*
> > -		 * Check if we abut a region above - nothing above ~0 + 1.
> > -		 * If we abut above and below, remove and free.  If only
> > -		 * abut above, remove, modify, reinsert.
> > -		 */
> > -		if (likely(iova + size)) {
> > -			struct vfio_dma *tmp;
> > -			tmp = vfio_find_dma(iommu, iova + size, 1);
> > -			if (tmp && tmp->prot == prot &&
> > -			    tmp->vaddr == vaddr + size) {
> > -				vfio_remove_dma(iommu, tmp);
> > -				if (dma) {
> > -					dma->size += tmp->size;
> > -					kfree(tmp);
> > -				} else {
> > -					size += tmp->size;
> > -					tmp->size = size;
> > -					tmp->iova = iova;
> > -					tmp->vaddr = vaddr;
> > -					vfio_insert_dma(iommu, tmp);
> > -					dma = tmp;
> > -				}
> > -			}
> > -		}
> > +	mutex_unlock(&iommu->lock);
> > +	return ret;
> > +}
> > +
> > +static int vfio_bus_type(struct device *dev, void *data) {
> > +	struct bus_type **bus = data;
> > +
> > +	if (*bus && *bus != dev->bus)
> > +		return -EINVAL;
> > +
> > +	*bus = dev->bus;
> > +
> > +	return 0;
> > +}
> > +
> > +static int vfio_iommu_replay(struct vfio_iommu *iommu,
> > +			     struct vfio_domain *domain)
> > +{
> > +	struct vfio_domain *d;
> > +	struct rb_node *n;
> > +	int ret;
> > +
> > +	/* Arbitrarily pick the first domain in the list for lookups */
> > +	d = list_first_entry(&iommu->domain_list, struct vfio_domain,
> > next);
> > +	n = rb_first(&iommu->dma_list);
> > +
> > +	/* If there's not a domain, there better not be any mappings */
> > +	if (WARN_ON(n && !d))
> > +		return -EINVAL;
> > +
> > +	for (; n; n = rb_next(n)) {
> > +		struct vfio_dma *dma;
> > +		dma_addr_t iova;
> > +
> > +		dma = rb_entry(n, struct vfio_dma, node);
> > +		iova = dma->iova;
> > +
> > +		while (iova < dma->iova + dma->size) {
> > +			phys_addr_t phys = iommu_iova_to_phys(d->domain, iova);
> > +			size_t size;
> > 
> > -		if (!dma) {
> > -			dma = kzalloc(sizeof(*dma), GFP_KERNEL);
> > -			if (!dma) {
> > -				iommu_unmap(iommu->domain, iova, size);
> > -				ret = -ENOMEM;
> > -				goto out_unpin;
> > +			if (WARN_ON(!phys)) {
> > +				iova += PAGE_SIZE;
> > +				continue;
> >  			}
> > 
> > -			dma->size = size;
> > -			dma->iova = iova;
> > -			dma->vaddr = vaddr;
> > -			dma->prot = prot;
> > -			vfio_insert_dma(iommu, dma);
> > -		}
> > -	}
> > +			size = PAGE_SIZE;
> > 
> > -	WARN_ON(ret);
> > -	mutex_unlock(&iommu->lock);
> > -	return ret;
> > +			while (iova + size < dma->iova + dma->size &&
> > +			       phys + size == iommu_iova_to_phys(d->domain,
> > +								 iova + size))
> > +				size += PAGE_SIZE;
> > 
> > -out_unpin:
> > -	vfio_unpin_pages(pfn, npage, prot, true);
> > +			ret = iommu_map(domain->domain, iova, phys,
> > +					size, dma->prot | domain->prot);
> > +			if (ret)
> > +				return ret;
> > 
> > -out:
> > -	iova = map->iova;
> > -	size = map->size;
> > -	while ((dma = vfio_find_dma(iommu, iova, size))) {
> > -		int r = vfio_remove_dma_overlap(iommu, iova,
> > -						&size, dma);
> > -		if (WARN_ON(r || !size))
> > -			break;
> > +			iova += size;
> > +		}
> >  	}
> > 
> > -	mutex_unlock(&iommu->lock);
> > -	return ret;
> > +	return 0;
> >  }
> > 
> >  static int vfio_iommu_type1_attach_group(void *iommu_data,
> >  					 struct iommu_group *iommu_group)
> >  {
> >  	struct vfio_iommu *iommu = iommu_data;
> > -	struct vfio_group *group, *tmp;
> > +	struct vfio_group *group, *g;
> > +	struct vfio_domain *domain, *d;
> > +	struct bus_type *bus = NULL;
> >  	int ret;
> > 
> > -	group = kzalloc(sizeof(*group), GFP_KERNEL);
> > -	if (!group)
> > -		return -ENOMEM;
> > -
> >  	mutex_lock(&iommu->lock);
> > 
> > -	list_for_each_entry(tmp, &iommu->group_list, next) {
> > -		if (tmp->iommu_group == iommu_group) {
> > +	list_for_each_entry(d, &iommu->domain_list, next) {
> > +		list_for_each_entry(g, &d->group_list, next) {
> > +			if (g->iommu_group != iommu_group)
> > +				continue;
> > +
> >  			mutex_unlock(&iommu->lock);
> > -			kfree(group);
> >  			return -EINVAL;
> >  		}
> >  	}
> > 
> > +	group = kzalloc(sizeof(*group), GFP_KERNEL);
> > +	domain = kzalloc(sizeof(*domain), GFP_KERNEL);
> > +	if (!group || !domain) {
> > +		ret = -ENOMEM;
> > +		goto out_free;
> > +	}
> > +
> > +	group->iommu_group = iommu_group;
> > +
> > +	/* Determine bus_type in order to allocate a domain */
> > +	ret = iommu_group_for_each_dev(iommu_group, &bus, vfio_bus_type);
> > +	if (ret)
> > +		goto out_free;
> > +
> > +	domain->domain = iommu_domain_alloc(bus);
> > +	if (!domain->domain) {
> > +		ret = -EIO;
> > +		goto out_free;
> > +	}
> > +
> > +	ret = iommu_attach_group(domain->domain, iommu_group);
> > +	if (ret)
> > +		goto out_domain;
> > +
> > +	INIT_LIST_HEAD(&domain->group_list);
> > +	list_add(&group->next, &domain->group_list);
> > +
> > +	if (!allow_unsafe_interrupts &&
> > +	    !iommu_domain_has_cap(domain->domain, IOMMU_CAP_INTR_REMAP)) {
> > +		pr_warn("%s: No interrupt remapping support.  Use the module
> > param \"allow_unsafe_interrupts\" to enable VFIO IOMMU support on this
> > platform\n",
> > +		       __func__);
> > +		ret = -EPERM;
> > +		goto out_detach;
> > +	}
> > +
> > +	if (iommu_domain_has_cap(domain->domain,
> > IOMMU_CAP_CACHE_COHERENCY))
> > +		domain->prot |= IOMMU_CACHE;
> > +
> >  	/*
> > -	 * TODO: Domain have capabilities that might change as we add
> > -	 * groups (see iommu->cache, currently never set).  Check for
> > -	 * them and potentially disallow groups to be attached when it
> > -	 * would change capabilities (ugh).
> > +	 * Try to match an existing compatible domain.  We don't want to
> > +	 * preclude an IOMMU driver supporting multiple bus_types and being
> > +	 * able to include different bus_types in the same IOMMU domain, so
> > +	 * we test whether the domains use the same iommu_ops rather than
> > +	 * testing if they're on the same bus_type.
> >  	 */
> > -	ret = iommu_attach_group(iommu->domain, iommu_group);
> > -	if (ret) {
> > -		mutex_unlock(&iommu->lock);
> > -		kfree(group);
> > -		return ret;
> > +	list_for_each_entry(d, &iommu->domain_list, next) {
> > +		if (d->domain->ops == domain->domain->ops &&
> > +		    d->prot == domain->prot) {
> > +			iommu_detach_group(domain->domain, iommu_group);
> > +			if (!iommu_attach_group(d->domain, iommu_group)) {
> > +				list_add(&group->next, &d->group_list);
> > +				iommu_domain_free(domain->domain);
> > +				kfree(domain);
> > +				mutex_unlock(&iommu->lock);
> > +				return 0;
> > +			}
> > +
> > +			ret = iommu_attach_group(domain->domain, iommu_group);
> > +			if (ret)
> > +				goto out_domain;
> > +		}
> >  	}
> > 
> > -	group->iommu_group = iommu_group;
> > -	list_add(&group->next, &iommu->group_list);
> > +	/* replay mappings on new domains */
> > +	ret = vfio_iommu_replay(iommu, domain);
> > +	if (ret)
> > +		goto out_detach;
> > +
> > +	list_add(&domain->next, &iommu->domain_list);
> > 
> >  	mutex_unlock(&iommu->lock);
> > 
> >  	return 0;
> > +
> > +out_detach:
> > +	iommu_detach_group(domain->domain, iommu_group);
> > +out_domain:
> > +	iommu_domain_free(domain->domain);
> > +out_free:
> > +	kfree(domain);
> > +	kfree(group);
> > +	mutex_unlock(&iommu->lock);
> > +	return ret;
> > +}
> > +
> > +static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu) {
> > +	struct rb_node *node;
> > +
> > +	while ((node = rb_first(&iommu->dma_list)))
> > +		vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma,
> > node));
> >  }
> > 
> >  static void vfio_iommu_type1_detach_group(void *iommu_data,
> >  					  struct iommu_group *iommu_group)  {
> >  	struct vfio_iommu *iommu = iommu_data;
> > +	struct vfio_domain *domain;
> >  	struct vfio_group *group;
> > 
> >  	mutex_lock(&iommu->lock);
> > 
> > -	list_for_each_entry(group, &iommu->group_list, next) {
> > -		if (group->iommu_group == iommu_group) {
> > -			iommu_detach_group(iommu->domain, iommu_group);
> > +	list_for_each_entry(domain, &iommu->domain_list, next) {
> > +		list_for_each_entry(group, &domain->group_list, next) {
> > +			if (group->iommu_group != iommu_group)
> > +				continue;
> > +
> > +			iommu_detach_group(domain->domain, iommu_group);
> >  			list_del(&group->next);
> >  			kfree(group);
> > -			break;
> > +			/*
> > +			 * Group ownership provides privilege, if the group
> > +			 * list is empty, the domain goes away.  If it's the
> > +			 * last domain, then all the mappings go away too.
> > +			 */
> > +			if (list_empty(&domain->group_list)) {
> > +				if (list_is_singular(&iommu->domain_list))
> > +					vfio_iommu_unmap_unpin_all(iommu);
> > +				iommu_domain_free(domain->domain);
> > +				list_del(&domain->next);
> > +				kfree(domain);
> > +			}
> > +			goto done;
> >  		}
> >  	}
> > 
> > +done:
> >  	mutex_unlock(&iommu->lock);
> >  }
> > 
> > @@ -769,40 +828,17 @@ static void *vfio_iommu_type1_open(unsigned long
> > arg)  {
> >  	struct vfio_iommu *iommu;
> > 
> > -	if (arg != VFIO_TYPE1_IOMMU)
> > +	if (arg != VFIO_TYPE1_IOMMU && arg != VFIO_TYPE1v2_IOMMU)
> >  		return ERR_PTR(-EINVAL);
> > 
> >  	iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
> >  	if (!iommu)
> >  		return ERR_PTR(-ENOMEM);
> > 
> > -	INIT_LIST_HEAD(&iommu->group_list);
> > +	INIT_LIST_HEAD(&iommu->domain_list);
> >  	iommu->dma_list = RB_ROOT;
> >  	mutex_init(&iommu->lock);
> > -
> > -	/*
> > -	 * Wish we didn't have to know about bus_type here.
> > -	 */
> > -	iommu->domain = iommu_domain_alloc(&pci_bus_type);
> > -	if (!iommu->domain) {
> > -		kfree(iommu);
> > -		return ERR_PTR(-EIO);
> > -	}
> > -
> > -	/*
> > -	 * Wish we could specify required capabilities rather than create
> > -	 * a domain, see what comes out and hope it doesn't change along
> > -	 * the way.  Fortunately we know interrupt remapping is global for
> > -	 * our iommus.
> > -	 */
> > -	if (!allow_unsafe_interrupts &&
> > -	    !iommu_domain_has_cap(iommu->domain, IOMMU_CAP_INTR_REMAP)) {
> > -		pr_warn("%s: No interrupt remapping support.  Use the module
> > param \"allow_unsafe_interrupts\" to enable VFIO IOMMU support on this
> > platform\n",
> > -		       __func__);
> > -		iommu_domain_free(iommu->domain);
> > -		kfree(iommu);
> > -		return ERR_PTR(-EPERM);
> > -	}
> > +	iommu->v2 = (arg == VFIO_TYPE1v2_IOMMU);
> > 
> >  	return iommu;
> >  }
> > @@ -810,25 +846,24 @@ static void *vfio_iommu_type1_open(unsigned long
> > arg)  static void vfio_iommu_type1_release(void *iommu_data)  {
> >  	struct vfio_iommu *iommu = iommu_data;
> > +	struct vfio_domain *domain, *domain_tmp;
> >  	struct vfio_group *group, *group_tmp;
> > -	struct rb_node *node;
> > 
> > -	list_for_each_entry_safe(group, group_tmp, &iommu->group_list,
> > next) {
> > -		iommu_detach_group(iommu->domain, group->iommu_group);
> > -		list_del(&group->next);
> > -		kfree(group);
> > -	}
> > +	vfio_iommu_unmap_unpin_all(iommu);
> > 
> > -	while ((node = rb_first(&iommu->dma_list))) {
> > -		struct vfio_dma *dma = rb_entry(node, struct vfio_dma, node);
> > -		size_t size = dma->size;
> > -		vfio_remove_dma_overlap(iommu, dma->iova, &size, dma);
> > -		if (WARN_ON(!size))
> > -			break;
> > +	list_for_each_entry_safe(domain, domain_tmp,
> > +				 &iommu->domain_list, next) {
> > +		list_for_each_entry_safe(group, group_tmp,
> > +					 &domain->group_list, next) {
> > +			iommu_detach_group(domain->domain, group->iommu_group);
> > +			list_del(&group->next);
> > +			kfree(group);
> > +		}
> > +		iommu_domain_free(domain->domain);
> > +		list_del(&domain->next);
> > +		kfree(domain);
> >  	}
> > 
> > -	iommu_domain_free(iommu->domain);
> > -	iommu->domain = NULL;
> >  	kfree(iommu);
> >  }
> > 
> > @@ -841,6 +876,7 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
> >  	if (cmd == VFIO_CHECK_EXTENSION) {
> >  		switch (arg) {
> >  		case VFIO_TYPE1_IOMMU:
> > +		case VFIO_TYPE1v2_IOMMU:
> >  			return 1;
> >  		default:
> >  			return 0;
> > @@ -858,7 +894,7 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
> > 
> >  		info.flags = 0;
> > 
> > -		info.iova_pgsizes = iommu->domain->ops->pgsize_bitmap;
> > +		info.iova_pgsizes = vfio_pgsize_bitmap(iommu);
> > 
> >  		return copy_to_user((void __user *)arg, &info, minsz);
> > 
> > @@ -911,9 +947,6 @@ static const struct vfio_iommu_driver_ops
> > vfio_iommu_driver_ops_type1 = {
> > 
> >  static int __init vfio_iommu_type1_init(void)  {
> > -	if (!iommu_present(&pci_bus_type))
> > -		return -ENODEV;
> > -
> >  	return vfio_register_iommu_driver(&vfio_iommu_driver_ops_type1);
> >  }
> > 
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index
> > 0fd47f5..460fdf2 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -23,6 +23,7 @@
> > 
> >  #define VFIO_TYPE1_IOMMU		1
> >  #define VFIO_SPAPR_TCE_IOMMU		2
> > +#define VFIO_TYPE1v2_IOMMU		3
> > 
> >  /*
> >   * The IOCTL interface is designed for extensibility by embedding the
> > 
> > 
> 



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Williamson March 18, 2014, 2:30 p.m. UTC | #3
On Tue, 2014-03-18 at 14:02 +0300, Dennis Mungai wrote:
> Also, on the same note, do Intel processors from Lynnfield (2009 Core i7s)
> and Arrandale (2010 Mobile Intel Core processors ) that advertise VT-d
> support handle these advanced features?

You would need to check the capability registers of the physical
hardware or look for it in the datasheets.  You can decode it from dmesg
by looking for these lines:

dmar: IOMMU 0: reg_base_addr fed90000 ver 1:0 cap c0000020e60262 ecap f0101a
dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap c9008020660262 ecap f0105a

On Intel, IOMMU_CACHE is available with Snoop Control:

#define ecap_sc_support(e)      ((e >> 7) & 0x1) /* Snooping Control */

So neither of the IOMMUs in the example above support it (Ivy Bridge
mobile system).

> Apparently, even in cases where both the processor and the chip set handles
> VT-d for IOMMU, BIOS vendors mess it up so badly, and at times, even so
> called workarounds don't help much.

The change here presupposes working IOMMU support, it's not going to fix
you BIOS.  Much of the intel-iommu code tries to avoid trusting the
BIOS, but basic DMAR support is required.  Thanks,

Alex

> On 17 Feb 2014 23:25, "Alex Williamson" <alex.williamson@redhat.com> wrote:
> 
> > We currently have a problem that we cannot support advanced features
> > of an IOMMU domain (ex. IOMMU_CACHE), because we have no guarantee
> > that those features will be supported by all of the hardware units
> > involved with the domain over its lifetime.  For instance, the Intel
> > VT-d architecture does not require that all DRHDs support snoop
> > control.  If we create a domain based on a device behind a DRHD that
> > does support snoop control and enable SNP support via the IOMMU_CACHE
> > mapping option, we cannot then add a device behind a DRHD which does
> > not support snoop control or we'll get reserved bit faults from the
> > SNP bit in the pagetables.  To add to the complexity, we can't know
> > the properties of a domain until a device is attached.
> >
> > We could pass this problem off to userspace and require that a
> > separate vfio container be used, but we don't know how to handle page
> > accounting in that case.  How do we know that a page pinned in one
> > container is the same page as a different container and avoid double
> > billing the user for the page.
> >
> > The solution is therefore to support multiple IOMMU domains per
> > container.  In the majority of cases, only one domain will be required
> > since hardware is typically consistent within a system.  However, this
> > provides us the ability to validate compatibility of domains and
> > support mixed environments where page table flags can be different
> > between domains.
> >
> > To do this, our DMA tracking needs to change.  We currently try to
> > coalesce user mappings into as few tracking entries as possible.  The
> > problem then becomes that we lose granularity of user mappings.  We've
> > never guaranteed that a user is able to unmap at a finer granularity
> > than the original mapping, but we must honor the granularity of the
> > original mapping.  This coalescing code is therefore removed, allowing
> > only unmaps covering complete maps.  The change in accounting is
> > fairly small here, a typical QEMU VM will start out with roughly a
> > dozen entries, so it's arguable if this coalescing was ever needed.
> >
> > We also move IOMMU domain creation to the point where a group is
> > attached to the container.  An interesting side-effect of this is that
> > we now have access to the device at the time of domain creation and
> > can probe the devices within the group to determine the bus_type.
> > This finally makes vfio_iommu_type1 completely device/bus agnostic.
> > In fact, each IOMMU domain can host devices on different buses managed
> > by different physical IOMMUs, and present a single DMA mapping
> > interface to the user.  When a new domain is created, mappings are
> > replayed to bring the IOMMU pagetables up to the state of the current
> > container.  And of course, DMA mapping and unmapping automatically
> > traverse all of the configured IOMMU domains.
> >
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > Cc: Varun Sethi <Varun.Sethi@freescale.com>
> > ---
> >  drivers/vfio/vfio_iommu_type1.c |  637
> > +++++++++++++++++++++------------------
> >  include/uapi/linux/vfio.h       |    1
> >  2 files changed, 336 insertions(+), 302 deletions(-)
> >
> > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > b/drivers/vfio/vfio_iommu_type1.c
> > index 4fb7a8f..8c7bb9b 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -30,7 +30,6 @@
> >  #include <linux/iommu.h>
> >  #include <linux/module.h>
> >  #include <linux/mm.h>
> > -#include <linux/pci.h>         /* pci_bus_type */
> >  #include <linux/rbtree.h>
> >  #include <linux/sched.h>
> >  #include <linux/slab.h>
> > @@ -55,11 +54,17 @@ MODULE_PARM_DESC(disable_hugepages,
> >                  "Disable VFIO IOMMU support for IOMMU hugepages.");
> >
> >  struct vfio_iommu {
> > -       struct iommu_domain     *domain;
> > +       struct list_head        domain_list;
> >         struct mutex            lock;
> >         struct rb_root          dma_list;
> > +       bool v2;
> > +};
> > +
> > +struct vfio_domain {
> > +       struct iommu_domain     *domain;
> > +       struct list_head        next;
> >         struct list_head        group_list;
> > -       bool                    cache;
> > +       int                     prot;           /* IOMMU_CACHE */
> >  };
> >
> >  struct vfio_dma {
> > @@ -99,7 +104,7 @@ static struct vfio_dma *vfio_find_dma(struct vfio_iommu
> > *iommu,
> >         return NULL;
> >  }
> >
> > -static void vfio_insert_dma(struct vfio_iommu *iommu, struct vfio_dma
> > *new)
> > +static void vfio_link_dma(struct vfio_iommu *iommu, struct vfio_dma *new)
> >  {
> >         struct rb_node **link = &iommu->dma_list.rb_node, *parent = NULL;
> >         struct vfio_dma *dma;
> > @@ -118,7 +123,7 @@ static void vfio_insert_dma(struct vfio_iommu *iommu,
> > struct vfio_dma *new)
> >         rb_insert_color(&new->node, &iommu->dma_list);
> >  }
> >
> > -static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma
> > *old)
> > +static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma
> > *old)
> >  {
> >         rb_erase(&old->node, &iommu->dma_list);
> >  }
> > @@ -322,32 +327,39 @@ static long vfio_unpin_pages(unsigned long pfn, long
> > npage,
> >         return unlocked;
> >  }
> >
> > -static int vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma
> > *dma,
> > -                           dma_addr_t iova, size_t *size)
> > +static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma
> > *dma)
> >  {
> > -       dma_addr_t start = iova, end = iova + *size;
> > +       dma_addr_t iova = dma->iova, end = dma->iova + dma->size;
> > +       struct vfio_domain *domain, *d;
> >         long unlocked = 0;
> >
> > +       if (!dma->size)
> > +               return;
> > +       /*
> > +        * We use the IOMMU to track the physical addresses, otherwise we'd
> > +        * need a much more complicated tracking system.  Unfortunately
> > that
> > +        * means we need to use one of the iommu domains to figure out the
> > +        * pfns to unpin.  The rest need to be unmapped in advance so we
> > have
> > +        * no iommu translations remaining when the pages are unpinned.
> > +        */
> > +       domain = d = list_first_entry(&iommu->domain_list,
> > +                                     struct vfio_domain, next);
> > +
> > +       list_for_each_entry_continue(d, &iommu->domain_list, next)
> > +               iommu_unmap(d->domain, dma->iova, dma->size);
> > +
> >         while (iova < end) {
> >                 size_t unmapped;
> >                 phys_addr_t phys;
> >
> > -               /*
> > -                * We use the IOMMU to track the physical address.  This
> > -                * saves us from having a lot more entries in our mapping
> > -                * tree.  The downside is that we don't track the size
> > -                * used to do the mapping.  We request unmap of a single
> > -                * page, but expect IOMMUs that support large pages to
> > -                * unmap a larger chunk.
> > -                */
> > -               phys = iommu_iova_to_phys(iommu->domain, iova);
> > +               phys = iommu_iova_to_phys(domain->domain, iova);
> >                 if (WARN_ON(!phys)) {
> >                         iova += PAGE_SIZE;
> >                         continue;
> >                 }
> >
> > -               unmapped = iommu_unmap(iommu->domain, iova, PAGE_SIZE);
> > -               if (!unmapped)
> > +               unmapped = iommu_unmap(domain->domain, iova, PAGE_SIZE);
> > +               if (WARN_ON(!unmapped))
> >                         break;
> >
> >                 unlocked += vfio_unpin_pages(phys >> PAGE_SHIFT,
> > @@ -357,119 +369,26 @@ static int vfio_unmap_unpin(struct vfio_iommu
> > *iommu, struct vfio_dma *dma,
> >         }
> >
> >         vfio_lock_acct(-unlocked);
> > -
> > -       *size = iova - start;
> > -
> > -       return 0;
> >  }
> >
> > -static int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t
> > start,
> > -                                  size_t *size, struct vfio_dma *dma)
> > +static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma
> > *dma)
> >  {
> > -       size_t offset, overlap, tmp;
> > -       struct vfio_dma *split;
> > -       int ret;
> > -
> > -       if (!*size)
> > -               return 0;
> > -
> > -       /*
> > -        * Existing dma region is completely covered, unmap all.  This is
> > -        * the likely case since userspace tends to map and unmap buffers
> > -        * in one shot rather than multiple mappings within a buffer.
> > -        */
> > -       if (likely(start <= dma->iova &&
> > -                  start + *size >= dma->iova + dma->size)) {
> > -               *size = dma->size;
> > -               ret = vfio_unmap_unpin(iommu, dma, dma->iova, size);
> > -               if (ret)
> > -                       return ret;
> > -
> > -               /*
> > -                * Did we remove more than we have?  Should never happen
> > -                * since a vfio_dma is contiguous in iova and vaddr.
> > -                */
> > -               WARN_ON(*size != dma->size);
> > -
> > -               vfio_remove_dma(iommu, dma);
> > -               kfree(dma);
> > -               return 0;
> > -       }
> > -
> > -       /* Overlap low address of existing range */
> > -       if (start <= dma->iova) {
> > -               overlap = start + *size - dma->iova;
> > -               ret = vfio_unmap_unpin(iommu, dma, dma->iova, &overlap);
> > -               if (ret)
> > -                       return ret;
> > -
> > -               vfio_remove_dma(iommu, dma);
> > -
> > -               /*
> > -                * Check, we may have removed to whole vfio_dma.  If not
> > -                * fixup and re-insert.
> > -                */
> > -               if (overlap < dma->size) {
> > -                       dma->iova += overlap;
> > -                       dma->vaddr += overlap;
> > -                       dma->size -= overlap;
> > -                       vfio_insert_dma(iommu, dma);
> > -               } else
> > -                       kfree(dma);
> > -
> > -               *size = overlap;
> > -               return 0;
> > -       }
> > -
> > -       /* Overlap high address of existing range */
> > -       if (start + *size >= dma->iova + dma->size) {
> > -               offset = start - dma->iova;
> > -               overlap = dma->size - offset;
> > -
> > -               ret = vfio_unmap_unpin(iommu, dma, start, &overlap);
> > -               if (ret)
> > -                       return ret;
> > -
> > -               dma->size -= overlap;
> > -               *size = overlap;
> > -               return 0;
> > -       }
> > -
> > -       /* Split existing */
> > -
> > -       /*
> > -        * Allocate our tracking structure early even though it may not
> > -        * be used.  An Allocation failure later loses track of pages and
> > -        * is more difficult to unwind.
> > -        */
> > -       split = kzalloc(sizeof(*split), GFP_KERNEL);
> > -       if (!split)
> > -               return -ENOMEM;
> > -
> > -       offset = start - dma->iova;
> > -
> > -       ret = vfio_unmap_unpin(iommu, dma, start, size);
> > -       if (ret || !*size) {
> > -               kfree(split);
> > -               return ret;
> > -       }
> > -
> > -       tmp = dma->size;
> > +       vfio_unmap_unpin(iommu, dma);
> > +       vfio_unlink_dma(iommu, dma);
> > +       kfree(dma);
> > +}
> >
> > -       /* Resize the lower vfio_dma in place, before the below insert */
> > -       dma->size = offset;
> > +static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
> > +{
> > +       struct vfio_domain *domain;
> > +       unsigned long bitmap = PAGE_MASK;
> >
> > -       /* Insert new for remainder, assuming it didn't all get unmapped */
> > -       if (likely(offset + *size < tmp)) {
> > -               split->size = tmp - offset - *size;
> > -               split->iova = dma->iova + offset + *size;
> > -               split->vaddr = dma->vaddr + offset + *size;
> > -               split->prot = dma->prot;
> > -               vfio_insert_dma(iommu, split);
> > -       } else
> > -               kfree(split);
> > +       mutex_lock(&iommu->lock);
> > +       list_for_each_entry(domain, &iommu->domain_list, next)
> > +               bitmap &= domain->domain->ops->pgsize_bitmap;
> > +       mutex_unlock(&iommu->lock);
> >
> > -       return 0;
> > +       return bitmap;
> >  }
> >
> >  static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> > @@ -477,10 +396,10 @@ static int vfio_dma_do_unmap(struct vfio_iommu
> > *iommu,
> >  {
> >         uint64_t mask;
> >         struct vfio_dma *dma;
> > -       size_t unmapped = 0, size;
> > +       size_t unmapped = 0;
> >         int ret = 0;
> >
> > -       mask = ((uint64_t)1 << __ffs(iommu->domain->ops->pgsize_bitmap)) -
> > 1;
> > +       mask = ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
> >
> >         if (unmap->iova & mask)
> >                 return -EINVAL;
> > @@ -491,20 +410,61 @@ static int vfio_dma_do_unmap(struct vfio_iommu
> > *iommu,
> >
> >         mutex_lock(&iommu->lock);
> >
> > +       /*
> > +        * vfio-iommu-type1 (v1) - User mappings were coalesced together to
> > +        * avoid tracking individual mappings.  This means that the
> > granularity
> > +        * of the original mapping was lost and the user was allowed to
> > attempt
> > +        * to unmap any range.  Depending on the contiguousness of physical
> > +        * memory and page sizes supported by the IOMMU, arbitrary unmaps
> > may
> > +        * or may not have worked.  We only guaranteed unmap granularity
> > +        * matching the original mapping; even though it was untracked
> > here,
> > +        * the original mappings are reflected in IOMMU mappings.  This
> > +        * resulted in a couple unusual behaviors.  First, if a range is
> > not
> > +        * able to be unmapped, ex. a set of 4k pages that was mapped as a
> > +        * 2M hugepage into the IOMMU, the unmap ioctl returns success but
> > with
> > +        * a zero sized unmap.  Also, if an unmap request overlaps the
> > first
> > +        * address of a hugepage, the IOMMU will unmap the entire hugepage.
> > +        * This also returns success and the returned unmap size reflects
> > the
> > +        * actual size unmapped.
> > +        *
> > +        * We attempt to maintain compatibility with this "v1" interface,
> > but
> > +        * we take control out of the hands of the IOMMU.  Therefore, an
> > unmap
> > +        * request offset from the beginning of the original mapping will
> > +        * return success with zero sized unmap.  And an unmap request
> > covering
> > +        * the first iova of mapping will unmap the entire range.
> > +        *
> > +        * The v2 version of this interface intends to be more
> > deterministic.
> > +        * Unmap requests must fully cover previous mappings.  Multiple
> > +        * mappings may still be unmaped by specifying large ranges, but
> > there
> > +        * must not be any previous mappings bisected by the range.  An
> > error
> > +        * will be returned if these conditions are not met.  The v2
> > interface
> > +        * will only return success and a size of zero if there were no
> > +        * mappings within the range.
> > +        */
> > +       if (iommu->v2) {
> > +               dma = vfio_find_dma(iommu, unmap->iova, 0);
> > +               if (dma && dma->iova != unmap->iova) {
> > +                       ret = -EINVAL;
> > +                       goto unlock;
> > +               }
> > +               dma = vfio_find_dma(iommu, unmap->iova + unmap->size - 1,
> > 0);
> > +               if (dma && dma->iova + dma->size != unmap->iova +
> > unmap->size) {
> > +                       ret = -EINVAL;
> > +                       goto unlock;
> > +               }
> > +       }
> > +
> >         while ((dma = vfio_find_dma(iommu, unmap->iova, unmap->size))) {
> > -               size = unmap->size;
> > -               ret = vfio_remove_dma_overlap(iommu, unmap->iova, &size,
> > dma);
> > -               if (ret || !size)
> > +               if (!iommu->v2 && unmap->iova > dma->iova)
> >                         break;
> > -               unmapped += size;
> > +               unmapped += dma->size;
> > +               vfio_remove_dma(iommu, dma);
> >         }
> >
> > +unlock:
> >         mutex_unlock(&iommu->lock);
> >
> > -       /*
> > -        * We may unmap more than requested, update the unmap struct so
> > -        * userspace can know.
> > -        */
> > +       /* Report how much was unmapped */
> >         unmap->size = unmapped;
> >
> >         return ret;
> > @@ -516,22 +476,47 @@ static int vfio_dma_do_unmap(struct vfio_iommu
> > *iommu,
> >   * soon, so this is just a temporary workaround to break mappings down
> > into
> >   * PAGE_SIZE.  Better to map smaller pages than nothing.
> >   */
> > -static int map_try_harder(struct vfio_iommu *iommu, dma_addr_t iova,
> > +static int map_try_harder(struct vfio_domain *domain, dma_addr_t iova,
> >                           unsigned long pfn, long npage, int prot)
> >  {
> >         long i;
> >         int ret;
> >
> >         for (i = 0; i < npage; i++, pfn++, iova += PAGE_SIZE) {
> > -               ret = iommu_map(iommu->domain, iova,
> > +               ret = iommu_map(domain->domain, iova,
> >                                 (phys_addr_t)pfn << PAGE_SHIFT,
> > -                               PAGE_SIZE, prot);
> > +                               PAGE_SIZE, prot | domain->prot);
> >                 if (ret)
> >                         break;
> >         }
> >
> >         for (; i < npage && i > 0; i--, iova -= PAGE_SIZE)
> > -               iommu_unmap(iommu->domain, iova, PAGE_SIZE);
> > +               iommu_unmap(domain->domain, iova, PAGE_SIZE);
> > +
> > +       return ret;
> > +}
> > +
> > +static int vfio_iommu_map(struct vfio_iommu *iommu, dma_addr_t iova,
> > +                         unsigned long pfn, long npage, int prot)
> > +{
> > +       struct vfio_domain *d;
> > +       int ret;
> > +
> > +       list_for_each_entry(d, &iommu->domain_list, next) {
> > +               ret = iommu_map(d->domain, iova, (phys_addr_t)pfn <<
> > PAGE_SHIFT,
> > +                               npage << PAGE_SHIFT, prot | d->prot);
> > +               if (ret) {
> > +                       if (ret != -EBUSY ||
> > +                           map_try_harder(d, iova, pfn, npage, prot))
> > +                               goto unwind;
> > +               }
> > +       }
> > +
> > +       return 0;
> > +
> > +unwind:
> > +       list_for_each_entry_continue_reverse(d, &iommu->domain_list, next)
> > +               iommu_unmap(d->domain, iova, npage << PAGE_SHIFT);
> >
> >         return ret;
> >  }
> > @@ -545,12 +530,12 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
> >         long npage;
> >         int ret = 0, prot = 0;
> >         uint64_t mask;
> > -       struct vfio_dma *dma = NULL;
> > +       struct vfio_dma *dma;
> >         unsigned long pfn;
> >
> >         end = map->iova + map->size;
> >
> > -       mask = ((uint64_t)1 << __ffs(iommu->domain->ops->pgsize_bitmap)) -
> > 1;
> > +       mask = ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
> >
> >         /* READ/WRITE from device perspective */
> >         if (map->flags & VFIO_DMA_MAP_FLAG_WRITE)
> > @@ -561,9 +546,6 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
> >         if (!prot)
> >                 return -EINVAL; /* No READ/WRITE? */
> >
> > -       if (iommu->cache)
> > -               prot |= IOMMU_CACHE;
> > -
> >         if (vaddr & mask)
> >                 return -EINVAL;
> >         if (map->iova & mask)
> > @@ -588,180 +570,257 @@ static int vfio_dma_do_map(struct vfio_iommu
> > *iommu,
> >                 return -EEXIST;
> >         }
> >
> > -       for (iova = map->iova; iova < end; iova += size, vaddr += size) {
> > -               long i;
> > +       dma = kzalloc(sizeof(*dma), GFP_KERNEL);
> > +       if (!dma) {
> > +               mutex_unlock(&iommu->lock);
> > +               return -ENOMEM;
> > +       }
> > +
> > +       dma->iova = map->iova;
> > +       dma->vaddr = map->vaddr;
> > +       dma->prot = prot;
> >
> > +       /* Insert zero-sized and grow as we map chunks of it */
> > +       vfio_link_dma(iommu, dma);
> > +
> > +       for (iova = map->iova; iova < end; iova += size, vaddr += size) {
> >                 /* Pin a contiguous chunk of memory */
> >                 npage = vfio_pin_pages(vaddr, (end - iova) >> PAGE_SHIFT,
> >                                        prot, &pfn);
> >                 if (npage <= 0) {
> >                         WARN_ON(!npage);
> >                         ret = (int)npage;
> > -                       goto out;
> > -               }
> > -
> > -               /* Verify pages are not already mapped */
> > -               for (i = 0; i < npage; i++) {
> > -                       if (iommu_iova_to_phys(iommu->domain,
> > -                                              iova + (i << PAGE_SHIFT))) {
> > -                               ret = -EBUSY;
> > -                               goto out_unpin;
> > -                       }
> > +                       break;
> >                 }
> >
> > -               ret = iommu_map(iommu->domain, iova,
> > -                               (phys_addr_t)pfn << PAGE_SHIFT,
> > -                               npage << PAGE_SHIFT, prot);
> > +               /* Map it! */
> > +               ret = vfio_iommu_map(iommu, iova, pfn, npage, prot);
> >                 if (ret) {
> > -                       if (ret != -EBUSY ||
> > -                           map_try_harder(iommu, iova, pfn, npage, prot))
> > {
> > -                               goto out_unpin;
> > -                       }
> > +                       vfio_unpin_pages(pfn, npage, prot, true);
> > +                       break;
> >                 }
> >
> >                 size = npage << PAGE_SHIFT;
> > +               dma->size += size;
> > +       }
> >
> > -               /*
> > -                * Check if we abut a region below - nothing below 0.
> > -                * This is the most likely case when mapping chunks of
> > -                * physically contiguous regions within a virtual address
> > -                * range.  Update the abutting entry in place since iova
> > -                * doesn't change.
> > -                */
> > -               if (likely(iova)) {
> > -                       struct vfio_dma *tmp;
> > -                       tmp = vfio_find_dma(iommu, iova - 1, 1);
> > -                       if (tmp && tmp->prot == prot &&
> > -                           tmp->vaddr + tmp->size == vaddr) {
> > -                               tmp->size += size;
> > -                               iova = tmp->iova;
> > -                               size = tmp->size;
> > -                               vaddr = tmp->vaddr;
> > -                               dma = tmp;
> > -                       }
> > -               }
> > +       if (ret)
> > +               vfio_remove_dma(iommu, dma);
> >
> > -               /*
> > -                * Check if we abut a region above - nothing above ~0 + 1.
> > -                * If we abut above and below, remove and free.  If only
> > -                * abut above, remove, modify, reinsert.
> > -                */
> > -               if (likely(iova + size)) {
> > -                       struct vfio_dma *tmp;
> > -                       tmp = vfio_find_dma(iommu, iova + size, 1);
> > -                       if (tmp && tmp->prot == prot &&
> > -                           tmp->vaddr == vaddr + size) {
> > -                               vfio_remove_dma(iommu, tmp);
> > -                               if (dma) {
> > -                                       dma->size += tmp->size;
> > -                                       kfree(tmp);
> > -                               } else {
> > -                                       size += tmp->size;
> > -                                       tmp->size = size;
> > -                                       tmp->iova = iova;
> > -                                       tmp->vaddr = vaddr;
> > -                                       vfio_insert_dma(iommu, tmp);
> > -                                       dma = tmp;
> > -                               }
> > -                       }
> > -               }
> > +       mutex_unlock(&iommu->lock);
> > +       return ret;
> > +}
> > +
> > +static int vfio_bus_type(struct device *dev, void *data)
> > +{
> > +       struct bus_type **bus = data;
> > +
> > +       if (*bus && *bus != dev->bus)
> > +               return -EINVAL;
> > +
> > +       *bus = dev->bus;
> > +
> > +       return 0;
> > +}
> > +
> > +static int vfio_iommu_replay(struct vfio_iommu *iommu,
> > +                            struct vfio_domain *domain)
> > +{
> > +       struct vfio_domain *d;
> > +       struct rb_node *n;
> > +       int ret;
> > +
> > +       /* Arbitrarily pick the first domain in the list for lookups */
> > +       d = list_first_entry(&iommu->domain_list, struct vfio_domain,
> > next);
> > +       n = rb_first(&iommu->dma_list);
> > +
> > +       /* If there's not a domain, there better not be any mappings */
> > +       if (WARN_ON(n && !d))
> > +               return -EINVAL;
> > +
> > +       for (; n; n = rb_next(n)) {
> > +               struct vfio_dma *dma;
> > +               dma_addr_t iova;
> > +
> > +               dma = rb_entry(n, struct vfio_dma, node);
> > +               iova = dma->iova;
> > +
> > +               while (iova < dma->iova + dma->size) {
> > +                       phys_addr_t phys = iommu_iova_to_phys(d->domain,
> > iova);
> > +                       size_t size;
> >
> > -               if (!dma) {
> > -                       dma = kzalloc(sizeof(*dma), GFP_KERNEL);
> > -                       if (!dma) {
> > -                               iommu_unmap(iommu->domain, iova, size);
> > -                               ret = -ENOMEM;
> > -                               goto out_unpin;
> > +                       if (WARN_ON(!phys)) {
> > +                               iova += PAGE_SIZE;
> > +                               continue;
> >                         }
> >
> > -                       dma->size = size;
> > -                       dma->iova = iova;
> > -                       dma->vaddr = vaddr;
> > -                       dma->prot = prot;
> > -                       vfio_insert_dma(iommu, dma);
> > -               }
> > -       }
> > +                       size = PAGE_SIZE;
> >
> > -       WARN_ON(ret);
> > -       mutex_unlock(&iommu->lock);
> > -       return ret;
> > +                       while (iova + size < dma->iova + dma->size &&
> > +                              phys + size == iommu_iova_to_phys(d->domain,
> > +                                                                iova +
> > size))
> > +                               size += PAGE_SIZE;
> >
> > -out_unpin:
> > -       vfio_unpin_pages(pfn, npage, prot, true);
> > +                       ret = iommu_map(domain->domain, iova, phys,
> > +                                       size, dma->prot | domain->prot);
> > +                       if (ret)
> > +                               return ret;
> >
> > -out:
> > -       iova = map->iova;
> > -       size = map->size;
> > -       while ((dma = vfio_find_dma(iommu, iova, size))) {
> > -               int r = vfio_remove_dma_overlap(iommu, iova,
> > -                                               &size, dma);
> > -               if (WARN_ON(r || !size))
> > -                       break;
> > +                       iova += size;
> > +               }
> >         }
> >
> > -       mutex_unlock(&iommu->lock);
> > -       return ret;
> > +       return 0;
> >  }
> >
> >  static int vfio_iommu_type1_attach_group(void *iommu_data,
> >                                          struct iommu_group *iommu_group)
> >  {
> >         struct vfio_iommu *iommu = iommu_data;
> > -       struct vfio_group *group, *tmp;
> > +       struct vfio_group *group, *g;
> > +       struct vfio_domain *domain, *d;
> > +       struct bus_type *bus = NULL;
> >         int ret;
> >
> > -       group = kzalloc(sizeof(*group), GFP_KERNEL);
> > -       if (!group)
> > -               return -ENOMEM;
> > -
> >         mutex_lock(&iommu->lock);
> >
> > -       list_for_each_entry(tmp, &iommu->group_list, next) {
> > -               if (tmp->iommu_group == iommu_group) {
> > +       list_for_each_entry(d, &iommu->domain_list, next) {
> > +               list_for_each_entry(g, &d->group_list, next) {
> > +                       if (g->iommu_group != iommu_group)
> > +                               continue;
> > +
> >                         mutex_unlock(&iommu->lock);
> > -                       kfree(group);
> >                         return -EINVAL;
> >                 }
> >         }
> >
> > +       group = kzalloc(sizeof(*group), GFP_KERNEL);
> > +       domain = kzalloc(sizeof(*domain), GFP_KERNEL);
> > +       if (!group || !domain) {
> > +               ret = -ENOMEM;
> > +               goto out_free;
> > +       }
> > +
> > +       group->iommu_group = iommu_group;
> > +
> > +       /* Determine bus_type in order to allocate a domain */
> > +       ret = iommu_group_for_each_dev(iommu_group, &bus, vfio_bus_type);
> > +       if (ret)
> > +               goto out_free;
> > +
> > +       domain->domain = iommu_domain_alloc(bus);
> > +       if (!domain->domain) {
> > +               ret = -EIO;
> > +               goto out_free;
> > +       }
> > +
> > +       ret = iommu_attach_group(domain->domain, iommu_group);
> > +       if (ret)
> > +               goto out_domain;
> > +
> > +       INIT_LIST_HEAD(&domain->group_list);
> > +       list_add(&group->next, &domain->group_list);
> > +
> > +       if (!allow_unsafe_interrupts &&
> > +           !iommu_domain_has_cap(domain->domain, IOMMU_CAP_INTR_REMAP)) {
> > +               pr_warn("%s: No interrupt remapping support.  Use the
> > module param \"allow_unsafe_interrupts\" to enable VFIO IOMMU support on
> > this platform\n",
> > +                      __func__);
> > +               ret = -EPERM;
> > +               goto out_detach;
> > +       }
> > +
> > +       if (iommu_domain_has_cap(domain->domain,
> > IOMMU_CAP_CACHE_COHERENCY))
> > +               domain->prot |= IOMMU_CACHE;
> > +
> >         /*
> > -        * TODO: Domain have capabilities that might change as we add
> > -        * groups (see iommu->cache, currently never set).  Check for
> > -        * them and potentially disallow groups to be attached when it
> > -        * would change capabilities (ugh).
> > +        * Try to match an existing compatible domain.  We don't want to
> > +        * preclude an IOMMU driver supporting multiple bus_types and being
> > +        * able to include different bus_types in the same IOMMU domain, so
> > +        * we test whether the domains use the same iommu_ops rather than
> > +        * testing if they're on the same bus_type.
> >          */
> > -       ret = iommu_attach_group(iommu->domain, iommu_group);
> > -       if (ret) {
> > -               mutex_unlock(&iommu->lock);
> > -               kfree(group);
> > -               return ret;
> > +       list_for_each_entry(d, &iommu->domain_list, next) {
> > +               if (d->domain->ops == domain->domain->ops &&
> > +                   d->prot == domain->prot) {
> > +                       iommu_detach_group(domain->domain, iommu_group);
> > +                       if (!iommu_attach_group(d->domain, iommu_group)) {
> > +                               list_add(&group->next, &d->group_list);
> > +                               iommu_domain_free(domain->domain);
> > +                               kfree(domain);
> > +                               mutex_unlock(&iommu->lock);
> > +                               return 0;
> > +                       }
> > +
> > +                       ret = iommu_attach_group(domain->domain,
> > iommu_group);
> > +                       if (ret)
> > +                               goto out_domain;
> > +               }
> >         }
> >
> > -       group->iommu_group = iommu_group;
> > -       list_add(&group->next, &iommu->group_list);
> > +       /* replay mappings on new domains */
> > +       ret = vfio_iommu_replay(iommu, domain);
> > +       if (ret)
> > +               goto out_detach;
> > +
> > +       list_add(&domain->next, &iommu->domain_list);
> >
> >         mutex_unlock(&iommu->lock);
> >
> >         return 0;
> > +
> > +out_detach:
> > +       iommu_detach_group(domain->domain, iommu_group);
> > +out_domain:
> > +       iommu_domain_free(domain->domain);
> > +out_free:
> > +       kfree(domain);
> > +       kfree(group);
> > +       mutex_unlock(&iommu->lock);
> > +       return ret;
> > +}
> > +
> > +static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu)
> > +{
> > +       struct rb_node *node;
> > +
> > +       while ((node = rb_first(&iommu->dma_list)))
> > +               vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma,
> > node));
> >  }
> >
> >  static void vfio_iommu_type1_detach_group(void *iommu_data,
> >                                           struct iommu_group *iommu_group)
> >  {
> >         struct vfio_iommu *iommu = iommu_data;
> > +       struct vfio_domain *domain;
> >         struct vfio_group *group;
> >
> >         mutex_lock(&iommu->lock);
> >
> > -       list_for_each_entry(group, &iommu->group_list, next) {
> > -               if (group->iommu_group == iommu_group) {
> > -                       iommu_detach_group(iommu->domain, iommu_group);
> > +       list_for_each_entry(domain, &iommu->domain_list, next) {
> > +               list_for_each_entry(group, &domain->group_list, next) {
> > +                       if (group->iommu_group != iommu_group)
> > +                               continue;
> > +
> > +                       iommu_detach_group(domain->domain, iommu_group);
> >                         list_del(&group->next);
> >                         kfree(group);
> > -                       break;
> > +                       /*
> > +                        * Group ownership provides privilege, if the group
> > +                        * list is empty, the domain goes away.  If it's
> > the
> > +                        * last domain, then all the mappings go away too.
> > +                        */
> > +                       if (list_empty(&domain->group_list)) {
> > +                               if (list_is_singular(&iommu->domain_list))
> > +                                       vfio_iommu_unmap_unpin_all(iommu);
> > +                               iommu_domain_free(domain->domain);
> > +                               list_del(&domain->next);
> > +                               kfree(domain);
> > +                       }
> > +                       goto done;
> >                 }
> >         }
> >
> > +done:
> >         mutex_unlock(&iommu->lock);
> >  }
> >
> > @@ -769,40 +828,17 @@ static void *vfio_iommu_type1_open(unsigned long arg)
> >  {
> >         struct vfio_iommu *iommu;
> >
> > -       if (arg != VFIO_TYPE1_IOMMU)
> > +       if (arg != VFIO_TYPE1_IOMMU && arg != VFIO_TYPE1v2_IOMMU)
> >                 return ERR_PTR(-EINVAL);
> >
> >         iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
> >         if (!iommu)
> >                 return ERR_PTR(-ENOMEM);
> >
> > -       INIT_LIST_HEAD(&iommu->group_list);
> > +       INIT_LIST_HEAD(&iommu->domain_list);
> >         iommu->dma_list = RB_ROOT;
> >         mutex_init(&iommu->lock);
> > -
> > -       /*
> > -        * Wish we didn't have to know about bus_type here.
> > -        */
> > -       iommu->domain = iommu_domain_alloc(&pci_bus_type);
> > -       if (!iommu->domain) {
> > -               kfree(iommu);
> > -               return ERR_PTR(-EIO);
> > -       }
> > -
> > -       /*
> > -        * Wish we could specify required capabilities rather than create
> > -        * a domain, see what comes out and hope it doesn't change along
> > -        * the way.  Fortunately we know interrupt remapping is global for
> > -        * our iommus.
> > -        */
> > -       if (!allow_unsafe_interrupts &&
> > -           !iommu_domain_has_cap(iommu->domain, IOMMU_CAP_INTR_REMAP)) {
> > -               pr_warn("%s: No interrupt remapping support.  Use the
> > module param \"allow_unsafe_interrupts\" to enable VFIO IOMMU support on
> > this platform\n",
> > -                      __func__);
> > -               iommu_domain_free(iommu->domain);
> > -               kfree(iommu);
> > -               return ERR_PTR(-EPERM);
> > -       }
> > +       iommu->v2 = (arg == VFIO_TYPE1v2_IOMMU);
> >
> >         return iommu;
> >  }
> > @@ -810,25 +846,24 @@ static void *vfio_iommu_type1_open(unsigned long arg)
> >  static void vfio_iommu_type1_release(void *iommu_data)
> >  {
> >         struct vfio_iommu *iommu = iommu_data;
> > +       struct vfio_domain *domain, *domain_tmp;
> >         struct vfio_group *group, *group_tmp;
> > -       struct rb_node *node;
> >
> > -       list_for_each_entry_safe(group, group_tmp, &iommu->group_list,
> > next) {
> > -               iommu_detach_group(iommu->domain, group->iommu_group);
> > -               list_del(&group->next);
> > -               kfree(group);
> > -       }
> > +       vfio_iommu_unmap_unpin_all(iommu);
> >
> > -       while ((node = rb_first(&iommu->dma_list))) {
> > -               struct vfio_dma *dma = rb_entry(node, struct vfio_dma,
> > node);
> > -               size_t size = dma->size;
> > -               vfio_remove_dma_overlap(iommu, dma->iova, &size, dma);
> > -               if (WARN_ON(!size))
> > -                       break;
> > +       list_for_each_entry_safe(domain, domain_tmp,
> > +                                &iommu->domain_list, next) {
> > +               list_for_each_entry_safe(group, group_tmp,
> > +                                        &domain->group_list, next) {
> > +                       iommu_detach_group(domain->domain,
> > group->iommu_group);
> > +                       list_del(&group->next);
> > +                       kfree(group);
> > +               }
> > +               iommu_domain_free(domain->domain);
> > +               list_del(&domain->next);
> > +               kfree(domain);
> >         }
> >
> > -       iommu_domain_free(iommu->domain);
> > -       iommu->domain = NULL;
> >         kfree(iommu);
> >  }
> >
> > @@ -841,6 +876,7 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
> >         if (cmd == VFIO_CHECK_EXTENSION) {
> >                 switch (arg) {
> >                 case VFIO_TYPE1_IOMMU:
> > +               case VFIO_TYPE1v2_IOMMU:
> >                         return 1;
> >                 default:
> >                         return 0;
> > @@ -858,7 +894,7 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
> >
> >                 info.flags = 0;
> >
> > -               info.iova_pgsizes = iommu->domain->ops->pgsize_bitmap;
> > +               info.iova_pgsizes = vfio_pgsize_bitmap(iommu);
> >
> >                 return copy_to_user((void __user *)arg, &info, minsz);
> >
> > @@ -911,9 +947,6 @@ static const struct vfio_iommu_driver_ops
> > vfio_iommu_driver_ops_type1 = {
> >
> >  static int __init vfio_iommu_type1_init(void)
> >  {
> > -       if (!iommu_present(&pci_bus_type))
> > -               return -ENODEV;
> > -
> >         return vfio_register_iommu_driver(&vfio_iommu_driver_ops_type1);
> >  }
> >
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index 0fd47f5..460fdf2 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -23,6 +23,7 @@
> >
> >  #define VFIO_TYPE1_IOMMU               1
> >  #define VFIO_SPAPR_TCE_IOMMU           2
> > +#define VFIO_TYPE1v2_IOMMU             3
> >
> >  /*
> >   * The IOCTL interface is designed for extensibility by embedding the
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> >



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 4fb7a8f..8c7bb9b 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -30,7 +30,6 @@ 
 #include <linux/iommu.h>
 #include <linux/module.h>
 #include <linux/mm.h>
-#include <linux/pci.h>		/* pci_bus_type */
 #include <linux/rbtree.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
@@ -55,11 +54,17 @@  MODULE_PARM_DESC(disable_hugepages,
 		 "Disable VFIO IOMMU support for IOMMU hugepages.");
 
 struct vfio_iommu {
-	struct iommu_domain	*domain;
+	struct list_head	domain_list;
 	struct mutex		lock;
 	struct rb_root		dma_list;
+	bool v2;
+};
+
+struct vfio_domain {
+	struct iommu_domain	*domain;
+	struct list_head	next;
 	struct list_head	group_list;
-	bool			cache;
+	int			prot;		/* IOMMU_CACHE */
 };
 
 struct vfio_dma {
@@ -99,7 +104,7 @@  static struct vfio_dma *vfio_find_dma(struct vfio_iommu *iommu,
 	return NULL;
 }
 
-static void vfio_insert_dma(struct vfio_iommu *iommu, struct vfio_dma *new)
+static void vfio_link_dma(struct vfio_iommu *iommu, struct vfio_dma *new)
 {
 	struct rb_node **link = &iommu->dma_list.rb_node, *parent = NULL;
 	struct vfio_dma *dma;
@@ -118,7 +123,7 @@  static void vfio_insert_dma(struct vfio_iommu *iommu, struct vfio_dma *new)
 	rb_insert_color(&new->node, &iommu->dma_list);
 }
 
-static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
+static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
 {
 	rb_erase(&old->node, &iommu->dma_list);
 }
@@ -322,32 +327,39 @@  static long vfio_unpin_pages(unsigned long pfn, long npage,
 	return unlocked;
 }
 
-static int vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
-			    dma_addr_t iova, size_t *size)
+static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 {
-	dma_addr_t start = iova, end = iova + *size;
+	dma_addr_t iova = dma->iova, end = dma->iova + dma->size;
+	struct vfio_domain *domain, *d;
 	long unlocked = 0;
 
+	if (!dma->size)
+		return;
+	/*
+	 * We use the IOMMU to track the physical addresses, otherwise we'd
+	 * need a much more complicated tracking system.  Unfortunately that
+	 * means we need to use one of the iommu domains to figure out the
+	 * pfns to unpin.  The rest need to be unmapped in advance so we have
+	 * no iommu translations remaining when the pages are unpinned.
+	 */
+	domain = d = list_first_entry(&iommu->domain_list,
+				      struct vfio_domain, next);
+
+	list_for_each_entry_continue(d, &iommu->domain_list, next)
+		iommu_unmap(d->domain, dma->iova, dma->size);
+
 	while (iova < end) {
 		size_t unmapped;
 		phys_addr_t phys;
 
-		/*
-		 * We use the IOMMU to track the physical address.  This
-		 * saves us from having a lot more entries in our mapping
-		 * tree.  The downside is that we don't track the size
-		 * used to do the mapping.  We request unmap of a single
-		 * page, but expect IOMMUs that support large pages to
-		 * unmap a larger chunk.
-		 */
-		phys = iommu_iova_to_phys(iommu->domain, iova);
+		phys = iommu_iova_to_phys(domain->domain, iova);
 		if (WARN_ON(!phys)) {
 			iova += PAGE_SIZE;
 			continue;
 		}
 
-		unmapped = iommu_unmap(iommu->domain, iova, PAGE_SIZE);
-		if (!unmapped)
+		unmapped = iommu_unmap(domain->domain, iova, PAGE_SIZE);
+		if (WARN_ON(!unmapped))
 			break;
 
 		unlocked += vfio_unpin_pages(phys >> PAGE_SHIFT,
@@ -357,119 +369,26 @@  static int vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
 	}
 
 	vfio_lock_acct(-unlocked);
-
-	*size = iova - start;
-
-	return 0;
 }
 
-static int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t start,
-				   size_t *size, struct vfio_dma *dma)
+static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
 {
-	size_t offset, overlap, tmp;
-	struct vfio_dma *split;
-	int ret;
-
-	if (!*size)
-		return 0;
-
-	/*
-	 * Existing dma region is completely covered, unmap all.  This is
-	 * the likely case since userspace tends to map and unmap buffers
-	 * in one shot rather than multiple mappings within a buffer.
-	 */
-	if (likely(start <= dma->iova &&
-		   start + *size >= dma->iova + dma->size)) {
-		*size = dma->size;
-		ret = vfio_unmap_unpin(iommu, dma, dma->iova, size);
-		if (ret)
-			return ret;
-
-		/*
-		 * Did we remove more than we have?  Should never happen
-		 * since a vfio_dma is contiguous in iova and vaddr.
-		 */
-		WARN_ON(*size != dma->size);
-
-		vfio_remove_dma(iommu, dma);
-		kfree(dma);
-		return 0;
-	}
-
-	/* Overlap low address of existing range */
-	if (start <= dma->iova) {
-		overlap = start + *size - dma->iova;
-		ret = vfio_unmap_unpin(iommu, dma, dma->iova, &overlap);
-		if (ret)
-			return ret;
-
-		vfio_remove_dma(iommu, dma);
-
-		/*
-		 * Check, we may have removed to whole vfio_dma.  If not
-		 * fixup and re-insert.
-		 */
-		if (overlap < dma->size) {
-			dma->iova += overlap;
-			dma->vaddr += overlap;
-			dma->size -= overlap;
-			vfio_insert_dma(iommu, dma);
-		} else
-			kfree(dma);
-
-		*size = overlap;
-		return 0;
-	}
-
-	/* Overlap high address of existing range */
-	if (start + *size >= dma->iova + dma->size) {
-		offset = start - dma->iova;
-		overlap = dma->size - offset;
-
-		ret = vfio_unmap_unpin(iommu, dma, start, &overlap);
-		if (ret)
-			return ret;
-
-		dma->size -= overlap;
-		*size = overlap;
-		return 0;
-	}
-
-	/* Split existing */
-
-	/*
-	 * Allocate our tracking structure early even though it may not
-	 * be used.  An Allocation failure later loses track of pages and
-	 * is more difficult to unwind.
-	 */
-	split = kzalloc(sizeof(*split), GFP_KERNEL);
-	if (!split)
-		return -ENOMEM;
-
-	offset = start - dma->iova;
-
-	ret = vfio_unmap_unpin(iommu, dma, start, size);
-	if (ret || !*size) {
-		kfree(split);
-		return ret;
-	}
-
-	tmp = dma->size;
+	vfio_unmap_unpin(iommu, dma);
+	vfio_unlink_dma(iommu, dma);
+	kfree(dma);
+}
 
-	/* Resize the lower vfio_dma in place, before the below insert */
-	dma->size = offset;
+static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
+{
+	struct vfio_domain *domain;
+	unsigned long bitmap = PAGE_MASK;
 
-	/* Insert new for remainder, assuming it didn't all get unmapped */
-	if (likely(offset + *size < tmp)) {
-		split->size = tmp - offset - *size;
-		split->iova = dma->iova + offset + *size;
-		split->vaddr = dma->vaddr + offset + *size;
-		split->prot = dma->prot;
-		vfio_insert_dma(iommu, split);
-	} else
-		kfree(split);
+	mutex_lock(&iommu->lock);
+	list_for_each_entry(domain, &iommu->domain_list, next)
+		bitmap &= domain->domain->ops->pgsize_bitmap;
+	mutex_unlock(&iommu->lock);
 
-	return 0;
+	return bitmap;
 }
 
 static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
@@ -477,10 +396,10 @@  static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 {
 	uint64_t mask;
 	struct vfio_dma *dma;
-	size_t unmapped = 0, size;
+	size_t unmapped = 0;
 	int ret = 0;
 
-	mask = ((uint64_t)1 << __ffs(iommu->domain->ops->pgsize_bitmap)) - 1;
+	mask = ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
 
 	if (unmap->iova & mask)
 		return -EINVAL;
@@ -491,20 +410,61 @@  static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 
 	mutex_lock(&iommu->lock);
 
+	/*
+	 * vfio-iommu-type1 (v1) - User mappings were coalesced together to
+	 * avoid tracking individual mappings.  This means that the granularity
+	 * of the original mapping was lost and the user was allowed to attempt
+	 * to unmap any range.  Depending on the contiguousness of physical
+	 * memory and page sizes supported by the IOMMU, arbitrary unmaps may
+	 * or may not have worked.  We only guaranteed unmap granularity
+	 * matching the original mapping; even though it was untracked here,
+	 * the original mappings are reflected in IOMMU mappings.  This
+	 * resulted in a couple unusual behaviors.  First, if a range is not
+	 * able to be unmapped, ex. a set of 4k pages that was mapped as a
+	 * 2M hugepage into the IOMMU, the unmap ioctl returns success but with
+	 * a zero sized unmap.  Also, if an unmap request overlaps the first
+	 * address of a hugepage, the IOMMU will unmap the entire hugepage.
+	 * This also returns success and the returned unmap size reflects the
+	 * actual size unmapped.
+	 *
+	 * We attempt to maintain compatibility with this "v1" interface, but
+	 * we take control out of the hands of the IOMMU.  Therefore, an unmap
+	 * request offset from the beginning of the original mapping will
+	 * return success with zero sized unmap.  And an unmap request covering
+	 * the first iova of mapping will unmap the entire range.
+	 *
+	 * The v2 version of this interface intends to be more deterministic.
+	 * Unmap requests must fully cover previous mappings.  Multiple
+	 * mappings may still be unmaped by specifying large ranges, but there
+	 * must not be any previous mappings bisected by the range.  An error
+	 * will be returned if these conditions are not met.  The v2 interface
+	 * will only return success and a size of zero if there were no
+	 * mappings within the range.
+	 */
+	if (iommu->v2) {
+		dma = vfio_find_dma(iommu, unmap->iova, 0);
+		if (dma && dma->iova != unmap->iova) {
+			ret = -EINVAL;
+			goto unlock;
+		}
+		dma = vfio_find_dma(iommu, unmap->iova + unmap->size - 1, 0);
+		if (dma && dma->iova + dma->size != unmap->iova + unmap->size) {
+			ret = -EINVAL;
+			goto unlock;
+		}
+	}
+
 	while ((dma = vfio_find_dma(iommu, unmap->iova, unmap->size))) {
-		size = unmap->size;
-		ret = vfio_remove_dma_overlap(iommu, unmap->iova, &size, dma);
-		if (ret || !size)
+		if (!iommu->v2 && unmap->iova > dma->iova)
 			break;
-		unmapped += size;
+		unmapped += dma->size;
+		vfio_remove_dma(iommu, dma);
 	}
 
+unlock:
 	mutex_unlock(&iommu->lock);
 
-	/*
-	 * We may unmap more than requested, update the unmap struct so
-	 * userspace can know.
-	 */
+	/* Report how much was unmapped */
 	unmap->size = unmapped;
 
 	return ret;
@@ -516,22 +476,47 @@  static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
  * soon, so this is just a temporary workaround to break mappings down into
  * PAGE_SIZE.  Better to map smaller pages than nothing.
  */
-static int map_try_harder(struct vfio_iommu *iommu, dma_addr_t iova,
+static int map_try_harder(struct vfio_domain *domain, dma_addr_t iova,
 			  unsigned long pfn, long npage, int prot)
 {
 	long i;
 	int ret;
 
 	for (i = 0; i < npage; i++, pfn++, iova += PAGE_SIZE) {
-		ret = iommu_map(iommu->domain, iova,
+		ret = iommu_map(domain->domain, iova,
 				(phys_addr_t)pfn << PAGE_SHIFT,
-				PAGE_SIZE, prot);
+				PAGE_SIZE, prot | domain->prot);
 		if (ret)
 			break;
 	}
 
 	for (; i < npage && i > 0; i--, iova -= PAGE_SIZE)
-		iommu_unmap(iommu->domain, iova, PAGE_SIZE);
+		iommu_unmap(domain->domain, iova, PAGE_SIZE);
+
+	return ret;
+}
+
+static int vfio_iommu_map(struct vfio_iommu *iommu, dma_addr_t iova,
+			  unsigned long pfn, long npage, int prot)
+{
+	struct vfio_domain *d;
+	int ret;
+
+	list_for_each_entry(d, &iommu->domain_list, next) {
+		ret = iommu_map(d->domain, iova, (phys_addr_t)pfn << PAGE_SHIFT,
+				npage << PAGE_SHIFT, prot | d->prot);
+		if (ret) {
+			if (ret != -EBUSY ||
+			    map_try_harder(d, iova, pfn, npage, prot))
+				goto unwind;
+		}
+	}
+
+	return 0;
+
+unwind:
+	list_for_each_entry_continue_reverse(d, &iommu->domain_list, next)
+		iommu_unmap(d->domain, iova, npage << PAGE_SHIFT);
 
 	return ret;
 }
@@ -545,12 +530,12 @@  static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	long npage;
 	int ret = 0, prot = 0;
 	uint64_t mask;
-	struct vfio_dma *dma = NULL;
+	struct vfio_dma *dma;
 	unsigned long pfn;
 
 	end = map->iova + map->size;
 
-	mask = ((uint64_t)1 << __ffs(iommu->domain->ops->pgsize_bitmap)) - 1;
+	mask = ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
 
 	/* READ/WRITE from device perspective */
 	if (map->flags & VFIO_DMA_MAP_FLAG_WRITE)
@@ -561,9 +546,6 @@  static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	if (!prot)
 		return -EINVAL; /* No READ/WRITE? */
 
-	if (iommu->cache)
-		prot |= IOMMU_CACHE;
-
 	if (vaddr & mask)
 		return -EINVAL;
 	if (map->iova & mask)
@@ -588,180 +570,257 @@  static int vfio_dma_do_map(struct vfio_iommu *iommu,
 		return -EEXIST;
 	}
 
-	for (iova = map->iova; iova < end; iova += size, vaddr += size) {
-		long i;
+	dma = kzalloc(sizeof(*dma), GFP_KERNEL);
+	if (!dma) {
+		mutex_unlock(&iommu->lock);
+		return -ENOMEM;
+	}
+
+	dma->iova = map->iova;
+	dma->vaddr = map->vaddr;
+	dma->prot = prot;
 
+	/* Insert zero-sized and grow as we map chunks of it */
+	vfio_link_dma(iommu, dma);
+
+	for (iova = map->iova; iova < end; iova += size, vaddr += size) {
 		/* Pin a contiguous chunk of memory */
 		npage = vfio_pin_pages(vaddr, (end - iova) >> PAGE_SHIFT,
 				       prot, &pfn);
 		if (npage <= 0) {
 			WARN_ON(!npage);
 			ret = (int)npage;
-			goto out;
-		}
-
-		/* Verify pages are not already mapped */
-		for (i = 0; i < npage; i++) {
-			if (iommu_iova_to_phys(iommu->domain,
-					       iova + (i << PAGE_SHIFT))) {
-				ret = -EBUSY;
-				goto out_unpin;
-			}
+			break;
 		}
 
-		ret = iommu_map(iommu->domain, iova,
-				(phys_addr_t)pfn << PAGE_SHIFT,
-				npage << PAGE_SHIFT, prot);
+		/* Map it! */
+		ret = vfio_iommu_map(iommu, iova, pfn, npage, prot);
 		if (ret) {
-			if (ret != -EBUSY ||
-			    map_try_harder(iommu, iova, pfn, npage, prot)) {
-				goto out_unpin;
-			}
+			vfio_unpin_pages(pfn, npage, prot, true);
+			break;
 		}
 
 		size = npage << PAGE_SHIFT;
+		dma->size += size;
+	}
 
-		/*
-		 * Check if we abut a region below - nothing below 0.
-		 * This is the most likely case when mapping chunks of
-		 * physically contiguous regions within a virtual address
-		 * range.  Update the abutting entry in place since iova
-		 * doesn't change.
-		 */
-		if (likely(iova)) {
-			struct vfio_dma *tmp;
-			tmp = vfio_find_dma(iommu, iova - 1, 1);
-			if (tmp && tmp->prot == prot &&
-			    tmp->vaddr + tmp->size == vaddr) {
-				tmp->size += size;
-				iova = tmp->iova;
-				size = tmp->size;
-				vaddr = tmp->vaddr;
-				dma = tmp;
-			}
-		}
+	if (ret)
+		vfio_remove_dma(iommu, dma);
 
-		/*
-		 * Check if we abut a region above - nothing above ~0 + 1.
-		 * If we abut above and below, remove and free.  If only
-		 * abut above, remove, modify, reinsert.
-		 */
-		if (likely(iova + size)) {
-			struct vfio_dma *tmp;
-			tmp = vfio_find_dma(iommu, iova + size, 1);
-			if (tmp && tmp->prot == prot &&
-			    tmp->vaddr == vaddr + size) {
-				vfio_remove_dma(iommu, tmp);
-				if (dma) {
-					dma->size += tmp->size;
-					kfree(tmp);
-				} else {
-					size += tmp->size;
-					tmp->size = size;
-					tmp->iova = iova;
-					tmp->vaddr = vaddr;
-					vfio_insert_dma(iommu, tmp);
-					dma = tmp;
-				}
-			}
-		}
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+
+static int vfio_bus_type(struct device *dev, void *data)
+{
+	struct bus_type **bus = data;
+
+	if (*bus && *bus != dev->bus)
+		return -EINVAL;
+
+	*bus = dev->bus;
+
+	return 0;
+}
+
+static int vfio_iommu_replay(struct vfio_iommu *iommu,
+			     struct vfio_domain *domain)
+{
+	struct vfio_domain *d;
+	struct rb_node *n;
+	int ret;
+
+	/* Arbitrarily pick the first domain in the list for lookups */
+	d = list_first_entry(&iommu->domain_list, struct vfio_domain, next);
+	n = rb_first(&iommu->dma_list);
+
+	/* If there's not a domain, there better not be any mappings */
+	if (WARN_ON(n && !d))
+		return -EINVAL;
+
+	for (; n; n = rb_next(n)) {
+		struct vfio_dma *dma;
+		dma_addr_t iova;
+
+		dma = rb_entry(n, struct vfio_dma, node);
+		iova = dma->iova;
+
+		while (iova < dma->iova + dma->size) {
+			phys_addr_t phys = iommu_iova_to_phys(d->domain, iova);
+			size_t size;
 
-		if (!dma) {
-			dma = kzalloc(sizeof(*dma), GFP_KERNEL);
-			if (!dma) {
-				iommu_unmap(iommu->domain, iova, size);
-				ret = -ENOMEM;
-				goto out_unpin;
+			if (WARN_ON(!phys)) {
+				iova += PAGE_SIZE;
+				continue;
 			}
 
-			dma->size = size;
-			dma->iova = iova;
-			dma->vaddr = vaddr;
-			dma->prot = prot;
-			vfio_insert_dma(iommu, dma);
-		}
-	}
+			size = PAGE_SIZE;
 
-	WARN_ON(ret);
-	mutex_unlock(&iommu->lock);
-	return ret;
+			while (iova + size < dma->iova + dma->size &&
+			       phys + size == iommu_iova_to_phys(d->domain,
+								 iova + size))
+				size += PAGE_SIZE;
 
-out_unpin:
-	vfio_unpin_pages(pfn, npage, prot, true);
+			ret = iommu_map(domain->domain, iova, phys,
+					size, dma->prot | domain->prot);
+			if (ret)
+				return ret;
 
-out:
-	iova = map->iova;
-	size = map->size;
-	while ((dma = vfio_find_dma(iommu, iova, size))) {
-		int r = vfio_remove_dma_overlap(iommu, iova,
-						&size, dma);
-		if (WARN_ON(r || !size))
-			break;
+			iova += size;
+		}
 	}
 
-	mutex_unlock(&iommu->lock);
-	return ret;
+	return 0;
 }
 
 static int vfio_iommu_type1_attach_group(void *iommu_data,
 					 struct iommu_group *iommu_group)
 {
 	struct vfio_iommu *iommu = iommu_data;
-	struct vfio_group *group, *tmp;
+	struct vfio_group *group, *g;
+	struct vfio_domain *domain, *d;
+	struct bus_type *bus = NULL;
 	int ret;
 
-	group = kzalloc(sizeof(*group), GFP_KERNEL);
-	if (!group)
-		return -ENOMEM;
-
 	mutex_lock(&iommu->lock);
 
-	list_for_each_entry(tmp, &iommu->group_list, next) {
-		if (tmp->iommu_group == iommu_group) {
+	list_for_each_entry(d, &iommu->domain_list, next) {
+		list_for_each_entry(g, &d->group_list, next) {
+			if (g->iommu_group != iommu_group)
+				continue;
+
 			mutex_unlock(&iommu->lock);
-			kfree(group);
 			return -EINVAL;
 		}
 	}
 
+	group = kzalloc(sizeof(*group), GFP_KERNEL);
+	domain = kzalloc(sizeof(*domain), GFP_KERNEL);
+	if (!group || !domain) {
+		ret = -ENOMEM;
+		goto out_free;
+	}
+
+	group->iommu_group = iommu_group;
+
+	/* Determine bus_type in order to allocate a domain */
+	ret = iommu_group_for_each_dev(iommu_group, &bus, vfio_bus_type);
+	if (ret)
+		goto out_free;
+
+	domain->domain = iommu_domain_alloc(bus);
+	if (!domain->domain) {
+		ret = -EIO;
+		goto out_free;
+	}
+
+	ret = iommu_attach_group(domain->domain, iommu_group);
+	if (ret)
+		goto out_domain;
+
+	INIT_LIST_HEAD(&domain->group_list);
+	list_add(&group->next, &domain->group_list);
+
+	if (!allow_unsafe_interrupts &&
+	    !iommu_domain_has_cap(domain->domain, IOMMU_CAP_INTR_REMAP)) {
+		pr_warn("%s: No interrupt remapping support.  Use the module param \"allow_unsafe_interrupts\" to enable VFIO IOMMU support on this platform\n",
+		       __func__);
+		ret = -EPERM;
+		goto out_detach;
+	}
+
+	if (iommu_domain_has_cap(domain->domain, IOMMU_CAP_CACHE_COHERENCY))
+		domain->prot |= IOMMU_CACHE;
+
 	/*
-	 * TODO: Domain have capabilities that might change as we add
-	 * groups (see iommu->cache, currently never set).  Check for
-	 * them and potentially disallow groups to be attached when it
-	 * would change capabilities (ugh).
+	 * Try to match an existing compatible domain.  We don't want to
+	 * preclude an IOMMU driver supporting multiple bus_types and being
+	 * able to include different bus_types in the same IOMMU domain, so
+	 * we test whether the domains use the same iommu_ops rather than
+	 * testing if they're on the same bus_type.
 	 */
-	ret = iommu_attach_group(iommu->domain, iommu_group);
-	if (ret) {
-		mutex_unlock(&iommu->lock);
-		kfree(group);
-		return ret;
+	list_for_each_entry(d, &iommu->domain_list, next) {
+		if (d->domain->ops == domain->domain->ops &&
+		    d->prot == domain->prot) {
+			iommu_detach_group(domain->domain, iommu_group);
+			if (!iommu_attach_group(d->domain, iommu_group)) {
+				list_add(&group->next, &d->group_list);
+				iommu_domain_free(domain->domain);
+				kfree(domain);
+				mutex_unlock(&iommu->lock);
+				return 0;
+			}
+
+			ret = iommu_attach_group(domain->domain, iommu_group);
+			if (ret)
+				goto out_domain;
+		}
 	}
 
-	group->iommu_group = iommu_group;
-	list_add(&group->next, &iommu->group_list);
+	/* replay mappings on new domains */
+	ret = vfio_iommu_replay(iommu, domain);
+	if (ret)
+		goto out_detach;
+
+	list_add(&domain->next, &iommu->domain_list);
 
 	mutex_unlock(&iommu->lock);
 
 	return 0;
+
+out_detach:
+	iommu_detach_group(domain->domain, iommu_group);
+out_domain:
+	iommu_domain_free(domain->domain);
+out_free:
+	kfree(domain);
+	kfree(group);
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+
+static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu)
+{
+	struct rb_node *node;
+
+	while ((node = rb_first(&iommu->dma_list)))
+		vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node));
 }
 
 static void vfio_iommu_type1_detach_group(void *iommu_data,
 					  struct iommu_group *iommu_group)
 {
 	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_domain *domain;
 	struct vfio_group *group;
 
 	mutex_lock(&iommu->lock);
 
-	list_for_each_entry(group, &iommu->group_list, next) {
-		if (group->iommu_group == iommu_group) {
-			iommu_detach_group(iommu->domain, iommu_group);
+	list_for_each_entry(domain, &iommu->domain_list, next) {
+		list_for_each_entry(group, &domain->group_list, next) {
+			if (group->iommu_group != iommu_group)
+				continue;
+
+			iommu_detach_group(domain->domain, iommu_group);
 			list_del(&group->next);
 			kfree(group);
-			break;
+			/*
+			 * Group ownership provides privilege, if the group
+			 * list is empty, the domain goes away.  If it's the
+			 * last domain, then all the mappings go away too.
+			 */
+			if (list_empty(&domain->group_list)) {
+				if (list_is_singular(&iommu->domain_list))
+					vfio_iommu_unmap_unpin_all(iommu);
+				iommu_domain_free(domain->domain);
+				list_del(&domain->next);
+				kfree(domain);
+			}
+			goto done;
 		}
 	}
 
+done:
 	mutex_unlock(&iommu->lock);
 }
 
@@ -769,40 +828,17 @@  static void *vfio_iommu_type1_open(unsigned long arg)
 {
 	struct vfio_iommu *iommu;
 
-	if (arg != VFIO_TYPE1_IOMMU)
+	if (arg != VFIO_TYPE1_IOMMU && arg != VFIO_TYPE1v2_IOMMU)
 		return ERR_PTR(-EINVAL);
 
 	iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
 	if (!iommu)
 		return ERR_PTR(-ENOMEM);
 
-	INIT_LIST_HEAD(&iommu->group_list);
+	INIT_LIST_HEAD(&iommu->domain_list);
 	iommu->dma_list = RB_ROOT;
 	mutex_init(&iommu->lock);
-
-	/*
-	 * Wish we didn't have to know about bus_type here.
-	 */
-	iommu->domain = iommu_domain_alloc(&pci_bus_type);
-	if (!iommu->domain) {
-		kfree(iommu);
-		return ERR_PTR(-EIO);
-	}
-
-	/*
-	 * Wish we could specify required capabilities rather than create
-	 * a domain, see what comes out and hope it doesn't change along
-	 * the way.  Fortunately we know interrupt remapping is global for
-	 * our iommus.
-	 */
-	if (!allow_unsafe_interrupts &&
-	    !iommu_domain_has_cap(iommu->domain, IOMMU_CAP_INTR_REMAP)) {
-		pr_warn("%s: No interrupt remapping support.  Use the module param \"allow_unsafe_interrupts\" to enable VFIO IOMMU support on this platform\n",
-		       __func__);
-		iommu_domain_free(iommu->domain);
-		kfree(iommu);
-		return ERR_PTR(-EPERM);
-	}
+	iommu->v2 = (arg == VFIO_TYPE1v2_IOMMU);
 
 	return iommu;
 }
@@ -810,25 +846,24 @@  static void *vfio_iommu_type1_open(unsigned long arg)
 static void vfio_iommu_type1_release(void *iommu_data)
 {
 	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_domain *domain, *domain_tmp;
 	struct vfio_group *group, *group_tmp;
-	struct rb_node *node;
 
-	list_for_each_entry_safe(group, group_tmp, &iommu->group_list, next) {
-		iommu_detach_group(iommu->domain, group->iommu_group);
-		list_del(&group->next);
-		kfree(group);
-	}
+	vfio_iommu_unmap_unpin_all(iommu);
 
-	while ((node = rb_first(&iommu->dma_list))) {
-		struct vfio_dma *dma = rb_entry(node, struct vfio_dma, node);
-		size_t size = dma->size;
-		vfio_remove_dma_overlap(iommu, dma->iova, &size, dma);
-		if (WARN_ON(!size))
-			break;
+	list_for_each_entry_safe(domain, domain_tmp,
+				 &iommu->domain_list, next) {
+		list_for_each_entry_safe(group, group_tmp,
+					 &domain->group_list, next) {
+			iommu_detach_group(domain->domain, group->iommu_group);
+			list_del(&group->next);
+			kfree(group);
+		}
+		iommu_domain_free(domain->domain);
+		list_del(&domain->next);
+		kfree(domain);
 	}
 
-	iommu_domain_free(iommu->domain);
-	iommu->domain = NULL;
 	kfree(iommu);
 }
 
@@ -841,6 +876,7 @@  static long vfio_iommu_type1_ioctl(void *iommu_data,
 	if (cmd == VFIO_CHECK_EXTENSION) {
 		switch (arg) {
 		case VFIO_TYPE1_IOMMU:
+		case VFIO_TYPE1v2_IOMMU:
 			return 1;
 		default:
 			return 0;
@@ -858,7 +894,7 @@  static long vfio_iommu_type1_ioctl(void *iommu_data,
 
 		info.flags = 0;
 
-		info.iova_pgsizes = iommu->domain->ops->pgsize_bitmap;
+		info.iova_pgsizes = vfio_pgsize_bitmap(iommu);
 
 		return copy_to_user((void __user *)arg, &info, minsz);
 
@@ -911,9 +947,6 @@  static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
 
 static int __init vfio_iommu_type1_init(void)
 {
-	if (!iommu_present(&pci_bus_type))
-		return -ENODEV;
-
 	return vfio_register_iommu_driver(&vfio_iommu_driver_ops_type1);
 }
 
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 0fd47f5..460fdf2 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -23,6 +23,7 @@ 
 
 #define VFIO_TYPE1_IOMMU		1
 #define VFIO_SPAPR_TCE_IOMMU		2
+#define VFIO_TYPE1v2_IOMMU		3
 
 /*
  * The IOCTL interface is designed for extensibility by embedding the