From patchwork Tue Apr 9 16:52:10 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Wei Liu X-Patchwork-Id: 10891765 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0C711139A for ; Tue, 9 Apr 2019 16:54:42 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E1D45252D5 for ; Tue, 9 Apr 2019 16:54:41 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id D5AA42621E; Tue, 9 Apr 2019 16:54:41 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.2 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED autolearn=ham version=3.3.1 Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120]) (using TLSv1.2 with cipher AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 392EB252D5 for ; Tue, 9 Apr 2019 16:54:39 +0000 (UTC) Received: from localhost ([127.0.0.1] helo=lists.xenproject.org) by lists.xenproject.org with esmtp (Exim 4.89) (envelope-from ) id 1hDtyv-0001TM-TJ; Tue, 09 Apr 2019 16:52:21 +0000 Received: from all-amaz-eas1.inumbo.com ([34.197.232.57] helo=us1-amaz-eas2.inumbo.com) by lists.xenproject.org with esmtp (Exim 4.89) (envelope-from ) id 1hDtyu-0001SP-TV for xen-devel@lists.xenproject.org; Tue, 09 Apr 2019 16:52:20 +0000 X-Inumbo-ID: d40bee9a-5ae7-11e9-8b81-b314d09c256e Received: from SMTP03.CITRIX.COM (unknown [162.221.156.55]) by us1-amaz-eas2.inumbo.com (Halon) with ESMTPS id d40bee9a-5ae7-11e9-8b81-b314d09c256e; Tue, 09 Apr 2019 16:52:16 +0000 (UTC) X-IronPort-AV: E=Sophos;i="5.60,330,1549929600"; d="scan'208";a="83138381" From: Wei Liu To: Date: Tue, 9 Apr 2019 17:52:10 +0100 Message-ID: <20190409165210.7624-4-wei.liu2@citrix.com> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20190409165210.7624-1-wei.liu2@citrix.com> References: <20190409165210.7624-1-wei.liu2@citrix.com> MIME-Version: 1.0 Subject: [Xen-devel] [PATCH v3 3/3] docs: remove tmem related text X-BeenThere: xen-devel@lists.xenproject.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Xen developer discussion List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Cc: Stefano Stabellini , Wei Liu , Konrad Rzeszutek Wilk , George Dunlap , Andrew Cooper , Ian Jackson , Tim Deegan , Julien Grall , Jan Beulich , Daniel De Graaf Errors-To: xen-devel-bounces@lists.xenproject.org Sender: "Xen-devel" X-Virus-Scanned: ClamAV using ClamSMTP Signed-off-by: Wei Liu Acked-by: Daniel De Graaf Acked-by: Konrad Rzeszutek Wilk --- v3: Rebase and fix up conflicts --- docs/man/xl.1.pod.in | 68 --- docs/man/xl.conf.5.pod | 9 +- docs/misc/tmem-internals.html | 789 ------------------------------ docs/misc/xen-command-line.pandoc | 12 - docs/misc/xsm-flask.txt | 36 -- 5 files changed, 2 insertions(+), 912 deletions(-) delete mode 100644 docs/misc/tmem-internals.html diff --git a/docs/man/xl.1.pod.in b/docs/man/xl.1.pod.in index 4310fcd818..a8dae005b1 100644 --- a/docs/man/xl.1.pod.in +++ b/docs/man/xl.1.pod.in @@ -1677,74 +1677,6 @@ Obtain information of USB devices connected as such via the device model =back -=head1 TRANSCENDENT MEMORY (TMEM) - -=over 4 - -=item B I<[OPTIONS]> I - -List tmem pools. - -B - -=over 4 - -=item B<-l> - -If this parameter is specified, also list tmem stats. - -=back - -=item B I - -Freeze tmem pools. - -=item B I - -Thaw tmem pools. - -=item B I [I] - -Change tmem settings. - -B - -=over 4 - -=item B<-w> I - -Weight (int) - -=item B<-p> I - -Compress (int) - -=back - -=item B I [I] - -De/authenticate shared tmem pool. - -B - -=over 4 - -=item B<-u> I - -Specify uuid (abcdef01-2345-6789-1234-567890abcdef) - -=item B<-a> I - -0=auth,1=deauth - -=back - -=item B - -Get information about how much freeable memory (MB) is in-use by tmem. - -=back - =head1 FLASK B is a security framework that defines a mandatory access control policy diff --git a/docs/man/xl.conf.5.pod b/docs/man/xl.conf.5.pod index 37262a7ef8..b1bde7d657 100644 --- a/docs/man/xl.conf.5.pod +++ b/docs/man/xl.conf.5.pod @@ -148,10 +148,8 @@ The default choice is "xvda". =item B If this option is enabled then when a guest is created there will be an -guarantee that there is memory available for the guest. This is an -particularly acute problem on hosts with memory over-provisioned guests -that use tmem and have self-balloon enabled (which is the default -option). The self-balloon mechanism can deflate/inflate the balloon +guarantee that there is memory available for the guest. +The self-balloon mechanism can deflate/inflate the balloon quickly and the amount of free memory (which C can show) is stale the moment it is printed. When claim is enabled a reservation for the amount of memory (see 'memory' in xl.conf(5)) is set, which is then @@ -163,9 +161,6 @@ If the reservation cannot be meet the guest creation fails immediately instead of taking seconds/minutes (depending on the size of the guest) while the guest is populated. -Note that to enable tmem type guests, one needs to provide C on the -Xen hypervisor argument and as well on the Linux kernel command line. - Default: C<1> =over 4 diff --git a/docs/misc/tmem-internals.html b/docs/misc/tmem-internals.html deleted file mode 100644 index 9b7e70e650..0000000000 --- a/docs/misc/tmem-internals.html +++ /dev/null @@ -1,789 +0,0 @@ -

Transcendent Memory Internals in Xen

-

-by Dan Magenheimer, Oracle Corp.

-

-Draft 0.1 -- Updated: 20100324 -

Overview

-

-This document focuses on the internal implementation of -Transcendent Memory (tmem) on Xen. It assumes -that the reader has a basic knowledge of the terminology, objectives, and -functionality of tmem and also has access to the Xen source code. -It corresponds to the Xen 4.0 release, with -patch added to support page deduplication (V2). -

-The primary responsibilities of the tmem implementation are to: -

    -
  • manage a potentially huge and extremely dynamic -number of memory pages from a potentially large number of clients (domains) -with low memory overhead and proper isolation -
  • provide quick and efficient access to these -pages with as much concurrency as possible -
  • enable efficient reclamation and eviction of pages (e.g. when -memory is fully utilized) -
  • optionally, increase page density through compression and/or -deduplication -
  • where necessary, properly assign and account for -memory belonging to guests to avoid malicious and/or accidental unfairness -and/or denial-of-service -
  • record utilization statistics and make them available to management tools -
-

Source Code Organization

- -

-The source code in Xen that provides the tmem functionality -is divided up into four files: tmem.c, tmem.h, tmem_xen.c, and tmem_xen.h. -The files tmem.c and tmem.h are intended to -be implementation- (and hypervisor-) independent and the other two files -provide the Xen-specific code. This -division is intended to make it easier to port tmem functionality to other -hypervisors, though at this time porting to other hypervisors has not been -attempted. Together, these four files -total less than 4000 lines of C code. -

-Even ignoring the implementation-specific functionality, the -implementation-independent part of tmem has several dependencies on -library functionality (Xen source filenames in parentheses): -

    -
  • -a good fast general-purpose dynamic memory -allocator with bounded response time and efficient use of memory for a very -large number of sub-page allocations. To -achieve this in Xen, the bad old memory allocator was replaced with a -slightly-modified version of TLSF (xmalloc_tlsf.c), first ported to Linux by -Nitin Gupta for compcache. -
  • -good tree data structure libraries, specifically -red-black trees (rbtree.c) and radix trees (radix-tree.c). -Code for these was borrowed for Linux and adapted for tmem and Xen. -
  • -good locking and list code. Both of these existed in Xen and required -little or no change. -
  • -optionally, a good fast lossless compression -library. The Xen implementation added to -support tmem uses LZO1X (lzo.c), also ported for Linux by Nitin Gupta. -
-

-More information about the specific functionality of these -libraries can easily be found through a search engine, via wikipedia, or in the -Xen or Linux source logs so we will not elaborate further here. - -

Prefixes/Abbreviations/Glossary

- -

-The tmem code uses several prefixes and abbreviations. -Knowledge of these will improve code readability: -

    -
  • -tmh == -transcendent memory host. Functions or -data structures that are defined by the implementation-specific code, i.e. the -Xen host code -
  • -tmemc -== transcendent memory control. -Functions or data structures that provide management tool functionality, -rather than core tmem operations. -
  • -cli or -client == client. -The tmem generic term for a domain or a guest OS. -
-

-When used in prose, common tmem operations are indicated -with a different font, such as put -and get. - -

Key Data Structures

- -

-To manage a huge number of pages, efficient data structures -must be carefully selected. -

-Recall that a tmem-enabled guest OS may create one or more -pools with different attributes. It then -puts and gets -pages to/from this pool, identifying the page -with a handle that consists of a pool_id, an -object_id, and a page_id (sometimes -called an index). -This suggests a few obvious core data -structures: -

    -
  • -When a guest OS first calls tmem, a client_t is created to contain -and track all uses of tmem by that guest OS. Among -other things, a client_t keeps pointers -to a fixed number of pools (16 in the current Xen implementation). -
  • -When a guest OS requests a new pool, a pool_t is created. -Some pools are shared and are kept in a -sharelist (sharelist_t) which points -to all the clients that are sharing the pool. -Since an object_id is 64-bits, -a pool_t must be able to keep track -of a potentially very large number of objects. -To do so, it maintains a number of parallel trees (256 in the current -Xen implementation) and a hash algorithm is applied to the object_id -to select the correct tree. -Each tree element points to an object. -Because an object_id usually represents an inode -(a unique file number identifier), and inode numbers -are fairly random, though often "clumpy", a red-black tree -is used. -
  • -When a guest first -puts a page to a pool with an as-yet-unused object_id, an -obj_t is created. Since a page_id is usually an index into a file, -it is often a small number, but may sometimes be very large (up to -32-bits). A radix tree is a good data structure to contain items -with this kind of index distribution. -
  • -When a page is -put, a page descriptor, or pgp_t, is created, which -among other things will point to the storage location where the data is kept. -In the normal case the pointer is to a pfp_t, which is an -implementation-specific datatype representing a physical pageframe in memory -(which in Xen is a "struct page_info"). -When deduplication is enabled, it points to -yet another data structure, a pcd_t -(see below). When compression is enabled -(and deduplication is not), the pointer points directly to the compressed data. -For reasons we will see shortly, each pgp_t that represents -an ephemeral page (that is, a page placed -in an ephemeral pool) is also placed -into two doubly-linked linked lists, one containing all ephemeral pages -put by the same client and one -containing all ephemeral pages across all clients ("global"). -
  • -When deduplication is enabled, multiple pgp_t's may need to point to -the same data, so another data structure (and level of indirection) is used -called a page content descriptor, or pcd_t. -Multiple page descriptors (pgp_t's) may point to the same pcd_t. -The pcd_t, in turn, points to either a pfp_t -(if a full page of data), directly to a -location in memory (if the page has been compressed or trailing zeroes have -been eliminated), or even a NULL pointer (if the page contained all zeroes and -trailing zero elimination is enabled). -
-

-The most apparent usage of this multi-layer web of data structures -is "top-down" because, in normal operation, the vast majority of tmem -operations invoked by a client are -puts and gets, which require the various -data structures to be walked starting with the client_t, then -a pool_t, then an obj_t, then a pgd_t. -However, there is another highly frequent tmem operation that is not -visible from a client: memory reclamation. -Since tmem attempts to use all spare memory in the system, it must -frequently free up, or evict, -pages. The eviction algorithm will be -explained in more detail later but, in brief, to free memory, ephemeral pages -are removed from the tail of one of the doubly-linked lists, which means that -all of the data structures associated with that page-to-be-removed must be -updated or eliminated and freed. As a -result, each data structure also contains a back-pointer -to its parent, for example every obj_t -contains a pointer to its containing pool_t. -

-This complex web of interconnected data structures is updated constantly and -thus extremely sensitive to careless code changes which, for example, may -result in unexpected hypervisor crashes or non-obvious memory leaks. -On the other hand, the code is fairly well -modularized so, once understood, it is possible to relatively easily switch out -one kind of data structure for another. -To catch problems as quickly as possible when debug is enabled, most of -the data structures are equipped with sentinelsand many inter-function -assumptions are documented and tested dynamically -with assertions. -While these clutter and lengthen the tmem -code substantially, their presence has proven invaluable on many occasions. -

-For completeness, we should also describe a key data structure in the Xen -implementation-dependent code: the tmh_page_list. For security and -performance reasons, pages that are freed due to tmem operations (such -as get) are not immediately put back into Xen's pool -of free memory (aka the Xen heap). -Tmem pages may contain guest-private data that must be scrubbed before -those memory pages are released for the use of other guests. -But if a page is immediately re-used inside of tmem itself, the entire -page is overwritten with new data, so need not be scrubbed. -Since tmem is usually the most frequent -customer of the Xen heap allocation code, it would be a waste of time to scrub -a page, release it to the Xen heap, and then immediately re-allocate it -again. So, instead, tmem maintains -currently-unused pages of memory on its own free list, tmh_page_list, -and returns the pages to Xen only when non-tmem Xen -heap allocation requests would otherwise fail. - -

Scalablility/Concurrency

- -

Tmem has been designed to be highly scalable. -Since tmem access is invoked similarly in -many ways to asynchronous disk access, a "big SMP" tmem-aware guest -OS can, and often will, invoke tmem hypercalls simultaneously on many different -physical CPUs. And, of course, multiple -tmem-aware guests may independently and simultaneously invoke tmem -hypercalls. While the normal frequency -of tmem invocations is rarely extremely high, some tmem operations such as data -compression or lookups in a very large tree may take tens of thousands of -cycles or more to complete. Measurements -have shown that normal workloads spend no more than about 0.2% (2% with -compression enabled) of CPU time executing tmem operations. -But those familiar with OS scalability issues -recognize that even this limited execution time can create concurrency problems -in large systems and result in poorly-scalable performance. -

-A good locking strategy is critical to concurrency, but also -must be designed carefully to avoid deadlock and livelock problems. For -debugging purposes, tmem supports a "big kernel lock" which disables -concurrency altogether (enabled in Xen with "tmem_lock", but note -that this functionality is rarely tested and likely has bit-rotted). Infrequent -but invasive tmem hypercalls, such as pool creation or the control operations, -are serialized on a single read-write lock, called tmem_rwlock, -which must be held for writing. All other tmem operations must hold this lock -for reading, so frequent operations such as -put and get flush can execute simultaneously -as long as no invasive operations are occurring. -

-Once a pool has been selected, there is a per-pool -read-write lock (pool_rwlock) which -must be held for writing if any transformative operations might occur within -that pool, such as when an obj_t is -created or destroyed. For the highly -frequent operation of finding an obj_t -within a pool, pool_rwlock must be held for reading. -

-Once an object has been selected, there is a per-object -spinlock (obj_spinlock). -This is a spinlock rather than a read-write -lock because nearly all of the most frequent tmem operations (e.g. -put and get flush) -are transformative, in -that they add or remove a page within the object. -This lock is generally taken whenever an -object lookup occurs and released when the tmem operation is complete. -

-Next, the per-client and global ephemeral lists are -protected by a single global spinlock (eph_lists_spinlock) -and the per-client persistent lists are also protected by a single global -spinlock (pers_list_spinlock). -And to complete the description of -implementation-independent locks, if page deduplication is enabled, all pages -for which the first byte match are contained in one of 256 trees that are -protected by one of 256 corresponding read-write locks -(pcd_tree_rwlocks). -

-In the Xen-specific code (tmem_xen.c), page frames (e.g. struct page_info) -that have been released are kept in a list (tmh_page_list) that -is protected by a spinlock (tmh_page_list_lock). -There is also an "implied" lock -associated with compression, which is likely the most time-consuming operation -in all of tmem (of course, only when compression is enabled): A compression -buffer is allocated one-per-physical-cpu early in Xen boot and a pointer to -this buffer is returned to implementation-independent code and used without a -lock. -

-The proper method to avoid deadlocks is to take and release -locks in a very specific predetermined order. -Unfortunately, since tmem data structures must simultaneously be -accessed "top-down" ( -put and get) -and "bottoms-up" -(memory reclamation), more complex methods must be employed: -A trylockmechanism is used (c.f. tmem_try_to_evict_pgp()), -which takes the lock if it is available but returns immediately (rather than -spinning and waiting) if the lock is not available. -When walking the ephemeral list to identify -pages to free, any page that belongs to an object that is locked is simply -skipped. Further, if the page is the -last page belonging to an object, and the pool read-write lock for the pool the -object belongs to is not available (for writing), that object is skipped. -These constraints modify the LRU algorithm -somewhat, but avoid the potential for deadlock. -

-Unfortunately, a livelock was still discovered in this approach: -When memory is scarce and each client is -putting a large number of pages -for exactly one object (and thus holding the object spinlock for that object), -memory reclamation takes a very long time to determine that it is unable to -free any pages, and so the time to do a -put (which eventually fails) becomes linear to the -number of pages in the object! To avoid -this situation, a workaround was added to always ensure a minimum amount of -memory (1MB) is available before any object lock is taken for the client -invoking tmem (see tmem_ensure_avail_pages()). -Other such livelocks (and perhaps deadlocks) -may be lurking. -

-A last issue related to concurrency is atomicity of counters. -Tmem gathers a large number of -statistics. Some of these counters are -informational only, while some are critical to tmem operation and must be -incremented and decremented atomically to ensure, for example, that the number -of pages in a tree never goes negative if two concurrent tmem operations access -the counter exactly simultaneously. Some -of the atomic counters are used for debugging (in assertions) and perhaps need -not be atomic; fixing these may increase performance slightly by reducing -cache-coherency traffic. Similarly, some -of the non-atomic counters may yield strange results to management tools, such -as showing the total number of successful -puts as being higher than the number of -puts attempted. -These are left as exercises for future tmem implementors. - -

Control and Manageability

- -

-Tmem has a control interface to, for example, set various -parameters and obtain statistics. All -tmem control operations funnel through do_tmem_control() -and other functions supporting tmem control operations are prefixed -with tmemc_. - -

-During normal operation, even if only one tmem-aware guest -is running, tmem may absorb nearly all free memory in the system for its own -use. Then if a management tool wishes to -create a new guest (or migrate a guest from another system to this one), it may -notice that there is insufficient "free" memory and fail the creation -(or migration). For this reason, tmem -introduces a new tool-visible class of memory -- freeable memory -- -and provides a control interface to access -it. All ephemeral memory and all pages on the tmh_page_list -are freeable. To properly access freeable -memory, a management tool must follow a sequence of steps: -

    -
  • -freeze -tmem:When tmem is frozen, all -puts fail, which ensures that no -additional memory may be absorbed by tmem. -(See tmemc_freeze_pools(), and -note that individual clients may be frozen, though this functionality may be -used only rarely.) -
  • -query freeable MB: If all freeable memory were released to the Xen -heap, this is the amount of memory (in MB) that would be freed. -See tmh_freeable_pages(). -
  • -flush: -Tmem may be requested to flush, or relinquish, a certain amount of memory, e.g. -back to the Xen heap. This amount is -specified in KB. See tmemc_flush_mem() and tmem_relinquish_npages(). -
  • -At this point the management tool may allocate -the memory, e.g. using Xen's published interfaces. -
  • -thaw -tmem: This terminates the freeze, allowing tmem to accept -puts again. -
-

-Extensive tmem statistics are available through tmem's -control interface (see tmemc_list and -the separate source for the "xm tmem-list" command and the -xen-tmem-list-parse tool). To maximize -forward/backward compatibility with future tmem and tools versions, statistical -information is passed via an ASCII interface where each individual counter is -identified by an easily parseable two-letter ASCII sequence. - -

Save/Restore/Migrate

- -

-Another piece of functionality that has a major impact on -the tmem code is support for save/restore of a tmem client and, highly related, -live migration of a tmem client. -Ephemeral pages, by definition, do not need to be saved or -live-migrated, but persistent pages are part of the state of a running VM and -so must be properly preserved. -

-When a save (or live-migrate) of a tmem-enabled VM is initiated, the first step -is for the tmem client to be frozen (see the manageability section). -Next, tmem API version information is -recorded (to avoid possible incompatibility issues as the tmem spec evolves in -the future). Then, certain high-level -tmem structural information specific to the client is recorded, including -information about the existing pools. -Finally, the contents of all persistent pages are recorded. -

-For live-migration, the process is somewhat more complicated. -Ignoring tmem for a moment, recall that in -live migration, the vast majority of the VM's memory is transferred while the -VM is still fully operational. During -each phase, memory pages belonging to the VM that are changed are marked and -then retransmitted during a later phase. -Eventually only a small amount of memory remains, the VM is paused, the -remaining memory is transmitted, and the VM is unpaused on the target machine. -

-The number of persistent tmem pages may be quite large, -possibly even larger than all the other memory used by the VM; so it is -unacceptable to transmit persistent tmem pages during the "paused" -phase of live migration. But if the VM -is still operational, it may be making calls to tmem: -A frozen tmem client will reject any -put operations, but tmem must -still correctly process flushes -(page and object), including implicit flushes due to duplicate -puts. -Fortunately, these operations can only -invalidate tmem pages, not overwrite tmem pages or create new pages. -So, when a live-migrate has been initiated, -the client is frozen. Then during the -"live" phase, tmem transmits all persistent pages, but also records -the handle of all persistent pages that are invalidated. -Then, during the "paused" phase, -only the handles of invalidated persistent pages are transmitted, resulting in -the invalidation on the target machine of any matching pages that were -previously transmitted during the "live" phase. -

-For restore (and on the target machine of a live migration), -tmem must be capable of reconstructing the internal state of the client from -the saved/migrated data. However, it is -not the client itself that is put'ing -the pages but the management tools conducting the restore/migration. -This slightly complicates tmem by requiring -new API calls and new functions in the implementation, but the code is -structured so that duplication is minimized. -Once all tmem data structures for the client are reconstructed, all -persistent pages are recreated and, in the case of live-migration, all -invalidations have been processed and the client has been thawed, the restored -client can be resumed. -

-Finally, tmem's data structures must be cluttered a bit to -support save/restore/migration. Notably, -a per-pool list of persistent pages must be maintained and, during live -migration, a per-client list of invalidated pages must be logged. -A reader of the code will note that these -lists are overlaid into space-sensitive data structures as a union, which may -be more error-prone but eliminates significant space waste. - -

Miscellaneous Tmem Topics

- -

-Duplicate puts. -One interesting corner case that -significantly complicates the tmem source code is the possibility -of a duplicate -put, -which occurs when two -puts -are requested with the same handle but with possibly different data. -The tmem API addresses - -put-put-get -coherence explicitly: When a duplicate -put occurs, tmem may react one of two ways: (1) The -put may succeed with the old -data overwritten by the new data, or (2) the -put may be failed with the original data flushed and -neither the old nor the new data accessible. -Tmem may not fail the -put and leave the old data accessible. -

-When tmem has been actively working for an extended period, -system memory may be in short supply and it is possible for a memory allocation -for a page (or even a data structure such as a pgd_t) to fail. Thus, -for a duplicate -put, it may be impossible for tmem to temporarily -simultaneously maintain data structures and data for both the original -put and the duplicate -put. -When the space required for the data is -identical, tmem may be able to overwrite in place the old data with -the new data (option 1). But in some circumstances, such as when data -is being compressed, overwriting is not always possible and option 2 must be -performed. -

-Page deduplication and trailing-zero elimination. -When page deduplication is enabled -("tmem_dedup" option to Xen), ephemeral pages for which the contents -are identical -- whether the pages belong -to the same client or different clients -- utilize the same pageframe of -memory. In Xen environments where -multiple domains have a highly similar workload, this can save a substantial -amount of memory, allowing a much larger number of ephemeral pages to be -used. Tmem page deduplication uses -methods similar to the KSM implementation in Linux [ref], but differences between -the two are sufficiently great that tmem does not directly leverage the -code. In particular, ephemeral pages in -tmem are never dirtied, so need never be copied-on-write. -Like KSM, however, tmem avoids hashing, -instead employing red-black trees -that use the entire page contents as the lookup -key. There may be better ways to implement this. -

-Dedup'ed pages may optionally be compressed -("tmem_compress" and "tmem_dedup" Xen options specified), -to save even more space, at the cost of more time. -Additionally, trailing zero elimination (tze) may be applied to dedup'ed -pages. With tze, pages that contain a -significant number of zeroes at the end of the page are saved without the trailing -zeroes; an all-zero page requires no data to be saved at all. -In certain workloads that utilize a large number -of small files (and for which the last partial page of a file is padded with -zeroes), a significant space savings can be realized without the high cost of -compression/decompression. -

-Both compression and tze significantly complicate memory -allocation. This will be discussed more below. -

-Memory accounting. -Accounting is boring, but poor accounting may -result in some interesting problems. In -the implementation-independent code of tmem, most data structures, page frames, -and partial pages (e.g. for compresssion) are billed to a pool, -and thus to a client. Some infrastructure data structures, such as -pools and clients, are allocated with tmh_alloc_infra(), which does not -require a pool to be specified. Two other -exceptions are page content descriptors (pcd_t) -and sharelists (sharelist_t) which -are explicitly not associated with a pool/client by specifying NULL instead of -a pool_t. -(Note to self: -These should probably just use the tmh_alloc_infra() interface too.) -As we shall see, persistent pool pages and -data structures may need to be handled a bit differently, so the -implementation-independent layer calls a different allocation/free routine for -persistent pages (e.g. tmh_alloc_page_thispool()) -than for ephemeral pages (e.g. tmh_alloc_page()). -

-In the Xen-specific layer, we -disregard the pool_t for ephemeral -pages, as we use the generic Xen heap for all ephemeral pages and data -structures.(Denial-of-service attacks -can be handled in the implementation-independent layer because ephemeral pages -are kept in per-client queues each with a counted length. -See the discussion on weights and caps below.) -However we explicitly bill persistent pages -and data structures against the client/domain that is using them. -(See the calls to the Xen routine alloc_domheap_pages() in tmem_xen.h; of -the first argument is a domain, the pages allocated are billed by Xen to that -domain.)This means that a Xen domain -cannot allocate even a single tmem persistent page when it is currently utilizing -its maximum assigned memory allocation! -This is reasonable for persistent pages because, even though the data is -not directly accessible by the domain, the data is permanently saved until -either the domain flushes it or the domain dies. -

-Note that proper accounting requires (even for ephemeral pools) that the same -pool is referenced when memory is freed as when it was allocated, even if the -ownership of a pool has been moved from one client to another (c.f. shared_pool_reassign()). -The underlying Xen-specific information may -not always enforce this for ephemeral pools, but incorrect alloc/free matching -can cause some difficult-to-find memory leaks and bent pointers. -

-Page deduplication is not possible for persistent pools for -accounting reasons: Imagine a page that is created by persistent pool A, which -belongs to a domain that is currently well under its maximum allocation. -Then the pcd_tis matched by persistent pool B, which is -currently at its maximum. -Then the domain owning pool A is destroyed. -Is B beyond its maximum? -(There may be a clever way around this -problem. Exercise for the reader!) -

-Memory allocation. The implementation-independent layer assumes -there is a good fast general-purpose dynamic memory allocator with bounded -response time and efficient use of memory for a very large number of sub-page -allocations. The old xmalloc memory -allocator in Xen was not a good match for this purpose, so was replaced by the -TLSF allocator. Note that the TLSF -allocator is used only for allocations smaller than a page (and, more -precisely, no larger than tmem_subpage_maxsize()); -full pages are allocated by Xen's normal heap allocator. -

-After the TLSF allocator was integrated into Xen, more work -was required so that each client could allocate memory from a separate -independent pool. (See the call to xmem_pool_create()in -tmh_client_init().) -This allows the data structures allocated for the -purpose of supporting persistent pages to be billed to the same client as the -pages themselves. It also allows partial -(e.g. compressed) pages to be properly billed. -Further, when partial page allocations cause internal fragmentation, -this fragmentation can be isolated per-client. -And, when a domain dies, full pages can be freed, rather than only -partial pages. One other change was -required in the TLSF allocator: In the original version, when a TLSF memory -pool was allocated, the first page of memory was also allocated. -Since, for a persistent pool, this page would -be billed to the client, the allocation of the first page failed if the domain -was started at its maximum memory, and this resulted in a failure to create the -memory pool. To avoid this, the code was -changed to delay the allocation of the first page until first use of the memory -pool. -

-Memory allocation interdependency. -As previously described, -pages of memory must be moveable back and forth between the Xen heap and the -tmem ephemeral lists (and page lists). -When tmem needs a page but doesn't have one, it requests one from the -Xen heap (either indirectly via xmalloc, or directly via Xen's alloc_domheap_pages()). -And when Xen needs a page but doesn't have -one, it requests one from tmem (via a call to tmem_relinquish_pages() in Xen's alloc_heap_pages() in page_alloc.c). -This leads to a potential infinite loop! -To break this loop, a new memory flag (MEMF_tmem) was added to Xen -to flag and disallow the loop. -See tmh_called_from_tmem() -in tmem_relinquish_pages(). -Note that the tmem_relinquish_pages() interface allows for memory requests of -order > 0 (multiple contiguous pages), but the tmem implementation disallows -any requests larger than a single page. -

-LRU page reclamation. -Ephemeral pages generally age in -a queue, and the space associated with the oldest -- or least-recently-used -- page is reclaimed when tmem needs more -memory. But there are a few exceptions -to strict LRU queuing. First is when -removal from a queue is constrained by locks, as previously described above. -Second, when an ephemeral pool is shared, unlike a private ephemeral -pool, a -get -does not imply a -flush -Instead, in a shared pool, a -results in the page being promoted to the front of the queue. -Third, when a page that is deduplicated (i.e. -is referenced by more than one pgp_t) -reaches the end of the LRU queue, it is marked as eviction attempted and promoted to the front of the queue; if it -reaches the end of the queue a second time, eviction occurs. -Note that only the pgp_t is evicted; the actual data is only reclaimed if there is no -other pgp_t pointing to the data. -

-All of these modified- LRU algorithms deserve to be studied -carefully against a broad range of workloads. -

-Internal fragmentation. -When -compression or tze is enabled, allocations between a half-page and a full-page -in size are very common and this places a great deal of pressure on even the -best memory allocator. Additionally, -problems may be caused for memory reclamation: When one tmem ephemeral page is -evicted, only a fragment of a physical page of memory might be reclaimed. -As a result, when compression or tze is -enabled, it may take a very large number of eviction attempts to free up a full -contiguous page of memory and so, to avoid near-infinite loops and livelocks, eviction -must be assumed to be able to fail. -While all memory allocation paths in tmem are resilient to failure, very -complex corner cases may eventually occur. -As a result, compression and tze are disabled by default and should be -used with caution until they have been tested with a much broader set of -workloads.(Note to self: The -code needs work.) -

-Weights and caps. -Because -of the just-discussed LRU-based eviction algorithms, a client that uses tmem at -a very high frequency can quickly swamp tmem so that it provides little benefit -to a client that uses it less frequently. -To reduce the possibility of this denial-of-service, limits can be -specified via management tools that are enforced internally by tmem. -On Xen, the "xm tmem-set" command -can specify "weight=<weight>" or "cap=<cap>" -for any client. If weight is non-zero -for a client and the current percentage of ephemeral pages in use by the client -exceeds its share (as measured by the sum of weights of all clients), the next -page chosen for eviction is selected from the requesting client's ephemeral -queue, instead of the global ephemeral queue that contains pages from all -clients.(See client_over_quota().) -Setting a cap for a client is currently a no-op. -

-Shared pools and authentication. -When tmem was first proposed to the linux kernel mailing list -(LKML), there was concern expressed about security of shared ephemeral -pools. The initial tmem implementation only -required a client to provide a 128-bit UUID to identify a shared pool, and the -linux-side tmem implementation obtained this UUID from the superblock of the -shared filesystem (in ocfs2). It was -pointed out on LKML that the UUID was essentially a security key and any -malicious domain that guessed it would have access to any data from the shared -filesystem that found its way into tmem. -Ocfs2 has only very limited security; it is assumed that anyone who can -access the filesystem bits on the shared disk can mount the filesystem and use -it. But in a virtualized data center, -higher isolation requirements may apply. -As a result, management tools must explicitly authenticate (or may -explicitly deny) shared pool access to any client. -On Xen, this is done with the "xl -tmem-shared-auth" command. -

-32-bit implementation. -There was some effort put into getting tmem working on a 32-bit Xen. -However, the Xen heap is limited in size on -32-bit Xen so tmem did not work very well. -There are still 32-bit ifdefs in some places in the code, but things may -have bit-rotted so using tmem on a 32-bit Xen is not recommended. - -

Known Issues

- -

Fragmentation.When tmem -is active, all physically memory becomes fragmented -into individual pages. However, the Xen -memory allocator allows memory to be requested in multi-page contiguous -quantities, called order>0 allocations. -(e.g. 2order so -order==4 is sixteen contiguous pages.) -In some cases, a request for a larger order will fail gracefully if no -matching contiguous allocation is available from Xen. -As of Xen 4.0, however, there are several -critical order>0 allocation requests that do not fail gracefully. -Notably, when a domain is created, and -order==4 structure is required or the domain creation will fail. -And shadow paging requires many order==2 -allocations; if these fail, a PV live-migration may fail. -There are likely other such issues. -

-But, fragmentation can occur even without tmem if any domU does -any extensive ballooning; tmem just accelerates the fragmentation. -So the fragmentation problem must be solved -anyway. The best solution is to disallow -order>0 allocations altogether in Xen -- or at least ensure that any attempt -to allocate order>0 can fail gracefully, e.g. by falling back to a sequence -of single page allocations. However this restriction may require a major rewrite -in some of Xen's most sensitive code. -(Note that order>0 allocations during Xen boot and early in domain0 -launch are safe and, if dom0 does not enable tmem, any order>0 allocation by -dom0 is safe, until the first domU is created.) -

-Until Xen can be rewritten to be fragmentation-safe, a small hack -was added in the Xen page -allocator.(See the comment " -memory is scarce" in alloc_heap_pages().) -Briefly, a portion of memory is pre-reserved -for allocations where order>0 and order<9. -(Domain creation uses 2MB pages, but fails -gracefully, and there are no other known order==9 allocations or order>9 -allocations currently in Xen.) -

-NUMA. Tmem assumes that -all memory pages are equal and any RAM page can store a page of data for any -client. This has potential performance -consequences in any NUMA machine where access to far memory is significantly slower than access to near memory. -On nearly all of today's servers, however, -access times to far memory is still -much faster than access to disk or network-based storage, and tmem's primary performance -advantage comes from the fact that paging and swapping are reduced. -So, the current tmem implementation ignores -NUMA-ness; future tmem design for NUMA machines is an exercise left for the -reader. - -

Bibliography

- -

-(needs work)