[2/4] design: incorporate the delayed logging design doc
diff mbox

Message ID 152623083653.10242.6271621221466813278.stgit@magnolia
State New
Headers show

Commit Message

Darrick J. Wong May 13, 2018, 5 p.m. UTC
From: Darrick J. Wong <darrick.wong@oracle.com>

Move the delayed logging design documentation into the design book.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 .../delayed_logging.asciidoc                       |  808 ++++++++++++++++++++
 design/XFS_Filesystem_Structure/docinfo.xml        |   14 
 .../xfs_filesystem_structure.asciidoc              |    2 
 design/xfs-delayed-logging-design.asciidoc         |  810 --------------------
 4 files changed, 824 insertions(+), 810 deletions(-)
 create mode 100644 design/XFS_Filesystem_Structure/delayed_logging.asciidoc
 delete mode 100644 design/xfs-delayed-logging-design.asciidoc



--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Allison Collins May 14, 2018, 2:29 a.m. UTC | #1
Looks good.  Thx!

Reviewed by: Allison Henderson <allison.henderson@oracle.com>

On 05/13/2018 10:00 AM, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
>
> Move the delayed logging design documentation into the design book.
>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>   .../delayed_logging.asciidoc                       |  808 ++++++++++++++++++++
>   design/XFS_Filesystem_Structure/docinfo.xml        |   14
>   .../xfs_filesystem_structure.asciidoc              |    2
>   design/xfs-delayed-logging-design.asciidoc         |  810 --------------------
>   4 files changed, 824 insertions(+), 810 deletions(-)
>   create mode 100644 design/XFS_Filesystem_Structure/delayed_logging.asciidoc
>   delete mode 100644 design/xfs-delayed-logging-design.asciidoc
>
>
> diff --git a/design/XFS_Filesystem_Structure/delayed_logging.asciidoc b/design/XFS_Filesystem_Structure/delayed_logging.asciidoc
> new file mode 100644
> index 0000000..e9a336f
> --- /dev/null
> +++ b/design/XFS_Filesystem_Structure/delayed_logging.asciidoc
> @@ -0,0 +1,808 @@
> += Delayed Logging
> +
> +== Introduction to Re-logging in XFS
> +
> +XFS logging is a combination of logical and physical logging. Some objects,
> +such as inodes and dquots, are logged in logical format where the details
> +logged are made up of the changes to in-core structures rather than on-disk
> +structures. Other objects - typically buffers - have their physical changes
> +logged. The reason for these differences is to reduce the amount of log space
> +required for objects that are frequently logged. Some parts of inodes are more
> +frequently logged than others, and inodes are typically more frequently logged
> +than any other object (except maybe the superblock buffer) so keeping the
> +amount of metadata logged low is of prime importance.
> +
> +The reason that this is such a concern is that XFS allows multiple separate
> +modifications to a single object to be carried in the log at any given time.
> +This allows the log to avoid needing to flush each change to disk before
> +recording a new change to the object. XFS does this via a method called
> +"re-logging". Conceptually, this is quite simple - all it requires is that any
> +new change to the object is recorded with a *new copy* of all the existing
> +changes in the new transaction that is written to the log.
> +
> +That is, if we have a sequence of changes A through to F, and the object was
> +written to disk after change D, we would see in the log the following series
> +of transactions, their contents and the log sequence number (LSN) of the
> +transaction:
> +
> +....
> +	Transaction		Contents	LSN
> +	   A			   A		   X
> +	   B			  A+B		  X+n
> +	   C			 A+B+C		 X+n+m
> +	   D			A+B+C+D		X+n+m+o
> +	    <object written to disk>
> +	   E			   E		   Y (> X+n+m+o)
> +	   F			  E+F		  Y+p
> +....
> +
> +In other words, each time an object is relogged, the new transaction contains
> +the aggregation of all the previous changes currently held only in the log.
> +
> +This relogging technique also allows objects to be moved forward in the log so
> +that an object being relogged does not prevent the tail of the log from ever
> +moving forward.  This can be seen in the table above by the changing
> +(increasing) LSN of each subsequent transaction - the LSN is effectively a
> +direct encoding of the location in the log of the transaction.
> +
> +This relogging is also used to implement long-running, multiple-commit
> +transactions.  These transaction are known as rolling transactions, and require
> +a special log reservation known as a permanent transaction reservation. A
> +typical example of a rolling transaction is the removal of extents from an
> +inode which can only be done at a rate of two extents per transaction because
> +of reservation size limitations. Hence a rolling extent removal transaction
> +keeps relogging the inode and btree buffers as they get modified in each
> +removal operation. This keeps them moving forward in the log as the operation
> +progresses, ensuring that current operation never gets blocked by itself if the
> +log wraps around.
> +
> +Hence it can be seen that the relogging operation is fundamental to the correct
> +working of the XFS journalling subsystem. From the above description, most
> +people should be able to see why the XFS metadata operations writes so much to
> +the log - repeated operations to the same objects write the same changes to
> +the log over and over again. Worse is the fact that objects tend to get
> +dirtier as they get relogged, so each subsequent transaction is writing more
> +metadata into the log.
> +
> +Another feature of the XFS transaction subsystem is that most transactions are
> +asynchronous. That is, they don't commit to disk until either a log buffer is
> +filled (a log buffer can hold multiple transactions) or a synchronous operation
> +forces the log buffers holding the transactions to disk. This means that XFS is
> +doing aggregation of transactions in memory - batching them, if you like - to
> +minimise the impact of the log IO on transaction throughput.
> +
> +The limitation on asynchronous transaction throughput is the number and size of
> +log buffers made available by the log manager. By default there are 8 log
> +buffers available and the size of each is 32kB - the size can be increased up
> +to 256kB by use of a mount option.
> +
> +Effectively, this gives us the maximum bound of outstanding metadata changes
> +that can be made to the filesystem at any point in time - if all the log
> +buffers are full and under IO, then no more transactions can be committed until
> +the current batch completes. It is now common for a single current CPU core to
> +be to able to issue enough transactions to keep the log buffers full and under
> +IO permanently. Hence the XFS journalling subsystem can be considered to be IO
> +bound.
> +
> +== Delayed Logging Concepts
> +
> +The key thing to note about the asynchronous logging combined with the
> +relogging technique XFS uses is that we can be relogging changed objects
> +multiple times before they are committed to disk in the log buffers. If we
> +return to the previous relogging example, it is entirely possible that
> +transactions A through D are committed to disk in the same log buffer.
> +
> +That is, a single log buffer may contain multiple copies of the same object,
> +but only one of those copies needs to be there - the last one "D", as it
> +contains all the changes from the previous changes. In other words, we have one
> +necessary copy in the log buffer, and three stale copies that are simply
> +wasting space. When we are doing repeated operations on the same set of
> +objects, these "stale objects" can be over 90% of the space used in the log
> +buffers. It is clear that reducing the number of stale objects written to the
> +log would greatly reduce the amount of metadata we write to the log, and this
> +is the fundamental goal of delayed logging.
> +
> +From a conceptual point of view, XFS is already doing relogging in memory (where
> +memory == log buffer), only it is doing it extremely inefficiently. It is using
> +logical to physical formatting to do the relogging because there is no
> +infrastructure to keep track of logical changes in memory prior to physically
> +formatting the changes in a transaction to the log buffer. Hence we cannot avoid
> +accumulating stale objects in the log buffers.
> +
> +Delayed logging is the name we've given to keeping and tracking transactional
> +changes to objects in memory outside the log buffer infrastructure. Because of
> +the relogging concept fundamental to the XFS journalling subsystem, this is
> +actually relatively easy to do - all the changes to logged items are already
> +tracked in the current infrastructure. The big problem is how to accumulate
> +them and get them to the log in a consistent, recoverable manner.
> +Describing the problems and how they have been solved is the focus of this
> +document.
> +
> +One of the key changes that delayed logging makes to the operation of the
> +journalling subsystem is that it disassociates the amount of outstanding
> +metadata changes from the size and number of log buffers available. In other
> +words, instead of there only being a maximum of 2MB of transaction changes not
> +written to the log at any point in time, there may be a much greater amount
> +being accumulated in memory. Hence the potential for loss of metadata on a
> +crash is much greater than for the existing logging mechanism.
> +
> +It should be noted that this does not change the guarantee that log recovery
> +will result in a consistent filesystem. What it does mean is that as far as the
> +recovered filesystem is concerned, there may be many thousands of transactions
> +that simply did not occur as a result of the crash. This makes it even more
> +important that applications that care about their data use fsync() where they
> +need to ensure application level data integrity is maintained.
> +
> +It should be noted that delayed logging is not an innovative new concept that
> +warrants rigorous proofs to determine whether it is correct or not. The method
> +of accumulating changes in memory for some period before writing them to the
> +log is used effectively in many filesystems including ext3 and ext4. Hence
> +no time is spent in this document trying to convince the reader that the
> +concept is sound. Instead it is simply considered a "solved problem" and as
> +such implementing it in XFS is purely an exercise in software engineering.
> +
> +The fundamental requirements for delayed logging in XFS are simple:
> +
> +	. Reduce the amount of metadata written to the log by at least
> +	   an order of magnitude.
> +	. Supply sufficient statistics to validate Requirement #1.
> +	. Supply sufficient new tracing infrastructure to be able to debug
> +	   problems with the new code.
> +	. No on-disk format change (metadata or log format).
> +	. Enable and disable with a mount option.
> +	. No performance regressions for synchronous transaction workloads.
> +
> +== Delayed Logging Design
> +
> +=== Storing Changes
> +
> +The problem with accumulating changes at a logical level (i.e. just using the
> +existing log item dirty region tracking) is that when it comes to writing the
> +changes to the log buffers, we need to ensure that the object we are formatting
> +is not changing while we do this. This requires locking the object to prevent
> +concurrent modification. Hence flushing the logical changes to the log would
> +require us to lock every object, format them, and then unlock them again.
> +
> +This introduces lots of scope for deadlocks with transactions that are already
> +running. For example, a transaction has object A locked and modified, but needs
> +the delayed logging tracking lock to commit the transaction. However, the
> +flushing thread has the delayed logging tracking lock already held, and is
> +trying to get the lock on object A to flush it to the log buffer. This appears
> +to be an unsolvable deadlock condition, and it was solving this problem that
> +was the barrier to implementing delayed logging for so long.
> +
> +The solution is relatively simple - it just took a long time to recognise it.
> +Put simply, the current logging code formats the changes to each item into an
> +vector array that points to the changed regions in the item. The log write code
> +simply copies the memory these vectors point to into the log buffer during
> +transaction commit while the item is locked in the transaction. Instead of
> +using the log buffer as the destination of the formatting code, we can use an
> +allocated memory buffer big enough to fit the formatted vector.
> +
> +If we then copy the vector into the memory buffer and rewrite the vector to
> +point to the memory buffer rather than the object itself, we now have a copy of
> +the changes in a format that is compatible with the log buffer writing code.
> +that does not require us to lock the item to access. This formatting and
> +rewriting can all be done while the object is locked during transaction commit,
> +resulting in a vector that is transactionally consistent and can be accessed
> +without needing to lock the owning item.
> +
> +Hence we avoid the need to lock items when we need to flush outstanding
> +asynchronous transactions to the log. The differences between the existing
> +formatting method and the delayed logging formatting can be seen in the
> +diagram below.
> +
> +Current format log vector:
> +....
> +Object    +---------------------------------------------+
> +Vector 1      +----+
> +Vector 2                    +----+
> +Vector 3                                   +----------+
> +....
> +
> +After formatting:
> +
> +....
> +Log Buffer    +-V1-+-V2-+----V3----+
> +....
> +
> +Delayed logging vector:
> +
> +....
> +Object    +---------------------------------------------+
> +Vector 1      +----+
> +Vector 2                    +----+
> +Vector 3                                   +----------+
> +....
> +
> +After formatting:
> +
> +....
> +Memory Buffer +-V1-+-V2-+----V3----+
> +Vector 1      +----+
> +Vector 2           +----+
> +Vector 3                +----------+
> +....
> +
> +The memory buffer and associated vector need to be passed as a single object,
> +but still need to be associated with the parent object so if the object is
> +relogged we can replace the current memory buffer with a new memory buffer that
> +contains the latest changes.
> +
> +The reason for keeping the vector around after we've formatted the memory
> +buffer is to support splitting vectors across log buffer boundaries correctly.
> +If we don't keep the vector around, we do not know where the region boundaries
> +are in the item, so we'd need a new encapsulation method for regions in the log
> +buffer writing (i.e. double encapsulation). This would be an on-disk format
> +change and as such is not desirable.  It also means we'd have to write the log
> +region headers in the formatting stage, which is problematic as there is per
> +region state that needs to be placed into the headers during the log write.
> +
> +Hence we need to keep the vector, but by attaching the memory buffer to it and
> +rewriting the vector addresses to point at the memory buffer we end up with a
> +self-describing object that can be passed to the log buffer write code to be
> +handled in exactly the same manner as the existing log vectors are handled.
> +Hence we avoid needing a new on-disk format to handle items that have been
> +relogged in memory.
> +
> +
> +=== Tracking Changes
> +
> +Now that we can record transactional changes in memory in a form that allows
> +them to be used without limitations, we need to be able to track and accumulate
> +them so that they can be written to the log at some later point in time.  The
> +log item is the natural place to store this vector and buffer, and also makes sense
> +to be the object that is used to track committed objects as it will always
> +exist once the object has been included in a transaction.
> +
> +The log item is already used to track the log items that have been written to
> +the log but not yet written to disk. Such log items are considered "active"
> +and as such are stored in the Active Item List (AIL) which is a LSN-ordered
> +double linked list. Items are inserted into this list during log buffer IO
> +completion, after which they are unpinned and can be written to disk. An object
> +that is in the AIL can be relogged, which causes the object to be pinned again
> +and then moved forward in the AIL when the log buffer IO completes for that
> +transaction.
> +
> +Essentially, this shows that an item that is in the AIL can still be modified
> +and relogged, so any tracking must be separate to the AIL infrastructure. As
> +such, we cannot reuse the AIL list pointers for tracking committed items, nor
> +can we store state in any field that is protected by the AIL lock. Hence the
> +committed item tracking needs it's own locks, lists and state fields in the log
> +item.
> +
> +Similar to the AIL, tracking of committed items is done through a new list
> +called the Committed Item List (CIL).  The list tracks log items that have been
> +committed and have formatted memory buffers attached to them. It tracks objects
> +in transaction commit order, so when an object is relogged it is removed from
> +it's place in the list and re-inserted at the tail. This is entirely arbitrary
> +and done to make it easy for debugging - the last items in the list are the
> +ones that are most recently modified. Ordering of the CIL is not necessary for
> +transactional integrity (as discussed in the next section) so the ordering is
> +done for convenience/sanity of the developers.
> +
> +
> +=== Checkpoints
> +
> +When we have a log synchronisation event, commonly known as a "log force",
> +all the items in the CIL must be written into the log via the log buffers.
> +We need to write these items in the order that they exist in the CIL, and they
> +need to be written as an atomic transaction. The need for all the objects to be
> +written as an atomic transaction comes from the requirements of relogging and
> +log replay - all the changes in all the objects in a given transaction must
> +either be completely replayed during log recovery, or not replayed at all. If
> +a transaction is not replayed because it is not complete in the log, then
> +no later transactions should be replayed, either.
> +
> +To fulfill this requirement, we need to write the entire CIL in a single log
> +transaction. Fortunately, the XFS log code has no fixed limit on the size of a
> +transaction, nor does the log replay code. The only fundamental limit is that
> +the transaction cannot be larger than just under half the size of the log.  The
> +reason for this limit is that to find the head and tail of the log, there must
> +be at least one complete transaction in the log at any given time. If a
> +transaction is larger than half the log, then there is the possibility that a
> +crash during the write of a such a transaction could partially overwrite the
> +only complete previous transaction in the log. This will result in a recovery
> +failure and an inconsistent filesystem and hence we must enforce the maximum
> +size of a checkpoint to be slightly less than a half the log.
> +
> +Apart from this size requirement, a checkpoint transaction looks no different
> +to any other transaction - it contains a transaction header, a series of
> +formatted log items and a commit record at the tail. From a recovery
> +perspective, the checkpoint transaction is also no different - just a lot
> +bigger with a lot more items in it. The worst case effect of this is that we
> +might need to tune the recovery transaction object hash size.
> +
> +Because the checkpoint is just another transaction and all the changes to log
> +items are stored as log vectors, we can use the existing log buffer writing
> +code to write the changes into the log. To do this efficiently, we need to
> +minimise the time we hold the CIL locked while writing the checkpoint
> +transaction. The current log write code enables us to do this easily with the
> +way it separates the writing of the transaction contents (the log vectors) from
> +the transaction commit record, but tracking this requires us to have a
> +per-checkpoint context that travels through the log write process through to
> +checkpoint completion.
> +
> +Hence a checkpoint has a context that tracks the state of the current
> +checkpoint from initiation to checkpoint completion. A new context is initiated
> +at the same time a checkpoint transaction is started. That is, when we remove
> +all the current items from the CIL during a checkpoint operation, we move all
> +those changes into the current checkpoint context. We then initialise a new
> +context and attach that to the CIL for aggregation of new transactions.
> +
> +This allows us to unlock the CIL immediately after transfer of all the
> +committed items and effectively allow new transactions to be issued while we
> +are formatting the checkpoint into the log. It also allows concurrent
> +checkpoints to be written into the log buffers in the case of log force heavy
> +workloads, just like the existing transaction commit code does. This, however,
> +requires that we strictly order the commit records in the log so that
> +checkpoint sequence order is maintained during log replay.
> +
> +To ensure that we can be writing an item into a checkpoint transaction at
> +the same time another transaction modifies the item and inserts the log item
> +into the new CIL, then checkpoint transaction commit code cannot use log items
> +to store the list of log vectors that need to be written into the transaction.
> +Hence log vectors need to be able to be chained together to allow them to be
> +detached from the log items. That is, when the CIL is flushed the memory
> +buffer and log vector attached to each log item needs to be attached to the
> +checkpoint context so that the log item can be released. In diagrammatic form,
> +the CIL would look like this before the flush:
> +
> +----
> +	CIL Head
> +	   |
> +	   V
> +	Log Item <-> log vector 1	-> memory buffer
> +	   |				-> vector array
> +	   V
> +	Log Item <-> log vector 2	-> memory buffer
> +	   |				-> vector array
> +	   V
> +	......
> +	   |
> +	   V
> +	Log Item <-> log vector N-1	-> memory buffer
> +	   |				-> vector array
> +	   V
> +	Log Item <-> log vector N	-> memory buffer
> +					-> vector array
> +----
> +
> +And after the flush the CIL head is empty, and the checkpoint context log
> +vector list would look like:
> +
> +----
> +	Checkpoint Context
> +	   |
> +	   V
> +	log vector 1	-> memory buffer
> +	   |		-> vector array
> +	   |		-> Log Item
> +	   V
> +	log vector 2	-> memory buffer
> +	   |		-> vector array
> +	   |		-> Log Item
> +	   V
> +	......
> +	   |
> +	   V
> +	log vector N-1	-> memory buffer
> +	   |		-> vector array
> +	   |		-> Log Item
> +	   V
> +	log vector N	-> memory buffer
> +			-> vector array
> +			-> Log Item
> +----
> +
> +Once this transfer is done, the CIL can be unlocked and new transactions can
> +start, while the checkpoint flush code works over the log vector chain to
> +commit the checkpoint.
> +
> +Once the checkpoint is written into the log buffers, the checkpoint context is
> +attached to the log buffer that the commit record was written to along with a
> +completion callback. Log IO completion will call that callback, which can then
> +run transaction committed processing for the log items (i.e. insert into AIL
> +and unpin) in the log vector chain and then free the log vector chain and
> +checkpoint context.
> +
> +Discussion Point: I am uncertain as to whether the log item is the most
> +efficient way to track vectors, even though it seems like the natural way to do
> +it. The fact that we walk the log items (in the CIL) just to chain the log
> +vectors and break the link between the log item and the log vector means that
> +we take a cache line hit for the log item list modification, then another for
> +the log vector chaining. If we track by the log vectors, then we only need to
> +break the link between the log item and the log vector, which means we should
> +dirty only the log item cachelines. Normally I wouldn't be concerned about one
> +vs two dirty cachelines except for the fact I've seen upwards of 80,000 log
> +vectors in one checkpoint transaction. I'd guess this is a "measure and
> +compare" situation that can be done after a working and reviewed implementation
> +is in the dev tree....
> +
> +=== Checkpoint Sequencing
> +
> +One of the key aspects of the XFS transaction subsystem is that it tags
> +committed transactions with the log sequence number of the transaction commit.
> +This allows transactions to be issued asynchronously even though there may be
> +future operations that cannot be completed until that transaction is fully
> +committed to the log. In the rare case that a dependent operation occurs (e.g.
> +re-using a freed metadata extent for a data extent), a special, optimised log
> +force can be issued to force the dependent transaction to disk immediately.
> +
> +To do this, transactions need to record the LSN of the commit record of the
> +transaction. This LSN comes directly from the log buffer the transaction is
> +written into. While this works just fine for the existing transaction
> +mechanism, it does not work for delayed logging because transactions are not
> +written directly into the log buffers. Hence some other method of sequencing
> +transactions is required.
> +
> +As discussed in the checkpoint section, delayed logging uses per-checkpoint
> +contexts, and as such it is simple to assign a sequence number to each
> +checkpoint. Because the switching of checkpoint contexts must be done
> +atomically, it is simple to ensure that each new context has a monotonically
> +increasing sequence number assigned to it without the need for an external
> +atomic counter - we can just take the current context sequence number and add
> +one to it for the new context.
> +
> +Then, instead of assigning a log buffer LSN to the transaction commit LSN
> +during the commit, we can assign the current checkpoint sequence. This allows
> +operations that track transactions that have not yet completed know what
> +checkpoint sequence needs to be committed before they can continue. As a
> +result, the code that forces the log to a specific LSN now needs to ensure that
> +the log forces to a specific checkpoint.
> +
> +To ensure that we can do this, we need to track all the checkpoint contexts
> +that are currently committing to the log. When we flush a checkpoint, the
> +context gets added to a "committing" list which can be searched. When a
> +checkpoint commit completes, it is removed from the committing list. Because
> +the checkpoint context records the LSN of the commit record for the checkpoint,
> +we can also wait on the log buffer that contains the commit record, thereby
> +using the existing log force mechanisms to execute synchronous forces.
> +
> +It should be noted that the synchronous forces may need to be extended with
> +mitigation algorithms similar to the current log buffer code to allow
> +aggregation of multiple synchronous transactions if there are already
> +synchronous transactions being flushed. Investigation of the performance of the
> +current design is needed before making any decisions here.
> +
> +The main concern with log forces is to ensure that all the previous checkpoints
> +are also committed to disk before the one we need to wait for. Therefore we
> +need to check that all the prior contexts in the committing list are also
> +complete before waiting on the one we need to complete. We do this
> +synchronisation in the log force code so that we don't need to wait anywhere
> +else for such serialisation - it only matters when we do a log force.
> +
> +The only remaining complexity is that a log force now also has to handle the
> +case where the forcing sequence number is the same as the current context. That
> +is, we need to flush the CIL and potentially wait for it to complete. This is a
> +simple addition to the existing log forcing code to check the sequence numbers
> +and push if required. Indeed, placing the current sequence checkpoint flush in
> +the log force code enables the current mechanism for issuing synchronous
> +transactions to remain untouched (i.e. commit an asynchronous transaction, then
> +force the log at the LSN of that transaction) and so the higher level code
> +behaves the same regardless of whether delayed logging is being used or not.
> +
> +=== Checkpoint Log Space Accounting
> +
> +The big issue for a checkpoint transaction is the log space reservation for the
> +transaction. We don't know how big a checkpoint transaction is going to be
> +ahead of time, nor how many log buffers it will take to write out, nor the
> +number of split log vector regions are going to be used. We can track the
> +amount of log space required as we add items to the commit item list, but we
> +still need to reserve the space in the log for the checkpoint.
> +
> +A typical transaction reserves enough space in the log for the worst case space
> +usage of the transaction. The reservation accounts for log record headers,
> +transaction and region headers, headers for split regions, buffer tail padding,
> +etc. as well as the actual space for all the changed metadata in the
> +transaction. While some of this is fixed overhead, much of it is dependent on
> +the size of the transaction and the number of regions being logged (the number
> +of log vectors in the transaction).
> +
> +An example of the differences would be logging directory changes versus logging
> +inode changes. If you modify lots of inode cores (e.g. chmod -R g+w *), then
> +there are lots of transactions that only contain an inode core and an inode log
> +format structure. That is, two vectors totaling roughly 150 bytes. If we modify
> +10,000 inodes, we have about 1.5MB of metadata to write in 20,000 vectors. Each
> +vector is 12 bytes, so the total to be logged is approximately 1.75MB. In
> +comparison, if we are logging full directory buffers, they are typically 4KB
> +each, so we in 1.5MB of directory buffers we'd have roughly 400 buffers and a
> +buffer format structure for each buffer - roughly 800 vectors or 1.51MB total
> +space.  From this, it should be obvious that a static log space reservation is
> +not particularly flexible and is difficult to select the "optimal value" for
> +all workloads.
> +
> +Further, if we are going to use a static reservation, which bit of the entire
> +reservation does it cover? We account for space used by the transaction
> +reservation by tracking the space currently used by the object in the CIL and
> +then calculating the increase or decrease in space used as the object is
> +relogged. This allows for a checkpoint reservation to only have to account for
> +log buffer metadata used such as log header records.
> +
> +However, even using a static reservation for just the log metadata is
> +problematic. Typically log record headers use at least 16KB of log space per
> +1MB of log space consumed (512 bytes per 32k) and the reservation needs to be
> +large enough to handle arbitrary sized checkpoint transactions. This
> +reservation needs to be made before the checkpoint is started, and we need to
> +be able to reserve the space without sleeping.  For a 8MB checkpoint, we need a
> +reservation of around 150KB, which is a non-trivial amount of space.
> +
> +A static reservation needs to manipulate the log grant counters - we can take a
> +permanent reservation on the space, but we still need to make sure we refresh
> +the write reservation (the actual space available to the transaction) after
> +every checkpoint transaction completion. Unfortunately, if this space is not
> +available when required, then the regrant code will sleep waiting for it.
> +
> +The problem with this is that it can lead to deadlocks as we may need to commit
> +checkpoints to be able to free up log space (refer back to the description of
> +rolling transactions for an example of this).  Hence we *must* always have
> +space available in the log if we are to use static reservations, and that is
> +very difficult and complex to arrange. It is possible to do, but there is a
> +simpler way.
> +
> +The simpler way of doing this is tracking the entire log space used by the
> +items in the CIL and using this to dynamically calculate the amount of log
> +space required by the log metadata. If this log metadata space changes as a
> +result of a transaction commit inserting a new memory buffer into the CIL, then
> +the difference in space required is removed from the transaction that causes
> +the change. Transactions at this level will *always* have enough space
> +available in their reservation for this as they have already reserved the
> +maximal amount of log metadata space they require, and such a delta reservation
> +will always be less than or equal to the maximal amount in the reservation.
> +
> +Hence we can grow the checkpoint transaction reservation dynamically as items
> +are added to the CIL and avoid the need for reserving and regranting log space
> +up front. This avoids deadlocks and removes a blocking point from the
> +checkpoint flush code.
> +
> +As mentioned early, transactions can't grow to more than half the size of the
> +log. Hence as part of the reservation growing, we need to also check the size
> +of the reservation against the maximum allowed transaction size. If we reach
> +the maximum threshold, we need to push the CIL to the log. This is effectively
> +a "background flush" and is done on demand. This is identical to
> +a CIL push triggered by a log force, only that there is no waiting for the
> +checkpoint commit to complete. This background push is checked and executed by
> +transaction commit code.
> +
> +If the transaction subsystem goes idle while we still have items in the CIL,
> +they will be flushed by the periodic log force issued by the xfssyncd. This log
> +force will push the CIL to disk, and if the transaction subsystem stays idle,
> +allow the idle log to be covered (effectively marked clean) in exactly the same
> +manner that is done for the existing logging method. A discussion point is
> +whether this log force needs to be done more frequently than the current rate
> +which is once every 30s.
> +
> +
> +=== Log Item Pinning
> +
> +Currently log items are pinned during transaction commit while the items are
> +still locked. This happens just after the items are formatted, though it could
> +be done any time before the items are unlocked. The result of this mechanism is
> +that items get pinned once for every transaction that is committed to the log
> +buffers. Hence items that are relogged in the log buffers will have a pin count
> +for every outstanding transaction they were dirtied in. When each of these
> +transactions is completed, they will unpin the item once. As a result, the item
> +only becomes unpinned when all the transactions complete and there are no
> +pending transactions. Thus the pinning and unpinning of a log item is symmetric
> +as there is a 1:1 relationship with transaction commit and log item completion.
> +
> +For delayed logging, however, we have an asymmetric transaction commit to
> +completion relationship. Every time an object is relogged in the CIL it goes
> +through the commit process without a corresponding completion being registered.
> +That is, we now have a many-to-one relationship between transaction commit and
> +log item completion. The result of this is that pinning and unpinning of the
> +log items becomes unbalanced if we retain the "pin on transaction commit, unpin
> +on transaction completion" model.
> +
> +To keep pin/unpin symmetry, the algorithm needs to change to a "pin on
> +insertion into the CIL, unpin on checkpoint completion". In other words, the
> +pinning and unpinning becomes symmetric around a checkpoint context. We have to
> +pin the object the first time it is inserted into the CIL - if it is already in
> +the CIL during a transaction commit, then we do not pin it again. Because there
> +can be multiple outstanding checkpoint contexts, we can still see elevated pin
> +counts, but as each checkpoint completes the pin count will retain the correct
> +value according to it's context.
> +
> +Just to make matters more slightly more complex, this checkpoint level context
> +for the pin count means that the pinning of an item must take place under the
> +CIL commit/flush lock. If we pin the object outside this lock, we cannot
> +guarantee which context the pin count is associated with. This is because of
> +the fact pinning the item is dependent on whether the item is present in the
> +current CIL or not. If we don't pin the CIL first before we check and pin the
> +object, we have a race with CIL being flushed between the check and the pin
> +(or not pinning, as the case may be). Hence we must hold the CIL flush/commit
> +lock to guarantee that we pin the items correctly.
> +
> +=== Concurrent Scalability
> +
> +A fundamental requirement for the CIL is that accesses through transaction
> +commits must scale to many concurrent commits. The current transaction commit
> +code does not break down even when there are transactions coming from 2048
> +processors at once. The current transaction code does not go any faster than if
> +there was only one CPU using it, but it does not slow down either.
> +
> +As a result, the delayed logging transaction commit code needs to be designed
> +for concurrency from the ground up. It is obvious that there are serialisation
> +points in the design - the three important ones are:
> +
> +	. Locking out new transaction commits while flushing the CIL
> +	. Adding items to the CIL and updating item space accounting
> +	. Checkpoint commit ordering
> +
> +Looking at the transaction commit and CIL flushing interactions, it is clear
> +that we have a many-to-one interaction here. That is, the only restriction on
> +the number of concurrent transactions that can be trying to commit at once is
> +the amount of space available in the log for their reservations. The practical
> +limit here is in the order of several hundred concurrent transactions for a
> +128MB log, which means that it is generally one per CPU in a machine.
> +
> +The amount of time a transaction commit needs to hold out a flush is a
> +relatively long period of time - the pinning of log items needs to be done
> +while we are holding out a CIL flush, so at the moment that means it is held
> +across the formatting of the objects into memory buffers (i.e. while memcpy()s
> +are in progress). Ultimately a two pass algorithm where the formatting is done
> +separately to the pinning of objects could be used to reduce the hold time of
> +the transaction commit side.
> +
> +Because of the number of potential transaction commit side holders, the lock
> +really needs to be a sleeping lock - if the CIL flush takes the lock, we do not
> +want every other CPU in the machine spinning on the CIL lock. Given that
> +flushing the CIL could involve walking a list of tens of thousands of log
> +items, it will get held for a significant time and so spin contention is a
> +significant concern. Preventing lots of CPUs spinning doing nothing is the
> +main reason for choosing a sleeping lock even though nothing in either the
> +transaction commit or CIL flush side sleeps with the lock held.
> +
> +It should also be noted that CIL flushing is also a relatively rare operation
> +compared to transaction commit for asynchronous transaction workloads - only
> +time will tell if using a read-write semaphore for exclusion will limit
> +transaction commit concurrency due to cache line bouncing of the lock on the
> +read side.
> +
> +The second serialisation point is on the transaction commit side where items
> +are inserted into the CIL. Because transactions can enter this code
> +concurrently, the CIL needs to be protected separately from the above
> +commit/flush exclusion. It also needs to be an exclusive lock but it is only
> +held for a very short time and so a spin lock is appropriate here. It is
> +possible that this lock will become a contention point, but given the short
> +hold time once per transaction I think that contention is unlikely.
> +
> +The final serialisation point is the checkpoint commit record ordering code
> +that is run as part of the checkpoint commit and log force sequencing. The code
> +path that triggers a CIL flush (i.e. whatever triggers the log force) will enter
> +an ordering loop after writing all the log vectors into the log buffers but
> +before writing the commit record. This loop walks the list of committing
> +checkpoints and needs to block waiting for checkpoints to complete their commit
> +record write. As a result it needs a lock and a wait variable. Log force
> +sequencing also requires the same lock, list walk, and blocking mechanism to
> +ensure completion of checkpoints.
> +
> +These two sequencing operations can use the mechanism even though the
> +events they are waiting for are different. The checkpoint commit record
> +sequencing needs to wait until checkpoint contexts contain a commit LSN
> +(obtained through completion of a commit record write) while log force
> +sequencing needs to wait until previous checkpoint contexts are removed from
> +the committing list (i.e. they've completed). A simple wait variable and
> +broadcast wakeups (thundering herds) has been used to implement these two
> +serialisation queues. They use the same lock as the CIL, too. If we see too
> +much contention on the CIL lock, or too many context switches as a result of
> +the broadcast wakeups these operations can be put under a new spinlock and
> +given separate wait lists to reduce lock contention and the number of processes
> +woken by the wrong event.
> +
> +
> +=== Lifecycle Changes
> +
> +The existing log item life cycle is as follows:
> +
> +----
> +	1. Transaction allocate
> +	2. Transaction reserve
> +	3. Lock item
> +	4. Join item to transaction
> +		If not already attached,
> +			Allocate log item
> +			Attach log item to owner item
> +		Attach log item to transaction
> +	5. Modify item
> +		Record modifications in log item
> +	6. Transaction commit
> +		Pin item in memory
> +		Format item into log buffer
> +		Write commit LSN into transaction
> +		Unlock item
> +		Attach transaction to log buffer
> +
> +	<log buffer IO dispatched>
> +	<log buffer IO completes>
> +
> +	7. Transaction completion
> +		Mark log item committed
> +		Insert log item into AIL
> +			Write commit LSN into log item
> +		Unpin log item
> +	8. AIL traversal
> +		Lock item
> +		Mark log item clean
> +		Flush item to disk
> +
> +	<item IO completion>
> +
> +	9. Log item removed from AIL
> +		Moves log tail
> +		Item unlocked
> +----
> +
> +Essentially, steps 1-6 operate independently from step 7, which is also
> +independent of steps 8-9. An item can be locked in steps 1-6 or steps 8-9
> +at the same time step 7 is occurring, but only steps 1-6 or 8-9 can occur
> +at the same time. If the log item is in the AIL or between steps 6 and 7
> +and steps 1-6 are re-entered, then the item is relogged. Only when steps 8-9
> +are entered and completed is the object considered clean.
> +
> +With delayed logging, there are new steps inserted into the life cycle:
> +
> +----
> +	1. Transaction allocate
> +	2. Transaction reserve
> +	3. Lock item
> +	4. Join item to transaction
> +		If not already attached,
> +			Allocate log item
> +			Attach log item to owner item
> +		Attach log item to transaction
> +	5. Modify item
> +		Record modifications in log item
> +	6. Transaction commit
> +		Pin item in memory if not pinned in CIL
> +		Format item into log vector + buffer
> +		Attach log vector and buffer to log item
> +		Insert log item into CIL
> +		Write CIL context sequence into transaction
> +		Unlock item
> +
> +	<next log force>
> +
> +	7. CIL push
> +		lock CIL flush
> +		Chain log vectors and buffers together
> +		Remove items from CIL
> +		unlock CIL flush
> +		write log vectors into log
> +		sequence commit records
> +		attach checkpoint context to log buffer
> +
> +	<log buffer IO dispatched>
> +	<log buffer IO completes>
> +
> +	8. Checkpoint completion
> +		Mark log item committed
> +		Insert item into AIL
> +			Write commit LSN into log item
> +		Unpin log item
> +	9. AIL traversal
> +		Lock item
> +		Mark log item clean
> +		Flush item to disk
> +	<item IO completion>
> +	10. Log item removed from AIL
> +		Moves log tail
> +		Item unlocked
> +----
> +
> +From this, it can be seen that the only life cycle differences between the two
> +logging methods are in the middle of the life cycle - they still have the same
> +beginning and end and execution constraints. The only differences are in the
> +committing of the log items to the log itself and the completion processing.
> +Hence delayed logging should not introduce any constraints on log item
> +behaviour, allocation or freeing that don't already exist.
> +
> +As a result of this zero-impact "insertion" of delayed logging infrastructure
> +and the design of the internal structures to avoid on disk format changes, we
> +can basically switch between delayed logging and the existing mechanism with a
> +mount option. Fundamentally, there is no reason why the log manager would not
> +be able to swap methods automatically and transparently depending on load
> +characteristics, but this should not be necessary if delayed logging works as
> +designed.
> +
> +EOF.
> diff --git a/design/XFS_Filesystem_Structure/docinfo.xml b/design/XFS_Filesystem_Structure/docinfo.xml
> index e13d705..558a04c 100644
> --- a/design/XFS_Filesystem_Structure/docinfo.xml
> +++ b/design/XFS_Filesystem_Structure/docinfo.xml
> @@ -169,4 +169,18 @@
>   			</simplelist>
>   		</revdescription>
>   	</revision>
> +	<revision>
> +		<revnumber>3.141592</revnumber>
> +		<date>May 2018</date>
> +		<author>
> +			<firstname>Darrick</firstname>
> +			<surname>Wong</surname>
> +			<email>darrick.wong@oracle.com</email>
> +		</author>
> +		<revdescription>
> +			<simplelist>
> +				<member>Incorporate Dave Chinner's log design document.</member>
> +			</simplelist>
> +		</revdescription>
> +	</revision>
>   </revhistory>
> diff --git a/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc b/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc
> index 8d54935..7bdfade 100644
> --- a/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc
> +++ b/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc
> @@ -48,6 +48,8 @@ include::overview.asciidoc[]
>   
>   include::metadata_integrity.asciidoc[]
>   
> +include::delayed_logging.asciidoc[]
> +
>   include::reflink.asciidoc[]
>   
>   include::reconstruction.asciidoc[]
> diff --git a/design/xfs-delayed-logging-design.asciidoc b/design/xfs-delayed-logging-design.asciidoc
> deleted file mode 100644
> index e54e786..0000000
> --- a/design/xfs-delayed-logging-design.asciidoc
> +++ /dev/null
> @@ -1,810 +0,0 @@
> -= XFS Delayed Logging Design
> -Dave Chinner, <dchinner@redhat.com>
> -v1.0, Feb 2014: Initial conversion to asciidoc
> -
> -== Introduction to Re-logging in XFS
> -
> -XFS logging is a combination of logical and physical logging. Some objects,
> -such as inodes and dquots, are logged in logical format where the details
> -logged are made up of the changes to in-core structures rather than on-disk
> -structures. Other objects - typically buffers - have their physical changes
> -logged. The reason for these differences is to reduce the amount of log space
> -required for objects that are frequently logged. Some parts of inodes are more
> -frequently logged than others, and inodes are typically more frequently logged
> -than any other object (except maybe the superblock buffer) so keeping the
> -amount of metadata logged low is of prime importance.
> -
> -The reason that this is such a concern is that XFS allows multiple separate
> -modifications to a single object to be carried in the log at any given time.
> -This allows the log to avoid needing to flush each change to disk before
> -recording a new change to the object. XFS does this via a method called
> -"re-logging". Conceptually, this is quite simple - all it requires is that any
> -new change to the object is recorded with a *new copy* of all the existing
> -changes in the new transaction that is written to the log.
> -
> -That is, if we have a sequence of changes A through to F, and the object was
> -written to disk after change D, we would see in the log the following series
> -of transactions, their contents and the log sequence number (LSN) of the
> -transaction:
> -
> -....
> -	Transaction		Contents	LSN
> -	   A			   A		   X
> -	   B			  A+B		  X+n
> -	   C			 A+B+C		 X+n+m
> -	   D			A+B+C+D		X+n+m+o
> -	    <object written to disk>
> -	   E			   E		   Y (> X+n+m+o)
> -	   F			  E+F		  Y+p
> -....
> -
> -In other words, each time an object is relogged, the new transaction contains
> -the aggregation of all the previous changes currently held only in the log.
> -
> -This relogging technique also allows objects to be moved forward in the log so
> -that an object being relogged does not prevent the tail of the log from ever
> -moving forward.  This can be seen in the table above by the changing
> -(increasing) LSN of each subsequent transaction - the LSN is effectively a
> -direct encoding of the location in the log of the transaction.
> -
> -This relogging is also used to implement long-running, multiple-commit
> -transactions.  These transaction are known as rolling transactions, and require
> -a special log reservation known as a permanent transaction reservation. A
> -typical example of a rolling transaction is the removal of extents from an
> -inode which can only be done at a rate of two extents per transaction because
> -of reservation size limitations. Hence a rolling extent removal transaction
> -keeps relogging the inode and btree buffers as they get modified in each
> -removal operation. This keeps them moving forward in the log as the operation
> -progresses, ensuring that current operation never gets blocked by itself if the
> -log wraps around.
> -
> -Hence it can be seen that the relogging operation is fundamental to the correct
> -working of the XFS journalling subsystem. From the above description, most
> -people should be able to see why the XFS metadata operations writes so much to
> -the log - repeated operations to the same objects write the same changes to
> -the log over and over again. Worse is the fact that objects tend to get
> -dirtier as they get relogged, so each subsequent transaction is writing more
> -metadata into the log.
> -
> -Another feature of the XFS transaction subsystem is that most transactions are
> -asynchronous. That is, they don't commit to disk until either a log buffer is
> -filled (a log buffer can hold multiple transactions) or a synchronous operation
> -forces the log buffers holding the transactions to disk. This means that XFS is
> -doing aggregation of transactions in memory - batching them, if you like - to
> -minimise the impact of the log IO on transaction throughput.
> -
> -The limitation on asynchronous transaction throughput is the number and size of
> -log buffers made available by the log manager. By default there are 8 log
> -buffers available and the size of each is 32kB - the size can be increased up
> -to 256kB by use of a mount option.
> -
> -Effectively, this gives us the maximum bound of outstanding metadata changes
> -that can be made to the filesystem at any point in time - if all the log
> -buffers are full and under IO, then no more transactions can be committed until
> -the current batch completes. It is now common for a single current CPU core to
> -be to able to issue enough transactions to keep the log buffers full and under
> -IO permanently. Hence the XFS journalling subsystem can be considered to be IO
> -bound.
> -
> -== Delayed Logging Concepts
> -
> -The key thing to note about the asynchronous logging combined with the
> -relogging technique XFS uses is that we can be relogging changed objects
> -multiple times before they are committed to disk in the log buffers. If we
> -return to the previous relogging example, it is entirely possible that
> -transactions A through D are committed to disk in the same log buffer.
> -
> -That is, a single log buffer may contain multiple copies of the same object,
> -but only one of those copies needs to be there - the last one "D", as it
> -contains all the changes from the previous changes. In other words, we have one
> -necessary copy in the log buffer, and three stale copies that are simply
> -wasting space. When we are doing repeated operations on the same set of
> -objects, these "stale objects" can be over 90% of the space used in the log
> -buffers. It is clear that reducing the number of stale objects written to the
> -log would greatly reduce the amount of metadata we write to the log, and this
> -is the fundamental goal of delayed logging.
> -
> -From a conceptual point of view, XFS is already doing relogging in memory (where
> -memory == log buffer), only it is doing it extremely inefficiently. It is using
> -logical to physical formatting to do the relogging because there is no
> -infrastructure to keep track of logical changes in memory prior to physically
> -formatting the changes in a transaction to the log buffer. Hence we cannot avoid
> -accumulating stale objects in the log buffers.
> -
> -Delayed logging is the name we've given to keeping and tracking transactional
> -changes to objects in memory outside the log buffer infrastructure. Because of
> -the relogging concept fundamental to the XFS journalling subsystem, this is
> -actually relatively easy to do - all the changes to logged items are already
> -tracked in the current infrastructure. The big problem is how to accumulate
> -them and get them to the log in a consistent, recoverable manner.
> -Describing the problems and how they have been solved is the focus of this
> -document.
> -
> -One of the key changes that delayed logging makes to the operation of the
> -journalling subsystem is that it disassociates the amount of outstanding
> -metadata changes from the size and number of log buffers available. In other
> -words, instead of there only being a maximum of 2MB of transaction changes not
> -written to the log at any point in time, there may be a much greater amount
> -being accumulated in memory. Hence the potential for loss of metadata on a
> -crash is much greater than for the existing logging mechanism.
> -
> -It should be noted that this does not change the guarantee that log recovery
> -will result in a consistent filesystem. What it does mean is that as far as the
> -recovered filesystem is concerned, there may be many thousands of transactions
> -that simply did not occur as a result of the crash. This makes it even more
> -important that applications that care about their data use fsync() where they
> -need to ensure application level data integrity is maintained.
> -
> -It should be noted that delayed logging is not an innovative new concept that
> -warrants rigorous proofs to determine whether it is correct or not. The method
> -of accumulating changes in memory for some period before writing them to the
> -log is used effectively in many filesystems including ext3 and ext4. Hence
> -no time is spent in this document trying to convince the reader that the
> -concept is sound. Instead it is simply considered a "solved problem" and as
> -such implementing it in XFS is purely an exercise in software engineering.
> -
> -The fundamental requirements for delayed logging in XFS are simple:
> -
> -	. Reduce the amount of metadata written to the log by at least
> -	   an order of magnitude.
> -	. Supply sufficient statistics to validate Requirement #1.
> -	. Supply sufficient new tracing infrastructure to be able to debug
> -	   problems with the new code.
> -	. No on-disk format change (metadata or log format).
> -	. Enable and disable with a mount option.
> -	. No performance regressions for synchronous transaction workloads.
> -
> -== Delayed Logging Design
> -
> -=== Storing Changes
> -
> -The problem with accumulating changes at a logical level (i.e. just using the
> -existing log item dirty region tracking) is that when it comes to writing the
> -changes to the log buffers, we need to ensure that the object we are formatting
> -is not changing while we do this. This requires locking the object to prevent
> -concurrent modification. Hence flushing the logical changes to the log would
> -require us to lock every object, format them, and then unlock them again.
> -
> -This introduces lots of scope for deadlocks with transactions that are already
> -running. For example, a transaction has object A locked and modified, but needs
> -the delayed logging tracking lock to commit the transaction. However, the
> -flushing thread has the delayed logging tracking lock already held, and is
> -trying to get the lock on object A to flush it to the log buffer. This appears
> -to be an unsolvable deadlock condition, and it was solving this problem that
> -was the barrier to implementing delayed logging for so long.
> -
> -The solution is relatively simple - it just took a long time to recognise it.
> -Put simply, the current logging code formats the changes to each item into an
> -vector array that points to the changed regions in the item. The log write code
> -simply copies the memory these vectors point to into the log buffer during
> -transaction commit while the item is locked in the transaction. Instead of
> -using the log buffer as the destination of the formatting code, we can use an
> -allocated memory buffer big enough to fit the formatted vector.
> -
> -If we then copy the vector into the memory buffer and rewrite the vector to
> -point to the memory buffer rather than the object itself, we now have a copy of
> -the changes in a format that is compatible with the log buffer writing code.
> -that does not require us to lock the item to access. This formatting and
> -rewriting can all be done while the object is locked during transaction commit,
> -resulting in a vector that is transactionally consistent and can be accessed
> -without needing to lock the owning item.
> -
> -Hence we avoid the need to lock items when we need to flush outstanding
> -asynchronous transactions to the log. The differences between the existing
> -formatting method and the delayed logging formatting can be seen in the
> -diagram below.
> -
> -Current format log vector:
> -....
> -Object    +---------------------------------------------+
> -Vector 1      +----+
> -Vector 2                    +----+
> -Vector 3                                   +----------+
> -....
> -
> -After formatting:
> -
> -....
> -Log Buffer    +-V1-+-V2-+----V3----+
> -....
> -
> -Delayed logging vector:
> -
> -....
> -Object    +---------------------------------------------+
> -Vector 1      +----+
> -Vector 2                    +----+
> -Vector 3                                   +----------+
> -....
> -
> -After formatting:
> -
> -....
> -Memory Buffer +-V1-+-V2-+----V3----+
> -Vector 1      +----+
> -Vector 2           +----+
> -Vector 3                +----------+
> -....
> -
> -The memory buffer and associated vector need to be passed as a single object,
> -but still need to be associated with the parent object so if the object is
> -relogged we can replace the current memory buffer with a new memory buffer that
> -contains the latest changes.
> -
> -The reason for keeping the vector around after we've formatted the memory
> -buffer is to support splitting vectors across log buffer boundaries correctly.
> -If we don't keep the vector around, we do not know where the region boundaries
> -are in the item, so we'd need a new encapsulation method for regions in the log
> -buffer writing (i.e. double encapsulation). This would be an on-disk format
> -change and as such is not desirable.  It also means we'd have to write the log
> -region headers in the formatting stage, which is problematic as there is per
> -region state that needs to be placed into the headers during the log write.
> -
> -Hence we need to keep the vector, but by attaching the memory buffer to it and
> -rewriting the vector addresses to point at the memory buffer we end up with a
> -self-describing object that can be passed to the log buffer write code to be
> -handled in exactly the same manner as the existing log vectors are handled.
> -Hence we avoid needing a new on-disk format to handle items that have been
> -relogged in memory.
> -
> -
> -=== Tracking Changes
> -
> -Now that we can record transactional changes in memory in a form that allows
> -them to be used without limitations, we need to be able to track and accumulate
> -them so that they can be written to the log at some later point in time.  The
> -log item is the natural place to store this vector and buffer, and also makes sense
> -to be the object that is used to track committed objects as it will always
> -exist once the object has been included in a transaction.
> -
> -The log item is already used to track the log items that have been written to
> -the log but not yet written to disk. Such log items are considered "active"
> -and as such are stored in the Active Item List (AIL) which is a LSN-ordered
> -double linked list. Items are inserted into this list during log buffer IO
> -completion, after which they are unpinned and can be written to disk. An object
> -that is in the AIL can be relogged, which causes the object to be pinned again
> -and then moved forward in the AIL when the log buffer IO completes for that
> -transaction.
> -
> -Essentially, this shows that an item that is in the AIL can still be modified
> -and relogged, so any tracking must be separate to the AIL infrastructure. As
> -such, we cannot reuse the AIL list pointers for tracking committed items, nor
> -can we store state in any field that is protected by the AIL lock. Hence the
> -committed item tracking needs it's own locks, lists and state fields in the log
> -item.
> -
> -Similar to the AIL, tracking of committed items is done through a new list
> -called the Committed Item List (CIL).  The list tracks log items that have been
> -committed and have formatted memory buffers attached to them. It tracks objects
> -in transaction commit order, so when an object is relogged it is removed from
> -it's place in the list and re-inserted at the tail. This is entirely arbitrary
> -and done to make it easy for debugging - the last items in the list are the
> -ones that are most recently modified. Ordering of the CIL is not necessary for
> -transactional integrity (as discussed in the next section) so the ordering is
> -done for convenience/sanity of the developers.
> -
> -
> -=== Checkpoints
> -
> -When we have a log synchronisation event, commonly known as a "log force",
> -all the items in the CIL must be written into the log via the log buffers.
> -We need to write these items in the order that they exist in the CIL, and they
> -need to be written as an atomic transaction. The need for all the objects to be
> -written as an atomic transaction comes from the requirements of relogging and
> -log replay - all the changes in all the objects in a given transaction must
> -either be completely replayed during log recovery, or not replayed at all. If
> -a transaction is not replayed because it is not complete in the log, then
> -no later transactions should be replayed, either.
> -
> -To fulfill this requirement, we need to write the entire CIL in a single log
> -transaction. Fortunately, the XFS log code has no fixed limit on the size of a
> -transaction, nor does the log replay code. The only fundamental limit is that
> -the transaction cannot be larger than just under half the size of the log.  The
> -reason for this limit is that to find the head and tail of the log, there must
> -be at least one complete transaction in the log at any given time. If a
> -transaction is larger than half the log, then there is the possibility that a
> -crash during the write of a such a transaction could partially overwrite the
> -only complete previous transaction in the log. This will result in a recovery
> -failure and an inconsistent filesystem and hence we must enforce the maximum
> -size of a checkpoint to be slightly less than a half the log.
> -
> -Apart from this size requirement, a checkpoint transaction looks no different
> -to any other transaction - it contains a transaction header, a series of
> -formatted log items and a commit record at the tail. From a recovery
> -perspective, the checkpoint transaction is also no different - just a lot
> -bigger with a lot more items in it. The worst case effect of this is that we
> -might need to tune the recovery transaction object hash size.
> -
> -Because the checkpoint is just another transaction and all the changes to log
> -items are stored as log vectors, we can use the existing log buffer writing
> -code to write the changes into the log. To do this efficiently, we need to
> -minimise the time we hold the CIL locked while writing the checkpoint
> -transaction. The current log write code enables us to do this easily with the
> -way it separates the writing of the transaction contents (the log vectors) from
> -the transaction commit record, but tracking this requires us to have a
> -per-checkpoint context that travels through the log write process through to
> -checkpoint completion.
> -
> -Hence a checkpoint has a context that tracks the state of the current
> -checkpoint from initiation to checkpoint completion. A new context is initiated
> -at the same time a checkpoint transaction is started. That is, when we remove
> -all the current items from the CIL during a checkpoint operation, we move all
> -those changes into the current checkpoint context. We then initialise a new
> -context and attach that to the CIL for aggregation of new transactions.
> -
> -This allows us to unlock the CIL immediately after transfer of all the
> -committed items and effectively allow new transactions to be issued while we
> -are formatting the checkpoint into the log. It also allows concurrent
> -checkpoints to be written into the log buffers in the case of log force heavy
> -workloads, just like the existing transaction commit code does. This, however,
> -requires that we strictly order the commit records in the log so that
> -checkpoint sequence order is maintained during log replay.
> -
> -To ensure that we can be writing an item into a checkpoint transaction at
> -the same time another transaction modifies the item and inserts the log item
> -into the new CIL, then checkpoint transaction commit code cannot use log items
> -to store the list of log vectors that need to be written into the transaction.
> -Hence log vectors need to be able to be chained together to allow them to be
> -detached from the log items. That is, when the CIL is flushed the memory
> -buffer and log vector attached to each log item needs to be attached to the
> -checkpoint context so that the log item can be released. In diagrammatic form,
> -the CIL would look like this before the flush:
> -
> -----
> -	CIL Head
> -	   |
> -	   V
> -	Log Item <-> log vector 1	-> memory buffer
> -	   |				-> vector array
> -	   V
> -	Log Item <-> log vector 2	-> memory buffer
> -	   |				-> vector array
> -	   V
> -	......
> -	   |
> -	   V
> -	Log Item <-> log vector N-1	-> memory buffer
> -	   |				-> vector array
> -	   V
> -	Log Item <-> log vector N	-> memory buffer
> -					-> vector array
> -----
> -
> -And after the flush the CIL head is empty, and the checkpoint context log
> -vector list would look like:
> -
> -----
> -	Checkpoint Context
> -	   |
> -	   V
> -	log vector 1	-> memory buffer
> -	   |		-> vector array
> -	   |		-> Log Item
> -	   V
> -	log vector 2	-> memory buffer
> -	   |		-> vector array
> -	   |		-> Log Item
> -	   V
> -	......
> -	   |
> -	   V
> -	log vector N-1	-> memory buffer
> -	   |		-> vector array
> -	   |		-> Log Item
> -	   V
> -	log vector N	-> memory buffer
> -			-> vector array
> -			-> Log Item
> -----
> -
> -Once this transfer is done, the CIL can be unlocked and new transactions can
> -start, while the checkpoint flush code works over the log vector chain to
> -commit the checkpoint.
> -
> -Once the checkpoint is written into the log buffers, the checkpoint context is
> -attached to the log buffer that the commit record was written to along with a
> -completion callback. Log IO completion will call that callback, which can then
> -run transaction committed processing for the log items (i.e. insert into AIL
> -and unpin) in the log vector chain and then free the log vector chain and
> -checkpoint context.
> -
> -Discussion Point: I am uncertain as to whether the log item is the most
> -efficient way to track vectors, even though it seems like the natural way to do
> -it. The fact that we walk the log items (in the CIL) just to chain the log
> -vectors and break the link between the log item and the log vector means that
> -we take a cache line hit for the log item list modification, then another for
> -the log vector chaining. If we track by the log vectors, then we only need to
> -break the link between the log item and the log vector, which means we should
> -dirty only the log item cachelines. Normally I wouldn't be concerned about one
> -vs two dirty cachelines except for the fact I've seen upwards of 80,000 log
> -vectors in one checkpoint transaction. I'd guess this is a "measure and
> -compare" situation that can be done after a working and reviewed implementation
> -is in the dev tree....
> -
> -=== Checkpoint Sequencing
> -
> -One of the key aspects of the XFS transaction subsystem is that it tags
> -committed transactions with the log sequence number of the transaction commit.
> -This allows transactions to be issued asynchronously even though there may be
> -future operations that cannot be completed until that transaction is fully
> -committed to the log. In the rare case that a dependent operation occurs (e.g.
> -re-using a freed metadata extent for a data extent), a special, optimised log
> -force can be issued to force the dependent transaction to disk immediately.
> -
> -To do this, transactions need to record the LSN of the commit record of the
> -transaction. This LSN comes directly from the log buffer the transaction is
> -written into. While this works just fine for the existing transaction
> -mechanism, it does not work for delayed logging because transactions are not
> -written directly into the log buffers. Hence some other method of sequencing
> -transactions is required.
> -
> -As discussed in the checkpoint section, delayed logging uses per-checkpoint
> -contexts, and as such it is simple to assign a sequence number to each
> -checkpoint. Because the switching of checkpoint contexts must be done
> -atomically, it is simple to ensure that each new context has a monotonically
> -increasing sequence number assigned to it without the need for an external
> -atomic counter - we can just take the current context sequence number and add
> -one to it for the new context.
> -
> -Then, instead of assigning a log buffer LSN to the transaction commit LSN
> -during the commit, we can assign the current checkpoint sequence. This allows
> -operations that track transactions that have not yet completed know what
> -checkpoint sequence needs to be committed before they can continue. As a
> -result, the code that forces the log to a specific LSN now needs to ensure that
> -the log forces to a specific checkpoint.
> -
> -To ensure that we can do this, we need to track all the checkpoint contexts
> -that are currently committing to the log. When we flush a checkpoint, the
> -context gets added to a "committing" list which can be searched. When a
> -checkpoint commit completes, it is removed from the committing list. Because
> -the checkpoint context records the LSN of the commit record for the checkpoint,
> -we can also wait on the log buffer that contains the commit record, thereby
> -using the existing log force mechanisms to execute synchronous forces.
> -
> -It should be noted that the synchronous forces may need to be extended with
> -mitigation algorithms similar to the current log buffer code to allow
> -aggregation of multiple synchronous transactions if there are already
> -synchronous transactions being flushed. Investigation of the performance of the
> -current design is needed before making any decisions here.
> -
> -The main concern with log forces is to ensure that all the previous checkpoints
> -are also committed to disk before the one we need to wait for. Therefore we
> -need to check that all the prior contexts in the committing list are also
> -complete before waiting on the one we need to complete. We do this
> -synchronisation in the log force code so that we don't need to wait anywhere
> -else for such serialisation - it only matters when we do a log force.
> -
> -The only remaining complexity is that a log force now also has to handle the
> -case where the forcing sequence number is the same as the current context. That
> -is, we need to flush the CIL and potentially wait for it to complete. This is a
> -simple addition to the existing log forcing code to check the sequence numbers
> -and push if required. Indeed, placing the current sequence checkpoint flush in
> -the log force code enables the current mechanism for issuing synchronous
> -transactions to remain untouched (i.e. commit an asynchronous transaction, then
> -force the log at the LSN of that transaction) and so the higher level code
> -behaves the same regardless of whether delayed logging is being used or not.
> -
> -=== Checkpoint Log Space Accounting
> -
> -The big issue for a checkpoint transaction is the log space reservation for the
> -transaction. We don't know how big a checkpoint transaction is going to be
> -ahead of time, nor how many log buffers it will take to write out, nor the
> -number of split log vector regions are going to be used. We can track the
> -amount of log space required as we add items to the commit item list, but we
> -still need to reserve the space in the log for the checkpoint.
> -
> -A typical transaction reserves enough space in the log for the worst case space
> -usage of the transaction. The reservation accounts for log record headers,
> -transaction and region headers, headers for split regions, buffer tail padding,
> -etc. as well as the actual space for all the changed metadata in the
> -transaction. While some of this is fixed overhead, much of it is dependent on
> -the size of the transaction and the number of regions being logged (the number
> -of log vectors in the transaction).
> -
> -An example of the differences would be logging directory changes versus logging
> -inode changes. If you modify lots of inode cores (e.g. chmod -R g+w *), then
> -there are lots of transactions that only contain an inode core and an inode log
> -format structure. That is, two vectors totaling roughly 150 bytes. If we modify
> -10,000 inodes, we have about 1.5MB of metadata to write in 20,000 vectors. Each
> -vector is 12 bytes, so the total to be logged is approximately 1.75MB. In
> -comparison, if we are logging full directory buffers, they are typically 4KB
> -each, so we in 1.5MB of directory buffers we'd have roughly 400 buffers and a
> -buffer format structure for each buffer - roughly 800 vectors or 1.51MB total
> -space.  From this, it should be obvious that a static log space reservation is
> -not particularly flexible and is difficult to select the "optimal value" for
> -all workloads.
> -
> -Further, if we are going to use a static reservation, which bit of the entire
> -reservation does it cover? We account for space used by the transaction
> -reservation by tracking the space currently used by the object in the CIL and
> -then calculating the increase or decrease in space used as the object is
> -relogged. This allows for a checkpoint reservation to only have to account for
> -log buffer metadata used such as log header records.
> -
> -However, even using a static reservation for just the log metadata is
> -problematic. Typically log record headers use at least 16KB of log space per
> -1MB of log space consumed (512 bytes per 32k) and the reservation needs to be
> -large enough to handle arbitrary sized checkpoint transactions. This
> -reservation needs to be made before the checkpoint is started, and we need to
> -be able to reserve the space without sleeping.  For a 8MB checkpoint, we need a
> -reservation of around 150KB, which is a non-trivial amount of space.
> -
> -A static reservation needs to manipulate the log grant counters - we can take a
> -permanent reservation on the space, but we still need to make sure we refresh
> -the write reservation (the actual space available to the transaction) after
> -every checkpoint transaction completion. Unfortunately, if this space is not
> -available when required, then the regrant code will sleep waiting for it.
> -
> -The problem with this is that it can lead to deadlocks as we may need to commit
> -checkpoints to be able to free up log space (refer back to the description of
> -rolling transactions for an example of this).  Hence we *must* always have
> -space available in the log if we are to use static reservations, and that is
> -very difficult and complex to arrange. It is possible to do, but there is a
> -simpler way.
> -
> -The simpler way of doing this is tracking the entire log space used by the
> -items in the CIL and using this to dynamically calculate the amount of log
> -space required by the log metadata. If this log metadata space changes as a
> -result of a transaction commit inserting a new memory buffer into the CIL, then
> -the difference in space required is removed from the transaction that causes
> -the change. Transactions at this level will *always* have enough space
> -available in their reservation for this as they have already reserved the
> -maximal amount of log metadata space they require, and such a delta reservation
> -will always be less than or equal to the maximal amount in the reservation.
> -
> -Hence we can grow the checkpoint transaction reservation dynamically as items
> -are added to the CIL and avoid the need for reserving and regranting log space
> -up front. This avoids deadlocks and removes a blocking point from the
> -checkpoint flush code.
> -
> -As mentioned early, transactions can't grow to more than half the size of the
> -log. Hence as part of the reservation growing, we need to also check the size
> -of the reservation against the maximum allowed transaction size. If we reach
> -the maximum threshold, we need to push the CIL to the log. This is effectively
> -a "background flush" and is done on demand. This is identical to
> -a CIL push triggered by a log force, only that there is no waiting for the
> -checkpoint commit to complete. This background push is checked and executed by
> -transaction commit code.
> -
> -If the transaction subsystem goes idle while we still have items in the CIL,
> -they will be flushed by the periodic log force issued by the xfssyncd. This log
> -force will push the CIL to disk, and if the transaction subsystem stays idle,
> -allow the idle log to be covered (effectively marked clean) in exactly the same
> -manner that is done for the existing logging method. A discussion point is
> -whether this log force needs to be done more frequently than the current rate
> -which is once every 30s.
> -
> -
> -=== Log Item Pinning
> -
> -Currently log items are pinned during transaction commit while the items are
> -still locked. This happens just after the items are formatted, though it could
> -be done any time before the items are unlocked. The result of this mechanism is
> -that items get pinned once for every transaction that is committed to the log
> -buffers. Hence items that are relogged in the log buffers will have a pin count
> -for every outstanding transaction they were dirtied in. When each of these
> -transactions is completed, they will unpin the item once. As a result, the item
> -only becomes unpinned when all the transactions complete and there are no
> -pending transactions. Thus the pinning and unpinning of a log item is symmetric
> -as there is a 1:1 relationship with transaction commit and log item completion.
> -
> -For delayed logging, however, we have an asymmetric transaction commit to
> -completion relationship. Every time an object is relogged in the CIL it goes
> -through the commit process without a corresponding completion being registered.
> -That is, we now have a many-to-one relationship between transaction commit and
> -log item completion. The result of this is that pinning and unpinning of the
> -log items becomes unbalanced if we retain the "pin on transaction commit, unpin
> -on transaction completion" model.
> -
> -To keep pin/unpin symmetry, the algorithm needs to change to a "pin on
> -insertion into the CIL, unpin on checkpoint completion". In other words, the
> -pinning and unpinning becomes symmetric around a checkpoint context. We have to
> -pin the object the first time it is inserted into the CIL - if it is already in
> -the CIL during a transaction commit, then we do not pin it again. Because there
> -can be multiple outstanding checkpoint contexts, we can still see elevated pin
> -counts, but as each checkpoint completes the pin count will retain the correct
> -value according to it's context.
> -
> -Just to make matters more slightly more complex, this checkpoint level context
> -for the pin count means that the pinning of an item must take place under the
> -CIL commit/flush lock. If we pin the object outside this lock, we cannot
> -guarantee which context the pin count is associated with. This is because of
> -the fact pinning the item is dependent on whether the item is present in the
> -current CIL or not. If we don't pin the CIL first before we check and pin the
> -object, we have a race with CIL being flushed between the check and the pin
> -(or not pinning, as the case may be). Hence we must hold the CIL flush/commit
> -lock to guarantee that we pin the items correctly.
> -
> -=== Concurrent Scalability
> -
> -A fundamental requirement for the CIL is that accesses through transaction
> -commits must scale to many concurrent commits. The current transaction commit
> -code does not break down even when there are transactions coming from 2048
> -processors at once. The current transaction code does not go any faster than if
> -there was only one CPU using it, but it does not slow down either.
> -
> -As a result, the delayed logging transaction commit code needs to be designed
> -for concurrency from the ground up. It is obvious that there are serialisation
> -points in the design - the three important ones are:
> -
> -	. Locking out new transaction commits while flushing the CIL
> -	. Adding items to the CIL and updating item space accounting
> -	. Checkpoint commit ordering
> -
> -Looking at the transaction commit and CIL flushing interactions, it is clear
> -that we have a many-to-one interaction here. That is, the only restriction on
> -the number of concurrent transactions that can be trying to commit at once is
> -the amount of space available in the log for their reservations. The practical
> -limit here is in the order of several hundred concurrent transactions for a
> -128MB log, which means that it is generally one per CPU in a machine.
> -
> -The amount of time a transaction commit needs to hold out a flush is a
> -relatively long period of time - the pinning of log items needs to be done
> -while we are holding out a CIL flush, so at the moment that means it is held
> -across the formatting of the objects into memory buffers (i.e. while memcpy()s
> -are in progress). Ultimately a two pass algorithm where the formatting is done
> -separately to the pinning of objects could be used to reduce the hold time of
> -the transaction commit side.
> -
> -Because of the number of potential transaction commit side holders, the lock
> -really needs to be a sleeping lock - if the CIL flush takes the lock, we do not
> -want every other CPU in the machine spinning on the CIL lock. Given that
> -flushing the CIL could involve walking a list of tens of thousands of log
> -items, it will get held for a significant time and so spin contention is a
> -significant concern. Preventing lots of CPUs spinning doing nothing is the
> -main reason for choosing a sleeping lock even though nothing in either the
> -transaction commit or CIL flush side sleeps with the lock held.
> -
> -It should also be noted that CIL flushing is also a relatively rare operation
> -compared to transaction commit for asynchronous transaction workloads - only
> -time will tell if using a read-write semaphore for exclusion will limit
> -transaction commit concurrency due to cache line bouncing of the lock on the
> -read side.
> -
> -The second serialisation point is on the transaction commit side where items
> -are inserted into the CIL. Because transactions can enter this code
> -concurrently, the CIL needs to be protected separately from the above
> -commit/flush exclusion. It also needs to be an exclusive lock but it is only
> -held for a very short time and so a spin lock is appropriate here. It is
> -possible that this lock will become a contention point, but given the short
> -hold time once per transaction I think that contention is unlikely.
> -
> -The final serialisation point is the checkpoint commit record ordering code
> -that is run as part of the checkpoint commit and log force sequencing. The code
> -path that triggers a CIL flush (i.e. whatever triggers the log force) will enter
> -an ordering loop after writing all the log vectors into the log buffers but
> -before writing the commit record. This loop walks the list of committing
> -checkpoints and needs to block waiting for checkpoints to complete their commit
> -record write. As a result it needs a lock and a wait variable. Log force
> -sequencing also requires the same lock, list walk, and blocking mechanism to
> -ensure completion of checkpoints.
> -
> -These two sequencing operations can use the mechanism even though the
> -events they are waiting for are different. The checkpoint commit record
> -sequencing needs to wait until checkpoint contexts contain a commit LSN
> -(obtained through completion of a commit record write) while log force
> -sequencing needs to wait until previous checkpoint contexts are removed from
> -the committing list (i.e. they've completed). A simple wait variable and
> -broadcast wakeups (thundering herds) has been used to implement these two
> -serialisation queues. They use the same lock as the CIL, too. If we see too
> -much contention on the CIL lock, or too many context switches as a result of
> -the broadcast wakeups these operations can be put under a new spinlock and
> -given separate wait lists to reduce lock contention and the number of processes
> -woken by the wrong event.
> -
> -
> -=== Lifecycle Changes
> -
> -The existing log item life cycle is as follows:
> -
> -----
> -	1. Transaction allocate
> -	2. Transaction reserve
> -	3. Lock item
> -	4. Join item to transaction
> -		If not already attached,
> -			Allocate log item
> -			Attach log item to owner item
> -		Attach log item to transaction
> -	5. Modify item
> -		Record modifications in log item
> -	6. Transaction commit
> -		Pin item in memory
> -		Format item into log buffer
> -		Write commit LSN into transaction
> -		Unlock item
> -		Attach transaction to log buffer
> -
> -	<log buffer IO dispatched>
> -	<log buffer IO completes>
> -
> -	7. Transaction completion
> -		Mark log item committed
> -		Insert log item into AIL
> -			Write commit LSN into log item
> -		Unpin log item
> -	8. AIL traversal
> -		Lock item
> -		Mark log item clean
> -		Flush item to disk
> -
> -	<item IO completion>
> -
> -	9. Log item removed from AIL
> -		Moves log tail
> -		Item unlocked
> -----
> -
> -Essentially, steps 1-6 operate independently from step 7, which is also
> -independent of steps 8-9. An item can be locked in steps 1-6 or steps 8-9
> -at the same time step 7 is occurring, but only steps 1-6 or 8-9 can occur
> -at the same time. If the log item is in the AIL or between steps 6 and 7
> -and steps 1-6 are re-entered, then the item is relogged. Only when steps 8-9
> -are entered and completed is the object considered clean.
> -
> -With delayed logging, there are new steps inserted into the life cycle:
> -
> -----
> -	1. Transaction allocate
> -	2. Transaction reserve
> -	3. Lock item
> -	4. Join item to transaction
> -		If not already attached,
> -			Allocate log item
> -			Attach log item to owner item
> -		Attach log item to transaction
> -	5. Modify item
> -		Record modifications in log item
> -	6. Transaction commit
> -		Pin item in memory if not pinned in CIL
> -		Format item into log vector + buffer
> -		Attach log vector and buffer to log item
> -		Insert log item into CIL
> -		Write CIL context sequence into transaction
> -		Unlock item
> -
> -	<next log force>
> -
> -	7. CIL push
> -		lock CIL flush
> -		Chain log vectors and buffers together
> -		Remove items from CIL
> -		unlock CIL flush
> -		write log vectors into log
> -		sequence commit records
> -		attach checkpoint context to log buffer
> -
> -	<log buffer IO dispatched>
> -	<log buffer IO completes>
> -
> -	8. Checkpoint completion
> -		Mark log item committed
> -		Insert item into AIL
> -			Write commit LSN into log item
> -		Unpin log item
> -	9. AIL traversal
> -		Lock item
> -		Mark log item clean
> -		Flush item to disk
> -	<item IO completion>
> -	10. Log item removed from AIL
> -		Moves log tail
> -		Item unlocked
> -----
> -
> -From this, it can be seen that the only life cycle differences between the two
> -logging methods are in the middle of the life cycle - they still have the same
> -beginning and end and execution constraints. The only differences are in the
> -committing of the log items to the log itself and the completion processing.
> -Hence delayed logging should not introduce any constraints on log item
> -behaviour, allocation or freeing that don't already exist.
> -
> -As a result of this zero-impact "insertion" of delayed logging infrastructure
> -and the design of the internal structures to avoid on disk format changes, we
> -can basically switch between delayed logging and the existing mechanism with a
> -mount option. Fundamentally, there is no reason why the log manager would not
> -be able to swap methods automatically and transparently depending on load
> -characteristics, but this should not be necessary if delayed logging works as
> -designed.
> -
> -EOF.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  https://urldefense.proofpoint.com/v2/url?u=http-3A__vger.kernel.org_majordomo-2Dinfo.html&d=DwICaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=LHZQ8fHvy6wDKXGTWcm97burZH5sQKHRDMaY1UthQxc&m=Nf8WUxufLEmKZuk2gKNHK8pZ4RMwAPFrtbhGK9U84W8&s=jCdh7D6v75E1G6nLbTu4rv3rhsIoKPJUPx5KvBMwjWk&e=

--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/design/XFS_Filesystem_Structure/delayed_logging.asciidoc b/design/XFS_Filesystem_Structure/delayed_logging.asciidoc
new file mode 100644
index 0000000..e9a336f
--- /dev/null
+++ b/design/XFS_Filesystem_Structure/delayed_logging.asciidoc
@@ -0,0 +1,808 @@ 
+= Delayed Logging
+
+== Introduction to Re-logging in XFS
+
+XFS logging is a combination of logical and physical logging. Some objects,
+such as inodes and dquots, are logged in logical format where the details
+logged are made up of the changes to in-core structures rather than on-disk
+structures. Other objects - typically buffers - have their physical changes
+logged. The reason for these differences is to reduce the amount of log space
+required for objects that are frequently logged. Some parts of inodes are more
+frequently logged than others, and inodes are typically more frequently logged
+than any other object (except maybe the superblock buffer) so keeping the
+amount of metadata logged low is of prime importance.
+
+The reason that this is such a concern is that XFS allows multiple separate
+modifications to a single object to be carried in the log at any given time.
+This allows the log to avoid needing to flush each change to disk before
+recording a new change to the object. XFS does this via a method called
+"re-logging". Conceptually, this is quite simple - all it requires is that any
+new change to the object is recorded with a *new copy* of all the existing
+changes in the new transaction that is written to the log.
+
+That is, if we have a sequence of changes A through to F, and the object was
+written to disk after change D, we would see in the log the following series
+of transactions, their contents and the log sequence number (LSN) of the
+transaction:
+
+....
+	Transaction		Contents	LSN
+	   A			   A		   X
+	   B			  A+B		  X+n
+	   C			 A+B+C		 X+n+m
+	   D			A+B+C+D		X+n+m+o
+	    <object written to disk>
+	   E			   E		   Y (> X+n+m+o)
+	   F			  E+F		  Y+p
+....
+
+In other words, each time an object is relogged, the new transaction contains
+the aggregation of all the previous changes currently held only in the log.
+
+This relogging technique also allows objects to be moved forward in the log so
+that an object being relogged does not prevent the tail of the log from ever
+moving forward.  This can be seen in the table above by the changing
+(increasing) LSN of each subsequent transaction - the LSN is effectively a
+direct encoding of the location in the log of the transaction.
+
+This relogging is also used to implement long-running, multiple-commit
+transactions.  These transaction are known as rolling transactions, and require
+a special log reservation known as a permanent transaction reservation. A
+typical example of a rolling transaction is the removal of extents from an
+inode which can only be done at a rate of two extents per transaction because
+of reservation size limitations. Hence a rolling extent removal transaction
+keeps relogging the inode and btree buffers as they get modified in each
+removal operation. This keeps them moving forward in the log as the operation
+progresses, ensuring that current operation never gets blocked by itself if the
+log wraps around.
+
+Hence it can be seen that the relogging operation is fundamental to the correct
+working of the XFS journalling subsystem. From the above description, most
+people should be able to see why the XFS metadata operations writes so much to
+the log - repeated operations to the same objects write the same changes to
+the log over and over again. Worse is the fact that objects tend to get
+dirtier as they get relogged, so each subsequent transaction is writing more
+metadata into the log.
+
+Another feature of the XFS transaction subsystem is that most transactions are
+asynchronous. That is, they don't commit to disk until either a log buffer is
+filled (a log buffer can hold multiple transactions) or a synchronous operation
+forces the log buffers holding the transactions to disk. This means that XFS is
+doing aggregation of transactions in memory - batching them, if you like - to
+minimise the impact of the log IO on transaction throughput.
+
+The limitation on asynchronous transaction throughput is the number and size of
+log buffers made available by the log manager. By default there are 8 log
+buffers available and the size of each is 32kB - the size can be increased up
+to 256kB by use of a mount option.
+
+Effectively, this gives us the maximum bound of outstanding metadata changes
+that can be made to the filesystem at any point in time - if all the log
+buffers are full and under IO, then no more transactions can be committed until
+the current batch completes. It is now common for a single current CPU core to
+be to able to issue enough transactions to keep the log buffers full and under
+IO permanently. Hence the XFS journalling subsystem can be considered to be IO
+bound.
+
+== Delayed Logging Concepts
+
+The key thing to note about the asynchronous logging combined with the
+relogging technique XFS uses is that we can be relogging changed objects
+multiple times before they are committed to disk in the log buffers. If we
+return to the previous relogging example, it is entirely possible that
+transactions A through D are committed to disk in the same log buffer.
+
+That is, a single log buffer may contain multiple copies of the same object,
+but only one of those copies needs to be there - the last one "D", as it
+contains all the changes from the previous changes. In other words, we have one
+necessary copy in the log buffer, and three stale copies that are simply
+wasting space. When we are doing repeated operations on the same set of
+objects, these "stale objects" can be over 90% of the space used in the log
+buffers. It is clear that reducing the number of stale objects written to the
+log would greatly reduce the amount of metadata we write to the log, and this
+is the fundamental goal of delayed logging.
+
+From a conceptual point of view, XFS is already doing relogging in memory (where
+memory == log buffer), only it is doing it extremely inefficiently. It is using
+logical to physical formatting to do the relogging because there is no
+infrastructure to keep track of logical changes in memory prior to physically
+formatting the changes in a transaction to the log buffer. Hence we cannot avoid
+accumulating stale objects in the log buffers.
+
+Delayed logging is the name we've given to keeping and tracking transactional
+changes to objects in memory outside the log buffer infrastructure. Because of
+the relogging concept fundamental to the XFS journalling subsystem, this is
+actually relatively easy to do - all the changes to logged items are already
+tracked in the current infrastructure. The big problem is how to accumulate
+them and get them to the log in a consistent, recoverable manner.
+Describing the problems and how they have been solved is the focus of this
+document.
+
+One of the key changes that delayed logging makes to the operation of the
+journalling subsystem is that it disassociates the amount of outstanding
+metadata changes from the size and number of log buffers available. In other
+words, instead of there only being a maximum of 2MB of transaction changes not
+written to the log at any point in time, there may be a much greater amount
+being accumulated in memory. Hence the potential for loss of metadata on a
+crash is much greater than for the existing logging mechanism.
+
+It should be noted that this does not change the guarantee that log recovery
+will result in a consistent filesystem. What it does mean is that as far as the
+recovered filesystem is concerned, there may be many thousands of transactions
+that simply did not occur as a result of the crash. This makes it even more
+important that applications that care about their data use fsync() where they
+need to ensure application level data integrity is maintained.
+
+It should be noted that delayed logging is not an innovative new concept that
+warrants rigorous proofs to determine whether it is correct or not. The method
+of accumulating changes in memory for some period before writing them to the
+log is used effectively in many filesystems including ext3 and ext4. Hence
+no time is spent in this document trying to convince the reader that the
+concept is sound. Instead it is simply considered a "solved problem" and as
+such implementing it in XFS is purely an exercise in software engineering.
+
+The fundamental requirements for delayed logging in XFS are simple:
+
+	. Reduce the amount of metadata written to the log by at least
+	   an order of magnitude.
+	. Supply sufficient statistics to validate Requirement #1.
+	. Supply sufficient new tracing infrastructure to be able to debug
+	   problems with the new code.
+	. No on-disk format change (metadata or log format).
+	. Enable and disable with a mount option.
+	. No performance regressions for synchronous transaction workloads.
+
+== Delayed Logging Design
+
+=== Storing Changes
+
+The problem with accumulating changes at a logical level (i.e. just using the
+existing log item dirty region tracking) is that when it comes to writing the
+changes to the log buffers, we need to ensure that the object we are formatting
+is not changing while we do this. This requires locking the object to prevent
+concurrent modification. Hence flushing the logical changes to the log would
+require us to lock every object, format them, and then unlock them again.
+
+This introduces lots of scope for deadlocks with transactions that are already
+running. For example, a transaction has object A locked and modified, but needs
+the delayed logging tracking lock to commit the transaction. However, the
+flushing thread has the delayed logging tracking lock already held, and is
+trying to get the lock on object A to flush it to the log buffer. This appears
+to be an unsolvable deadlock condition, and it was solving this problem that
+was the barrier to implementing delayed logging for so long.
+
+The solution is relatively simple - it just took a long time to recognise it.
+Put simply, the current logging code formats the changes to each item into an
+vector array that points to the changed regions in the item. The log write code
+simply copies the memory these vectors point to into the log buffer during
+transaction commit while the item is locked in the transaction. Instead of
+using the log buffer as the destination of the formatting code, we can use an
+allocated memory buffer big enough to fit the formatted vector.
+
+If we then copy the vector into the memory buffer and rewrite the vector to
+point to the memory buffer rather than the object itself, we now have a copy of
+the changes in a format that is compatible with the log buffer writing code.
+that does not require us to lock the item to access. This formatting and
+rewriting can all be done while the object is locked during transaction commit,
+resulting in a vector that is transactionally consistent and can be accessed
+without needing to lock the owning item.
+
+Hence we avoid the need to lock items when we need to flush outstanding
+asynchronous transactions to the log. The differences between the existing
+formatting method and the delayed logging formatting can be seen in the
+diagram below.
+
+Current format log vector:
+....
+Object    +---------------------------------------------+
+Vector 1      +----+
+Vector 2                    +----+
+Vector 3                                   +----------+
+....
+
+After formatting:
+
+....
+Log Buffer    +-V1-+-V2-+----V3----+
+....
+
+Delayed logging vector:
+
+....
+Object    +---------------------------------------------+
+Vector 1      +----+
+Vector 2                    +----+
+Vector 3                                   +----------+
+....
+
+After formatting:
+
+....
+Memory Buffer +-V1-+-V2-+----V3----+
+Vector 1      +----+
+Vector 2           +----+
+Vector 3                +----------+
+....
+
+The memory buffer and associated vector need to be passed as a single object,
+but still need to be associated with the parent object so if the object is
+relogged we can replace the current memory buffer with a new memory buffer that
+contains the latest changes.
+
+The reason for keeping the vector around after we've formatted the memory
+buffer is to support splitting vectors across log buffer boundaries correctly.
+If we don't keep the vector around, we do not know where the region boundaries
+are in the item, so we'd need a new encapsulation method for regions in the log
+buffer writing (i.e. double encapsulation). This would be an on-disk format
+change and as such is not desirable.  It also means we'd have to write the log
+region headers in the formatting stage, which is problematic as there is per
+region state that needs to be placed into the headers during the log write.
+
+Hence we need to keep the vector, but by attaching the memory buffer to it and
+rewriting the vector addresses to point at the memory buffer we end up with a
+self-describing object that can be passed to the log buffer write code to be
+handled in exactly the same manner as the existing log vectors are handled.
+Hence we avoid needing a new on-disk format to handle items that have been
+relogged in memory.
+
+
+=== Tracking Changes
+
+Now that we can record transactional changes in memory in a form that allows
+them to be used without limitations, we need to be able to track and accumulate
+them so that they can be written to the log at some later point in time.  The
+log item is the natural place to store this vector and buffer, and also makes sense
+to be the object that is used to track committed objects as it will always
+exist once the object has been included in a transaction.
+
+The log item is already used to track the log items that have been written to
+the log but not yet written to disk. Such log items are considered "active"
+and as such are stored in the Active Item List (AIL) which is a LSN-ordered
+double linked list. Items are inserted into this list during log buffer IO
+completion, after which they are unpinned and can be written to disk. An object
+that is in the AIL can be relogged, which causes the object to be pinned again
+and then moved forward in the AIL when the log buffer IO completes for that
+transaction.
+
+Essentially, this shows that an item that is in the AIL can still be modified
+and relogged, so any tracking must be separate to the AIL infrastructure. As
+such, we cannot reuse the AIL list pointers for tracking committed items, nor
+can we store state in any field that is protected by the AIL lock. Hence the
+committed item tracking needs it's own locks, lists and state fields in the log
+item.
+
+Similar to the AIL, tracking of committed items is done through a new list
+called the Committed Item List (CIL).  The list tracks log items that have been
+committed and have formatted memory buffers attached to them. It tracks objects
+in transaction commit order, so when an object is relogged it is removed from
+it's place in the list and re-inserted at the tail. This is entirely arbitrary
+and done to make it easy for debugging - the last items in the list are the
+ones that are most recently modified. Ordering of the CIL is not necessary for
+transactional integrity (as discussed in the next section) so the ordering is
+done for convenience/sanity of the developers.
+
+
+=== Checkpoints
+
+When we have a log synchronisation event, commonly known as a "log force",
+all the items in the CIL must be written into the log via the log buffers.
+We need to write these items in the order that they exist in the CIL, and they
+need to be written as an atomic transaction. The need for all the objects to be
+written as an atomic transaction comes from the requirements of relogging and
+log replay - all the changes in all the objects in a given transaction must
+either be completely replayed during log recovery, or not replayed at all. If
+a transaction is not replayed because it is not complete in the log, then
+no later transactions should be replayed, either.
+
+To fulfill this requirement, we need to write the entire CIL in a single log
+transaction. Fortunately, the XFS log code has no fixed limit on the size of a
+transaction, nor does the log replay code. The only fundamental limit is that
+the transaction cannot be larger than just under half the size of the log.  The
+reason for this limit is that to find the head and tail of the log, there must
+be at least one complete transaction in the log at any given time. If a
+transaction is larger than half the log, then there is the possibility that a
+crash during the write of a such a transaction could partially overwrite the
+only complete previous transaction in the log. This will result in a recovery
+failure and an inconsistent filesystem and hence we must enforce the maximum
+size of a checkpoint to be slightly less than a half the log.
+
+Apart from this size requirement, a checkpoint transaction looks no different
+to any other transaction - it contains a transaction header, a series of
+formatted log items and a commit record at the tail. From a recovery
+perspective, the checkpoint transaction is also no different - just a lot
+bigger with a lot more items in it. The worst case effect of this is that we
+might need to tune the recovery transaction object hash size.
+
+Because the checkpoint is just another transaction and all the changes to log
+items are stored as log vectors, we can use the existing log buffer writing
+code to write the changes into the log. To do this efficiently, we need to
+minimise the time we hold the CIL locked while writing the checkpoint
+transaction. The current log write code enables us to do this easily with the
+way it separates the writing of the transaction contents (the log vectors) from
+the transaction commit record, but tracking this requires us to have a
+per-checkpoint context that travels through the log write process through to
+checkpoint completion.
+
+Hence a checkpoint has a context that tracks the state of the current
+checkpoint from initiation to checkpoint completion. A new context is initiated
+at the same time a checkpoint transaction is started. That is, when we remove
+all the current items from the CIL during a checkpoint operation, we move all
+those changes into the current checkpoint context. We then initialise a new
+context and attach that to the CIL for aggregation of new transactions.
+
+This allows us to unlock the CIL immediately after transfer of all the
+committed items and effectively allow new transactions to be issued while we
+are formatting the checkpoint into the log. It also allows concurrent
+checkpoints to be written into the log buffers in the case of log force heavy
+workloads, just like the existing transaction commit code does. This, however,
+requires that we strictly order the commit records in the log so that
+checkpoint sequence order is maintained during log replay.
+
+To ensure that we can be writing an item into a checkpoint transaction at
+the same time another transaction modifies the item and inserts the log item
+into the new CIL, then checkpoint transaction commit code cannot use log items
+to store the list of log vectors that need to be written into the transaction.
+Hence log vectors need to be able to be chained together to allow them to be
+detached from the log items. That is, when the CIL is flushed the memory
+buffer and log vector attached to each log item needs to be attached to the
+checkpoint context so that the log item can be released. In diagrammatic form,
+the CIL would look like this before the flush:
+
+----
+	CIL Head
+	   |
+	   V
+	Log Item <-> log vector 1	-> memory buffer
+	   |				-> vector array
+	   V
+	Log Item <-> log vector 2	-> memory buffer
+	   |				-> vector array
+	   V
+	......
+	   |
+	   V
+	Log Item <-> log vector N-1	-> memory buffer
+	   |				-> vector array
+	   V
+	Log Item <-> log vector N	-> memory buffer
+					-> vector array
+----
+
+And after the flush the CIL head is empty, and the checkpoint context log
+vector list would look like:
+
+----
+	Checkpoint Context
+	   |
+	   V
+	log vector 1	-> memory buffer
+	   |		-> vector array
+	   |		-> Log Item
+	   V
+	log vector 2	-> memory buffer
+	   |		-> vector array
+	   |		-> Log Item
+	   V
+	......
+	   |
+	   V
+	log vector N-1	-> memory buffer
+	   |		-> vector array
+	   |		-> Log Item
+	   V
+	log vector N	-> memory buffer
+			-> vector array
+			-> Log Item
+----
+
+Once this transfer is done, the CIL can be unlocked and new transactions can
+start, while the checkpoint flush code works over the log vector chain to
+commit the checkpoint.
+
+Once the checkpoint is written into the log buffers, the checkpoint context is
+attached to the log buffer that the commit record was written to along with a
+completion callback. Log IO completion will call that callback, which can then
+run transaction committed processing for the log items (i.e. insert into AIL
+and unpin) in the log vector chain and then free the log vector chain and
+checkpoint context.
+
+Discussion Point: I am uncertain as to whether the log item is the most
+efficient way to track vectors, even though it seems like the natural way to do
+it. The fact that we walk the log items (in the CIL) just to chain the log
+vectors and break the link between the log item and the log vector means that
+we take a cache line hit for the log item list modification, then another for
+the log vector chaining. If we track by the log vectors, then we only need to
+break the link between the log item and the log vector, which means we should
+dirty only the log item cachelines. Normally I wouldn't be concerned about one
+vs two dirty cachelines except for the fact I've seen upwards of 80,000 log
+vectors in one checkpoint transaction. I'd guess this is a "measure and
+compare" situation that can be done after a working and reviewed implementation
+is in the dev tree....
+
+=== Checkpoint Sequencing
+
+One of the key aspects of the XFS transaction subsystem is that it tags
+committed transactions with the log sequence number of the transaction commit.
+This allows transactions to be issued asynchronously even though there may be
+future operations that cannot be completed until that transaction is fully
+committed to the log. In the rare case that a dependent operation occurs (e.g.
+re-using a freed metadata extent for a data extent), a special, optimised log
+force can be issued to force the dependent transaction to disk immediately.
+
+To do this, transactions need to record the LSN of the commit record of the
+transaction. This LSN comes directly from the log buffer the transaction is
+written into. While this works just fine for the existing transaction
+mechanism, it does not work for delayed logging because transactions are not
+written directly into the log buffers. Hence some other method of sequencing
+transactions is required.
+
+As discussed in the checkpoint section, delayed logging uses per-checkpoint
+contexts, and as such it is simple to assign a sequence number to each
+checkpoint. Because the switching of checkpoint contexts must be done
+atomically, it is simple to ensure that each new context has a monotonically
+increasing sequence number assigned to it without the need for an external
+atomic counter - we can just take the current context sequence number and add
+one to it for the new context.
+
+Then, instead of assigning a log buffer LSN to the transaction commit LSN
+during the commit, we can assign the current checkpoint sequence. This allows
+operations that track transactions that have not yet completed know what
+checkpoint sequence needs to be committed before they can continue. As a
+result, the code that forces the log to a specific LSN now needs to ensure that
+the log forces to a specific checkpoint.
+
+To ensure that we can do this, we need to track all the checkpoint contexts
+that are currently committing to the log. When we flush a checkpoint, the
+context gets added to a "committing" list which can be searched. When a
+checkpoint commit completes, it is removed from the committing list. Because
+the checkpoint context records the LSN of the commit record for the checkpoint,
+we can also wait on the log buffer that contains the commit record, thereby
+using the existing log force mechanisms to execute synchronous forces.
+
+It should be noted that the synchronous forces may need to be extended with
+mitigation algorithms similar to the current log buffer code to allow
+aggregation of multiple synchronous transactions if there are already
+synchronous transactions being flushed. Investigation of the performance of the
+current design is needed before making any decisions here.
+
+The main concern with log forces is to ensure that all the previous checkpoints
+are also committed to disk before the one we need to wait for. Therefore we
+need to check that all the prior contexts in the committing list are also
+complete before waiting on the one we need to complete. We do this
+synchronisation in the log force code so that we don't need to wait anywhere
+else for such serialisation - it only matters when we do a log force.
+
+The only remaining complexity is that a log force now also has to handle the
+case where the forcing sequence number is the same as the current context. That
+is, we need to flush the CIL and potentially wait for it to complete. This is a
+simple addition to the existing log forcing code to check the sequence numbers
+and push if required. Indeed, placing the current sequence checkpoint flush in
+the log force code enables the current mechanism for issuing synchronous
+transactions to remain untouched (i.e. commit an asynchronous transaction, then
+force the log at the LSN of that transaction) and so the higher level code
+behaves the same regardless of whether delayed logging is being used or not.
+
+=== Checkpoint Log Space Accounting
+
+The big issue for a checkpoint transaction is the log space reservation for the
+transaction. We don't know how big a checkpoint transaction is going to be
+ahead of time, nor how many log buffers it will take to write out, nor the
+number of split log vector regions are going to be used. We can track the
+amount of log space required as we add items to the commit item list, but we
+still need to reserve the space in the log for the checkpoint.
+
+A typical transaction reserves enough space in the log for the worst case space
+usage of the transaction. The reservation accounts for log record headers,
+transaction and region headers, headers for split regions, buffer tail padding,
+etc. as well as the actual space for all the changed metadata in the
+transaction. While some of this is fixed overhead, much of it is dependent on
+the size of the transaction and the number of regions being logged (the number
+of log vectors in the transaction).
+
+An example of the differences would be logging directory changes versus logging
+inode changes. If you modify lots of inode cores (e.g. chmod -R g+w *), then
+there are lots of transactions that only contain an inode core and an inode log
+format structure. That is, two vectors totaling roughly 150 bytes. If we modify
+10,000 inodes, we have about 1.5MB of metadata to write in 20,000 vectors. Each
+vector is 12 bytes, so the total to be logged is approximately 1.75MB. In
+comparison, if we are logging full directory buffers, they are typically 4KB
+each, so we in 1.5MB of directory buffers we'd have roughly 400 buffers and a
+buffer format structure for each buffer - roughly 800 vectors or 1.51MB total
+space.  From this, it should be obvious that a static log space reservation is
+not particularly flexible and is difficult to select the "optimal value" for
+all workloads.
+
+Further, if we are going to use a static reservation, which bit of the entire
+reservation does it cover? We account for space used by the transaction
+reservation by tracking the space currently used by the object in the CIL and
+then calculating the increase or decrease in space used as the object is
+relogged. This allows for a checkpoint reservation to only have to account for
+log buffer metadata used such as log header records.
+
+However, even using a static reservation for just the log metadata is
+problematic. Typically log record headers use at least 16KB of log space per
+1MB of log space consumed (512 bytes per 32k) and the reservation needs to be
+large enough to handle arbitrary sized checkpoint transactions. This
+reservation needs to be made before the checkpoint is started, and we need to
+be able to reserve the space without sleeping.  For a 8MB checkpoint, we need a
+reservation of around 150KB, which is a non-trivial amount of space.
+
+A static reservation needs to manipulate the log grant counters - we can take a
+permanent reservation on the space, but we still need to make sure we refresh
+the write reservation (the actual space available to the transaction) after
+every checkpoint transaction completion. Unfortunately, if this space is not
+available when required, then the regrant code will sleep waiting for it.
+
+The problem with this is that it can lead to deadlocks as we may need to commit
+checkpoints to be able to free up log space (refer back to the description of
+rolling transactions for an example of this).  Hence we *must* always have
+space available in the log if we are to use static reservations, and that is
+very difficult and complex to arrange. It is possible to do, but there is a
+simpler way.
+
+The simpler way of doing this is tracking the entire log space used by the
+items in the CIL and using this to dynamically calculate the amount of log
+space required by the log metadata. If this log metadata space changes as a
+result of a transaction commit inserting a new memory buffer into the CIL, then
+the difference in space required is removed from the transaction that causes
+the change. Transactions at this level will *always* have enough space
+available in their reservation for this as they have already reserved the
+maximal amount of log metadata space they require, and such a delta reservation
+will always be less than or equal to the maximal amount in the reservation.
+
+Hence we can grow the checkpoint transaction reservation dynamically as items
+are added to the CIL and avoid the need for reserving and regranting log space
+up front. This avoids deadlocks and removes a blocking point from the
+checkpoint flush code.
+
+As mentioned early, transactions can't grow to more than half the size of the
+log. Hence as part of the reservation growing, we need to also check the size
+of the reservation against the maximum allowed transaction size. If we reach
+the maximum threshold, we need to push the CIL to the log. This is effectively
+a "background flush" and is done on demand. This is identical to
+a CIL push triggered by a log force, only that there is no waiting for the
+checkpoint commit to complete. This background push is checked and executed by
+transaction commit code.
+
+If the transaction subsystem goes idle while we still have items in the CIL,
+they will be flushed by the periodic log force issued by the xfssyncd. This log
+force will push the CIL to disk, and if the transaction subsystem stays idle,
+allow the idle log to be covered (effectively marked clean) in exactly the same
+manner that is done for the existing logging method. A discussion point is
+whether this log force needs to be done more frequently than the current rate
+which is once every 30s.
+
+
+=== Log Item Pinning
+
+Currently log items are pinned during transaction commit while the items are
+still locked. This happens just after the items are formatted, though it could
+be done any time before the items are unlocked. The result of this mechanism is
+that items get pinned once for every transaction that is committed to the log
+buffers. Hence items that are relogged in the log buffers will have a pin count
+for every outstanding transaction they were dirtied in. When each of these
+transactions is completed, they will unpin the item once. As a result, the item
+only becomes unpinned when all the transactions complete and there are no
+pending transactions. Thus the pinning and unpinning of a log item is symmetric
+as there is a 1:1 relationship with transaction commit and log item completion.
+
+For delayed logging, however, we have an asymmetric transaction commit to
+completion relationship. Every time an object is relogged in the CIL it goes
+through the commit process without a corresponding completion being registered.
+That is, we now have a many-to-one relationship between transaction commit and
+log item completion. The result of this is that pinning and unpinning of the
+log items becomes unbalanced if we retain the "pin on transaction commit, unpin
+on transaction completion" model.
+
+To keep pin/unpin symmetry, the algorithm needs to change to a "pin on
+insertion into the CIL, unpin on checkpoint completion". In other words, the
+pinning and unpinning becomes symmetric around a checkpoint context. We have to
+pin the object the first time it is inserted into the CIL - if it is already in
+the CIL during a transaction commit, then we do not pin it again. Because there
+can be multiple outstanding checkpoint contexts, we can still see elevated pin
+counts, but as each checkpoint completes the pin count will retain the correct
+value according to it's context.
+
+Just to make matters more slightly more complex, this checkpoint level context
+for the pin count means that the pinning of an item must take place under the
+CIL commit/flush lock. If we pin the object outside this lock, we cannot
+guarantee which context the pin count is associated with. This is because of
+the fact pinning the item is dependent on whether the item is present in the
+current CIL or not. If we don't pin the CIL first before we check and pin the
+object, we have a race with CIL being flushed between the check and the pin
+(or not pinning, as the case may be). Hence we must hold the CIL flush/commit
+lock to guarantee that we pin the items correctly.
+
+=== Concurrent Scalability
+
+A fundamental requirement for the CIL is that accesses through transaction
+commits must scale to many concurrent commits. The current transaction commit
+code does not break down even when there are transactions coming from 2048
+processors at once. The current transaction code does not go any faster than if
+there was only one CPU using it, but it does not slow down either.
+
+As a result, the delayed logging transaction commit code needs to be designed
+for concurrency from the ground up. It is obvious that there are serialisation
+points in the design - the three important ones are:
+
+	. Locking out new transaction commits while flushing the CIL
+	. Adding items to the CIL and updating item space accounting
+	. Checkpoint commit ordering
+
+Looking at the transaction commit and CIL flushing interactions, it is clear
+that we have a many-to-one interaction here. That is, the only restriction on
+the number of concurrent transactions that can be trying to commit at once is
+the amount of space available in the log for their reservations. The practical
+limit here is in the order of several hundred concurrent transactions for a
+128MB log, which means that it is generally one per CPU in a machine.
+
+The amount of time a transaction commit needs to hold out a flush is a
+relatively long period of time - the pinning of log items needs to be done
+while we are holding out a CIL flush, so at the moment that means it is held
+across the formatting of the objects into memory buffers (i.e. while memcpy()s
+are in progress). Ultimately a two pass algorithm where the formatting is done
+separately to the pinning of objects could be used to reduce the hold time of
+the transaction commit side.
+
+Because of the number of potential transaction commit side holders, the lock
+really needs to be a sleeping lock - if the CIL flush takes the lock, we do not
+want every other CPU in the machine spinning on the CIL lock. Given that
+flushing the CIL could involve walking a list of tens of thousands of log
+items, it will get held for a significant time and so spin contention is a
+significant concern. Preventing lots of CPUs spinning doing nothing is the
+main reason for choosing a sleeping lock even though nothing in either the
+transaction commit or CIL flush side sleeps with the lock held.
+
+It should also be noted that CIL flushing is also a relatively rare operation
+compared to transaction commit for asynchronous transaction workloads - only
+time will tell if using a read-write semaphore for exclusion will limit
+transaction commit concurrency due to cache line bouncing of the lock on the
+read side.
+
+The second serialisation point is on the transaction commit side where items
+are inserted into the CIL. Because transactions can enter this code
+concurrently, the CIL needs to be protected separately from the above
+commit/flush exclusion. It also needs to be an exclusive lock but it is only
+held for a very short time and so a spin lock is appropriate here. It is
+possible that this lock will become a contention point, but given the short
+hold time once per transaction I think that contention is unlikely.
+
+The final serialisation point is the checkpoint commit record ordering code
+that is run as part of the checkpoint commit and log force sequencing. The code
+path that triggers a CIL flush (i.e. whatever triggers the log force) will enter
+an ordering loop after writing all the log vectors into the log buffers but
+before writing the commit record. This loop walks the list of committing
+checkpoints and needs to block waiting for checkpoints to complete their commit
+record write. As a result it needs a lock and a wait variable. Log force
+sequencing also requires the same lock, list walk, and blocking mechanism to
+ensure completion of checkpoints.
+
+These two sequencing operations can use the mechanism even though the
+events they are waiting for are different. The checkpoint commit record
+sequencing needs to wait until checkpoint contexts contain a commit LSN
+(obtained through completion of a commit record write) while log force
+sequencing needs to wait until previous checkpoint contexts are removed from
+the committing list (i.e. they've completed). A simple wait variable and
+broadcast wakeups (thundering herds) has been used to implement these two
+serialisation queues. They use the same lock as the CIL, too. If we see too
+much contention on the CIL lock, or too many context switches as a result of
+the broadcast wakeups these operations can be put under a new spinlock and
+given separate wait lists to reduce lock contention and the number of processes
+woken by the wrong event.
+
+
+=== Lifecycle Changes
+
+The existing log item life cycle is as follows:
+
+----
+	1. Transaction allocate
+	2. Transaction reserve
+	3. Lock item
+	4. Join item to transaction
+		If not already attached,
+			Allocate log item
+			Attach log item to owner item
+		Attach log item to transaction
+	5. Modify item
+		Record modifications in log item
+	6. Transaction commit
+		Pin item in memory
+		Format item into log buffer
+		Write commit LSN into transaction
+		Unlock item
+		Attach transaction to log buffer
+
+	<log buffer IO dispatched>
+	<log buffer IO completes>
+
+	7. Transaction completion
+		Mark log item committed
+		Insert log item into AIL
+			Write commit LSN into log item
+		Unpin log item
+	8. AIL traversal
+		Lock item
+		Mark log item clean
+		Flush item to disk
+
+	<item IO completion>
+
+	9. Log item removed from AIL
+		Moves log tail
+		Item unlocked
+----
+
+Essentially, steps 1-6 operate independently from step 7, which is also
+independent of steps 8-9. An item can be locked in steps 1-6 or steps 8-9
+at the same time step 7 is occurring, but only steps 1-6 or 8-9 can occur
+at the same time. If the log item is in the AIL or between steps 6 and 7
+and steps 1-6 are re-entered, then the item is relogged. Only when steps 8-9
+are entered and completed is the object considered clean.
+
+With delayed logging, there are new steps inserted into the life cycle:
+
+----
+	1. Transaction allocate
+	2. Transaction reserve
+	3. Lock item
+	4. Join item to transaction
+		If not already attached,
+			Allocate log item
+			Attach log item to owner item
+		Attach log item to transaction
+	5. Modify item
+		Record modifications in log item
+	6. Transaction commit
+		Pin item in memory if not pinned in CIL
+		Format item into log vector + buffer
+		Attach log vector and buffer to log item
+		Insert log item into CIL
+		Write CIL context sequence into transaction
+		Unlock item
+
+	<next log force>
+
+	7. CIL push
+		lock CIL flush
+		Chain log vectors and buffers together
+		Remove items from CIL
+		unlock CIL flush
+		write log vectors into log
+		sequence commit records
+		attach checkpoint context to log buffer
+
+	<log buffer IO dispatched>
+	<log buffer IO completes>
+
+	8. Checkpoint completion
+		Mark log item committed
+		Insert item into AIL
+			Write commit LSN into log item
+		Unpin log item
+	9. AIL traversal
+		Lock item
+		Mark log item clean
+		Flush item to disk
+	<item IO completion>
+	10. Log item removed from AIL
+		Moves log tail
+		Item unlocked
+----
+
+From this, it can be seen that the only life cycle differences between the two
+logging methods are in the middle of the life cycle - they still have the same
+beginning and end and execution constraints. The only differences are in the
+committing of the log items to the log itself and the completion processing.
+Hence delayed logging should not introduce any constraints on log item
+behaviour, allocation or freeing that don't already exist.
+
+As a result of this zero-impact "insertion" of delayed logging infrastructure
+and the design of the internal structures to avoid on disk format changes, we
+can basically switch between delayed logging and the existing mechanism with a
+mount option. Fundamentally, there is no reason why the log manager would not
+be able to swap methods automatically and transparently depending on load
+characteristics, but this should not be necessary if delayed logging works as
+designed.
+
+EOF.
diff --git a/design/XFS_Filesystem_Structure/docinfo.xml b/design/XFS_Filesystem_Structure/docinfo.xml
index e13d705..558a04c 100644
--- a/design/XFS_Filesystem_Structure/docinfo.xml
+++ b/design/XFS_Filesystem_Structure/docinfo.xml
@@ -169,4 +169,18 @@ 
 			</simplelist>
 		</revdescription>
 	</revision>
+	<revision>
+		<revnumber>3.141592</revnumber>
+		<date>May 2018</date>
+		<author>
+			<firstname>Darrick</firstname>
+			<surname>Wong</surname>
+			<email>darrick.wong@oracle.com</email>
+		</author>
+		<revdescription>
+			<simplelist>
+				<member>Incorporate Dave Chinner's log design document.</member>
+			</simplelist>
+		</revdescription>
+	</revision>
 </revhistory>
diff --git a/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc b/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc
index 8d54935..7bdfade 100644
--- a/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc
+++ b/design/XFS_Filesystem_Structure/xfs_filesystem_structure.asciidoc
@@ -48,6 +48,8 @@  include::overview.asciidoc[]
 
 include::metadata_integrity.asciidoc[]
 
+include::delayed_logging.asciidoc[]
+
 include::reflink.asciidoc[]
 
 include::reconstruction.asciidoc[]
diff --git a/design/xfs-delayed-logging-design.asciidoc b/design/xfs-delayed-logging-design.asciidoc
deleted file mode 100644
index e54e786..0000000
--- a/design/xfs-delayed-logging-design.asciidoc
+++ /dev/null
@@ -1,810 +0,0 @@ 
-= XFS Delayed Logging Design
-Dave Chinner, <dchinner@redhat.com>
-v1.0, Feb 2014: Initial conversion to asciidoc
-
-== Introduction to Re-logging in XFS
-
-XFS logging is a combination of logical and physical logging. Some objects,
-such as inodes and dquots, are logged in logical format where the details
-logged are made up of the changes to in-core structures rather than on-disk
-structures. Other objects - typically buffers - have their physical changes
-logged. The reason for these differences is to reduce the amount of log space
-required for objects that are frequently logged. Some parts of inodes are more
-frequently logged than others, and inodes are typically more frequently logged
-than any other object (except maybe the superblock buffer) so keeping the
-amount of metadata logged low is of prime importance.
-
-The reason that this is such a concern is that XFS allows multiple separate
-modifications to a single object to be carried in the log at any given time.
-This allows the log to avoid needing to flush each change to disk before
-recording a new change to the object. XFS does this via a method called
-"re-logging". Conceptually, this is quite simple - all it requires is that any
-new change to the object is recorded with a *new copy* of all the existing
-changes in the new transaction that is written to the log.
-
-That is, if we have a sequence of changes A through to F, and the object was
-written to disk after change D, we would see in the log the following series
-of transactions, their contents and the log sequence number (LSN) of the
-transaction:
-
-....
-	Transaction		Contents	LSN
-	   A			   A		   X
-	   B			  A+B		  X+n
-	   C			 A+B+C		 X+n+m
-	   D			A+B+C+D		X+n+m+o
-	    <object written to disk>
-	   E			   E		   Y (> X+n+m+o)
-	   F			  E+F		  Y+p
-....
-
-In other words, each time an object is relogged, the new transaction contains
-the aggregation of all the previous changes currently held only in the log.
-
-This relogging technique also allows objects to be moved forward in the log so
-that an object being relogged does not prevent the tail of the log from ever
-moving forward.  This can be seen in the table above by the changing
-(increasing) LSN of each subsequent transaction - the LSN is effectively a
-direct encoding of the location in the log of the transaction.
-
-This relogging is also used to implement long-running, multiple-commit
-transactions.  These transaction are known as rolling transactions, and require
-a special log reservation known as a permanent transaction reservation. A
-typical example of a rolling transaction is the removal of extents from an
-inode which can only be done at a rate of two extents per transaction because
-of reservation size limitations. Hence a rolling extent removal transaction
-keeps relogging the inode and btree buffers as they get modified in each
-removal operation. This keeps them moving forward in the log as the operation
-progresses, ensuring that current operation never gets blocked by itself if the
-log wraps around.
-
-Hence it can be seen that the relogging operation is fundamental to the correct
-working of the XFS journalling subsystem. From the above description, most
-people should be able to see why the XFS metadata operations writes so much to
-the log - repeated operations to the same objects write the same changes to
-the log over and over again. Worse is the fact that objects tend to get
-dirtier as they get relogged, so each subsequent transaction is writing more
-metadata into the log.
-
-Another feature of the XFS transaction subsystem is that most transactions are
-asynchronous. That is, they don't commit to disk until either a log buffer is
-filled (a log buffer can hold multiple transactions) or a synchronous operation
-forces the log buffers holding the transactions to disk. This means that XFS is
-doing aggregation of transactions in memory - batching them, if you like - to
-minimise the impact of the log IO on transaction throughput.
-
-The limitation on asynchronous transaction throughput is the number and size of
-log buffers made available by the log manager. By default there are 8 log
-buffers available and the size of each is 32kB - the size can be increased up
-to 256kB by use of a mount option.
-
-Effectively, this gives us the maximum bound of outstanding metadata changes
-that can be made to the filesystem at any point in time - if all the log
-buffers are full and under IO, then no more transactions can be committed until
-the current batch completes. It is now common for a single current CPU core to
-be to able to issue enough transactions to keep the log buffers full and under
-IO permanently. Hence the XFS journalling subsystem can be considered to be IO
-bound.
-
-== Delayed Logging Concepts
-
-The key thing to note about the asynchronous logging combined with the
-relogging technique XFS uses is that we can be relogging changed objects
-multiple times before they are committed to disk in the log buffers. If we
-return to the previous relogging example, it is entirely possible that
-transactions A through D are committed to disk in the same log buffer.
-
-That is, a single log buffer may contain multiple copies of the same object,
-but only one of those copies needs to be there - the last one "D", as it
-contains all the changes from the previous changes. In other words, we have one
-necessary copy in the log buffer, and three stale copies that are simply
-wasting space. When we are doing repeated operations on the same set of
-objects, these "stale objects" can be over 90% of the space used in the log
-buffers. It is clear that reducing the number of stale objects written to the
-log would greatly reduce the amount of metadata we write to the log, and this
-is the fundamental goal of delayed logging.
-
-From a conceptual point of view, XFS is already doing relogging in memory (where
-memory == log buffer), only it is doing it extremely inefficiently. It is using
-logical to physical formatting to do the relogging because there is no
-infrastructure to keep track of logical changes in memory prior to physically
-formatting the changes in a transaction to the log buffer. Hence we cannot avoid
-accumulating stale objects in the log buffers.
-
-Delayed logging is the name we've given to keeping and tracking transactional
-changes to objects in memory outside the log buffer infrastructure. Because of
-the relogging concept fundamental to the XFS journalling subsystem, this is
-actually relatively easy to do - all the changes to logged items are already
-tracked in the current infrastructure. The big problem is how to accumulate
-them and get them to the log in a consistent, recoverable manner.
-Describing the problems and how they have been solved is the focus of this
-document.
-
-One of the key changes that delayed logging makes to the operation of the
-journalling subsystem is that it disassociates the amount of outstanding
-metadata changes from the size and number of log buffers available. In other
-words, instead of there only being a maximum of 2MB of transaction changes not
-written to the log at any point in time, there may be a much greater amount
-being accumulated in memory. Hence the potential for loss of metadata on a
-crash is much greater than for the existing logging mechanism.
-
-It should be noted that this does not change the guarantee that log recovery
-will result in a consistent filesystem. What it does mean is that as far as the
-recovered filesystem is concerned, there may be many thousands of transactions
-that simply did not occur as a result of the crash. This makes it even more
-important that applications that care about their data use fsync() where they
-need to ensure application level data integrity is maintained.
-
-It should be noted that delayed logging is not an innovative new concept that
-warrants rigorous proofs to determine whether it is correct or not. The method
-of accumulating changes in memory for some period before writing them to the
-log is used effectively in many filesystems including ext3 and ext4. Hence
-no time is spent in this document trying to convince the reader that the
-concept is sound. Instead it is simply considered a "solved problem" and as
-such implementing it in XFS is purely an exercise in software engineering.
-
-The fundamental requirements for delayed logging in XFS are simple:
-
-	. Reduce the amount of metadata written to the log by at least
-	   an order of magnitude.
-	. Supply sufficient statistics to validate Requirement #1.
-	. Supply sufficient new tracing infrastructure to be able to debug
-	   problems with the new code.
-	. No on-disk format change (metadata or log format).
-	. Enable and disable with a mount option.
-	. No performance regressions for synchronous transaction workloads.
-
-== Delayed Logging Design
-
-=== Storing Changes
-
-The problem with accumulating changes at a logical level (i.e. just using the
-existing log item dirty region tracking) is that when it comes to writing the
-changes to the log buffers, we need to ensure that the object we are formatting
-is not changing while we do this. This requires locking the object to prevent
-concurrent modification. Hence flushing the logical changes to the log would
-require us to lock every object, format them, and then unlock them again.
-
-This introduces lots of scope for deadlocks with transactions that are already
-running. For example, a transaction has object A locked and modified, but needs
-the delayed logging tracking lock to commit the transaction. However, the
-flushing thread has the delayed logging tracking lock already held, and is
-trying to get the lock on object A to flush it to the log buffer. This appears
-to be an unsolvable deadlock condition, and it was solving this problem that
-was the barrier to implementing delayed logging for so long.
-
-The solution is relatively simple - it just took a long time to recognise it.
-Put simply, the current logging code formats the changes to each item into an
-vector array that points to the changed regions in the item. The log write code
-simply copies the memory these vectors point to into the log buffer during
-transaction commit while the item is locked in the transaction. Instead of
-using the log buffer as the destination of the formatting code, we can use an
-allocated memory buffer big enough to fit the formatted vector.
-
-If we then copy the vector into the memory buffer and rewrite the vector to
-point to the memory buffer rather than the object itself, we now have a copy of
-the changes in a format that is compatible with the log buffer writing code.
-that does not require us to lock the item to access. This formatting and
-rewriting can all be done while the object is locked during transaction commit,
-resulting in a vector that is transactionally consistent and can be accessed
-without needing to lock the owning item.
-
-Hence we avoid the need to lock items when we need to flush outstanding
-asynchronous transactions to the log. The differences between the existing
-formatting method and the delayed logging formatting can be seen in the
-diagram below.
-
-Current format log vector:
-....
-Object    +---------------------------------------------+
-Vector 1      +----+
-Vector 2                    +----+
-Vector 3                                   +----------+
-....
-
-After formatting:
-
-....
-Log Buffer    +-V1-+-V2-+----V3----+
-....
-
-Delayed logging vector:
-
-....
-Object    +---------------------------------------------+
-Vector 1      +----+
-Vector 2                    +----+
-Vector 3                                   +----------+
-....
-
-After formatting:
-
-....
-Memory Buffer +-V1-+-V2-+----V3----+
-Vector 1      +----+
-Vector 2           +----+
-Vector 3                +----------+
-....
-
-The memory buffer and associated vector need to be passed as a single object,
-but still need to be associated with the parent object so if the object is
-relogged we can replace the current memory buffer with a new memory buffer that
-contains the latest changes.
-
-The reason for keeping the vector around after we've formatted the memory
-buffer is to support splitting vectors across log buffer boundaries correctly.
-If we don't keep the vector around, we do not know where the region boundaries
-are in the item, so we'd need a new encapsulation method for regions in the log
-buffer writing (i.e. double encapsulation). This would be an on-disk format
-change and as such is not desirable.  It also means we'd have to write the log
-region headers in the formatting stage, which is problematic as there is per
-region state that needs to be placed into the headers during the log write.
-
-Hence we need to keep the vector, but by attaching the memory buffer to it and
-rewriting the vector addresses to point at the memory buffer we end up with a
-self-describing object that can be passed to the log buffer write code to be
-handled in exactly the same manner as the existing log vectors are handled.
-Hence we avoid needing a new on-disk format to handle items that have been
-relogged in memory.
-
-
-=== Tracking Changes
-
-Now that we can record transactional changes in memory in a form that allows
-them to be used without limitations, we need to be able to track and accumulate
-them so that they can be written to the log at some later point in time.  The
-log item is the natural place to store this vector and buffer, and also makes sense
-to be the object that is used to track committed objects as it will always
-exist once the object has been included in a transaction.
-
-The log item is already used to track the log items that have been written to
-the log but not yet written to disk. Such log items are considered "active"
-and as such are stored in the Active Item List (AIL) which is a LSN-ordered
-double linked list. Items are inserted into this list during log buffer IO
-completion, after which they are unpinned and can be written to disk. An object
-that is in the AIL can be relogged, which causes the object to be pinned again
-and then moved forward in the AIL when the log buffer IO completes for that
-transaction.
-
-Essentially, this shows that an item that is in the AIL can still be modified
-and relogged, so any tracking must be separate to the AIL infrastructure. As
-such, we cannot reuse the AIL list pointers for tracking committed items, nor
-can we store state in any field that is protected by the AIL lock. Hence the
-committed item tracking needs it's own locks, lists and state fields in the log
-item.
-
-Similar to the AIL, tracking of committed items is done through a new list
-called the Committed Item List (CIL).  The list tracks log items that have been
-committed and have formatted memory buffers attached to them. It tracks objects
-in transaction commit order, so when an object is relogged it is removed from
-it's place in the list and re-inserted at the tail. This is entirely arbitrary
-and done to make it easy for debugging - the last items in the list are the
-ones that are most recently modified. Ordering of the CIL is not necessary for
-transactional integrity (as discussed in the next section) so the ordering is
-done for convenience/sanity of the developers.
-
-
-=== Checkpoints
-
-When we have a log synchronisation event, commonly known as a "log force",
-all the items in the CIL must be written into the log via the log buffers.
-We need to write these items in the order that they exist in the CIL, and they
-need to be written as an atomic transaction. The need for all the objects to be
-written as an atomic transaction comes from the requirements of relogging and
-log replay - all the changes in all the objects in a given transaction must
-either be completely replayed during log recovery, or not replayed at all. If
-a transaction is not replayed because it is not complete in the log, then
-no later transactions should be replayed, either.
-
-To fulfill this requirement, we need to write the entire CIL in a single log
-transaction. Fortunately, the XFS log code has no fixed limit on the size of a
-transaction, nor does the log replay code. The only fundamental limit is that
-the transaction cannot be larger than just under half the size of the log.  The
-reason for this limit is that to find the head and tail of the log, there must
-be at least one complete transaction in the log at any given time. If a
-transaction is larger than half the log, then there is the possibility that a
-crash during the write of a such a transaction could partially overwrite the
-only complete previous transaction in the log. This will result in a recovery
-failure and an inconsistent filesystem and hence we must enforce the maximum
-size of a checkpoint to be slightly less than a half the log.
-
-Apart from this size requirement, a checkpoint transaction looks no different
-to any other transaction - it contains a transaction header, a series of
-formatted log items and a commit record at the tail. From a recovery
-perspective, the checkpoint transaction is also no different - just a lot
-bigger with a lot more items in it. The worst case effect of this is that we
-might need to tune the recovery transaction object hash size.
-
-Because the checkpoint is just another transaction and all the changes to log
-items are stored as log vectors, we can use the existing log buffer writing
-code to write the changes into the log. To do this efficiently, we need to
-minimise the time we hold the CIL locked while writing the checkpoint
-transaction. The current log write code enables us to do this easily with the
-way it separates the writing of the transaction contents (the log vectors) from
-the transaction commit record, but tracking this requires us to have a
-per-checkpoint context that travels through the log write process through to
-checkpoint completion.
-
-Hence a checkpoint has a context that tracks the state of the current
-checkpoint from initiation to checkpoint completion. A new context is initiated
-at the same time a checkpoint transaction is started. That is, when we remove
-all the current items from the CIL during a checkpoint operation, we move all
-those changes into the current checkpoint context. We then initialise a new
-context and attach that to the CIL for aggregation of new transactions.
-
-This allows us to unlock the CIL immediately after transfer of all the
-committed items and effectively allow new transactions to be issued while we
-are formatting the checkpoint into the log. It also allows concurrent
-checkpoints to be written into the log buffers in the case of log force heavy
-workloads, just like the existing transaction commit code does. This, however,
-requires that we strictly order the commit records in the log so that
-checkpoint sequence order is maintained during log replay.
-
-To ensure that we can be writing an item into a checkpoint transaction at
-the same time another transaction modifies the item and inserts the log item
-into the new CIL, then checkpoint transaction commit code cannot use log items
-to store the list of log vectors that need to be written into the transaction.
-Hence log vectors need to be able to be chained together to allow them to be
-detached from the log items. That is, when the CIL is flushed the memory
-buffer and log vector attached to each log item needs to be attached to the
-checkpoint context so that the log item can be released. In diagrammatic form,
-the CIL would look like this before the flush:
-
-----
-	CIL Head
-	   |
-	   V
-	Log Item <-> log vector 1	-> memory buffer
-	   |				-> vector array
-	   V
-	Log Item <-> log vector 2	-> memory buffer
-	   |				-> vector array
-	   V
-	......
-	   |
-	   V
-	Log Item <-> log vector N-1	-> memory buffer
-	   |				-> vector array
-	   V
-	Log Item <-> log vector N	-> memory buffer
-					-> vector array
-----
-
-And after the flush the CIL head is empty, and the checkpoint context log
-vector list would look like:
-
-----
-	Checkpoint Context
-	   |
-	   V
-	log vector 1	-> memory buffer
-	   |		-> vector array
-	   |		-> Log Item
-	   V
-	log vector 2	-> memory buffer
-	   |		-> vector array
-	   |		-> Log Item
-	   V
-	......
-	   |
-	   V
-	log vector N-1	-> memory buffer
-	   |		-> vector array
-	   |		-> Log Item
-	   V
-	log vector N	-> memory buffer
-			-> vector array
-			-> Log Item
-----
-
-Once this transfer is done, the CIL can be unlocked and new transactions can
-start, while the checkpoint flush code works over the log vector chain to
-commit the checkpoint.
-
-Once the checkpoint is written into the log buffers, the checkpoint context is
-attached to the log buffer that the commit record was written to along with a
-completion callback. Log IO completion will call that callback, which can then
-run transaction committed processing for the log items (i.e. insert into AIL
-and unpin) in the log vector chain and then free the log vector chain and
-checkpoint context.
-
-Discussion Point: I am uncertain as to whether the log item is the most
-efficient way to track vectors, even though it seems like the natural way to do
-it. The fact that we walk the log items (in the CIL) just to chain the log
-vectors and break the link between the log item and the log vector means that
-we take a cache line hit for the log item list modification, then another for
-the log vector chaining. If we track by the log vectors, then we only need to
-break the link between the log item and the log vector, which means we should
-dirty only the log item cachelines. Normally I wouldn't be concerned about one
-vs two dirty cachelines except for the fact I've seen upwards of 80,000 log
-vectors in one checkpoint transaction. I'd guess this is a "measure and
-compare" situation that can be done after a working and reviewed implementation
-is in the dev tree....
-
-=== Checkpoint Sequencing
-
-One of the key aspects of the XFS transaction subsystem is that it tags
-committed transactions with the log sequence number of the transaction commit.
-This allows transactions to be issued asynchronously even though there may be
-future operations that cannot be completed until that transaction is fully
-committed to the log. In the rare case that a dependent operation occurs (e.g.
-re-using a freed metadata extent for a data extent), a special, optimised log
-force can be issued to force the dependent transaction to disk immediately.
-
-To do this, transactions need to record the LSN of the commit record of the
-transaction. This LSN comes directly from the log buffer the transaction is
-written into. While this works just fine for the existing transaction
-mechanism, it does not work for delayed logging because transactions are not
-written directly into the log buffers. Hence some other method of sequencing
-transactions is required.
-
-As discussed in the checkpoint section, delayed logging uses per-checkpoint
-contexts, and as such it is simple to assign a sequence number to each
-checkpoint. Because the switching of checkpoint contexts must be done
-atomically, it is simple to ensure that each new context has a monotonically
-increasing sequence number assigned to it without the need for an external
-atomic counter - we can just take the current context sequence number and add
-one to it for the new context.
-
-Then, instead of assigning a log buffer LSN to the transaction commit LSN
-during the commit, we can assign the current checkpoint sequence. This allows
-operations that track transactions that have not yet completed know what
-checkpoint sequence needs to be committed before they can continue. As a
-result, the code that forces the log to a specific LSN now needs to ensure that
-the log forces to a specific checkpoint.
-
-To ensure that we can do this, we need to track all the checkpoint contexts
-that are currently committing to the log. When we flush a checkpoint, the
-context gets added to a "committing" list which can be searched. When a
-checkpoint commit completes, it is removed from the committing list. Because
-the checkpoint context records the LSN of the commit record for the checkpoint,
-we can also wait on the log buffer that contains the commit record, thereby
-using the existing log force mechanisms to execute synchronous forces.
-
-It should be noted that the synchronous forces may need to be extended with
-mitigation algorithms similar to the current log buffer code to allow
-aggregation of multiple synchronous transactions if there are already
-synchronous transactions being flushed. Investigation of the performance of the
-current design is needed before making any decisions here.
-
-The main concern with log forces is to ensure that all the previous checkpoints
-are also committed to disk before the one we need to wait for. Therefore we
-need to check that all the prior contexts in the committing list are also
-complete before waiting on the one we need to complete. We do this
-synchronisation in the log force code so that we don't need to wait anywhere
-else for such serialisation - it only matters when we do a log force.
-
-The only remaining complexity is that a log force now also has to handle the
-case where the forcing sequence number is the same as the current context. That
-is, we need to flush the CIL and potentially wait for it to complete. This is a
-simple addition to the existing log forcing code to check the sequence numbers
-and push if required. Indeed, placing the current sequence checkpoint flush in
-the log force code enables the current mechanism for issuing synchronous
-transactions to remain untouched (i.e. commit an asynchronous transaction, then
-force the log at the LSN of that transaction) and so the higher level code
-behaves the same regardless of whether delayed logging is being used or not.
-
-=== Checkpoint Log Space Accounting
-
-The big issue for a checkpoint transaction is the log space reservation for the
-transaction. We don't know how big a checkpoint transaction is going to be
-ahead of time, nor how many log buffers it will take to write out, nor the
-number of split log vector regions are going to be used. We can track the
-amount of log space required as we add items to the commit item list, but we
-still need to reserve the space in the log for the checkpoint.
-
-A typical transaction reserves enough space in the log for the worst case space
-usage of the transaction. The reservation accounts for log record headers,
-transaction and region headers, headers for split regions, buffer tail padding,
-etc. as well as the actual space for all the changed metadata in the
-transaction. While some of this is fixed overhead, much of it is dependent on
-the size of the transaction and the number of regions being logged (the number
-of log vectors in the transaction).
-
-An example of the differences would be logging directory changes versus logging
-inode changes. If you modify lots of inode cores (e.g. chmod -R g+w *), then
-there are lots of transactions that only contain an inode core and an inode log
-format structure. That is, two vectors totaling roughly 150 bytes. If we modify
-10,000 inodes, we have about 1.5MB of metadata to write in 20,000 vectors. Each
-vector is 12 bytes, so the total to be logged is approximately 1.75MB. In
-comparison, if we are logging full directory buffers, they are typically 4KB
-each, so we in 1.5MB of directory buffers we'd have roughly 400 buffers and a
-buffer format structure for each buffer - roughly 800 vectors or 1.51MB total
-space.  From this, it should be obvious that a static log space reservation is
-not particularly flexible and is difficult to select the "optimal value" for
-all workloads.
-
-Further, if we are going to use a static reservation, which bit of the entire
-reservation does it cover? We account for space used by the transaction
-reservation by tracking the space currently used by the object in the CIL and
-then calculating the increase or decrease in space used as the object is
-relogged. This allows for a checkpoint reservation to only have to account for
-log buffer metadata used such as log header records.
-
-However, even using a static reservation for just the log metadata is
-problematic. Typically log record headers use at least 16KB of log space per
-1MB of log space consumed (512 bytes per 32k) and the reservation needs to be
-large enough to handle arbitrary sized checkpoint transactions. This
-reservation needs to be made before the checkpoint is started, and we need to
-be able to reserve the space without sleeping.  For a 8MB checkpoint, we need a
-reservation of around 150KB, which is a non-trivial amount of space.
-
-A static reservation needs to manipulate the log grant counters - we can take a
-permanent reservation on the space, but we still need to make sure we refresh
-the write reservation (the actual space available to the transaction) after
-every checkpoint transaction completion. Unfortunately, if this space is not
-available when required, then the regrant code will sleep waiting for it.
-
-The problem with this is that it can lead to deadlocks as we may need to commit
-checkpoints to be able to free up log space (refer back to the description of
-rolling transactions for an example of this).  Hence we *must* always have
-space available in the log if we are to use static reservations, and that is
-very difficult and complex to arrange. It is possible to do, but there is a
-simpler way.
-
-The simpler way of doing this is tracking the entire log space used by the
-items in the CIL and using this to dynamically calculate the amount of log
-space required by the log metadata. If this log metadata space changes as a
-result of a transaction commit inserting a new memory buffer into the CIL, then
-the difference in space required is removed from the transaction that causes
-the change. Transactions at this level will *always* have enough space
-available in their reservation for this as they have already reserved the
-maximal amount of log metadata space they require, and such a delta reservation
-will always be less than or equal to the maximal amount in the reservation.
-
-Hence we can grow the checkpoint transaction reservation dynamically as items
-are added to the CIL and avoid the need for reserving and regranting log space
-up front. This avoids deadlocks and removes a blocking point from the
-checkpoint flush code.
-
-As mentioned early, transactions can't grow to more than half the size of the
-log. Hence as part of the reservation growing, we need to also check the size
-of the reservation against the maximum allowed transaction size. If we reach
-the maximum threshold, we need to push the CIL to the log. This is effectively
-a "background flush" and is done on demand. This is identical to
-a CIL push triggered by a log force, only that there is no waiting for the
-checkpoint commit to complete. This background push is checked and executed by
-transaction commit code.
-
-If the transaction subsystem goes idle while we still have items in the CIL,
-they will be flushed by the periodic log force issued by the xfssyncd. This log
-force will push the CIL to disk, and if the transaction subsystem stays idle,
-allow the idle log to be covered (effectively marked clean) in exactly the same
-manner that is done for the existing logging method. A discussion point is
-whether this log force needs to be done more frequently than the current rate
-which is once every 30s.
-
-
-=== Log Item Pinning
-
-Currently log items are pinned during transaction commit while the items are
-still locked. This happens just after the items are formatted, though it could
-be done any time before the items are unlocked. The result of this mechanism is
-that items get pinned once for every transaction that is committed to the log
-buffers. Hence items that are relogged in the log buffers will have a pin count
-for every outstanding transaction they were dirtied in. When each of these
-transactions is completed, they will unpin the item once. As a result, the item
-only becomes unpinned when all the transactions complete and there are no
-pending transactions. Thus the pinning and unpinning of a log item is symmetric
-as there is a 1:1 relationship with transaction commit and log item completion.
-
-For delayed logging, however, we have an asymmetric transaction commit to
-completion relationship. Every time an object is relogged in the CIL it goes
-through the commit process without a corresponding completion being registered.
-That is, we now have a many-to-one relationship between transaction commit and
-log item completion. The result of this is that pinning and unpinning of the
-log items becomes unbalanced if we retain the "pin on transaction commit, unpin
-on transaction completion" model.
-
-To keep pin/unpin symmetry, the algorithm needs to change to a "pin on
-insertion into the CIL, unpin on checkpoint completion". In other words, the
-pinning and unpinning becomes symmetric around a checkpoint context. We have to
-pin the object the first time it is inserted into the CIL - if it is already in
-the CIL during a transaction commit, then we do not pin it again. Because there
-can be multiple outstanding checkpoint contexts, we can still see elevated pin
-counts, but as each checkpoint completes the pin count will retain the correct
-value according to it's context.
-
-Just to make matters more slightly more complex, this checkpoint level context
-for the pin count means that the pinning of an item must take place under the
-CIL commit/flush lock. If we pin the object outside this lock, we cannot
-guarantee which context the pin count is associated with. This is because of
-the fact pinning the item is dependent on whether the item is present in the
-current CIL or not. If we don't pin the CIL first before we check and pin the
-object, we have a race with CIL being flushed between the check and the pin
-(or not pinning, as the case may be). Hence we must hold the CIL flush/commit
-lock to guarantee that we pin the items correctly.
-
-=== Concurrent Scalability
-
-A fundamental requirement for the CIL is that accesses through transaction
-commits must scale to many concurrent commits. The current transaction commit
-code does not break down even when there are transactions coming from 2048
-processors at once. The current transaction code does not go any faster than if
-there was only one CPU using it, but it does not slow down either.
-
-As a result, the delayed logging transaction commit code needs to be designed
-for concurrency from the ground up. It is obvious that there are serialisation
-points in the design - the three important ones are:
-
-	. Locking out new transaction commits while flushing the CIL
-	. Adding items to the CIL and updating item space accounting
-	. Checkpoint commit ordering
-
-Looking at the transaction commit and CIL flushing interactions, it is clear
-that we have a many-to-one interaction here. That is, the only restriction on
-the number of concurrent transactions that can be trying to commit at once is
-the amount of space available in the log for their reservations. The practical
-limit here is in the order of several hundred concurrent transactions for a
-128MB log, which means that it is generally one per CPU in a machine.
-
-The amount of time a transaction commit needs to hold out a flush is a
-relatively long period of time - the pinning of log items needs to be done
-while we are holding out a CIL flush, so at the moment that means it is held
-across the formatting of the objects into memory buffers (i.e. while memcpy()s
-are in progress). Ultimately a two pass algorithm where the formatting is done
-separately to the pinning of objects could be used to reduce the hold time of
-the transaction commit side.
-
-Because of the number of potential transaction commit side holders, the lock
-really needs to be a sleeping lock - if the CIL flush takes the lock, we do not
-want every other CPU in the machine spinning on the CIL lock. Given that
-flushing the CIL could involve walking a list of tens of thousands of log
-items, it will get held for a significant time and so spin contention is a
-significant concern. Preventing lots of CPUs spinning doing nothing is the
-main reason for choosing a sleeping lock even though nothing in either the
-transaction commit or CIL flush side sleeps with the lock held.
-
-It should also be noted that CIL flushing is also a relatively rare operation
-compared to transaction commit for asynchronous transaction workloads - only
-time will tell if using a read-write semaphore for exclusion will limit
-transaction commit concurrency due to cache line bouncing of the lock on the
-read side.
-
-The second serialisation point is on the transaction commit side where items
-are inserted into the CIL. Because transactions can enter this code
-concurrently, the CIL needs to be protected separately from the above
-commit/flush exclusion. It also needs to be an exclusive lock but it is only
-held for a very short time and so a spin lock is appropriate here. It is
-possible that this lock will become a contention point, but given the short
-hold time once per transaction I think that contention is unlikely.
-
-The final serialisation point is the checkpoint commit record ordering code
-that is run as part of the checkpoint commit and log force sequencing. The code
-path that triggers a CIL flush (i.e. whatever triggers the log force) will enter
-an ordering loop after writing all the log vectors into the log buffers but
-before writing the commit record. This loop walks the list of committing
-checkpoints and needs to block waiting for checkpoints to complete their commit
-record write. As a result it needs a lock and a wait variable. Log force
-sequencing also requires the same lock, list walk, and blocking mechanism to
-ensure completion of checkpoints.
-
-These two sequencing operations can use the mechanism even though the
-events they are waiting for are different. The checkpoint commit record
-sequencing needs to wait until checkpoint contexts contain a commit LSN
-(obtained through completion of a commit record write) while log force
-sequencing needs to wait until previous checkpoint contexts are removed from
-the committing list (i.e. they've completed). A simple wait variable and
-broadcast wakeups (thundering herds) has been used to implement these two
-serialisation queues. They use the same lock as the CIL, too. If we see too
-much contention on the CIL lock, or too many context switches as a result of
-the broadcast wakeups these operations can be put under a new spinlock and
-given separate wait lists to reduce lock contention and the number of processes
-woken by the wrong event.
-
-
-=== Lifecycle Changes
-
-The existing log item life cycle is as follows:
-
-----
-	1. Transaction allocate
-	2. Transaction reserve
-	3. Lock item
-	4. Join item to transaction
-		If not already attached,
-			Allocate log item
-			Attach log item to owner item
-		Attach log item to transaction
-	5. Modify item
-		Record modifications in log item
-	6. Transaction commit
-		Pin item in memory
-		Format item into log buffer
-		Write commit LSN into transaction
-		Unlock item
-		Attach transaction to log buffer
-
-	<log buffer IO dispatched>
-	<log buffer IO completes>
-
-	7. Transaction completion
-		Mark log item committed
-		Insert log item into AIL
-			Write commit LSN into log item
-		Unpin log item
-	8. AIL traversal
-		Lock item
-		Mark log item clean
-		Flush item to disk
-
-	<item IO completion>
-
-	9. Log item removed from AIL
-		Moves log tail
-		Item unlocked
-----
-
-Essentially, steps 1-6 operate independently from step 7, which is also
-independent of steps 8-9. An item can be locked in steps 1-6 or steps 8-9
-at the same time step 7 is occurring, but only steps 1-6 or 8-9 can occur
-at the same time. If the log item is in the AIL or between steps 6 and 7
-and steps 1-6 are re-entered, then the item is relogged. Only when steps 8-9
-are entered and completed is the object considered clean.
-
-With delayed logging, there are new steps inserted into the life cycle:
-
-----
-	1. Transaction allocate
-	2. Transaction reserve
-	3. Lock item
-	4. Join item to transaction
-		If not already attached,
-			Allocate log item
-			Attach log item to owner item
-		Attach log item to transaction
-	5. Modify item
-		Record modifications in log item
-	6. Transaction commit
-		Pin item in memory if not pinned in CIL
-		Format item into log vector + buffer
-		Attach log vector and buffer to log item
-		Insert log item into CIL
-		Write CIL context sequence into transaction
-		Unlock item
-
-	<next log force>
-
-	7. CIL push
-		lock CIL flush
-		Chain log vectors and buffers together
-		Remove items from CIL
-		unlock CIL flush
-		write log vectors into log
-		sequence commit records
-		attach checkpoint context to log buffer
-
-	<log buffer IO dispatched>
-	<log buffer IO completes>
-
-	8. Checkpoint completion
-		Mark log item committed
-		Insert item into AIL
-			Write commit LSN into log item
-		Unpin log item
-	9. AIL traversal
-		Lock item
-		Mark log item clean
-		Flush item to disk
-	<item IO completion>
-	10. Log item removed from AIL
-		Moves log tail
-		Item unlocked
-----
-
-From this, it can be seen that the only life cycle differences between the two
-logging methods are in the middle of the life cycle - they still have the same
-beginning and end and execution constraints. The only differences are in the
-committing of the log items to the log itself and the completion processing.
-Hence delayed logging should not introduce any constraints on log item
-behaviour, allocation or freeing that don't already exist.
-
-As a result of this zero-impact "insertion" of delayed logging infrastructure
-and the design of the internal structures to avoid on disk format changes, we
-can basically switch between delayed logging and the existing mechanism with a
-mount option. Fundamentally, there is no reason why the log manager would not
-be able to swap methods automatically and transparently depending on load
-characteristics, but this should not be necessary if delayed logging works as
-designed.
-
-EOF.