From patchwork Tue May 14 00:59:43 2013 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Zhiyong Wu X-Patchwork-Id: 2561561 Return-Path: X-Original-To: patchwork-linux-btrfs@patchwork.kernel.org Delivered-To: patchwork-process-083081@patchwork1.kernel.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by patchwork1.kernel.org (Postfix) with ESMTP id 76A703FD4E for ; Tue, 14 May 2013 01:00:52 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756060Ab3ENBAD (ORCPT ); Mon, 13 May 2013 21:00:03 -0400 Received: from e8.ny.us.ibm.com ([32.97.182.138]:53239 "EHLO e8.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755728Ab3ENA77 (ORCPT ); Mon, 13 May 2013 20:59:59 -0400 Received: from /spool/local by e8.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 13 May 2013 20:59:58 -0400 Received: from d01dlp03.pok.ibm.com (9.56.250.168) by e8.ny.us.ibm.com (192.168.1.108) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Mon, 13 May 2013 20:59:55 -0400 Received: from d01relay05.pok.ibm.com (d01relay05.pok.ibm.com [9.56.227.237]) by d01dlp03.pok.ibm.com (Postfix) with ESMTP id 2C927C90044; Mon, 13 May 2013 20:59:54 -0400 (EDT) Received: from d01av01.pok.ibm.com (d01av01.pok.ibm.com [9.56.224.215]) by d01relay05.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r4E0xsvX318816; Mon, 13 May 2013 20:59:54 -0400 Received: from d01av01.pok.ibm.com (loopback [127.0.0.1]) by d01av01.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id r4E0xrAu012156; Mon, 13 May 2013 20:59:54 -0400 Received: from us.ibm.com (f17.cn.ibm.com [9.115.122.140]) by d01av01.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVin) with SMTP id r4E0xlxX011940; Mon, 13 May 2013 20:59:49 -0400 Received: by us.ibm.com (sSMTP sendmail emulation); Tue, 14 May 2013 09:00:48 +0800 From: zwu.kernel@gmail.com To: viro@zeniv.linux.org.uk Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org, sekharan@us.ibm.com, linuxram@us.ibm.com, david@fromorbit.com, dsterba@suse.cz, gregkh@linuxfoundation.org, paulmck@linux.vnet.ibm.com, chris.mason@fusionio.com, Zhi Yong Wu Subject: [PATCH v2 11/12] VFS hot tracking: add documentation Date: Tue, 14 May 2013 08:59:43 +0800 Message-Id: <1368493184-5939-12-git-send-email-zwu.kernel@gmail.com> X-Mailer: git-send-email 1.7.11.7 In-Reply-To: <1368493184-5939-1-git-send-email-zwu.kernel@gmail.com> References: <1368493184-5939-1-git-send-email-zwu.kernel@gmail.com> X-TM-AS-MML: No X-Content-Scanned: Fidelis XPS MAILER x-cbid: 13051400-9360-0000-0000-00001227627B Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Zhi Yong Wu Add Documentation for VFS hot tracking feature Signed-off-by: Chandra Seetharaman Signed-off-by: Zhi Yong Wu --- Documentation/filesystems/00-INDEX | 2 + Documentation/filesystems/hot_tracking.txt | 256 +++++++++++++++++++++++++++++ 2 files changed, 258 insertions(+) create mode 100644 Documentation/filesystems/hot_tracking.txt diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX index 8042050..2454472 100644 --- a/Documentation/filesystems/00-INDEX +++ b/Documentation/filesystems/00-INDEX @@ -122,3 +122,5 @@ xfs.txt - info and mount options for the XFS filesystem. xip.txt - info on execute-in-place for file mappings. +hot_tracking.txt + - info on hot data tracking in VFS layer diff --git a/Documentation/filesystems/hot_tracking.txt b/Documentation/filesystems/hot_tracking.txt new file mode 100644 index 0000000..9ea3fa8 --- /dev/null +++ b/Documentation/filesystems/hot_tracking.txt @@ -0,0 +1,256 @@ +Hot Data Tracking + +April, 2013 Zhi Yong Wu + +CONTENTS + +1. Introduction +2. Motivation +3. The Design +4. How to Calc Frequency of Reads/Writes & Temperature +5. Git Development Tree +6. Usage Example + + +1. Introduction + + The feature adds the support for tracking data temperature +information in VFS layer. Essentially, this means maintaining some key +stats(like number of reads/writes, last read/write time, frequency of +reads/writes), then distilling those numbers down to a single +"temperature" value that reflects what data is "hot", and filesystem +can use this information to move hot data from slow devices to fast +devices. + + The long-term goal of the feature is to allow some FSs, +e.g. Btrfs to intelligently utilize SSDs in a heterogenous volume. +Incidentally, this project has been motivated by +the Project Ideas page on the Btrfs wiki. + + +2. Motivation + + This is essentially the traditional cache argument: SSD is fast and +expensive; HDD is cheap but slow. ZFS, for example, can already take +advantage of SSD caching. Btrfs should also be able to take advantage of +hybrid storage without many broad, sweeping changes to existing code. + + The overall goal of enabling hot data relocation to SSD has been +motivated by the Project Ideas page on the Btrfs wiki at +. +It will divide into two parts. VFS provide hot data tracking function +while specific FS will provide hot data relocation function. +So as the first step of this goal, this feature provides the first part +of the functionality. + + +3. The Design + +These include the following parts: + + * Hooks in existing vfs functions to track data access frequency + + * New rb-trees for tracking access frequency of inodes and sub-file +ranges + The relationship between super_block and rb-trees is as below: +hot_info.hot_inode_tree + Each FS instance can find hot tracking info s_hot_root. + hot_info has hot_inode_tree and it has inode's hot information, +and it has hot_range_tree, which has range's hot information. + + * A list of hot inodes and hot ranges by its temperature + + * A debugfs interface for dumping data from the rb-trees + + * A work queue for updating inode heat info + + * Mount options for enabling temperature tracking(-o hot_track, +default mean disabled) + * An ioctl to retrieve the frequency information collected for a certain +file + * Ioctls to enable/disable frequency tracking per inode. + +Let us see their relationship as below: + + * hot_info.hot_inode_tree indexes hot_inode_items, one per inode + + * hot_inode_item contains access frequency data for that inode + + * hot_inode_item holds a heat list node to link the access frequency +data for that inode + + * hot_inode_item.hot_range_tree indexes hot_range_items for that inode + + * hot_range_item contains access frequency data for that range + + * hot_range_item holds a heat list node to index the access +frequency data for that range + + * hot_info.heat_inode_map indexes per-inode heat list nodes + + * hot_info.heat_range_map indexes per-range heat list nodes + + How about some ascii art? :) Just looking at the hot inode item case +(the range item case is the same pattern, though), we have: + + super_block + | + V + hot_info + | + +-------------------------+----------------------------------------+ + | | | + | | | + V V V +heat_inode_map hot_inode_tree heat_range_map + | | | + | V | + | +-------hot_comm_item--------+ | + | | frequency data | | ++---+ | list_head | | +| V V | +| ...<--hot_comm_item-->... ...<--hot_comm_item-->... | + frequency data frequency data | + list_head list_head | + hot_range_tree hot_range_tree | + | | + V | + +-------hot_comm_item--------+ | + | frequency data | | + | list_head | +---+ + V ^ | V | + <--hot_comm_item-->... | | ...<--hot_comm_item-->... | + frequency data frequency data + list_head list_head + + +4. How to Calc Frequency of Reads/Writes & Temperature + +1.) hot_rw_freq_calc() + + This function does the actual work of updating the frequency numbers. +FREQ_POWER determines how many atime deltas we keep track of (as a power of 2). +So, setting it to anything above 16ish is probably overkill. Also, +the higher the power, the more bits get right shifted out of the timestamp, +reducing precision, so take note of that as well. + + FREQ_POWER, defined immediately below, determines how heavily to weight +the current frequency numbers against the newest access. For example, a value +of 4 means that the new access information will be weighted 1/16th (ie 2^-4) +as heavily as the existing frequency info. In essence, this is a kludged- +together version of a weighted average, since we can't afford to keep all of +the information that it would take to get a _real_ weighted average. + +2.) hot_temp_calc() + + The following comments explain what exactly comprises a unit of heat. +Each of six values of heat are calculated and combined in order to form an +overall temperature for the data: + + * NRR - number of reads since mount + * NRW - number of writes since mount + * LTR - time elapsed since last read (ns) + * LTW - time elapsed since last write (ns) + * AVR - average delta between recent reads (ns) + * AVW - average delta between recent writes (ns) + + These values are divided (right-shifted) according to the *_DIVIDER_POWER +values defined below to bring the numbers into a reasonable range. You can +modify these values to fit your needs. However, each heat unit is a u32 and +thus maxes out at 2^32 - 1. Therefore, you must choose your dividers quite +carefully or else they could max out or be stuck at zero quite easily. +(E.g., if you chose AVR_DIVIDER_POWER = 0, nothing less than 4s of atime +delta would bring the temperature above zero, ever.) + + Finally, each value is added to the overall temperature between 0 and 8 +times, depending on its *_COEFF_POWER value. Note that the coefficients are +also actually implemented with shifts, so take care to treat these values +as powers of 2. (I.e., 0 means we'll add it to the temp once; 1 = 2x, etc.) + + * AVR/AVW cold unit = 2^X ns of average delta + * AVR/AVW heat unit = HEAT_MAX_VALUE - cold unit + + E.g., data with an average delta between 0 and 2^X ns will have a cold +value of 0, which means a heat value equal to HEAT_MAX_VALUE. + + This function is responsible for distilling the six heat +criteria, which are described in detail in hot_tracking.h) down into a single +temperature value for the data, which is an integer between 0 +and HEAT_MAX_VALUE. + + To accomplish this, the raw values from the hot_freq_data structure +are shifted in order to make the temperature calculation more +or less sensitive to each value. + + Once this calibration has happened, we do some additional normalization and +make sure that everything fits nicely in a u32. From there, we take a very +rudimentary kind of "average" of each of the values, where the *_COEFF_POWER +values act as weights for the average. + + Finally, we use the MAP_BITS value, which determines the size of the +heat list array, to normalize the temperature to the proper granularity. + + +5. Git Development Tree + + This feature is still on development and review, so if you're interested, +you can pull from the git repository at the following location: + + https://github.com/wuzhy/kernel.git hot_tracking + git://github.com/wuzhy/kernel.git hot_tracking + + +6. Usage Example + +1.) To use hot tracking, you should mount like this: + +$ mount -o hot_track /dev/sdb /mnt +[ 1505.894078] device label test devid 1 transid 29 /dev/sdb +[ 1505.952977] btrfs: disk space caching is enabled +[ 1506.069678] vfs: turning on hot data tracking + +2.) Mount debugfs at first: + +$ mount -t debugfs none /sys/kernel/debug +$ ls -l /sys/kernel/debug/hot_track/ +total 0 +drwxr-xr-x 2 root root 0 Aug 8 04:40 sdb +$ ls -l /sys/kernel/debug/hot_track/sdb +total 0 +-rw-r--r-- 1 root root 0 Aug 8 04:40 inode_stat +-rw-r--r-- 1 root root 0 Aug 8 04:40 extent_stat + +3.) View information about hot tracking from debugfs: + +$ echo "hot tracking test" > /mnt/file +$ cat /sys/kernel/debug/hot_track/sdb/inode_stat +inode 279, reads 0, writes 1, temp 109 +$ cat /sys/kernel/debug/hot_track/sdb/extent_stat +inode 279, extent 0+1048576, reads 0, writes 1, temp 64 + +$ echo "hot data tracking test" >> /mnt/file +$ cat /sys/kernel/debug/hot_track/sdb/inode_stat +inode 279, reads 0, writes 2, temp 109 +$ cat /sys/kernel/debug/hot_track/sdb/extent_stat +inode 279, extent 0+1048576 reads 0, writes 2, temp 64 + +4.) Check temp sorting result of some nodes: + +$ cat /sys/kernel/debug/hot_track/loop0/inode_spot +inode 5248773, reads 0, writes 244, temp 111 +inode 878523, reads 0, writes 1, temp 109 +inode 878524, reads 0, writes 1, temp 109 + +5.) Tune some hot tracking parameters as below: + +$ cat /proc/sys/fs/hot-age-interval +300 +$ echo 360 > /proc/sys/fs/hot-age-interval +$ cat /proc/sys/fs/hot-age-interval +360 +$ cat /proc/sys/fs/hot-update-interval +300 +$ echo 360 > /proc/sys/fs/hot-update-interval +$ cat /proc/sys/fs/hot-update-interval +360 +